TallVocabL2Fi: Measurements of 15 L2 Finnish learners' vocabularies

View resource name in all available languages

TallVocabL2Fi: Mitat 15 S2-opiskelijan sanavarastosta


Persistent Identifier of this resource:


Access location:

The TallVocabL2Fi dataset comprises of responses from 15 participants to a "tall" 12000 word 5-point scale self-rating response task and a 100 word confirmatory word translation response task. The 15 participants were split by native language, 5 English, 4 Hungarian and 6 Russian, and self-reported CEFR reading level, 5 B1, 4 B2, 5 C1 and 2 C2. The data was gathered through a website from paid participants resident in Finland over a period of 3 months from September and November 2021. In total there are 180 thousand word knowledge self-rating responses and 1.5 thousand word translation responses.

The dataset is unique in its combination of the tall data collection set up, where responses are collected for many words, the varied backgrounds of the learners, the use of Finnish prompt words, and the triangulation with a word translation test. The dataset can be used for vocabulary acquisition research in general, but it is particularly suited to evaluation of the task of Vocabulary Inventory Prediction (VIP) including techniques based on Computer-Adaptive Testing (CAT).

The dataset is relational/tabular. It is distributed as a series of TSV files along with a SQL schema exported from DuckDB.

The TallVocabL2Fi dataset is available for download via Kielipankki – The Language Bank of Finland.

Further information about the schema and the collection process is available in the readme included with the data, and in the accompanying publication:

Robertson, F., Chang & L., Söyrinki, S. (2022). TallVocabL2Fi: An Extensive Mapping of 15 Finnish L2 Learners' Vocabulary. In Language Resources and Evaluation Conference (LREC 2022).

You don’t have the permission to edit this resource.