Parallel Sentence Aligned Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2014-2020, Korp

Lausetasolla kohdistettu suomi–selkosuomi-rinnakkaiskorpus Ylen suomenkielisestä uutisarkistosta 2014-2020, Korp


This resource will be available via Korp in Kielipankki – the Language Bank of Finland.

This is a parallel corpus created of the Yle news articles from 2014-2020 by aligning the standard Finnish versions with the easy-language versions. The dataset, created by Anna Dmitrieva and available in CSV format, is aligned on the sentence level. It is based on the two parallel document-level datasets of Yle News articles available on Kielipankki ( and The dataset spans the period from September 2014 to December 2020.

This dataset is comprised of the following parts:
1) Sentence alignments: parallel documents from regular and Easy Finnish Yle news articles aligned sentence-by-sentence. Only the "positive" documents were taken from the 2019-2020 dataset ( All but 50 documents were aligned automatically with Vecalign ( using LASER embeddings ( Each document has the following columns:
1.1) pair_id: an id comprised of three parts divided by a double underscore: the id of the regular document, the id of the Easy Finnish document (with a singular underscore), and the sentence pair number.
1.2) regular_string: a sentence from the regular Finnish article.
1.3) selko_string: a corresponding sentence from the Easy Finnish article.
1.4) score: the confidence score given by Vecalign. The lower the score, the more similar the sentences. The "good" pairs are estimated to have a score below or equal to 0.65; however, the score is not definitive proof of whether the sentences in the pair truly match in meaning. The zero score is assigned when a sentence has no pair. The scores for all non-zero sentence pairs in manually aligned documents are set to 0.(3).
2) Golden sentence alignments: 50 documents aligned manually by a human assessor (text). Also available in the ladder format (indexes).

