Parallel Sentence Aligned Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2014-2020, Korp

View resource name in all available languages

Lausetasolla kohdistettu suomi–selkosuomi-rinnakkaiskorpus Ylen suomenkielisestä uutisarkistosta 2014-2020, Korp

ylenews-fi-2014-2020-selko-par-sent-korp

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2024031301

This resource will be available via Korp in Kielipankki – the Language Bank of Finland.

This is a parallel corpus created of the Yle news articles from 2014-2020 by aligning the standard Finnish versions with the easy-language versions. The dataset, created by Anna Dmitrieva and available in CSV format, is aligned on the sentence level. It is based on the two parallel document-level datasets of Yle News articles available on Kielipankki (http://urn.fi/urn:nbn:fi:lb-2022111625 and http://urn.fi/urn:nbn:fi:lb-2024011701). The dataset spans the period from September 2014 to December 2020.

This dataset is comprised of the following parts:
1) Sentence alignments: parallel documents from regular and Easy Finnish Yle news articles aligned sentence-by-sentence. Only the "positive" documents were taken from the 2019-2020 dataset (http://urn.fi/urn:nbn:fi:lb-2022111625). All but 50 documents were aligned automatically with Vecalign (https://github.com/thompsonb/vecalign) using LASER embeddings (https://github.com/facebookresearch/LASER). Each document has the following columns:
1.1) pair_id: an id comprised of three parts divided by a double underscore: the id of the regular document, the id of the Easy Finnish document (with a singular underscore), and the sentence pair number.
1.2) regular_string: a sentence from the regular Finnish article.
1.3) selko_string: a corresponding sentence from the Easy Finnish article.
1.4) score: the confidence score given by Vecalign. The lower the score, the more similar the sentences. The "good" pairs are estimated to have a score below or equal to 0.65; however, the score is not definitive proof of whether the sentences in the pair truly match in meaning. The zero score is assigned when a sentence has no pair. The scores for all non-zero sentence pairs in manually aligned documents are set to 0.(3).
2) Golden sentence alignments: 50 documents aligned manually by a human assessor (text). Also available in the ladder format (indexes).

You don’t have the permission to edit this resource.