Parallel Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2014-2018, source

View resource name in all available languages

Suomi-selkosuomi-rinnakkaiskorpus Ylen suomenkielisestä uutisarkistosta 2014-2018, lähdeaineisto

ylenews-fi-2014-2018-selko-par-src

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2024011701

Access location:

This resource is available for download in Kielipankki – the Language Bank of Finland.

This is a parallel corpus created of the Yle news articles from 2014-2018 by aligning the standard Finnish versions with the easy-language versions. The dataset, created by Anna Dmitrieva and available in CSV format, is aligned on the document level. The news articles were obtained from the datasets available via Kielipankki (http://urn.fi/urn:nbn:fi:lb-2017070501 and http://urn.fi/urn:nbn:fi:lb-2019050901).

This dataset extends the previously published Parallel Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2019-2020 (http://urn.fi/urn:nbn:fi:lb-2022111625). Please note that this dataset has not been assessed by a human expert. The articles have been aligned automatically with the Vecalign document alignment algorithm (https://github.com/thompsonb/vecalign) without candidate rescoring, using LASER embeddings (https://github.com/facebookresearch/LASER).

Description of all columns in the dataset:
-index_in_selko: This index consists of two parts divided by an underscore. The first (longer) part identifies the entire Easy Finnish article from the original dataset. The second (shorter) part is the number of the paragraph. Since the Yle Selkosuomi articles usually consist of multiple paragraphs, each paragraph describing a separate piece of news, we represent each paragraph as an individual little article in our dataset. Paragraph numbering starts with 0.
- index_in_regular: The identifier of the regular Finnish article taken from the original dataset.
- selko_text: A piece of news in Easy Finnish.
- regular_text: A corresponding piece of news in regular Finnish.
- distance: The cosine distance between the document vectors. The lower the distance, the more similar the documents are.

You don’t have the permission to edit this resource.