Parallel Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2019-2020, source

View resource name in all available languages

Suomi-selkosuomi-rinnakkaiskorpus Ylen suomenkielisestä uutisarkistosta 2019-2020, lähdeaineisto

ylenews-fi-2019-2020-selko-par-src

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2022111625

Access location:

The resource is available via Kielipankki – The Language Bank of Finland.

This parallel dataset can be used for training simplification models and/or studying simplification strategies that experts apply for Finnish news articles. The languages of the dataset are Finnish and Easy-to-read Finnish. The articles of which the dataset is comprised are dated 2019-2020, but the dataset itself was created in 2022. This resource also contains similarity scores obtained automatically for each pair of articles as well as human judgement similarity scores. The news articles were obtained from the datasets available via Kielipankki (http://urn.fi/urn:nbn:fi:lb-2021050401 and http://urn.fi/urn:nbn:fi:lb-2021050701).

Description of all columns in the dataset:
-index_in_selko: This index consists of two parts divided by an underscore. The first (longer) part is the identifier of the entire Easy Finnish article taken from the original dataset. The second (shorter) part is the number of the paragraph. Since the Yle Selkosuomi articles usually consist of multiple paragraphs, each paragraph describing a separate piece of news, we represent each paragraph as a separate little article in our dataset. Paragraph numbering starts with 0.
- index_in_regular: The identifier of the regular Finnish article taken from the original dataset.
- selko_text: A piece of news in Easy Finnish.
- regular_text: A corresponding piece of news in regular Finnish.
- cos_sim: The cosine similarity score between the first 15 sentences of the articles in the pair (each sentence was vectorized with a SentenceTransformer model, then an average vector for each article in the pair was obtained, and finally, these two average vectors were compared).
- status: A score given to this pair of articles by the human assessor. Positive status means that the articles are definitely talking about the same phenomenon. Negative means the opposite, that the articles definitely talk about something different. Neutral status means that it is unclear whether the articles talk about the same thing.
- comments: Comments given by the human assessor.

You don’t have the permission to edit this resource.