Parallel Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2014-2018, source

34 Last view: 2024-04-26

7 Last update: 2024-02-01

Parallel Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2014-2018, source

View resource name in all available languages

Suomi-selkosuomi-rinnakkaiskorpus Ylen suomenkielisestä uutisarkistosta 2014-2018, lähdeaineisto

ylenews-fi-2014-2018-selko-par-src

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2024011701

Access location: http://urn.fi/urn:nbn:fi:lb-2024011702

This resource is available for download in Kielipankki – the Language Bank of Finland.

This is a parallel corpus created of the Yle news articles from 2014-2018 by aligning the standard Finnish versions with the easy-language versions. The dataset, created by Anna Dmitrieva and available in CSV format, is aligned on the document level. The news articles were obtained from the datasets available via Kielipankki (http://urn.fi/urn:nbn:fi:lb-2017070501 and http://urn.fi/urn:nbn:fi:lb-2019050901).

This dataset extends the previously published Parallel Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2019-2020 (http://urn.fi/urn:nbn:fi:lb-2022111625). Please note that this dataset has not been assessed by a human expert. The articles have been aligned automatically with the Vecalign document alignment algorithm (https://github.com/thompsonb/vecalign) without candidate rescoring, using LASER embeddings (https://github.com/facebookresearch/LASER).

Description of all columns in the dataset:
-index_in_selko: This index consists of two parts divided by an underscore. The first (longer) part identifies the entire Easy Finnish article from the original dataset. The second (shorter) part is the number of the paragraph. Since the Yle Selkosuomi articles usually consist of multiple paragraphs, each paragraph describing a separate piece of news, we represent each paragraph as an individual little article in our dataset. Paragraph numbering starts with 0.
- index_in_regular: The identifier of the regular Finnish article taken from the original dataset.
- selko_text: A piece of news in Easy Finnish.
- regular_text: A corresponding piece of news in regular Finnish.
- distance: The cosine distance between the document vectors. The lower the distance, the more similar the documents are.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 02/01/2024

Licence

CLARIN ACA - NC

Restrictions: Academic - Non Commercial Use, Attribution, No Redistribution, Other

User Nature: Academic

Distribution Access/Medium: Downloadable

Licensors:

Yleisradio Oy, Finnish Broadcasting Company (Yle)

Distribution rights holders:

University of Helsinki

IPR Holder

Yleisradio Oy, Finnish Broadcasting Company (Yle)

Contact Person

Anna Dmitrieva

text

Bilingual text corpusLanguages

Finnish

Variety: Standard Finnish (Type: Other) (7,004 Texts)

Finnish

Variety: Easy-to-read Finnish (Type: Other) (7,004 Texts)

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Size

7,004 Articles

Modalities

Written Language

Time Coverage

2014 - 2018

CreationOriginal Sources

Yle Finnish News Archive 2011-2018, source http://urn.fi/urn:nb...
Yle News Archive Easy-to-read Finnish 2011-2018, source http://urn.fi/urn:nb...

Resource Creation

Resource Creator

Anna Dmitrieva

Yleisradio Oy, Finnish Broadcasting Company (Yle)

Metadata

Created: 01/17/2024

Last Updated: 02/01/2024

Revision: access location added

Metadata Creator

Ute Dieckmann

Relation

Related Resource: Yle Finnish News Archive 2011-2018, source http://urn.fi/urn:nb...

Relation Type: IsVariantFormOf

Related Resource: Yle News Archive Easy-to-read Finnish 2011-2018, source http://urn.fi/urn:nb...

Relation Type: IsVariantFormOf

Related Resource: Parallel Corpus of Finnish and Easy-to-read Finnish from the Yle News Archive 2019-2020, source http://urn.fi/urn:nb...

Relation Type: Continues

Documentation

Document Type: Other

Lisenssi (Ylen uutisarkisto, kokotekstiversiot, License (Yle News Archive, full text versions), http://urn.fi/urn:nb...

Editor: FIN-CLARIN

Document Language: English

Document Type: Other

Aineistoryhmäsivu (Ylen uutisarkisto), Resource group page (Yle News Archive), http://urn.fi/urn:nb...

Editor: FIN-CLARIN

How to cite: https://www.kielipan...

People who looked at this resource also viewed the following:

Resources from the same creators