Yle machine translated subtitles evaluation dataset

27 Last view: 2023-08-24

3 Last update: 2022-09-20

Yle machine translated subtitles evaluation dataset

yle-av-dataset-3

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2022092025

Access location: http://urn.fi/urn:nbn:fi:lb-2022092026

This dataset contains semi-automatically cleaned, parallel professional subtitles from 44 programs, containing 10.3k aligned sentence pairs for these language pairs: FIN-SWE, FIN-ENG, SWE-ENG.

This dataset does not contain video or audio, but the total content length covered by the subtitles is 22,46 hours.

---
Yle has released three datasets with an experimental license for a limited amount of time to support the development of language and media related technologies. These datasets were originally created by the MeMAD research and innovation project, a collaboration between media industry members and research groups. The MeMAD project received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 780069.

LICENSE INFORMATION:

The data is available for research purposes upon specific request from Yle. The party requesting the data has to be located in Finland to gain access to the data (but your other project partners do not need to be).

Please see the website at https://developer.yle.fi/en/data/avdata/index.html for more detailed terms and conditions. Requests can be made until the end of year 2022 by submitting the form available via the website.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 03/22/2021

End date: 12/31/2022

Licence

Other

Restrictions: Academic - Non Commercial Use, Attribution, No Redistribution, Other

User Nature: Academic

Distribution Access/Medium: Other

Licensors:

Yleisradio Oy, Finnish Broadcasting Company (Yle)

Distribution rights holders:

Yleisradio Oy, Finnish Broadcasting Company (Yle)

Contact Persons

Lauri Saarikoski

Tuomas Nolvi

text

Multilingual text corpusLanguages

Swedish Finnish English

Linguality

Linguality type: Multilingual

Multi-linguality type: Parallel (Semi-automatically cleaned, professional subtitles)

Size

10,300 Sentences

Modalities

Spoken Language

Resource Creation

Funding Project

Methods for Managing Audiovisual Data (MeMAD - 780069)

URL: https://memad.eu/

Funding Type: Eu Funds

Funder: The European Union's Horizon 2020 research and innovation programme

Project duration: 01/01/2018 - 12/31/2020

Metadata

Created: 11/03/2021

Last Updated: 09/21/2022

Source: https://developer.yle.fi/en/data/avdata/index.html

Metadata Creator

Mietta Lennes

Documentation

Document Type: Other

Yle audiovisual data and subtitles datasets, https://developer.yl... , 2021

Publisher: Yle - Finnish Broadcasting Company

Document Language: English

People who looked at this resource also viewed the following:

Resources from the same project