Europarl Parallel Corpus

533 Last view: 2024-04-20

16 Last update: 2020-01-18

91 Last download: 2020-09-01

Europarl Parallel Corpus

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-20140730195

Access location: http://www.statmt.org/europarl/

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Unrestricted Use

Licence

CC - ZERO

Download location: hidden

Distribution Access/Medium: Downloadable

IPR Holder

Philipp Koehn

Contact Person

Philipp Koehn

text

Multilingual text corpusLanguages

Slovenian (12,665,974 Words) Finnish (33,708,706 Words) Spanish (54,806,927 Words) English (53,974,751 Words) Italian (50,259,169 Words) French (54,202,850 Words) Danish (47,761,381 Words) Greek, Modern (1453-) (1,517,141 Words) German (47,236,849 Words) Swedish (45,665,947 Words) Portuguese (52,300,149 Words) Estonian (11,358,009 Words) Czech (13,195,311 Words) Lithuanian (11,512,131 Words) Hungarian (12,606,986 Words) Polish (7,087,016 Words) Latvian (12,085,228 Words) Slovak (13,116,301 Words) Romanian (9,663,544 Words) Dutch; Flemish (53,487,257 Words) Bulgarian (411,636 Sentences)

Linguality

Linguality type: Multilingual

Multi-linguality type: Parallel

Size

650,000,000 Words

Modalities

Written Language

Time Coverage

1996-2011

Metadata

Created: 09/23/2012

Last Updated: 10/07/2014

Metadata Language: English (en)

Metadata Creator

Saara Pöyhönen

Usage

Foreseen UseNlp Applications

Use NLP Specific: Machine Translation

People who looked at this resource also viewed the following:

People who downloaded this resource also downloaded the following: