Finnish News Agency Archive 1992-2018, CoNLL-U, source

180 Last view: 2024-04-16

8 Last update: 2023-05-10

Finnish News Agency Archive 1992-2018, CoNLL-U, source

View resource name in all available languages

STT:n uutisarkisto 1992-2018, CoNLL-U, lähdemateriaali

stt-fi-1992-2018-conllu-src

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2020031201

Access location: http://urn.fi/urn:nbn:fi:lb-2020031202

This is the parsed version of the Finnish News Agency Archive 1992-2018 corpus (http://urn.fi/urn:nbn:fi:lb-2019041501). The corpus was parsed by Khalid Alnajjar (University of Helsinki) using Turku neural parser pipeline (http://turkunlp.org/Turku-neural-parser-pipeline/).

The Finnish News Agency Archive corpus comprises newswire articles in Finnish sent to media outlets by the Finnish News Agency (STT) between 1992-2018. The corpus includes about 2,8 million items in total. Most of the material is news articles that vary from short “news flashes” to telegrams and longer articles. News articles are categorized by department (domestic, foreign, economy, politics, culture, entertainment and sports) as well as by metadata (IPTC subject categories or keywords and location data). The archive also includes other material STT has created or forwarded such as news planning lists, sports results, analysis articles and press releases.

The corpus is available for non-commercial research through the download service korp.csc.fi/download as whole texts based on a research plan submitted with the application in the Language Bank Rights.

Notes:
-) Headlines and news content were parsed and the output is in CoNLL-U Format.
-) Filenames in the original corpus are preserved, only the file extension was changed. This allows mapping the parsed corpus to the original corpus to obtain additional metadata if needed.
-) Files having "h_" as the prefix contain the parsed headline. Otherwise, it is the parsed news content.
-) Not all documents in the corpus contained a headline or/and news content. In such cases, the file was ignored.
-) The corpus contained some English documents and, in such cases, the output of the parser is usually incorrect. Language identification could be done to deal with the English documents appropriately.
-) UralicNLP (https://github.com/mikahama/uralicNLP/wiki/UD-parser) can be utilized easily to read and use the parsed corpus in Python.

Acknowledgments:
-) This work has been supported by European Union's Horizon 2020 research and innovation programme under grant agreement No 825153, project EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media).
-) The corpus was processed on the Finnish Grid and Cloud Infrastructure (urn:nbn:fi:research-infras-2016072533).

Licence: http://urn.fi/urn:nbn:fi:lb-2019041502

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

CLARIN RES

Restrictions: Academic - Non Commercial Use, Attribution, No Redistribution, Other, Redeposit

User Nature: Academic

Distribution Access/Medium: Downloadable

Licensors:

Oy Suomen Tietotoimisto Finska Notisbyrån Ab

Distribution rights holders:

University of Helsinki

IPR Holder

Oy Suomen Tietotoimisto Finska Notisbyrån Ab

Contact Person

User support FIN-CLARIN

text

Monolingual text corpusLanguages

Finnish

Linguality

Linguality type: Monolingual

Text Format

NewsML-G2

Size

2,848,322 Texts

Modalities

Written Language

AnnotationSyntactic Annotation - Treebanks

Annotation Tools:

http://turkunlp.org/...

Annotators:

Khalid Alnajjar

Time Coverage

1992-2018

Metadata

Created: 04/14/2019

Last Updated: 03/22/2021

Metadata Language: English (en)

Metadata Creator

Tommi Jauhiainen

Ute Dieckmann

Relation

Related Resource: http://urn.fi/urn:nb...

Relation Type: IsDerivedFrom

Documentation

Resource group page http://urn.fi/urn:nb...

How to cite: https://www.kielipan...

Format info https://iptc.org/sta...

Document Type: Other

Lisenssi (stt-fi, kokotekstiversiot), Licence (stt-fi, full text versions), http://urn.fi/urn:nb... , 2019

Editor: FIN-CLARIN

Document Language: English

Change log: 2023-05-10 Added the missing license condition +DEP, according to the original agreement.

People who looked at this resource also viewed the following: