The Finnish N-grams 1820-2000 of the Newspaper and Periodical Corpus of the National Library of Finland 
View resource name in all available languages
Kansalliskirjaston sanoma- ja aikakauslehtikokoelman suomenkieliset n-grammit 1820-2000
FNC1
Persistent Identifier of this resource:
http://urn.fi/urn:nbn:fi:lb-2014073038
Access location:
The corpus is available for download in Kielipankki - the Language Bank of Finland (see Access location).
The National Library of Finland has digitized a large proportion of the Finnish newspapers, magazines, and periodicals published between 1820 and 2000. This resource contains sets of unigrams, bigrams and trigrams extracted by the University of Helsinki from the source data of the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version (see http://urn.fi/urn:nbn:fi:lb-2016050302).
The resource consists of plain UTF-8 encoded text files, each containing a list of n-grams that have been ordered by their frequencies from highest to lowest. Each line in a file consists of two or more fields separated by a whitespace character. The first field indicates the absolute frequency of a unique n-gram, and the remaining fields contain the tokens (strings of non-whitespace characters) of the n-gram itself. Uppercase letters have been retained as such and have not been converted into lowercase letters. Punctuation characters are treated as separate tokens except when they are part of an abbreviation ("etc.", "mm.") or when they separate a case ending or an enclitic from an abbreviation or a sign ("EU:ssa", "%:iin"), as per the typographic principles of standard Finnish. The n-grams have been computed across sentence boundaries for each decade (from the 1820s to the 2000s) as well as for the entire corpus, with unigrams, bigrams and trigrams in separate files.
Since the source material has been digitized by the means of optical character recognition (OCR), the resource also contains erroneous word forms and non-word strings of characters. Furthermore, due to the large time span of the original corpus, the resource contains several lexical items and spelling variants that have since become obsolete in standard Finnish.
The resource will be updated in the future as improvements are being made to the source material.
Referring to the Finnish N-gram Corpus
If you use material from the Finnish N-gram Corpus and want to quote
it, you may want to use the following information:
Bibliographic references
The Finnish N-gram Corpus, version 1 (FNC1). 2014. Distributed by the
University of Helsinki on behalf of the FIN-CLARIN Consortium.
URL: http://www.helsinki.fi/finclarin/fnc1
Data from the FNC1
Our policy is to request that citations from the Finnish N-gram
Corpus should include the corpus identifier and version number (a 4
letter code). A suitable way of crediting the FNC1 would
be:
"N-grams from the Finnish N-gram Corpus, version 1, (FNC1) and the frequencies derived from it were obtained under the CC BY 4.0 license."
log
25.11.2018 link http://islrn.org/resources/926-345-171-872-3 removed
View resource description in all available languagesAineisto on saatavilla Kielipankin kautta ladattavassa muodossa (ks. Access location).
Kansalliskirjasto on digitoinut suuren osan vuosien 1820 ja 2000 välillä julkaistuista suomalaisista sanoma- ja aikakauslehdistä. Suomenkielisten n-grammien aineisto sisältää Helsingin yliopiston koostamat unigrammit, bigrammit ja trigrammit, jotka on kerätty Kansalliskirjaston sanoma- ja aikakauslehtikokoelman suomenkielisen osakorpuksen Kielipankki-version (http://urn.fi/urn:nbn:fi:lb-2016050302) lähdeaineistosta.
The National Library of Finland has digitized a large proportion of the Finnish newspapers, magazines, and periodicals published between 1820 and 2000. This resource contains sets of unigrams, bigrams and trigrams extracted by the University of Helsinki from the source data of the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version (see http://urn.fi/urn:nbn:fi:lb-2016050302).
The resource consists of plain UTF-8 encoded text files, each containing a list of n-grams that have been ordered by their frequencies from highest to lowest. Each line in a file consists of two or more fields separated by a whitespace character. The first field indicates the absolute frequency of a unique n-gram, and the remaining fields contain the tokens (strings of non-whitespace characters) of the n-gram itself. Uppercase letters have been retained as such and have not been converted into lowercase letters. Punctuation characters are treated as separate tokens except when they are part of an abbreviation ("etc.", "mm.") or when they separate a case ending or an enclitic from an abbreviation or a sign ("EU:ssa", "%:iin"), as per the typographic principles of standard Finnish. The n-grams have been computed across sentence boundaries for each decade (from the 1820s to the 2000s) as well as for the entire corpus, with unigrams, bigrams and trigrams in separate files.
Since the source material has been digitized by the means of optical character recognition (OCR), the resource also contains erroneous word forms and non-word strings of characters. Furthermore, due to the large time span of the original corpus, the resource contains several lexical items and spelling variants that have since become obsolete in standard Finnish.
The resource will be updated in the future as improvements are being made to the source material.
Referring to the Finnish N-gram Corpus
If you use material from the Finnish N-gram Corpus and want to quote
it, you may want to use the following information:
Bibliographic references
The Finnish N-gram Corpus, version 1 (FNC1). 2014. Distributed by the
University of Helsinki on behalf of the FIN-CLARIN Consortium.
URL: http://www.helsinki.fi/finclarin/fnc1
Data from the FNC1
Our policy is to request that citations from the Finnish N-gram
Corpus should include the corpus identifier and version number (a 4
letter code). A suitable way of crediting the FNC1 would
be:
"N-grams from the Finnish N-gram Corpus, version 1, (FNC1) and the frequencies derived from it were obtained under the CC BY 4.0 license."
log
25.11.2018 link http://islrn.org/resources/926-345-171-872-3 removed
Aineisto on saatavilla Kielipankin kautta ladattavassa muodossa (ks. Access location).
Kansalliskirjasto on digitoinut suuren osan vuosien 1820 ja 2000 välillä julkaistuista suomalaisista sanoma- ja aikakauslehdistä. Suomenkielisten n-grammien aineisto sisältää Helsingin yliopiston koostamat unigrammit, bigrammit ja trigrammit, jotka on kerätty Kansalliskirjaston sanoma- ja aikakauslehtikokoelman suomenkielisen osakorpuksen Kielipankki-version (http://urn.fi/urn:nbn:fi:lb-2016050302) lähdeaineistosta.
People who looked at this resource also viewed the following:
- The Swedish N-grams 1770-1940 of the Newspaper and Periodical Corpus of the National Library of Finland
- The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874)
- The HS.fi News and Comments Corpus
- The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version
People who downloaded this resource also downloaded the following: