The Swedish N-grams 1770-1940 of the Newspaper and Periodical Corpus of the National Library of Finland

305 Last view: 2024-04-26

15 Last update: 2021-09-15

59 Last download: 2020-09-01

The Swedish N-grams 1770-1940 of the Newspaper and Periodical Corpus of the National Library of Finland

View resource name in all available languages

Kansalliskirjaston sanoma- ja aikakauslehtikokoelman ruotsinkieliset n-grammit 1770-1940

SNC1

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2014091902

Access location: http://urn.fi/urn:nbn:fi:lb-2014091903

The corpus is available in Kielipankki - the Language Bank of Finland, download: https://korp.csc.fi/download/SNC1/

The National Library of Finland has digitized a large proportion of Finland’s Swedish newspapers, magazines, and periodicals published between 1770 and 1940. This resource contains sets of unigrams, bigrams and trigrams extracted from a corpus that has been compiled from the digitized newspapers by the University of Helsinki.

The resource consists of plain UTF-8 encoded text files, each containing a list of n-grams that have been ordered by their frequencies from highest to lowest. Each line in a file consists of two or more fields separated by a whitespace character. The first field indicates the absolute frequency of a unique n-gram, and the remaining fields contain the tokens (strings of non-whitespace characters) of the n-gram itself. Uppercase letters have been retained as such and have not been converted into lowercase letters. Punctuation characters are treated as separate tokens except when they are part of an abbreviation ("etc.", "mm."). The n-grams have been computed across sentence boundaries for each decade (from the 1770s to the 1940s) as well as for the entire corpus, with unigrams, bigrams and trigrams in separate files.

Since the source material has been digitized by the means of optical character recognition (OCR), the resource also contains erroneous word forms and non-word strings of characters. Furthermore, due to the large time span of the original corpus, the resource contains several lexical items and spelling variants that have since become obsolete in standard Swedish.

The resource will be updated in the future as improvements are being made to the source material.

Referring to the Swedish N-gram Corpus

If you use material from the Swedish N-gram Corpus and want to quote it, you may want to use the following information:

Bibliographic references

The Swedish N-gram Corpus, version 1 (SNC1). 2014. Distributed by the University of Helsinki on behalf of the FIN-CLARIN Consortium.

URL: http://www.helsinki.fi/finclarin/snc1

Data from the SNC1

Our policy is to request that citations from the Swedish N-gram Corpus should include the corpus identifier and version number (a 4 letter code). A suitable way of crediting the SNC1 would be:

"N-grams from the Swedish N-gram Corpus, version 1, (SNC1) and the frequencies derived from it were obtained under the CC BY 4.0 license."

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Unrestricted Use

Licence

CC - BY

Restrictions: Attribution

Download location: hidden

Distribution Access/Medium: Downloadable

Attribution Details: See Documenation section.

Licensors:

University of Helsinki

Distribution rights holders:

University of Helsinki

IPR Holder

University of Helsinki

Contact Person

User support FIN-CLARIN

textngram

Monolingual textngram corpusLanguages

Swedish

Linguality

Linguality type: Monolingual

Size

10,558 Mb

Character encoding

UTF - 8

Modalities

Written Language

Time Coverage

1770-1949

Geographic coverage

Finland

NGram

Order: 3

Base item: Word

Metadata

Created: 09/15/2014

Last Updated: 09/15/2021

Metadata Language: English (en)

Revision: Link to resource group page added

Metadata Creator

Ute Dieckmann

Imre Bartis

Usage

Foreseen UseHuman Use

Use NLP Specific: Linguistic Research

Actual Use - Human Use

Use NLP Specific: Linguistic Research

Relation

Related Resource: The Swedish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version http://urn.fi/urn:nb...

Relation Type: IsDerivedFrom

Documentation

Resource group page: http://urn.fi/urn:nb...

How to cite: https://www.kielipan...

CHANGE LOG: 12.3.2020: Added common subdirectory SNC1 to zip files.

People who looked at this resource also viewed the following:

People who downloaded this resource also downloaded the following:

The Finnish N-grams 1820-2000 of the Newspaper and Periodical Corpus of the National Library of Finland