The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874)

Kansalliskirjaston sanoma- ja aikakauslehtikokoelman OCR-korpus (1771-1874)


Persistent Identifier of this resource:

Access location:

The corpus is available in Kielipankki - the Language Bank of Finland, download:, as well as on the Taito server (, directory name: /appl/kielipankki/Digilib-pub.

This corpus consists of the OCR results of the material published before 1875 in the corpus of publications digitized by the National Library of Finland. This part of the corpus is so old that any copyrights in it must have expired before 2015.

The full corpus, as FIN-CLARIN has it, is organized in eleven branches named arc01, ..., arc11. Each document is stored as a zip archive containing scanned image files in different resolutions, and the OCR results as XML documents. This distribution has the same structure but contains only the OCR results.

Each of the distribution files, ..., contains the material extracted from one branch of the full corpus; arc10 is currently missing from this distribution for technical reasons; and arc11 did not contain any relevant material.

The distribution file "" contains all 10 branches in one archive.

Change Log:
- Corrected time coverage typo in Metadata
- Changed shortname from Digilib-Pub to Digilib-Pub-1874-dl

Kansalliskirjaston sanoma- ja aikakauslehtikokoelman OCR-korpus (1771-1874) on saatavilla Kielipankissa, lataus:, sekä Taito-palvelimella (, hakemistossa /appl/kielipankki/Digilib-pub.

Tämä korpus koostuu niiden Kansalliskirjaston digitoimien dokumenttien OCR-tuloksista, jotka on julkaistu ennen vuotta 1875. Tämä osa korpuksesta on niin vanha, että sitä koskevien tekijänoikeuksien on täytynyt raueta ennen vuotta 2015.

