Psycholinguistic Descriptives – META-SHARE

Last view: 2024-04-04

473 Last view: 2024-04-04

Last update: 2021-08-17

14 Last update: 2021-08-17

Psycholinguistic Descriptives

View resource name in all available languages

Psykolingvistiset tunnusluvut

psychlingdesc

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2018081601

Access location: http://urn.fi/urn:nbn:fi:lb-2018081602

The material is available at the Language Bank of Finland (Kielipankki) download service, access location http://urn.fi/urn:nbn:fi:lb-2018081602.

This material comprises a dataset and a query tool for acquiring commonly used psycholinguistic descriptives for Finnish words. The dataset is based on six large corpora from sources such as magazines, newspapers, movie and tv-series subtitles, encyclopedia topics and Internet discussions.
The material includes word surface form frequencies, lemma frequencies, syllable frequencies and letter n-gram frequencies. In addition the query tool can be used to acquire descriptives such as orthographic neighbors for lists of words. More information on the datasets and the query tool can be found in the readme file.

Descriptives:
Word lemma and surface forms tokens: 2500 million
Unique lemmas: 0.7 million
Unique surface forms: 1.5 million

The corpora used:
The Suomi24 Corpus: http://urn.fi/urn:nbn:fi:lb-2017021630

Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version (KLK,
only from 1980 onwards): http://urn.fi/urn:nbn:fi:lb-2016050302

Finnish Magazines and Newspapers from the 1990s and 2000s, , Version 2:
http://urn.fi/urn:nbn:fi:lb-2017091901

Finnish Wikipedia 2017: http://urn.fi/urn:nbn:fi:lb-2018060401

Finnish Opensubtitles 2017: http://urn.fi/urn:nbn:fi:lb-2018060403

Unpublished corpus source:

Comments made to the Finnish discussions of the Reddit forum https://old.reddit.com/r/Suomi/ between January 2012 and December 2017

Change log:
This description was replaced on December 12, 2018

View resource description in all available languages

Tämä aineisto on saatavilla Kielipankin latauspalvelussa, sijaintipaikka http://urn.fi/urn:nbn:fi:lb-2018081602.

Aineisto käsittää kuudesta eri tekstikorpuksesta kerättyjen sanojen frekvenssit sekä yksinkertaisen hakutyökalun, jolla sanoille voidaan laskea usein käytettyjä psykolingvistisiä tunnuslukuja. Sanafrekvenssitaulukoita on suodatettu, jotta ne vastaisivat paremmin sanojen todellisia taajuuksia. Tarkemmat tiedot suodatuksesta ja hakutyökalusta löytyvät readme-tiedostosta.

Lemmojen (perusmuotojen) ja pintamuotojen aineistot yhdessä kattavat noin 2500 miljoonaa sanetta/lemmaa, 1,5 miljoonaa uniikkia sanaa ja 0,7 miljoonaa uniikkia lemmaa.

Aineistot, joihin sanafrekvenssitaulukot perustuvat:

Suomi 24 -korpus: http://urn.fi/urn:nbn:fi:lb-2017021630

Kansalliskirjaston sanoma- ja aikakauslehtikokoelman suomenkielinen osakorpus, Kielipankki-versio (KLK, vain vuodesta 1980 eteenpäin):
http://urn.fi/urn:nbn:fi:lb-2016050302

1990- ja 2000-luvun suomalaisia aikakaus- ja sanomalehtiä -korpus, versio 2: http://urn.fi/urn:nbn:fi:lb-2017091901

Suomenkielinen Wikipedia 2017: http://urn.fi/urn:nbn:fi:lb-2018060401

Suomenkielinen Opensubtitles 2017: http://urn.fi/urn:nbn:fi:lb-2018060403

Lisäksi sanafrekvenssitaulukoiden tekemistä varten on haettu data seuraavalta verkkosivustolta:

Suomenkieliseen Reddit-palveluun https://old.reddit.com/r/Suomi/ lähetetyt kommentit (tammikuu 2012 – joulukuu 2017)

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Unrestricted Use

Licence

CC - BY

Restrictions: Attribution

Licensors:

Tatu Huovilainen

Distribution rights holders:

University of Helsinki

IPR Holder

Tatu Huovilainen

Contact Person

User support at CSC - IT Center for Science Ltd. The Language Bank of Finland

textngram

Monolingual textngram corpusLanguages

Finnish

Linguality

Linguality type: Monolingual

Size

2,500,000,000 Words

NGram

Order: 3

Base item: Word

Resource Creation

Resource Creator

Tatu Huovilainen

Metadata

Created: 08/16/2018

Last Updated: 08/17/2021

Metadata Language: English (en)

Revision: Link to resource group page added

Metadata Creator

Hanna Westerlund

Relation

Related Resource: The Suomi 24 Corpus (2016H2) http://urn.fi/urn:nb...

Relation Type: IsDerivedFrom

Related Resource: The Suomi 24 Sentences Corpus (2016H2) http://urn.fi/urn:nb...

Relation Type: IsDerivedFrom

Related Resource: The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version http://urn.fi/urn:nb...

Relation Type: IsDerivedFrom

Related Resource: Corpus of Finnish Magazines and Newspapers from the 1990s and 2000s, Version 2 http://urn.fi/urn:nb...

Relation Type: IsDerivedFrom

Related Resource: Finnish Wikipedia 2017, http://urn.fi/urn:nb...

Relation Type: IsDerivedFrom

Related Resource: Finnish Opensubtitles 2017 http://urn.fi/urn:nb...

Relation Type: IsDerivedFrom

Related Resource: The Reddit/r/Suomi https://old.reddit.c...

Relation Type: IsDerivedFrom

Documentation

How to cite (in English): https://www.kielipan...

CHANGE LOG: 30.1.2020 Availability changed from Restricted to Unrestricted; 21.6.2020 restriction Attribution added.

How to cite (in Finnish): https://www.kielipan...

Document Type: Other

License: Creative Commons Attribution 4.0 International (CC BY 4.0), https://creativecomm...

Document Language: English

Resource group page: http://urn.fi/urn:nb...

People who looked at this resource also viewed the following:

Resources from the same creators