Word embeddings trained with word2vec from the Finnish Text Collection
View resource name in all available languages
word2vec-menetelmällä harjoitetut sanaupotukset Suomen kielen tekstikokoelmasta
ftc-wordvec
Persistent Identifier of this resource:
http://urn.fi/urn:nbn:fi:lb-2022041405
Access location:
This package contains word embeddings trained with word2vec from newspaper text in Kielipankki's Finnish Text Collection (FTC)
(http://urn.fi/urn:nbn:fi:lb-2016050206). The following files were used:
aamulehti.tar.gz
demari.tar.gz
hameensanomat.tar.gz
hyvinkaansanomat.tar.gz
iltalehti.tar.gz
kangasalansanomat.tar.gz
karjalainen.tar.gz
kauppalehti.tar.gz
keskisuomalainen.tar.gz
optio.tar.gz
suomenkuvalehti.tar.gz
tekniikanmaailma.tar.gz
turunsanomat.tar.gz
Instead of surface forms, the lemmas from text annotations were used. So inflected forms like "koiralta" are absent, and are instead all represented as the base form "koira".
All lemmas were also converted to lowercase. So names like "Niinistö" are represented as "niinistö".
The embedding file contains 247 305 entries. The dimension of the vector space is 100.
The embedding file is in a simple and easily parsed textual format produced by word2vec. The first line in the file gives the vocabulary size and dimension. Each line after that begins with a vocabulary item, followed by a space, followed by 128 floating point numbers (represented textually) each followed by a space. For efficient processing, look into converting this into a binary representation.
(http://urn.fi/urn:nbn:fi:lb-2016050206). The following files were used:
aamulehti.tar.gz
demari.tar.gz
hameensanomat.tar.gz
hyvinkaansanomat.tar.gz
iltalehti.tar.gz
kangasalansanomat.tar.gz
karjalainen.tar.gz
kauppalehti.tar.gz
keskisuomalainen.tar.gz
optio.tar.gz
suomenkuvalehti.tar.gz
tekniikanmaailma.tar.gz
turunsanomat.tar.gz
Instead of surface forms, the lemmas from text annotations were used. So inflected forms like "koiralta" are absent, and are instead all represented as the base form "koira".
All lemmas were also converted to lowercase. So names like "Niinistö" are represented as "niinistö".
The embedding file contains 247 305 entries. The dimension of the vector space is 100.
The embedding file is in a simple and easily parsed textual format produced by word2vec. The first line in the file gives the vocabulary size and dimension. Each line after that begins with a vocabulary item, followed by a space, followed by 128 floating point numbers (represented textually) each followed by a space. For efficient processing, look into converting this into a binary representation.
People who looked at this resource also viewed the following: