Ajatella, miettiä, pohtia, harkita -korpus



The corpus is available for download in Kielipankki - the Language Bank of Finland: You should be able to download it by just logging in with your university credentials. In case you cannot log in, even though you are affiliated to a university, see instructions at

A copy of the uncompressed corpus is also available at, instructions on how to gain access rights:

The amph micro-corpus consists of altogether 3404 occurrences of the four most common Finnish THINK lexemes, ajatella, miettiä, pohtia, and harkita 'think, reflect, ponder, consider'.

These occurrences have been extracted from a corpus consisting of two months worth (January–February 1995) of written text from Helsingin Sanomat (1995), Finland’s major daily newspaper, and six months worth (October 2002 – April 2003) of written discussion in the SFNET (2002-2003) Internet discussion forum, namely regarding (personal) relationships (sfnet.keskustelu.ihmissuhteet) and politics (sfnet.keskustelu.politiikka). The newspaper corpus consisted altogether of 3,304,512 words of body text, excluding headers and captions (as well as punctuation tokens), and included 1,750 representatives of the studied THINK verbs, whereas the Internet corpus comprised altogether 1,174,693 words of body text, excluding quotes of previous postings as well as punctuation tokens, adding up to 1,654 representatives of the studied THINK verbs. The individual overall frequencies among the studied THINK lexemes in the corpus were 1492 for ajatella, 812 for miettiä, 713 for pohtia, and 387 for harkita.

The corpus contents were first automatically syntactically and morphologically analyzed using a computational implementation of Functional Dependency Grammar (Tapanainen and Järvinen, 1997, Järvinen and Tapanainen 1997) for Finnish, namely the FI-FDG parser (Connexor 2007). After this, all the instances of the studied THINK lexemes together with their syntactic arguments were manually validated and corrected, if necessary, and subsequently supplemented with semantic classifications. In addition, some extra-linguistic features (newspaper section or specific newsgroup, author ID when available, unique document index) are incorporated, when they could be identified and extracted from the original corpora.

The amph micro-corpus contains for each occurrence of the selected four THINK verbs in the original research corpora all relevant contextual features, including the verb itself, analyzed at the aforementioned morphological, syntactic and semantic levels in the immediate sentential context, as well as all pertinent extralinguistic features. In addition, the amph micro-corpus includes scripts for processing this data, R functions for its statistical analysis, as well as a comprehensive set of the ensuing results as R format data tables.

For a more detailed description of the corpus see

