This software package provides finnish-postag, a part-of-speech and morphology tagger for Finnish, and finnish-nertag, a named entity recogniser for Finnish.
Both tools take running text from standard input and produce tabular output (one token per line) to standard output. See --help messages for more details.
An installer is provided in the form of a Makefile. More information can be found in the README-file in the download folder.
1.1 Initial release
1.2: Intermediate version (not published at the Language Bank)
- finnish-nertag and finnish-postag tokenize identically; tokenization no longer allows multi-word tokens
- fixed tokenization-related bugs
- added new version of OMorFi
- fixed several glaring FinnPOS-related bugs and improved POS tagging and lemmatization
- reduced the size of omorfi_tokenize.pmatch and ftb.omorfi.model
- implemented nested annotations
- added options --no-tokenize, --show-analyses and --show-nested
- more reliable and extensive lemma normalization with normalize-lemmas.py
- Capture() memory is wiped at XML closing tags such as </text>, </body> etc.
- FiNER rules:
- added sub category EnamexPrsAnm (animals)
- restored EnamexPrsTit (titles)
- restored and expanded numerical expressions (NumexMsrXxx, NumexMsrCur)
- rewrote and expanded EnamexProXxx rules to include foods and cultivars
- TimexTmeDat: years that are divisble by 10 are now recognized more reliably
- greatly improved recall and precision
- fixed the all-caps bug: consecutive all-caps input strings no longer cause finnish-nertag to slow down or freeze
1.3.1 Maintenance update
- Uses now natively compiled hfst-pmatch if found in path (The pre compiled version can be slow)
- added --no-tokenize option to finnish-postag
1.3.2 Bugfix update
- Fixes bug in finnish-tokenize