Helsinki Corpus of Swahili 2.0 (HCS 2.0) Not Annotated Version
View resource name in all available languages
Helsinki Swahili -korpus 2.0 (HCS 2.0), ei annotoitu versio
The Helsinki Corpus of Swahili 2.0 Not Annotated Version, consisting of plain text without linguistic codes, contains about 25 million words. The corpus is available in Kielipankki, the Language Bank of Finland, download location: http://urn.fi/urn:nbn:fi:lb-2016042801
Preparation of the material
Most of the corpus material was retrieved from the Web. This method was used increasingly after texts in the Web became available. Only texts in news media and on open government pages were retrieved. Some types of texts, such as books, were scanned and proofread. Part of the oldest news material before the time of scanners in the 1980’ies was manually typed.
The corpus material has gone through a series of formatting and correction routines.
1. Converting the text into ascii-format, required by the tagger. There is a wild variety of codes for describing diacritics in Web texts. These had to be formalized.
2. Proofreading and correcting the text with a speller.
3. Analyzing the proofread text for finding still remaining typos and possibly new words.
4. Constructing a correction program that automatically corrects such typos that can be safely corrected. More than 8000 such mistake types were identified.
5. New words found in corpus were added to the parser.
6. Texts were corrected using the constructed correction program.
7. Metadata in text files were formalized.
8. Texts were converted into sentence-per-line format.
9. Text within each file was randomly shuffled to mix the sentence order.
The result of these routines comprises the Helsinki Corpus of Swahili 2.0 Not Annotated Version.
Metadata were added to each file.
Structure of the corpus
HCS 2.0 contains the following types of material:
Old material contains material before 2003. Much of this material is in Helsinki Corpus of Swahili 1.0. The big difference is, however, that while in the earlier corpus only sections of books were included, in the new corpus whole texts are included. The other difference is that while in the old corpus text sections are in the original order, in the new corpus sentences are randomly shuffled.
Most of the new material consists of news texts from 2004-2015. The section ‘Bunge’ contains Hansards of the Tanzanian Parliament from the years 2004, 2005 and 2006. Metadata in the beginning of each file give more information. Also the names of the files give hints of the contents of the files.
People who looked at this resource also viewed the following: