Helsinki Corpus of Swahili 2.0 (HCS 2.0) Not Annotated Version

513 Last view: 2024-04-26

13 Last update: 2021-10-11

Helsinki Corpus of Swahili 2.0 (HCS 2.0) Not Annotated Version

View resource name in all available languages

Helsinki Swahili -korpus 2.0 (HCS 2.0), ei annotoitu versio

hcs-na-v2

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2016011302

Access location: http://urn.fi/urn:nbn:fi:lb-2016042801

The Helsinki Corpus of Swahili 2.0 Not Annotated Version, consisting of plain text without linguistic codes, contains about 25 million words. The corpus is available in Kielipankki, the Language Bank of Finland, download location: http://urn.fi/urn:nbn:fi:lb-2016042801

Preparation of the material
Most of the corpus material was retrieved from the Web. This method was used increasingly after texts in the Web became available. Only texts in news media and on open government pages were retrieved. Some types of texts, such as books, were scanned and proofread. Part of the oldest news material before the time of scanners in the 1980’ies was manually typed.
The corpus material has gone through a series of formatting and correction routines.
1. Converting the text into ascii-format, required by the tagger. There is a wild variety of codes for describing diacritics in Web texts. These had to be formalized.
2. Proofreading and correcting the text with a speller.
3. Analyzing the proofread text for finding still remaining typos and possibly new words.
4. Constructing a correction program that automatically corrects such typos that can be safely corrected. More than 8000 such mistake types were identified.
5. New words found in corpus were added to the parser.
6. Texts were corrected using the constructed correction program.
7. Metadata in text files were formalized.
8. Texts were converted into sentence-per-line format.
9. Text within each file was randomly shuffled to mix the sentence order.

The result of these routines comprises the Helsinki Corpus of Swahili 2.0 Not Annotated Version.

Metadata were added to each file.

Structure of the corpus
HCS 2.0 contains the following types of material:
Old material
1. Books
2. News
New material
1. Bunge
2. News

Old material contains material before 2003. Much of this material is in Helsinki Corpus of Swahili 1.0. The big difference is, however, that while in the earlier corpus only sections of books were included, in the new corpus whole texts are included. The other difference is that while in the old corpus text sections are in the original order, in the new corpus sentences are randomly shuffled.

Most of the new material consists of news texts from 2004-2015. The section ‘Bunge’ contains Hansards of the Tanzanian Parliament from the years 2004, 2005 and 2006. Metadata in the beginning of each file give more information. Also the names of the files give hints of the contents of the files.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Unrestricted Use

Licence

CC - BY

Restrictions: Attribution

Download location: hidden

Distribution Access/Medium: Downloadable

Attribution Details: See Documentation section.

Licensors:

Hurskainen Arvi

Distribution rights holders:

University of Helsinki

IPR Holder

Hurskainen Arvi

Contact Person

User support at CSC - IT Center for Science Ltd. The Language Bank of Finland

text

Monolingual text corpusLanguages

Swahili

Linguality

Linguality type: Monolingual

Size

25,000,000 Tokens

Modalities

Written Language

Metadata

Created: 01/13/2016

Last Updated: 10/11/2021

Metadata Language: English (en)

Revision: relation to former roof page removed, relation to annotated version added

Metadata Creator

Ute Dieckmann

Imre Bartis

Relation

Related Resource: Helsinki Corpus of Swahili 2.0 (HCS 2.0) Annotated Version http://urn.fi/urn:nb...

Relation Type: IsOriginalFormOf

Documentation

Resource group page: http://urn.fi/urn:nb...

How to cite: www.kielipankki.fi/v...

People who looked at this resource also viewed the following: