Developing a Broadband Automatic Speech Recognition System for Afrikaans
INTERSPEECH 2011 Developing a broadband automatic speech recognition system for Afrikaans Febe de Wet1,2, Alta de Waal1 & Gerhard B van Huyssteen3 1Human Language Technology Competency Area, CSIR Meraka Institute, Pretoria, South Africa 2Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa 3Centre for Text Technology (CTexT), North-West University, Potchefstroom, South Africa fdwet@csir.co.za, adewaal@csir.co.za, Gerhard.VanHuyssteen@nwu.ac.za Abstract the South African languages, followed by the local vernacular Afrikaans is one of the eleven official languages of South of South African English [3]. This position can be ascribed Africa. It is classified as an under-resourced language. No an- to various factors, including the fact that more linguistic ex- notated broadband speech corpora currently exist for Afrikaans. pertise and foundational work are available for Afrikaans and This article reports on the development of speech resources for South African English than for the other languages, the avail- Afrikaans, specifically a broadband speech corpus and an ex- ability of text (e.g. newspapers) and speech sources, the fact tended pronunciation dictionary. Baseline results for an ASR that Afrikaans is still somewhat used in the business domain and system that was built using these resources are also presented. in commercial environments (thereby increasing supply-and- In addition, the article suggests different strategies to exploit demand for Afrikaans-based technologies), and also the fact the close relationship between Afrikaans and Dutch for the pur- that Afrikaans could leverage on HLT developments for Dutch poses of technology development. (as a closely related language). Index Terms: Afrikaans, under-resourced languages, auto- Notwithstanding its profile as the most technologically de- matic speech recognition, speech resources veloped language in South Africa, Afrikaans can still be consid- ered an under-resourced language when compared to languages such as English, Spanish, Dutch or Japanese.
[Show full text]