Automatic Collecting of Text Data for Cantonese Language Modeling Jiang CAO1, Xiaojun WU1, Yu Ting Yeung2, Tan LEE2, Thomas Fang Zheng1* 1

S4-4 Automatic Collecting of Text Data for Cantonese Language Modeling Jiang CAO1, Xiaojun WU1, Yu Ting Yeung2, Tan LEE2, Thomas Fang Zheng1* 1. Center for Speech and Language Technologies, Division of Technical Innovation and Development, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University 2. Department of Electronic Engineering, The Chinese University of Hong Kong *. Contact: Room 4-416 Information Science and Technology Building, Tsinghua University, Beijing, 100084, China; +86-10-6279-6393, [email protected]; http://cslt.riit.tsinghua.edu.cn/~fzheng Abstract and Chinese, there are many existing language It is hard to collect corpora used to train good databases available for language model training language models for many minority languages. (e.g. ATIS (Ward 1990), CallHome (Ries 2000) and Cantonese, one of the most popular Chinese SogouT1). It is also relatively easy to collect more dialects, is such a kind of language, lacking of text data for these langauages. For those spoken language materials for language model training. languages or dialects, the data availability is not This is a very big obstruction for the processing of trivial. In Cantonese, for example, there is very Cantonese language. little speech data and even less text data available, Unlike many other languages, there are great simply because it is not an official written language differences between written and colloquial (Fung 1999). The research progress of Cantonese Cantonese. What’s more, people in Hong Kong are language processing has been limited by the lack of using mixed Cantonese and English while they talk, text databases. The task of collecting more which is also a special characteristic of this language materials is a straightforward but feasible language. Beyond these, the materials collected breakthrough point in improving the performance from different sources have different proportion of of existing systems. colloquial Cantonese sentences, which means that In this study, we first developed a filter model, different sources should not be equally treated. which was built at lexical and grammar levels, to We developed a filter model, which was built decide whether a retrieved sentence is in Cantonese up at lexical and grammar levels. We trained this or not. This model was trained using a model using a development set and achieved a development set, which was made up of 500 precision rate of 99.89% and a recall rate of 88.2% colloquial Cantonese sentences and 500 standard in the test set. Chinese sentences. These sentences were all With this model, we found a method to define collected from web pages. The test set contained the credibility for the different material sources. It 2,251 colloquial Cantonese sentences and 7,172 was an iterative process and the proportion of the written sentences. The proposed filter model could sentences chosen from different sources for model achieve a precision rate of 99.89% and a recall rate training is decided by its result. of 88.2%, which was very encouraging. It must be pointed out that the precision rate is more 1 Introduction important than the recall rate in our intended Language modeling has been successfully used for application of data collection because undesirable speech recognition, part-of-speech tagging, sentences would deteriorate the performance of syntactic parsing, and information retrieval language models. recently and so on (Song 1999). In recent years, We designed a method to assess the credibility Grammar-Based Language Models (GLMs) and of a source of text materials, e.g., a website, which Statistical Language Models (SLMs) are most was an iterative process. The credibility level of each source was updated in every iteration widely used types of language models (Hockey according to the proportion of the colloquial 2005). For large-vocabulary continuous speech recognition (LVCSR), SLM, typically N-gram, Cantonese sentences contained in the documents plays an important role (Duchatcau 2005). In order from this source. With different iterations, to estimate model parameters, a large database of documents from different sources were treated text of this language is necessary. Lacking of differently. This method helped identify more language materials is generally a serious problem useful sources which attracted more attention, and in statistical language model training and may thus improved the efficiency of data collection. affect the performance of a language processing system. For popular languages like English，Spanish 1 http://www.sogou.com/labs/dl/t.html 130 2 Cantonese becoming popular in Mainland China in recent years, code-switching in Hong Kong Cantonese is 2.1 Introduction to Cantonese considered as an integrated part and a special Cantonese or Yue (粤语) is one of the most popular feature of contemporary Cantonese. Chinese dialects, and is a member of the For language model training in Cantonese Sino-Tibetan language families (Li 2006). It is also speech recognition, we need text materials that known as 广东话/廣東話 (Mandarin: Guǎngdōng match the content of real-life spoken Cantonese. huà, Cantonese: gwong dung waa). The population But in most Cantonese-speaking regions, including of Cantonese speakers is over 55 million. The areas Hong Kong SAR, Cantonese is not an official with the highest concentration of speakers are in written language (Zhang 1999). In Hong Kong, Guangdong Province and some parts of Guangxi standard Chinese in traditional characters (繁体中 Province, Hong Kong and Macau (See The 文 ) is used in official communications. Most Introduction in Lau 1999). It is the defacto official newspapers and formal publications are printed in spoken language in Hong Kong. Most Chinese standard Chinese. Over the internet, which hosts a people in Southeast Asia and North America also large amount of text materials, written Cantonese speak Cantonese. The name of this language, remains a minority. “Cantonese”, has been stated as being derived from Canton, which is an English name for Guangdong 3 The Cantonese Filter Model Province and also an English name for Guangzhou. As discussed earlier, there are quite a few (Graham 2006). Cantonese speech is generally differences between written Cantonese and unintelligible to people who live in other standard Chinese. Our goal is to develop an provinces. effective method to distinguish written Cantonese Cantonese is a monosyllabic and tonal from standard Chinese, so as to facilitate massive language. Each Chinese character is pronounced as collection of Cantonese text materials from the a syllable sound in Cantonese (Chan 2005). There internet. are about 5,500 commonly used characters in Cantonese, and this is a little less than standard 3.1 Construction of the Filter Model Chinese spoken in Mainland China and Taiwan. The basic function of the filter model is to Some of the characters are unique to Cantonese determine whether a given sentence is in written and are never being used in standard Chinese. But Cantonese or not. This function can be at the same time, there are also some characters implemented in two steps. The first step is to used in standard Chinese but not in Cantonese. In indicate whether the sentence is Chinese in a this paper, we mainly focus on Cantonese being “broad” sense, i.e., either written Cantonese, spoken in Hong Kong. standard Chinese or other kinds of Chinese text. 2.2 Written Cantonese and Written Chinese This is done by checking the encoding method of the text. The second step is to indicate whether the Standard written Chinese is, in essence, in sequence of Chinese characters is in Cantonese or agreement with Mandarin in Taiwan. When not. Cantonese text contains a set of standard written Chinese is read out with Cantonese-specific characters. It may also contain Cantonese pronunciations, the speech would sound special grammar phrases and code-switching strange and unnatural. It is very different from contents. For Cantonese language processing, spoken Cantonese. If we transcribe spoken methods of word segmentation and sentence Cantonese into Chinese characters, the resulted text parsing are not as well developed as for standard is referred to as written Cantonese. Chinese. Relevant research is rare. In our filter There are many differences between model, an obvious feature is used to detect Cantonese and standard Chinese. A standard Cantonese-specific content: Sentences that contain Chinese speaker will have difficulties to read at least one of these characters is considered to be written Cantonese. Cantonese has some unique in Cantonese. This simple method was found to characters like “嘅”, ”係”, and unique words, ”尋 perform very well in our study. 日”, ”噚日”, ”擒日”, ”琴日”. There also exist In (Chan 2005), some representative some special grammatical forms and phrase Cantonese sentences are listed. We chose 500 expressions in Cantonese. Cantonese is a language sentences from the list and used them as the where code-switching is quite common (Chan training data in our experiments. For standard 2005). In Hong Kong, people are used to embed Chinese, there is a “Common Chinese Character English words in a sentence. Although this is 131 List” published by the government in 19882. Then different training sets different numbers of positive we defined that the unique characters in Cantonese characters were found and thus the precision and from Chinese are the characters which appear in recall rate were different. All these experiments the training set but not in the Common Chinese were done in the same test set as we described Character List, and of course, in the traditional above and without the help of the negative Chinese form. These characters are called positive keywords, as shown in Table 1. characters. We also tried to defined negative characters as those which only appear in Chinese but not in Cantonese. However, a simple Training Set No. of Positive Precision Recall experiment showed that almost all Chinese Sentences Characters Rate Rate characters can be found in Cantonese.

Automatic Collecting of Text Data for Cantonese Language Modeling Jiang CAO1, Xiaojun WU1, Yu Ting Yeung2, Tan LEE2, Thomas Fang Zheng1* 1

A Corpus Study of the 3 Tone Sandhi in Standard Chinese

The Status of Cantonese in the Education Policy of Hong Kong Kwai Sang Lee and Wai Mun Leung*

Intonation in Hong Kong English and Guangzhou Cantonese-Accented English: a Phonetic Comparison

Pan-Sinitic Object Marking: Morphology and Syntax*

Cantonese Vs. Mandarin: a Summary

Download Our Latest Allegravita Backgrounder Booklet Here (PDF)

Language Specific Peculiarities Document for Cantonese As

Cifu: a Frequency Lexicon of Hong Kong Cantonese

Language Management in the People's Republic of China

Corpus of Hong Kong Cantonese 香香香港港港語語語語語語料料料庫庫庫

LANGUAGE CONTACT and AREAL DIFFUSION in SINITIC LANGUAGES (Pre-Publication Version)

Essentials of Standard Chinese Phonetics for Prosthetic Dentistry Xiulian Hu, DMD,1 Ye Lin, MD,1 Cordula Hunold, Phd,2 & Katja Nelson, DDS, Phd3