Child Language Corpus: Data Collection and Annotation

Jane S. Tsay Institute of Linguistics, Chung Cheng University Min-Hsiung, Chia-Yi 621, Taiwan [email protected]

recording sessions, each 40 to 60 minutes long, Abstract totaling about 330 hours.

1.3 Transcription Taiwan Child Language Corpus contains scripts transcribed from about 330 hours Each recording session was transcribed into a of recordings of fourteen young children separate text file, using Chinese orthography. from Chinese speaking For words that do not have a conventionalized families in Taiwan. The format of the written form, the Taiwan Southern Min corpus adopts the Child Language Data system, i.e., Taiwan Southern Min Exchange System (CHILDES). The size was used. of the corpus is about 1.6 million words. About half of the sessions (from children In this paper, we describe data collection, under two and a half years old) also have transcription, word segmentation, and phonetic transcription in unicode IPA part-of-speech annotation of this corpus. (International Phonetic ). Applications of the corpus are also The three primary transcribers, who were discussed. also the investigators who did the recordings, were well-trained linguists. All recordings were 1 Data Collection first transcribed by the investigator of the specific session and then checked by the other Taiwan Child Language Corpus (TAICORP) is two transcribers. a corpus of text files transcribed from the child speech recorded between October 1997 through 2 Text files in CHILDES format May 2000. The target language is Southern Min Chinese spoken in Taiwan. TAICORP adopts the format of CHILDES (Child Language Data Exchange System), 1.1 Children originally set up by Elizabeth Bates, Brian MacWhinney, and Catherine Snow, to transcribe All fourteen children participated were from and code the recordings of child speech into Taiwanese-speaking families in Min-Hsiung machine-readable text (MacWhinney & Snow Villlage, Chiayi County, Taiwan. 1985, MacWhinney 1995). There were nine boys and five girls, aged from one year two months to three years and The main components of CHILDES format eleven months at the beginning of the project. are headers and tiers. More than half of the children were recorded over more than two years. 2.1 Headers 1.2 Recordings Obligatory headers are necessary for every file. They mark the beginning, the end and The recordings were made through regular the participants of the file. home visits. Spontaneous speech of these Constant headers mark the name of the file children at play was recorded using Mini Disc and the background information of the recorders. The interval of the sessions was children. about two weeks. There were totally 431

56 Changeable headers contain information that ɀʳSIS: the utterance of sister can change within the file, such as the ɀʳBRO: the utterance of brother recording date, duration, coders and so on. ɀʳGRM: the utterance of grandmother These headers begin with @, for example: ɀʳGRF: the utterance of grandfather ɀʳOTH: the utterance of other people Obligatory headers: @Begin The main tier is the most important tier @End because it is where the utterances are listed. The @Participants utterances in the main tier were transcribed in the romanization (pinyin) system of Taiwan Constant headers: Southern Min (to be explained and illustrated in @Age of XXX: Section 5). @Birth of XXX: @Coder: Dependent Tiers @Educ of XXX: @Filename: Additional information is given in dependent @ID: tiers, indicated by %, following the main tier. @Language: Dependent tiers can be changed according to @Language of XXX: the design and goals of each corpus. @SES of XXX: social and economic The dependent tiers used in TAICORP status of a specific speaker include the following: @Sex of XXX: @Warning: the defects of the file %ort: transcription in standard orthography %cod: part-of-speech coding Changeable headers: %pho: phonetic transcription in IPA @Activities: %ton: value in 5-point scale @Comment: @Date: For adults' speech, only %ort and %cod @Location: tiers are used. For younger children's speech, @New Episode: %pho and %ton tiers are also used. The @Room Layout: following text is an example from TAICORP. @Situation: ([m…] = speech in Mandarin; SHI = ࢂ "be") @Tape Location: @Time Duration: @Begin @Time Start: @Participants: CHI Lin Target_Child, INV Rose Investigator, MOT Mother, OTH Great 2.2 Tiers Grandmother @Age of CHI: 2;1.22 The content of a file is presented in tiers, @Birth of CHI: 28-AUG-1995 including main tiers and dependent tiers. A main @Sex of CHI: Male tier, indicated by *, contains the utterance of the @Coder: Rose, Kay, Joyce speaker. @Language: Taiwanese @Date:20-OCT-1997 Main tiers @Tape Location: Lin D1-1-56 @Comment: Time Duration 37 minutes The main tiers used in TAICORP include the @Location: Chiayi, Taiwan following: @Transcriber: Rose @Comment: Track number is D1-1 ɀʳINV: the utterance of the investigator ɀʳCHI: the utterance of the target child *INV: bo2lin2@s [:=m], li2 tha5tu2a2 khi3 ɀʳMOT: the utterance of mother to2? ?ॲ װ גᙰࢰ܃ ,[ɀʳFAT: the utterance of father %ort: [m ਹࣥ

57 %cod: Nb Nh Nd VCL Ncd Table 2. *CHI: hia1/hin1. %ort: ሕ 1. Table 2 Part-of-Speech Tagset in TAICORP %cod: Ncd Coding Part-of-speech %pho: h i a A non-predicative adjective %ton: 55 Caa coordinate conjunction *INV: hia1 si7 to2ui7? Cab listing conjunction Cba conjunction occurring at the end ?ۯort: ሕ ਢ ॲ% %cod: Ncd SHI Ncd of a sentence *CHI: hm0. Cbb following a subject %ort: hm0. Da possibly preceding a noun %cod: I Dfa preceding VH through VL %pho: ?? Dfb following adverb %ton: ?? Di post-verbal *MOT:li2 kin1a2 ciah8 bi2ko1 si7 bo0? Dk sentence initial ˒ˆ ྤ ᗶʳ ਢʳۏ ʳ ଇʳ˄ גʳ վ܃ :ort% D adverbial %cod: Nh Nd VC Na SHI T Na common noun @End Nb proper noun Nc location noun 3 Statistics of the corpus Ncd localizer Nd time noun The corpus size is about 1.6 million words Neu numeral determiner (more than 2 million morphemes/Chinese Nes specific determiner characters). The number of utterances/lines, Nep anaphoric determiner words, mean length of utterances (MLU) are listed in Table 1. Neqa classifier determiner Neqb postposed classifier determiner Lines Words MLU Nf classifier Ng postposition Children ˄ˉ˄ʿ˅ˈˆ ˇˆˇʿˈˈˊ ˅ˁˉˌˈ Nh pronoun Adults ˆˆˉʿ˄ˊˆ ˄ʿ˅˄˄ʿˌˇˉ ˆˁˉ˃ˈ I interjection Total ˇˌˊʿˇ˅ˉ ˄ʿˉˇˉʿˈ˃ˆ ˆˁ˄ˈ˃ P preposition Table 1 Statistics of the corpus T particle VA active intransitive verb It might be worth mentioning that the VAC MLU of adults in this corpus is relatively short. VB active pseudo-transitive verb This could be attributed to the nature of this VC active transitive verb corpus as being child-directed speech. VCL transitive verb taking a locative argument 4 Part-of speech annotation VD ditransitive verb VE active transitive verb with Southern Min and Mandarin are both Sinitic sentential object languages. They are very similar in their VF active transitive verb with VP morphology and syntactic structures. Therefore, object we adopted the part-of-speech coding system of VG classifactory verb the Sinica Corpus, Academia Sinica, Taiwan VH stative intransitive verb (see various CKIP technical reports). However, VHC stative causitive verb among the 115 categories used in the Sinica VI stative pseudo-transitive verb Corpus (CKIP 1993), only 46 codes were used VJ stative transitive verb in TAICORP. In other words, categorization in VK stative transitive verb with TAICORP is broader. These codes are listed in sentential object

58 "ѯ 0ki5kha2/ "unusualڻ VL stative transitive verb with VP object V_2 If a written form cannot be found in one of DE *special tag for the word "ޑ" the major Southern Min dictionaries, SHI special tag for the word "ࢂ" romanization is used. FW foreign words Romanization is also used if the written *Di/T *marker following pseudo- form of a word is found in the dictionary but transitive active verb has so low frequency that it can't be found in the computer coding system. *CIT *special tag for the word "ள 2" For homonyms, a number is added after 5 Orthography-related issues for a the character to indicate different lemmas. For speech-based corpus of Southern Min example: ᇂ 1 0kah4/!!"to cover with a blanket" 5.1 Romanization system ᇂ 2 0kham3/!!"to cover" ᇂ 3 0kua3/!!"a cover" As mentioned in Section 2, utterances in the main tier are transcribed in romanization (Southern Min pinyin). The romanization 6 The Autosegmentation program and the system used in TAICORP is the Taiwan Spell-checker Southern Min Phonetic Alphabetic (also known as Taiwan Language Phonetic Alphabet, TLPA, In order to speed up the building of the corpus, originally proposed by the Taiwan Language a word auto-segmentation program is necessary. Society in 1991) announced officially by the Yet, when the program is segmenting words Ministry of Education of Taiwan in 1998. from the text, it can also deal with some related problems at the same time, such as the consistency of the transcription, adding 5.2 Standard orthography: Chinese romanization, and expanding the lexicon. characters The Lexicon Bank are used in the dependent tier %ort as the standard orthography. This is a As the basis of the auto-segmentation program reasonable way because most of the Southern and the spell-cheker, a corpus-based lexicon has Min words are cognates of Mandarin words. been constructed which includes the lemma However, because Southern Min does not have (both in romanization and in Chinese as conventionalized orthography as Mandarin, characters), alternative forms, synonyms, and quite a few words in Southern Min do not have part-of-speech. (See the Appendix for a sample a consistent way of writing them. Some of them of the lexicon.) don't even have very obvious corresponding Characters. Consistency in the transcription In order to ensure consistency in the corpus, Southern Min dictionaries were used. These Taiwanese speech recognition is still developing, dictionaries are listed after the References. so there is no way to transcribe the data with This issue is particularly important for a machine. Hence, transcription can only be corpus based on spontaneous speech, rather done manually. The transcribers might be than written text. For example, the following inconsistent in choosing the written form. For common words in Southern Min have to be example, ࡪ࡛ (an3cuann2) "how" can be checked in the dictionary about their written transcribed as ࡛ኬ, ࡛ሶኬ, ࡪ࡛, ࡛ሶ, ϙ forms because they do not occur in Mandarin: ሶ and so on. Therefore, it is very important to design a program can identify the inconsistency. ㍩࿽!0bang2tah4/ "mosquito net" When the program is segmenting the text, ਉ!/ban2/ "to pick" it tries to match a string which matches the

59 word in the column of "Chinese character" in Acknowledgements the lexicon bank. It then segments the word and codes its pinyin. Figure 1 shows the input text This project was supported by grants from the in the frame, and Figure 2 shows the output of National Science Council, Taiwan. (Grant no. after segmentation. Word segmentation standard NSC89-2411-H-194-06, NSC90-2411-H-194- follows mostly that of the Sinica Corpus (Chen 031, NSC91-2411-H-194-029). We thank all the et al. 1996). children and their families. Research assistants If the transcription happens to be one of at Chung Cheng University, especially Tingyu the "other forms," it will be replaced with the Rose Huang, Hui-Chuan Joyce Liu, and Xiao- standard form listed under the "Chinese Chun Kay Chen, have made remarkable character." contributions to this project.

Adding new words to the Lexicon References If a word does not exist in the lexicon, it will be added to the lexicon after the file manager Chinese Knowledge Information Processing confirms its status. Group (CKIP). 1993. Chinese Part-of- Speech Analysis. Technical Report 93-05. In short, the word auto-segmentation program is Taipei: Academia Sinica. able to do four things at the same time: Chen, Keh-Jiann, Chu-Ren Huang, Li-Ping 1 segment words in the text Chang, Hui-Li Hsu. 1996. SINICA 2 code the pinyin for the characters CORPUS: Design Methodology for 3 correct the inconsistent written forms Balanced Corpora. Language, Information, 4 expand the lexicon bank and Computation (PACLIC), 11: 167-176. Huang, Chu-Ren, Keh-jiann Chen, Feng-yi Chen, and Li-Li Chang. 1997. 7 Applications of the corpus "Segmentation Standard for Chinese Natural Language Processing" This corpus has been used for studies on various Computational Linguistics and Chinese aspects of child language acquisition, including Language Processing, Vol. 2, no. 2, pp. 47- tone acquisition (Tsay and Huang, 1998; Tsay, 62. Myers, and Chen, 2000; Tsay, 2001), consonant Huang, Chu-Ren, Keh-Jiann Chen and -Shin acquisition (Liu and Tsay, 2000), classifier Lin. 1997. "Corpus on Web: Introducing the acquisition (Myers and Tsay, 2000), final First Tagged and balanced Chinese particle acquisition (Hung, Li, and Tsay, 2004), Corpus." Conference Proceedings of verb acquisition (Lee and Tsay, 2001; Lin and Pacific Neighborhood Consortium 1997. Tsay, 2005), vocabulary acquisition (Tsay and Hung, Jia-Fei, Cherry Li, and Jane Tsay. 2004. Cheng, in progress). More studies are on the "The Child's Utterance Final Particles in way. Taiwanese: A Case Study." Proceedings of Because this corpus is based on the 9th International Symposium of Chinese spontaneous speech, it also has its applications Languages and Linguistics, 477-498. Taipei: in addition to linguistic research. For example, National Taiwan University. this corpus can be used in extracting important Lee, Thomas Hun-tak and Jane Tsay. 2001. speech features. "Argument structure in the early speech of This corpus will be released by the -speaking and Taiwanese- Association for Computational Linguistics and speaking children." The Joint Meeting of Processing, Taiwan, in fall of the 10th IACL and the 13th NACCL. June 2005. 22-24, 2001. UC Irvine. Lin, Huei-ling and Jane Tsay. 2005. "Acquiring Causatives in Taiwanese" Paper presented at the 14th IACL. Leiden University. Liu, Joyce H. C. and Jane Tsay. 2000. "An

60 Optimality-Theoretic Analysis of Taiwanese of the 30th Child Language Research Forum, Consonant Acquisition." Proceedings of 211-218. Stanford, California: Center for The 7th International Symposium on the Study of Language and Information Chinese Languages and Linguistics, 107- 126. Chung Cheng University, Taiwan. MacWhinney, Brian. 1995. The CHILDES Dictionaries Project: Tools for Analyzing Talk. 2nd ed. Hillsdale, NJ.: Lawrence Erlbaum Chen, Xiu 1998. Taiwanhua Dacidian Associates Inc., Publishers. [Taiwanese Dictionary]. Taipei: Yuanliu MacWhinney, Brian, and Catherine Snow. 1985. Publishing Co. The Child Language Data Exchange System. Dong, Zhongsi. 2001. Taiwan Minnanyu Cidian Journal of Child Language, 12: 271-296. [Taiwan Southern Min Dictionary]. Taipei: Myers, James and Jane Tsay. 2000 "The Wunan Publisher. Acquisition of the Default Classifier in Li, Rong. 1998. Fangyan Cidian Taiwanese." Proceedings of the 7th [Xiamen Dictionary]. Jiangsu: International Symposium on Chinese Education Publisher. Languages and Linguistics, 87-106. Chung Wu, Shouli. 2000. Guotaiyu Duizhao Huoyong Cheng University, Taiwan. Cidian [Mandarin-Taiwanese Comparative Myers, James and Jane Tsay. 2002. "Grammar Dictionary]. Taipei: Yuanliu Publising Co. and Cognition in Sinitic Noun Classifier Xu, Jidun. 1992. Changyong Hanzi Taiyu Systems." Proceedings of the First Cidian [Taiwanese Dictionary of Frequently Cognitive Linguistic Conference, pp. 199- Used Chinese Characters]. Taipei: Culture 216. Taipei: Chengchi University Department, Zili Evening News. Tsay, Jane. 2001. "Phonetic Parameters of Tone Yang, Qingchu. 1993. Guotai Shuangyu Cidian Acquisition in Taiwanese" In Minehru [Mandarin-Taiwanese Bilingual Dictionary] Nakayama (ed.) Issues in East Asian Kaohsiung: Duli Publishing Co. Language Acquisition, 205-226. Tokyo: Yang, Xiufang. 2001. Minnanyu Cihui Kuroshio Publishers. [Southern Min Vocabulary]. Taipei: Tsay, Jane and Ting-Yu Huang. 1998. "Phonetic Ministry of Education. Parameters in the Acquisition of Entering Tones in Taiwanese." The Proceedings of the Conference on Phonetics of the Languages in . 109-112. City University of Kong. Tsay, Jane, James Myers, and Xiao-Jun Chen. 2000. " as Evidence for Segmentation in Taiwanese." Proceedings

61 Southern Min Spell Checker

Figure 1 Input text

Figure 2 Output text

62 Appendix Sample of the Lexicon

Chinese Southern Min Pinyin Part-of- Meaning (or Example character speech Mandarin synonyms) ಖe0 be7ki3e0 VKآ ߠూ be7kian3siau3/ VH լ૞ᜭآ bue7kian3siau3 ᔄԱ be7liau2/ VB ᔄݙ bue7liau2 ᔄԱԱ be7liau2liau2/ VB ᔄ٠٠ bue7liau2liau2 װbe7liau2liau2khi3/ VB ᔄ٠٠ װᔄԱԱ bue7liau2liau2khi3 ΰଥ堸೯װࠌآ܃ ױࠌ be7sai2/ D լ۩Εլ౨Εլآ saih6 ဲαآΕא bue7sai2 ࠌآवሐຍᑌْ ױࠌ be7sai2/ VH լ۩Εլ౨Εլآ saih6آΕא bue7sai2 ࠌ൓ be7sai2cit4/ D լ౨آ bue7sai2cit4 ቝړ ቝړsu1Εآ ᙁ be7su1/bue7su1 Dآ Εլאױᅝ be7tang3/ D լ౨Εլآ tang3آ౨Εآbue7tang3 ۩Ε אױᅝ൓ be7tang3cit4/ D լ౨Εլآ bue7tang3cit4 ᔄൾ be7tiau7/bue7tiau7 VC be7tiau7khi3/ VB װᔄൾ bue7tiau7khi3 ೯ be7tin2be7tang7 VA ԫ೯Ոլ೯آ஡آ ᔄထ be7tioh8/ VC bue7tioh8 ᔄ໱ be7tiunn5/ Nc bue7tiunn5 ړ൓լٽbe7tu2ho2/ VH ಻ ړࢰآ bue7tu2ho2 ᔄݙ be7uan5/bue7uan5 VC ؑװ඿ beh4/bueh4 D ᄎΕფΕ૞Ε૞1 (+೯ဲ ) ݺ૞ ၇လג ඿ beh4/bueh4 D ݶ૞ ݶࠐլ֗Ա αݺ૞ຍิဲټ+) ඿ beh4/bueh4 VC ᄎΕფΕ૞Ε૞1 Ꮖ૝ ඿ფ beh4ai3/ D ඿ფΕუ૞Ε඿૞Ε ඿ფ ઱ [m नࠠ] bueh4ai3 ૞ფ ඿ფ beh4ai3/ VC ඿ფΕუ૞Ε඿૞Ε bueh4ai3 ૞ფ …ࠐء඿ྤ beh4bo5 Cbb bue4ྤΕbue2bo5 ΰຑ൷ဲα ܀

63