Taiwan Child Language Corpus: Data Collection and Annotation
Total Page:16
File Type:pdf, Size:1020Kb
Taiwan Child Language Corpus: Data Collection and Annotation Jane S. Tsay Institute of Linguistics, Chung Cheng University Min-Hsiung, Chia-Yi 621, Taiwan [email protected] recording sessions, each 40 to 60 minutes long, Abstract totaling about 330 hours. 1.3 Transcription Taiwan Child Language Corpus contains scripts transcribed from about 330 hours Each recording session was transcribed into a of recordings of fourteen young children separate text file, using Chinese orthography. from Southern Min Chinese speaking For words that do not have a conventionalized families in Taiwan. The format of the written form, the Taiwan Southern Min corpus adopts the Child Language Data romanization system, i.e., Taiwan Southern Min Exchange System (CHILDES). The size Pinyin was used. of the corpus is about 1.6 million words. About half of the sessions (from children In this paper, we describe data collection, under two and a half years old) also have transcription, word segmentation, and phonetic transcription in unicode IPA part-of-speech annotation of this corpus. (International Phonetic Alphabet). Applications of the corpus are also The three primary transcribers, who were discussed. also the investigators who did the recordings, were well-trained linguists. All recordings were 1 Data Collection first transcribed by the investigator of the specific session and then checked by the other Taiwan Child Language Corpus (TAICORP) is two transcribers. a corpus of text files transcribed from the child speech recorded between October 1997 through 2 Text files in CHILDES format May 2000. The target language is Southern Min Chinese spoken in Taiwan. TAICORP adopts the format of CHILDES (Child Language Data Exchange System), 1.1 Children originally set up by Elizabeth Bates, Brian MacWhinney, and Catherine Snow, to transcribe All fourteen children participated were from and code the recordings of child speech into Taiwanese-speaking families in Min-Hsiung machine-readable text (MacWhinney & Snow Villlage, Chiayi County, Taiwan. 1985, MacWhinney 1995). There were nine boys and five girls, aged from one year two months to three years and The main components of CHILDES format eleven months at the beginning of the project. are headers and tiers. More than half of the children were recorded over more than two years. 2.1 Headers 1.2 Recordings Obligatory headers are necessary for every file. They mark the beginning, the end and The recordings were made through regular the participants of the file. home visits. Spontaneous speech of these Constant headers mark the name of the file children at play was recorded using Mini Disc and the background information of the recorders. The interval of the sessions was children. about two weeks. There were totally 431 56 Changeable headers contain information that ɀʳSIS: the utterance of sister can change within the file, such as the ɀʳBRO: the utterance of brother recording date, duration, coders and so on. ɀʳGRM: the utterance of grandmother These headers begin with @, for example: ɀʳGRF: the utterance of grandfather ɀʳOTH: the utterance of other people Obligatory headers: @Begin The main tier is the most important tier @End because it is where the utterances are listed. The @Participants utterances in the main tier were transcribed in the romanization (pinyin) system of Taiwan Constant headers: Southern Min (to be explained and illustrated in @Age of XXX: Section 5). @Birth of XXX: @Coder: Dependent Tiers @Educ of XXX: @Filename: Additional information is given in dependent @ID: tiers, indicated by %, following the main tier. @Language: Dependent tiers can be changed according to @Language of XXX: the design and goals of each corpus. @SES of XXX: social and economic The dependent tiers used in TAICORP status of a specific speaker include the following: @Sex of XXX: @Warning: the defects of the file %ort: transcription in standard orthography %cod: part-of-speech coding Changeable headers: %pho: phonetic transcription in IPA @Activities: %ton: tone value in 5-point scale @Comment: @Date: For adults' speech, only %ort and %cod @Location: tiers are used. For younger children's speech, @New Episode: %pho and %ton tiers are also used. The @Room Layout: following text is an example from TAICORP. @Situation: ([m…] = speech in Mandarin; SHI = ࢂ "be") @Tape Location: @Time Duration: @Begin @Time Start: @Participants: CHI Lin Target_Child, INV Rose Investigator, MOT Mother, OTH Great 2.2 Tiers Grandmother @Age of CHI: 2;1.22 The content of a file is presented in tiers, @Birth of CHI: 28-AUG-1995 including main tiers and dependent tiers. A main @Sex of CHI: Male tier, indicated by *, contains the utterance of the @Coder: Rose, Kay, Joyce speaker. @Language: Taiwanese @Date:20-OCT-1997 Main tiers @Tape Location: Lin D1-1-56 @Comment: Time Duration 37 minutes The main tiers used in TAICORP include the @Location: Chiayi, Taiwan following: @Transcriber: Rose @Comment: Track number is D1-1 ɀʳINV: the utterance of the investigator ɀʳCHI: the utterance of the target child *INV: bo2lin2@s [:=m], li2 tha5tu2a2 khi3 ɀʳMOT: the utterance of mother to2? ?ॲ װ גᙰࢰ܃ ,[ɀʳFAT: the utterance of father %ort: [m ਹࣥ 57 %cod: Nb Nh Nd VCL Ncd Table 2. *CHI: hia1/hin1. %ort: ሕ 1. Table 2 Part-of-Speech Tagset in TAICORP %cod: Ncd Coding Part-of-speech %pho: h i a A non-predicative adjective %ton: 55 Caa coordinate conjunction *INV: hia1 si7 to2ui7? Cab listing conjunction Cba conjunction occurring at the end ?ۯort: ሕ ਢ ॲ% %cod: Ncd SHI Ncd of a sentence *CHI: hm0. Cbb following a subject %ort: hm0. Da possibly preceding a noun %cod: I Dfa preceding VH through VL %pho: ?? Dfb following adverb %ton: ?? Di post-verbal *MOT:li2 kin1a2 ciah8 bi2ko1 si7 bo0? Dk sentence initial ˒ˆ ྤ ᗶʳ ਢʳۏ ʳ ଇʳ˄ גʳ վ܃ :ort% D adverbial %cod: Nh Nd VC Na SHI T Na common noun @End Nb proper noun Nc location noun 3 Statistics of the corpus Ncd localizer Nd time noun The corpus size is about 1.6 million words Neu numeral determiner (more than 2 million morphemes/Chinese Nes specific determiner characters). The number of utterances/lines, Nep anaphoric determiner words, mean length of utterances (MLU) are listed in Table 1. Neqa classifier determiner Neqb postposed classifier determiner Lines Words MLU Nf classifier Ng postposition Children ˄ˉ˄ʿ˅ˈˆ ˇˆˇʿˈˈˊ ˅ˁˉˌˈ Nh pronoun Adults ˆˆˉʿ˄ˊˆ ˄ʿ˅˄˄ʿˌˇˉ ˆˁˉ˃ˈ I interjection Total ˇˌˊʿˇ˅ˉ ˄ʿˉˇˉʿˈ˃ˆ ˆˁ˄ˈ˃ P preposition Table 1 Statistics of the corpus T particle VA active intransitive verb It might be worth mentioning that the VAC MLU of adults in this corpus is relatively short. VB active pseudo-transitive verb This could be attributed to the nature of this VC active transitive verb corpus as being child-directed speech. VCL transitive verb taking a locative argument 4 Part-of speech annotation VD ditransitive verb VE active transitive verb with Southern Min and Mandarin are both Sinitic sentential object languages. They are very similar in their VF active transitive verb with VP morphology and syntactic structures. Therefore, object we adopted the part-of-speech coding system of VG classifactory verb the Sinica Corpus, Academia Sinica, Taiwan VH stative intransitive verb (see various CKIP technical reports). However, VHC stative causitive verb among the 115 categories used in the Sinica VI stative pseudo-transitive verb Corpus (CKIP 1993), only 46 codes were used VJ stative transitive verb in TAICORP. In other words, categorization in VK stative transitive verb with TAICORP is broader. These codes are listed in sentential object 58 "ѯ 0ki5kha2/ "unusualڻ VL stative transitive verb with VP object V_2 If a written form cannot be found in one of DE *special tag for the word "ޑ" the major Southern Min dictionaries, SHI special tag for the word "ࢂ" romanization is used. FW foreign words Romanization is also used if the written *Di/T *marker following pseudo- form of a word is found in the dictionary but transitive active verb has so low frequency that it can't be found in the computer coding system. *CIT *special tag for the word "ள 2" For homonyms, a number is added after 5 Orthography-related issues for a the character to indicate different lemmas. For speech-based corpus of Southern Min example: ᇂ 1 0kah4/!!"to cover with a blanket" 5.1 Romanization system ᇂ 2 0kham3/!!"to cover" ᇂ 3 0kua3/!!"a cover" As mentioned in Section 2, utterances in the main tier are transcribed in romanization (Southern Min pinyin). The romanization 6 The Autosegmentation program and the system used in TAICORP is the Taiwan Spell-checker Southern Min Phonetic Alphabetic (also known as Taiwan Language Phonetic Alphabet, TLPA, In order to speed up the building of the corpus, originally proposed by the Taiwan Language a word auto-segmentation program is necessary. Society in 1991) announced officially by the Yet, when the program is segmenting words Ministry of Education of Taiwan in 1998. from the text, it can also deal with some related problems at the same time, such as the consistency of the transcription, adding 5.2 Standard orthography: Chinese romanization, and expanding the lexicon. characters The Lexicon Bank Chinese characters are used in the dependent tier %ort as the standard orthography. This is a As the basis of the auto-segmentation program reasonable way because most of the Southern and the spell-cheker, a corpus-based lexicon has Min words are cognates of Mandarin words. been constructed which includes the lemma However, because Southern Min does not have (both in romanization and in Chinese as conventionalized orthography as Mandarin, characters), alternative forms, synonyms, and quite a few words in Southern Min do not have part-of-speech. (See the Appendix for a sample a consistent way of writing them.