Testing the Impact of Syllable Aggregation in Romanized Fields of Chinese Language Bibliographic Records
Total Page:16
File Type:pdf, Size:1020Kb
143 Clement Arsenault Faculty of Information Studies, University of Toronto, CANADA Testing the Impact of Syllable Aggregation in Romanized Fields of Chinese Language Bibliographic Records Abstract: Today, two Romanization systems for Chinese data are in use in most libraries in the Western world: 1) Wade-Giles, and 2) Hanyu pinyin (simplY referred to as pinyin). In 1997, the Library of Congress finallyofficially announced the adoption of pinyin for Romanizing Chinese data in its bibliographic records. One of the main problems in implementing the pinyin standard for library use is that pinyin, as opposed to Wade-Giles, aggregates Chinese "words" into single linguistic units. Chinese characters represent monosyllabic morphemes rather than words and are equally spaced from one another, and theChinese text, in its original form, does not provide visual cues as to where a word starts or ends. When the script is romanized it is however essential that syllables or words be separated from one another, since, in most information retrieval techniques, the identificationof "visual words" is required. In this respect, the Romanized strings could be divided either in monosyllables or in polysyllable words. This study aims to explore the impact of using either unaggregated pinyin (monosyllabic) or aggregated pinyin (polysyllabic) Romanization in Chinese�language bibliographic records. An experiment, using transaction log analysis, was carried out to observe variations in the retrieval performance of title searches-both phrase and keyword-in a large OPAC of Chinese language records. General results are presented and a summary of the pros and cons of using either method is given. 1. Introduction The first online public access catalogues (OPACs) developed in large institutions, and the bibliographic databases produced and maintained by cataloguing agencies, did not have, until the mid 1980s, built�in capabilities to handle non�Roman scripts. Mainly because of limitations of coding space for large character sets, non-Roman scripts were solely represented by romanized fields, Entering non-Roman vernacular script in MARC records is now technically possible but it should nevertheless be noted that even today, most local OPACs in the Western world are still not equipped with the necessary typographical utilities to display the characters contained in these records, let alone with a proper interface to input them into query strings, leaving the end�user back to square one, that is with romanized enhies. Today, two Romanization systems for Chinese data are in use in most libraries in the Western world: 1) Wade-Giles�the system used in most NOlih-American libraries; 2) Pinyin, the system developed and officially adopted in 1958 by the People's Republic of China (PRC}--called Hanyu pinyin but simply referred to as pinyin�used mainly in European and Australian libraries. With the recent adoption of the Hanyu pinyin Romanization standard (pinyin) by the Library of Congress (LC), the replacement of Wade-Giles strings with pinyin entries in bibliographic records is eminent and will affect many libraries in North America in the coming years. Using pinyin over Wade-Giles will have a significant impact on retrieval in OPACs. The conversion from Wade-Giles to pinyin will likely be beneficial since end-users are, for the great majority, more familiar with pinyin than with Wade-Giles (Young, 1992). Pinyin entries in bibliographic records can be constructed following either a monosyllabic or a polysyllabic pattern. The goal of the current study is to investigate how polysyllabic transcription affects retrieval perfonnance in item-specific title searching in OPACs. 144 2. Background of the Research 2.1. Basic Characteristics of Chinese Language There exists a quasi one-ta-one syllable-morpheme---character pattern in Chinese (Kratochvil, 1968, 156), in the sense that virtually each character represents, at a given time, a single syllable. This quasi one-ta-one relationship between syllables, morphemes and characters has often been a source of confusion in defining what, in Chinese, constitutes a word. It is estimated that around 28% of Chinese words are composed of one character, while 67% are two-character words; the remaining 5% are fonned with three or more characters (Suen, 1986, 8). While there exist several thousand Chinese characters, modem standard Chinese (Mandarin) has only about 1300 different syllables (counting tones). There is inevitably a large number of homophone characters. This problem is further compounded by the fact that, when tones are ignored-as is the case in Romanized fields of bibliographic records-the number of unique syllables is reduced to around 408; so unless tones are marked, there are a little over 400 different syllables that can be used to represent the thousands of existing Chinese characters. This is, needless to say, a source of great confusion for users who rely solely on monosyllabic Romanized fields for the identification and retrieval of their bibliographic references. Expressing linguistic word units in aggregated polysyllabic fonn greatly helps reduce the number of homonyms produced by the monosyllabic transcription method (Anderson, 1972, 12; King, 1983). 2.2. Conversion of Chinese Script in Bibliographic Records Transliteration has been defined as the process of representing the characters of one alphabet, the target script, into those of another alphabet, the host script (Wellisch, 1978, 28). Because Chinese is a non-phonological writing system, it is impossible to transliterate, in the strict sense of the tenn, Chinese characters into Roman letters. The only type of script conversion possible is indirect transcription, that is, using the writing system of one language, to represent the sounds of the Chinese characters. Some studies have shown that library users are usually not very successful at retrieving items for which only a Romanized form has been entered in the bibliographic record (Aissing, 1992; Young, 1992). However, in North America where most automated systems function primarily with the Roman script, Romanization, if used alongside the original script, could be used to enhance access. 2.3. Parsing of Romanized Chinese Entries In a Chinese text, apart frompunctuation which indicates the end of sentences and their syntactic division, there are no visual cues as to where syntactic words start and end. This lack of visual boundaries does not mean that syntactic words do not exist in Chinese. In a Romanized text the level of ambiguity created by homophony is such that it is often nearly impossible to make any sense of unaggregated (monosyllabic) Romanized Chinese text. Research has shown that the ambiguity is resolved about 95% of the time when syllables are aggregated into words (King, 1983, 57). Word segmentation is not an easy task, greatly due to the fact that the delimitation of words as syntactic units is often based both on historical and cultural conventions. To this day, no definitivestandard on word segmentation of Chinese has been unanimously adopted. For bibliographic control pinyin entries in bibliographic records can be constructed following either a monosyllabic or a polysyllabic pattern. Although the fonner is easier and less costly to implement, it seems rational to believe that, because monosyllabic pinyin transcription can only produce somewhere 410 different syllables for ! indexing (408 if diacritic marks are ignored by the retrieval algorithm) , it creates data strings that are inadequate for effective and efficient online retrieval of records. Using the polysyllabic method is potentially beneficial for end-users since combining single syllables into linguistic units greatly reduces ambiguity and increases dramatically the number of individual units available for indexing. The decision to use mono- or polysyllabic Romaniza- 145 tion will have direct implications all browsing, indexing and retrieval in OPACs and will have vast repercussions on the services we offer to library users. Recognizing the significance of that problem, the Research Libraties Group (RLG) published, in 1987, the Chillese Aggrega tion Guidelines (RLG, 1987) and adopted a policy where Chinese characters and romanized syllables could be joined with a special "aggregator" character. RLG will soon offer its subscribers the possibility of downloading records with or without these aggregators which means that local OPACs could contain records in either mano- Of polysyllabic Romanization. 3. Research Methodology The experiment was ptirnarily designed to measure the difference in retrieval performance in OPAC searches when replacing Wade-Giles (WO) entries with monosyllabic pinyin (mPY) and with polysyllabic pinyin (pPY) in Chinese-language bibliographic records. The focus was on item-specific retrieval using the exact-title and the keywords-in-title search modes. Data were obtained by asking 30 library users to perform a specific retrieval task. All participants were native Chinese speakers and were all graduate students at the University of Toronto. Three treatment groups were defined, namely WO, mPY, and pPY. Each patticipant 2 was assigned to a specifictreatment, 6 for WO and 12 for each of the pinyin groupS. The task consisted of using Romanization to search two lists of20 monograph titles each, in a database containing ca. 50 000 bibliographic records for Chinese monographs. The database records contained Romanized fields only while the printed lists of titles were given in original Chinese characters, so the