<<

143

Clement Arsenault Faculty of Information Studies, University of Toronto, CANADA

Testing the Impact of Aggregation in Romanized Fields of Bibliographic Records

Abstract: Today, two systems for Chinese data are in use in most libraries in the Western world: 1) Wade-Giles, and 2) Hanyu (simplY referred to as pinyin). In 1997, the finallyofficially announced the adoption of pinyin for Romanizing Chinese data in its bibliographic records. One of the main problems in implementing the pinyin standard for library use is that pinyin, as opposed to Wade-Giles, aggregates Chinese "words" into single linguistic units. represent monosyllabic morphemes rather than words and are equally spaced from one another, and theChinese text, in its original form, does not provide visual cues as to where a word starts or ends. When the script is romanized it is however essential that or words be separated from one another, since, in most information retrieval techniques, the identificationof "visual words" is required. In this respect, the Romanized strings could be divided either in monosyllables or in polysyllable words. This study aims to explore the impact of using either unaggregated pinyin (monosyllabic) or aggregated pinyin (polysyllabic) Romanization in Chinese�language bibliographic records. An experiment, using transaction log analysis, was carried out to observe variations in the retrieval performance of title searches-both phrase and keyword-in a large OPAC of Chinese language records. General results are presented and a summary of the pros and cons of using either method is given.

1. Introduction The first online public access catalogues (OPACs) developed in large institutions, and the bibliographic databases produced and maintained by cataloguing agencies, did not have, until the mid 1980s, built�in capabilities to handle non�Roman scripts. Mainly because of limitations of coding space for large character sets, non-Roman scripts were solely represented by romanized fields, Entering non-Roman vernacular script in MARC records is now technically possible but it should nevertheless be noted that even today, most local OPACs in the Western world are still not equipped with the necessary typographical utilities to display the characters contained in these records, let alone with a proper interface to input them into query strings, leaving the end�user back to square one, that is with romanized enhies. Today, two Romanization systems for Chinese data are in use in most libraries in the Western world: 1) Wade-Giles�the system used in most NOlih-American libraries; 2) Pinyin, the system developed and officially adopted in 1958 by the People' Republic of (PRC}--called Hanyu pinyin but simply referred to as pinyin�used mainly in European and Australian libraries. With the recent adoption of the Hanyu pinyin Romanization standard (pinyin) by the Library of Congress (LC), the replacement of Wade-Giles strings with pinyin entries in bibliographic records is eminent and will affect many libraries in North America in the coming years. Using pinyin over Wade-Giles will have a significant impact on retrieval in OPACs. The conversion from Wade-Giles to pinyin will likely be beneficial since end-users are, for the great majority, more familiar with pinyin than with Wade-Giles (Young, 1992). Pinyin entries in bibliographic records can be constructed following either a monosyllabic or a polysyllabic pattern. The goal of the current study is to investigate how polysyllabic transcription affects retrieval perfonnance in item-specific title searching in OPACs. 144

2. Background of the Research 2.1. Basic Characteristics of Chinese Language There exists a quasi one-ta-one syllable-morpheme---character pattern in Chinese (Kratochvil, 1968, 156), in the sense that virtually each character represents, at a given time, a single syllable. This quasi one-ta-one relationship between syllables, morphemes and characters has often been a source of confusion in defining what, in Chinese, constitutes a word. It is estimated that around 28% of Chinese words are composed of one character, while 67% are two-character words; the remaining 5% are fonned with three or more characters (Suen, 1986, 8). While there exist several thousand Chinese characters, modem () has only about 1300 different syllables (counting tones). There is inevitably a large number of homophone characters. This problem is further compounded by the fact that, when tones are ignored-as is the case in Romanized fields of bibliographic records-the number of unique syllables is reduced to around 408; so unless tones are marked, there are a little over 400 different syllables that can be used to represent the thousands of existing Chinese characters. This is, needless to say, a source of great confusion for users who rely solely on monosyllabic Romanized fields for the identification and retrieval of their bibliographic references. Expressing linguistic word units in aggregated polysyllabic fonn greatly helps reduce the number of homonyms produced by the monosyllabic transcription method (Anderson, 1972, 12; King, 1983).

2.2. Conversion of Chinese Script in Bibliographic Records Transliteration has been defined as the process of representing the characters of one , the target script, into those of another alphabet, the host script (Wellisch, 1978, 28). Because Chinese is a non-phonological writing system, it is impossible to transliterate, in the strict sense of the tenn, Chinese characters into Roman letters. The only type of script conversion possible is indirect transcription, that is, using the writing system of one language, to represent the sounds of the Chinese characters. Some studies have shown that library users are usually not very successful at retrieving items for which only a Romanized form has been entered in the bibliographic record (Aissing, 1992; Young, 1992). However, in North America where most automated systems function primarily with the Roman script, Romanization, if used alongside the original script, could be used to enhance access.

2.3. Parsing of Romanized Chinese Entries In a Chinese text, apart frompunctuation which indicates the end of sentences and their syntactic division, there are no visual cues as to where syntactic words start and end. This lack of visual boundaries does not mean that syntactic words do not exist in Chinese. In a Romanized text the level of ambiguity created by homophony is such that it is often nearly impossible to make any sense of unaggregated (monosyllabic) Romanized Chinese text. Research has shown that the ambiguity is resolved about 95% of the time when syllables are aggregated into words (King, 1983, 57). Word segmentation is not an easy task, greatly due to the fact that the delimitation of words as syntactic units is often based both on historical and cultural conventions. To this day, no definitivestandard on word segmentation of Chinese has been unanimously adopted. For bibliographic control pinyin entries in bibliographic records can be constructed following either a monosyllabic or a polysyllabic pattern. Although the fonner is easier and less costly to implement, it seems rational to believe that, because monosyllabic pinyin transcription can only produce somewhere 410 different syllables for ! indexing (408 if marks are ignored by the retrieval algorithm) , it creates data strings that are inadequate for effective and efficient online retrieval of records. Using the polysyllabic method is potentially beneficial for end-users since combining single syllables into linguistic units greatly reduces ambiguity and increases dramatically the number of individual units available for indexing. The decision to use mono- or polysyllabic Romaniza- 145

tion will have direct implications all browsing, indexing and retrieval in OPACs and will have vast repercussions on the services we offer to library users. Recognizing the significance of that problem, the Research Libraties Group (RLG) published, in 1987, the Chillese Aggrega­ tion Guidelines (RLG, 1987) and adopted a policy where Chinese characters and romanized syllables could be joined with a special "aggregator" character. RLG will soon offer its subscribers the possibility of downloading records with or without these aggregators which means that local OPACs could contain records in either mano- Of polysyllabic Romanization.

3. Research Methodology The experiment was ptirnarily designed to measure the difference in retrieval performance in OPAC searches when replacing Wade-Giles (WO) entries with monosyllabic pinyin (mPY) and with polysyllabic pinyin (pPY) in Chinese-language bibliographic records. The focus was on item-specific retrieval using the exact-title and the keywords-in-title search modes. Data were obtained by asking 30 library users to perform a specific retrieval task. All participants were native Chinese speakers and were all graduate students at the University of Toronto. Three treatment groups were defined, namely WO, mPY, and pPY. Each patticipant 2 was assigned to a specifictreatment, 6 for WO and 12 for each of the pinyin groupS. The task consisted of using Romanization to search two lists of20 monograph titles each, in a database containing ca. 50 000 bibliographic records for Chinese monographs. The database records contained Romanized fields only while the printed lists of titles were given in original Chinese characters, so the participants had to mentally convert the Chinese script into Roman script to build their search queties. Transaction logs were automatically generated during the search process by a concealed logging program hidden to the end-user.

3.1. Experimental Design 3 All patticipants were provided with identical1ists of specific titles to search, so all the searches were known-item title searches. The 40 titles were broken down into two lists of 20 titles each. After being randomly assigned to a specific treatment (Romanization), each participant had ,to search the first list using the exact-title search mode and the second list using the keyword mode, or vice-versa. Participants were infonned that they were free to issue as many or as few queries as desired as as each title in the list was searched at least once and that the order of the titles in the lists was respected. The experiment was repeated over the moderator vatiable, namely the two search modes, exact-title search and keyword search. The generic experimental model can thus be regarded as a single 2 3 factorial design, or as two independent 1 x 3 factorial designs. Table 1 illustrates how the 30 participants were distributed in the cells of the model.

mPY pPY Total Exact-title search 6WG 12 12 30 -- Keywordss�arch--- -- 6------12------12---- 3-6----- Table 1: Allocation of Participants to Cells

4. Findings 4.1. Classification of Unsuccessful Queries The unsuccessful queries recorded in the transaction logs (Le., queries that were not followed by the display of the record of the item sought) were gathered in a file for analysis. These queries were first classified in four subgroups, and labelled by type from type-I to type­ 4 IV. Type-IV problems consist of structural errors that are of little interest here and for that reason they were discarded from further analysis. The remainder was categorized this way: Type-I problems: Queries that retrieved no item (zero-hit query); 146

Type-II problems: Queries that retrieved at least one item, but the set did not contain the item sought; Type-III problems: Queries that retrieved at least one item, the set contained item sought, but the item was not displayed by participant, usually because the set was too large to browse. The number of queries falling in each of these three categOlies was added for each individual trial and sum of these counts was tabulated for each of the six cells of the2 x 3 experimental model (WG/Exact-litle, mPYlExacl-litle, ...). The proporlion of each problem type was obtained by dividing these sums by the total number of errors in each cell. These proportions are illustrated in the three tables below:

mPY PY WG Exact-title mode 94.1% (n-80) 78.4% (11-223) 82.5% (1/-239) word mode 41.6% (n-24) 47.2% (n-I09 79.2% (1/-248) Table 2: Proportion of Type-I Problems

mPY PY WG Exact-title mode 3.5% (11-3) 14.2% (11-40) 14.7% (1/=\2) Ke word mode 39.3% (11=23) 38.0% (11=80) 16.0% 11=50 Table 3: Proportion of Type-II Problems

mPY PY WG Exact-title mode 2.3% (11=2) 7.4% (11-21) 2.8% (1/=8) Ke word mode 19.1% (1/=11) 14.9% 11=34) 4.8% (1/�15) Table 4: Proportion of Type-III Problems

It is interesting to note that in the keyword search mode the proportion of type-III problems (sets containing the record of the item sought, but discarded as they were considered 100 large to browse) is much smaller for polysyllabic pinyin searches than for the other two Romanization methods. This seems to indicate, as expected, that polysyllabic searches have a higher precision rate, in the sense that they generate fewer sets that end-users considered to be "too large to handle", Also of interest is the fact that the proportion of type-I problems (zero­ hit queries) is much larger in exact-title searches than in keyword searches (roughly twice the size), except for polysyllabic searches in which case the proportion is stable. This again shows that in keyword mode, polysyllabic searches tend to be more precise, with a higher proportion of zero-hit searches.

4.2. Cause of Failure The text of all unsuccessful queries was also analysed to detelTIline the cause of the failure (type-III problems queries were excluded since in these cases the cause of failure was not due to the text of the query itself). Errors were classified following a grounded theory approach, meaning that categories were generated, as required, by the variety of elTor types revealed in the query text. Data were coded by the researcher over a period of one week. Results were compared and conflicts were resolved on a case by case basiss. These data are summarized for each Romanization type in the tables below. Table 5 shows the distribution of errorsby type over Romanization method with structural errors ignored.

mPY DPY F'level 2nd level WG Aggregation A 0.66 (65.7%) 0.67 (46.9%) 0.85 (60.8%) A,' 0.08 1.6%) 0.06 (9.4%) 0.70 (82.4%) (t 147

A, O.oz (3.5%) 0.10 (15.3%) 0.11 (12.3%) A, 0.5� (84.9%) �Q,51 (75.3%) 0.05 (5.3%) Romanization 0.35 (34.3%) 0.76 (53.1 %) 0.55 (39.2%) RJ 0.11 (31.0%) 0.14 (19.0%) 0.10 (18.8%) R, 0.05 (15.6%) (0.9%) 0.00 (0.0%) R, 0.14 (40.0%) 0.50 (65.8%) 0.42 (76.5%) 0.0503.4%) 0.11 (14.3%) 0.03 (4.7%) *Key. . MonosyllabICR. entries were aggregated or vice-versa A2: Two normal syllables joined as if place or personal name A3: Place or personal name not joined with R1: Character was mispronounced or misread R2: Pinyin was used instead of Wade-Giles or vice-versa R3: Confusion in sound of character Other types of Romanization errors that do not fitin either category �: Table 5: Averages number of errors per unsuccessful query

The data in the two tables above reveal several interesting facts. First, we can see that Romanization etTors account for roughly between one third to one half of all errors depending on the Romanization method under investigation. This is a relatively high proportion which reveals that end�users still have problems using proper Romanized strings to construct their search queries. Further observation at the 2nd level reveals that most Romanization errors are 6 RJ errors, that is they are caused by sound confusion of phonetic nature. This confusion is observed typically with end�users whose is not standard. Notice that the number of phonetic errors is surprisingly much smaller for Wade�Giles searches. This can be explained by the fact that Wade-Giles is in a way much more "forgiving" than pinyin when using aspirated initial in search queries, since the aspiration mark-a reverse curved right�to�left (but incorrectly inputted as an ayn in MARC records}-is not indexed. Therefore, even if the end-user confuses, for example, the sounds and 'en, this has no consequence since all syllables chen and ch 'en are indexed as chen. This phenomenon is not manifest in pinyin searches since the distinction between aspirated and unaspirated palatal consonants is expressed by using a distinct Roman letter or group of letter (zh/ch, zlc ...), and in the above example the syllables are written and chen respectively, which produces two different queries. In Wade-Giles searches, we also observe a relatively high prop0l1ion of R2 errors (using pinyin instead of Wade-Giles) compared to the two pinyin groups. This shows that even when end-users are told to use Wade-Giles, they still unwittingly use pinyin from time to time, probably by force of habit. As for aggregation errors, we can see from the data in Table 5 that not only the number but also the proportions of 2n level errors are almost identical between Wade-Giles and monosyllabic pinyin; this is nOlmal since there exists virtually no difference in aggregation between these two methods (apart from that are used in Wade-Giles to transcribe multi-syllabic place and personal names). On the other side, the polysyllabic pinyin group exhibits completely different characteristics with about ten times as many aggregation errors in common words errors), but ten times fewer aggregation errors in proper words for (AI place and persons errors) compared to the other two groups. The dramatic decrease in errors is explained(AJ by the fact that it makes more sense to the end user to join syllables AJof multi-syllable place and personal names when the rest of the entries is also in polysyllabic fonn. When using Wade-Giles and monosyllabic pinyin, participants made a lot of errors simply because, being accustomed to entering queries in monosyllabic fonn, they A3forgot to join the syllables of multi-syllable place or personal names, even though this was clearly stated and explained in the general instructions, with a reminder just before the start of the search sessions with the help of examples. It is also possible that end�users were unable at 148 times to detect these place names and personal names in the title, and thus failed to enter them in joined syllables. This clearly illustrates the fact that having a mixed aggregation fonnat�as it is the case with the current Wade-Giles method and the monosyllabic pinyin method proposed by LC-is confusing to the end-user. This effect is probably greater in real life situations where end-users are not necessarily reminded of this peculiarity or even are maybe completely unaware of it. The problem just described is greatly decreased if polysyllabic transcription is chosen, as we can see from the data. However, this is counterbalanced by the fact that participants were quite uncertain about the aggregation fonnats of corrnnon words (i.., all words except personal and place names). This was the cause of many errors and, in polysyllabic sessions, end-users had to input, on average, a higher number of queries per title 7 found. There is indeed a fact0l1al effect when a participant is trying to detect the error(s) in a query string if, apart from being uncertain about the exact , or is also uncertain about aggregation. Unless the pmiicipant was willing to re-try the search several times the item might not be found. See the example in Table 6 (title: shengsichang).

En11 Set size Dis fa QUERY: shenshi o o QUERY: shengshi 3 o QUERY: sf chang o o QUERY: shellgsi 9 o QUERY: shell sichan Table 6: Example of Trial-aud-Error Search for Proper Romanizatiou Form

As we can see from this log excerpt, the first query contains one aggregation and two Romanization errors. In the second query, the participant tries a variation by changing the ending of the first syllable from a front to a back nasal, then in the third query, the second Romanization error is corrected. At that point the Romanization is correct but the aggregation is still incorrect. Two more queties are necessary to get to the proper fonn.

5. Conclusions Classification of unsuccessful queries revealed that in the keyword search mode the proportion of "large" sets�sets discarded by the end-user as they were considered too large to be browsed�is smaller for polysyllabic pinyin searches than for the other two Roman­ ization methods. The proportion of zero-hit sets was also higher for keyword searches in polysyllabic mode than in the two monosyllabic groups (WG and mPY). This seems to indicate that in keyword mode, polysyllabic searches tend to be more precise, producing a higher proportion of zero-hit searches and a lower proportion of "large" sets than for monosyllabic searches. These facts were not observed in phrase searches which seems to indicate that aggregation of Romanized syllables does not affect precision rate. Polysyllabic searches generated more aggregation errors overall which meant that end­ users needed to issue more queries per title found. This, however, did not affect the success rate or the time required to complete the task, and thus has a minimal impact on retrieval perfonnance. Classification of error types revealed that many of the errors detected in the two monosyllabic groups (WG and mPY) were aggregation errors caused by the fact that end­ users failed to either detect the presence of a multi-character place or personal name in the title, or simply forgot to aggregate the syllables of a multi-character place or personal name. It would therefore be beneficial, if monosyllabic pinyin is used to replace Wade-Giles, to transcribe everything in monosyllables, including multi-character personal and place names as this seems to be a great source of confusion. 149

Classification of elTOf types also revealed that regardless of aggregation problems, a high proportion of etTOfS were caused by incorrect Romanization of characters. This corroborates findings from previous studies that revealed that relying solely on Romanization for rettieval is not very effective.

Notes ! Note that even in monosyllabic transcription, multi-character place and personal names are transcribed in polysyllabic form: with a hyphen in Wade-Giles (e.. Ts'e-tung) and without hyphen in pinyin (e.g. ). 2 Since detecting differences in retrieval performance between the two pinyin groups was a more important issue, a larger number of participants were allocated in those groups. Fewer participants were assigned to the Wade-Giles group since it was expected that the WGJmPY and WGJpPY differences would be easier to observe than the mPY/pPY ones, thus requiring a smaller n. 3 The order of the titles in the lists was randomized to equalize the learning factor from title to title. 4 A type-IV problem may be that the participant searched the item by entering the author's name rather than the title, which is simply invalid since the interface did not allow to search the author index. 5 Note that there was an greater than 95% match between the results of the three coding sessions. 6 The usual errors include confusions between dental and guttural nasals, between aspirated and un­ aspirated dental sibilants, retroflexes and palatals, and between dental sibilant and retroflexfricatives. 7 Analysis of the logs revealed that the success rate and the completion time were not affected by this.

References Aissing, Alena L. 1992. Computer-oriented bibliographic control for Cyrillic documents with or without script conversion. InJormation Technology and Libraries, 11(4): 340--44. Anderson, James D. 1972. A Comparative Study ojMethods ofArranging Chinese Language Author­ title Catalogs in Large American Chinese Language Collections. Ph.D. diss., Columbia University. King, Paul L. 1983. Contextual Factors in Chinese Pinyin Writing. Ph.D. diss., Cornell University. Research Libraries Group. 1987. RLG Chinese Aggregation Guidelines. Stanford: Research Libraries Group, Inc. Suen, Ching Y. 1986. ComputatiolJal Studies oj the Most Frequent Chinese Words and Sounds. : World Scientific. Wellisch, Hans H. 1978. The COlJversion of Scripts: Its Nature, History, and Utilization. New York: John Wiley & Sons. Young, Joann S. 1992. Chinese Romanization change: A study on user preference. Cataloging & Classification Quarterly, 15 (2): 15-35.