The First International Chinese Word Segmentation Bakeoff

The First International Chinese Word Segmentation Bakeoff Richard Sproat Thomas Emerson AT&T Labs – Research Basis Technology 180 Park Avenue, Florham Park, NJ, 07932, USA 150 CambridgePark Drive [email protected] Cambridge, MA 02140, USA [email protected] Abstract context of more general evaluations for Chinese- English machine translation. See (Yao, 2001; Yao, This paper presents the results from the 2002) for the first and second of these; the third eval- ACL-SIGHAN-sponsored First Interna- uation will be held in August 2003. The test cor- tional Chinese Word Segmentation Bake- pora were segmented according to the Chinese na- off held in 2003 and reported in con- tional standard GB 13715 (GB/T 13715–92, 1993), junction with the Second SIGHAN Work- though some lenience was granted in the case of shop on Chinese Language Processing, plausible alternative segmentations (Yao, 2001); so Sapporo, Japan. We give the motivation while GB 13715 specifies the segmentation / ¡£¢ for having an international segmentation for Mao Zedong, ¡¤¢ was also allowed. Accura- contest (given that there have been two cies in the mid 80’s to mid 90’s were reported for the within-China contests to date) and we re- four systems that participated in the first evaluation, port on the results of this first international with higher scores (many in the high nineties) being contest, analyze these results, and make reported for the second evaluation. some recommendations for the future. The motivations for holding the current contest are twofold. First of all, by making the contest international, we are encouraging participation from 1 Introduction people and institutions who work on Chinese word Chinese word segmentation is a difficult problem segmentation anywhere in the world. The final set of that has received a lot of attention in the literature; participants in the bakeoff include two from Main- reviews of some of the various approaches can be land China, three from Hong Kong, one from Japan, found in (Wang et al., 1990; Wu and Tseng, 1993; one from Singapore, one from Taiwan and four from Sproat and Shih, 2001). The problem with this liter- the United States. ature has always been that it is very hard to compare Secondly, as we have already noted, there are at systems, due to the lack of any common standard test least four distinct standards in active use in the sense set. Thus, an approach that seems very promising that large corpora are being developed according to based on its published report is nonetheless hard to those standards; see Section 2.1. It has also been compare fairly with other systems, since the systems observed that different segmentation standards are are often tested on their own selected test corpora. appropriate for different purposes; that the segmen- Part of the problem is also that there is no single tation standard that one might prefer for information accepted segmentation standard: There are several, retrieval applications is likely to be different from including the four standards used in this evaluation. the one that one would prefer for text-to-speech syn- A number of segmentation contests have been thesis; see (Wu, 2003) for useful discussion. Thus, held in recent years within Mainland China, in the while we do not subscribe to the view that any of the extant standards are, in fact, appropriate for any 2 Details of the contest particular application, nevertheless, it seems desir- 2.1 Corpora able to have a contest where people are tested against more than one standard. The corpora are detailed in Table 1. Links to descriptions of the corpora can be found at A third point is that we decided early on that we http://www.sighan.org/bakeoff2003/ would not be lenient in our scoring, so that alter- bakeoff_instr.html; publications on spe- ¢ native segmentations as in the case of ¡ Mao cific corpora are (Huang et al., 1997) (Academia Zedong, cited above, would not be allowed. While Sinica), (Xia, 1999) (Chinese Treebank); the it would be fairly straightforward (in many cases) Beijing University standard is very similar to that to automatically score both alternatives, we felt we outlined in (GB/T 13715–92, 1993). Table 1 lists could provide a more objective measure if we went the abbreviations for the four corpora that will be strictly by the particular segmentation standard be- used throughout this paper. The suffixes “o” and ing tested on, and simply did not get into the busi- “c” will be used to denote open and closed tracks, ness of deciding upon allowable alternatives. respectively: Thus “ASo,c” denotes the Academia Comparing segmenters is difficult. This is not Sinica corpus, both open and closed tracks; and only because of differences in segmentation stan- “PKc” denotes the Beijing University corpus, closed dards but also due to differences in the design of track. systems: Systems based exclusively (or even pri- During the course of this bakeoff, a number of marily) on lexical and grammatical analysis will of- inconsistencies in segmentation were noted in the ten be at a disadvantage during the comparison com- CTB corpus by one of the participants. This was pared to systems trained exclusively on the training done early enough so that it was possible for the data. Competitions also may fail to predict the per- CTB developers to correct some of the more com- formance of the segmenter on new texts outside the mon cases, both in the training and the test data. training and testing sets. The handling of out-of- The revised training data was posted for participants, vocabulary words becomes a much larger issue in and the revised test data was used during the testing these situations than is accounted for within the test phase. environment: A system that performs admirably in Inconsistencies were also noted by another par- the competition may perform poorly on texts from ticipant for the AS corpus. Unfortunately this came different registers. too late in the process to correct the data. However, some informal tests on the revised testing data indi- Another issue that is not accounted for in the cated that the differences were minor. current collection of evaluations is the handling of short strings with minimal context, such as queries 2.2 Rules and Procedures submitted to a search engine. This has been stud- The contest followed a strict set of guidelines and ied indirectly through the cross-language informa- a rigid timetable. The detailed instructions for the tion retrieval work performed for the TREC 5 and bakeoff can be found at http://www.sighan. TREC 6 competitions (Smeaton and Wilkinson, org/bakeoff2003/bakeoff_instr.html 1997; Wilkinson, 1998). (with simplified and traditional Chinese versions This report summarizes the results of this First also available). Training material was available International Chinese Word Segmentation Bakeoff, starting March 15, testing material was available provides some analysis of the results, and makes April 22, and the results had to be returned to the specific recommendations for future bakeoffs. One SIGHAN ftp site by April 25 no later than 17:00 thing we do not do here is get into the details of spe- EDT. cific systems; each of the participants was required Upon initial registration sites were required to de- to provide a four page description of their system clare which corpora they would be training and test- along with detailed discussion of their results, and ing on, and whether they would be participating in these papers are published in this volume. the open or closed tracks (or both) on each corpus, Corpus Abbrev. Encoding # Train. Words # Test. Words Academia Sinica AS Big Five (MS Codepage 950) 5.8M 12K U. Penn Chinese Treebank CTB EUC-CN (GB 2312-80) 250K 40K Hong Kong CityU HK Big Five (HKSCS) 240K 35K Beijing University PK GBK (MS Codepage 936) 1.1M 17K Table 1: Corpora used. where these were defined as follows: 2.3 Participating sites Participating sites are shown in Table 2. These are a For the open test sites were allowed to train subset of the sites who had registered for the bake- on the training set for a particular corpus, and off, as some sites withdrew due to technical difficul- in addition they could use any other mate- ties. rial including material from other training cor- 3 Further details of the corpora pora, proprietary dictionaries, material from the WWW and so forth. However, if a site An unfortunate, and sometimes unforseen, complex- selected the open track the site was required ity in dealing with Chinese text on the computer is to explain what percentage of the results came the plethora of character sets and character encod- from which sources. For example, if the sys- ings used throughout Greater China. This is demon- tem did particularly well on out-of-vocabulary strated in the Encoding column of Table 1: words then the participants were required to explain if, for example, those results could mostly 1. Both AS and HK utilize complex-form (or “tra- be attributed to having a good dictionary. ditional”) characters, using variants of the Big Five character set. The Academia Sinica corpus is composed almost entirely of characters In the closed test, participants could only use in pure Big Five (four characters, 0xFB5B, training material from the training data for the 0xFA76, 0xFB7A, and 0xFAAF are outside particular corpus being testing on. No other the encoding range of Big Five), while the material was allowed. City University corpus utilizes 38 (34 unique) characters from the Hong Kong Supplementary Other obvious restrictions applied: Participants Character Set (HKSCS) extension to Big Five. were prohibited from testing on corpora from their 2.

The First International Chinese Word Segmentation Bakeoff

Consonant Characters and Inherent Vowels

Legacy Character Sets & Encodings

Basis Technology Unicode対応ライブラリスペックシート文字コードその他の名称 Adobe-Standard-Encoding A

Implementing Cross-Locale CJKV Code Conversion

San José, October 2, 2000 Feel Free to Distribute This Text

DVB); Specification for Service Information (SI) in DVB Systems

Implementing Cross-Locale CJKV Code Conversion

International Register of Coded Character Sets to Be Used with Escape Sequences for Information Interchange in Data Processing

Instantly Identify and Triage Many Languages

OPMG.GB.2312 Operations in Panama: a Man, a Plan, a Canal

Windows Mbox Viewer User Manual 1.0.2.6 Table of Contents 1 Modification History

Unicode Support in the Solaris Operating Environment