<<

Character-Level Dependencies in Chinese: Usefulness and Learning

Hai Zhao Department of Chinese, Translation and Linguistics City University of Tat Chee Avenue, Kowloon, Hong Kong, [email protected]

Abstract brings the argument about how to determine the word-hood in Chinese. Linguists’ views about We investigate the possibility of exploit- what is a Chinese word diverge so greatly that ing character-based dependency for Chi- multiple word segmentation standards have been nese information processing. As Chinese proposed for computational linguistics tasks since text is made up of character sequences the first Bakeoff (Bakeoff-1, or Bakeoff-2003)3 rather than word sequences, word in Chi- (Sproat and Emerson, 2003). nese is not so natural a concept as in En- Up to Bakeoff-4, seven word segmentation stan- glish, nor is word easy to be defined with- dards have been proposed. However, this does not out argument for such a language. There- effectively solve the open problem what a Chi- fore we propose a character-level depen- nese word should exactly be but raises another is- dency scheme to represent primary lin- sue: what a segmentation standard should be se- guistic relationships within a Chinese sen- lected for the successive application. As word tence. The usefulness of character depen- often plays a basic role for the further language dencies are verified through two special- processing, if it cannot be determined in a uni- ized dependency parsing tasks. The first fied way, then all successive tasks will be affected is to handle trivial character dependencies more or less. that are equally transformed from tradi- Motivated by dependency representation for tional word boundaries. The second fur- syntactic parsing since (Collins, 1999) that has thermore considers the case that annotated been drawn more and more interests in recent internal character dependencies inside a years, we suggest that character-level dependen- word are involved. Both of these results cies can be adopted to alleviate this difficulty in from character-level dependency parsing Chinese processing. If we regard traditional word are positive. This study provides an alter- boundary as a linear representation for neighbored native way to formularize basic character- characters, then character-level dependencies can and word-level representation for Chinese. provide a way to represent non-linear relations be- tween non-neighbored characters. To show that 1 Introduction character dependencies can be useful, we develop In many human languages, word can be naturally a parsing scheme for the related learning task and identified from writing. However, this is not the demonstrate its effectiveness. case for Chinese, for Chinese is born to be written The rest of the paper is organized as fol- in character1 sequence rather than word sequence, lows. The next section shows the drawbacks of namely, no natural separators such as blanks ex- the current word boundary representation through ist between words. As word does not appear in some language examples. Section 3 describes a natural way as most European languages2, it a character-level dependency parsing scheme for traditional word segmentation task and reports its 1 Character here stands for various tokens occurring in evaluation results. Section 4 verifies the useful- a naturally written Chinese text, including Chinese charac- ter(hanzi), , and foreign letters. However, Chi- ness of annotated character dependencies inside a nese characters often cover the most part. word. Section 5 looks into a few issues concern- 2Even in European languages, a naive but necessary method to properly define word is to list them all by hand. 3First International Chinese Word Segmentation Bakeoff, Thank the first anonymous reviewer who points this fact. available at http://www.sighan.org/bakeoff2003.

Proceedings of the 12th Conference of the European Chapter of the ACL, pages 879–887, Athens, Greece, 30 March – 3 April 2009. c 2009 Association for Computational Linguistics

879 ing the role of character dependencies. Section 6 This example shows that there will be in a concludes the paper. dilemma to perform segmentation over these char-

acters. If a segmentation position locates before Ê 2 To Segment or Not: That Is the ‘Ò’(three) or ‘ ’(five), then this will make them Question meaningless or losing its original meaning at least because either of these two characters should log-

Though most words can be unambiguously de- Ï ically follow the substring ‘´ ’ (week) to con-

fined in Chinese text, some word boundaries are

ÏÒ struct the expected word ‘´ ’(Wednesday) or

not so easily determined. We show such three ex-

Ï Ê ‘´ ’ (Friday). Otherwise, to make all the amples as the following. above five characters as a word will have to ig- The first example is from the MSRA segmented nore all these logical dependent relations among corpus of Bakeoff-2 (Bakeoff-2005) (Emerson, these characters and segment it later for a proper 2005):

tackling as the above first example.

Ü »  ® ® ì ÊÆ é Ç ¬ ¬

è / / / / All these examples suggest that dependencies

\| Ý ¼ Ð / / / exist between discontinuous characters, and word boundary representation is insufficient to handle a / piece of / “ / Beijing City Beijing Opera these cases. This motivates us to introduce char- OK Sodality / member / entrance / ticket / ” acter dependencies.

As the guideline of MSRA standard requires any 3 Character-Level Dependency Parsing organization’s full name as a word, many long words in this form are frequently encountered. Character dependency is proposed as an alterna- Though this type of ‘words’ may be regarded as an tive to word boundary. The idea itself is extremely effective unit to some extent, some smaller mean- simple, character dependencies inside sequence ingful constituents can be still identified inside are annotated or formally defined in the similar them. Some researchers argue that these should way that syntactic dependencies over words are be seen as phrases rather than words. In fact, e.g., usually annotated. a machine translation system will have to segment We will initially develop a character-level de- this type of words into some smaller units for a pendency parsing scheme in this section. Es- proper translation. pecially, we show character dependencies, even The second example is from the PKU corpus of those trivial ones that are equally transformed Bakeoff-2, from pre-defined word boundaries, can be effec-

tively captured in a parsing way.

Á 7 Àê ¸ ¥ / / / 3.1 Formularization China / in / South Africa / embassy Using a character-level dependency representa- (the Chinese embassy in South Africa) tion, we first show how a word segmentation task can be transformed into a dependency parsing This example demonstrates how researchers can problem. Since word segmentation is traditionally

also feel inconvenient if an organization name is formularized as an unlabeled character chunking 

segmented into pieces. Though the word ‘ task since (Xue, 2003), only unlabeled dependen-

À ê ¸’(embassy) is right after ‘ ’(South Africa) cies are concerned in the transformation. There are in the above phrase, the embassy does not belong many ways to transform chunks in a sequence into to South Africa but China, and it is only located in dependency representation. However, for the sake South Africa. of simplicity, only well-formed and projective out- The third example is an abbreviation that makes put sequences are considered for our processing. use of the characteristics of . Borrowing the notation from (Nivre and Nils-

son, 2005), an unlabeled dependency graph is for-

Ï è Ò Ê ´ / / / mally defined as follows: Week / one / three / five An unlabeled dependency graph for a string (, Wednesday and Friday) of cliques (i.e., words and characters) W =

880 3.2 Shift-reduce Parsing According to (McDonald and Nivre, 2007), all data-driven models for dependency parsing that Figure 1: Two character dependency schemes have been proposed in recent years can be de- scribed as either graph-based or transition-based. Since both dependency schemes that we construct w1...w is an unlabeled directed graph D = n for parsing are well-formed and projective, the lat- (W, A), where ter is chosen as the parsing framework for the sake (a) W is the set of ordered nodes, i.e. clique of efficiency. In detail, a shift-reduce method is tokens in the input string, ordered by a adopted as in (Nivre, 2003). linear precedence relation <, The method is step-wise and a classifier is used (b) A is a set of unlabeled arcs (wi, wj), to make a parsing decision step by step. In each 4 where wi, wj ∈ W , step, the classifier checks a clique pair , namely, TOP, the top of a stack that consists of the pro- If (w , w ) ∈ A, w is called the head of w i j i j cessed cliques, and, INPUT, the first clique in the and w a dependent of w . Traditionally, the no- j i unprocessed sequence, to determine if a dependent tation w → w means (w , w ) ∈ A; w →∗ i j i j i relation should be established between them. Be- w denotes the reflexive and transitive closure of j sides two arc-building actions, a shift action and a the (unlabeled) arc relation. We assume that the reduce action are also defined, as follows, designed dependency structure satisfies the fol- lowing common constraints in existing literature Left-arc: Add an arc from INPUT to TOP and (Nivre, 2006). pop the stack. Right-arc: Add an arc from TOP to INPUT and (1) D is weakly connected, that is, the cor- push INPUT onto the stack. responding undirected graph is connected. Reduce: Pop TOP from the stack. (CONNECTEDNESS) Shift: Push INPUT onto the stack.

(2) The graph D is acyclic, i.e., if wi → wj then In this work, we adopt a left-to-right arc-eager ∗ not wj → wi. (ACYCLICITY) parsing model, that means that the parser scans the input sequence from left to right and right depen- (3) There is at most one arc (wi, wj) ∈ A, ∀wj ∈ dents are attached to their heads as soon as possi- W . (SINGLE-HEAD) ble (Hall et al., 2007). In the implementation, as for Scheme E, all four actions are required to pass (4) An arc wi → wk is projective iff, for every through an input sequence. However, only three word wj occurring between wi and wk in the actions, i.e., reduce action will never be used, are string (wi < wj < wk or wi > wj > wk), ∗ needed for Scheme B. wi → wj . (PROJECTIVITY) We say that D is well-formed iff it is acyclic and 3.3 Learning Model and Features connected, and D is projective iff every arcs in A While memory-based and margin-based learn- are projective. Note that the above four conditions ing approaches such as support vector machines entail that the graph D is a single-rooted tree. For are popularly applied to shift-reduce parsing, we an arc w → w , if w < w , then it is called right- i j i j apply maximum entropy model as the learning arc, otherwise left-arc. model for efficient training and producing some Following the above four constraints and con- comparable results. Our implementation of max- sidering segmentation characteristics, we may imum entropy adopts L-BFGS algorithm for pa- have two character dependency representation rameter optimization as usual. No additional fea- schemes as shown in Figure 1 by using a series ture selection techniques are used. of trivial dependencies inside or outside a word. Note that we use arc direction to distinguish con- With notations defined in Table 1, a feature set nected and segmented relation among characters. as shown in Table 2 is adopted. Here, we explain The scheme with the assistant root node before the some terms in Tables 1 and 2. sequence in Figure 1 is called Scheme B, and the 4Here, clique means character or word in a sequence, other Scheme E. which depends on what constructs the sequence.

881 Table 1: Feature Notations Table 2: Features for Parsing Basic Extension Notation Meaning x.char itself, its previous two and next two s The character in the top of stack characters, and all bigrams within the s−1,... The first character below the top of stack, etc. five-character window. (x is s or i.) i, i+1,... The first (second) character in the s.h.char unprocessed sequence, etc. s.dprel dprel Dependent label s.rm.dprel h Head s−1.cnseq lm Leftmost child s−1.cnseq+s.char rm Rightmost child s−1.curroot.lm.cnseq rn Right nearest child s−1.curroot.lm.cnseq+s.char char Character form s−1.curroot.lm.cnseq+i.char . ’s, i.e., ‘s.dprel’ means dependent label s−1.curroot.lm.cnseq+s−1.cnseq of character in the top of stack s−1.curroot.lm.cnseq+s.char+s−1.cnseq + Feature combination, i.e., ‘s.char+i.char’ s−1.curroot.lm.cnseq+i.char+s−1.cnseq means both s.char and i.char work as a s.avn+i.avn, n = 1, 2, 3, 4, 5 feature function. preact−1 preact−2 preact−2+preact−1 Since we only considered unlabeled depen- dency parsing, dprel means the arc direction from the greatest AV score to activate the above feature the head, either left or right. The feature cur- function for that character. root returns the root of a partial parsing tree that The feature preactn returns the previous pars- includes a specified node. The feature cnseq re- ing action type, and the subscript n stands for the turns a substring started from a given character. It action order before the current action. checks the direction of the arc that passes the given character and collects all characters with the same 3.4 Decoding arc direction to yield an output substring until the Without Markovian feature like preact−1, a shift- arc direction is changed. Note that all combina- reduce parser can scan through an input sequence tional features concerned with this one can be re- in linear time. That is, the decoding of a parsing garded as word-level features. method for word segmentation will be extremely The feature av is derived from unsupervised fast. The time complexity of decoding will be 2L segmentation as in (Zhao and Kit, 2008a), and for Scheme E, and L for Scheme B, where L is the accessor variety (AV) (Feng et al., 2004) is the length of the input sequence. adopted as the unsupervised segmentation crite- However, it is somewhat complicated as Marko- rion. The AV value of a substring s is defined as vian features are involved. Following the work of (Duan et al., 2007), the decoding in this case is to AV (s) = min{L (s), R (s)}, av av search a parsing action sequence with the maximal probability. where the left and right AV values Lav(s) and Rav(s) are defined, respectively, as the numbers Sdi = argmax Y p(di|di−1di−2...), of its distinct predecessor and successor charac- i ters. In this work, AV values for substrings are derived from unlabeled training and test corpora where Sdi is the object parsing action sequence, by substring counting. Multiple features are used p(di|di−1...) is the conditional probability, and di to represent substrings of various lengths identi- is i-th parsing action. We use a beam search al- fied by the AV criterion. Formally put, the feature gorithm as in (Ratnaparkhi, 1996) to find the ob- function for a n-character substring s with a score ject parsing action sequence. The time complex- AV (s) is defined as ity of this beam search algorithm will be 4BL for Scheme E and 3BL for Scheme B, where B is the t t+1 avn = t, if 2 ≤ AV (s) < 2 , (1) beam width. where t is an integer to logarithmize the score and 3.5 Related Methods taken as the feature value. For an overlap character Among character-based learning techniques for of several substrings, we only choose the one with word segmentation, we may identify two main

882 types, classification (GOH et al., 2004) and tag- each character one by one, it is reasonable to com- ging (Low et al., 2005). Both character classifi- pare it to a classification method over characters. cation and tagging need to define the position of character inside a word. Traditionally, the four 3.6 Evaluation Results tags, b, m, e, and s stand, respectively, for the beginning, midle, end of a word, and a single- Table 3: Corpus size of Bakeoff-2 in number of character as word since (Xue, 2003). The follow- words ing n-gram features from (Xue, 2003; Low et al., AS CityU MSRA PKU 2005) are used as basic features, Training(M) 5.45 1.46 2.37 1.1 Test(K) 122 41 107 104 (a) Cn(n = −2, −1, 0, 1, 2),

(b) CnCn+1(n = −2, −1, 0, 1), The experiments in this section are performed

(c) C−1C1, in all four corpora from Bakeoff-2. Corpus size information is in Table 3. where C stands for a character and the subscripts Traditionally, word segmentation performance for the relative order to the current character C0. In is measured by F-score ( F = 2RP/(R + P ) ), addition, the feature av that is defined in equation where the recall (R) and precision (P ) are the pro- (1) is also taken as an option. avn (n=1,...,5) is portions of the correctly segmented words to all applied as feature for the current character. words in, respectively, the gold-standard segmen- While word segmentation is conducted as a tation and a segmenter’s output. To compute the classification task, each individual character will word F-score, all parsing results will be restored be simply assigned a tag with the maximal prob- to word boundaries according to the direction of ability given by the classifier. In this case, we re- output arcs. store word boundary only according to two tags b and s. However, the output tag sequence given by character classification may include illegal tag Table 4: The results of parsing and classifica- transition (e.g., m is after e.). In (Low et al., 2005), tion/tagging approaches using different feature a dynamic programming algorithm is adopted to combinations find a tag sequence with the maximal joint prob- a ability from all legal tag sequences. If such a dy- S. Feature AS CityU MSRA PKU Basicb .935 .922 .950 .917 namic programming decoding is adopted, then this B +AVc .941 .933 .956 .927 method for word segmentation is regarded as char- +Prevd .937 .923 .951 .918 acter tagging 5. +AV+Prev .942 .935 .958 .929 Basic .940 .932 .957 .926 The time complexity of character-based classifi- E +AV .948 .947 .964 .942 cation method for decoding is L, which is the best +Prev .944 .940 .962 .931 result in decoding velocity. As dynamic program- +AV+Prev .949 .951 .967 .943 e is applied, the time complexity will be 16L n-gram/c .933 .923 .948 .923 Cf +AV/c .942 .936 .957 .933 with four tags. n-gram/dg .945 .938 .956 .936 Recently, conditional random fields (CRFs) be- +AV/d .950 .949 .966 .945 comes popular for word segmentation since it pro- aScheme b vides slightly better performance than maximum Features in top two blocks of Table 2. cFive av features are added on the above basic features. entropy method does (Peng et al., 2004). How- dThree Markovian features in Table 2 are added on the above ever, CRFs is a structural learning tool rather than basic features. e a simple classification framework. As shift-reduce /c: Classification fCharacter classification or tagging using maximum entropy parsing is a typical step-wise method that checks g/d: Only search in legal tag sequences. 5Someone may argue that maximum entropy Markov model (MEMM) is truly a tagging tool. Yes, this method was initialized by (Xue, 2003). However, our empirical results Our comparison with existing work will be con- show that MEMM never outperforms maximum entropy plus ducted in closed test of Bakeoff. The rule for the dynamic programming decoding as (Low et al., 2005) in Chi- closed test is that no additional information be- nese word segmentation. We also know that the latter reports the best results in Bakeoff-2. This is why MEMM method is yond training corpus is allowed, while open test excluded from our comparison. of Bakeoff is without such restrict.

883 The results with different dependency schemes now, we launched an annotation job based on are in Table 4. As the feature preact is involved, UPUC segmented corpus of Bakeoff-3(Bakeoff- a beam search algorithm with width 5 is used to 2006)(Levow, 2006). The training corpus is with decode, otherwise, a simple shift-reduce decod- 880K characters and test corpus 270K. However, ing is used. We see that the performance given the essential of the annotation job is actually con- by Scheme E is much better than that by Scheme ducted in a lexicon. B. The results of character-based classification After a lexicon is extracted from CTB seg- and tagging methods are at the bottom of Table 46. mented corpus, we use a top-down strategy to an- It is observed that the parsing method outperforms notate internal dependencies inside these words classification and tagging method without Marko- from the lexicon. A long word is first split vian features or decoding throughout the whole se- into some smaller constituents, and dependencies quence. As full features are used, the former and among these constituents are determined, char- the latter provide the similar performance. acter dependencies inside each constituents are Due to using a global model like CRFs, our pre- then annotated. Some simple rules are adopted vious work in (Zhao et al., 2006; Zhao and Kit, to determine dependency relation, e.g., modifiers 2008c) reported the best results over the evaluated are kept marking as dependants and the only corpora of Bakeoff-2 until now7. Though those rest constituent will be marked as head at last. results are slightly better than the results here, we Some words are hard to determine internal depen- still see that the results of character-level depen-

dency relation, such as foreign names, e.g., ‘Ä

dency parsing approach (Scheme E) are compara-

ß ê º õ B : ’(Portugal) and ‘ ’(Maradona), and

ble to those state-of-the-art ones on each evaluated

ë é ¬ uninterrupted words (ë ), e.g., ‘ ’(ant)

corpus. h and ‘" ’(clover). In this case, we simply adopt 4 Character Dependencies inside a Word a series of linear dependencies with the last char- acter as head to mark these words. We further consider exploiting annotated charac- In the previous section, we have shown that ter dependencies inside a word (internal depen- Scheme E is a better dependency representation dencies). A parsing task for these internal de- for encoding word boundaries. Thus annotated pendencies incorporated with trivial external de- internal dependencies are used to replace those pendencies 8 that are transformed from common trivial internal dependencies in Scheme E to ob- word boundaries are correspondingly proposed us- tain the corpus that we require. Note that now ing the same parsing way as the previous section. we cannot distinguish internal and external de- 4.1 Annotation of Internal Dependencies pendencies only according to the arc direction any more, as both left- and right-arc can ap- In Subsection 3.1, we assign trivial character de- pear for internal character dependency represen- pendencies inside a word for the parsing task of tation. Thus two labeled left arcs, external and word segmentation, i.e., each character as the head internal, are used for the annotation disambigua- of its predecessor or successor. These trivial for- tion. As internal dependencies are introduced, mally defined dependencies may be against the we find that some words (about 10%) are con- syntactic or semantic senses of those characters, structed by two or more parallel constituent parts as we have discussed in Section 2. Now we will according to our annotations, this not only lets consider human annotated character dependencies two labeled arcs insufficiently distinguish internal- inside a word. and external dependencies, but also makes pars- As such an corpus with annotated inter- ing extremely difficult, namely, a great amount nal dependencies has not been available until of non-projective dependencies will appear if we 6Only the results of open track are reported in (Low et directly introduce these internal dependencies. al., 2005), while we give a comparison following closed track Again, we adopt a series of linear dependencies rules, so, our results here are not comparable to those of (Low et al., 2005). with the last character as head to represent in- 7As n-gram features are used, F-scores in (Zhao et al., ternal dependencies for these words by ignor- 2006) are, AS:0.953, CityU:0.948, MSRA:0.974,PKU:0.952. ing their parallel constituents. To handle the re- 8We correspondingly call dependencies that mark word boundary external dependencies that correspond to internal mained non-projectivities, a strengthened pseudo- dependencies. projectivization technique as in (Zhao and Kit,

884 A comparison of classification/tagging and parsing methods is given in Table 6. To evalu- ate the results with word F-score, all external de- pendencies in outputs are restored as word bound- Figure 2: Annotated internal dependencies (Arc aries. There are three models are evaluated in Ta- label e notes trivial external dependencies.) ble 6. It is shown that there is a significant perfor- mance enhancement as annotated internal charac- Table 5: Features for internal dependency parsing ter dependency is introduced. This positive result Basic Extension shows that annotated internal character dependen- s.char itself, its next two characters, and all bigrams cies are meaningful. within the three-character window. i.char its previous one and next three characters, and all bigrams within the four-character window. Table 6: Comparison of different methods s.char+i.char a b s.h.char Approach basic +AV +Prev +AV+Prev s.rm.dprel Class/Tagc .918 .935 .928 .941 s.curtree Parsing/wod .921 .937 .924 .942 s.curtree+s.char Parsing/w e .925 .940 .929 .945 s−1.curtree+s.char aThe highest F-score in Bakeoff-3 is 0.933. s.curroot.lm.curtree bAs for the tagging method, this means dynamic pro- s−1.curroot.lm.curtree gramming decoding; As for the parsing method, this means s.curroot.lm.curtree+s.char three Markovian features. s−1.curroot.lm.curtree+s.char cCharacter-based classification or tagging method s.curtree+s.curroot.lm.curtree dUsing trivial internal dependencies in Scheme E. s−1.curtree+s−1.curroot.lm.curtree eUsing annotated internal character dependencies. s.curtree+s.curroot.lm.curtree+s.char s−1.curtree+s−1.curroot.lm.curtree+s.char s−1.curtree+s−1.curroot.lm.curtree+i.char x.avn, n = 1, ..., 5 (x is s or i.) 5 Is Word Still Necessary? s.avn+i.avn, n = 1, ..., 5 preact−1 Note that this work is not about joint learning preact−2 preact−2+preact−1 of word boundaries and syntactic dependencies such as (Luo, 2003), where a character-based tag- ging method is used for syntactic constituent pars- 2008b) is used during parsing. An annotated ex- ing from unsegmented Chinese text. Instead, this ample is illustrated in Figure 2. work is to explore an alternative way to repre- sent “word-hood” in Chinese, which is based on 4.2 Learning of Internal Dependencies character-level dependencies instead of traditional To demonstrate internal character dependencies word boundaries definition. are helpful for further processing. A series of Though considering dependencies among similar word segmentation experiments as in Sub- words is not novel (Gao and Suzuki, 2004), section 3.6 are performed. Note that this task is we recognize that this study is the first work slightly different from the previous one, as it is a concerned with character dependency. This five-class parsing action classification task as left study originally intends to lead us to consider an arc has two labels to differ internal and external alternative way that can play the similar role as dependencies. Thus a different feature set has to word boundary annotations. be used. However, all input sequences are still pro- In Chinese, not word but character is the actual jective. minimal unit for either writing or speaking. Word- Features listed in Table 5 are adopted for the hood has been carefully defined by many means, parsing task that annotated character dependencies and this effort results in multi-standard segmented exist inside words. The feature curtree in Table corpora provided by a series of Bakeoff evalu- 5 is similar to cnseq of Table 2. It first greedily ations. However, from the view of linguistics, searches all connected character started from the Bakeoff does not solve the problem but technically given one until an arc with external label is found skirts round it. As one asks what a Chinese word over some character. Then it collects all characters is, Bakeoff just answers that we have many def- that has been reached to yield an output substring initions and each one is fine. Instead, motivated as feature value. from the results of the previous two sections, we

885 suggest that character dependency representation ter dependencies provide a more general and nat- could present a natural and unified way to allevi- ural way to reflect character relations within a se- ate the drawbacks of word boundary representa- quence than word boundary annotations do. tion that is only able to represent the relation of neighbored characters. 6 Conclusion and Future Work In this study, we initially investigate the possibil- Table 7: What we have done for character depen- ity of exploiting character dependencies for Chi- dency nese. To show that character-level dependency Internal External Our work can be a good alternative to word boundary rep- trivial trivial Section 3 resentation for Chinese, we carry out a series of annotated trivial Section 4 annotated ? parsing experiments. The techniques are devel- oped step by step. Firstly, we show that word seg- mentation task can be effectively re-formularized If we regard that our current work is stepping character-level dependency parsing. The results of into more and more annotated character dependen- a character-level dependency parser can be com- cies as shown in Table 7, then it is natural to ex- parable with traditional methods. Secondly, we tend annotated internal character dependencies to consider annotated character dependencies inside the whole sequence without those unnatural word a word. We show that a parser can still effectively boundary constraints. In this sense, internal and capture both these annotated internal character de- external character dependency will not need be pendencies and trivial external dependencies that differed any more. A full character-level depen- are transformed from word boundaries. The exper- dency tree is illustrated as shown in Figure 3(a)9 imental results show that annotated internal depen- With the help of such a tree, we may define word dencies even bring performance enhancement and or even phrase according to what part of subtree is indirectly verify the usefulness of them. Finally, picked up. Word-hood, if we still need this con- we suggest that a full annotated character depen- cept, can be freely determined later as further pro- dency tree can be constructed over all possible cessing purpose requires. character pairs within a given sequence, though its usefulness needs to be explored in the future.

Acknowledgements

(a) This work is beneficial from many sources, in- cluding three anonymous reviewers. Especially, the authors are grateful to two colleagues, one re- viewer from EMNLP-2008 who gave some very insightful comments to help us extend this work, (b) and Mr. SONG Yan who annotated internal depen- Figure 3: Extended character dependencies dencies of top frequent 22K words extracted from Basically we only consider unlabeled depen- UPUC segmentation corpus. Of course, it is the dencies in this work, and dependant labels can be duty of the first author if there still exists anything emptied to do something else, e.g., Figure 3(b) wrong in this work. shows how to extend internal character dependen- cies of Figure 2 to accommodate part-of-speech References tags. This extension can also be transplanted to a full character dependency tree of Figure 3(a), then Michael Collins. 1999. Head-Driven Statistical Mod- this may leads to a character-based labeled syntac- els for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania. tic dependency tree. In brief, we see that charac-

9We may easily build such a corpus by embedding an- Xiangyu Duan, Jun Zhao, and Bo Xu. 2007. Proba- notated internal dependencies into a word-level dependency bilistic parsing action models for multi-lingual de- tree bank. As UPUC corpus of Bakeoff-3 just follows the pendency parsing. In Proceedings of the CoNLL word segmentation convention of Chinese tree bank, we have Shared Task Session of EMNLP-CoNLL 2007, pages built such a full character-level dependency tree corpus. 940–946, Prague, Czech, June 28-30.

886 Thomas Emerson. 2005. The second international Joakim Nivre. 2003. An efficient algorithm for pro- Chinese word segmentation bakeoff. In Proceed- jective dependency parsing. In Proceedings of the ings of the Fourth SIGHAN Workshop on Chinese 8th International Workshop on Parsing Technologies Language Processing, pages 123–133, Jeju Island, (IWPT 03), pages 149–160, Nancy, France, April Korea, October 14-15. 23-25.

Haodi Feng, Kang Chen, Xiaotie Deng, and Weimin Joakim Nivre. 2006. Constraints on non-projective de- Zheng. 2004. Accessor variety criteria for Chi- pendency parsing. In Proceedings of 11th Confer- nese word extraction. Computational Linguistics, ence of the European Chapter of the Association for 30(1):75–93. Computational Linguistics (EACL-2006), pages 73– 80, Trento, Italy, April 3-7. Jianfeng Gao and Hisami Suzuki. 2004. Capturing long distance dependency in language modeling: An Fuchun Peng, Fangfang Feng, and Andrew McCallum. empirical study. In K.-Y.Su, J. Tsujii, J. H. Lee, and 2004. Chinese segmentation and new word detec- O. Y. Kwong, editors, Natural Language Processing tion using conditional random fields. In COLING - IJCNLP 2004, volume 3248 of Lecture Notes in 2004, pages 562–568, Geneva, Switzerland, August Computer Science, pages 396–405, Sanya, Hainan 23-27. Island, China, March 22-24. Adwait Ratnaparkhi. 1996. A maximum entropy part- Chooi-Ling GOH, Masayuki Asahara, and Yuji Mat- of-speech tagger. In Proceedings of the Empiri- sumoto. 2004. Chinese word segmentation by clas- cal Method in Natural Language Processing Confer- sification of characters. In ACL SIGHAN Workshop ence, pages 133–142, University of Pennsylvania. 2004, pages 57–64, Barcelona, Spain, July. Associ- ation for Computational Linguistics. Richard Sproat and Thomas Emerson. 2003. The first international Chinese word segmentation bakeoff. Johan Hall, Jens Nilsson, Joakim Nivre, In The Second SIGHAN Workshop on Chinese Lan- G¨ulsen Eryiˇgit, Be´ata Megyesi, Mattias Nils- guage Processing, pages 133–143, Sapporo, Japan. son, and Markus Saers. 2007. Single malt or blended? a study in multilingual parser optimiza- Nianwen Xue. 2003. Chinese word segmentation as tion. In Proceedings of the CoNLL Shared Task character tagging. Computational Linguistics and Session of EMNLP-CoNLL 2007, pages 933–939, Processing, 8(1):29–48. Prague, Czech, June. Hai Zhao and Chunyu Kit. 2008a. Exploiting unla- beled text with different unsupervised segmentation Gina-Anne Levow. 2006. The third international Chi- criteria for chinese word segmentation. In Research nese language processing bakeoff: Word segmen- in Computing Science, volume 33, pages 93–104. tation and named entity recognition. In Proceed- ings of the Fifth SIGHAN Workshop on Chinese Lan- Hai Zhao and Chunyu Kit. 2008b. Parsing syn- guage Processing, pages 108–117, Sydney, Aus- tactic and semantic dependencies with two single- tralia, July 22-23. stage maximum entropy models. In Twelfth Confer- ence on Computational Natural Language Learning Jin Kiat Low, Hwee Tou Ng, and Wenyuan Guo. 2005. (CoNLL-2008), pages 203–207, Manchester, UK, A maximum entropy approach to Chinese word seg- August 16-17. mentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pages Hai Zhao and Chunyu Kit. 2008c. Unsupervised 161–164, Jeju Island, Korea, October 14-15. segmentation helps supervised learning of charac- ter tagging for word segmentation and named en- Xiaoqiang Luo. 2003. A maximum entropy chinese tity recognition. In The Sixth SIGHAN Workshop character-based parser. In Proceedings of the 2003 on Chinese Language Processing, pages 106–111, Conference on Empirical Methods in Natural Lan- Hyderabad, India, January 11-12. guage Processing (EMNLP 2003), pages 192 – 199, Sapporo, Japan, July 11-12. Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2006. Effective tag set selection in Chinese Ryan McDonald and Joakim Nivre. 2007. Charac- word segmentation via conditional random field terizing the errors of data-driven dependency pars- modeling. In Proceedings of the 20th Asian Pacific ing models. In Proceedings of the 2007 Joint Con- Conference on Language, Information and Compu- ference on Empirical Methods in Natural Language tation, pages 87–94, Wuhan, China, November 1-3. Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), pages 122–131, Prague, Czech, June 28-30.

Joakim Nivre and Jens Nilsson. 2005. Pseudo- projective dependency parsing. In Proceedings of the 43rd Annual Meeting on Association for Compu- tational Linguistics (ACL-2005),pages 99–106,Ann Arbor, Michigan, USA, June 25-30.

887