Character-Level Dependencies in Chinese: Usefulness and Learning
Total Page:16
File Type:pdf, Size:1020Kb
Character-Level Dependencies in Chinese: Usefulness and Learning Hai Zhao Department of Chinese, Translation and Linguistics City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong, China [email protected] Abstract brings the argument about how to determine the word-hood in Chinese. Linguists’ views about We investigate the possibility of exploit- what is a Chinese word diverge so greatly that ing character-based dependency for Chi- multiple word segmentation standards have been nese information processing. As Chinese proposed for computational linguistics tasks since text is made up of character sequences the first Bakeoff (Bakeoff-1, or Bakeoff-2003)3 rather than word sequences, word in Chi- (Sproat and Emerson, 2003). nese is not so natural a concept as in En- Up to Bakeoff-4, seven word segmentation stan- glish, nor is word easy to be defined with- dards have been proposed. However, this does not out argument for such a language. There- effectively solve the open problem what a Chi- fore we propose a character-level depen- nese word should exactly be but raises another is- dency scheme to represent primary lin- sue: what a segmentation standard should be se- guistic relationships within a Chinese sen- lected for the successive application. As word tence. The usefulness of character depen- often plays a basic role for the further language dencies are verified through two special- processing, if it cannot be determined in a uni- ized dependency parsing tasks. The first fied way, then all successive tasks will be affected is to handle trivial character dependencies more or less. that are equally transformed from tradi- Motivated by dependency representation for tional word boundaries. The second fur- syntactic parsing since (Collins, 1999) that has thermore considers the case that annotated been drawn more and more interests in recent internal character dependencies inside a years, we suggest that character-level dependen- word are involved. Both of these results cies can be adopted to alleviate this difficulty in from character-level dependency parsing Chinese processing. If we regard traditional word are positive. This study provides an alter- boundary as a linear representation for neighbored native way to formularize basic character- characters, then character-level dependencies can and word-level representation for Chinese. provide a way to represent non-linear relations be- tween non-neighbored characters. To show that 1 Introduction character dependencies can be useful, we develop In many human languages, word can be naturally a parsing scheme for the related learning task and identified from writing. However, this is not the demonstrate its effectiveness. case for Chinese, for Chinese is born to be written The rest of the paper is organized as fol- in character1 sequence rather than word sequence, lows. The next section shows the drawbacks of namely, no natural separators such as blanks ex- the current word boundary representation through ist between words. As word does not appear in some language examples. Section 3 describes a natural way as most European languages2, it a character-level dependency parsing scheme for traditional word segmentation task and reports its 1 Character here stands for various tokens occurring in evaluation results. Section 4 verifies the useful- a naturally written Chinese text, including Chinese charac- ter(hanzi), punctuation, and foreign letters. However, Chi- ness of annotated character dependencies inside a nese characters often cover the most part. word. Section 5 looks into a few issues concern- 2Even in European languages, a naive but necessary method to properly define word is to list them all by hand. 3First International Chinese Word Segmentation Bakeoff, Thank the first anonymous reviewer who points this fact. available at http://www.sighan.org/bakeoff2003. Proceedings of the 12th Conference of the European Chapter of the ACL, pages 879–887, Athens, Greece, 30 March – 3 April 2009. c 2009 Association for Computational Linguistics 879 ing the role of character dependencies. Section 6 This example shows that there will be in a concludes the paper. dilemma to perform segmentation over these char- acters. If a segmentation position locates before Ê 2 To Segment or Not: That Is the ‘Ò’(three) or ‘ ’(five), then this will make them Question meaningless or losing its original meaning at least because either of these two characters should log- Though most words can be unambiguously de- Ï ically follow the substring ‘´ ’ (week) to con- fined in Chinese text, some word boundaries are ÏÒ struct the expected word ‘´ ’(Wednesday) or not so easily determined. We show such three ex- Ï Ê ‘´ ’ (Friday). Otherwise, to make all the amples as the following. above five characters as a word will have to ig- The first example is from the MSRA segmented nore all these logical dependent relations among corpus of Bakeoff-2 (Bakeoff-2005) (Emerson, these characters and segment it later for a proper 2005): tackling as the above first example. Ü » ® ù ® ì ÊÆ é Ç ¬ ¬ è / / / / All these examples suggest that dependencies \| Ý ¼ Ð / / / exist between discontinuous characters, and word boundary representation is insufficient to handle a / piece of / “ / Beijing City Beijing Opera these cases. This motivates us to introduce char- OK Sodality / member / entrance / ticket / ” acter dependencies. As the guideline of MSRA standard requires any 3 Character-Level Dependency Parsing organization’s full name as a word, many long words in this form are frequently encountered. Character dependency is proposed as an alterna- Though this type of ‘words’ may be regarded as an tive to word boundary. The idea itself is extremely effective unit to some extent, some smaller mean- simple, character dependencies inside sequence ingful constituents can be still identified inside are annotated or formally defined in the similar them. Some researchers argue that these should way that syntactic dependencies over words are be seen as phrases rather than words. In fact, e.g., usually annotated. a machine translation system will have to segment We will initially develop a character-level de- this type of words into some smaller units for a pendency parsing scheme in this section. Es- proper translation. pecially, we show character dependencies, even The second example is from the PKU corpus of those trivial ones that are equally transformed Bakeoff-2, from pre-defined word boundaries, can be effec- tively captured in a parsing way. Á 7 Àê ¸ ¥ / / / 3.1 Formularization China / in / South Africa / embassy Using a character-level dependency representa- (the Chinese embassy in South Africa) tion, we first show how a word segmentation task can be transformed into a dependency parsing This example demonstrates how researchers can problem. Since word segmentation is traditionally also feel inconvenient if an organization name is formularized as an unlabeled character chunking segmented into pieces. Though the word ‘ task since (Xue, 2003), only unlabeled dependen- À ê ¸’(embassy) is right after ‘ ’(South Africa) cies are concerned in the transformation. There are in the above phrase, the embassy does not belong many ways to transform chunks in a sequence into to South Africa but China, and it is only located in dependency representation. However, for the sake South Africa. of simplicity, only well-formed and projective out- The third example is an abbreviation that makes put sequences are considered for our processing. use of the characteristics of Chinese characters. Borrowing the notation from (Nivre and Nils- son, 2005), an unlabeled dependency graph is for- Ï è Ò Ê ´ / / / mally defined as follows: Week / one / three / five An unlabeled dependency graph for a string (Monday, Wednesday and Friday) of cliques (i.e., words and characters) W = 880 3.2 Shift-reduce Parsing According to (McDonald and Nivre, 2007), all data-driven models for dependency parsing that Figure 1: Two character dependency schemes have been proposed in recent years can be de- scribed as either graph-based or transition-based. Since both dependency schemes that we construct w1...w is an unlabeled directed graph D = n for parsing are well-formed and projective, the lat- (W, A), where ter is chosen as the parsing framework for the sake (a) W is the set of ordered nodes, i.e. clique of efficiency. In detail, a shift-reduce method is tokens in the input string, ordered by a adopted as in (Nivre, 2003). linear precedence relation <, The method is step-wise and a classifier is used (b) A is a set of unlabeled arcs (wi, wj), to make a parsing decision step by step. In each 4 where wi, wj ∈ W , step, the classifier checks a clique pair , namely, TOP, the top of a stack that consists of the pro- If (w , w ) ∈ A, w is called the head of w i j i j cessed cliques, and, INPUT, the first clique in the and w a dependent of w . Traditionally, the no- j i unprocessed sequence, to determine if a dependent tation w → w means (w , w ) ∈ A; w →∗ i j i j i relation should be established between them. Be- w denotes the reflexive and transitive closure of j sides two arc-building actions, a shift action and a the (unlabeled) arc relation. We assume that the reduce action are also defined, as follows, designed dependency structure satisfies the fol- lowing common constraints in existing literature Left-arc: Add an arc from INPUT to TOP and (Nivre, 2006). pop the stack. Right-arc: Add an arc from TOP to INPUT and (1) D is weakly connected, that is, the cor- push INPUT onto the stack. responding undirected graph is connected. Reduce: Pop TOP from the stack. (CONNECTEDNESS) Shift: Push INPUT onto the stack. (2) The graph D is acyclic, i.e., if wi → wj then In this work, we adopt a left-to-right arc-eager ∗ not wj → wi.