A CRF Sequence Labeling Approach to Chinese Punctuation Prediction
Total Page:16
File Type:pdf, Size:1020Kb
A CRF Sequence Labeling Approach to Chinese Punctuation Prediction Yanqing Zhao, Chaoyue Wang, Guohong Fu School of Computer Science and Technology, Heilongjiang University Harbin 150080, China [email protected], [email protected], [email protected] information extraction (Matusov et al., 2006; Lu Abstract and Ng, 2010). Over the past years, numerous studies have been This paper presents a conditional random performed on the insertion of punctuations in fields based labeling approach to Chinese speech transcripts using supervised techniques. punctuation prediction. To this end, we However, it is actually very difficult or even first reformulate Chinese punctuation impossible to develop a large high-quality corpus prediction as a multiple-pass labeling task to achieve reliable models for predicting on a sequence of words, and then explore punctuations in speech transcripts or ASR outputs various features from three linguistic levels, (Takeuchi et al., 2007). Furthermore, most namely words, phrase and functional previous research on punctuation prediction chunks for punctuation prediction under the exploited very shallow linguistic features such as framework of conditional random fields. lexical features or prosodic cues (viz. pitch and Our experimental results on the Tsinghua pause duration) (Lu and Ng, 2010), few studies Chinese Treebank show that using multiple have been done on the exploration of deeper deeper linguistic features and multiple-pass linguistic features like syntactic structural labeling consistently improves performance. information for punctuation prediction, particularly in Chinese (Guo et al., 2010). 1 Introduction In this paper we draw our motivation from speech transcripts to written texts. On the one hand, Punctuation prediction, also referred to as a number of large annotated corpora of written punctuation restoration, aims at inserting proper texts are available to date. On the other hand, we punctuation marks at right position of an intend to examine the role of different linguistic unpunctuated text (Gravano et al., 2009; Guo et al., features on Chinese punctuation prediction. To this 2010). Punctuation is obviously an essential end, we reformulate Chinese punctuation indicator for sentence construction. For Chinese, prediction as a multiple-pass labeling task on word adding proper punctuation marks can not only sequences, and then explore multiple features at enhance the readability of text, but also can three linguistic levels, namely words, phrases and provide additional information for further language functional chunks, for punctuation labeling under analysis, such as word segmentation, phrasing and the framework of conditional random fields syntactic analysis (Guo et al., 2010; Chen and (CRFs). Furthermore, we have also performed Huang, 2011; Xue and Yang, 2011). As such, evaluation on the Tsinghua Chinese Treebank punctuation prediction plays a critical role in many (Zhou, 2004). natural language processing applications such as The rest of the paper is organized as follows: In automatic speech recognition (ASR), machine Section 2, we will provide a brief review of the translation, automatic summarization, and related work on punctuation prediction. In Section 508 Copyright 2012 by Yanqing Zhao, Chaoyue Wang, and Guohong Fu 26th Pacific Asia Conference on Language,Information and Computation pages 508–514 3, we will describe in detail a labeling method to prediction. This might be that a well-annotated Chinese punctuation prediction. Section 4 will corpus of speech texts is not available to date. As summarize the experimental results. Finally in such, in the present study we address the problem Section 6, we will give our conclusion and some of Chinese punctuation prediction from the possible directions for future work. perspective of written texts. Specially, we attempt to exploit multiple levels of features under the 2 Related Work framework of CRF-based sequence labeling and thus examine the role of for Chinese punctuation Punctuation prediction has been well studied in the communities of ASR, and a variety of techniques 3 Approach have been attempted, including n-grams (Takeuchi et al., 2007; Gravano et al., 2009), maximum This section details the CRFs-based multiple pass entropy models (MEMs) (Huang and Zweig, 2002; labeling method to Chinese punctuation prediction. Guo et al., 2010), and CRFs (Liu et al., 2005; Tomanek et al., 2007; Lu and Ng, 2010). 3.1 Task Formulation Current research focuses on seeking informative Chinese punctuation prediction is a process of features for punctuation prediction. Huang and inserting proper punctuation marks into a raw Zweig (2002) attempted to explore POS features Chinese text without punctuation marks. In the and prosodic features for inserting punctuations in present study, we reformulate Chinese punctuation automatically recognized speech texts using prediction as a multiple-pass labeling task on an MEMs. Takeuchi et al. (2007) exploited silence unpunctuated word string with the help of word information from ASR systems and head or tail pattern tags defined in Table 1. Furthermore, we phrases within sentences. They showed that using consider eleven main punctuation marks as shown head and tail phrases could result in improvement in Table 2. of performance in sentence boundary detection. Gravano et al. (2009) examined the effect of Tag Definition different orders of n-grams on performance in B The preceding word of the current punctuation prediction. More recently, Huang and punctuation. Chen (2011) used CRFs to combine different A The following word of the current features for labeling pause and stop in Chinese punctuation. texts, including the beginning and end features of O Words not adjacent to the current voice fragments, character features, word features, punctuation. POS features, syntactic features and topic features. BOT The head word of a text. In addition to speech transcripts or ASR outputs, EOT The tail word of a text. recently a number of researchers start to study Table 1: Patterns of words in punctuation labeling punctuation prediction via written texts. Tomanek et al (2007) employed CRFs to phrase a biological No. Name Punctuation Tag article, and then inserted punctuation to sentences 1 Comma , COM during sentence segmentation. Xue and Yang 2 full stop 。 FUL (2011) used MEMs to explore contextual words, ! POS features and syntax trees for inserting 3 exclamation mark EXC : commas in Chinese texts. Laboreiro and Sarmento 4 Colon COL (2010) applied support vector machines to exploit 5 Bracket () {}[] BRA multiple cues, including such as characters, 6 question mark ? QUE character types, symbols and punctuations, for 7 Semicolon ; SEM sentence segmentation and punctuation correction 8 enumeration comma 、 ENU in micro-blog texts. 9 book title mark 《》〈〉 BOO From these studies, it is clear that systems with 10 quotation mark “”‘’ QUO more and deeper features outperform systems only 11 Ellipsis …… ELL using simple features. However, most previous studies only used lexical cues for punctuation Table 2: Types of Chinese punctuation marks 509 In order to reduce the interference between and the construction of a clean government.), along different types of punctuation marks and to with its unpunctuated word string, three levels of simplify the problem of punctuation prediction as linguistic annotations and the corresponding well, we take the following order to perform punctuation labeling representation. punctuation labelling: sentence-final delimiters It is worth noting the major motivation of this (viz. period, question mark and exclamation mark) study is to investigate the effects of different levels → sentence-internal delimiters (viz. comma, of linguistic cues on Chinese punctuation semicolon, colon, and enumeration comma) → prediction. To achieve this, we take the following indicators (viz. bracket, book title mark, quotation three steps: First, we remove all punctuation marks mark, and ellipsis). within a given original punctuated text like line (a) After punctuation labeling, each word within the in Figure 1 and reduce it to an unpunctuated text unpunctuated text will receive a hybrid punctuation (viz. line (b)) before punctuation labeling. Then, tag of the form t1-t2, if it is adjacent to a we explore three levels of linguistic information to punctuation mark, or is at the beginning or end of a restore the removed punctuation marks using the text. Otherwise, it will only receive a tag O. Here, CRF-based multiple-pass labeling strategy. Finally, t1 denotes the pattern of the current word in we evaluate punctuation prediction performance by punctuation labeling (as shown in Table 1), and t2 comparing the automatically restored punctuation stands for the type of the punctuation mark (as marks with the corresponding original ones. defined in Table 2) that precedes or follows the Considering the availability of linguistic current word if applicable. information, we perform punctuation prediction on the Tsinghua Chinese Treebank (Zhou, 2004), a (a) Punctuated text: 执法部门是反腐败斗争 、 corpus of written Chinese with a variety of 搞好廉政建设的重点部门之一。 linguistic annotation information, including word (b) Unpunctuated word string: 执法 /部门 /是/反 segmentation, POS, phrases and functional chunks. /腐败 /斗争 /搞好 /廉政 /建设 /的/重点 /部门 /之 Also, the relevant annotation scheme is used throughout our present study. 一/ (c) POS: 执法 /vN 部门 /n 是/vC 反/v 腐败 /a 3.2 CRFs for Punctuation Labeling 斗争 /vN 搞好 /v 廉政 /vN 建设 /vN 的 We choose CRFs as the basic framework for /uJDE