A CRF Sequence Labeling Approach to Chinese Punctuation Prediction Yanqing Zhao, Chaoyue Wang, Guohong Fu School of Computer Science and Technology, Heilongjiang University Harbin 150080, China
[email protected],
[email protected],
[email protected] information extraction (Matusov et al., 2006; Lu Abstract and Ng, 2010). Over the past years, numerous studies have been This paper presents a conditional random performed on the insertion of punctuations in fields based labeling approach to Chinese speech transcripts using supervised techniques. punctuation prediction. To this end, we However, it is actually very difficult or even first reformulate Chinese punctuation impossible to develop a large high-quality corpus prediction as a multiple-pass labeling task to achieve reliable models for predicting on a sequence of words, and then explore punctuations in speech transcripts or ASR outputs various features from three linguistic levels, (Takeuchi et al., 2007). Furthermore, most namely words, phrase and functional previous research on punctuation prediction chunks for punctuation prediction under the exploited very shallow linguistic features such as framework of conditional random fields. lexical features or prosodic cues (viz. pitch and Our experimental results on the Tsinghua pause duration) (Lu and Ng, 2010), few studies Chinese Treebank show that using multiple have been done on the exploration of deeper deeper linguistic features and multiple-pass linguistic features like syntactic structural labeling consistently improves performance. information for punctuation prediction, particularly in Chinese (Guo et al., 2010). 1 Introduction In this paper we draw our motivation from speech transcripts to written texts. On the one hand, Punctuation prediction, also referred to as a number of large annotated corpora of written punctuation restoration, aims at inserting proper texts are available to date.