Extending the Punctuation Module for European Portuguese
Total Page:16
File Type:pdf, Size:1020Kb
INTERSPEECH 2010 Extending the punctuation module for European Portuguese Fernando Batista1,2, Helena Moniz1,3, Isabel Trancoso1,4, Hugo Meinedo1, Ana Isabel Mata3, Nuno Mamede1,4 1INESC-ID, Lisbon, Portugal 2ISCTE-IUL - Lisbon University Institute, Lisbon, Portugal 3FLUL/CLUL, University of Lisbon, Portugal 4IST, Lisbon, Portugal {fmmb;helenam;isabel.trancoso;meinedo;njm}@l2f.inesc-id.pt and [email protected] Abstract 2. Related work This paper describes our recent work on extending the punc- Recent studies (e.g., [2, 3, 4, 5, 6, 7]) show that the analysis of tuation module of automatic subtitles for Portuguese Broad- prosodic features is crucial to improve sentence boundary de- cast News. The main improvement was achieved by the use of tection systems. Prosodic features, such as pause, final length- prosodic information. This enabled the extension of the previ- ening, pitch reset, and energy, are among the most salient cues ous module which covered only full stops and commas, to cover in these studies. This is supported on evidences that such cues question marks as well. The approach uses lexical, acoustic and are language-independent [8] and that languages have prosodic prosodic information. Our results show that the latter is rel- strategies to delimit sentence-like units (SU). evant for all types of punctuation. An analysis of the results also shows what type of interrogative is better dealt with by Recent studies have also pointed out that there are prosodic our method, taking into account the specificities of Portuguese. similarities across speaking styles [3]. The authors extracted This may lead to different results for different types of corpora, prosodic features and ranked them accordingly to their impact depending on the types of interrogatives that are more frequent. on the identification of dialog act boundaries. The scaled fea- tures are (in order): pause, pitch (log ratio of the pitch at the Index Terms: automatic punctuation, sentence boundary detec- end of the last word and at the beginning of the first word after tion, rich transcription, broadcast news subtitling. a boundary), energy (similar to the analysis window for pitch) and duration. 1. Introduction Detecting positions where a punctuation mark is missing, The main motivation of this work is the improvement of the roughly corresponds to the task of detecting a SU, or finding punctuation module of our automatic broadcast news captioning SU boundaries. SU boundary detection has gained increas- system. Like the speech recognition module and the modules ing attention during recent years, and it has been part of the that precede it, the punctuation and capitalization modules share NIST rich transcription evaluations. A general HMM (Hidden low latency requirements, given their on-the-fly usage. Markov Model) framework that allows the combination of lexi- Although the use of prosodic features in automatic punc- cal and prosodic clues for recovering full stop, comma and ques- tuation methods is well studied for some languages, the first tion marks is used by [9] and [10]. A similar approach was also implemented version for European Portuguese (henceforth EP) used for detecting sentence boundaries by [11, 2, 12]. [10] also deals only with full stop and comma recovery, and explores a combines 4-gram language models with a CART (Classifica- limited set of features, mostly lexical and acoustic, simultane- tion and Regression Tree) and concludes that prosodic infor- ously targeting at low latency, and language independence [1]. mation highly improve the results. [13] describes a maximum entropy (ME) based method for inserting punctuation marks The aim of this work is to improve the punctuation module, into spontaneous conversational speech, where the punctuation first by exploring additional features, namely prosodic, and later task is considered as a tagging task and words are tagged with by weighting the lexical and prosodic features impact on the the appropriate punctuation. It covers three punctuation marks: baseline system when encompassing interrogatives. To the best comma, full stop, and question mark; and the best results on the of our knowledge, this is the first study to quantify the distinct ASR output are achieved using bigram-based features and com- interrogative types and also to discuss the weight of the lexical bining lexical and prosodic features. [7] proposes a multi-pass and prosodic properties of these structures, based on planned linear fold algorithm for sentence boundary detection in spon- and spontaneous speech data for EP. taneous speech, which uses prosodic features, focusing on the The next section reviews related work. The baseline mod- relation between sentence boundaries, break indices and dura- ule and its improvement are described in Sections 3 and 4, re- tion, covering their local and global structural properties. Other spectively. Section 5 deals with the implementation of prosodic recent studies have shown that the best performance for the features to detect question marks. Conclusions and future work punctuation task is achieved when prosodic, morphologic and are presented in Section 6. syntactic information are combined [6, 12, 14]. Copyright 2010 ISCA 1509 26-30 September 2010, Makuhari, Chiba, Japan alignment between the manual and the automatic transcriptions. Table 1: Portuguese BN corpus properties. This is a non-trivial task mainly because of the recognition er- #Words Dur. (h) Planned Spont. WER rors. The NIST SCLite tool 1 was used for this task, followed Train 477k 46 55% 32% 14% by a post-processing step, either by aligning words which can Devel 66k 6 51% 38% 19% be written differently or by correcting some SCLite basic er- Test 135k 18 56% 36% 19% rors. The data was automatically annotated with part-of-speech information, using MARv [20]. The results achieved with this baseline version are shown 3. Baseline punctuation module in the first line of Table 2. The evaluation used the performance metrics Precision, Recall and SER (Slot Error Rate) [21]. Only Our on-line broadcast news processing system consists of a punctuation marks are considered as slots and used by these pipeline of modules that starts with jingle detection, audio di- metrics. Hence, the SER is computed by dividing the number arization, and automatic speech recognition (ASR) [15]. Our of punctuation errors by the number of punctuation marks in the ASR system follows the connectionist paradigm. The hy- reference data. brid models have been recently improved by the inclusion of multiple-state phone units, and a fixed set of phone transition units aimed at specifically modeling the most frequent intra- 4. Improved model for full stop and comma word phone transitions [16]. Although the system can be im- In order to introduce prosodic features for detecting SUs, we proved by using dynamic vocabulary and language models up- have performed a number of additional steps. The first step con- dated daily, the experiments reported in this paper used a fixed sisted of extracting the pitch and the energy from the speech sig- vocabulary of 100k words. nal, which was achieved using the Snack Sound Toolkit2. Du- Next in our pipeline, come the punctuation and capitaliza- rations of phones, words, and interword-pauses are extracted tion modules [1, 17]. Both use a discriminative approach, based from the recognizer output. By combining the pitch values with on maximum entropy (ME) models, which provides a clean way the phone boundaries, we have removed micro-intonation and of expressing and combining different properties of the infor- octave jump effects from the pitch track. Another important mation. This is specially useful for the punctuation task, given step consisted of marking the syllable boundaries as well as the the broad set of available lexical, acoustic and prosodic fea- syllable stress. A set of syllabification rules was designed and tures. This approach requires all information to be expressed applied to the lexicon. The rules account fairly well for native in terms of features, causing the resultant data file to become words, but need improvements for words of foreign origin. Fi- several times larger than the original one. The classification is nally, we have calculated the maximum, minimum, median and straightforward, making it interesting for on-the-fly usage. slope values for pitch and energy in each word, syllable, and The experiments described in this paper used the MegaM phone. Duration was also calculated for each one of the previ- tool [18] for training the maximum entropy models, using con- ous units. jugate gradient and logistic regression. Our baseline experi- As previously mentioned, our experiments aim at analyzing ments targeted only the two most frequent punctuation marks: the weight and contribution of each prosodic feature per se and full stop and comma. The following features were used for a the impact of the combination of prosodic features. Underlying given word w in the position i of the corpus: wi, wi+1, 2wi 2, the feature extraction process are linguistic evidences that pitch − 2wi 1, 2wi, 2wi+1, 3wi 2, 3wi 1, pi, pi+1, 2pi 2, 2pi 1, contour, boundary tones, energy slopes, and pauses are crucial − − − − − 2pi, 2pi+1, 3pi 2, 3pi 1, GenderChgs1, SpeakerChgs1, to delimit sentence-like units across languages. First, we have − − and TimeGap1, where: wi is the current word, wi+1 is the tested if the features would perform better on different units of word that follows and nwi x is the n-gram of words that starts analysis: phones, syllables and/or words. Supported on linguis- ± x positions after or before the position i; pi is part-of-speech tic findings for EP [22, 23], we hypothesized that the stressed of the current word, and npi x is the n-gram of part-of-speech and post-stressed syllables would be relevant units of analysis ± of words that starts x positions after or before the position i.