Babler - Data Collection from the Web to Support Speech Recognition and Keyword Search

Babler - Data Collection from the Web to Support Speech Recognition and Keyword Search Gideon Mendels Erica Cooper Julia Hirschberg Columbia University, New York, USA [email protected] ecooper, julia @cs.columbia.edu { } Abstract computational tools and large amounts of readily available web data and include languages such as We describe a system to collect web data French, Spanish, Mandarin, and German. Low for Low Resource Languages, to aug- Resource Languages (LRLs), although many are ment language model training data for Au- spoken by millions of people, are much less likely tomatic Speech Recognition (ASR) and and much more difficult to mine, due largely to keyword search by reducing the Out-of- the smaller presence these languages have on the Vocabulary (OOV) rates – words in the web. These include languages such as Paraguayan test set that did not appear in the train- Guarni, Igbo, Amharic, Halh Mongolian, Ja- ing set for ASR. We test this system vanese, Pashto, and Dholuo, inter alia. on seven Low Resource Languages from In this paper we describe a new system which the IARPA Babel Program: Paraguayan addresses the problem of collecting large amounts Guarani, Igbo, Amharic, Halh Mongolian, of LRL data from multiple web sources. Unlike Javanese, Pashto, and Dholuo. The suc- current HRL collection systems, Babler provides cess of our system compared with other a targeted collection pipeline for social networks web collection systems is due to the tar- and conversational style text. The purpose of this geted collection sources (blogs, twitter, data collection is to augment the training data used forums) and the inclusion of a separate by Automatic Speech Recognition (ASR) to cre- language identification component in its ate language models ASR and for Keyword Search pipeline, which filters the data initially (KWS) for LRLs. The more specific goal is to re- collected before finally saving it. Our reduce the Out-of-Vocabulary (OOV) rates for lan- sults show a major reduction of OOV rates guages when the amount of data in the training relative to those calculated from train- set is small and thus words in the test set may ing corpora alone and major reductions not occur in the training set. Web data can add in OOV rates calculated in terms of key- many additional words to the ASR and KWS lexi- words in the training development set. We con which is shown to improve performance over also describe differences among genres in WER and KW hit rate. Critically, this web data this reduction, which vary by language must be in a genre close to that of the ASR train- but show a pronounced influence for aug- ing and test sets which is the main reason we de- mentation from Twitter data for most lan- veloped a pipeline that focuses on conversational guages. style text. In this paper we describe the proper- ties which LRL web collection requires of sys- 1 Introduction tems, compare ours with other popular web col- Collecting data from the web for commercial and lection and scraping software, and describe results research purposes has become a popular task, achieved for reducing Word Error Rate (WER) for used for a wide variety of purposes in text and ASR and OOVs and improvements in the IARPA speech processing. However, to date, most of Babel keyword search task. this data collection has been done for English and In Section 2 we describe previous research in other High Resource Languages (HRLs). These web collection for speech recognition and key- languages are characterized by having extensive word search. In Section 3 we briefly describe the 72 Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, pages 72–81, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics IARPA Babel project and we describe its language data in different domains (Iyer et al., 1997), par- resources. In Section 4 we describe the compo- ticularly when that data closely matches the genre nents of our web collection systems. In Section of the available training material and the task at 5 we identify the web sources we use. In Section hand (Bulyko et al., 2003). While earlier work fo- 6 we compare our system to other tools for web cused on English, (Ng et al., 2005) extended this data collection. In Section 7 we describe subse- approach to the recognition of Mandarin conversa- quent text normalization used to prepare the col- tional speech and Schlippe et al 2013 explored the lection material for language modeling. In Section use of web data to perform unsupervised language 8 we describe results of adding collected web data model adaptation for French Broadcast News us- to available Babel training data in reducing OOV ing RSS feeds and Twitter data. Creutz et al. rates. We conclude in Section 9 and discuss future (2009) presented an efficient method for select- research. ing queries to extract useful web text for general or user-dependent vocabularies. Most of this re- 2 Previous Research search has used perplexity to determine improve- ment resulting from the addition of web text to A number of tools and methodologies have been the original language model corpus (Bulyko et al., proposed for web scraping use in building web 2007) although (Sarikaya et al., 2005) have also corpora for speech and NLP applications. Ba- proposed the use of BLEU scores in augmenting roni and Bernardini (2004) developed BootCat to language model training data for Spoken Dialogue generate search engine queries in an iterative pro- Systems. cess in order to create a corpus typically for specific domains. De Groc et al (2011) optimized In recent years, the use of web data has be- the query generation process by graph modeling gun to be used to improve OOV rates for ASR the relationship between queries, documents and and KWS performance on LRLs in the IARPA terms. This approach improved mean precision Babel project (Harper, 2011) which presents ma- by 25% over the BootCat method. Hoogeveen jor new challenges. Web data for these languages and Pauw (2011) used a similar query generation is typically much scarcer than for HRLs, partic- method but incorporated language identification ularly in genres that are similar to the telephone as part of their pipeline. In text-based research, conversations used in this project; since many of web resources have been mined by researchers these LRLs are spoken with significant amounts to collect social media and review data for senti- of code-switching, which must be identified dur- ment analysis ((Wang et al., 2014);(C. Argueta and ing web scraping, collecting data for Babel LRLs Chen, 2016)), to improve language identification is much more complex than for other languages. (Lui et al., 2014), to find interpretations of com- Language ID is thus also an important component pound nominals (Nicholson and Baldwin, 2006), of LRL web data collection. to find variants of proper names (Andrews et al., (Gandhe et al., 2013) used simple web query 2012), to provide parallel corpora for training Ma- word seeding from the Babel lexicon on Wikipedia chine Translation engines, to develop corpora for data, news articles and results from 30 Google studies of code-switching (Solorio et al., 2014), to queries for five of the Babel Base Period lan- predict chat responses in social media to facilitate guages: Cantonese, Pashto, Tagalog, Turkish and response completion (Pang and Ravi, 2012), inter Vietnamese. This approach improved OOV rates alia. In each case the data collected will differ de- by up to 50% and improved Actual Term Weighted pending upon the application. Value (ATWV) (Fiscus et al., 2007) by 0.0424 However, in speech research, web data collec- in the best case (larger values of ATWV repre- tion has been largely focused on improving ASR sent improved performance), compared to a base- and KWS, where insufficient data may be avail- line system trained only on the Babel Limited able from existing training corpora. Until re- Language Pack data which was provided for the cently, most attempts at data augmentation from task of recognition and search; each corpus con- the web have been confined to HRLs such as En- sisted of ten hours of transcribed conversational glish, French, and Mandarin. In ASR research, speech. On average, ATWV was improved by improved performance has been achieved by sup- 0.0243 across all five languages. (Zhang et al., plementing language model training data with web 2015) used automatically generated query terms 73 followed by simple language identification tech- ers and recorded on separate channels under a va- niques to reduce OOV rates for Babel Very Lim- riety of recording conditions; Limited Language ited Language Packs (three hours of transcribed Packs (LLPs) with 10 hours of transcribed speech; telephone conversations) on Cebuano, Kazakh, and Very Limited Language Packs (VLLPs) with Kurdish, Lithuanian, Telugu and Tok Pisin. Using 3 hours of transcribed speech from the FLP cor- a variety of web genres, they managed to halve the pus. We evaluate here on the LLP lexicons (de- OOV on the development set and to improve key- rived from the 10 hour transcripts) for the seven word spotting by an absolute 2.8 points of ATWV. languages examined. The speakers are diverse in In our work, (Mendels et al., 2015), work- terms of age and dialect and the gender ratio is ing on the same data and using a variety of ad- approximately equal. A main goal of the Babel ditional web genres including blogs, TED talks, program is determining how speech recognition and online news sources obtained from keyword and keyword search technology can be developed searches seeded by the 1000 most common words for LRLs using increasingly smaller data sets for in each language, together with BBN-collected training.

Load more