Improving Trigram Language Modeling with the World Wide Web

Improving Trigram Language Modeling with the World Wide Web Xiaojin Zhu Ronald Rosenfeld November 2000 CMU-CS-00-171 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract We propose a novel method for using the World Wide Web to acquire trigram estimates for statistical language modeling. We submit an N-gram as a phrase query to web search engines. The search engines return the number of web pages containing the phrase, from which the N-gram count is estimated. The N-gram counts are then used to form web-based trigram probability estimates. We discuss the properties of such estimates, and methods to interpolate them with traditional corpus based trigram estimates. We show that the interpolated models improve speech recognition word error rate significantly over a small test set. Keywords: Language models, Speech recognition and synthesis, Web-based services 1 Introduction A language model is a critical component for many applications, including speech recognition. Enormous effort has been spent on building and improving language models. Broadly speaking, this effort develops along two orthogonal directions: The first direction is to apply increasingly sophisticated estimation methods to a fixed training data set (corpus) to achieve better estimation. Examples include various interpolation and backoff schemes for smoothing, variable length N-grams, vocabulary clustering, decision trees, probabilistic context free grammar, maximum entropy models, etc [1]. We can view these methods as trying to “squeeze out” more benefit from a fixed corpus. The second direction is to acquire more training data. However, automatically collecting and incorporating new training data is non-trivial, and there has been relatively little research in this direction. Adaptive models are examples of the second direction. For instance, a cache language model uses recent utterances as additional training data to create better N-gram estimates. The recent rapid development of the World Wide Web (WWW) makes it an extremely large and valuable data source. Just-in-time language modeling [2] submits previous user utterances as queries to WWW search engines, and uses the retrieved web pages as unigram adaptation data. In this paper, we propose a novel method for using the WWW and its search engines to derive additional training data for N-gram language modeling, and show significant improvements in terms of speech recognition word error rate. The rest of the paper is organized as follows. Section 2 gives the outline of our method, and discusses the relevant properties of the WWW and search engines. Section 3 investigates the problem of combining a traditional corpus with data from the web. Section 4 presents our experimental results. Finally Section 5 discusses both the potential and the limitations of our proposed method, and lists some possible extensions. 2 The WWW as trigram training data pw jw w The basic problem in trigram language modeling is to estimate , i.e. the probability of a word given the two words preceding it. This is typically done by smoothing the maximum likelihood estimate cw w w p w jw w ML cw w cw w w cw w w w w w w with various methods, where and are the counts of and in some w w w training data respectively. The main idea behind our method is to obtain the counts of and w w as they appear on the WWW, to estimate c w w w web p w jw w web c w w web p c w w web and combine web with the estimates from a traditional corpus (here and elsewhere, when , p w jw w we regard web as unavailable). Essentially, we are using the searchable web as additional training data for trigram language modeling. There are several questions to be addressed. First, how to obtain the counts from the web? What is the quality of these web estimates? How could they be used to improve language modeling? We will examine these questions in the following sections, in the context of N-best list rescoring for speech recognition. 2.1 Obtaining N-gram counts from the WWW w w n To obtain the count of an N-gram from the web, we use the ‘exact phrase search’ function w w n of web search engines. That is, we send as a single quoted phrase query to a search engine. Ideally, we would like the search engine to report the phrase count, i.e. the total number of occurrences of 1 the phrase in all its indexed web pages. However in practice, most search engines only report the web page count, i.e. the number of web pages containing the phrase. Since one web page may contain one or more occurrence of the phrase, we need to estimate the phrase count from the web page count. Many web search engines claim they can perform exact phrase search. However, most of them seem to use an internal stop word list to remove common words from a query phrase. An interesting test phrase is “to be or not to be”: Some search engines return totally irrelevant web pages for this query, since most, if not all, words are ignored. In addition, a few search engines perform stemming so the query “she say” will return some web pages only containing “she says” or “she said”. Furthermore, some search engines report neither phrase counts nor web page counts. We experimented with a dozen popular search engines, and found three that meet our criteria: AltaVista [3] advanced search mode, Lycos [4], and FAST [5] . They all report web page counts. One brute force method to get the phrase counts is to actually download all the web pages the search engine finds. However, queries of common words typically result in tens of thousands of web pages, and this method is clearly infeasible. Fortunately at the time of our experiment AltaVista had a simple search mode, which reported both the phrase count and the web page count for a query. Figure 1 shows the phrase count vs. web page count for 1200 queries. Trigram queries (phrases consisting of three consecutive words), bigram queries and unigram queries are plotted separately. There are horizontal branches in the bigram and trigram plots that don’t make sense (more web pages than total phrase counts). We regard these as outliers due to idiosyncrasies of the search engine, and exclude them from further consideration. The three plots are largely log-linear. This prompted us to perform the following log-linear regression separately for trigrams, bigrams, and unigrams: c pg pg where c is the phrase count, and the web page count. Table 1 lists the coefficients. The three regression functions are also plotted in Figure 1. We assume these functions apply to other search engines as well. In the rest of the paper, all web N-gram counts are estimated by applying the corresponding regression function to the web page counts reported by search engines. 10 10 10 10 10 10 Unigram Bigram Trigram 5 5 5 10 10 10 phrase count 0 0 0 10 10 10 0 5 10 0 5 10 0 5 10 10 10 10 10 10 10 10 10 10 webpage number returned by search engines Figure 1: Web phrase count vs. web page count Our selection is admittedly incomplete. In addition, since search engines develop and change rapidly, all our comments are only valid during the period of this experiment. 2 Unigram 2.427 1.019 Bigram 1.209 1.014 Trigram 1.174 1.025 Table 1: Coefficients of log-linear regression for estimating Web N-gram counts from Web page counts reported by search engines. 2.2 The quality of web estimates To investigate the quality of web estimates, we needed a baseline corpus for comparison. The baseline we used is a 103 million word Broadcast News corpus. 2.2.1 Web N-gram coverage The first experiment we ran was N-gram coverage test on unseen text. That is, we wanted to see how many N-grams in the test text are not on the web, and/or not in the baseline corpus. We were hoping to show that the web covers many more N-grams than the baseline corpus. Note that by ‘the web’ we mean the searchable portion of the web as indexed by the search engines we chose. The unseen news test text consisted of 24 randomly chosen sentences from 4 web news sources (CNN, ABC, Fox, BBC) and 6 categories (world, domestic, technology, health, entertainment, politics). All the sentences were selected from the day’s news stories, on the day the experiment was carried out. This was to make sure that the search engines hadn’t had the time to index the web pages containing these sentences. After the experiment was completed, we checked each sentence, and indeed none of them were found by the search engines yet. Therefore the test text is truly unseen to both the web search engines and the baseline corpus. (The test text is of written news style, which might be slightly different from the broadcast news style in the baseline corpus.) There are 327 unigram types (i.e. unique words), 462 bigram types and 453 trigram types in the test text. Table 2 lists the number of N-gram types not covered by the different search engines and the baseline corpus, respectively. Unique Types Not Covered By AltaVista Lycos FAST Corpus Unigram 327 0 0 0 8 Bigram 462 4 5 5 68 Trigram 453 46 46 46 189 Table 2: Novel N-gram types in 24 news sentences Clearly, the web’s coverage, under any of the search engines, is much better than that of the baseline corpus.

Load more