Web Text Corpus for Natural Language Processing

Web Text Corpus for Natural Language Processing Vinci Liu and James R. Curran School of Information Technologies University of Sydney NSW 2006, Australia {vinci,james}@it.usyd.edu.au Abstract To date, most NLP tasks that have utilised web Web text has been successfully used as data have accessed it through search engines, us- training data for many NLP applications. ing only the hit counts or examining a limited While most previous work accesses web number of results pages. The tasks are reduced text through search engine hit counts, we to determining n-gram probabilities which are created a Web Corpus by downloading then estimated by hit counts from search engine web pages to create a topic-diverse collec- queries. This method only gathers information tion of 10 billion words of English. We from the hit counts but does not require the com- show that for context-sensitive spelling putationally expensive downloading of actual text correction the Web Corpus results are bet- for analysis. Unfortunately search engines were ter than using a search engine. For the- not designed for NLP research and the reported hit saurus extraction, it achieved similar over- counts are subject to uncontrolled variations and all results to a corpus of newspaper text. approximations (Nakov and Hearst, 2005). Volk With many more words available on the (2002) proposed a linguistic search engine to ex- web, better results can be obtained by col- tract word relationships more accurately. lecting much larger web corpora. We created a 10 billion word topic-diverse Web Corpus by spidering websites from a set of seed 1 Introduction URLs. The seed set is selected from the Open Traditional written corpora for linguistics research Directory to ensure that a diverse range of top- are created primarily from printed text, such as ics is included in the corpus. A process of text newspaper articles and books. With the growth of cleaning transforms the HTML text into a form the World Wide Web as an information resource, it useable by most NLP systems – tokenised words, is increasingly being used as training data in Nat- one sentence per line. Text filtering removes un- ural Language Processing (NLP) tasks. wanted text from the corpus, such as non-English There are many advantages to creating a corpus sentences and most lines of text that are not gram- from web data rather than printed text. All web matical sentences. We compare the vocabulary of data is already in electronic form and therefore the Web Corpus with newswire. readable by computers, whereas not all printed Our Web Corpus is evaluated on two NLP tasks. data is available electronically. The vast amount Context-sensitive spelling correction is a disam- of text available on the web is a major advantage, biguation problem, where the correction word in a with Keller and Lapata (2003) estimating that over confusion set (e.g. {their, they’re}) needs to be se- 98 billion words were indexed by Google in 2003. lected for a given context. Thesaurus extraction is The performance of NLP systems tends to im- a similarity task, where synonyms of a target word prove with increasing amount of training data. are extracted from a corpus of unlabelled text. Our Banko and Brill (2001) showed that for context- evaluation demonstrates that web text can be used sensitive spelling correction, increasing the train- for the same tasks as search engine hit counts and ing data size increases the accuracy, for up to 1 newspaper text. However, there is a much larger billion words in their experiments. quantity of freely available web text to exploit. 233 2 Existing Web Corpora ing this information from the BNC presented no difficulty, making so many queries to the Altavista The web has become an indispensible resource was too time-consuming. They had to reduce the with a vast amount of information available. Many size of the test set to obtain a result. NLP tasks have successfully utilised web data, in- Lapata and Keller (2005) performed a wide cluding machine translation (Grefenstette, 1999), range of NLP tasks using web data by querying prepositional phrase attachment (Volk, 2001), and Altavista and Google. This included variety of other-anaphora resolution (Modjeska et al., 2003). generation tasks (e.g. machine translation candi- 2.1 Search Engine Hit Counts date selection) and analysis tasks (e.g. prepositional phrase attachment, countability detection). Most NLP systems that have used the web access They showed that while web counts usually out- it via search engines such as Altavista and Google. performed BNC counts and consistently outper- N-gram counts are approximated by literal queries formed the baseline, the best performing system w w “ 1 ... n”. Relations between two words are is usually a supervised method trained on anno- approximated in Altavista by the NEAR operator tated data. Keller and Lapata concluded that hav- (which locates word pairs within 10 tokens of each ing access linguistic information (accurate n-gram other). The overall coverage of the queries can counts, POS tags, and parses) outperforms using a be expanded by morphological expansion of the large amount of web data. search terms. Keller and Lapata (2003) demonstrated a high 2.2 Spidered Web Corpora degree of correlation between n-gram estimates A few projects have utilised data downloaded from from search engine hit counts and n-gram frequen- the web. Ravichandran et al. (2005) used a col- cies obtained from traditional corpora such as the lection of 31 million web pages to produce noun British National Corpus (BNC). The hit counts similarity lists. They found that most NLP algo- also had a higher correlation to human plausibil- rithms are unable to run on web scale data, espe- ity judgements than the BNC counts. cially those with quadratic running time. Halacsy The web count method contrasts with tradi- et al. (2004) created a Hungarian corpus from the tional methods where the frequencies are obtained web by downloading text from the .hu domain. from a corpus of locally available text. While the From a 18 million page crawl of the web a 1 bil- corpus is much smaller than the web, an accu- lion word corpus is created (removing duplicates rate count and further text processing is possible and non-Hungarian text). because all of the contexts are readily accessible. A terabyte-sized corpus of the web was col- The web count method obtains only an approxi- lected at the University of Waterloo in 2001. A mate number of matches on the web, with no con- breadth first search from a seed set of university trol over which pages are indexed by the search home pages yielded over 53 billion words, requir- engines and with no further analysis possible. ing 960GB of storage. Clarke et al. (2002) and There are a number of limitations in the search Terra and Clarke (2003) used this corpus for their engine approximations. As many search engines question answering system. They obtained in- discard punctuation information (especially when creasing performance with increasing corpus size using the NEAR operator), words considered ad- but began reaching asymptotic behaviour at the jacent to each other could actually lie in differ- 300-500GB range. ent sentences or paragraphs. For example in Volk (2001), the system assumes that a preposition at- 3 Creating the Web Corpus taches to a noun simply when the noun appears within a fixed context window of the preposition. There are many challenges in creating a web cor- The preposition and noun could in fact be related pus, as the World Wide Web is unstructured and differently or be in different sentences altogether. without a definitive directory. No simple method The speed of querying search engines is another exists to collect a large representative sample of concern. Keller and Lapata (2003) needed to ob- the web. Two main approaches exist for collect- tain the frequency counts of 26,271 test adjective ing representative web samples – IP address sam- pairs from the web and from the BNC for the task pling and random walks. The IP address sam- of prenominal adjective ordering. While extract- pling technique randomly generates IP addresses 234 and explores any websites found (Lawrence and External links encountered during this process are Giles, 1999). This method requires substantial re- added to the link collection of the topic node re- sources as many attempts are made for each web- gardless of the actual topic of the link. Although site found. Lawrence and Giles reported that 1 in websites of one topic tends to link to other web- 269 tries found a web server. sites of the same topic, this process contributes to Random walk techniques attempt to simulate a a topic drift. As the spider traverses away from regular undirected web graph (Henzinger et al., the original seed URLs, we are less certain of the 2000). In such a graph, a random walk would pro- topic included in the collection. duce a uniform sample of the nodes (i.e. the web pages). However, only an approximation of such a 3.2 Text Cleaning graph is possible, as the web is directed (i.e. you Text cleaning is the term we used to describe the cannot easily determine all web pages linking to overall process of converting raw HTML found on a particular page). Most implementations of ran- the web into a form useable by NLP algorithms dom walks approximates the number of backward – white space delimited words, separated into one links by using information from search engines. sentence per line. It consists of many low-level 3.1 Web Spidering processes which are often accomplished by simple rule-based scripts.

Web Text Corpus for Natural Language Processing

Construction of English-Bodo Parallel Text

Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna

Enriching an Explanatory Dictionary with Framenet and Propbank Corpus Examples

20 Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation

Linguistic Annotation of the Digital Papyrological Corpus: Sematia

Creating and Using Multilingual Corpora in Translation Studies Claudio Fantinuoli and Federico Zanettin

Corpus Linguistics As a Tool in Legal Interpretation Lawrence M

Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture

Latvian Framenet: Cross-Lingual Issues

Against Corpus Linguistics

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

Natural Language Processing, Sentiment Analysis and Clinical Analytics