Processing Text
Total Page:16
File Type:pdf, Size:1020Kb
4 Processing Text “I was trying to comprehend the meaning of the words” Spock, Star Trek: e Final Frontier 4.1 From Words to Terms Aer gathering the text we want to search, the next step is to decide if it should be modied or restructured in some way to simplify searching. e types of changes that are made at this stage are called text transformation or, more oen, text pro- cessing. e goal of text processing is to convert the many forms that words can occur into more consistent index terms. Index terms are the representation of the content of a document that are used for searching. e simplest decision about text processing would be to not do it at all. A good example of this is the “nd” feature in your favorite word processor. By the time you use the nd command, the text you wish to search has already been gathered: it’s on the screen. Aer you type the word you want to nd, the word processor scans the document and tries to nd the exact sequence of letters that you just typed. is feature is extremely useful, and nearly every text editing program can do this because users demand it. e trouble is that exact text search is rather restrictive. e most annoying restriction is case-sensitivity: suppose you want to nd “computer hardware,”and there is a sentence in the document that begins with “Computer hardware.” Your search query does not exactly match the text in the sentence, because the rst letter of the sentence is capitalized. Fortunately, most word processors have an option for ignoring case during searches. Youcan think of this as a very rudimentary form of online text processing. Like most text processing techniques, ignoring case in- creases the probability that you will nd a match for your query in the document. Many search engines do not distinguish between uppercase and lowercase let- ters. However, they go much further. As we will see in this chapter, search engines 2 4 Processing Text can strip punctuation from words to make them easier to nd. Words are split apart in a process called tokenization. Some words may be ignored entirely in or- der to make query processing more effective and efficient; this is called stopping. e system may use stemming to allow similar words (like “run” and “running”) to match each other. Some documents, like Webpages, may have formatting changes (like bold or large text), or explicit structure (like titles, chapters and captions) that can also be used by the system. Web pages also contain links to other Web pages which can be used to improve document ranking. All of these techniques are dis- cussed in this chapter. ese text processing techniques are fairly simple, even though their effects on search results can be profound. None of these techniques involves the com- puter doing any kind of complex reasoning or understanding of the text. Search engines work because much of the meaning of text is captured by counts of word occurrences and co-occurrences1, especially when that data is gathered from the huge text collections available on the Web. Understanding the statistical nature of text is fundamental to understanding retrieval models and ranking algorithms, so we begin this chapter with a discussion of text statistics. More sophisticated tech- niques for natural language processing that involve syntactic and semantic analysis of text have been studied for decades, including their application to information retrieval, but to date have had little impact on ranking algorithms for search en- gines. ese techniques are, however, being used for the task of question answer- ing, which is described in Chapter 11. In addition, techniques involving more complex text processing are being used to identify additional index terms or fea- tures for search. Information extraction techniques for identifying peoples’ names, organization names, addresses, and many other special types of features are dis- cussed here, and classication, which can be used to identify semantic categories, is discussed in Chapter 9. Finally, even though this book focuses on retrieving English documents, in- formation retrieval techniques can be used with text in many different languages. In this chapter, we show how different languages require different types of text representation and processing. 1 Word co-occurrence measures the number of times groups of words (usually pairs) oc- cur together in documents. A collocation is the name given to a pair, group, or sequence of words that occur together more oen than would be expected by chance. e term association measures that are used to nd collocations are discussed in Chapter 6. ©Addison Wesley DRAFT 29-Feb-2008/16:36 4.2 Text Statistics 3 4.2 Text Statistics Although language is incredibly rich and varied, it is also very predictable. ere are many ways to describe a particular topic or event, but if the words that occur in many descriptions of an event are counted, then some words will occur much more frequently than others. Some of these frequent words, such as “and” or “the”, will be common in the description of any event, but others will be characteristic of that particular event. is was observed as early as 1958 by Luhn, when he proposed that the signicance of a word depended on its frequency in the docu- ment. Statistical models of word occurrences are very important in information retrieval, and are used in many of the core components of search engines, such as the ranking algorithms, query transformation, and indexing techniques. ese models will be discussed in later chapters, but we start here with some of the basic models of word occurrence. One of the most obvious features of text from a statistical point of view is that the distribution of word frequencies is very skewed. ere are a few words that have very high frequencies and many words that have low frequencies. In fact, the two most frequent words in English (“the”, “of ”) account for about 10% of all word occurrences. e most frequent 6 words account for 20% of occurrences, and the most frequent 50 words are about 40% of all text! On the other hand, about one-half of all words only occur once. is distribution is described byZipf’s law2, which states that the frequency of the rth most common word is inversely proportional to r, or alternatively, the rank of a word times its frequency (f) is approximately a constant (k): r · f = k Weoen want to talk about probability of occurrence of a word, which is just the frequency of the word divided by the total number of word occurrences in the text. In this case, Zipf ’s law is: r · Pr = c where Pr is the probability of occurrence for the rth ranked word, and c is a con- stant. For English, c ≈ 0:1. Figure 4.1 shows the graph of Zipf ’s Law with this constant. is clearly shows how the frequency of word occurrence falls rapidly aer the rst few most common words. 2 Named aer the American linguist George Kingsley Zipf. ©Addison Wesley DRAFT 29-Feb-2008/16:36 4 4 Processing Text 0.1 0.09 0.08 0.07 0.06 Probability 0.05 (of occurrence) 0.04 0.03 0.02 0.01 0 0 10 20 30 40 50 60 70 80 90 100 Rank (by decreasing frequency) Fig. 4.1. Rank vs. probability of occurrence for words assuming Zipf ’s Law (rank × prob- ability = 0.1). To see how well Zipf ’s Law predicts word occurrences in actual text collec- tions, we will use the Associated Press collection of news stories from 1989 (called AP89) as an example. is collection was used in TREC evaluations. Table 4.1 shows some statistics for the word occurrences in AP89. e vocabulary size is the number of unique words in the collection. Even in this relatively small col- lection, the vocabulary size is quite large (nearly 200,000). A large proportion of these words (70,000) occur only once. Words that occur once in a text corpus or book have long been regarded as important in text analysis, and have been given the special name of Hapax Legomena3. Table 4.2 shows the y most frequent words from the AP89 collection, to- gether with their frequencies, ranks, probability of occurrence (converted to a percentage of total occurrences), and the r.Pr value. From this table, we can see 3 e name was created by scholars studying the Bible. Since the 13th Century, people have studied the word occurrences in the Bible and, of particular interest, created con- cordances which are indexes of where words occur in the text. Concordances are the ancestors of the inverted les that are used in modern search engines. e rst concor- dance was said to have required 500 monks to create. ©Addison Wesley DRAFT 29-Feb-2008/16:36 4.2 Text Statistics 5 Total documents 84,678 Total word occurrences 39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064 Table 4.1. Statistics for the AP89 collection. that Zipf ’s Law is quite accurate, in that the value of r.Pr is approximately con- stant, and close to 0.1. e biggest variations are for some of the most frequent words. In fact, it is generally observed that Zipf ’s Law is inaccurate for low and high ranks (high-frequency and low-frequency words). Table 4.3 gives some ex- amples for lower frequency words from AP89. 1 Zipf AP89 0.1 0.01 0.001 Pr 0.0001 1e-005 1e-006 1e-007 1e-008 1 10 100 1000 10000 100000 1e+006 Rank Fig.