Canada Archives Canada Published Heritage Direction Du Branch Patrimoine De I'edition

A PROXIMITY-BASED METHOD FOR APPROXIMATE PHRASE SEARCHING: MOVIE SCRIPTS AND SONG LYRICS

Kathryn Patterson

Submitted in partial fulfillment of the requirements for the degree of Master of Computer Science

Dalhousie University Halifax, Nova Scotia July 2008

@ Copyright by Kathryn Patterson, 2008 Library and Bibliotheque et 1*1 Archives Canada Archives Canada Published Heritage Direction du Branch Patrimoine de I'edition

395 Wellington Street 395, rue Wellington Ottawa ON K1A0N4 Ottawa ON K1A0N4 Canada Canada

Your file Votre reference ISBN: 978-0-494-43553-3 Our file Notre reference ISBN: 978-0-494-43553-3

NOTICE: AVIS: The author has granted a non L'auteur a accorde une licence non exclusive exclusive license allowing Library permettant a la Bibliotheque et Archives and Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par Plntemet, prefer, telecommunication or on the Internet, distribuer et vendre des theses partout dans loan, distribute and sell theses le monde, a des fins commerciales ou autres, worldwide, for commercial or non sur support microforme, papier, electronique commercial purposes, in microform, et/ou autres formats. paper, electronic and/or any other formats.

The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in et des droits moraux qui protege cette these. this thesis. Neither the thesis Ni la these ni des extraits substantiels de nor substantial extracts from it celle-ci ne doivent etre imprimes ou autrement may be printed or otherwise reproduits sans son autorisation. reproduced without the author's permission.

In compliance with the Canadian Conformement a la loi canadienne Privacy Act some supporting sur la protection de la vie privee, forms may have been removed quelques formulaires secondaires from this thesis. ont ete enleves de cette these.

While these forms may be included Bien que ces formulaires in the document page count, aient inclus dans la pagination, their removal does not represent il n'y aura aucun contenu manquant. any loss of content from the thesis. Canada DALHOUSIE UNIVERSITY

To comply with the Canadian Privacy Act the National Library of Canada has requested that the following pages be removed from this copy of the thesis:

Preliminary Pages Examiners Signature Page (pii) Dalhousie Library Copyright Agreement (piii)

Appendices Copyright Releases (if applicable) Table of Contents

List of Tables vi

List of Figures vii

Abstract ix

Acknowledgements x

Chapter 1 Introduction 1

Chapter 2 Background 6 2.1 Exact Phrase Matching 7 2.2 Approximate Phrase Matching 9 2.3 Keyword Phrase Matching Algorithms 10 2.3.1 Vector Space Model 10 2.3.2 Boolean Retrieval 11 2.3.3 Search Engines 12 2.4 Proximity-based Search Algorithms 15 2.5 Summary 17

Chapter 3 Methodology 19 3.1 Datasets 19 3.2 Queries 19 3.3 Exact Phrase Matching 21 3.4 Proximity-based Phrase Searching 22 3.4.1 Term Counting 22 3.4.2 Term Adjacencies 23 3.4.3 Maximum Windows 25 3.4.4 Adjacencies in Maximum Windows 28 3.4.5 Proximity-based Methods 28

IV 3.5 Experimental Procedure 30 3.5.1 Exact Phrase 30 3.5.2 Vector Space Model 30 3.5.3 Proximity Search 30 3.5.4 Search Engines 30 3.5.5 Result Sets 31 3.5.6 Experimental Conditions 31 3.6 Summary 32

Chapter 4 Results 33 4.1 Exact Phrase Queries 33 4.2 Proximity-based Phrase-Matching Methods Compared to the Vector Space Model and Search Engines Results 33 4.3 Elements of PBP Explored 47

Chapter 5 Discussion 53 5.1 Exact Phrases 53 5.2 Vector Space Model 54 5.3 Search Engines 55 5.4 Proximity-based Algorithm 57 5.5 Proximity-based Algorithm Factors 60 5.6 Users 61 5.7 Summary 61

Chapter 6 Conclusion 63

Bibliography 68

Appendix A Queries 71

v List of Tables

Table 2.1 Advanced search methods in search engines Google, Yahoo and Ask 15

Table 3.1 The percentage of queries which contained no gap, a small gap (...), a large gap ( ) or both 21 Table 3.2 Experimental Conditions 32

Table 4.1 Reasons for and percentage of failed exact phrase matching queries 34 Table 4.2 Number of documents as highest ranked documents in movie script collection 34 Table 4.3 Number of documents with highest ranked documents in lyrics collection 35 Table 4.4 Reasons for and numbers of failed queries using PBP-reg ... 47 Table 4.5 Average ties per query of the PBP elements and PBP-reg on the movie script corpus (best result shown in bold) 48 Table 4.6 Performance of the PBP elements and PBP-reg on the movie script collection (best results shown in bold) 50 Table 4.7 Average ties per query of the PBP elements and PBP-reg on the song lyrics collection (best result shown in bold) 50 Table 4.8 Performance of the PBP elements and PBP-reg on the song lyrics collection (best results shown in bold) 50

A. 1 Movie Script Collection Queries 71 A. 2 Lyrics Collection Queries continued 81

VI List of Figures

Figure 2.1 Knuth-Morris-Pratt pattern matching example 8

Figure 3.1 (a) IMSDB genres and scripts, (b) LyricsDownload.com lyrics 20 Figure 3.2 Maximum window size calculation for here's looking at you kid 26 Figure 3.3 Maximum window size calculation for My name is you kill my father, prepare to die 27

Figure 4.1 Accuracy on movie script collection 35 Figure 4.2 Accuracy on song lyric collection 36 Figure 4.3 Comparison of the "top 10" accuracy in each collection .... 37 Figure 4.4 Percentage of results correctly found within the top 100 lost when only the top 50 results are considered 38 Figure 4.5 Percentage of results correctly found within the top 50 lost when only the top 20 results are considered 39 Figure 4.6 Percentage of results correctly found within the top 20 lost when only the top 10 results are considered 40 Figure 4.7 Linear trend of the performance by PBP-reg and Google-all as the number of terms in the query decreases on the movie script collection 41 Figure 4.8 Linear trend of the performance by PBP-veg and Google-all as the number of rare terms in the query decreases on the movie script collection 42 Figure 4.9 Linear trend of the performance by PBP-reg and Google-all as the number of terms in the query decreases on the lyrics collection 43 Figure 4.10 Linear trend of the performance by PBP-reg and Google-all as the number of rare terms in the query decreases on the lyrics collection 44 Figure 4.11 Percentage of query terms being rare terms over for each query, independently sorted by percentage. 91 lyrics queries and 85 movie script queries 45 Figure 4.12 Example Document Ranking - the correct document is shown in italics 48

vn Figure 4.13 A comparison of the accuracy of each factor in ranking the correct document within the top number of documents specified on the movie script corpus 49 Figure 4.14 A comparison of the accuracy of each factor in ranking the correct document within the top number of documents specified on the lyrics corpus 51 Figure 4.15 Rank and tie results of the PBP elements and PBP-reg on the song lyrics collection and movie script collection 52

viii Abstract

Search engines provide an effective means of retrieving a document when the query contains uncommon terms or the query is an exact phrase. However, phrase queries usually contain common terms including determiners and users ixi&y not remember phrases exactly. Search engines discard common terms or assign them little impor tance, which may lead to poor retrieval results. In this work, we examine the use of proximity-based phrase searching to search for quotes from song lyrics and movie scripts. We compare the results of our new method to popular search engines, Google.ca, Yahoo.com and Ask.com, as well as the Vector Space Model. Our proximity-based method demonstrates an average accuracy of 80.5%. An improvement of approximately 25% on search engine results shows that an additional method to complement the common search engine methods could be beneficial. This method may also prove useful in a search system for offline corpora containing similar document types.

rx Acknowledgements

Thanks to my supervisor, Dr. Carolyn Watters, for guiding me and keeping me on track; to my examining committee for their time and helpful feedback; to Rob for saving the day on short notice; to Vlado and Mike for reducing the pain inflicted on me by WF$i; to Dave, Sean, Rob, Ian, Mike, Chris, Andrew, and Jon for queries galore; to Oliver for the gift of distributed computing and Scott for fixing the cluster every time it broke; to Chris for saving me from a sinking ship; to Joel and Vanessa for helping me keep my sanity; to Ian for his blind encouragement; and to my family for their love and support. I have been supported financially by the Faculty of Computer Science and the Natural Sciences and Engineering Research Council of Canada.

x Chapter 1

Introduction

Searching the web has become synonymous with "Googling" [28]. Google [11] and other popular search engines, Yahoo [44], MSN [27] and Ask [1], made up 98% of U.S. searches in 2007 [12]. These popular search engines provide a variety of features that allow users to customize their query for their task. For the task of finding web pages on a topic, simple keyword search is available in two basic forms "any of these terms" and "all of these terms". For the task of searching for the source of a phrase, for example a lyric from a song or a line from a movie, phrase matching is available in search engines in several forms. If we remember the phrases "credit check" or "check credit" and "bank" in a song, we may be able to locate the complete phrase, "will frequently check credit at (moral) bank (hole in the wall)", the complete lyrics or the name of the song containing those phrases - Fitter Happier by Radiohead. One way to format a query when there is uncertainty of the order of "credit" and "check" is credit AND check AND bank. The result may turn up many financial web pages. At this point you may want and AND (lyrics OR song). Ads on web pages may cause too many unrelated pages to still appear. If you notice some common terms among the unrelated web pages, such as "loan" and "debt" you could specify that pages with those terms should be ignored. The query then becomes: credit AND check AND bank: AND (lyrics OR song) AND NOT loan AND NOT debt. Queries aimed at finding specific passages of text can easily become complex. As the level of uncertainty increases, the use of the OR operator may increase as well, making the query even more complex. Another form of searching for the source of the phrase is exact phrase matching. These types of search options differ significantly. "Exact phrase" matching requires that all terms be present in the order provided in the query. Using exact phrases can

1 2 help create a more specific query. For example, the previously mentioned query could be revised to: ("credit check" OR "check credit") AND bank AND (lyrics OR song) AND NOT loan AND NOT debt. This is likely a better query for correctly retrieving the desired web page, but it is more complex.

"All of these terms" searches require that all of the terms provided be found in the retrieved web page, but order does not matter. "Any of these terms" searching requires that at least one of the terms provided be found in the retrieved web page and order, again, does not matter. "None of these terms" also allows users to eliminate web pages containing specific terms. These four options can also be combined in queries. As a result, specific terms and phrases are used interchangeably in the query. An example query using all four options is:

Using algorithms incorporating links, aspects of web page content, and complex text matching techniques [11], search engines attempt to locate and rank highly those web pages that are most "relevant" to a user's query. Other query features, not considered in this thesis, allow users to specify languages, domains, etc. in order to narrow or broaden their search results.

If a piece of text from a document on the web is known, this text provides an exact phrase that is more likely to retrieve the desired document than is a query using the terms of the phrase as keywords. In either case, the correct document will be retrieved if the correct text is provided. If the terms provided occur frequently, the retrieved document set will be large and the desired page may be low in the ranked results. Consequently, users often attempt to find a specific document by submitting an exact phrase believed to be contained in the text. If the terms and term order of the exact phrase are not correct and this inaccurate phrase is not contained in the document, this page will not be retrieved. In the case that the order is incorrect, the user can retrieve the document by using the query terms as keywords in an "all of these terms" search. If some of the terms are incorrect, the user can retrieve the document in an "any of these terms search". While these approaches may result in the desired document being retrieved, the precision of the results may become too low for the document to be ranked within the number of results a user is willing to examine. The query may, of course, be developed further by the user as combinations of keywords and partial phrases, including as phrases the parts they are confident 3 to be part of an exact phrase and leaving terms in which they are not confident as keywords in the query. However, this trial-and-error form of query development is complex and tedious. The problem of retrieving the source document of a specific piece of text within the top ranked results, when the query phrase is inaccurate is considerable.

Text-based document retrieval often comes in the form of one of three search tasks: ad-hoc retrieval, topic distillation, and the named page finding task. Ad-hoc retrieval is when the user has an abstract information need [6]. Any page containing the answer to a question they are searching for is relevant [6]. In topic distillation, the user is looking for a web page that contains general information on a topic [6], such as a Wikipedia [43] article. In the named page finding task, the query expresses the name of the specific page the user is looking for [8], such as "The Best Page In The Universe" [41]

Finding a document or web page with specific details on a query topic, as in ad-hoc retrieval, is a more difficult task than finding a web page related to a query topic, as in topic distillation. The likelihood of retrieving a web page with the specific details is dependent upon the information provided in the query, the information contained in a desired web page, and the algorithms used by the search engine for matching and ranking web pages with respect to the query. Moreover, these tasks have been shown to perform better with different techniques suited to the tasks [6]. We hypothesize that the task considered in this thesis will also require a different technique to improve document retrieval performance as it is a very different task from the three described above. We further believe that currently available search algorithms are insufficient for the task.

In this thesis, we first examine current search methods suitable for keyword search ing and the logic behind these algorithms. These are the most flexible query methods available, as they allow us to perform queries regardless of the structural information known about the desired text. We use search engines to compare keyword search methods to exact phrase matching. In addition to this, we consider previous work where the search methods allow us to include Boolean properties for query terms, probabilistic properties of query terms, and structural information about query terms. We consider the advantages and disadvantages of more complex queries and what the 4 appropriate level of complexity is for various search tasks. Second, we provide the details of retrieving a specific document using a piece of text from it using a search engine and using various algorithms on a static corpus to determine how these query methods perform. We hypothesize that these methods are insufficient for attaining a satisfactory level of document retrieval accuracy. Third, we examine new search methods for phrase source searching based on proximity operators and their application to different problems in previous studies presented by other researchers. Proximity operators can be used to place structural requirements when matching query terms to full text. Our comparison of proximity- based searching and the aforementioned query methods indicated that the proximity- based method is more effective for specific phrase source searching tasks. We also examine the impact of factors such as term order, term length, term frequency and assigning equal weight to matched terms.

This thesis presents a proximity based method of phrase source retrieval that uses approximate phrase matching techniques. Approximate phrase matching is the task of locating a phrase in a passage of text which may or may not be exactly the same as the target phrase. The similarity between the phrases or the degree to which the phrases are approximately the same depends on identical terms, terms sharing stems, terms sharing meaning, term order, etc. We propose this method as an additional search method to augment those provided by search engines and digital libraries. This method is further developed to increase performance and broaden its usefulness to the task of retrieving a specific document from a phrase query whether the exact phrase is known or whether the phrase contains errors. The objective was to achieve a high level of document retrieval accuracy without raising the level of query complexity undesirably for the user. To examine the performance of the proximity-based method, we carry out several studies performing phrase queries on two document sets, song lyrics and movie scripts. In these studies we compare the accuracy in ranking the desired document within top ranked documents between the proximity-based method, the vector space model, and the more popular search engines; Google, Ask, Yahoo, and MSN. In demonstrating the effectiveness of our method, we analyse how well the algorithms incorporating factors of our proximity-based method perform. This analysis shows that counting terms 5 with a proximity factor is necessary to achieve a reasonable retrieval performance. It also showed that counting terms and adjacencies with, a proximity factor improves performance while counting terms over the whole document, counting adjacencies over the whole document, and counting adjacencies with a proximity factor fail to achieve a sufficiently high performance on their own. As part of the evaluation of the proximity-based method, we perform an algo rithmic analysis of the method, compare its complexity to the complexities of other methods in our studies, and detail the worst-case efficiency impact that incorporating this method current search engines and digital libraries would have. The remainder of the dissertation is organized as follows. Chapter 2 describes the background of keyword searching and phrase matching as it relates to the currently best-performing and commonly known methods of document retrieval and the tasks for which they are suited. Chapter 3 details the system by which search method accuracy is determined, the complexity of query formulation, the implementation of the proximity-based method, the creation of corpora for comparing search methods, and metrics by which the accuracy and efficiency of the search methods are compared. The results of the studies performed are detailed in Chapter 4. A discussion of the hypotheses, results, contributions, and future work is provided in Chapter 5. Chapter 2

Background

While the problem examined in this thesis can be loosely described as approximate phrase matching, the tasks we consider are rather specific and closely related works are uncommon. Approximate phrase matching could be considered to include such problems as attempting to find documents on certain topics, named-pages or finding a page where the desired page contains the query in the title, homepages, etc. by looking for documents containing similar phrases to the phrase provided in the query. Similarity, in this case, would be having common terms, common ordering of terms, common approximate distances between terms and terms with the same meaning.

Here we consider the specific task where a user provides a phrase query they believe to be correctly or nearly correctly quoted from a static passage of text, such as a movie script or song lyrics. The phrases provided may be incorrect or partially incorrect and we require a method that will work whether or not the phrases were correctly remembered. "The diverse nature of both documents and information needs on the Web hints at the need for specialized search systems that cater to specific situations, or flexible sys tems that can respond appropriately to a variety of situations" [33]. This statement is consistent with the informal observations made in trying to perform the approximate phrase searching task. While the current methods are technically usable for our task, we hypothesize that they require too complex a query and can be outperformed by a method which takes into consideration the nature of the task. In the following sections we examine current search methods which could, theo retically, be applied to our task. To the best of our knowledge, little prior work has been done to examine this task. Hence, we will first examine the simplest and most intuitive approach to the problem, exact phrase matching, where any mistake will result in a failed retrieval. The imperfections in human memory should make it quite apparent that exact-phrase matching is not a sufficiently robust method.

6 7

More relaxed query methods, including the vector space model and the propri etary algorithms employed by several search engines are also available. An important requirement of all of the methods examined in our experiments is that they can be used without complex query formulation and cognitive burdens. In the next section, we discuss notable search methods common in Information Retrieval.

2.1 Exact Phrase Matching

Exact phrase matching is the process of taking terms provided in a specific order and matching the terms against text such that the terms remain consecutive and in the same order provided. Depending on the query system, case may be considered or ignored, certain types of punctuation may be deleted or replaced by whitespace, and numeric characters may be ignored. For example, consider the query "you're writ ing, it said The Cookbook". After removing non-alphanumeric characters, the query becomes "youre writing it said the cookbook" [38]. This is then applied to both the query and the passage as shown below.

Original Passage: Cleveland: You're both cooks? Anna: Who, him? He can't cook; he's banned from the kitchen. Cleveland: I don't understand something. My ladder accidentally bumped into your desk. I ended up seeing the title of what it is you're writing, it said A''The Cookbook''. Vick Ran: I know, it's a bad title, right? It's actually just my thoughts on all our cultural problems and thoughts on leaders and stuff.

Processed Passage: Cleveland youre both cooks anna who him he cant cook hes banned from the kitchen Cleveland i dont understand something my ladder accidentally bumped into your desk i ended up seeing the title of what it is youre writing it said the cookbook vick ran i know its a bad title right its actually just my thoughts on all our 8 cultural problems and thoughts on leaders and stuff The query would then match the emphasized terms in the above passage. This process is a simple algorithm which involves scanning the text for the first query term and matching the following terms until a mismatch occurs, then continuing the scan. The brute force method of pattern matching has a complexity of 0(mn), where m is the number of terms in the phrase to be matched and n is the number of terms in the passage. More efficient than the brute force method is the Knuth-Morris- Pratt algorithm can achieve such a scan in 0(m + n) complexity [26]. The algorithm enables the skipping of terms in the passage based on the existence of duplicates of the query terms. An array is constructed which determines how many terms to shift ahead when a term fails to match. An example of the phrase "b a b a b c a" being matched against the passage "c b b a b a b b a b a b a b a c a b a" is found in Figure 2.1 [26].

113 3 5 3 7 b a b a b c a

c b b a b a b b a b a b a b a c a b a 4- 1 3 1 2 3 4 5 # 12 3 4 1 2 3 4 5 6 1 2 3 4 5 % 12 3 4 6 4-12

Figure 2.1: Knuth-Morris-Pratt pattern matching example 9

Exact phrase matching has a linear worst-case complexity. This method is there fore efficient for phrases which can be located with an exact phrase query. For tasks where a user has only some of the text from a document in the form of the phrase and wishes to find more of the text or to determine which document the phrase orig inated from, exact phrase matching will find the correct document. If the phrase is common, the number of retrieved documents may be large, but the correct document will nevertheless be among this set.

2.2 Approximate Phrase Matching

Having a written record of the phrase to be used in an exact phrase query guarantees accuracy. The problem addressed in this thesis is the difficulty that arises when users attempt to recall a phrase from memory or perhaps transcribe text from some audio source. The phrase may be recalled incorrectly or pieces of the phrase may be missing. There must, therefore, be some flexibility in how the phrase is matched. There are several possible ways in which a phrase could be incorrect and thus would not be matched including:

• Incorrect word(s) i.e. "glass" instead of "cup"

• Incorrect order i.e. "found the broken table" instead of "found the table bro ken"

• Missing or additional words i.e. "Ill meet you at the Computer Science build ing" instead of "I'll meet you at the Second Cup in the Computer Science building"

In the case of known phrase fragmentation, the query can be expressed as a combination of several partial phrases. For example, the complete phase "I'd like mention some obscure singer or band, and you'd know such a lot about it. But not right at the moment, just a few minutes later. After you had the chance to look it up on the net, maybe?" [15] may be recalled by the segments "obscure singer or band", "after you", and "look it up on". While the partial phrases are more relaxed than a more complete exact phrase, they may still be uncommon enough to yield the desired document in the retrieved document set. In this case, we can use a Boolean search 10 method, where the operands are phrases and the operators are AND, to perform this query. For example, ("I'd" OR "I would" OR "I'll" OR "I will") AND "obscure singer or band" AND "after you" AND "look it up" AND ("on the internet" OR "on the web" OR "online") . If there remain inaccuracies in the partial phrases or we have some combination of this problem and other inaccuracies, the problem of searching for the source document of a phrase is no longer simple. A user could iteratively reformulate their query to contain modified or reordered terms or the phrases could be broken down into smaller partial phrases. The query, however, becomes more complex and problematic from the user perspective. Ideally, an algorithm would locate phrases as similar to a provided phrase or partial phrases as possible while allowing for flexibility while ranking documents. Our objective is to achieve an approximate phrase matching that improves the accuracy of query results where the query phrase is not certain.

2.3 Keyword Phrase Matching Algorithms

The simplest search method alternative to exact phrase searching is a keyword search. No knowledge of the phrase structure is required only which terms are to be searched for and keyword search methods will succeed even if some of the terms are incorrect. Methods based on the Vector Space Model rank documents according to the frequency of matched terms, so this does not require that all terms of the query be found in a given document.

2.3.1 Vector Space Model

The Vector Space Model is a method of expressing the relevancy of a document to a query based on the similarity of terms [36]. The similarity between the terms of the document and the terms of the query are determined using the frequency of the occurrence of the terms in the document and the frequency of terms in the document collection. The higher the frequency of the occurrence of query terms in specific documents, the higher the document will be ranked. This method has the advantage of not depending on the order of keywords, as the query terms are treated simply as a bag of words. If a user is searching for a 11 document on a particular topic, then the name of the topic is likely to appear many times in a useful result and reasonably sparsely in the entire document set. This method becomes disadvantageous, however, if the keywords used in the query are high frequency terms, such as determiners. These terms will be common over many documents and so the desired document is less likely to be ranked highly. The similarity function is given by Equation 2.1.

£ m Xro sim{l. 0 = .*'? = | "" "/ « (2.1) l*|x|«i Jz^Jz^ di is the term vector for the ith document and q is the term vector for the query and di • q is the dot product of these term vectors [3]. Some applications of the vector space model involve removing stop words [3], which are common terms such as determiners. In this thesis, stop words are not removed since we are considering phrases and we wish to match the common terms from queries to the terms in the documents. The performance of the vector space model varies depending on the problem type. We use this prominent algorithm as a benchmark to examine performance at the task of phrase source searching.

2.3.2 Boolean Retrieval

The Boolean Retrieval model allows a user some control in query creation. With this model, the user can state that certain terms must be present in the document, some must not exist in the document, or that any one of a set of terms may be present. Queries are expressed with Boolean operators AND, OR and NOT. An example of a Boolean query is as follows: cheap AND (((rates OR prices) AND hotel) OR accom modations) which means that the document must match one of the following keyword queries: cheap hotel rates, cheap hotel prices, or cheap accommodations. This model allows for very specific queries to be made and for the system to quickly evaluate the query. However, the outcome of the query is a binary evaluation of the relevance of documents. Documents which are deemed relevant will not be ranked in any way, as there is no basis for comparison among "relevant" documents. 12

Boolean queries require careful thought in their formulation as shown in the pre vious example. One of our objectives is to find or create a method which maintains simplicity. This objective is motivated by the recurring observation that users keep their queries short and simple [18] [39]. In a study by [21] using two search engines, WebCrawler and Magellan, only 12% of 2,000 queries examined contained Boolean operators. Advanced search methods, when available, are used sparingly and some times incorrectly [19]. "If we believe that humans seek the path of least cognitive resistance, then reducing the cognitive load for users should be one of the primary goals of IR system design [45]. Supposing Boolean was properly understood by more people and encouraged for use, it would still be difficult to obtain correct results when the queries are uncertain or contain errors when the requirements for matching a Boolean query are so strict.

2.3.3 Search Engines

"[Search] engines are indispensable for searching the Web, they employ a variety of relatively advanced IR techniques, and there are some peculiar aspects of search engines that make searching the Web different than more conventional information retrieval" [13]. With the massive amounts of documents and diversity of documents on the web, the document which is the source of a phrase a user is searching for likely exists on the web. The user's success in finding the document on the web will depend on many of the same conditions of success in finding the document within a stable corpus. The success in finding the source of a phrase on the web will also largely be affected by the number of documents on the web. As the number of documents containing the desired text increases, the number of these that are irrelevant will also increase. The web "will grow without any upper bound" [4] and the currently indexable web is more than 11.5 billion pages [14], so the difficulty of this problem will only increase with time. While searching the web and improving web searching has appeal, in this thesis we concentrate on finding source documents in specific collections. In general, the complexity of queries used in search engines has been decreasing and the voluntary usage of query operators remains quite low [17]. In addition, 81% of users did not look at results past the first page in [17] and 68% of search requests 13 were for the first page of results in [39]. Whether this is a result of search engine performance or user behavior, maintaining the level of simplicity which users are familiar with is important when providing them with a new method of performing certain types of queries. Increasing the complexity of queries also runs the risk of increasing the likelihood of errors [16]. The advantages of searching the entire web as opposed to specific corpora of documents, i.e. collections of lyrics, include:

• A higher probability of retrieving the document as a result of including many more relevant collections than the specific ones included in your chosen corpora. While the collections you include may be the correct type of documents, the correct document may be not be contained in the collections or the particular copy that they contain may have inaccuracies which prevent them from being retrieved.

• The correct document may not exist on the web while there may be a document which does not have the complete original text, but contains the desired quote and names the desired document.

• The user may have misheard or misread the phrase and would only find the correct text if the phrase was similarly mistyped somewhere on the web. For example, collections of misheard lyrics already exist on the web, such as "The Archive of Misheard Lyrics KissThisGuy.com [40].

Developing an algorithm for phrase source finding tasks which performs well on a static collection of documents is useful. The next step would be to expand a good algorithm to the entire web or multiple collections on the web. We examine the behaviour of search engines to determine what problems they suffer from when phrase matching in our design of a new algorithm for phrase source searching. Applying an algorithm directly to all documents in a large collection is computa tionally inefficient and generally impossible on the web. Some sort of pre-processing must be done, such as indexing, to make the phrase matching task feasible. Search engines retrieve results seemingly instantaneously, so this may enable an algorithm which performs well on a static document set to be expanded to the web. 14

The proprietary nature of many of the algorithms employed in Search Engines prevents information about these algorithms from being publicized [33]. This puts researchers at a disadvantage in terms of understanding and learning from the im plementation of those algorithms. However, the "approximate workings" of the al gorithms can be understood by reading the search tutorials provided by the search engines [13]. Google provides some details on how their search engine works [5], including details of their PageRank algorithm. The limited non-proprietary knowledge indicates that in order to utilize search engines to apply a phrase source searching algorithm, the search engines would have to pre-process the documents. A relaxed query could be provided to reduce the documents potentially relevant to the original query down to those retrieved by the search engine. Provided that a search engine retrieves the correct document anywhere within the results it returns, an algorithm could then be applied to the retrieved results to re-rank the documents. Any collection containing plain text documents, html documents, PDF documents, etc. can be queried with a search engine simply by providing the data on the web and allowing the collection to be crawled. In this way, the search capabilities for many document collections, including static ones, is not limited to the search capabilities of a digital library, customized algorithms, or publicly available information retrieval algorithms. Search engines are very accessible for phrase matching on documents containing text. We will consider the search capabilities of the following search engines: Google, Yahoo, and Ask. These three search engines of the previously mentioned four made up 90.34% of U.S. searches in 2007 [12], and so are an excellent representation of search engines. Search engines perform two types of bag-of-words [3] searches "any of these terms" and "all of these terms" but are also able to perform exact-phrase searches, typically through the placement of quotation marks around the pieces of text that you wish to represent as exact phrases. The search methods provided are similar to applying an AND operator to all query terms, an OR operator to all of the query terms;, and exact phrases. Details on what descriptions are provided by each search engine to describe the common 15 methods is shown in Table 2.1. The actual algorithms used by the search engines are more involved than Boolean queries and exact-phrases matching. The additional information used in the algorithms allows the documents to be ranked and improve retrieval performance.

Table 2.1: Advanced search methods in search engines Google, Yahoo and Ask. Search AND OR NOT Exact Phrase Engine Google with all of the with at least one without the with the exact words of the words words phrase Yahoo all of these any of these none of these the exact phrase words words words Ask all of the words at least one of none of the the exact phrase the words words

"The findings of traditional IR are mostly based on experiments conducted with small, stable document collections and sets of relatively descriptive and specific queries" [45]. Similarly, our study focuses on stable document collections, which allows us to apply customized algorithms. In order to compare the performance of our algorithms to search engine perfor mance, we limit our web searches to specific domains or web directories. The search engines tested The use of "site: domain, com" in the query, where domain.com is any domain or web directory you wish to search in allows us to create this limitation. In these experiments, the domains specified contain either song lyrics as documents or movie scripts as documents. In addition to ensuring that we are searching lyrics or scripts, we also eliminate unrelated documents and spam documents by specifying a domain.

2.4 Proximity-based Search Algorithms

One method used for phrase matching where the sections of a phrase are known, but joining parts are not is the proximity search [9] using the NEAR operator [20]. Proximity searching uses a maximum distance and, optionally, an ordering [31]. Proximity queries are written in expressions with two terms as operands and a prox imity operator. For example, trailer NEAR3 boys means that there may be no more 16 than three words between trailer and boys and that these terms can be in any or der. In order to specify that trailer must come before boys, the operator would then become trailer WITHIN3 boys. Consider a text that contains the phrase "Get your head stuck in your niece's dollhouse because you wanted to see what it was like to be a giant and it's Uncle Co- nan, you went to Harvard!?". If the user remembers "head stuck", and dollhouse and "you went to Harvard", the correct query would be: "head stuck" NEAR5 dollhouse WITHIN20 "you went to Harvard". Sadakane et al. [35] examined the use of a k-word proximity search in which k- words were provided by the user. They searched for the smallest interval containing all k words in an arbitrary order. They looked at two methods to determine the efficiency of finding this minimum interval; a plane sweep algorithm and a divide- and-conquer approach. They found that both algorithms were efficient enough for practical use. This method would fail, however, if any of the words were incorrect. A user may not want the document where the terms are simply closest together since they may be querying for phrases which are not close together in the desired document. Terms within a phrase should be close together, but not necessarily terms in different phrases. Keen [20] examined five methods of proximity searching. He found that a simple method involving setting a range of 5 words resulted in a Recall of 59% and a Precision of 43%. As he increased the complexity of the proximity methods, the Recall increased and the Precision decreased. The results for all five searches are relatively high compared to that of vector space model searches, as in [22]. This method suffers from similar problems as the method used Sadakane et al [35]. A range of 5 words may be too small and if any of the words are incorrect, then the method fails. Rasolofo et al. examined the use of a term proximity measurement with the Okapi probabilistic system [34]. They applied proximity scoring to the top document re turned through a keyword search to enhance scoring. They reported that the method was stable and that their method generally improved results, raising the precision of 71 queries, worsening 43 and leaving 11 unchanged. Mishne and Rijke used phrase and proximity terms for the task of topic distillation [25]. Sets of consecutive terms in the query are treated as phrases and searched for 17 in the documents. They found that "even on top of a good basic ranking scheme for web retrieval, phrases and proximity terms may bring improvements in retrieval effectiveness". While this is a very different task from the task examined in this thesis, the idea that consecutive terms provided in a query may appear consecutively within a certain distance in a relevant document is similar to part of the reasoning used in creating our proximity based method. Pickens and Croft examined the the use of Boolean, adjacency, and proximity evaluations on phrase terms independently of any specific retrieval approach to de termine their general impacts on the number of relevant documents that could be retrieved [30]. They found that adjacency is a positive indicator of relevance more often than Boolean. However, approximately a third of the time adjacency had a negative impact. The results of the proximity ranges tested varied. Each of these factors are examined in this thesis. Adding conditions such as proximity will improve results, only if the user has a reasonably accurate recollection of the maximum distance between sentence segments. Our interest is in the use of proximity-based phrase search methods in which the user does not need to make these decisions.

2.5 Summary

While exact phrase matching is the simplest and most efficient method for the task of locating the source document of a phrase, it can easily fail with any inaccuracies and we examine how often these inaccuracies occur. Various methods can be used to perform an approximate phrase matching, including various keyword search methods and algorithms (e.g. the vector space model, boolean searches, "any" and "all" search engines methods). These keyword methods have proved useful for other tasks and some of these methods will provide a benchmark for comparison of the proximity based method we have developed. Proximity based methods have been used effectively on keyword searching tasks, including topic distillation where terms were actually treated as sub-phrases, either individually or in combination with another retrieval method to improve the already reasonable performance. To the best of our knowledge, no work has been done involv ing the using of proximity based algorithms for the phrase searching task examined 18 in this thesis. Chapter 3

Methodology

The methodology used in this study consisted of building two document sets and two sets of manually generated user queries, supplying the query terms in a site-specific query to the search engines and evaluation of resulting document rankings. We then compared the effectiveness of the vector space algorithms and our new algorithms.

3.1 Datasets

All HTML tags and images, etc., were removed from the web pages. All words were converted to lowercase and punctuation characters were removed. For the first dataset, a corpus of 712 complete movie scripts was obtained from the Internet Movie Script Database [42]. An example of the documents is found in Figure 3.1 (a) The movie scripts had an average length of 23,766 words. For the second dataset, a corpus of 685,079 song lyric documents was obtained from LyricsDownload.com [24]. An example of the documents is found in Figure 3.1 (b). This dataset was chosen as it contained several orders of magnitude more documents than the movie script collection and the documents were several orders of magnitude smaller than the movie scripts. The lyrics had an average length of 216 words. The proximity search algorithms developed for this study were based solely on words and relative locations of the words within the scripts.

3.2 Queries

For the movie scripts document set, a collection of 91 queries were obtained from six university students. The average number of terms per query was 13.98. This dataset will be referred to as IMSDB. For the song lyrics document set, a collection of 85 queries were obtained from 8

19 20

DOWNLOAD The web's largest movie script resource! "FORREST GUMP" g^iwpiftioec ABC 'MltmMi'^fxHM FOHRBST Why don't you leva me, Jenny? I'm not a smart man, but 1 know what love is.

Forrest turns and walks toward the door. Jenny turns and walks up the stairs. Forrest stands outside. EXT. GUMP HOUSE - HIGHI • Always, Ifuww. You'll b<* ai fhy show' •' The house stands in the rain. EXT. (JUMP HOUSE - SIGHT Forreat lies in his bed as the door opens. Jenny gets into bed next to Forroet. ..- ,t,#p nijjhlt fcwwftome. '• : -!vw»f#( (Mohs:. titnoW' •

JENMY Forrest, I do love you. turn W* lights oft, am mo norm , K«M>p your (watt stiff V\t be your thrill, nt« nigfit wifl go i>n, m$ attfe wltit Jenny and Forrest kise. Jenny takes off her nightgown as thay make love. fay it ain't w, 1 wJ!lm» gw, turn th« HyM* off, ewrj m« horns Kafip you? hftiwt soil, fif to ynut tMlt^tha night will g« on, tha night «riii yo on, ni) EXT, OUMP BOUSE ~ MOUSING

Jenny carries her purse and walks toward a waiting cab. fcaoH to ** Blf*ik-1B2 tyrti a iiiiimnniuiniiiii »•" » « « «i

(a) (b) Figure 3.1: (a) IMSDB genres and scripts, (b) LyricsDownload.com lyrics 21 university students. The average number of terms per query was 7.34. This dataset will be referred to as LYRICS. For each collection of queries, the students were asked to write down quotes in the form of a complete or incomplete sentence, indicating "small" and "large" gaps be tween sentence segments. The students were not told what the interpreted maximum size of the gaps would be.

Queries are entered in the form tx 0\ t2 o2 ... ora_i tn, where ti is a term or word and Oi is an operator between the words U and U+\. Each term is a word and each operator is whitespace, an ellipsis or a double-ellipsis. Whitespace operators imply that the adjacent terms ought to be consecutive; however, our constraints will be more relaxed. A single ellipsis implies that there is a small gap between the adjacent terms and a double-ellipsis implies that there is a large gap between the adjacent terms. Three sample queries are as follows:

• here's looking at you kid

• Mos Eisley a wretched hive of scum and villainy

• My name is ... you kill my father, prepare to die

Table 3.1: The percentage of queries which contained no gap, a small gap (...)> a large gap ( ) or both None Small Large Both Movies Scripts 59 (64.83%) 30 (32.97%) 1 (1.10%) 1 (1.10%) Lyrics 70 (76.12%) 10 (11.76%) 4 (4.71%) 1 (1.18%)

3.3 Exact Phrase Matching

Exact phrase matching can be applied to the queries provided by quoting subsections of text which do not contain ellipses and then removing the ellipses from this result. For example, given the query:

Mos Eisley a wretched hive of scum and villainy we add quotation marks as follows "Mos Eisley" "a wretched hive of scum and villainy". 22

and remove any ellipses "Mos Eisley" "a wretched hive of scum and villainy".

Queries of this form can be submitted to search engines. Applying this method to a collection of queries can be used to determine the frequency at which exact phrase matching fails in that collection.

3.4 Proximity-based Phrase Searching

In the proximity-based algorithms which were developed in Equations 3.7-3.9, several factors were considered when applying scores to documents for ranking. The equations were developed based on hypotheses regarding term proximity and which additional factors should be considered. The hypotheses were tested and modified iteratively using a few queries and a few documents at a time. Each of these factors was then tested individually and in smaller combinations, where appropriate in Equations 3.1- 3.6 to examine their effects. Each equation is denoted beginning with PBP, for "proximity-based phrase". Although not all of our methods use a proximity factor, but for simplicity this prefix was used consistently.

3.4.1 Term Counting

When deciding the relevance of a document to a query, the number of matched query terms should be maximized. In addition to the presence of a query term, the frequency of that term in the document or in an entire corpus may be considered in deciding its importance. The Vector Space Model incorporates a weighting for terms. In this section we weigh all terms equally. While the number of matched terms may not be the most important factor, this simple metric is used to determine a value for term weighting. If a term appears in the query multiple times, then that term may be counted for as many times as it appears in the query. We assume that the term is intended to be found multiple times, whereas some bag of words models may ignore repeat occurrences. Given the text of a document, whether it is the entire document or a section of the document, a binary vector, /, of size | t | is created which indicates whether or not each term, ij, was found in the text. If ti is found in the text, /; = 1, otherwise 23

fi = 0. Each term, ti from the phrase may only be counted once, but since terms are not necessary unique, some words may be counted more than once. In order to assign scores to documents based only on the number of terms matched over the entire document we apply the following formula to the entire document:

PBP-c=\f\ (3.1)

where / is a binary array representing which query terms were found in the entire document.

3.4.2 Term Adjacencies

When matching a phrase query to a passage of text, we expect that terms that are adjacent in the query are likely to appear adjacent in the desired text. We expect that there may be inaccuracies such as missing terms and reordered terms in the query. However, if the considered passage has few or none of the query terms ap pearing adjacently in the considered passage, it's less likely to be the correct passage. Phrases with the same meaning may have sections of the phrase reordered and mod ified terms, but may have many similarities in their term adjacencies. The following example shows common terms which are underlined:

"When I was in Hawaii, I saw dolphins for the first time. In the evening, there was a baby dolphin."

"For the first time in my life, I saw dolphins when I was in Hawaii. I saw a baby dolphin in the evening."

In this section, we consider adjacencies between any query term and any other query term in the passage of text. We could limit ourselves to strictly considering adjacencies between terms which were adjacent in the query, but our approach is simpler and more computationally efficient. We can adjust our method to incorporate this limitation, if it is found that this condition is too relaxed. Using only adjacencies, we could match the two phrases in the above example closely. In some cases, such as queries where there are few or no adjacent terms 24 provided in the query, losing only adjacencies may not be sufficient. For example: Desired passage: "I never realize how much I like being home unless I've been somewhere really different for a while." [42] Provided passage: "I only realize that I love being here when I have been someplace else for quite some time" In this case, we would need to at least consider term counting. There also may be queries where terms are provided adjacently, but inaccuracies in the query cause the number of adjacencies in the desired passage of text to be low or non-existent. Again, we would require term counting. In order to determine how well adjacency counting performs on its own, we test it separately from term counting. We also examine a combination of these factors. Together, they may cover enough situations to perform well. However, one or the other may prove to be significantly more useful than the other such that it may be computationally wise to eliminate the other factor. For example, if the term counting algorithm fails to retrieve the correct document, then the term adjacency algorithm will fail as well. At best, considering term adjacencies on their own may improve the ranking of the correct document and increase the likelihood of the document occurring within the top ranked documents.

Similar to 3.4.1, the number of occurrences of adjacent query terms within the considered text is counted. Again, each term may only be considered once and since there may be multiple ways one can select matched terms from the considered text, we let a be the value where terms are selected to maximize the number of adjacencies.

PBP-a = a (3.2)

where a is the number of pairs of adjacent query terms found in the entire docu ment. Maximizing the number of term adjacencies does not necessarily increase the number of matched terms in the considered text. We shall also consider the results of placing high priority on the number of matched terms, as in Eq. 3.1, and then breaking ties by considering the number of adjacencies:

PBP-ca = | / |2 + a (3.3) 25

where / is a binary array representing which query terms were found in the entire document and a is the number of pairs of adjacent query terms found in the entire document.

3.4.3 Maximum Windows

A proximity factor is a sort of compromise between term counting and term adjacen cies. Terms can be disjoint, but they should still be close Using the queries provided in the form described in 3.2, we incorporate a proximity factor into our algorithm by calculating a maximum window of terms which we will consider to match our query terms against. This task is similar to that of phrase selection in the problem of question answering [7]. In some systems for question answering, the phrase is selected by taking the smallest substring containing the query terms and then extending the substring to n terms on either side of this substring. However, in this thesis, we are not concerned with the surrounding terms and once we have the maximum window calculated, an actual window size which is the smallest window containing the matched terms will be determined. Each operand from the query is converted into an allowable distance between the adjacent terms. The following conversions were used:

• Whitespace —•>• maximum of 2 terms apart

• Single Ellipsis —> maximum 10 terms apart

• Double Ellipses —> maximum 20 terms apart

The distances were chosen by experimentation and made larger than is likely needed to ensure that the conditions were not too strict. Furthermore, the distances were chosen to be consistent among all documents and not dynamic (e.g. proportional to the length of the document) because this could introduce the problem of being unable to indicate a small enough distance because the document was so long that it made the single ellipsis distance value too large. Using these distances, a maximum phrase window size, M, is determined in terms of the maximum number of words in the window by adding the total allowable dis tances and the number of words. For example, the window sizes for the above exam ples queries are calculated in Figures 3.2 and 3.3, resulting in 13 and 36, respectively. 26

'here's 'here's NEAR! 2 looking looking 1 NEAR! 2 at -> at -» 1 ->M = 13 NEAR! 2 you you 1 NEAR2 2 Kkid y Kkid j J J

Figure 3.2: Maximum window size calculation for here's looking at you kid

The positions of the first and last words that match phrase terms are the first and last words of the actual window, W. Given the length of document, length, there are lengthM + 1 windows in that document and each window is denoted by Wj, where j indicates that the first term in the original window of size M is the jth word in the document. When ranking documents using windows, we can either assign importance to smaller values of W or we can treat all matched text equally which satisfies M, In either case, we clearly want to consider the number of terms which were found in the window, otherwise all windows which contain just one of the query terms will be ranked higher than those containing multiple terms if we sort by W or all documents which had at least one term will have the same score if we only require that M is satisfied. Higher priority should be placed on the number of terms matched. Some ties between documents with the same number of matched terms within a window of size M can then be broken by sorting those documents according to the value of W. A window-based ranking can then be expressed with the following equation:

Wj + \fj PBP-cw = | fj (3.4) where fj is a binary array representing which query terms were found in the jth window with a 1 or a 0 and Wj represents the window length in the number of terms it contains. 27

(My \ (My ^ (\ \ NEAR! 2 name name 1 NEAR! 2 is is 1 NEARIO 10 you you 1 NEAR! 2 kill kill 1

-» NEAR! -> 2 ->M = 36 my my 1 NEAR! 2 father father 1 NEAR2 2 prepare prepare 1 NEAR! 2 to to 1 NEAR2 2 die die j V J K \> J Figure 3.3: Maximum window size calculation for My name is you kill my father, prepare to die 28

3.4.4 Adjacencies in Maximum Windows

Using subsections of size M as calculated in 3.4.3, we apply Equations 3.1 and 3.2 to these considered pieces of text in Equations 3.5 and 3.6, respectively. As in Equation 3.4, we will again break ties in the scores using the actual window sizes, but we use the number of adjacencies compared to the window size in order to ensure that the window size has the lowest priority in these equations.

PBP-aw=aj + ^±i^ (3.5) Wj where a,j represents how many co-occurring query terms were present and Wj repre sents the window length in the number of terms it contains.

PBP-caw = | fj |2 + aj + ^±5l (3.6)

where fj is a binary array representing which query terms were found in the jth window with a 1 or a 0, aj represents how many co-occurring query terms were present, and Wj represents the window length in the number of terms it contains.

3.4.5 Proximity-based Methods

A proximity-based scoring system was developed that ranked those documents the highest which maximized the number of query terms found within a minimized win dow with consideration given to the number of adjacencies. The method is derived from Equation 3.6 where we place more emphasis on the number of terms found and less emphasis on the number of adjacencies found. The number of query terms which were found is squared, since we placed highest priority on maximizing the occurrence of query terms than any other factor. The score was increased by the number of occurrences of adjacent query terms. Finally, an additional factor considering the number of matched terms is added which is normalized to a value between 1 and 2 by the window size. This scoring system is expressed in equation 3.7.

2 PBP-reg = | fj | + aj + ^—^ (3.7) 29

where fj is a binary array representing which query terms were found in the jth window with a 1 or a 0, a,- represents how many co-occurring query terms were present, and Wj represents the window length in the number of terms it contains.

Two modified methods were developed to incorporate other factors, while main taining the highest importance in the factors already incorporated in PBP-reg. A modified method was developed to incorporate the lengths of the query terms. This was designed to determine whether assigning importance to documents which matched longer terms from the query would improve results:

PBP-length = | /, |' + a, + ^M + E W>)fa (3.8)

where fj is a binary array representing which query terms were found in the jth window with a 1 or a 0, Oj represents how many co-occurring query terms were present, W represents the window length in the number of terms it contains, and length(tk) represents the number of characters in tk where k is the index of a term in the query. In order to reduce the weight of the length factor relative to the weight of counting, different divisors were tested and 4 was found to work well on a small number of queries and documents. This is not necessarily the optimal divisor, but much larger or much smaller reduced performance.

A second variation was made by assigning importance to documents which matched the query terms that were less frequent, relative to the entire corpus:

PBP-tfHAI^ + ^ + E^J^ (3.9)

where /_,- is a binary array representing which query terms were found in the jth window with a 1 or a 0, a, represents how many co-occurring query terms were present, W represents the window length in the number of terms it contains, and totalTF(tk) is the number of times tk occurs within the entire corpus and k is the index of a term in the query. 30

3.5 Experimental Procedure

3.5.1 Exact Phrase

For each dataset, a collection of exact phrase queries are created by applying the method described in 3.3 to the queries. These exact phrase queries are then submitted to Google. The results of this experiment are found in section 4.1.

3.5.2 Vector Space Model

All documents were converted into term-frequency vectors. All queries were converted into term vectors. Stop words were not removed in doing this. The similarity between the query term vectors and the documents was calculated and documents were ranked using the VSM. The top 100 ranked documents were recorded. This method is referred to as VSM and the results of this experiment are found in section 4.2.

3.5.3 Proximity Search

The PBP equations from Equations 3.7-3.9 were tested on all queries over the whole corpus for both the movie script collection and the lyrics collection. The top 100 ranked documents for each query were recorded. The results of this experiment are found in section 4.2. The PBP equations from Equations 3.1-3.6 were tested on all queries over the whole corpus for both the movie script collection and the lyrics collection. The top 100 ranked documents for each query were recorded. The results of this experiment are found in section 4.3.

3.5.4 Search Engines

Three search engines, Google.ca, Yahoo.com and Ask.com, were used to perform site- specific searches. For example, in order to search for all of the terms "a", "b" and "c" on lyricsdownload.com, we provided the query "a b c site:domain.com" to Google.ca. We performed 4 additional searches over all queries: an "any of these" terms search with Google.ca, an "all of these terms" search with Google.ca, an "all of these terms" search with Yahoo.com and an "all of these terms" search with Ask.com. 31

These methods will be referred to as Google-any, Google-all, Yahoo-all and Ask-any, respectively. The "any of these terms" methods were not used with Yahoo.com nor Ask.com as the Google, ca searches yielded very poor results and both search engines began to exhibit the same behavior. For each query, the rank of the correct document among the first 100 results was recorded for each search engine method. The results of this experiment are found in section 4.3.

3.5.5 Result Sets

The number of top ranked documents to examine was chosen conservatively. Search ing through 100 results is certainly more than human searchers have the tolerance to examine. However, in order to determine whether the methods used are having any positive effect in cases where the results are less than desirable, having a large number of documents recorded can prove useful in improving the method used.

3.5.6 Experimental Conditions

In Table 3.2, each of the experiments conducted is listed and the conditions of each experiment are shown. 32

Table 3.2: Experimenta Conditions. ID Algorithm Corpus Queries Documents Results 1A Google-exact Movie scripts 91 712 100 IB Google-exact Song lyrics 85 685,079 100 2A Vector Space Model Movie scripts 91 712 100 2B Vector Space Model Song lyrics 85 685,079 100 3A PBP-reg Movie scripts 91 712 100 3B PBP-reg Song lyrics 85 685,079 100 4A PBP-length Movie scripts 91 712 100 4B PBP-length Song lyrics 85 685,079 100 5A PBP-tf Movie scripts 91 712 100 5B PBP-tf Song lyrics 85 685,079 100 6A PBP-c Movie scripts 91 712 100 6B PBP-c Song lyrics 85 685,079 100 7A PBP-a Movie scripts 91 712 100 7B PBP-a Song lyrics 85 685,079 100 8A PBP-cw Movie scripts 91 712 100 8B PBP-cw Song lyrics 85 685,079 100 9A PBP-aw Movie scripts 91 712 100 9B PBP-aw Song lyrics 85 685,079 100 10A Google-any Movie scripts 91 712 100 10B Google-any Song lyrics 85 685,079 100 11A Google-all Movie scripts 91 712 100 11B Google-all Song lyrics 85 685,079 100 12A Ask-all Movie scripts 91 712 100 12B Ask-all Song lyrics 85 685,079 100 13A Yahoo-all Movie scripts 91 712 100 13B Yahoo-all Song lyrics 85 685,079 100

3.6 Summary

The methodology used in this thesis consisted of building a lyrics corpus by selecting all web pages containing lyrics for songs from LyricsDownload.com, obtaining man ually generated user queries, cleaning this data, building term-document matrices, applying the vector space model to all documents for each query, building a corpus- term matrix from the lyrics, building term-term proximity matrices from query terms, applying a proximity-based scoring system to all documents for each query, and sup plying the query terms in a site-specific query to the search engines and evaluation of resulting document rankings. Chapter 4

Results

In this chapter, first we examine the results of an experiment testing what portion of the queries failed when submitted as an exact phrase query to a search engine (Google). The second section shows the results of examining individual and the combined factors of the PBP-reg method. The third section compares the document ranking accuracy of the Vector Space Model, three forms of the proximity-based methods (PBP-reg, PBP-len, Pi3P-tf), and search engine methods available from Google.ca, Yahoo.com, and Ask.com.

4.1 Exact Phrase Queries

Many of exact phrase queries failed for each corpus. On the movie script corpus, 77 of the 91 queries failed (84.62%). On the song lyrics corpus, 40 of the 85 queries failed (47.06%). The breakdown of the reasons these queries failed is found in Table 4.1. Various queries had multiple reasons for failing, so the sum of the percentage of failed queries for individual reasons will not sum to the percentage of total failed queries. The top three are shown in bold. The highest contributing factor for failed movie script queries was two or more words being incorrect. In general, the intended meaning of the phrase in the query and the phrase it was intended to match were very similar.

4.2 Proximity-based Phrase-Matching Methods Compared to the Vector Space Model and Search Engines Results

The results of all eight scoring systems for the movie script experiment and the song lyric experiment are shown in Table 4.2 and Table 4.3, respectively. Each row gives the number of queries which ranked the correct document within a top range of documents for that scoring system. For each number of top ranked documents

33 34

Table 4.1: Reasons for and percentage of failed exact phrase matching queries Reason for failed query Movie Scripts Song Lyrics Average 2 or 3 incorrect words 19.78% 2.35% 31.51% 4 or more incorrect words 37.36% 3.53% 20.45% only 1 incorrect word 12.09% 15.29% 13.69% Gap exists where none indicated 15.38% 3.53% 9.46% Incorrect order 7.69% 10.59% 9.14% Typo 5.49% 9.41% 7.45% Token concatenation 2.20% 5.88% 4.04% Correct, but not in text 6.59% 0.00% 3.30% Contraction 3.30% 2.35% 2.82% Search engine not handling new-line correctly 0.00% 4.71% 2.35% Incorrect honorific 0.00% 1.18% 0.59% Numbers as words 2.20% 1.18% 1.69% Larger gap than indicated 2.20% 0.00% 1.10% Too common 1.10% 0.00% 0.55% Total Queries Failed 84.62% 47.06% 65.84%

considered, the proximity based methods consistently retrieved the correct talking within those top ranked documents the highest number of times. In the movie script experiment, the additional length and term frequency factors increased the number of successful queries from the number of successful queries performed by the regular proximity based method when considering the top ten or fewer documents. In the song lyrics experiment, the proximity based method with the additional length performed the highest number of successful queries in all cases.

Table 4.2: Number of documents as highest ranked documents in movie script collection Movie Scripts 91 queries Top 1 Top 5 Top 10 Top 20 Top 50 Top 100 VSM 1 2 3 8 17 23 PBP-reg 55 63 66 68 72 79 PBP-len 56 63 65 73 74 77 PBP-tf 59 67 69 72 73 78 Go ogle-any 4 17 21 26 37 46 Google-all 47 51 52 54 59 59 Yahoo-all 18 31 32 39 42 46 Ask-all 36 42 45 47 47 47

A comparison of the accuracy among all search methods in ranking the correct 35

Tab e 4.3: Number of documents with highest ranked documents in lyrics collection Lyrics 85 queries Top 1 Top 5 Top 10 Top 20 Top 50 Top 100 VSM 0 0 0 0 0 0 PBP-reg 68 73 76 78 80 81 PBP-len 69 76 78 80 82 83 PBP-tf 68 73 76 77 78 80 Go ogle-any 0 1 1 2 2 4 Google-all 42 46 46 47 48 49 Yahoo-all 39 44 46 49 51 55 Ask-all 42 45 47 49 50 52 document within the top documents is shown in Figures 4.1 and 4.2. A comparison of the occurrences of the correct document within the top ten ranked documents in both collections is shown in Figure 4.3 in terms of percentage of valid queries. The average accuracies of the proximity methods were 80.5%, 81.1% and 82.1% for PBP- reg, PBP-len and PBP-ti, respectively. VSM's average accuracy was approximately 1.6%. The highest average accuracy among the search engines was 55.3%.

Accuracy - Movie Scripts

u>« 2 u3 u <

VSM PBP-reg PBP-len

ITopl HTop5 UToplO BTop20 »Top50 KToplOO

Figure 4.1: Accuracy on movie script collection 36

Figure 4.2: Accuracy on song lyric collection

The VSM algorithm had a low performance in both experiments. In the song lyrics experiment, where there were 685,079 documents, it failed to retrieve the correct document within the top 100 ranked documents for all queries. In the movie scripts experiment, the VSM algorithm was only able to achieve an accuracy of 25.27% when considering the top 100 ranked documents. If for a very large number of queries the documents were ranked randomly, then considering the top 100 documents in a set with 712 documents would yield an accuracy of approximately 14.04%. Therefore, the VSM algorithm had an accuracy of 11.23% higher than random ranking. Similarly, if the set contained 685,079 documents, then a random ranking would yield an accuracy of approximately 0.01460%. The accuracy of the VSM algorithm on the song lyrics dataset was 0%. As a result, a random ranking could outperform the VSM algorithm on a dataset of this size. The minimum improvement of the average PBP accuracies over the other tested methods was 24.79%. The amount of improvement over search engines in the exper iment on song lyric queries was more than 10% higher than the improvement in the experiment on movie script queries. 37

Figure 4.3: Comparison of the "top 10" accuracy in each collection

In order to determine how the results of each method are affected when the num ber of top ranked documents considered is reduced, we compare the percentage of successful queries from Figure 4.1 and Figure 4.2. In Figure 4.4, we look at the per centage of queries correctly ranked within the top 100 which are not correctly ranked within the top 50. In Figure 4.5, we look at the percentage of queries correctly ranked within the top 50 which are not correctly ranked within the top 20. In Figure 4.6, we look at the percentage of queries correctly ranked within the top 20 which are not correctly ranked within the top 10. The large drops in accuracy when reducing the result set in both the VSM method and the Google-any method indicate not only that the methods did not perform well, as can be seen in Figure 4.1 and Figure 4.4, but also that many of the successful queries were ranked lower in the result set. Some of these successful queries could be coincidental, especially in the movie scripts corpus when up to 12.6% of the docu ment set is being retrieved. The proximity-based methods had very little decline in performance and the three search engine methods which required that all terms be present had approximately the same decline as the proximity-based methods. 38

Drop in Accuracy using only top 50 of 100 results 60%

50%

40%

ro 3 30% <-> I u < 20%

10%

0% VSM PBP-reg PBP-len PBP-tf Google-any Google-all Yahoo-all Ask-all ffl Movie Scipts • Song Lyrics

Figure 4.4: Percentage of results correctly found within the top 100 lost when only the top 50 results are considered

In Figures 4.7, 4.8, 4.9 and 4.10, we examine the rank of the correct document produced by the PBP-reg and Google-all methods in the movie scripts experiments and the song lyrics experiment, respectively. The graphs show a comparison of the number of words in the query and the number of "rare" words to then trend of performance for each method. A rare word is considered to be any word which does not appear in the British Natial Corpus list of the 1000 most frequent words [2] and the Brown corpus list of the 2000 most frequent words [10]. The British National Corpus (BNC) [2] is a very large corpus of English texts from a wide range of sources with a total of 100 million words. The Brown corpus contains general non-academic English and has a total of 1,015,945 words. Google-all was chosen to compare to a proximity-based method as it was the highest performing "all of these terms" search engine method in finding the correct document within the top 10 returned documents.

Any results where both methods failed to retrieve the document were omitted and results where both methods find the correct document within the top 10 were omitted. Results which were ranked first by both methods or failed in both methods 39

Drop in Accuracy using only top 20 of 50 results

50% -

40% - > u ro 3 30% H u u < 20% -

10%

0% L- JL„fa VSM PBP-reg PBP-len PBP-tf Google-any Google-all Yahoo-all Ask-all H Movie Scipts • Song Lyrics

Figure 4.5: Percentage of results correctly found within the top 50 lost when only the top 20 results are considered are not of interest here. We wish to see how the results of our proximity based method differed from the Google-all method with respect to the number of terms. The number of rare terms was calculated as the number of total terms minus the average of the number of terms which appeared in the BNC list and the number of terms which appeared in the Brown list. In Figure 4.7, the performance of PBP-reg is generally unaffected and the perfor mance of Google-all is generally decreased as the total number of words in the query decreases. In Figure 4.8, the performance of PBP-reg is generally unaffected and the performance of Google-all is generally decreased as the number of rare terms in the query decreases. In Figure 4.9, the performance of PBP-reg is generally decreased and the performance of Google-all is generally increased as the total number of words in the query decreases. In Figure 4.10, the performance of PBP-reg is generally un affected and the performance of Google-all is generally increased as the number of rare terms in the query decreases. A comparison between the PBP-reg and Google-any methods reveals no trends 40

Drop in Accuracy using only top 10 of 20 results

VSM PBP-reg PBP-len PBP-tf Google-any Google-all Yahoo-all Ask-all

SB Movie Scipts • Song Lyrics

Figure 4.6: Percentage of results correctly found within the top 20 lost when only the top 10 results are considered relating to the number of words or number of rare words. Our previous comparison of the proximity methods to the vector space model [29] had shown very little difference in accuracy among the three PBP methods. Although length and term-frequency factors were shown to improve the results of some queries, the number of queries affected is very small. In addition, the term-frequency factor is very costly to generate on dynamic corpora as it requires that the term-frequency over the whole corpus be known. While the length factor does not decrease efficiency noticeably, it does not prove generally useful either. Many query results were improved by the proximity-based method but several queries failed in all methods except for the proximity-based methods. One example is the query utammy baker, boy did she loose some weight!" which all three PBP methods matched against the text: "... Kathy Straker, boy could she lose some weight ..." and correctly ranked this document first. The document contained relatively uncommon terms from the query, "tammy" and "baker" in a different line of the lyrics, but the search engine methods and the vector space model were still 41

Rank compared to total words - Movies

Number of words 30 16 14 10 8

•PBP-reg —Google-all •Linear (PBP-reg) — Linear (Google-al

Figure 4.7: Linear trend of the performance by PBP-reg and Google-all as the number of terms in the query decreases on the movie script collection 42

Rank compared rare words - Movies

Number of rare words 4-00 3.00 2.50

100 -

r^^L^Z:6!08'^" —Linear (PBP-reg) - Linear (Google-al

of faTe tLLstThtrend *T Perf0rmanCe * PBP-^ «d Google-aU as the number ol rare terms m the query decreases on the movie script collection 43

Rank compared to total words - Lyrics Number of words

Linear (PBP-reg) - Linear (Google-all) PBP-reg -—Google-all

, nRPrp(f and Google-all as the number Figure 4.9: Linear trend of the performance by PBP-reg and Goog of terL in the query decreases on the lyncs collection 44

Figure 4.10: Linear trend of the performance by PBP-reg and Google-all as the number of rare terms in the query decreases on the lyrics collection Distribution of percentages of query terms being rare

01 3 IX

Query number (unordered)

•™>Lyrics «""»Movie

Figure 4.11: Percentage of query terms being rare terms over for each query, pendently sorted by percentage. 91 lyrics queries and 85 movie script queries. 46

unable to locate the correct document within the top 100 results. Some queries were successfully ranked within the top 10 by the proximity-based method, but were not ranked first. One query, "/ wasn't even supposed to come in today", ranked the correct document highly, but since the query was poorly remem bered, the result was not the first. The correct document is Clerks. The top five ranked documents from the PBP-ti method matched the following text:

• Ed-TV: even talks to me and then today I come

• Pet-Sematary: wasn't even supposed to be a sprain today, my friend-that's what I

• 25th-Hour: supposed to be at work in a couple hours. Christ, I can't even imagine working today

• Clerks: even supposed to be here today!

• Wonder-Boys: in Sewickley Heights. I dropped him there once, but... (remem bering) Come to think of it, it-wasn't even

Some other examples of lower performing queries are "na .. .na ... not going to work her any more" which was intended to match "Naga ... Naga-worker here anyway!" and "as soon as you stop looking at the reason you become a numerologist" which was intended to match "You become a numerologist. What you need to do is take a break from your research". An example of one query which failed in all methods and appears to be difficult to improve is "thank you for calling initrode, please hold. Thank you for calling initrode, please hold" which was intended to match "Corporate Counsels Payroll, Nina speaking. Just a moment". Some of the terms are found in the desired document and one term, "initrode", is unique to it, but a method which is designed to make use of uncommon terms, such VSM, also failed. There were many instances where the correct document was found within the top 100, but tied with so many other documents in its score that the resulting rank was unsatisfactory. This typically occurred with queries made up mainly of common terms. These common terms often appeared near one another. The correct document 47 could have its rank improved by awarding value to text more closely resembling the structure of the original query.

Table 4,4: Reasons for and numbers of failed queries using PBP-reg Reason for failed query Movie Scripts Song Lyrics Average not enough correct words 4.40% 2.33% 3.36% mistaken phrase common 4.40% 2.33% 3.36% too many tied scores 3.30% 1.16% 2.23% spelling error 1.10% 2.33% 1.71% too common 1.10% 2.33% 1.71% smaller window found 2.20% 1.16% 1.68% inaccurate document 3.30% 0.00% 1.65% text does not exist 2.47% 0.00% 1.23% window too small 1.10% 0.00% 0.55% TOTAL FAILED 23.35% 11.63% 17.49%

Table 4.4 shows the results of an examination of failed queries queries for which the correct document was not ranked within the top ten results using the regular proximity-based algorithm.

4.3 Elements of PBP Explored

In examining how the individual factors compared in terms of accuracy, we raised the level of difficulty by considering what the worst possible rank of the document could have been based on the rank of the lowest ranked document with the same score as the correct document. For example, in Figure 4.12, suppose the correct document was ranked 14th with a score of 35.88 as in the table below. Four documents share this score. The highest that the rank could have been is 13th and the lowest that the rank could have been is 16th. In addition to raising the actual rank as much as we can, we also wish to decrease the number of ties. Decreasing the number of ties decreases the probability that the correct document will have a lower rank, as the order of tied documents is arbitrary. So, considering the highest potential rank could be useless if there are a high number of ties. If we always aim to raise the lowest potential rank, we achieve both objectives. In all of the tables and figures which follow in section 4.3, the lowest potential rank is treated as the rank of the correct document. The ranks produced by these 48

1 A 80 2 B 80 3 C 75.6

12 G 40 The correct document has been ranked 13 El ^ 14th. Three other documents share the 14 1 3*. as same score as the correct document. The 15 J > high rank held by a document with the 16 K ^ ss same score is 13 and the lowest rank J 17 L 29 held by a document with the same score is th 18 M 29 16 .

Figure 4.12: Example Document Ranking - the correct document is shown in italics algorithms, in practice, will be greater than or equal to the results outline in this section.

Table 4.5: Average ties per query of the PBP elements and PBP-xeg on the movie script corpus (best result shxywnx in bold, PBP-a PBP-wa PBP-c PBP-wc PBP-reg Average Ties Per Query 11.84 8.33 23.65 13.69 3.32

In the movie script experiment, counting term adjacencies over the whole doc ument outperformed counting term occurrences over the whole document in both accuracy and minimizing tied scores, Incorporating the proximity window improved the accuracy of both adjacency counting and term occurrence counting and reduced the number of tied scores. Incorporating the proximity window also caused term counting to outperform term adjacency counting. However, the number of ties re mained lower when counting term adjacencies. Incorporating all three factors, term counting, term adjacency and a proximity window, had the highest performance in both accuracy and minimizing the number of ties. In the song lyrics experiment, counting term occurrences over the whole document 49

Figure 4.13: A comparison of the accuracy of each factor in ranking the correct document within the top number of documents specified on the movie script corpus 50

Table 4.6: Performance of the PBP elements and PBP-reg on the movie script collection (best results shown in bold) PBP-a PBP-wa PBP-c PBP-wc PBP-reg Worst in Top 1 25.27% 38.46% 16.48% 47.25% 58.82% Worst in Top 5 39.56% 56.04% 27.47% 59.34% 71.76% Worst in Top 10 45.05% 59.34% 34.07% 63.74% 74.12% Worst in Top 20 51.65% 61.54% 36.26% 63.74% 75.29% Worst in Top 50 56.04% 65.93% 49.45% 67.03% 78.82% Worst in Top 100 68.13% 73.93% 68.13% 80.22% 84.71%

Table 4.7: Average ties per query of the PBP elements and PBP-reg on the song lyrics collection (best result shown in bold) PBP-a PBP-wa PBP-c PBP-wc PBP-reg Average Ties Per Query 9.13 10.05 24.45 4.87 2.38 typically achieved a higher accuracy counting term adjacencies over the whole docu ment. However, the number of tied scores was higher for counting term occurrences. The accuracy in counting term adjacencies was also higher when fewer top terms were considered. Incorporating the proximity window reduced the performance of term ad jacency counting in terms of accuracy and minimizing tied scores. Incorporating the proximity window improved the performance of counting term occurrences in terms of accuracy and minimizing the number of tied scores. Incorporating all three factors, term counting, term adjacency and a proximity window, had the highest performance in both accuracy and minimizing the number of ties.

Table 4.8: Performance of the PBP elements and PBP-reg on the song lyrics collec tion (best results shown in bold) PBP-a PBP-wa PBP-c PBP-wc PBP-reg Worst in Top 1 8.24% 5.88% 2.35% 12.94% 5.88% Worst in Top 5 23.53% 20.00% 18.82% 74.12% 77.65% Worst in Top 10 29.41% 22.35% 31.76% 83.53% 85.88% Worst in Top 20 30.59% 22.35% 45.88% 85.88% 87.06% Worst in Top 50 34.12% 30.59% 60.00% 88.24% 89.41% Worst in Top 100 41.18% 37.65% 77.65% 88.24% 90.59% 51

Elements of PBP compared to complete PBP - Lyrics 100%

90% 80% k 70% I in m "E d) 60% I 3 • Lowest in Top 1 CP / is o 50% "^A&tatS 8 Lowest in Top 5 at BO H Lowest in Top 10 re +-> a 40% • Lowest in Top 20 u i_ $ m Lowest in Top 50 Q. « Lowest in Top 100 30% M

20% . wfc

10% • # 0% wxmm 1 PBP-a PBP-wa PBP-c PBP-wc PBP-reg Method

Figure 4.14: A comparison of the accuracy of each factor in ranking the correct document within the top number of documents specified on the lyrics corpus 52

Movie Scripts & Song Lyrics - Lowest Potential Rank and Average Tied Scores

c o u Sn o 3 o o- 3

PBP-a PBP-aw PBP-c PBP-cw PBP-reg

J Movie: Lowest Rank within Top 10 0 Song: Lowest Rank in Top 10 I Movie: Average Ties Per Query BSong: Average Ties Per Query

Figure 4.15: Rank and tie results of the PBP elements and PBP-reg on the song lyrics collection and movie script collection Chapter 5

Discussion

In this section we discuss trends and observations resulting from the experiment results from Chapter 4. We discuss each type of method used and provide hypotheses for the differing performances in the results. The limitations and dependencies of each method are discussed. We also consider special cases or noise found in the results. New information which has been learned from our experiments regarding approximate phrase matching is outlined as well as the implications for future research.

5.1 Exact Phrases

A significantly higher percentage of the exact movie script corpus queries failed (84.62%) than the exact song lyrics queries (47.06%). The difference can be at tributed to the fact that movie script queries contained, on average, almost twice as many terms as the song lyrics queries - 13.98 and 7.34, respectively. Another rea son is that the language was more complex and likely to vary in ordering. For song lyrics, the cognitive connection between words and the music means there is a lower probability of providing incorrect words. As indicated in Table 4.1, 69.23% of the movie script queries contained 1 or more incorrect words and 21.17% of the song lyric queries contained 2 or more incorrect words. So, providing incorrect words was a higher source of query failures in the movie script experiment. The results of this experiment show quite clearly that exact phrase matching is far too restrictive for these types of queries. Quotes from movie scripts and lyrics which are generated from memory are prone to user error. Only 6.59% of the movie script queries and 1.18% of the song lyrics queries failed for reasons beyond the users' control. Exact phrase queries are very straightforward and exact phrase matching is the most efficient method of phrase searching. When the quote provided in the query is correct, this method of searching is the optimal method to use. However, the high failure rate of this method indicates that more options are necessary for users in a

53 54 situation where quotes are being recalled from memory. Ideally, a method will be developed that will eliminate the need for using exact phrase queries in situations where there is potential uncertainty in the query. This would eliminate the need to submit two or more different types of queries when the first type fails. The proximity-based method will not necessarily succeed or perform as well on all queries which exact phrase matching will perform successfully and will certainly do no better than exact phrase matching. So, exact phrase matching is still necessary.

5.2 Vector Space Model

The benchmark method chosen for our experiments, the Vector Space Model, per formed quite poorly as expected. This algorithm is a powerful tool in information retrieval [32] and is known to perform certain tasks very well, automatic indexing [37], and more commonly for retrieving documents on topics related to a query [22]. However, none of the tasks which we found employing the vector space model closely resemble the task examined in this thesis. While this algorithm has proven useful in common document retrieval tasks, the use of the vector space model in this special case of document retrieval is rarely successful on a small document set, such as the movie scripts documents, and virtually impossible on a large document set, such as the song lyrics document set. Some of the queries in each experiment failed for all methods except for the proximity-based method. In these cases, the Vector Space Model may have failed due to the individual terms in the query being present in too many documents to dis tinguish which documents were relevant. The proximity-based method requires that the query terms be found within a certain proximity of each other which the vector space model allows for the terms to be found anywhere in the document. Therefore, the number of documents which could match the query using the proximity-based method must be less than or equal to the number of documents matched using the vector space model. The number of documents matched by the proximity-based method is likely to be much less as most of the queries contained no gaps, 64.83% of the movie script queries and 76.12% of the song lyrics queries as in Table 3.1, resulting in smaller proximity windows. 55

Assuming our hypotheses regarding the generally poor performance of the Vector Space Model are correct, we consider a few of the exceptions where the correct docu ment was ranked highly. All of the exceptions listed below are from the movie script data set, since the Vector Space Model failed for all queries in the song lyrics data set: "that's no moon! that's a space station" ranked 12th. Only one query term was incorrect and the query contained three uncommon words; "moon", "space" and "station". "captain ship decloaking dead ahead shields delay that order mr worf tactical report... 38 photon torpedos ... 27 phaser cannons captain, should we raise our shields no she is a predator" ranked 12th. Many terms were incorrect, but 10 of the terms were still found in the intended phrase, 8 of which were uncommon; "captain", "shields", "worf", "tactical", "photon", "captain", "raise", and "shields". "na ... na ... not going to work her any more" ranked 2nd. None of the query terms were found in the intended phrase, so the terms must have appeared elsewhere in the document. The high rank was likely coincidental. The proximity-based methods ranked the document very low for this query. The two exceptions that actually performed well, as; opposed to being coinciden- tally ranked highly, were also ranked highly by all of the other methods. As a result, the Vector Space Model was the worst performing algorithm for this task examined in this thesis in terms of ranking the correct document highly.

5.3 Search Engines

All of the search engines performed anywhere from 2 to 47 times better than the Vector Space Model in the movie scripts experiment. The search engines, in general, performed well, achieving a performance between 18.68% and 56.04% ranking the correct document within the top ten in the movie scripts experiment and between 1.18% and 52.94% ranking the correct document within the top ten in the song lyrics experiment. In both experiments, the Google-all method achieved the highest percentage and the Google-any method achieved the lowest percentage in among the search engines test. 56

Similarly to the low performance of the Vector Space Model algorithm, the "any of these words" search engine method, Google-any, was unable to retrieve 95.29% of the correct documents within the top 100 results. As a result, it would be unwise you use this method for the task examined in this thesis. Our initial observation of the Google-any method and similar performance of the "any of the words" method for the other search engines motivated us to omit the other "any of these words" methods from our experiment for the remaining search engines. Intuitively, the high rate of error mentioned in the exact phrase matching experi ments may lead one to believe that "any of these words" methods would outperform "all of the words" methods. However, the "all of these words" methods are misleading in that they will ignore some terms. The conditions which cause the search engines to do so are unclear, as they are proprietary methods, but we hypothesize that one of the conditions would be a query term appearing in no documents.

For the queries which failed in all methods except for the proximity-based method, one of the causes of failure, the "all of these words" search engine methods may have failed due to the existence of terms in the query which were not in the document, such as "loose" which should have been "lose". While some incorrect terms were ignored, resulting in a higher performance than Google-any, some incorrect terms were not ignored. This could include terms that were not typographical errors, but simply a word present in the query and in some documents, but not in the desired document. The Google-all method, achieving a performance of 56.04% and 52.94% in the movie script and song lyrics experiments, respectively, performs fairly well. It's possi ble that a few reformulations of the query would yield a higher performance. However, it is unlikely that the performance already achieved by the proximity-based methods will be matched and it should not be necessary to issue multiple queries to get the desired result. The significant drawback of all search engines tested is the limitations implied by the "all of these terms", "any of these terms" and "exact phrase" meth ods. These search engines have indexed massive quantities of movie scripts and song lyrics, but are ill-equipped for this task. Providing more options to the users which take this task into consideration would likely improve performance significantly. It is one of our objectives in this thesis to demonstrate that current search engine methods 57 axe insufficient for this task and the results we have obtained are indicative of this insufficiency.

5.4 Proximity-based Algorithm

The proximity-based method was developed based on the hypothesis that a method making use of information regarding the occurrence of matched terms and the relative locations of matched terms was an effective means of locating similar phrases. Our hypothesis was correct. The proximity based method was the highest performing method of all those tested. The performance in ranking the correct document within the top ten doc uments was at least 35.29% higher and 14.29% higher using the proximity-based methods than all other methods in the movie script experiment and the song lyrics ex periment, respectively. Thus, there was an average minimum improvement of 24.79% over the other methods. The higher performance of the proximity based method on the song lyrics collection over the movie script collection may be explained by the fact that song lyrics very often contain phrases which would rarely, if ever, be said in conversations. Song lyrics are also typically mentally connected with the music in the song, making it less likely to use incorrect words or additional words that were not present. The amount of improvement over search engines in the experiment on song lyric queries was more than 20% higher than the improvement in the experiment on movie script queries. This may be explained by the fact that locating the correct document corresponding to a query of generally common words is much easier when documents in the collection axe shorter, but search engines have a tendency to ignore common words. So, the proximity based method would likely perform better. When dealing with long documents, a query containing generally common terms is likely to perform poorly regardless of the method, as natural spoken language and common terms make up a significant portion of the documents. So, searching for documents using queries with common terms is easier to improve with a method which considers all terms, if the documents are shorter. While the search engines' "all of these words" methods performed fairly well, the proximity-based methods performed considerably better, as previously mentioned. 58

One theory to explain the improvement of the proximity-based methods is that as the number of words and, typically, the number of rare words increases, the more spe cific the query becomes. With more information, the proximity-based methods can distinguish between documents which match some of the same query terms. How ever, as more information is provided in an "all of these terms" search, the higher the probability is of providing an incorrect term. An incorrect term can cause the document to not be found even when some of the terms were correct.

In Figures 4.7 through 4.10, PBP-reg and Google-all are compared to determine the relation of the difference in rank to the number of words in the query and the number of "rare" words. Decreasing the number of query terms and the number of rare query terms lowered the performance of Google-all on the movie script collection and increased the perfor mance of Google-all on the lyrics collection. The different affects on the performance of Google-all is likely due to the size of the documents in the collections. The change in performance is higher when decreasing in the number of rare words, as in Figure 4.8 and Figure 4.10, than the change in performance when decreasing the number of total words, as in Figure 4.7 and Figure 4.9. The performance of the proximity based method was virtually unaffected. An examination of the performance of Google-any when the number of query terms and rare query terms is decreased appears to show no correlation. This is what we would expect, since increasing the number of words should not impede the results of the Google-any method. Provided at least one correct, uncommon or possibly common word, the Google-any method should retrieve the result, but this method is too relaxed to rank documents which contain many of the provided words highly enough. Given the large performance improvement the proximity-based methods provide over the search engine methods, it is quite probable that adding such a method to search engines would prove to be useful. It may be the case that a proximity-based method such as this is not easily implemented in current search engine algorithms. The proximity-based method can still be employed for web searching by applying one or keyword searches on subsets of the query to quickly gather a reasonably sized set of documents which an independent proximity-based algorithm can then be applied 59

to. An examination of queries for which the correct document was ranked highly but not first by the proximity-based methods provides insight into why the correct document was not ranked higher. Looking at the number of matched terms, the lengths of the phrases and guessing the relative frequencies of the words makes it fairly easy to see how the top ranked documents were ranked higher than the cor rect document. For some queries, the only difference was the distance between the matched query terms and, since the proximity-based method prioritizes documents where the matched terms are closer together, this is expected. The distance between the matched terms being smaller does not necessarily imply that the document is the correct match, the terms need only fall within the maximum window, but the probability that this is the correct document is higher.

Some queries failed as a result of not containing enough rare terms or partial phrases. Interestingly, other queries failed because they contained rare terms not found in the correct document but contained in other documents which were ranked higher as a result. Another notable reason for query failure was that the user's remembered phrase did not exist in the documents. A common reason for failed queries was beyond the user's control; the user remembered the quote correctly or fairly accurately, but the document did not contain the text. Many of these conditions are unlikely to be improved, resulting in a minimum expected continued failure rate of about 10%. It is unsurprising that search engines performed poorly compared to the proximity- based methods. Approximate phrase matching is based on a high probability that many of the terms provided will be present combined with a low probability that the phrase provided will be exactly correct. The "any" and "all" searches are either too relaxed or too strict, respectively, to adequately handle queries of this nature. The modified algorithms for our proximity-based method, PBP-\en and PBP-ti, were based on the hypothesis that placing priority on certain matched terms would improve results. We supposed that finding a longer term or an infrequent term should be given more weight that a short or common term. In general, there was little or no improvement of the results by making the modifications. Only a few queries appeared to benefit in that the correct document was ranked higher or the number of 60 other documents with the same score as the correct document was reduced. However, a similar amount of queries had their results worsened by the modified methods. Therefore, we find that the modifications are unnecessary. Moreover, the PBP-ti algorithm cannot be used on a dynamic corpus even if it was found to improve results.

5.5 Proximity-based Algorithm Factors

When the proximity-based algorithms were developed, they were tested and modified based on initial observations made on a small number of queries over a small set of the movie script documents. While our hypotheses regarding the usefulness of the combination of factors provided to be correct, we return to the algorithms to examine the individual factors to determine how important each factor was on its own. The information gathered can be used to further modify and improve the method. In the movie script experiment, the relative performance of PBP-reg and its factors as shown in Figure 4.15 is as PBP-reg > PBP-wc > PBP-wa, > PBP—a, > PBP-c. In the song lyrics experiment, the relative performances do not have a strictly greater than or less than relationship, according to Figure 4.15. The relative perfor mances could be roughly described as PBP-reg « PBP-wc > PBP-a > PBP-wa > PBP-c with the exception that count is actually greater than both PBP-a and PBP-wa, for ranking the results within the top 20, 50, or 100. The PBP-a, method was the highest performing method which did not incorporate a window, achieving a minimum accuracy of 29.41% and a maximum accuracy of 45.05% over the two data sets in ranking the correct document within in the top 10. The outperformance of evaluating adjacencies over term counting was also shown in [30], except all terms were required to be present as opposed to rewarding the presence of terms or adjacencies as in this thesis. Simply counting term occurrences or adjacencies outperformed VSM and Google- any. The PBP-wc method was the highest performing method out of the two methods which incorporate windows, PBP-wc and PBP-wa, achieving a minimum accuracy of 63.74% and a maximum accuracy of 83.53%. These results indicate that the most important factor is incorporating a proximity method. The PBP-a method essentially did this by rewarding adjacent terms. Incorporating an actual proximity window in 61

PBP-wa, and PBP-wc resulted in a higher performance and the highest was with PBP-wc. Thus, incorporating a window eliminates the need to consider adjacencies to achieve a high performance. However, the proximity-based method performing the highest on both data sets indicates that using the adjacencies in combination with term counting within a window improves performance.

5.6 Users

The interface for the proximity-based method requires only a text box and some sort option box if the method is being made available alongside other methods in the search system. The form in which users express the existence of a gap need not be ellipses, but this form seemed intuitive to the users. Errors in the query formulations included typographical errors and an incorrect number of periods to form the ellipses. Suggested spellings could easily be provided and dynamic automatic correction of the number of periods could be implemented as well. In order to correct the issue with ellipses, an alternative form of indicating gaps such as one underscore for a small gap and two underscores for a large gap could be used. An investigation into whether or not users would find this straightforward and simple to use would be needed. Despite the errors encountered this form of query expression proved to be quite effective as indicated by the high performance of the results for the proximity-based method. In addition to the high performance, users generally commented that the method was simple and straightforward to use. It was not very often that the gaps were used as they were not always required, but the users understood what needed to be done after a short initial explanation. Users also commented that, while they typically search for songs and movies by title, as they tend to remember the names, that they would find this method useful for the occasional time that they cannot remember the title.

5.7 Summary

One of the major patterns in our observations was that, while an increased number of rare terms in the query may help, it is far more useful to simply count the number of 62 matched terms and so a term-weighting system is not useful for this task. Results were also greatly improved by the introduction of methods which incorporate structure and distances, such as term adjacencies and windows. One of the questions we aimed to answer in this work was which factors consid ered among all methods examined in this thesis were the most important factors for improving the results of approximate phrase matching. The patterns we observed coincide with our hypotheses that approximate phrase matching would benefit more from various methods that consider how similar two passages are in terms of common words, location of words and distances between words than it would from traditional search methods which tend to be more concerned with the characteristics of the terms provided, such as the Vector Space Model. The factors incorporated in our proximity-based method are fairly simple. While we have been unable to find research which incorporates these factors in phrase match ing tasks, it is highly likely that similar methods have been utilized in some sort of search tasks. The reason for the seeming lack of these factors in present methods used by search engines and database search systems is not clear. Perhaps it is the very simplicity of these factors that have caused them to be overlooked. The positive results found in this thesis will hopefully motivate future research in incorporating these factors and methods similar to our proximity-based method into search systems. Even an uncommon problem in information retrieval ought to be researched to find the most effective method of handling the task. It will benefit users and can reveal new information relevant to other tasks. The simplicity and effectiveness of our proximity- based method makes it a very promising solution to the task of approximate phrase matching and thus is a significant contribution to information retrieval specifically document retrieval. Chapter 6

Conclusion

The simplest and most efficient method of handling phrase queries, exact phrase matching, is clearly insufficient for our task where inaccuracies are probable. The failure rate on the movie script collection was 84.62% and the failure rate on the song lyrics collection was 47.06%. Users are prone to error when they don't have an exact quote to reference and are recalling a phrase from memory. As a result, users should have access to a method which allows them to search for the source document in a way which approximates a phrase. Keyword searching, as performed by the vector space model and the search engines tested in this study, has rather poor performance for the purpose of retrieving the document containing a phrase which may not be remembered correctly. The worst performing method was the vector space model, achieving accuracies of 3.30% and 0.00% on the movie script and song lyrics collections, respectively. This supports our hypothesis that the vector space model was likely to perform poorly as focus more on terms used and less on the structure of the terms, which is an intuitively incorrect approach for dealing with phrases. The highest performance seen in the keyword methods tested was 57.14% by Google-all, which also happened to be the highest performing search engine method tested on both document collections. While this search engine method improves upon the vector space model by a considerable amount, this performance is still quite low not likely to be useful for he users. The significant drawback to the search engines tested is the limited search options, where they are often far too strict or far to relaxed to handle minor inaccuracies. The Boolean operators available would be an effective means of communicating fairly detailed information about a query when a user is capable of using these operators, but most often they are not and even when they are capable they often wish not to put much effort into query formulation. The minimum improvement of the average PBP accuracies over the other tested

63 64 methods was 24.79%. Our proximity based method achieved an impressive perfor mance and level of improvement over the other methods on both document collections. With an average accuracy of 80.5%, 81.1% and 82.1% for PBP-reg, PBP-hn and PBP-tf, respectively, the proximity based method provides an effective means of ap proximating phrase searches. There was very little difference in performance between the proximity based algorithms

In examining the different factors incorporated in the proximity based method, the proximity or window factor had the largest impact. Accounting for adjacencies only proved more useful than accounting for term occurrences only, which supports the hypothesis that incorporating structure in the search algorithm, in general, is useful for this task of approximate phrase matching. While counting term occurrences within a proximity window achieved a fairly high performance, combining term occurrences, adjacencies and proximity proved to be the more effective method. Counting term occurrences and adjacencies on their own perform so poorly on their own that they not useful as search algorithms. The small percentage of queries which failed using the proximity based method can be attributed mainly to significant user error and document inaccuracy. Based on the number of queries which failed for reasons beyond our control, we predict that our method may be unable to be improved to surpass an accuracy of approximately 90% in ranking the correct document within the top 10, so we set this as our goal for future experiments.

The datasets used in our experiments varied greatly in the number of documents and document length. The considerably shorter documents in the song lyrics collec tion, with an average document length of 216, appears to have helped the proximity based method to improve upon the the search engines and the vector space model to a larger degree than its improvement in the movie script collection, with an average document length of 23,766. The proximity based method also appears to handle a significant increase in the size of a document collection, from 712 movie scripts to 685,079 song lyrics, much better than other methods tested. The common property of the datasets tested here is that the words are likely to be memorized, due to the repetitive nature of songs, some lines in movies being mem orable due to humor or shock, or simply from repeated exposure in both instances. It 65 is far less likely for a person to remember a large portion of a line from an academic paper, for example. Documents which discuss a certain topic are more likely to be remembered by concept and not by content. However, it would be interesting to see what sorts of queries users might have when trying to locate documents relevant to a topic or specific documents on a topic and how often, if at all, a proximity-based method may aid in their search.

In addition to this the effectiveness of the proximity based method, the increased level of complexity in formulating our queries over the complexity of forming keyword queries is very small and the users who generated the queries commented that they found the method very simple and straightforward after one short explanation. The consensus was that, while these users generally searched for songs or movies by title, since they often remember the title, that they would, find this method useful for the times where they do not have that information. The gap operators were used sparingly, but were necessary for many of the queries in which they were used.

The most likely improvement to be made to these algorithms is that of distin guishing between documents which have the same score. The use of the length based and frequency based score values did not achieve this effect well enough. In future work, we will be examining this. A new scoring system will be examined that awards value to pairs of terms correctly ordered relative to one another and pairs of terms which meet proximity constraints. An example of a query which appears as though it would benefit from the this approach is "yeah, I'm gonna have to get you to come in on sunday too" intended to match "Oh, oh, yea... forgot. I'm gonna also need you to come in Sunday too." Most of the terms in the query are very frequent terms that appear often near one another in a wide variety of structures.

Our goal is to improve the robustness of the proximity-based method to achieve accuracy as close to 90% as possible and ensure users are able to perform phrase searches on document collections and potentially the web without further increasing the complexity of query formulation. We plan to explore further modifications to the proximity based method which will incorporate other phrase structure properties, such as the relative order of terms and the proximity of pairs of query terms to one another in order to better separate documents with the same score and to improve the rank of documents further which are already ranked highly. 66

One weakness is that the queries were generated entirely by students, most of whom are in a computer science or related field. As a result, it is possible that the method by which queries are expressed is not as straightforward for the general public as it is for the users who contributed queries. In the future, we may explore the possibility of using the statistics on user error observed in the queries provided to enhance the search interface such that it makes suggestions for the user on how to improve their query. Lucas and Topi [23] found that there is a need for providing users with improved support in query formulation and the usage of operators. So, if the average user actually finds difficulty in using our method, we can likely resolve this with more thorough search guidance. A second weakness is that using the proximity based method does not allow for preprocessing, such as the indexing done by search engines [14] in order to perform fast keyword searches. This implies that our proximity based method will be less efficient than all of the other methods examined in this thesis. We plan to examine ways in which we can enable the proximity based method to operate more efficiently than it would by processing every document in the collections considered and calculating the scores every time a query is issued. This may include using search engine queries to create a smaller set of potentially relevant documents which would then be handled by our method.

A third weakness is the limitation on the window size with only three methods to specify a general distance between terms, adjacent, small gap and large gap. Given the typical lengths of the queries submitted and the number of queries which utilized the gaps being low, this weakness did not hinder the performance of our method in a noticeable way. However, it may be useful in the future to allow more categories of distances for the user to specify. For example, the small and large ellipses could be replaced by "_words_" and "-sentence-", respectively, and additional operators such as "_paragraph_" and "_page_" could be added. In addition to this, there could also be the option of submitting multiple phrases which could be located anywhere in the document and the distance between these phrases would be of no consequence. The values given to the phrase gaps introduced in this thesis may not be optimal, so it may serve to improve results in the future if experiments were run to test different values over numerous document sets to fine tune these values. 67

Overall, our method of locating the target document of an approximate phrase query is very effective for this task and has the potential for improvement. We have shown that term occurrence counting, adjacency counting and proximity operators, which have been proven useful for very different search tasks in previous research, work very well on small and large document collections with documents of largely different sizes. The impressive accuracy rate and the simplicity of our proximity based method indicate that our method could make a promising additional search engine method for phrase-searching tasks in addition to be provided as an option of search systems for static corpora. Bibliography

[1] Ask.com search engine - better web search, http://www.ask.com, site: accessed 2007.

[2] G. Aston and L. Burnard. The Bnc Handbook: Exploring the British National Corpus With Sara. Edinburgh University Press, 1998.

[3] R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern information retrieval. Addison- Wesley Harlow, England, 1999.

[4] DC Blair. Information retrieval and the philosophy of language. Annual Review of Information Science and Technology, 37(l):3-50, 2003.

[5] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107-117, 1998.

[6] S. Biittcher, C.L.A. Clarke, and B. Lushman. Term proximity scoring for ad-hoc retrieval on very large text collections. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 621-622, 2006.

[7] C. Clarke, G. Cormack, G. Kemkes, M. Laszlo, T. Lynam, E. Terra, and P. Tilker. Statistical selection of exact answers (multitext experiments for tree 2002). Proc. of TREC, pages 823-831.

[8] N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the TREC 2003 Web Track. Proceedings of TREC, 2003, 2003.

[9] CM. Eastman and B.J. Jansen. Coverage, relevance, and ranking: The impact of query operators on Web search engine results. ACM Transactions on Information Systems (TOIS), 21(4):383-411, 2003.

[10] W.N. Francis and H. Kucera. Brown corpus manual, 1979.

[11] Google, http://www.google.ca, site: accessed 2007.

[12] Google received 64 percent of i.s. searches in October. http://www.hitwise.com/press-center/hitwiseHS2004/google64ussearches.php, site: accessed 2007.

[13] M. Gordon and P. Pathak. Finding information on the World Wide Web: the re trieval effectiveness of search engines. Information Processing and Management, 35(2):141-180, 1999.

68 69

A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. International World Wide Web Conference, pages 902-903, 2005.

Hard candy script, http://www.horrorlair.com/movies/hard-candy-script.html, site: accessed 2008. B.J. Jansen. The effect of query complexity on Web searching results. Information Research, 6(1):6-1, 2000. B.J. Jansen and A. Spink. How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing and Management, 42(l):248-263, 2006.

B.J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information retrieval: a study of user queries on the Web. ACM SIGIR Forum, 32(1):5-17, 1998. B.J. Jansen, A. Spink, and T. Saracevic. Failure analysis in query construction: data and analysis from a large sample of Web queries. Proceedings of the third ACM conference on Digital libraries, pages 289-290, 1998. E.M. Keen. Some aspects of proximity searching in text retrieval systems. Journal of Information Science, 18(2):89, 1992. L. Keily. Improving resource discovery on the Internet: the user perspective, pages 205-212, 1997. DL Lee, H. Chuang, and K. Seamons. Document ranking and the vector-space model. Software, IEEE, 14(2):67-75, 1997. W, Lucas and H. Topi. Form and function: The impact of query term and opera tor usage on web search results. Journal of the American Society for Information Science and Technology, 53(2):95-108, 2002. Lyrics download.com. http://www.lyricsdownload.com/, site: accessed 2007.

G. Mishne and M. de Rijke. Boosting web retrieval through query operations. Proceedings ECIR, 2005, 2005. S. Mitra, T. Acharya, and J. Luo. Data Mining: Multimedia, Soft Computing, and Bioinformatics. Journal of Electronic Imaging, 15:019901, 2006. Msn.com. http://www.msn.com, site: accessed 2007. Official google blog: Do you "google?", http://googleblog.blogspot.com/2006/10/do- you-google.html, site: accessed April 14, 2008. [29] K. Patterson, C. Watters, and M. Shepherd. Document Retrieval Using Proximity-Based Phrase Searching. Hawaii International Conference on System Sciences, Proceedings of the 41st Annual, pages 137-137, 2008. 70

J. Pickens and W.B. Croft. An exploratory analysis of phrases in text retrieval. Proceedings of RIAO 2000 Conference, Paris, pages 1179-1195, 2000.

Proximity searching - rhodes university library. http://www.ru.ac.za/library/infolit/proximity.htmLl, site: accessed May, 2007. V.V. Raghavan and SKM Wong. A critical analysis of vector space model for information retrieval. Journal of the American Society for Information Science, 37(5):279-287, 1986.

E.M. Rasmussen. Information Retrieval on the Web. Annual Review of Information Science and Technology, ASISfcT, 39, 2005.

Y. Rasolofo and J. Savoy. Term proximity scoring for keyword-based retrieval systems. Proceedings 25th European Conference on IR Research (ECIR 2003), pages 207-218, 2003. K. Sadakane and H. Imai. Text Retrieval by using k-word Proximity Search. Database Applications in Non-Traditional Environments, pages 183-188, 1999.

G. Salton and MJ McGill. Introduction to Modern Information Retrieval. McGrawHill Book Co., New York, 1983.

G. Salton, A. Wong, and CS Yang. A vector space model for automatic indexing. Communications of the ACM, 18(ll):613-620, 1975. M. Night Shyamalan. The lady in the water [motion picture]. Warner Bros. Pictures, 2006. Craig Silverstein, Hannes Marais, Monika Henzinger, and Michael Moricz. Anal ysis of a very large web search engine query log. SIGIR Forum, 33(1):6—12, 1999.

The archive of misheard lyrics, http://www.kissthisguy.com, site: accessed April 14, 2008.

The best page in the universe, http://maddox.xmission.com, site: accessed 2008. The internet movie script database (imsdb). http://www.imsdb.com/, site: ac cessed 2007. Wikipedia. http://www.wikipedia.org, site: accessed 2008.

Yahoo! http://www.yahoo.com, site: accessed 2007.

K. Yang. Combining Text-, Link-, and Classification-based Retrieval Methods to Enhance Information Discovery on the Web. Ph D. thesis, UNC, 2002. Appendix A

Queries

Table A.l: Movie Script Collection Queries

Query Movie Intended Match I love those moments ... I like pirates of the none to wave at them as they pass Caribbean the me by curse of the black pearl nice hat pirates of the none Caribbean the curse of the black pearl it was a gift ... keep it the lord of the It was a gift. Keep it. rings two tow ers my friends you kneel to no the lord of the My friends ... you bow to no-one. one rings return of the king It's the genie of the lamp aladdin none why can't I have a normal as god as it Why can't I have a normal boyfriend??? boyfriend gets only steers and queers come full metal Only steers and queers come from Texas, from texas ... and you don't jacket Private Cowboy! And you don't look look like a cowboy to me much like a steer to me, so that kinda ... what is your major mal narrows it down! [gap] What is your function major malfunction, numbnuts? you know what the call a pulp fiction Also, you know what they call a Quarter quarter-pounder in france Pounder with Cheese in Paris? royale with cheese pulp fiction Royale with Cheese. english motherfucker! do you pulp fiction English-motherfucker-can-you-speak-it? speak it! say what again I dare you pulp fiction C'mon, say "What" again! I dare ya does marcellus Wallace look pulp fiction Now describe to me what Marsellus like a bitch Wallace looks like! [gap] does he look like a bitch?!

71 72 we should have shotguns for pulp fiction We should have shotguns for this kind this of deal. bring out the gimp pulp fiction Bring out The Gimp. it's not a motorcycle baby, pulp fiction Where did you get this motorcycle? it's a chopper [gap] It's a chopper, baby, hop on. zed's dead baby, zed's dead pulp fiction Zed's dead, baby, Zed's dead. did you see a sign on my lawn pulp fiction When you drove in here, did you notice that said dead nigger storage a sign out front that said, "Dead nigger storage?" this is some gourmet shit pulp fiction Goddamn Jimmie, this is some serious gourmet shit. I buy my coffee. I know how pulp fiction I'm the one who buys it, I know how good it is. fuckin' good it is. the have a name for people pulp fiction They got a word for 'em, they're called like that. they call them bums. bums It's the one that says bad-ass pulp fiction It's the one that says Bad Motherfucker motherfucker on it. Luke ... I am your father star wars the No. I am your father. empire strikes back that's no moon! that's a star wars a That's no moon! It's a space station. space station new hope Mos Eisley a wretched star wars a Mos Eisley Spaceport. You will never hive of scum and villainy new hope find a more wretched hive of scum and villainy. help me obi-wan ... you're my star wars a Help me, Obi-Wan Kenobi. You're my only hope new hope only hope. pew! pew! aargh! star wars a none new hope 37. My girlfriend suck 37 clerks Thirty-seven?! (to CUSTOMER) My dicks. Hey where the fuck are girlfriend sucked thirty-seven dicks! you going I wasn't even supposed to clerks I'm not even supposed to be here today. come in today Would you like to making clerks WOULD YOU LIKE SOME MAKING fuck berserker FUCK? BERSERKER! make me a motherfucking dogma At least there I can get turned down profit while trying to make myself a profit. leeloo dallas moolteepass fifth element none yeah I'm gonna have to get office space Oh, oh, yeal forgot. I'm gonna also need you to come in on sunday too you to come in Sunday too. 73

I could burn the building office space Ok. I'll set the building on fire. down Excuse me I believe that's my office space Excuse me. I believe you have my sta stapler pler? My name is Inigo Montoya. the princess Hello, my name is Inigo Montoya. You You killed my father. Prepare bride killed my father. Prepare to die. to die. No no no ... dance ... sexy true lies Now dance for me. Go on. The name's Constantine. constantine This is Constantine. John Constantine, John Constantine. Asshole asshole. There is either do or don't do. star wars the No! Try not. Do. Or do not. There is There is no try empire strikes no try. back Carpe diem seize the day dead poet's so Carpe Diem. Seize the day boys, make boys ciety your lives extraordinary. captain ship decloaking star trek Should I raise shields? PICARD dead ahead shields nemesis No! WORF Captain - ! PICARD delay that order (firm) Tactical analysis, Mister Worf. mr worf tactical WORF (quickly analyzing tactical dis report ... 38 photon torpe- play) Fifty-two disruptor banks, twenty- dos ... 27 phaser cannons seven photon torpedo bays, primary and captain, should we secondary phased shields. raise our shields no she is a predator baby steps baby steps baby what about BOB (O.S.) Baby steps to the hall, [gap] steps down the hallway baby bob Sorry. Baby steps. Baby steps... [gap] steps to the elevator baby Baby step to the elevator. Baby step steps into the elevator daahh to the elevator, [gap] BOB Baby step to the elevator. Baby step to the ele vator. The elevator doors close and it starts down. Bob screams. Heil Hitler indiana jones HEIL HITLER! and the last crusade This is my boomstick army of dark This is my boomstick. ness My name is ... you kill my fa the princess Hello, my name is Inigo Montoya. You ther prepare to die bride killed my father. Prepare to die. should have sent a poet contact I keep saying that but I can't... my mind can't... words... should've sent a poet. 74 only queer and steers come full metal Only steers and queers come from Texas, from texas and you don't look jacket Private Cowboy! And you don't look like a steer boy much like a steer to me, so that kinda narrows it down! sorry babe looks like I let you crank Hey doll. Looks like I let you down down again, seems like I al again. You were right about me ways do ... love you ... funny, you really have time to reflect on things when you know you're going to die ... [gap] you were the greatest, baby. how much did you take crank DOC MILES You shot the whole thing, ... hard-on made of steel? didn't you? [gap] You got a steel hard ...check on CHEV Let me check. Looks down. CHEV (CONT'D) Check. why did you stop what so you crank CHEV (CONT'D) (flustered) What's can fall asleep like you always the matter? VE (satisfied) So you can do? I don't think so fall asleep like you always do? I don't think so. shock me crank Now juice me. I'm not a game programmer crank I told you I was a video game program I'm a professional assassin. mer. That was a lie. Actually... [gap] I But I'm getting out for you kill people. I'm a professional hitman. this is the crazy Chinese shit crank The shit they gave you ... it's the Chi they give to horse nese shit. There is no antidote. I wish there was something I could do. CHEV What, so that's it? CARLITO Honestly, you should be dead already. It's a mir acle. CHEV A miracle. CARLITO We give that shit to horses ... what kind of hippie are you harold and KUMAR Jesus, what the hell kind of ... a capitalist one kumar go to hippie are you? HIPPIE ASSHOLE One whitecastle who understands the concept of supply and demand, dude. do you have to pee here harold and KUMAR Excuse me ... I'm sorry, I just ... what is this your bush? do kumar go to have to ask you ... why are you peeing you own the bush? fuck it. whitecastle here? [gap] CREEPY GUY Is this like I don't want to get stabbed your special bush or something? KU tonight MAR No, I just ... (beat) You know what? Forget about it. I'm not in the mood to get stabbed right now. 75 what are you doing harold and HAROLD Kumar, what the hell are ... Trimming my pubes kumar go to you doing! KUMAR I'm trimming my look, it's a bonsai tree whitecastle pubes. Kumar looks at himself in the mirror as he makes a couple more snips. On the floor, we see LARGE CLUMPS OF HAIR. HAROLD Why aren't you doing this in your room! KUMAR The mirror's in here, (re: his crotch) Hey, check it out! It looks like a Bonsai tree! you're not a eunich are you pirates of the none carribean the curse of the black pearl they trade you for a flat of almost famous none heineken! at least it was im ported beer Observe which of the follow chasing amy BANKY Okay, now see this? This is ing is real, santa claus the a four way road, okay? [gap] BANKY easter bunny, the sweet man- V.O. And dead in the center, is a crisp, loving lesbian or the man hat new, hundred dollar bill. Now at the ing ... the man-hating lesbian end of each of the streets, are four peo because none of the others ple, okay? You following? Up here, fucking exist we got a male-affectionate, easy-to-get- along-with, no political agenda lesbian. Okay? Now down here, we have a man-hating, angry-as-fuck, agenda-of- rage, bitter dyke. To this side, we got Santa Claus, right? And over to this side - the Easter Bunny, [gap] BANKY Which one's going to get to the hun dred dollar bill first? [gap] HOLDEN (beat; then pissed) The man-hating dyke. BANKY Good. Why? HOLDEN I don't know. BANKY (wildly crossing out the other three) BECAUSE THESE OTHER THREE ARE FIGMENTS OF YOUR FUCKING IMAGINATION! rollercoaster of love beavis and none ... rollercoaster of love butthead do america we'll let you go when a mon bruce Oh, yeah. I'll apologize... The day a key comes out of my ass almighty monkey climbs out of my butt. 76 do you like apples ... ya sure good will DO YOU LIKE APPLES?! CLARK .., well I got her number, how hunting Yeah? [gap] WILL WELL I GOT HER about those apples? NUMBER! HOW DO YA LIKE THEM APPLES?!! looks like somebody has a office space Uh oh. Sounds like somebody's got a case of the mondays case of the Mondays. Laurance what would you do office space PETER That's a really good idea, (sits with a million dollars ... two on the couch) Lawrence, what would you chicks at once ... that's it? do if you had a million dollars? [gap] ... the kind of chicks that will LAWRENCE I'll tell you what I'll do, double up on a dude like me, man-Two chicks at the same time, [gap] ya. I figure with a million dol PETER That's it? If you had a million lars I could hook that up dollars, that's what you'd do, two chicks at the same time? LAWRENCE Damn straight, man. I've always wanted to do that. I figure if I were a millionaire, I could hook that up. Chicks dig guys with money. it's always the small things office space Ok! Ok! I must have, I must have put a decimal here or there .... a decimal point in the wrong place or This isn't a small thing something. Shit. I always do that. I always mess up some mundane detail. they won't send you dub fed office space If we're caught while laundering money, they'll send you to pound me we're not going to go to white-collar- in the ass prison resort-prison. No, no, no. We're gonna go to federal-reserve-pound-me-in-the- ass-prison. I sentence you to five years in office space I hereby sentence you, Michael Bolton federal pound me in the ass and Samir Na... Ananajibad... to a prison term of no less than four years in federal- pound-me-in-the-ass-prison. two squirrels were married office space I used to be by the window, where I and I used to be able to see could see the squirrels and they were out merry. 77

so you're stealing ... no office space JOANNA So you're stealing. PETER over time we just take the Ah, no. No. You don't understand. It's, parts nobody is going to uh, very complicated. It's, uh, it's, it's miss, think of it like that aggregate so I'm talking about fractions tray of pennies at the store of a cent that, uh, over time, they add ... you're stealing from the up to a lot. JOANNA Ok. So you're handicapped kids? ... no, no, gonna make a lot of money, right? PE we're talking small bits over TER Yeah. JOANNA Ok. That's not a very long period of time yours? PETER Well, it, it becomes ours. JOANNA How's that not steal ing? PETER I don't think, I don't think I'm explaining this very well. Um, this Seven Eleven, right? If you take a penny from the tray - JOANNA From the crip pled children?! PETER No, that's the tray. I'm talking about the tray. The penny's for everybody. I take the order from the cus office space BOB SLYDELL So what you do is you tomer and bring it to the en take the specifications from the cus gineers ... so you physically tomers and you bring them down to the take the order from the cus software engineers? TOM That, that's tomer and bring it to the en right, [gap] BOB SLYDELL You physi gineers ... well no my secre cally take the specs from the customer? tary does that ... what ex TOM Well, no, my, my secretary does actly is it you say you do that, or, or the fax. BOB SLYDELL Ah. here? ... don't you people get BOB PORTER Then you must physi it? I'm a people person, why cally bring them to the software peo can't you understand that? ple. TOM Well... no. Yeah, I mean, sometimes. BOB SLYDELL Well, what would you say you do here? TOM Well, look, I already told you. I deal with the goddamn customers so the engineers don't have to!! I have people skills!! I am good at dealing with people!!! Can't you understand that?!? WHAT THE HELL IS WRONG WITH YOU PEO PLE?!!!!!!! as soon as you stop looking pi But Max, as soon as you discard scien at the reason you become a tific rigor, you are no longer a mathe numerologist matician. You become a numerologist. 78

Like the guy invented the pet office space Like that guy that invented the pet rock, he got rich. ... the pet rock. You see, that's what you have rock sucked ... I had an idea to do. You have to use your mind and like that once it was a jump to come up with some really great idea like conclusions mat. you'd jump that and you never have to work again! to conclusion MICHAEL I don't think the pet rock was really such a good idea. TOM The guy made a million dollars! Y'know I had an idea like that once. PETER Re ally? What was it, Tom? TOM Well, all right. It was a Jump to Conclusions- mat. You see, it would be this mat that you would put on the floor and it would have different conclusions written on it that you could.. .jump to. not right now lumberg. I've office space PETER Not right now, Lumbergh. I'm, got a meeting with the bobs I'm kinda busy. In fact, I'm going to ... alright then we'll get this have to ask you to go ahead and just cleared up for you peter come back another time. I have a meet ing with the Bobs in a couple of minutes. BILL Uh, I wasn't aware of a meeting with them. PETER Yeah, they called me at home. BILL That sounds good, Peter. Uh, and we'll go ahead and, uh, get this all fixed up for you later. fuck office space What the fuck does that mean?!! na ... na ... not going to work office space BOB PORTER First, Mr. Samir her any more Naga... Naga... BOB SLYDELL Naga. ..BOB PORTER Naga-worker here anyway! we could probably get you a office space PETER It's not too bad. Not too bad. job at penetrode if you want How's Penetrode? MICHAEL Initrode. peter ... naww I'm good PETER Initrode. SAMIR It's work. PETER Yeah. Yeah. MICHAEL I could probably get you a job if you want. PETER No, thanks. I'm doing good here. rule 1: you do not talk about fight club The first rule of fight club is - you don't fight club, rule 2: you do not talk about fight club. The second rule talk about fight club, rule 3: of fight club is - you don't talk about you stop when the tap out go fight club. The third rule of fight club limp, pass out is - when someone says "stop" or goes limp, the fight is over. 79 you stole my stapler ... I'll office space And if, if they take my stapler, I will, I burn the building down will set this building on fire. nobody hangs up on me Oh office space NO ONE HANGS UP ON ME. WE'RE ya — I cheated on you THROUGH!!! AND HA- ONE MORE THING. I'VE BEEN CHEATING ON YOU!!!! (BEEP Um ya ... I'm gonna have to office space Oh, oh, yeal forgot. I'm gonna also need ask you to come in on Sunday you to come in Sunday too. We, uh, lost we let a bunch of people go some people this week and we need to and we've got to play catch sorta catch up. Thanks. up thank you for calling initrode office space Corporate Counsels Payroll, Nina please hold. Thank you for speaking. Just a moment. calling initrode, please hold do you want to express your office space Look, we want you to express yourself, self ... yes ... well we like to ok? If you think the bare minimum encourage our employees to is enough, then ok. But some people do more choose to wear more and we encourage that, ok? You do want to express your self, don't you? we can't stop here ... this is fear and We can't stop here - this is bat country! bat country loathing in las vegas a whole myriad of uppers and fear and DUKE (V/O) We had two bags of grass, downers ... the tendency is to loathing in las seventy-five pellets of mescaline, five push as far as you can vegas sheets of high powered blotter acid, a salt shaker half full of cocaine, a whole galaxy of multi-colored uppers, down ers, screamers, laughers... Also a quart of tequila, a quart of rum, a case of beer, a pint of raw ether and two dozen amyls. [gap] DUKE (V/O) Not that we needed all that for the trip, but once you get locked into a serious drug collection, the tendency is to push it as far as you can. look what god did to our co fear and GONZO Oh, Jesus! Did you see what caine ... god didn't do it you loathing in las god just did to us? DUKE God didn't did it narc! vegas do that! You did it! You're a fucking narcotics agent, that was our cocaine, you pig! tell me about the fucking gold fear and But what about our room? And the golf shoes loathing in las shoes? vegas 80 just admiring the shape of fear and He's admiring the shape of your skull. your skull loathing in las vegas there is an art to police you fear and Few people understand the psychology have to give him a chase a loathing in las of dealing with a Highway Traffic Cop. thrill you can't stop imme vegas Your normal speeder will panic and im diately you must make him mediately pull over to the side. This want it is wrong. DUKE floors the gas pedal. DUKE (V/O) It arouses contempt in the cop heart. THE SPEEDOMETER CLIMBS STEADILY. DUKE (V/O) Make the bastard chase you. He will fol low. But he won't know what to make of your blinker signal that says you're about to turn right. there is nothing more terrible fear and There is nothing in the world more help than a man in the grips of an loathing in las less and irresponsible and depraved than other binge vegas a man in the depths of an ether binge. dr ... is studying ... studying contact And Dr. Arroway here will be spend ... little green men ing her precious time listening for little green men. All yours, guys. you better hope there is some fear and You better hope there's some Thorazine fucking thorizine in there loathing in las in that bag, because if there's not, ... you want me to throw the vegas you're in bad trouble. DUKE Okay. radio in when the white rab You're right. This is probably the only bit peaks solution. (holds the PLUGGED IN TAPE/RADIO over the tub) Let me make sure I have it all lined up. You want me to throw this thing into the tub when "WHITE RABBIT" peaks. Is that it? as your attorney I advise you fear and As your attorney I must advise you that to get a foot car and some loathing in las you'll need a very fast car with no top hawaiian shirts vegas and after that, the cocaine. And then the tape recorder, for special music, and some Acapulco shirts... 81 Table A.2: Lyrics Collection Queries continued

Query Song Intended Match like a dog in heat, a freak live crew me so I'm like a dog in heat, a freak without without warning? horny warning. what i've done ... what I've linkin park fig Youve become a part of me ... Myself become ure 09 from what Ive done Here I am in silence, Looking information Here I am in silence Looking round with round without a clue society whats out a clue on your mind Can I have another piece of crowded house Can I have another piece of chocolate chocolate cake, chocolate cake cake tammy baker, boy did she crowded house Tammy Baker, Tammy Baker ... Kathy loose some weight! chocolate cake Straker, boy could she lose some weight nothing ever grows in this rot meat loaf bat Nothin' ever grows in this rottin' old ten old hole out of hell hole any wharhowl must be laugh crowded house Andy Warhol must be laughing in his ing in the his grave chocolate cake grave here comes ms. hairy legs! crowded house And here comes Mrs. Hairy Legs chocolate cake I saw elvis presley walking crowded house I saw Elvis Presley walk out of a Seven out of a 7-11 chocolate cake Eleven she keeps ... in a pretty ... let queen killer She keeps Moet et Chandon In her them eat queen pretty cabinet 'Let them eat cake' she says al we've got is this moment jacynthe need All we've got is this moment you tonight I know that I can love you sarah mclach- I know I could love you much better than much better than this. lan full of this grace hit me baby one more time! britney spears hit me baby one more time this is ground control to ma david bowie HIT ME BABY ONE MORE TIME! jor torn space oddity you're circuit's dead is there david bowie Your circuit's dead, there's something anything wrong? space oddity wrong can you hear me major torn? david bowie Can you hear me, Major Tom? space oddity put your space face close to david bowie Press your space face close to mine, love mine, doll moonage dream our love is like the flowers new order the Oh, our love is like the flowers village 82

put your raygun to my head david bowie Put your ray gun to my head moonage dream keep your electric eye on me, david bowie Keep your 'lectric eye on me babe babe moonage dream freak out in a moon age day david bowie Freak out in a moonage daydream, oh! dream oh yeah moonage yeah! dream I come from downtown tragically hip I come from downtown, born and ready grace too for you. born waiting for you tragically hip I come from downtown, born and ready grace too for you. she was the best damn acdc shook me She was the best damn woman that I woman that i'd ever seen all night long ever seen, he would get down on his veggie tales his He would get down on his hands and hands and knees cheeseburger knees, I want to hold you till the fear dan hill some I wanna hold you 'til the fear in me sub in me subsides times when we sides touch never knowing who to cling to elton john can Never fading with the sunset When the when the rain set in dle in the wind rain set in I'm henry the eigth I am hermans her I'm Henry the eighth I am mits im henry the eight i am once i was the king of spain moxy fruvous Once I was the King of Spain, now I eat king of spain humble pie I've got problems ice t 99 prob I've got 99 problems and a bitch ain't lems one! PROFESSIONAL KILLER! kmfdm profes Professional killer sional killer A skilled criminal mind kmfdm profes A skilled criminal mind sional killer in the age of superboredom, kmfdm mega in the age of super boredom hype and rape, and mediocrity lomaniac mediocrity celebrate relentlessness kmfdm mega celebrate relentlessness menace to soci meanace to society lomaniac ety I want to feel you from the nine inch nails I want to feel you from the inside inside closer It's a man eating god creator kmfdm virus IT'S A MAN-EATING GOD- CREATOR god bless the telegraph omd telegraph God's got a telegraph nn his side. when I say innocent I should depeche mode And when I say innocent I should say say naive lie to me naive 83

I met a guy in a rover the pixies i met a guy in a rover trompe le monde all right when I'm on abba super Suddenly I feel all right (And suddenly the stage tonight trouper it's gonna be) And it's gonna be so dif ferent When I'm on the stage tonight can't remember if I cried don mclean I can't remember if I cried When I read the day the music died american pie about his widowed bride But something touched me deep inside The day the mu sic died I am un chien andalusia the pixies de- don't know about you but i am un chien baser andalusia it's more fun to compute kraftwerk its It's more fun to compute more fun to compute hips like Cinderella the pixies Hips like Cinderella tame we know major tom's a junkie david bowie We know Major Tom's a junkie ashes to ashes the only boy... was a son of a dusty spring- The only boy who could ever teach me preacher man field son of a Was the son of a preacher man preacher man I wish they could all be Cali david lee roth I wish they all could be California girls. fornia girls California girls I don't know, but I've been modest mouse I don't know but I've been told You'll told, you never grow old. i came as a rat never die and you never grow old They're trying to make us hawkwind lost They're trying to make us scream By scream and cancelling johnny sticking thumb tacks in her flesh And the dream cancelling the dream never a frown ... golden the stranglers Never a frown With golden brown brown golden brown you say it's ... you need pop will eat You say it's love that you need, But it's that you want to itself wise up war that you've got... That you want to sucker live your life and "to have" can I buy another cheap pi- crowded house Can I buy another cheap Picasso fake casso fake chocolate cake disperate in the pursuit of nutopia jack desperate in the pursuit of cool cool. off jill here eye yam... .rawk you like scorpions rock Here I am, rock you like a hurricane a hurrrrwicane you like a hur ricane hell's bells! acdc hells bells Hells bells If you're into evil, you're a acdc hells bells If you're into evil you're a friend of mine friend of mine! 84

I want my MTV! dire straits You play the guitar on the MTV money for nothing Look at them yoyo's , that dire straits Now look at them yo-yo's that's the way the way you do it. money for you do it nothing Money for nothing, cheques dire straits Money for nothin' and chicks for free for free money for nothing wild saweet cool. the crystal wild sweet and cool method wild sweet cool and we run, through the the knife girls and we ran through the moonlight far moonlight, far beyond night out beyond my heart beats ... I can doris day And my heart beats so that I can hardly hardly speak cheek to cheek speak. you and I in a little toy shop nena 99 red You and I in a little toy shop balloons honey put on that party dress torn petty Honey, put on that party dress. mary janes last dance NEVER SURRENDER corey hart YOU CAN NEVER SURRENDER BABY! never surren der I'VE BEEN DOWN primitive I been downhearted baby HEARTED BABY radio gods standing out side a broken phonebooth I WEAR MY SUNGLASSES corey hart And I wear my sunglasses at night So I AT NIGHT SO I CAN SO I sunglasses at can, So I can CAN night it makes me feel so fine I can't weezer island And it makes me feel so fine I cant con control my brain in the sun trol my brain I shot the sherrif, but I did eric clapton i I shot the sheriff, but I did not shoot the not shoot the deputy shot the sheriff deputy. and I would run away... run the corrs run and i would run away i would run away, away with you away yeah yeah i would run away i would run away with, you And when I'm riding on that yello swing And when I'm riding on that wing wing 85 on the night ... time loreena We've been rambling all the night And mckennit some time of this day mummers dance never let go till we're one celine dion my And never let go till we're one heart will go on I never found the words danielle white I never found the words to say. You're ... you're the one i never had a the one I think about each day. dream come true here I am ... hurricane scorpion rock Here I am, rock you like a hurricane you like a hur ricane you come around my london fergie london How come every time you come around ... bridge wanna go down bridge My London london bridge wanna go down like london london london set an open course for the vir styx come sail I'm sailing away set an open course for gin sea away the virgin sea minutes seem like hours hanson hansen 'Cause when the minutes seem like hours you know it takes my with your love and the hours seem like days Then a breath away week goes by you know it takes my breath away WAAAAAAAAAAY OUT pixies where is Way out in the water See it swimmin' ? IN THE WATER, SEE IT my mind SWIMMING one down, one to go yes leave it One down one to go you couldn't take it on the kings of leon You couldn't take it on the tight rope tight rope, no you had to take red morning No you had to take it on the side it on the side light Walk like an egyptian! tha bangles Walk like an Egyptian walk likean egyptian make it last forever friendship spice girls Make it last forever friendship never never ends wannabe ends, sheriff don't like it the re Sheriff don't like it freshments preacher s daughter people always ask me when billy bragg People ask when will you grow up to be I'll grow up to understand new england a man But all the girls I loved at school why the girls I knew at school are already pushing prams were already pushing prams