For Inverted File Indexing in Web Search Engines
Total Page:16
File Type:pdf, Size:1020Kb
Project Report
Comparative Analysis of Data Structures for Inverted File Indexing in Web Search Engines
Ingrid Biswas, Vikram Phadke CSE 598: Design and Analysis of algorithms Project Computer Science & Engineering Department Arizona State University [email protected],[email protected] ABSTRACT...... 4
1 INTRODUCTION...... 4
2 BACKGROUND: INFORMATION RETRIEVAL SYSTEMS...... 5
3 WEB SEARCH ENGINE ARCHITECTURE...... 6
3.1 Crawler...... 6
3.2 Repository...... 8
3.3 Parser...... 9
3.4 Indexer...... 10
3.5 Page ranking module and Query Engine...... 11
4 THE GOOGLE SEARCH ENGINE...... 12
5 TEXT INDEXING AND RETRIEVAL...... 15
5.1 Signature files...... 17
5.2 Vector space models...... 18
5.3 Latent semantic indexing (LSI)...... 18
5.4 Inverted File Indexing...... 19
5.5 Inverted File Compression...... 20
5.6 Representing and Accessing Lexicons...... 21
6 IMPLEMENTATION...... 22
7 ANALYSIS AND RESULTS...... 24
7.1 Inverted File Indexing Using Sorted Array...... 25
7.2 Inverted File Indexing Using Hash Table...... 27 7.3 Inverted File Indexing Using BTrees...... 30
7.4 Comparative analysis of all three data structures...... 32
7.5 Search and retrieval efficiency of the data structures...... 34
7.6 BTrees and External memory:...... 34
8 FUTURE RESEARCH DIRECTIONS...... 35
9 CONCLUSION...... 36
10 REFERENCES...... 37 ABSTRACT
Search Engines of today serve as portals to the millions of web pages that form the WWW (World Wide Web). They are probably the most popular examples of Information Retrieval tools. They contain four major components that interact together namely, the Crawler, Storage module, Parser, Indexer Query Processor and Ranking module. Efficient algorithms and data structures can make the difference between an average and an exceptional search engine. Search engines today have to index millions of pages. Our work studies text indexing in the context of web search engines. In particular the inverted file-indexing algorithm for indexing is studied in detail. Different Data structures are compared in terms of the time required to create index, the time required to query the index and the space footprint.
1 INTRODUCTION
Search engines are extremely useful information retrieval tools. They are used for just about everything from shopping for electronics to looking for research papers. With the size of the WWW growing rapidly the search engine technology faces increasing challenges. Our work had the following objectives (1) Gain an in depth understanding of search engine technology (2) Look at search engines from the perspective of algorithms and data structures (3) Studying the different modules of the search engines in detail, analyzing algorithms, data structures.
We focus on the indexing module of search engines, and analyze the inverted file indexing algorithm. Different kinds of data structures can be used to implement the index. Sorted arrays, tries, Btrees, Hash Tables can be used to create the index. Various issues such as the time required to create the index, the space footprint of the index, the time required for retrieval arise when talking about efficient data structures for the indexing algorithm. Our work focuses on comparing data structures for the inverted file indexing algorithm in terms of time required to create the index. An outline of this report follows. Section 2 provides some background on information retrieval techniques. Section 3 discusses web search engines and the various modules in the web search engines. Section 4 describes in detail the working of the Google search engine. Section 5 describes the various algorithms used for text indexing. It describes in detail the inverted file indexing algorithm and the data structures that can be used to store the index. Section 6 describes the design of the implementation of the “evaluation environment” that was used for comparing the performance of the inverted file indexing algorithms when different data structures are used. Section 7 explains the results of the experiments. Section 8 explains future research directions based on the experiences with this work.
2 BACKGROUND: INFORMATION RETRIEVAL SYSTEMS
Information retrieval is a general term that is used to identify all those activities that enable us to choose from a given collection of documents. These could be documents that belong to particular domain of interest or a particular topic. The activities that we are concerned with in retrieving information are those that permit us to reach the target of choosing the documents that are probably relevant to the initial information need in an automatic way. The main criteria for automatic information retrieval is that the collections of documents that are available are in a digital form. In traditional IR, the collection of documents is a set of documents that has been put together, because it is related to a specific context of interest for the users that are going to use it. An IR collection is a set of all the documents of the collection that have certain properties or features in common. These features are used to cluster similar documents enabling a faster retrieval of documents pertaining to the user query.
It is possible to use a traditional IR system and its documents collection in a web based IR system. But there are issues that need to be looked into in this aspect. The IR system needs to be made available to the end user through a program that connects the IR system sitting on the web server to a Web page that acts as an interface between the user and the IR system.
Retrieving information from the Internet is a common practice for Internet users. However the size and heterogeneity of the web makes it very challenging. It also reduces the effectiveness of information retrieval techniques that are used to retrieve information from traditional data sources. Many software tools are available these days for web information retrieval, like search engines, hierarchical (Google, AltaVista) directories (Yahoo) and many other software agents.
Web users started having the availability of proper tools to access documents on the Internet during 1994. Before that year, it was possible to use tools that were indexing and managing only the title, the URL, and some small parts of Web pages [Maud98]. Since then there have been so many advances in this field that is can be looked at as big event in the history of Information retrieval system technology. WebCrawler, developed at the University of Washington (USA), was the first tool that allowed the user to search the full text of entire Web documents, was available in April 1994 [Maud98]. Lycos, another web search engine developed at Carnegie Mellon University (USA) [Herr99] in July 1994. So, we can say that from 1994 on it has been possible to have Web tools with effective IR functionalities. Since 1994, the IR system for the web has flourished with innovative and better tools for effective and faster information retrieval. Altavista entered the scene in 1995 with a number of innovative features, and in the following years many other search tools were made available.
3 WEB SEARCH ENGINE ARCHITECTURE
3.1 Crawler
The crawler module retrieves pages from the Web. It typically starts with an initial set of URLs. This initial set of URLs is fed into the crawler in a queue structure. The crawler then gets a URL from this queue one at a time. There are different ways to choose which URL to visit next, namely, Depth-First, Breadth-First or randomly, depending on the implementation of the crawler. The crawler downloads the web page, extracts any URLs in the downloaded web page, and adds the URLs that it found on the downloaded web page in the same queue. This action is continued until the crawler decides to stop. The crawler will stop once it has visited all the web page URLs in its queue. There are several issues that need to be taken into consideration regarding how the crawler behaves. The main issues that need to be considered are based on the enormous size of the Internet. It is impossible for the crawler to download all pages on the Web. The most comprehensive search engine can index only a small fraction of the entire Internet. Based on this fact, it is necessary for the crawler to prioritize the URLs in such a way that it will visit “important" pages first. This ensures that the part of the Web that is visited by the crawler is more meaningful. The main steps that the crawler has to take can be summarized in the following steps. First, it needs to be fed a URL or a set of URL’s. The crawler picks a URL from this queue and fetches the web page from this URL. It then parses this page and extracts links to other URL’s from this page. It filters out unwanted links and links that it has already visited. It add all these URL’s into the queue. This is the basic working of all crawlers. The main difference between crawlers comes in depending on what algorithm they use for choosing the next URL. Some crawlers use simple algorithms to pick the next URL like random, FIFO, LIFO, others use a priority algorithm shown below in [Kwon00]. After the crawler has downloaded a number of pages, it sends the downloaded web page to the repository module to be stored. It then needs to make sure that the repository of web pages it has stored is refreshed. For this the crawler needs to revisit the same URLs in order to detect changes in the downloaded pages and refresh the collection. Because Web pages are changing at very different rates [Cho00], and due to the enormous size of the web the crawler is not able to go back to all the web pages fast and refresh the pages. Hence, it needs to decide which pages to revisit and which page to skip. This decision significantly impacts the “freshness" of the downloaded documents. As an example, if a certain page changes rarely, the crawler may not want to revisit the page very often, that way it is able to visit more pages that change more frequently. [Kwon00] gives a prediction algorithm that can be used to find out when a particular web page will be updated, helping the crawler to decide when to visit the page. This paper tells how to calculate the update frequency of each page by using these main factors. Firstly, you need to get LA(P), which is the local average of the page P, i.e., we calculate the average frequency of the pages that are in proximity of this page P and also all the page frequencies should be close within a certain threshold. Secondly, we need the history average of the page HA(P), which gives the average frequency that is calculated using the page modification history. Thirdly, we need to calculate the tolerance of the page, which defines how close this page is to other pages and this value is used in calculating the value of LA(P). The formula used to calculate the update frequency of a given page P as FR(P), is given below. It uses the 3 terms we calculated earlier.
FR(P)=HA(P)*(1-LW(n)) + LA(P)*LW(n)
where LW is a weight factor associated with the local average LA(P) and n is the number of history records. The algorithm makes a few trivial but useful assumptions. First, recent history is much more important than old history. Second, history data of the page are more trustful than locality data, provided that we have enough history records. The equations for history average and local weight are defined based on these two assumptions.
Due to the enormous size of the Web, crawlers often run on multiple machines and download pages in parallel [Cho00]. This parallel processing is needed so that the crawler is able to download a substantial amount of pages in a reasonable amount of time. These parallel crawlers need to be coordinated with each other, so that multiple crawlers do not visit the same URL multiple times.
3.2 Repository
The page repository is a scalable storage system for managing large collections of Web pages. The repository needs to perform two main functions. Firstly, it needs to provide an interface for the crawler to store web pages it has crawled. Secondly, it must provide an efficient API for accessing that the indexer module can use to retrieve the pages. There are a few challenges that the storage module needs to address. It needs to be scalable and distributive so that the data it stores can be distributed over a network of servers, due to the large size of data we are dealing with. The repository also must support different access modes namely, Random access and Streaming access.
Random access is used to quickly retrieve a specific Web page, given the page's unique identifier. The query engine module needs to access the repository with “Random access” to serve out web pages to the end-user depending on their query string. Streaming access is used to receive the entire collection, or a significant subset, as a stream of pages. Indexer module uses “Streaming access” to process and analyze pages in bulk. The repository also needs to deal with issues regarding updating the newer versions of the web pages. The repository needs to be able to identify pages that are obsolete (deleted from their websites). When the web pages are removed from their web sites, it is not informed to the repository. Thus, the repository needs a mechanism to be able to identify and remove obsolete pages from its storage.
3.3 Parser
The parser module is an intermediate module between the repository and indexer. The Indexer module uses this module to extract the web pages from the repository and process the web pages to remove the HTML tags. The Parser then takes this page content, i.e. the web page without all the tags and parses the page again to remove any Stop List words. Stop List words are words that occur very frequently and do no help in any way to differentiate between the documents. In other words, they appear in almost all the documents. Eg. of Stop List words would be a, and, the, if, how, etc. The indexer module will take the page content left by the Parser and use it to index the text. The parser then extracts the keywords from the page content and creates a forward index for each page. Forward index is a structure that stores a list of all the keywords that appear in the web page along with the occurrence of the keyword. 3.4 Indexer
The indexer module builds a variety of indexes on the pages in the repository. It gets the forward index structure built by the parser module and creates an inverted index structure. An inverted index structure contains a list of all keywords along with the list of URLs that the keyword appears in, for each keyword. The inverted index structure is indexed on the keywords. The indexer module creates two main indexes: a Text Index to index all the keywords and a Link Index to index all the links on the web page. Text-based retrieval, namely, searching for pages containing some keywords is the main method for identifying pages relevant to a query. Various methods have been used to implement support for text-based retrieval to search over the text document collections. Examples include suffix arrays [Manb90], inverted files or inverted indexes [Salt89, Witt94], and signature files [Falo84]. Inverted indexes have been the index structure of choice on the Web traditionally. Inverted indexes will be discussed in detail later on section 6. The whole Web is modeled as a graph with nodes and edges [Brod00]. Each node in the graph is a Web page and a directed edge from node A to node B represents a hypertext link in page A that points to page B. A Link Index is a subset of this graph that contains web pages (nodes) that have been visited and links (edges) that have been found on the web pages. The most common structural information that is often used by search algorithms [Brin98] is neighborhood information, i.e., for a given page P, the outward links are the set of pages that are pointed to by P or incoming links are the set of pages pointing to P. Neighborhood information of the original graph and its sub graph can be easily retrieved using the Adjacency list representations [Aho83] of the graph. The information stored in these adjacency lists can be used to extract other structural properties of the Web graph. For example, if we need to retrieve pages that are related to a given page, then the notion of sibling pages is often used. This information about sibling can be easily derived from the adjacency list structures described above. Small graphs of hundreds or even thousands of nodes can be efficiently represented by any one of a variety of well-known data structures [Aho83]. However the biggest challenge is to do the same for a graph with several million nodes and edges. The Connectivity Server in the AltaVista search engine that is used to deliver linkage information for all pages retrieved and indexed, is described in [Bhar98]. Even though link-based techniques are used to enhance the quality and relevance of search results, text based structures are the most important ones used.
3.5 Page ranking module and Query Engine
The Query Engine takes the query string from the user containing the terms to search for and retrieves pages that are likely to be relevant to the query. The relevant pages that are retrieved need to be ranked. Traditional Information Retrieval (IR) techniques do not have any effective algorithm for ranking query results due to the reasons listed below. Firstly, the Web is very large and has a great variation in the content, amount and quality of information present in the Web pages. Hence, many pages that contain the search terms may not be relevant to the user or could be of poor quality. Secondly, most Web pages are not very self-descriptive, so the traditional IR techniques that are used to examine the contents of a page do not work very well. An often cited example to illustrate this issue is the search for “search engines" [Klei99]. The homepages of most of the important search engines does not contain the text “search engine". Spamming is a big issue while ranking pages. Web developers have started adding misleading terms to the web pages so that the search engine will rank them higher. This is another reason, the content of pages alone cannot be used as a technique to rank the pages. As we have mentioned earlier, the web is looked as having a graph structure. The information maintained by the link structure can be used in ranking pages. For example, if there is a link to page B in a web page A, then it implies that web page A is recommending web page B. This recommendation can be used to give an importance to a web page based on how many pages are referring to it. Some new algorithms have been proposed that make use of this link structure. These algorithms are based not only on the content of the page but also on the link structure, hence they are generally better than the traditional IR algorithms. Spamming has come into even this aspect of the web with web developers adding more links to particular web pages. But the advantage is that they are not able to influence the link structure at a global level. Hence link analysis algorithms working at a global level are relatively robust against spamming. Page and Brin describe a global ranking scheme, called PageRank, in [Page98] that tries to capture the notion of “importance" of a page. The rank of the page is defined based on the number of pages that link to that page, in other words, a page is more important than another page if the number of incoming links is higher than the other’s. The rank of a web page A can be defined as the number of pages in the Web that point to A, and could be used to rank the results of a search query. This is known as citation ranking. It does not work very well against spamming, as it is very easy to artificially create a huge number of pages to point to the desired page. The PageRank algorithm extends the basic citation-ranking algorithm. It takes into consideration how important the pages are that point to this web page. Thus if an important web page points to a page A, it receives more importance in its ranking than if an unimportant page pointed to it. The definition of PageRank is recursive and the importance of a page both depends on and influences the importance of other pages. A simple definition of PageRank algorithm is given below that captures the above intuition. Let us denote the pages on the Web as 1, 2,….,m. forward(i) denotes the number of outgoing (forward link) links from a page i. back(i) denotes all the pages that contain a link to page i (back links). In this algorithm we assume that we can reach every page from any given page, i.e., the web forms a very strongly connected graph. A simple formula to calculate PageRank of page i, denoted by rank(i), is given by
rank(i) jBack (i)rank( j) / forward( j)
The division by forward(j) captures the intuition that pages which point to page i evenly distribute their rank to boost to all of the pages they point to.
4 THE GOOGLE SEARCH ENGINE
In this section we see the architecture and working of a very popular Search Engine, Google. Most of Google is implemented in C and C++ for efficiency and it can run on Linux and Solaris servers. Google has several distributed servers for web crawling, i.e. finding URLs and downloading web pages from the Internet. This helps in parallel processing, as there are millions of web pages all over the Internet. At the start of each run, the URL Server has a list of URLs that need to be crawled. The URL Server sends a list of these URLs to the crawlers. The crawlers use these URLs as a starting point to go and fetch more URLs from the web pages. The fetched web pages are then sent to the store server where they are compressed and sent to the repository for storage. This function of the crawler is an on going process where it keeps on going to the same URLs and tries to see if the web pages have been updated since it got them the previous time. If they have been updated, then the crawler gets the new web page and sends it to the store server. Every web page is given a Document ID number that is assigned when the URL is parsed on the web page.
Next step in the process is to index and sort the web pages that is done by the indexer and sorter. The indexing module takes the web pages from the repository, uncompresses and parses them. Each web page is then converted to a forward index structure that contains all the words in that web page along with the occurrences, the number of times the word occurs in the document. The position of the word in the document along with the font size and capitalization are also stored in the forward index. The indexer then distributes this structure into a set of barrels, creating a forward index that is partially sorted. The index also parses out all the links in the web page and stores them in the anchors file. Information from this file can be used to easily determine where each link points to and from and the text that is part of the link.
The URL Resolver reads the URLs from the anchors file and converts them to absolute Document ID’s. It puts the anchor text into the forward index associated with the Document ID. It also generates a Links database, which are pair of Document IDs. This Link database is used later in the Page Ranking algorithm.
The sorter takes the documents in the barrel that are sorted by Document ID and creates an inverted index. An inverted index contains the index for each word associated with the document and the occurrence in that document. The DumpLexicon takes this inverted index list along with the Lexicon that is produced by the indexer module and creates a new lexicon to be used by the searcher. The searcher that is run by the web server takes the query words from the user and uses the lexicon produced by the DumpLexicon, the inverted index and PageRank to answer the query.
Figure 1. High level Google Architecture [Huan00]
5 TEXT INDEXING AND RETRIEVAL
Indexing addresses the issue of how information from a collection of documents should be organized so that queries can be resolved efficiently and relevant portions of the data extracted quickly. We will describe a variety of indexing methods. To be as general as possible, a document collection or document database can be treated as a set of separate documents, each described by a set of representative terms, or simply terms (each term might have additional information, such as its location within the document). An index must be capable of identifying all documents that contain combinations of specified terms, or that are in some other way judged to be relevant to the set of query terms. The process of identifying the documents based on the terms is called a search or query of the index.
Applications of indexing
Indexing has been used for many years in a wide variety of applications. It has gained particular recent interest in the area of web searching (e.g. AltaVista, Hotbot, Lycos, Excite, ...). Some applications include Web searches, Library article and catalog searches, Law, patent searches , Information filtering, e.g. get interesting New York Time articles.
The goals of these applications: Speed -- want minimal information retrieval latency Space -- storing the document and indexing information with minimal space Accuracy -- returns the ``right'' set of documents Updates -- ability to modify index on the fly (only required by some applications)
Figure 2 provides an Overview of Indexing and Searching process.
Figure 2: Overview of Indexing and Searching Document Collections
Index Query
“Document List”
Figure 2 Overview of indexing and searching
The main approaches that are used for Text Indexing are as follows: Full text scanning (e.g. grep, egrep) Inverted file indexing (most web search engines) Signature files Vector space model
Each one of these approaches will be explained in detail in the following sections. Our work focuses on Inverted file indexing and efficient data structures that can be used. The different types of queries that a index may have to support are, boolean (and, or, not), proximity (adjacent, within), key word set, in relation to other documents (relevance feedback). The Index should also allow for prefix matches (AltaVista does this) ,wildcards ,edit distance bounds (egrep)
There are some general techniques that are used by all indexing approaches irrespective of the algorithm or data structures. These are case folding: London = london stemming: compress = compression = compressed (several offtheshelf English language stemmers are available) ignore stop words: to, the, it, be, or, ... Problems arise when search on To be or not to be or the month of May Thesaurus: fast = rapid (handbuilt clustering)
Granularity of Index
The Granularity of the index refers to the resolution to which term locations are recorded within each document. This might be at the document level, at the sentence level or exact locations. For proximity searches, the index must know exact (or near exact) locations.
5.1 Signature files
Signature files are an alternative to inverted file indexing. The main advantage of signature files is that they don't require that a lexicon be kept in memory during query processing. In fact they do not require a lexicon at all. If the vocabulary of the stored documents is rich, then the amount of space occupied by a lexicon may be a substantial fraction of the amount of space filled by the documents themselves.
Signature files are a probabilistic method for indexing documents. Each term in a document is assigned a random signature, which is a bit vector. These assignments are made by hashing. The descriptor of document is the bitwise logical OR of the signatures of its terms. As we will see, queries to signature files sometimes respond that a term is present in a document when in fact the term is absent. Such false matches necessitate a three valued query logic. There are three main issues with respect to signature files : (1) generating signatures, (2) searching on signatures, and (3) query logic on signature files. 5.2 Vector space models
Boolean queries are useful for detecting Boolean combinations of the presence and absence of terms in documents. However, Boolean queries never yield more information than a Yes or No answer. In contrast, vector space models allow search engines to quantify the degree of similarity between a query and a set of documents. The uses of vector space models include:
Ranked keyword searches, in which the search engine generates a list of documents that are ranked according to their relevance to a query. Relevance feedback, where the user specifies a query, the search engine returns a set of documents; the user then tells the search engine that documents among the set are relevant, and the search engine returns a new set of documents. This process continues until the user is satisfied. Semantic indexing, is a type of indexing in which search engines are able to return a set of documents whose ``meaning'' is similar to the meanings of terms in a user's query. In vector space models, documents are treated as vectors in which each term is a separate dimension. Queries are also modeled as vectors, typically 01 vectors. Vector space models are often used in conjunction with clustering to accelerate searches.
5.3 Latent semantic indexing (LSI)
All of the methods we have explained so far to search a collection of documents have matched words in users' queries to words in documents. These approaches all have two drawbacks. First, since there are usually many ways to express a given concept, there may be no document that matches the terms in a query even if there is a document that matches the meaning of the query. Second, since a given word may mean many things, a term in a query may retrieve irrelevant documents. In contrast, latent semantic indexing allows users to retrieve information on the basis of the conceptual content or meaning of a document. For example, the query automobile will pick up documents that do not contain automobile, but that do contain car or perhaps driver. 5.4 Inverted File Indexing
Inverted file indices are probably the most common method used for indexing documents. Figure 3 shows the structure of an inverted file index. It consists first of a lexicon with one entry for every term that appears in any document. We will discuss later how the lexicon can be organized. For each item in the lexicon the inverted file index has an inverted file entry (or posting list) that stores a list of pointers (also called postings) to all occurrences of the term in the main text. Thus to find the documents with a given term we need only look for the term in the lexicon and then grab its posting list. Boolean queries involving more than one term can be answered by taking the intersection (conjunction) or union (disjunction) of the corresponding posting lists. We will consider the following important issues in implementing inverted file indices. How to minimize the space taken by the posting lists? How to access the lexicon efficiently and allow for prefix and wildcard queries? How to take the union and intersection of posting lists efficiently.?
Figure 3: Structure of Inverted Index 5.5 Inverted File Compression
The total size of the posting lists can be as large as the document data itself. In fact, if the granularity of the posting lists is such that each pointer points to the exact location of the term in the document, then we can in effect recreate the original documents from the lexicon and posting lists (i.e., it contains the same information). By compressing the posting lists we can both reduce the total storage required by the index, and at the same time potentially reduce access time since fewer disk accesses will be required and/or the compressed lists can fit in faster memory. This has to be balanced with the fact that any compression of the lists is going to require onthefly uncompression, which might increase access times. In this section we discuss compression techniques that are quite cheap to uncompress onthefly. The key to compression is the observation that each posting list is an ascending sequence of integers (assume each document is indexed by an integer). The list can therefore be represented by a initial position followed by a list of gaps or deltas between adjacent locations.
For example: original posting list: elephant: [3, 5, 20, 21, 23, 76, 77, 78] posting list with deltas: elephant: [3, 2, 15, 1, 2, 53, 1, 1]
The advantage of using the deltas is that they can usually be compressed much better than indices themselves since their entropy is lower. To implement the compression on the deltas we need some model describing the probabilities of the deltas. Based on these probabilities we can use a standard Huffman or Arithmetic coding to code the deltas in each posting list. Models for the probabilities can be divided into global or local models (whether the same probabilities are given to all lists or not) and into fixed or dynamic (whether the probabilities are fixed independent of the data or whether they change based on the data). 5.6 Representing and Accessing Lexicons
There are many ways to store the lexicon. Here we list some of them Sorted -- just store the terms one after the other in a sorted array Tries -- store terms as a trie data structure Btrees -- well suited for disk storage
Perfect hashing -- assuming lexicon is fixed, a perfect hash can be calculated Frontcoding -- stores terms sorted but does not repeat front part of terms. Requires much less space than a simple sorted array.
When choosing among the methods one needs to consider both the space taken by the data structure and the access time. Another consideration is whether the structure allows for easy prefix queries (e.g., all terms that start with wux). Of the above methods all except for perfect hashing allow for easy prefix searching since terms with the same prefix will appear adjacently in the structure. Wildcard queries (e.g., w*x) can be handled in two ways. One way is to use ngrams, by which fragments of the terms are indexed (adding a level of indirection). Another way is to use a rotated lexicon. 6 IMPLEMENTATION
The main idea behind this project is to create indexes of keywords from web pages and store them in data structures. We then find out the time complexity and space complexity of storing and searching the keywords in these data structures. This project contains three main modules, the crawler module, parser module and the indexer module as shown in figure 4.
Out.txt
Crawler HTML Parser (take URL and parse HTML page to remove all HTML tags)
Out.txt Text Parser List of URL’s (get the page contents crawled from the from the HTML Parser website URL and retrieve the keywords)
Indexer Index keywords into data (get the index of documents structures with keyword and occurrences and create inverted index) Output file
Retrieve information
Figure 4 : Architecture of implementation of indexer module The crawler module is input a root URL and the number of levels that it needs to go down. In order to test our system with changing number of keywords, we use the option of crawling different levels to get variable web pages. This module uses Breadth-First- Search approach to get URL’s and visit these URL’s to get the web page content. The crawler visits the root URL and looks for links on the web page at that URL. It stores the links on that page in a queue and visits these pages in sequence. The crawler then outputs the visited URL’s into a file “out.txt”. In traditional search engines, the web page content is also downloaded and compressed and stored in a repository. Our crawler only gets the URL’s and does not store the web page content as we have very few web pages and we are not implementing a query processor that will need the document for returning to the user. The next module is the parser module. The parser module reads the file output by the crawler containing the entire URL’s visited. It takes each URL and does processing on it to remove the HTML tags first. This content is then taken and text parsed to extract all the words in that page. The words that are extracted are then processed some more to remove Stop Words and Stemmed to save the word as its root. Stop words are words that occur very often in the documents and do not assist in any way to discriminate one document from another. The Porter stemming algorithm is used to stem the ends of the words. We have not used any algorithm to remove prefixes to the words. These words are then stored in the forward indexing structure. The forward indexing structures stores the URL of the page visited, a list of keywords processed and the number of times each keyword occurs in that document. The indexer modules takes the forward indexing structure and processes it to create a inverse index structure that’s stores the keywords along with a list of documents that the keyword occurs in. Ideally the location of the keyword in each document should also be stored, but we have ignored that aspect as we are not interested in displaying the result to the user. Out main aim is to study the time taken to build this inverted index and time it takes to search for keywords in this structure. We have used 3 indexing structures. The first structure we have used is a simple sorted array which stores the keyword as the key for sorting. We use s binary search technique to find the keywords in the document. The second index structure used is a HashTable. Here again we use the Keyworsd as the key for hashing into the HashTable. The third data structure use is the Btree. We have implemented a 2 Btree that stores a minimum of 2 keywords in each of its nodes and a maximum of 4.
7 ANALYSIS AND RESULTS
We compared the efficiency of three different data structures with respect to the inverted file indexing algorithm. As explained in the implementation section 3.1 a crawler is used that can retrieve links contained in web pages. We first use the crawler to retrieve a set of pages that shall be indexed. A HTML parser is given the list of URL’s or web pages. The HTML parser then parses the files and retrieves only the strings in the web page. This set of strings from a web page is then passed on to a text parser that performs stemming and uses a stop list to remove words. This text parser then gives a set of words that shall be indexed using the inverted file indexing algorithm. We can set the crawler to visit pages with varying depths thus allowing us to vary the number of keywords that are indexed. We initiated the crawler with different starting URLs like http://www.cnn.com , http:// www.nbc.com etc. The depth used by the crawler is also varied so as to vary the no of keywords. With a depth of 1, the crawler would generate a set of all URLs available on the home page of the websites. Once a set of keywords is generated, the number of keywords is calculated. We shall compare the performance of the different data structures using based on the time that is needed to create the index. A tester program retrieves a set of keywords from a set of URLs and uses three different data structures sorted array, hash table and Btree to create indexes using the inverted file indexing algorithm. In the following sections the results on the performance of each of the data structures are presented.
7.1 Inverted File Indexing Using Sorted Array
To create an inverted index using sorted array, initially an index is created that is of the following format as shown in figure 5. URL Words found on the web page URL1 Word1,word2…………..word n URL2 Word1,word2…………..word n URL3
URL n Word1,word2…………..word n
Figure 5. Forward index of URLs with list of keywords that appear in them
To create an inverted index from the above structure, we visit each URL in the URL list, and then every word contained in the URL, if the word has never been indexed it is inserted into a sorted array and a pointer is placed to list of URLs that contain this word. If the word already is present in the sorted array then the pointer that points to URLs is updated to add this new URL. Eventually the inverted index has the following structure as shown in figure 6
Sorted array Postings list of words (Index in to URL list)
aid 4, 8 all 2, 4, 6 back 1, 3, 7 brown 1, 3, 5, 7 come 2, 4, 6, 8 dog 3, 5 fox 3, 5, 7 good 2, 4, 6, 8 jump 3 lazy 1, 3, 5, 7 men 2, 4, 8 now 2, 6, 8 over 1, 3, 5, 7, 8 party 6, 8 quick 1, 3 their 1, 5, 7 time 2, 4, 6 Figure 6: Inverted index using sorted array
The plot shown in figure 7 shows the performance of the sorted array structure when used in the inverted file indexing algorithm. The X-axis represents the number of keywords indexed, and the Y-axis represents the time required to create an inverted index based on the sorted array data structure. Inv Indexing using Sorted Array ) c
e 40000 s
l i 35000 m ( 30000 x e
d 25000 n I
e 20000 Series1 t a
e 15000 r c
o 10000 t
e 5000 m i
T 0
4 9 1 2 1 5 6 9 7 6 9 0 6 6 9 6 2 5 9 1 3 4 0 3 1 9 6 7 0 6 1 3 6 8 9 2 8 3 6 1 1 2 2 No. of keyword indexed
Figure 7: Plot of performance in sorted Array
As can be seen from the plot, the curve is closed to linear. The time required to create the sorted array is quite large is because every time a keyword needs to be indexed a binary search algorithm decides where this word should be placed, this is quite an expensive operation and must be performed for every keyword, The other operation that needs to be performed is that of retrieving and updating the postings list for a new keyword, or creating and initializing a postings list for a keyword that has been indexed before.
7.2 Inverted File Indexing Using Hash Table
To create an inverted file index using a hash table the class library in Java was used. The class “HashTable” implements a hash table, which maps keys to values. Here the keywords represent “keys” and the values are the “list of URLs”, that contain the keyword. Any non-null object can be used as a key or as a value. The hash table is open: in the case a "hash collision", a single bucket stores multiple entries, which must be searched sequentially. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. To create an inverted index, each URL a structure similar to Table 1.0 is used. Each URL is visited sequentially and the words contained in each URL are read. A keyword that represents a key is then hashed using the hashing function. The java hashing function seemingly satisfies the requirements of a good hashing function. The structure of an inverted file index with a hash table is shown in figure 8.
Key Hash Code = Insert Element At(Hash (Keyword) H(K) Code) = Posting List
Figure 8: Hash Table Implementation
Hash Code Key words being indexed Posting List generated by hashing
aid 0 4, 8
aid 1 2, 4, 6 back 2 1, 3, 7 brown 3 1, 3, 5, 7 come 4 2, 4, 6, 8 dog 5 3, 5 fox 6 3, 5, 7 fox 7 2, 4, 6, 8 jump 8 3 lazy 9 1, 3, 5, 7 men 10 2, 4, 8 now 11 2, 6, 8 over 12 1, 3, 5, 7, 8 party 13 6, 8 quick 14 1, 3 their 15 1, 5, 7 time 16 2, 4, 6
Figure 9 Inverted index structure of Hash Table The following plot shows the performance of the hash table when used in the inverted file indexing algorithm. The X axis represents the number of keywords indexed, and the Y axis represents the time required to create an inverted index based on the Hash table data structure.
Inv Indexing using Hash Table ) c
e 300 s
l i 250 m (
x
e 200 d n i
150 Series1 e t a
e 100 r c
o 50 t
e
m 0 i T 4 7 0 1 9 4 1 6 3 6 5 0 7 4 5 9 1 9 6 6 2 9 4 0 2 3 7 9 0 8 3 0 7 0 1 8 1 1 2 6 3 9 0 8 5 1 1 3 5 6 8 9 0 2 6 8 3 4 8 1 1 1 1 2 2 2 No Of keywords Indexed
Figure 10 Plot of performance in HashTable
As can be seen from the plot, the curve is sub linear. The time required to create the hash table is small because as opposed to the sorted array wherein every time a keyword needs to be indexed a binary search algorithm decides where this word should be placed, the hash table requires a simple hash function to compute the hash code which serves as an index into an array where a posting list is stored. The other operation that needs to be performed is that of retrieving and updating the postings list for a new keyword, or creating and initializing a postings list for a keyword that has been indexed before. Because of the absence of the binary search operation this is a very efficient data structure for creation of the inverted index. 7.3 Inverted File Indexing Using BTrees
BTree is a data structure that is a balanced tree (all leaf nodes are at the same level) in which all the nodes are sorted based on a key. Each node, except the root, stores a maximum of m number of keys in each node and a minimum of m/2 nodes. If each node has t number of keys in it, then it will have t+1 number of children to represent the range of values that it can store. At each insertion and deletion, the tree is restructured so that it is height balanced. There are 2 ways that the tree can be restructured. First, add the new keyword into the slot that it should be put into. If the node is over it's size limit, then split the node into 2 nodes and pass the middle keyword into the parent of this full node. This keeps going till the root, so that none of the nodes are full and the end result is a balanced tree. This method makes 2 passes of the tree like in an AVL tree where the node is first added and if the node is full, then it is split. Another way of inserting a new node into the tree would be to start with the root and keep splitting nodes that come in the way that are at their size limit, i.e. m. This way we parse the tree only once and create a place for the new node. The BTree class we have implemented uses this method. We have not concentrated on deleting keywords from the node. In the index structure that we have created using the BTree, the keyword is used as a key to sort the nodes. Each keyword has 2 lists associated with it. One list contains indexes to all the documents that contain that keyword and the second list contains the occurrence list, i.e. the number of times the keyword occurs in the document. The index in the BTree is shown in the diagram. Each node contains minimum 2 keywords and maximum of 4 keywords. We have not been able to show the associated document and occurrence list for each of the keywords. The structure of the BTree is shown in figure 11. And example of a node would be keyword : document urlList : [doc1, doc3, doc4] occurList : [1, 2, 1] This implies that the keyword "document" occurs in doc1 1 time, doc3 two times and doc4 one time.
D
o
c
A C u W
t
l
o
i
p
e
l
u
h
t
r
a
s
a b c s v z
i
b u i
a e
t
p
e t t s
t
t
Figure 11 Structure of BTree created
The following plot in figure 12 shows the performance of the BTree when used in the inverted file indexing algorithm. The X-axis represents the number of keywords indexed, and the Y-axis represents the time required to create an inverted index based on the BTree data structure. Inv Indexing usng BTrees )
c 9000 e s
l 8000 i
m 7000 (
x 6000 e d
n 5000 i Series1 e
t 4000 a
e 3000 r c 2000 o t
e 1000 m i 0 T 4 9 1 2 1 5 6 9 7 6 9 0 6 6 9 6 2 5 9 1 3 4 0 3 1 9 6 7 0 6 1 3 6 8 9 2 8 3 6 1 1 2 2 No Of keywords Indexed
Figure 12: Plot of performance in BTree
As can be seen from the plot, the curve is sub linear. The time required to create the
index using the BTree is of the order of : . where m is
the order of the BTree. The maximum depth of the BTree is always log [m/2] n, where n is the total number of keywords that we have indexed. Each node has to have a minimum of m/2 keywords.
7.4 Comparative analysis of all three data structures
The following plot shows the performance of the hash table, sorted array and the BTree when used in the inverted file indexing algorithm. The X-axis represents the number of keywords indexed, and the Y-axis represents the time required to create an inverted index when using each of the three data structures. Three different colors distinguish the line corresponding to the hash table, sorted array and the BTree from each other. Inverted Index Creation
40000 35000 c
e 30000 s
l i 25000 Sorted Array m
n 20000 Hash Table i
e 15000 BTree m i
T 10000 5000 0 2 1 4 9 1 5 6 9 7 6 9 0 6 6 9 6 2 5 9 1 3 1 3 4 0 9 6 0 6 7 1 3 6 8 9 2 6 8 3 1 1 2 2 No Of KeyWords
Figure 13: Comparitive plot of all 3 data structures
As is quite evident from the comparison plot of all three for these data structures. The Hash Table outperforms sorted array and the BTree in terms of the time required to create the inverted index. This can be attributed to the fact that when inserting a new keyword into the index, minimal amount of computing time is required in case of the hash table index. Comparing this to the binary search algorithm that the sorted array requires to find the correct place to insert the keyword. The binary search Searches a sorted array by repeatedly dividing the search interval in half. Begins with an interval covering the whole array. If the value of the search key is less than the item in the middle of the interval, narrow the interval to the lower half. Otherwise narrow it to the upper half. Repeatedly check until the value is found or the interval is empty. It runs in O(log N) wherein N is the size of the array, in this case the size of the lexicon in the index.
The hash table complexity depends on the hash function and collision resolution, but in this case is constant (1). Some open addressing schemes may suffer from clustering more than others. So it is evident that if we use a hashing function that minimizes collisions and we use a good resolution strategy, outperforming the sorted array is an easy task.
A BTree which is a balanced search tree in which every node has between minimum m/2 and ceiling m children, where m>1 is a fixed integer. The root may have as few as 2 children and the leaf nodes do not have any children. Inserting a keyword in a BTree has complexity of O(m log m n), where m is the order of the tree and n is the total number of keywords that are being indexed.
7.5 Search and retrieval efficiency of the data structures
Even though the project has focused on comparing the data structures in terms of the time required to create the index, the search and retrieval efficiency and the memory storage requirements of the data structures in warrants discussion. Searching in a hash table is O(1) , i.e. constant time and is extremely efficient.
Searching in a B Tree is O(log[m/2] n), where n is the total number of keywords that are indexed. The advantage of using the BTree is that it is balanced. Hence the height of the tree remains constant. Searching and retrieval on a sorted array requires O(log n) operations, because it needs the binary search algorithm for retrieval.
7.6 BTrees and External memory:
The payoff of the BTree insert and delete rules are that B-trees are always "balanced". Searching an unbalanced tree may require traversing an arbitrary and unpredictable number of nodes and pointers. Searching a balanced tree means that all leaves are at the same depth. There is no runaway pointer overhead. Indeed, even very large BTrees can guarantee only a small number of nodes must be retrieved to find a given key. For example, a B-tree of 10,000,000 keys with 50 keys per node never needs to retrieve more than 4 nodes to find any key. This is a good structure if much of the tree is in slow memory (disk), since the height, and hence the number of accesses, can be kept small, say one or two, by picking a large m. They are especially useful for search structures stored on disk. Disks have different retrieval characteristics than internal memory (RAM). Obviously, disk access is much, much slower. Furthermore, data is arranged in concentric circles (called tracks) on each side of a disk “platter” . (Most disks these days have a single platter, but some disks are a stack of platters.) A disk is read by read/write heads mounted on an arm that is moved in and out from track to track. Moving that arm takes time, so there is a real timing benefit to grouping data so that it can be read without moving the arm. The amount of data that can be read without moving the arm (from both sides of all platters) is called a cylinder. It's much faster to read an entire cylinder than to read a little, move the arm, read a little more, move the arm, etc., even if the total amount of data in a cylinder is much more than we need. BTrees are a good match for on-disk storage and searching because we can choose the node size to match the cylinder size. In doing so, we will store many data members in each node, making the tree flatter, so fewer node-to-node transitions will be needed.
8 FUTURE RESEARCH DIRECTIONS
This work has analyzed the performance of different data structures when used to build an index for text using the inverted file indexing algorithm. The metric that was used for the comparison was the time required to build an index. The ways in which this work could be expanded is as follows:
Using other metrics to compare the data structures for example the space footprint of the index, time required to search for a keyword in the index. Analyzing the efficieny of data structures like kd-trees and tries within the context of inverted file indexing algorithm Evaluating different text indexing algorithms like Signature Files. LSI and vector space model. Different metrics can be used for this analysis. Analysis of indexing algorithms for image and video retrieval. Text indexes are compressed to save space. Analysis of compression algorithms like Huffman coding and searching compressed indexes is another interesting research topic.
9 CONCLUSION
We achieved the goals that we had set for this project. We have gained a sound understanding of search engine technology, information retrieval techniques particularly text indexing. We have studied in depth the inverted file indexing algorithm and related data structures like hash table B trees and sorted arrays. From the performance analysis of the inverted index file indexing algorithm and data structures we can conclude that efficient algorithms and data structures are the key to efficient search engines. Google’s page rank algorithm that revolutionized search engine technology also bears testament to this fact. This work also enumerates future work based on this project. 10 REFERENCES
[Huan00] Huang, L. A survey on web information retrieval technologies. Tech. rep., ECSL, 2000.
[Aras01] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan . Searching the Web. ACM Transactions on Internet Technology, 1, p. 2-43, 2001.
[Brin98] S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of the 7th International WWW conference, 1998.
[Najo01] M. Najork and Janet L. Wiener. Breadth-First Search Crawling Yields High- Quality Pages. In Proceedings of the Tenth Internal World Wide Web Conference, pages 114-118, May, 2001
[Bent97] J. Bentley and R. Sedgewick. Fast Algorithms for Sorting and Searching Strings. Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. New Orleans, January, 1997. Pages 360- 369.
[Have99] T. Haveliwala. Efficient Computation of PageRank. Stanford Technical Report 2000-36.
[Brod00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener. Graph structure in the web. In Proc. Ninth International World Wide Web Conference (WWW9), 2000.
[Cho00] J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of the Twenty-sixth International Conference on Very LargeDatabases, 2000. Available at http://www-diglib.stanford.edu/cgi- bin/get/SIDL-WP-1999-0129. [Kwon00] A. Kwong M. Gertz. Improving the Quality of a Web Page Index. Department of Computer Science, University of California, Davis. 2000.
[Manb90] U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. In Proc. Of the 1st ACM-SIAM Symposium on Discrete Algorithms, pages 319-327, 1990.
[Salt89] G. Salton. Automatic Text Processing. Addison-Wesley, Reading, Mass., 1989.
[Witt94] I. H. Witten. Managing gigabytes : compressing and indexing documents and images. Van Nostrand Reinhold, New York, 1994.
[Falo84] C. Faloutsos and S. Christodoulakis. Signature files: An access method for documents and its analytical performance evaluation. ACM Transactions on Office Information Systems, 2(4):267-288, October 1984.
[Brod00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph Structure in the Web. In Proceedings of WWW9 Conference, 2000.
[Brin98] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of 7th World Wide Web Conference, 1998.
[Aho83] A. Aho, J. Hopcroft, and J. Ullman. Data Structures and Algorithms. Addison- Wesley, 1983.
[Bhar98] K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the web. In Proceedings of the Seventh International World-Wide Web Conference, April 1998. [Klei99] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, November 1999.
[Page98] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Computer Science Department, Stanford University, 1998. [Maud98] M. Maudlin. A history of search engines. 1998. http://www.wiley.com/compbooks/sonnenreich/history.html,
[Herr99] S. Davis Herring. The value of interdisciplinarity: A study based on the design of Internet search engines. Journal of the American Society for Information Science, 50(4):358-365, 1999.