Efficient Sorting of Search Results by String Attributes
Total Page:16
File Type:pdf, Size:1020Kb
Efficient sorting of search results by string attributes Nicholas Sherlock Andrew Trotman Department of Computer Science Department of Computer Science University of Otago University of Otago Otago 9054 New Zealand Otago 9054 New Zealand [email protected] [email protected] Abstract It is sometimes required to order search In addition, the search engine must allocate memory results using textual document attributes such as to an index structure which allows it to efficiently re- titles. This is problematic for performance because trieve those post titles by document index, which, using of the memory required to store these long text a simplistic scheme with 4 bytes required per docu- strings at indexing and search time. We create a ment offset, would require an additional 50 megabytes method for compressing strings which may be used of storage. for approximate ordering of search results on textual For search terms which occur in many documents, attributes. We create a metric for analyzing its most of the memory allocated to storing text fields like performance. We then use this metric to show that, post titles must be examined during result list sorting. for document collections containing tens of millions of As 550 megabytes is vastly larger than the cache mem- documents, we can sort document titles using 64-bits ory available inside the CPU, sorting the list of docu- of storage per title to within 100 positions of error per ments by post title requires the CPU to load that data document. from main memory, which adds substantial latency to query processing and competes for memory bandwidth Keywords Information Retrieval, Web Documents, with other processes running on the same system. Digital Libraries In this paper, we examine several methods which can be used to reduce the memory requirements for 1 Introduction storing document attributes in a form suitable for sort- A document search engine typically takes a search ing. We examine the tradeoffs which can be made be- query provided by the user and generates a list of tween sorting accuracy and the memory required for documents which contain some or all of the terms storage at search time. We then demonstrate the ac- in the query. This list of documents is sorted in a curacy of our methods by applying them to a corpus particular order, defined by a “ranking function”. of 14.8 million discussion posts taken from an online Common ranking functions assign an importance to discussion forum called “Chicken Smoothie”, and 22 each term in the search query using some metric, then million posts from the online discussion forum at “An- for each term in the query, apply that importance to the cestry.com”. number of occurences of the term in each document to give each document a score. The list of documents is 2 Ranking then sorted using this score to present documents to the First, let us examine the way that the search engine cal- user with the highest scores first. culates ranking scores for documents. For each search There are several such ranking functions that are of term in the query, an index over the document collec- interest in search over online discussion forums. For ex- tion is consulted. This index maps the term onto a list ample, the online forum software “phpBB” offers users of document IDs which contain that term, along with the ability to order their results by document attributes extra information about that term’s appearance in each such as post time, author name, forum name, topic title, document (for example, a count of the number of times and post title. In the case of text fields like names that term appears in the document). This information is and titles, this is problematic, as it requires the search passed to a “ranking function”, which uses it to generate engine to have these fields available in text form for a ranking score for each document. comparison sorting at search time. This consumes a The scores generated by the ranking function for large amount of memory. For example, in a collection each term in the search query are combined together in of 14.8 million forum posts, simply storing the post a structure called the “accumulator”. The accumulator title for each post requires 500 megabytes of memory. can be represented as an array of integers of a fixed size, Proceedings of the 16th Australasian Document Comput- one integer for each document in the collection. Every ing Symposium, Canberra, Australia, 2 December 2011. element in the accumulator is initialised to zero at the Copyright for this article remains with the authors. beginning of the search. Then, the value returned from the ranking function for each term in the search query indexing time, the text attributes for each document are is typically added (using simple arithmetic addition) to extracted and stored in temporary memory. When the the value already present in the accumulator for that indexing of the document collection is complete, the document, to give a new score for the document. document attributes are sorted using a string compari- At the end of the search process, the search engine’s son function. Each document’s position in the list of accumulator is sorted in descending order, which brings sorted attributes then directly becomes its pregenerated the documents which the ranking function scored the ranking score. In this way, we generate a maximally- highest (which are hoped to be the “most relevant” doc- compact set of ranking scores for the document col- uments to the user’s information need, based on their lection (having exactly as many distinct values as there search query) to the front of the result list. are distinct attributes in the collection), which also per- fectly encodes the relative ordering of the documents 2.1 Pregenerated ranking in the sorted list of attribute text. If each integer in If a ranking function’s value does not depend on the the pregen is 32 bits large, approximately 4.3 billion content of the user’s query, then instead of computing distinct documents can be perfectly ranked using this it when a document is located at search time, it can method (232). be computed when the document index is first created This method has several drawbacks. Because every (at “indexing time”). This pregenerated ranking can be attribute of the documents which must be sorted needs stored as a file (a “pregen”) consisting of an array of to be available at the end of indexing time, we must integers, one for each document in the collection. When either allocate enough memory to store those values the search engine starts up, it can read each of its pregen (e.g. 550MB for post titles in the Chicken Smoothie files into memory. Then at search time, rather than corpus), or avoid allocating memory by storing those performing any computations to calculate the ranking values in some sort of disk structure (requiring at least score for a document, the search engine can simply read 1 gigabyte of additional disk I/O, as we must both write the precalculated value stored in the pregen and store the values to disk and read them in again). that value directly into its accumulator to be sorted. Another drawback is that each value in the gener- This technique allows the pregenerated ranking to take ated ranking is dependent on the attributes of every doc- advantage of the search engine’s existing implementa- ument in the collection. If a later change to the doc- tion of accumulator sorting. For example, the search en- ument collection results in the first post in the rank- gine may already implement special optimisations like ing being deleted (say, “aardvarks and apples”), every partial sorting of accumulators for the case where only pregenerated ranking function for the collection is now the top-K sorted documents are required[5]. invalid and must be recomputed. If a ranking function is particularly expensive to This is also a problem in distributed search, compute (compared to reading a precomputed value where a document collection may be split into several from memory), speed gains can be made at search chunks, with each chunk being indexed by a separate time, at the expense of the extra memory required to computer. The pregens created by each index will store the pregens. For example, one ranking function be incompatible, since they are based on different, which is expensive to compute, but independent of the incomplete fragments of the total document collection. user’s query, is Google’s PageRank algorithm[4]. This If, during searching, a search engine consults several ranking function essentially assigns a score to web such distributed indexes, and retrieves a list of documents based on how many incoming links they documents from each with corresponding scores from have. their pregens, it will be unable to merge those results If we could compute a compact pregenerated using just the values of the pregens to create a list ranking for each of the document attributes that our which is sorted by the ordering over the complete users want to sort on, we could improve the speed at collection. search time (by sorting the smaller pregen instead of A better method for generating pregens would the longer original strings) and reduce the memory be able to compute a ranking score for a document required to store those text attributes. For example, attribute immediately upon reading it (which would if we could fit the 14.8 million post titles in the eliminate the requirement to store those values until Chicken Smoothie collection into 64-bit integers, the end of indexing), and the computed value would be the pregenerated ranking would only require 110 independent of all other documents in the collection megabytes of memory to store.