Index Compression Vs. Retrieval Time of Inverted Files for XML Documents

Index Compression vs. Retrieval Time of Inverted Files for XML Documents Norbert Fuhr Norbert Gövert University of Dortmund, Germany Query languages for retrieval of XML documents allow for conditions referring both to the content and the structure of documents. In order to process these queries efficiently, inverted files must contain also structural information, thus leading to index sizes that exceed the storage space of the original data. In this paper, we investigate two different approaches for reducing index space. First, we consider methods for compressing index entries. Second, we develop the new XS tree data structure which contains the structural description of a document in a rather compact form, such that these descriptions can be kept in main memory. We evaluate the efficiency of several variants of these two approaches on two large XML document collections. Results show that very high compression rates for indexes can be achieved. However, any compression increases retrieval time. Thus, retrieval time is minimized when uncompressed indexes are used. On the other hand, highly compressed indexes may be feasible for applications where storage is limited, such as in PDAs or E-book devices. 1 Introduction In the past, most IR approaches have ignored document formats or structure, partly due to the diversity of formats, partly due to the limited availability of structured documents. With the increased use of the Extended Markup Language (XML) as standard format for documents, this will change very soon. XML allows for logical markup of texts both at the macro level (for example chapter, section, paragraph) and the micro level (e. g. MathML1 for mathematical formulas, CML2 for chemical formulas). In order to exploit this markup in retrieval, an appropriate query language must be used. Unfortunately, the XQuery proposal [Chamberlin et al. 01] by the World Wide Web Consortium’s working group on XML query languages offers very little support for information retrieval of XML documents; instead, it focuses on the data-centric view of XML, thus offering features similar to query languages for (object-oriented) databases. As a more IR-related approach, the XIRQL language [Fuhr & Großjohann 01] extends the XPath subset of XQuery by IR concepts such as weighting, ranking and relevance-oriented search. ‘Classical’ text retrieval regards documents as unstructured text units, and queries contain value conditions only (i. e. search for texts containing certain terms). In contrast, all approaches for XML retrieval also support structural conditions, even in combination with values. Predecessors of this type of retrieval can be found in commercial text retrieval systems developed since the 1970s, where documents could be split up e. g. in fields, paragraphs and sentences, and the corresponding query languages supported conditions with respect to (w. r. t.) this (fixed) structure. Today, XML allows for more flexible (even recursive) document structures. In order to perform efficient retrieval on XML document collections, appropriate index structures must be used. Most of the old commercial text retrieval systems included all the necessary structural information within the inverted lists, thus creating indexes that were often larger than the original texts. As described below, experiments have shown that also in the case of XML documents, uncompressed indexes roughly require as much space as the documents themselves. However, standard compression schemes for XML documents reduce collection size by 80–90 %; therefore index size is a crucial factor when space matters. Thus, it seems to be appropriate to apply methods for index compression. Work in this direction also has to take into account the classical space-time tradeoff in computer science: Since the compressed index entries have to be decompressed during retrieval, savings in terms of storage space may lead to higher retrieval times. In this paper, we investigate a range of possibilities for index compression with different compression rates and retrieval 1http://www.w3.org/Math/ 2http://www.xml-cml.org/ 1 performance; this way, we offer a system implementor (or database administrator) a variety of solutions, from which the most appropriate one can be chosen for a given application. So far, there is very little literature on index structures for XML document retrieval; moreover, we are not aware of any experimentation using large, realistic XML collections. [Thom et al. 95] describe a method to integrate path information into inverted lists, and propose different compression methods. Our PIL approach described below is based on these ideas. [Moffat & Zobel 96] investigate methods for compressing inverted lists in the case of unstructured documents, and they also develop a new retrieval strategy that reduces overall retrieval time in comparison to uncompressed lists. As an alternative to inverted files, [Yoon et al. 01] use bitmap indexing for indexing structured documents. Term occurrences are associated with respective paths and document identifiers by means of a so-called BitCube: the set of paths, the set of terms, and the set of document identifiers each form one dimension. It is claimed that very efficient retrieval can be achieved with this approach. However, neither element indexes nor indexing weights are represented in the BitCube, and an extension towards inclusion of this data seems to be difficult at least. Since inverted files [Harman et al. 92] [Witten et al. 99, chapter 3] have shown their advantages in comparison with other access methods proposed in the literature, we base our work on this access structure and seek for modifications that reduce the storage requirements. For this purpose, we are investigating two different approaches in this paper. In order to reduce index space for inverted files, we can either compress the index entries, or we can reduce the information stored in the entries (e. g. no structural or positional information). In the latter case, the inverted file yields a superset of the correct answers, and we need additional sources in order to determine the final result. The most obvious solution in this case would be scanning of the original documents, i. e. retrieving the candidate documents and testing for the fulfillment of the actual query condition(s); however, this solution yields very high retrieval times. As an alternative, we consider the additional XS tree data structure which contains the structural description of a document in a rather compact form, such that these descriptions can be kept in main memory. In this case, the structural description in an inverted file entry is replaced by a pointer to a position in the XS tree; thus, the overall storage space of the index is reduced. In the remainder of this paper, we first specify the functionality of access methods for XML retrieval. Then we describe the PIL method for compressing inverted lists. As an alternative, Section 4 presents our new XS tree data structure that reduces the structural information in the inverted list entries to a pointer linking to this tree. The efficiency of both approaches are evaluated in Section 5. Finally, we give our conclusions and an outlook to further research. 2 XML querying Queries that can be formulated for searching in an XML document collection can be classified in the following four types: 1. Value-oriented (e. g. “Find documents about XML retrieval”): This type of queries only relates to the content of documents, as in classical information retrieval. 2. Structural (e. g. “Find books having more than 10 chapters and an appendix”): This type of queries only relates to the structure of documents. Presumably, it will be very rare in document-oriented applications of XML, it may be more convenient in data-centric applications. 3. Combination of value and structure (e. g. in an arts encyclopedia, pictures of London vs. images painted in London). Here XML markup allows for searches with a higher precision, by restricting conditions of the query to specific elements. 4. Relevance-oriented: This type of queries is similar to the first type, but here the system is asked to search for relevant document parts—unlike the classical IR approach treating documents as atomic units. Passage retrieval [Wilkinson 94] [Callan 94] is a means for retrieving portions of a document, but it does not consider document structure. In contrast, relevance-oriented retrieval searches for those elements of an XML document that answer the query best. For purely structural queries, the DataGuide structure described in [Goldman & Widom 97] is an efficient method for locating all elements in an XML document collection fulfilling structural conditions. Since IR is mainly value-based, its application to XML documents leads to queries of types 3 and 4. In the remainder of this paper, we only consider these two types of queries. In order to process queries referring to the logical structure of documents, XML query languages must support the following types of conditions: Element names: For high-precision retrieval, the queries must be allowed to specify the element name where certain values should occur (e. g. in the example from above, ‘London’ in the title of an image vs. ‘London’ in the location of creation). 2 Element index: When the index has a specific semantics, then users may want to specify the value of an index (e. g. first chapter of a book, first author of a publication). Ancestor / descendant: Since XML documents have a tree structure, the ancestor-descendant relationship de- scribes the logical structure of a document (aggregation of document parts). Queries may refer explicitly to this relationship (e. g. find a chapter that gives a theorem and also contains the corresponding proof). Relevance-oriented retrieval needs this information in order to identify logical parts fulfilling all conditions specified in the query. Preceding / following: This type of conditions refers to the linear sequence within the document. For example, to find a document explaining the vector space model in terms of uncertain inference, we may look for a paper that talks about uncertain inference before mentioning the vector space model.

Load more