CS490W XML Data and Retrieval XML and Retrieval

Total Page:16

File Type:pdf, Size:1020Kb

CS490W XML Data and Retrieval XML and Retrieval CS490W Semi-Structured Data Structure of XML z XML data is organized by documents like unstructured data XML data and Retrieval z There are structures (nodes/tags) within the documents z Each XML document is an ordered, labeled tree Luo Si z Element Nodes are labeled with Department of Computer Science ¾ Node name (e.g., chapter) Purdue University ¾ Node attributes and the values (e.g., size=1000; time=01/01/2007) ¾ May have child nodes or data z Data exist (e.g., text strings) within leaf nodes XML and Retrieval: Outline XML Example Outline: <book id=“ML_Tom”> <title>Machine Learning</title> z Semi-Structure Data <author> ¾ XML, Examples, Application <firstname>Tom</firstname> <surname>Mitchell</surname> </author> z XML Search ... ¾ XQuery <p>Machine Learning Applications...</p> ... ¾ XIRQL </book> z Text-Based XML Retrieval Elements, Attributes/Values, Data(Text String) ¾ Vector-space model ¾ INEX Semi-Structured Data XML Example XML has been used as the standard representation of Semi- <book id=“ML_Tom”> <title>Machine Learning</title> Structured Data <author> <firstname>Tom</firstname> z eXtensible Markup Language <surname>Michael</surname> </author> book is a W3C-recommended general-purpose markup language that supports a wide ... title variety of applications. <p>Machine Learning Applications...</p> ... z A framework for defining markup languages </book> author firstname surname z Open vocabulary for tags Elements, Attributes/Values, Data(Text String) chapter chapter z Each set of XML corresponds to different applications title para para para z facilitate the sharing of data across different information systems, particularly systems connected via the Internet z Examples: RSS, XHTML, MathML Elements Why XML? z Elements are defined by markup tags z Unlike relational database, XML data does not require z Elements: <TagName attr_a=“value”…>text</TagName> relational schemata, etc., because the data itself contains this information. z ID of the element is TagName z Unlike widely used Web format, HTML, which only ensures z Attribute: attr_a; Values=“value” the correct presentation of the formatted data, XML also z Data/text: “text” guarantees total usability of data. z End tag </TagName> XML, HTML, SGML XML Applications 1986: SGML ISO 8879-1986 z CML – chemical markup language: Nov 1995: HTML 2.0 z WML – wireless markup language Nov 1996: Simplified and stripped down SGML draft z ThML – theological markup language (dubbed XML) Jan 1997: HTML 3.2 Aug 1997: XML working draft Dec 1997: XML 1.0 proposed recommendation Jan 1998: XML Feb 1999: XHTML XML and HTML XML Applications z Both of them are derivations of SGML z CML – chemical markup language: z HTML is a markup language mainly for display in browsers CML (Chemical Markup Language) is a new approach to managing z XML is a framework for markup languages molecular information using tools such as XML and Java. It was the first z HTML defines display domain specific implementation based strictly on XML, z XML defines the data structure, the display factor is separated from the content <molecule convention="MDLMol" id="baclofen" title="BACLOFEN"> z HTML can be formalized as XML (XHTML) XML Applications XML Files z <?xml version="1.0"?> z WML – wireless markup language <!DOCTYPE note [ Wireless Markup Language, is a content format for devices that implement the <!ELEMENT note (to,from,heading,body)> Wireless Application Protocol (WAP) specification, such as mobile phones. <!ELEMENT to (#PCDATA)> DTD Example <?xml version="1.0"?> <!ELEMENT from (#PCDATA)> <!DOCTYPE wml PUBLIC "-//PHONE.COM//DTD WML 1.1//EN" <!ELEMENT heading (#PCDATA)> "http://www.phone.com/dtd/wml11.dtd" > <!ELEMENT body (#PCDATA)> <wml> ]> <card id="main" title="First Card"> <note> <p mode="wrap">This is a sample WML page.</p> <to>Tove</to> XML Document </card> <from>Jani</from> </wml> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> XML Applications XML Files z XML Schema: z ThML – theological markup language Recommended by the W3C as the successor of DTDs, more informally <ThML> referred to by the initialism for XML Schema instances, XSD (XML <ThML.body> Schema Definition). XSDs are far more powerful than DTDs in describing – <div1> <div2 title="Genesis" id="Gen"> XML languages. – <div3 title="Chapter 1"> <xs:schema • <p> xmlns:xs="http://www.w3.org/2001/XMLSchema"> • <scripture/> • In the beginning God created the heaven and the earth. <xs:element name="country" type="Country"/> • <scripture/> <xs:complexType name="Country"> • And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. <xs:sequence> •</p> – </div3> <xs:element name="name" type="xs:string"/> </div2> <xs:element name="population" type="xs:decimal"/> – </div1> </xs:sequence> </ThML.body> </ThML> </xs:complexType> </xs:schema> XML Files XML Search z Schema/DTD: syntax definition of XML Language; z Most XML Search protocols use a database-based approach Document Type Definition (DTD file) XML provides an application independent way of sharing data. With a DTD, ¾ Non-text data match independent groups of people can agree to use a common DTD for ¾ Exact keyword (text) match interchanging data. However, this is often NOT the case ¾ Evaluate XML path expression <?xml version="1.0"?> ¾ No concept of relevant <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> DTD Example <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]> XML Search Principal Forms z Traditional XML Search from Database-based approach z Path Query ¾ XQuery /book//title contains “Information Retrieval” ¾ Search multiple types of data: value-based (e.g., price of title of the book contains keywords “Information Retrieval” a book); ids (ISBN of book); keyword match (text) z Conditional expressions z XML text search from information retrieval approach $h/title, ¾ XIRQL IF $h/@type = "Journal" THEN …. ¾ Vector-space based if the type of an article is journal ¾ Search text data: estimate relevance of xml elements with respect of query ¾ Query may contain path expressions XML Search Flowers (FLWR) z XQuery z Programming Language: Flowers (FLWR) expression ¾ SQL for XML The programming language XQuery defines FLWOR or FLWR ¾ Used for text-rich documents; data-oriented documents (often pronounced as 'flower') as expression that supports (non-text); mixed documents iteration and binding of variables to intermediate results. ¾ Consider: path expression (XPath); XML Schema ¾ For and let create a sequence of tuples datatypes ¾ where filters the tuples on a boolean expression ¾ It is still a working draft; details are being improved ¾ order by sorts the tuples, using any comparable data ¾ return gets evaluated once for every tuple XML Search Flowers (FLWR) z XQuery considers some principal forms for $d in document("depts.xml")//deptno let $e := document("emps.xml")//employee[deptno = $d] ¾ Path expression where count($e) >= 10 ¾ Conditional expressions order by avg($e/salary) descending ¾ Datatype expressions return <big-dept> { $d, ¾ List expression <headcount>{count($e)}</headcount>, <avgsal>{avg($e/salary)}</avgsal> } ¾ etc </big-dept> z Programming Language: Flowers (FLWOR) expression z Principle forms can be evaluated with respect to context XML Search XML IR Challenges 3: User interface z XQuery considers some principal forms and combine them z How to guide user to find relevant elements with Flowers (FLWR) ¾ Granularity control: Book->Abstract->Full Text It is quite similar to SQL for relational database z What type of querying language z However, it does not have the concept of relevance, which is ¾ Natural language query (IR approach): most usable important for both text data (text-based information retrieval) ¾ With structure information: more powerful but less usable and non-text data (fuzzy search). z How to do query expansion Find a book about information retrieval ¾ How to automatically add structure information Find a book which is about $30. e.g., find a book written by J. K. Rowling, -> find a book written by /../author (J. K. Rowling,) open research problem XML IR Challenges 1: Term Statistics XIRQL z There are multiple types of elements: books/titles/abstracts; Prof. Norbert Furth University of Dortmund: Open source XML search engine how to construct the corpus-statistics (idf) for different XIRQL: a query language for information retrieval in XML elements? documents z How do we handle the term frequency information? ¾ Structured Document Retrieval Principle Example: ¾ Users may not know the schema /book//title “information retrieval” do we consider the book Allow users to search even if they do not know the schema of abstract? Hierarchical smoothing the data XML IR Challenges 2: Schemas Units z Ideal Case z Only atomic units can be returned ¾ There is a universal schema traditional IR treats documents as atomic units; XML treat ¾ User can associate data type with the universal schema tree-like view of documents. without ambiguity z XIRQL only indexes and returns atom-units ¾ Too ideal to be true… ¾ Atom-units can be leaf nodes that contain text z Real Word information ¾ There are many schemas; different spellings; different ¾ Atom-units can be other internal nodes concepts; different granularities; (e.g., “auth” & “authors”; ¾ Atom-units can be defined in DTD “abstract” & “description”; “abstract” & “keywords”) ¾ TF-IDF values are calculated based-on atom-units XIRQL Atom-Units XIRQL Summary z Relevance ranking with respect to structure document retrieval principle z Recommends datatype-specific operators for different types of data z Enable semantic roles Structured Document Retrieval Principle Text-Based XML Retrieval We should always rank the most specific/probable atom z Documents are marked up with XML tags units for answering a query. journal articles, conference papers, novels, manuals… Example query: xql z Queries Document: <chapter> 0.3 XQL plain text queries, queries with structures (keywords in the title or abstracts) <section> 0.5 example </section> z Results <section> 0.8 XQL 0.7 syntax </section> </chapter> System automatically adjust the granularities of the returned results.
Recommended publications
  • XML Retrieval
    XML Retrieval Mounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom, [email protected] Andrew Trotman, Department of Computer Science, University of Otago, New Zealand, [email protected] SYNONYM structured document retrieval, structured text retrieval, focused retrieval, content- oriented XML retrieval DEFINITION Text documents often contain a mixture of structured and unstructured content. One way to format this mixed content is according to the adopted W3C standard for information repositories and exchanges, the eXtensible Mark-up Language (XML). In contrast to HTML, which is mainly layout-oriented, XML follows the fundamental concept of separating the logical structure of a document from its layout. This logical document structure can be exploited to allow a more focused sub-document retrieval. XML retrieval breaks away from the traditional retrieval unit of a document as a single large (text) block and aims to implement focused retrieval strategies aiming at returning document components, i.e. XML elements, instead of whole documents in response to a user query. This focused retrieval strategy is believed to be of particular benefit for information repositories containing long documents, or documents covering a wide variety of topics (e.g. books, user manuals, legal documents), where the user’s effort to locate relevant content within a document can be reduced by directing them to the most relevant parts of the document. HISTORICAL BACKGROUND Managing the enormous amount of information available on the web, in digital libraries, in intranets, and so on, requires efficient and effective indexing and retrieval methods. Although this information is available in different forms (text, image, speech, audio, video etc), it remains widely prevalent in text form.
    [Show full text]
  • XML Retrieval
    X This focused retrieval strategy is believed to be XML Retrieval of particular benefit for information repositories containing long documents, or documents cov- Mounia Lalmas1 and Andrew Trotman2 ering a wide variety of topics (e.g., books, user 1Yahoo! Inc., London, UK manuals, legal documents), where the user’s ef- 2University of Otago, Dunedin, New Zealand fort to locate relevant content within a document can be reduced by directing them to the most relevant parts of the document. Synonyms Content-oriented XML retrieval; Focused re- Historical Background trieval; Structured document retrieval; Struc- tured text retrieval Managing the enormous amount of information available on the web, in digital libraries, in in- tranets, and so on requires efficient and effective Definition indexing and retrieval methods. Although this information is available in different forms (text, Text documents often contain a mixture of image, speech, audio, video, etc.), it remains structured and unstructured content. One way widely prevalent in text form. Textual informa- to format this mixed content is according to tion can be broadly classified into two categories, the adopted W3C standard for information structured and unstructured. repositories and exchanges, the eXtensible Mark- Unstructured information has no fixed prede- up Language (XML). In contrast to HTML, fined format and is typically expressed in natural which is mainly layout-oriented, XML follows language. For instance, much of the information the fundamental concept of separating the logical available on the web is unstructured. Although structure of a document from its layout. This this information is mostly formatted in HTML, logical document structure can be exploited to thus imposing some structure on the text, the allow a more focused sub-document retrieval.
    [Show full text]
  • XML Retrieval Introduction to Information Retrieval
    Introduction to Information Retrieval XML Retrieval Introduction to Information Retrieval Overview ❶ Introduction ❷ Basic XML concepts ❸ Challenges in XML IR ❹ Vector space model for XML IR ❺ Evaluation of XML IR Introduction to Information Retrieval IR and relational databases IR systems are often contrasted with relational databases (RDB). Traditionally, IR systems retrieve information from unstructured text (“raw” text without markup). RDB systems are used for querying relational data: sets of records that have values for predefined attributes such as employee number, title and salary. Some structured data sources containing text are best modeled as structured documents rather than relational data (Structured retrieval). 3 Introduction to Information Retrieval Structured retrieval Basic setting: queries are structured or unstructured; documents are structured. Applications of structured retrieval Digital libraries, patent databases, blogs, tagged text with entities like persons and locations (named entity tagging) Example . Digital libraries: give me a full‐length article on fast fourier transforms . Patents: give me patens whose claims mention RSA public key encryption and that cite US patent 4,405,829 . Entity‐tagged text: give me articles about sightseeing tours of the Vatican and the Coliseum 4 Introduction to Information Retrieval Why RDB is not suitable in this case Three main problems ❶ An unranked system (DB) would return a potentially large number of articles that mention the Vatican, the Coliseum and sightseeing tours without ranking them by relevance to query. ❷ Difficult for users to precisely state structural constraints –may not know which structured elements are supported by the system. tours AND (COUNTRY: Vatican OR LANDMARK: Coliseum)? tours AND (STATE: Vatican OR BUILDING: Coliseum)? ❸ Users may be completely unfamiliar with structured search and advanced search interfaces or unwilling to use them.
    [Show full text]
  • Database and Information Retrieval Techniques for XML
    Database and Information Retrieval Techniques for XML Mariano P. Consens1 and Ricardo Baeza-Yates2 1 University of Toronto Toronto, Canada [email protected] 2 ICREA – Univ. Pompeu Fabra Barcelona, Spain [email protected] 1Overview The world of data has been developed from two main points of view: the struc- tured relational data model and the unstructured text model. The two distinct cultures of databases and information retrieval now have a natural meeting place in the Web with its semi-structured XML model. As web-style searching becomes an ubiquitous tool, the need for integrating these two viewpoints becomes even more important. This tutorial3 will provide an overview of the different issues and approaches put forward by the Information Retrieval and the Database communities and survey the DB-IR integration efforts with a focus on techniques applicable to XML retrieval. A variety of application scenarios for DB-IR integration will be covered, including examples of current industrial tools. 2 Tutorial Content The tutorial consists of two parts: the first part will cover the problem space (basic concepts, requirements, models) and the second part the solution space (approaches and techniques). The major topics covered together with specific bibliographic references are listed below. Introduction. Types of data, DB & IR views, Applications, Tokenization, Web Challenges ([1–3]). Requirements for DB-IR. Motivation, Data and Query Requirements, Sam- ple Use Cases ([4, 5]). 3 Earlier versions of this tutorial have been given at VLDB 2004 and SIGIR 2005. Semi-structured text models. XPat and XQuery, Full-text extensions to XQuery, Structured text models, Query algebras ([6–12]).
    [Show full text]
  • XML Retrieval Models for Legislation’ in T
    Marie-Francine Moens, ‘XML Retrieval Models for Legislation’ in T. Gordon (ed.), Legal Knowledge and Information Systems. Jurix 2004: The Seventeenth Annual Conference. Amsterdam: IOS Press, 2004, pp. 1-10. XML Retrieval Models for Legislation Marie-Francine Moens Interdisciplinary Centre for Law and Information Technology (ICRI) Katholieke Universiteit Leuven Tiensestraat 41 B-3000 Leuven,Belgium Abstract. Legislation contains text-rich documents and is increasingly marked with XML tags. The XML markup can - among other uses - be exploited to more precisely answer free information queries. In this article we report on different XML retrieval models we explicitly designed for the retrieval of legislation and which are based on the vector space model and the probabilistic language model. In addition search data structures are designed for legislative databases that support these retrieval models. We show that the models provide more advanced access to the content of statutes. 1 Introduction Legislation typically involves structured information including the division of a statute in for instance titles, chapters, sections and articles, and the typical metadata (e.g., indication of the date of enactment, the area of applicability and references to other statutes) that are as- signed to the statute or its parts. Additionally, legislation contains large parts of unstructured information found in the natural language texts. The structured information is increasingly tagged with markup languages such as XML (Extensible Markup Language). The use of such a markup language makes it possible that documents can be easily interchanged between institutions and systems, and that the markups are interpretable across the use of different software. From way back, legal information retrieval is an important information technology ap- plication [2], and it has an increasing significance.
    [Show full text]
  • Introduction to Information Retrieval
    DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 195 10 XML retrieval Information retrieval systems are often contrasted with relational databases. Traditionally, IR systems have retrieved information from unstructured text – by which we mean “raw” text without markup. Databases are designed for querying relational data: sets of records that have values for predefined attributes such as employee number, title and salary. There are fundamental differences between information retrieval and database systems in terms of retrieval model, data structures and query language as shown in Table 10.1.1 Some highly structured text search problems are most efficiently handled by a relational database, for example, if the employee table contains an at- tribute for short textual job descriptions and you want to find all employees who are involved with invoicing. In this case, the SQL query: select lastname from employees where job_desc like 'invoic%'; may be sufficient to satisfy your information need with high precision and recall. However, many structured data sources containing text are best modeled as structured documents rather than relational data. We call the search over STRUCTURED such structured documents structured retrieval. Queries in structured retrieval RETRIEVAL can be either structured or unstructured, but we will assume in this chap- ter that the collection consists only of structured documents. Applications of structured retrieval include digital libraries, patent databases, blogs, text in which entities like persons and locations have been tagged (in a process called named entity tagging) and output from office suites like OpenOffice that save documents as marked up text. In all of these applications, we want to be able to run queries that combine textual criteria with structural criteria.
    [Show full text]
  • A Survey on Tree Matching and XML Retrieval Mohammed Amin Tahraoui, Karen Pinel-Sauvagnat, Cyril Laitang, Mohand Boughanem, Hamamache Kheddouci, Lei Ning
    A survey on tree matching and XML retrieval Mohammed Amin Tahraoui, Karen Pinel-Sauvagnat, Cyril Laitang, Mohand Boughanem, Hamamache Kheddouci, Lei Ning To cite this version: Mohammed Amin Tahraoui, Karen Pinel-Sauvagnat, Cyril Laitang, Mohand Boughanem, Hamamache Kheddouci, et al.. A survey on tree matching and XML retrieval. Computer Science Review, Elsevier, 2013, vol. 8, pp. 1-23. 10.1016/j.cosrev.2013.02.001. hal-01131158 HAL Id: hal-01131158 https://hal.archives-ouvertes.fr/hal-01131158 Submitted on 13 Mar 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 12411 To link to this article : DOI :10.1016/j.cosrev.2013.02.001 URL : http://dx.doi.org/10.1016/j.cosrev.2013.02.001 To cite this version : Tahraoui, Mohammed Amin and Pinel- Sauvagnat, Karen and Laitang, Cyril and Boughanem, Mohand and Kheddouci, Hamamache and Ning, Lei A survey on tree matching and XML retrieval.
    [Show full text]
  • XML and Information Retrieval
    Third Edition of the "XML and Information Retrieval" Workshop First Workshop on Integration of IR and DB (WIRD) Jointly held at SIGIR’2004, Sheffield, UK, July 29 th , 2004 Ricardo Baeza-Yates, University of Chile, Chile Yoelle S. Maarek, IBM Haifa Research Lab, Israel Thomas Roelleke, Queen Mary University of London, UK Arjen P. de Vries, CWI, Amsterdam, The Netherlands Introduction The morning session was dedicated to the third edition in the series of XML and Information Retrieval workshops that were held at SIGIR'2000 (Athens, Greece, see SIGIR Forum Fall 2000 issue) and SIGIR'2002 (Tampere, Finland, see SIGIR Forum Fall 2002 issue). The goal of the workshop, co-chaired by Baeza-Yates and Maarek, was to complement the INEX (Initiative for the Evaluation of XML Retrieval) meetings that have been organized for the last two years, by providing researchers a useful forum for discussing (before implementing) and evaluating their models at INEX in the second half of the year. Our intent was twofold: first encourage the exchanges of ideas between researchers who are now active in this "sub-field" and, second, attract new interests. Our focus, like in previous editions of the workshop, was to address issues related to the application of IR methods to XML data for querying, retrieval, navigating, etc. We have gone a long way since the first edition in 2000, when XML was entirely dominated by the DB community. However, it seems that the expected breakthrough has not occurred yet, and it is not clear whether it is for lack of XML data, of appropriate technology, or simply of real needs in the marketplace.
    [Show full text]
  • XML Information Retrieval:An Overview
    International Global Journal For Engineering Research–Volume 10 Issue 1 –2014 XML Information Retrieval:An overview Suma D1, U. Dinesh Acharya 2 and GeethaM3, Raviraja Holla M4 1Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal, Karnataka India [email protected] 2Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal, Karnataka India [email protected] 3Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal, Karnataka India [email protected] 4Department of Information Technology, Manipal University Jaipur, Jaipur India [email protected] Abstract—Locating and distilling the valuable relevant XML being touted the de facto standard in the Web posits information continued to be the major challenges of IR problems over the Web to XML IR problems on the WEB. Information Retrieval (IR) Systems owing to the explosive Initial research on XML IR strategies revealed customizing growth of online web information. These challenges can be the conventional IR strategies re-used in the context of XML. considered the XML Information Retrieval challenges as Later novel strategies specific to the XML IR strategies XML has become a de facto standard over the Web. The evolved. The effectiveness of these strategies is research on XML IR starts with the classical IR strategies environment-specific. Meanwhile the evolved intelligent and customized to XML IR. Later novel IR strategies specific to XML IR are evolved. Meanwhile literatures reveal rapid IR strategies leveraged the advanced computing power. development of the rapid and intelligent IR systems. Despite Traditional ad hoc IR techniques are then extended their success in their specified constrained domains, they yielding to personalized search and exploratory visualization have additional limitations in the complex information space.
    [Show full text]
  • XML Retrieval
    XML Retrieval RuSSIR 2010 XML Retrieval Mounia Lalmas [email protected] About me Professor of information retrieval at Queen Mary, University of London (up to August 2008) Microsoft Research/RAEng Research Professor of information retrieval at the University of Glasgow (and live outside London) Research INEX 2002 – 2008 (XML retrieval evaluation) Quantum theory for IR Aggregated search Bridging the digital divide Evaluation Measuring user engagement 1 XML Retrieval Outline Introduction to XML, basics and standards (Monday) Document-oriented XML retrieval (Tuesday & Wednesday) Evaluating XML retrieval effectiveness (Wednesday & Thursday) Going beyond XML retrieval (Thursday) Outline Introduction to XML, basics and standards Document-oriented XML retrieval Evaluating XML retrieval effectiveness Going beyond XML retrieval 2 XML Retrieval Introduction to XML, basics and standards What is XML? Database vs. information retrieval Document Type Definition (DTD) XML Schema Querying XML data (XPath, XQuery) XML (eXtensible Markup Language) 3 XML Retrieval XML: eXtensible Mark-up Language Meta-language (user-defined tags) adopted as the document format language by W3C Used to describe content and structure (and not layout) Grammar described in DTD (→ used for validation) <lecture> <title> Structured Document Retrieval </title> <author> <fnm> Smith </fnm> <snm> John </snm> </author> <chapter> <title> Introduction into SDR </title> <paragraph> …. </paragraph> … </chapter> … <!ELEMENT lecture (title, author+,chapter+) </lecture>
    [Show full text]
  • Index Compression Vs. Retrieval Time of Inverted Files for XML Documents
    Index Compression vs. Retrieval Time of Inverted Files for XML Documents Norbert Fuhr Norbert Gövert University of Dortmund, Germany Query languages for retrieval of XML documents allow for conditions referring both to the content and the structure of documents. In order to process these queries efficiently, inverted files must contain also structural information, thus leading to index sizes that exceed the storage space of the original data. In this paper, we investigate two different approaches for reducing index space. First, we consider methods for compressing index entries. Second, we develop the new XS tree data structure which contains the structural description of a document in a rather compact form, such that these descriptions can be kept in main memory. We evaluate the efficiency of several variants of these two approaches on two large XML document collections. Results show that very high compression rates for indexes can be achieved. However, any compression increases retrieval time. Thus, retrieval time is minimized when uncompressed indexes are used. On the other hand, highly compressed indexes may be feasible for applications where storage is limited, such as in PDAs or E-book devices. 1 Introduction In the past, most IR approaches have ignored document formats or structure, partly due to the diversity of formats, partly due to the limited availability of structured documents. With the increased use of the Extended Markup Language (XML) as standard format for documents, this will change very soon. XML allows for logical markup of texts both at the macro level (for example chapter, section, paragraph) and the micro level (e.
    [Show full text]
  • XML Information Retrieval
    XML Information Retrieval Mounia Lalmas University of Glasgow [email protected] Abstract Nowadays, increasingly, documents are marked-up using XML, the format stan- dard for structured documents. In contrast to HTML, which is mainly layout- oriented, XML follows the fundamental concept of separating the logical structure of a document from its layout. This document logical structure can be exploited to allow a focused access to documents, where the aim is to return the most relevant fragments within documents as answers to queries, instead of whole documents. This article describes approaches developed to query, represent, and rank XML fragments. Keywords: XML, structured document, logical structure, element, focused retrieval, query language, representation strategy, ranking strategy, information retrieval 1 Introduction Documents can be structured or unstructured. Unstructured documents have no (or very little) fixed pre-defined format, whereas structured documents are usually orga- nized according to a fixed pre-defined structure. An example of a structured document is a book organized into chapters, each with sections made of paragraphs and so on. Nowadays, the most common way to format structured content is with the W3C stan- dard for information repositories and exchanges, the eXtensible Mark-up Language (XML)1. Much of the content available on the web is formatted in HTML. Although HTML imposes some structure on a web content, this structure is mainly for presentation pur- poses and carries little meaning. In contrast, XML is used to provide meaning about the stored content. More precisely, in the context of text documents, with which this article is concerned, XML is used to specify the logical, or tree, structure of documents, in which separate document parts (e.g.
    [Show full text]