CS490W XML Data and Retrieval XML and Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
CS490W XML data and Retrieval Luo Si Department of Computer Science Purdue University XML and Retrieval: Outline Outline: z Semi-Structure Data ¾ XML, Examples, Application z XML Search ¾ XQuery ¾ XIRQL z Text-Based XML Retrieval ¾ Vector-space model ¾ INEX Semi-Structured Data XML has been used as the standard representation of Semi- Structured Data z eXtensible Markup Language is a W3C-recommended general-purpose markup language that supports a wide variety of applications. z A framework for defining markup languages z Open vocabulary for tags z Each set of XML corresponds to different applications z facilitate the sharing of data across different information systems, particularly systems connected via the Internet z Examples: RSS, XHTML, MathML Semi-Structured Data Structure of XML z XML data is organized by documents like unstructured data z There are structures (nodes/tags) within the documents z Each XML document is an ordered, labeled tree z Element Nodes are labeled with ¾ Node name (e.g., chapter) ¾ Node attributes and the values (e.g., size=1000; time=01/01/2007) ¾ May have child nodes or data z Data exist (e.g., text strings) within leaf nodes XML Example <book id=“ML_Tom”> <title>Machine Learning</title> <author> <firstname>Tom</firstname> <surname>Mitchell</surname> </author> ... <p>Machine Learning Applications...</p> ... </book> Elements, Attributes/Values, Data(Text String) XML Example <book id=“ML_Tom”> <title>Machine Learning</title> <author> <firstname>Tom</firstname> <surname>Michael</surname> </author> book ... <p>Machine Learning Applications...</p> title ... </book> author firstname surname Elements, Attributes/Values, Data(Text String) chapter chapter title para para para Elements z Elements are defined by markup tags z Elements: <TagName attr_a=“value”…>text</TagName> z ID of the element is TagName z Attribute: attr_a; Values=“value” z Data/text: “text” z End tag </TagName> XML, HTML, SGML 1986: SGML ISO 8879-1986 Nov 1995: HTML 2.0 Nov 1996: Simplified and stripped down SGML draft (dubbed XML) Jan 1997: HTML 3.2 Aug 1997: XML working draft Dec 1997: XML 1.0 proposed recommendation Jan 1998: XML Feb 1999: XHTML XML and HTML z Both of them are derivations of SGML z HTML is a markup language mainly for display in browsers z XML is a framework for markup languages z HTML defines display z XML defines the data structure, the display factor is separated from the content z HTML can be formalized as XML (XHTML) Why XML? z Unlike relational database, XML data does not require relational schemata, etc., because the data itself contains this information. z Unlike widely used Web format, HTML, which only ensures the correct presentation of the formatted data, XML also guarantees total usability of data. XML Applications z CML – chemical markup language: z WML – wireless markup language z ThML – theological markup language XML Applications z CML – chemical markup language: CML (Chemical Markup Language) is a new approach to managing molecular information using tools such as XML and Java. It was the first domain specific implementation based strictly on XML, <molecule convention="MDLMol" id="baclofen" title="BACLOFEN"> XML Applications z WML – wireless markup language Wireless Markup Language, is a content format for devices that implement the Wireless Application Protocol (WAP) specification, such as mobile phones. <?xml version="1.0"?> <!DOCTYPE wml PUBLIC "-//PHONE.COM//DTD WML 1.1//EN" "http://www.phone.com/dtd/wml11.dtd" > <wml> <card id="main" title="First Card"> <p mode="wrap">This is a sample WML page.</p> </card> </wml> XML Applications z ThML – theological markup language <ThML> <ThML.body> – <div1> <div2 title="Genesis" id="Gen"> – <div3 title="Chapter 1"> •<p> • <scripture/> • In the beginning God created the heaven and the earth. • <scripture/> • And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. •</p> – </div3> </div2> – </div1> </ThML.body> </ThML> XML Files z Schema/DTD: syntax definition of XML Language; Document Type Definition (DTD file) XML provides an application independent way of sharing data. With a DTD, independent groups of people can agree to use a common DTD for interchanging data. However, this is often NOT the case <?xml version="1.0"?> <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> DTD Example <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]> XML Files z <?xml version="1.0"?> <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> DTD Example <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]> <note> <to>Tove</to> XML Document <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> XML Files z XML Schema: Recommended by the W3C as the successor of DTDs, more informally referred to by the initialism for XML Schema instances, XSD (XML Schema Definition). XSDs are far more powerful than DTDs in describing XML languages. <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="country" type="Country"/> <xs:complexType name="Country"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="population" type="xs:decimal"/> </xs:sequence> </xs:complexType> </xs:schema> XML Search z Most XML Search protocols use a database-based approach ¾ Non-text data match ¾ Exact keyword (text) match ¾ Evaluate XML path expression ¾ No concept of relevant XML Search z Traditional XML Search from Database-based approach ¾ XQuery ¾ Search multiple types of data: value-based (e.g., price of a book); ids (ISBN of book); keyword match (text) z XML text search from information retrieval approach ¾ XIRQL ¾ Vector-space based ¾ Search text data: estimate relevance of xml elements with respect of query ¾ Query may contain path expressions XML Search z XQuery ¾ SQL for XML ¾ Used for text-rich documents; data-oriented documents (non-text); mixed documents ¾ Consider: path expression (XPath); XML Schema datatypes ¾ It is still a working draft; details are being improved XML Search z XQuery considers some principal forms ¾ Path expression ¾ Conditional expressions ¾ Datatype expressions ¾ List expression ¾ etc z Programming Language: Flowers (FLWOR) expression z Principle forms can be evaluated with respect to context Principal Forms z Path Query /book//title contains “Information Retrieval” title of the book contains keywords “Information Retrieval” z Conditional expressions $h/title, IF $h/@type = "Journal" THEN …. if the type of an article is journal Flowers (FLWR) z Programming Language: Flowers (FLWR) expression The programming language XQuery defines FLWOR or FLWR (often pronounced as 'flower') as expression that supports iteration and binding of variables to intermediate results. ¾ For and let create a sequence of tuples ¾ where filters the tuples on a boolean expression ¾ order by sorts the tuples, using any comparable data ¾ return gets evaluated once for every tuple Flowers (FLWR) for $d in document("depts.xml")//deptno let $e := document("emps.xml")//employee[deptno = $d] where count($e) >= 10 order by avg($e/salary) descending return <big-dept> { $d, <headcount>{count($e)}</headcount>, <avgsal>{avg($e/salary)}</avgsal> } </big-dept> XML Search z XQuery considers some principal forms and combine them with Flowers (FLWR) It is quite similar to SQL for relational database z However, it does not have the concept of relevance, which is important for both text data (text-based information retrieval) and non-text data (fuzzy search). Find a book about information retrieval Find a book which is about $30. XML IR Challenges 1: Term Statistics z There are multiple types of elements: books/titles/abstracts; how to construct the corpus-statistics (idf) for different elements? z How do we handle the term frequency information? Example: /book//title “information retrieval” do we consider the book abstract? Hierarchical smoothing XML IR Challenges 2: Schemas z Ideal Case ¾ There is a universal schema ¾ User can associate data type with the universal schema without ambiguity ¾ Too ideal to be true… z Real Word ¾ There are many schemas; different spellings; different concepts; different granularities; (e.g., “auth” & “authors”; “abstract” & “description”; “abstract” & “keywords”) XML IR Challenges 3: User interface z How to guide user to find relevant elements ¾ Granularity control: Book->Abstract->Full Text z What type of querying language ¾ Natural language query (IR approach): most usable ¾ With structure information: more powerful but less usable z How to do query expansion ¾ How to automatically add structure information e.g., find a book written by J. K. Rowling, -> find a book written by /../author (J. K. Rowling,) open research problem XIRQL Prof. Norbert Furth University of Dortmund: Open source XML search engine XIRQL: a query language for information retrieval in XML documents ¾ Structured Document Retrieval Principle ¾ Users may not know the schema Allow users to search even if they do not know the schema of the data Units z Only atomic units can be returned traditional IR treats documents as atomic units; XML treat tree-like view of documents. z XIRQL only indexes and returns atom-units ¾ Atom-units can be leaf nodes that contain text information ¾ Atom-units can be other internal nodes ¾ Atom-units can be defined in DTD ¾ TF-IDF values are calculated based-on atom-units XIRQL Atom-Units Structured Document Retrieval