CS490W XML Data and Retrieval XML and Retrieval

CS490W Semi-Structured Data Structure of XML z XML data is organized by documents like unstructured data XML data and Retrieval z There are structures (nodes/tags) within the documents z Each XML document is an ordered, labeled tree Luo Si z Element Nodes are labeled with Department of Computer Science ¾ Node name (e.g., chapter) Purdue University ¾ Node attributes and the values (e.g., size=1000; time=01/01/2007) ¾ May have child nodes or data z Data exist (e.g., text strings) within leaf nodes XML and Retrieval: Outline XML Example Outline: <book id=“ML_Tom”> <title>Machine Learning</title> z Semi-Structure Data <author> ¾ XML, Examples, Application <firstname>Tom</firstname> <surname>Mitchell</surname> </author> z XML Search ... ¾ XQuery <p>Machine Learning Applications...</p> ... ¾ XIRQL </book> z Text-Based XML Retrieval Elements, Attributes/Values, Data(Text String) ¾ Vector-space model ¾ INEX Semi-Structured Data XML Example XML has been used as the standard representation of Semi- <book id=“ML_Tom”> <title>Machine Learning</title> Structured Data <author> <firstname>Tom</firstname> z eXtensible Markup Language <surname>Michael</surname> </author> book is a W3C-recommended general-purpose markup language that supports a wide ... title variety of applications. <p>Machine Learning Applications...</p> ... z A framework for defining markup languages </book> author firstname surname z Open vocabulary for tags Elements, Attributes/Values, Data(Text String) chapter chapter z Each set of XML corresponds to different applications title para para para z facilitate the sharing of data across different information systems, particularly systems connected via the Internet z Examples: RSS, XHTML, MathML Elements Why XML? z Elements are defined by markup tags z Unlike relational database, XML data does not require z Elements: <TagName attr_a=“value”…>text</TagName> relational schemata, etc., because the data itself contains this information. z ID of the element is TagName z Unlike widely used Web format, HTML, which only ensures z Attribute: attr_a; Values=“value” the correct presentation of the formatted data, XML also z Data/text: “text” guarantees total usability of data. z End tag </TagName> XML, HTML, SGML XML Applications 1986: SGML ISO 8879-1986 z CML – chemical markup language: Nov 1995: HTML 2.0 z WML – wireless markup language Nov 1996: Simplified and stripped down SGML draft z ThML – theological markup language (dubbed XML) Jan 1997: HTML 3.2 Aug 1997: XML working draft Dec 1997: XML 1.0 proposed recommendation Jan 1998: XML Feb 1999: XHTML XML and HTML XML Applications z Both of them are derivations of SGML z CML – chemical markup language: z HTML is a markup language mainly for display in browsers CML (Chemical Markup Language) is a new approach to managing z XML is a framework for markup languages molecular information using tools such as XML and Java. It was the first z HTML defines display domain specific implementation based strictly on XML, z XML defines the data structure, the display factor is separated from the content <molecule convention="MDLMol" id="baclofen" title="BACLOFEN"> z HTML can be formalized as XML (XHTML) XML Applications XML Files z <?xml version="1.0"?> z WML – wireless markup language <!DOCTYPE note [ Wireless Markup Language, is a content format for devices that implement the <!ELEMENT note (to,from,heading,body)> Wireless Application Protocol (WAP) specification, such as mobile phones. <!ELEMENT to (#PCDATA)> DTD Example <?xml version="1.0"?> <!ELEMENT from (#PCDATA)> <!DOCTYPE wml PUBLIC "-//PHONE.COM//DTD WML 1.1//EN" <!ELEMENT heading (#PCDATA)> "http://www.phone.com/dtd/wml11.dtd" > <!ELEMENT body (#PCDATA)> <wml> ]> <card id="main" title="First Card"> <note> <p mode="wrap">This is a sample WML page.</p> <to>Tove</to> XML Document </card> <from>Jani</from> </wml> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> XML Applications XML Files z XML Schema: z ThML – theological markup language Recommended by the W3C as the successor of DTDs, more informally <ThML> referred to by the initialism for XML Schema instances, XSD (XML <ThML.body> Schema Definition). XSDs are far more powerful than DTDs in describing – <div1> <div2 title="Genesis" id="Gen"> XML languages. – <div3 title="Chapter 1"> <xs:schema • <p> xmlns:xs="http://www.w3.org/2001/XMLSchema"> • <scripture/> • In the beginning God created the heaven and the earth. <xs:element name="country" type="Country"/> • <scripture/> <xs:complexType name="Country"> • And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. <xs:sequence> •</p> – </div3> <xs:element name="name" type="xs:string"/> </div2> <xs:element name="population" type="xs:decimal"/> – </div1> </xs:sequence> </ThML.body> </ThML> </xs:complexType> </xs:schema> XML Files XML Search z Schema/DTD: syntax definition of XML Language; z Most XML Search protocols use a database-based approach Document Type Definition (DTD file) XML provides an application independent way of sharing data. With a DTD, ¾ Non-text data match independent groups of people can agree to use a common DTD for ¾ Exact keyword (text) match interchanging data. However, this is often NOT the case ¾ Evaluate XML path expression <?xml version="1.0"?> ¾ No concept of relevant <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> DTD Example <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]> XML Search Principal Forms z Traditional XML Search from Database-based approach z Path Query ¾ XQuery /book//title contains “Information Retrieval” ¾ Search multiple types of data: value-based (e.g., price of title of the book contains keywords “Information Retrieval” a book); ids (ISBN of book); keyword match (text) z Conditional expressions z XML text search from information retrieval approach $h/title, ¾ XIRQL IF $h/@type = "Journal" THEN …. ¾ Vector-space based if the type of an article is journal ¾ Search text data: estimate relevance of xml elements with respect of query ¾ Query may contain path expressions XML Search Flowers (FLWR) z XQuery z Programming Language: Flowers (FLWR) expression ¾ SQL for XML The programming language XQuery defines FLWOR or FLWR ¾ Used for text-rich documents; data-oriented documents (often pronounced as 'flower') as expression that supports (non-text); mixed documents iteration and binding of variables to intermediate results. ¾ Consider: path expression (XPath); XML Schema ¾ For and let create a sequence of tuples datatypes ¾ where filters the tuples on a boolean expression ¾ It is still a working draft; details are being improved ¾ order by sorts the tuples, using any comparable data ¾ return gets evaluated once for every tuple XML Search Flowers (FLWR) z XQuery considers some principal forms for $d in document("depts.xml")//deptno let $e := document("emps.xml")//employee[deptno = $d] ¾ Path expression where count($e) >= 10 ¾ Conditional expressions order by avg($e/salary) descending ¾ Datatype expressions return <big-dept> { $d, ¾ List expression <headcount>{count($e)}</headcount>, <avgsal>{avg($e/salary)}</avgsal> } ¾ etc </big-dept> z Programming Language: Flowers (FLWOR) expression z Principle forms can be evaluated with respect to context XML Search XML IR Challenges 3: User interface z XQuery considers some principal forms and combine them z How to guide user to find relevant elements with Flowers (FLWR) ¾ Granularity control: Book->Abstract->Full Text It is quite similar to SQL for relational database z What type of querying language z However, it does not have the concept of relevance, which is ¾ Natural language query (IR approach): most usable important for both text data (text-based information retrieval) ¾ With structure information: more powerful but less usable and non-text data (fuzzy search). z How to do query expansion Find a book about information retrieval ¾ How to automatically add structure information Find a book which is about $30. e.g., find a book written by J. K. Rowling, -> find a book written by /../author (J. K. Rowling,) open research problem XML IR Challenges 1: Term Statistics XIRQL z There are multiple types of elements: books/titles/abstracts; Prof. Norbert Furth University of Dortmund: Open source XML search engine how to construct the corpus-statistics (idf) for different XIRQL: a query language for information retrieval in XML elements? documents z How do we handle the term frequency information? ¾ Structured Document Retrieval Principle Example: ¾ Users may not know the schema /book//title “information retrieval” do we consider the book Allow users to search even if they do not know the schema of abstract? Hierarchical smoothing the data XML IR Challenges 2: Schemas Units z Ideal Case z Only atomic units can be returned ¾ There is a universal schema traditional IR treats documents as atomic units; XML treat ¾ User can associate data type with the universal schema tree-like view of documents. without ambiguity z XIRQL only indexes and returns atom-units ¾ Too ideal to be true… ¾ Atom-units can be leaf nodes that contain text z Real Word information ¾ There are many schemas; different spellings; different ¾ Atom-units can be other internal nodes concepts; different granularities; (e.g., “auth” & “authors”; ¾ Atom-units can be defined in DTD “abstract” & “description”; “abstract” & “keywords”) ¾ TF-IDF values are calculated based-on atom-units XIRQL Atom-Units XIRQL Summary z Relevance ranking with respect to structure document retrieval principle z Recommends datatype-specific operators for different types of data z Enable semantic roles Structured Document Retrieval Principle Text-Based XML Retrieval We should always rank the most specific/probable atom z Documents are marked up with XML tags units for answering a query. journal articles, conference papers, novels, manuals… Example query: xql z Queries Document: <chapter> 0.3 XQL plain text queries, queries with structures (keywords in the title or abstracts) <section> 0.5 example </section> z Results <section> 0.8 XQL 0.7 syntax </section> </chapter> System automatically adjust the granularities of the returned results.

Load more