CS490W XML Data and Retrieval XML and Retrieval

CS490W Semi-Structured Data Structure of XML z XML data is organized by documents like unstructured data XML data and Retrieval z There are structures (nodes/tags) within the documents z Each XML document is an ordered, labeled tree Luo Si z Element Nodes are labeled with Department of Computer Science ¾ Node name (e.g., chapter) Purdue University ¾ Node attributes and the values (e.g., size=1000; time=01/01/2007) ¾ May have child nodes or data z Data exist (e.g., text strings) within leaf nodes XML and Retrieval: Outline XML Example Outline: <book id=“ML_Tom”> <title>Machine Learning</title> z Semi-Structure Data <author> ¾ XML, Examples, Application <firstname>Tom</firstname> <surname>Mitchell</surname> </author> z XML Search ... ¾ XQuery <p>Machine Learning Applications...</p> ... ¾ XIRQL </book> z Text-Based XML Retrieval Elements, Attributes/Values, Data(Text String) ¾ Vector-space model ¾ INEX Semi-Structured Data XML Example XML has been used as the standard representation of Semi- <book id=“ML_Tom”> <title>Machine Learning</title> Structured Data <author> <firstname>Tom</firstname> z eXtensible Markup Language <surname>Michael</surname> </author> book is a W3C-recommended general-purpose markup language that supports a wide ... title variety of applications. <p>Machine Learning Applications...</p> ... z A framework for defining markup languages </book> author firstname surname z Open vocabulary for tags Elements, Attributes/Values, Data(Text String) chapter chapter z Each set of XML corresponds to different applications title para para para z facilitate the sharing of data across different information systems, particularly systems connected via the Internet z Examples: RSS, XHTML, MathML Elements Why XML? z Elements are defined by markup tags z Unlike relational database, XML data does not require z Elements: <TagName attr_a=“value”…>text</TagName> relational schemata, etc., because the data itself contains this information. z ID of the element is TagName z Unlike widely used Web format, HTML, which only ensures z Attribute: attr_a; Values=“value” the correct presentation of the formatted data, XML also z Data/text: “text” guarantees total usability of data. z End tag </TagName> XML, HTML, SGML XML Applications 1986: SGML ISO 8879-1986 z CML – chemical markup language: Nov 1995: HTML 2.0 z WML – wireless markup language Nov 1996: Simplified and stripped down SGML draft z ThML – theological markup language (dubbed XML) Jan 1997: HTML 3.2 Aug 1997: XML working draft Dec 1997: XML 1.0 proposed recommendation Jan 1998: XML Feb 1999: XHTML XML and HTML XML Applications z Both of them are derivations of SGML z CML – chemical markup language: z HTML is a markup language mainly for display in browsers CML (Chemical Markup Language) is a new approach to managing z XML is a framework for markup languages molecular information using tools such as XML and Java. It was the first z HTML defines display domain specific implementation based strictly on XML, z XML defines the data structure, the display factor is separated from the content <molecule convention="MDLMol" id="baclofen" title="BACLOFEN"> z HTML can be formalized as XML (XHTML) XML Applications XML Files z <?xml version="1.0"?> z WML – wireless markup language <!DOCTYPE note [ Wireless Markup Language, is a content format for devices that implement the <!ELEMENT note (to,from,heading,body)> Wireless Application Protocol (WAP) specification, such as mobile phones. <!ELEMENT to (#PCDATA)> DTD Example <?xml version="1.0"?> <!ELEMENT from (#PCDATA)> <!DOCTYPE wml PUBLIC "-//PHONE.COM//DTD WML 1.1//EN" <!ELEMENT heading (#PCDATA)> "http://www.phone.com/dtd/wml11.dtd" > <!ELEMENT body (#PCDATA)> <wml> ]> <card id="main" title="First Card"> <note> <p mode="wrap">This is a sample WML page.</p> <to>Tove</to> XML Document </card> <from>Jani</from> </wml> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> XML Applications XML Files z XML Schema: z ThML – theological markup language Recommended by the W3C as the successor of DTDs, more informally <ThML> referred to by the initialism for XML Schema instances, XSD (XML <ThML.body> Schema Definition). XSDs are far more powerful than DTDs in describing – <div1> <div2 title="Genesis" id="Gen"> XML languages. – <div3 title="Chapter 1"> <xs:schema • <p> xmlns:xs="http://www.w3.org/2001/XMLSchema"> • <scripture/> • In the beginning God created the heaven and the earth. <xs:element name="country" type="Country"/> • <scripture/> <xs:complexType name="Country"> • And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. <xs:sequence> •</p> – </div3> <xs:element name="name" type="xs:string"/> </div2> <xs:element name="population" type="xs:decimal"/> – </div1> </xs:sequence> </ThML.body> </ThML> </xs:complexType> </xs:schema> XML Files XML Search z Schema/DTD: syntax definition of XML Language; z Most XML Search protocols use a database-based approach Document Type Definition (DTD file) XML provides an application independent way of sharing data. With a DTD, ¾ Non-text data match independent groups of people can agree to use a common DTD for ¾ Exact keyword (text) match interchanging data. However, this is often NOT the case ¾ Evaluate XML path expression <?xml version="1.0"?> ¾ No concept of relevant <!DOCTYPE note [ <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> DTD Example <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)> ]> XML Search Principal Forms z Traditional XML Search from Database-based approach z Path Query ¾ XQuery /book//title contains “Information Retrieval” ¾ Search multiple types of data: value-based (e.g., price of title of the book contains keywords “Information Retrieval” a book); ids (ISBN of book); keyword match (text) z Conditional expressions z XML text search from information retrieval approach $h/title, ¾ XIRQL IF $h/@type = "Journal" THEN …. ¾ Vector-space based if the type of an article is journal ¾ Search text data: estimate relevance of xml elements with respect of query ¾ Query may contain path expressions XML Search Flowers (FLWR) z XQuery z Programming Language: Flowers (FLWR) expression ¾ SQL for XML The programming language XQuery defines FLWOR or FLWR ¾ Used for text-rich documents; data-oriented documents (often pronounced as 'flower') as expression that supports (non-text); mixed documents iteration and binding of variables to intermediate results. ¾ Consider: path expression (XPath); XML Schema ¾ For and let create a sequence of tuples datatypes ¾ where filters the tuples on a boolean expression ¾ It is still a working draft; details are being improved ¾ order by sorts the tuples, using any comparable data ¾ return gets evaluated once for every tuple XML Search Flowers (FLWR) z XQuery considers some principal forms for $d in document("depts.xml")//deptno let $e := document("emps.xml")//employee[deptno = $d] ¾ Path expression where count($e) >= 10 ¾ Conditional expressions order by avg($e/salary) descending ¾ Datatype expressions return <big-dept> { $d, ¾ List expression <headcount>{count($e)}</headcount>, <avgsal>{avg($e/salary)}</avgsal> } ¾ etc </big-dept> z Programming Language: Flowers (FLWOR) expression z Principle forms can be evaluated with respect to context XML Search XML IR Challenges 3: User interface z XQuery considers some principal forms and combine them z How to guide user to find relevant elements with Flowers (FLWR) ¾ Granularity control: Book->Abstract->Full Text It is quite similar to SQL for relational database z What type of querying language z However, it does not have the concept of relevance, which is ¾ Natural language query (IR approach): most usable important for both text data (text-based information retrieval) ¾ With structure information: more powerful but less usable and non-text data (fuzzy search). z How to do query expansion Find a book about information retrieval ¾ How to automatically add structure information Find a book which is about $30. e.g., find a book written by J. K. Rowling, -> find a book written by /../author (J. K. Rowling,) open research problem XML IR Challenges 1: Term Statistics XIRQL z There are multiple types of elements: books/titles/abstracts; Prof. Norbert Furth University of Dortmund: Open source XML search engine how to construct the corpus-statistics (idf) for different XIRQL: a query language for information retrieval in XML elements? documents z How do we handle the term frequency information? ¾ Structured Document Retrieval Principle Example: ¾ Users may not know the schema /book//title “information retrieval” do we consider the book Allow users to search even if they do not know the schema of abstract? Hierarchical smoothing the data XML IR Challenges 2: Schemas Units z Ideal Case z Only atomic units can be returned ¾ There is a universal schema traditional IR treats documents as atomic units; XML treat ¾ User can associate data type with the universal schema tree-like view of documents. without ambiguity z XIRQL only indexes and returns atom-units ¾ Too ideal to be true… ¾ Atom-units can be leaf nodes that contain text z Real Word information ¾ There are many schemas; different spellings; different ¾ Atom-units can be other internal nodes concepts; different granularities; (e.g., “auth” & “authors”; ¾ Atom-units can be defined in DTD “abstract” & “description”; “abstract” & “keywords”) ¾ TF-IDF values are calculated based-on atom-units XIRQL Atom-Units XIRQL Summary z Relevance ranking with respect to structure document retrieval principle z Recommends datatype-specific operators for different types of data z Enable semantic roles Structured Document Retrieval Principle Text-Based XML Retrieval We should always rank the most specific/probable atom z Documents are marked up with XML tags units for answering a query. journal articles, conference papers, novels, manuals… Example query: xql z Queries Document: <chapter> 0.3 XQL plain text queries, queries with structures (keywords in the title or abstracts) <section> 0.5 example </section> z Results <section> 0.8 XQL 0.7 syntax </section> </chapter> System automatically adjust the granularities of the returned results.

CS490W XML Data and Retrieval XML and Retrieval

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support