CS490W

XML data and Retrieval

Luo Si Department of Computer Science Purdue University

XML and Retrieval: Outline

Outline: z Semi-Structure Data ¾ XML, Examples, Application z XML Search ¾ XQuery ¾ XIRQL z Text-Based XML Retrieval ¾ Vector-space model ¾ INEX Semi-Structured Data

XML has been used as the standard representation of Semi- Structured Data z eXtensible

is a W3C-recommended general-purpose markup language that supports a wide variety of applications. z A framework for defining markup languages z Open vocabulary for tags z Each set of XML corresponds to different applications z facilitate the sharing of data across different information systems, particularly systems connected via the Internet z Examples: RSS, XHTML, MathML

Semi-Structured Data

Structure of XML z XML data is organized by documents like unstructured data z There are structures (nodes/tags) within the documents z Each XML document is an ordered, labeled tree z Element Nodes are labeled with ¾ Node name (e.g., chapter) ¾ Node attributes and the values (e.g., size=1000; time=01/01/2007) ¾ May have child nodes or data z Data exist (e.g., text strings) within leaf nodes XML Example

Machine Learning Tom Mitchell ...

Machine Learning Applications...

...

Elements, Attributes/Values, Data(Text String)

XML Example

Machine Learning Tom Michael book ...

Machine Learning Applications...

title ... author firstname surname Elements, Attributes/Values, Data(Text String) chapter chapter

title para para para Elements

z Elements are defined by markup tags z Elements: text z ID of the element is TagName z Attribute: attr_a; Values=“value” z Data/text: “text” z End tag

XML, HTML, SGML

1986: SGML ISO 8879-1986 Nov 1995: HTML 2.0 Nov 1996: Simplified and stripped down SGML draft (dubbed XML) Jan 1997: HTML 3.2 Aug 1997: XML working draft Dec 1997: XML 1.0 proposed recommendation Jan 1998: XML Feb 1999: XHTML XML and HTML

z Both of them are derivations of SGML z HTML is a markup language mainly for display in browsers z XML is a framework for markup languages z HTML defines display z XML defines the data structure, the display factor is separated from the content z HTML can be formalized as XML (XHTML)

Why XML?

z Unlike relational database, XML data does not require relational schemata, etc., because the data itself contains this information. z Unlike widely used Web format, HTML, which only ensures the correct presentation of the formatted data, XML also guarantees total usability of data. XML Applications

z CML – chemical markup language: z WML – z ThML – theological markup language

XML Applications

z CML – chemical markup language:

CML (Chemical Markup Language) is a new approach to managing molecular information using tools such as XML and Java. It was the first domain specific implementation based strictly on XML,

XML Applications

z WML – wireless markup language Wireless Markup Language, is a content format for devices that implement the Wireless Application Protocol (WAP) specification, such as mobile phones.

This is a sample WML page.

XML Applications

z ThML – theological markup language

‹

• In the beginning God created the heaven and the earth. • • And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. •

– ‹ – XML Files

z Schema/DTD: syntax definition of XML Language; (DTD file) XML provides an application independent way of sharing data. With a DTD, independent groups of people can agree to use a common DTD for interchanging data. However, this is often NOT the case DTD Example ]>

XML Files

z DTD Example ]> Tove XML Document Jani Reminder Don't forget me this weekend! XML Files

z XML Schema: Recommended by the W3C as the successor of DTDs, more informally referred to by the initialism for XML Schema instances, XSD (XML Schema Definition). XSDs are far more powerful than DTDs in describing XML languages.

XML Search

z Most XML Search protocols use a database-based approach ¾ Non-text data match ¾ Exact keyword (text) match ¾ Evaluate XML path expression ¾ No concept of relevant XML Search

z Traditional XML Search from Database-based approach ¾ XQuery ¾ Search multiple types of data: value-based (e.g., price of a book); ids (ISBN of book); keyword match (text) z XML text search from information retrieval approach ¾ XIRQL ¾ Vector-space based ¾ Search text data: estimate relevance of xml elements with respect of query ¾ Query may contain path expressions

XML Search

z XQuery ¾ SQL for XML ¾ Used for text-rich documents; data-oriented documents (non-text); mixed documents ¾ Consider: path expression (XPath); XML Schema datatypes ¾ It is still a working draft; details are being improved XML Search

z XQuery considers some principal forms ¾ Path expression ¾ Conditional expressions ¾ Datatype expressions ¾ List expression ¾ etc z Programming Language: Flowers (FLWOR) expression z Principle forms can be evaluated with respect to

Principal Forms

z Path Query /book//title contains “Information Retrieval” title of the book contains keywords “Information Retrieval” z Conditional expressions $h/title, IF $h/@type = "Journal" THEN …. if the type of an article is journal Flowers (FLWR)

z Programming Language: Flowers (FLWR) expression The programming language XQuery defines FLWOR or FLWR (often pronounced as 'flower') as expression that supports iteration and binding of variables to intermediate results. ¾ For and let create a sequence of tuples ¾ where filters the tuples on a boolean expression ¾ order by sorts the tuples, using any comparable data ¾ return gets evaluated once for every tuple

Flowers (FLWR)

for $d in document("depts.xml")//deptno let $e := document("emps.xml")//employee[deptno = $d] where count($e) >= 10 order by avg($e/salary) descending return { $d, {count($e)}, {avg($e/salary)} } XML Search

z XQuery considers some principal forms and combine them with Flowers (FLWR) It is quite similar to SQL for relational database z However, it does not have the concept of relevance, which is important for both text data (text-based information retrieval) and non-text data (fuzzy search). Find a book about information retrieval Find a book which is about $30.

XML IR Challenges 1: Term Statistics

z There are multiple types of elements: books/titles/abstracts; how to construct the corpus-statistics (idf) for different elements? z How do we handle the term frequency information? Example: /book//title “information retrieval” do we consider the book abstract? Hierarchical smoothing XML IR Challenges 2: Schemas

z Ideal Case ¾ There is a universal schema ¾ User can associate data type with the universal schema without ambiguity ¾ Too ideal to be true… z Real Word ¾ There are many schemas; different spellings; different concepts; different granularities; (e.g., “auth” & “authors”; “abstract” & “description”; “abstract” & “keywords”)

XML IR Challenges 3: User interface

z How to guide user to find relevant elements ¾ Granularity control: Book->Abstract->Full Text z What type of querying language ¾ Natural language query (IR approach): most usable ¾ With structure information: more powerful but less usable z How to do query expansion ¾ How to automatically add structure information e.g., find a book written by J. K. Rowling, -> find a book written by /../author (J. K. Rowling,) open research problem XIRQL

Prof. Norbert Furth University of Dortmund: Open source XML search engine XIRQL: a query language for information retrieval in XML documents ¾ Structured Document Retrieval Principle ¾ Users may not know the schema Allow users to search even if they do not know the schema of the data

Units

z Only atomic units can be returned traditional IR treats documents as atomic units; XML treat tree-like view of documents. z XIRQL only indexes and returns atom-units ¾ Atom-units can be leaf nodes that contain text information ¾ Atom-units can be other internal nodes ¾ Atom-units can be defined in DTD ¾ TF-IDF values are calculated based-on atom-units XIRQL Atom-Units

Structured Document Retrieval Principle

We should always rank the most specific/probable atom units for answering a query. Example query: xql Document: 0.3 XQL

0.5 example
0.8 XQL 0.7 syntax
Return section, not chapter Structured Document Retrieval Principle

Data types: XIRQL suggests vague predicates for different kinds of data types (e.g., person names, locations, dates). It suggests datatype-specific comparison operators (e.g., ‘near’, <, >, ‘broader’, ‘narrower’….) Semantic Roles: search for #persname, XIRQL searches all persons in documents, without specifying their role, regardless of their position in the XML document tree

XIRQL Summary

z Relevance ranking with respect to structure document retrieval principle z Recommends datatype-specific operators for different types of data z Enable semantic roles Text-Based XML Retrieval

z Documents are marked up with XML tags journal articles, conference papers, novels, manuals… z Queries plain text queries, queries with structures (keywords in the title or abstracts) z Results System automatically adjust the granularities of the returned results. (e.g., the most specific section about “the role of p53 gene for cancer) Considers both coverage and specificity

Vector Space Model and XML

z Vector space model for traditional IR Represent queries and plain documents by vectors in the keyword space. Do not distinguish the keywords in different fields (e.g., title or full text). Calculate similarities between vectors z Vector space in XML data Need to capture the structure of an XML document in the vector space. Vector Space Model and XML

z Flexible queries for XML retrieval ¾ Content Only queries (CO) information need of plan text queries, similar to those in traditional information retrieval ¾ Content and Structure (CAS) information need of plan text and structure information /book//title “Bill Gates” or /book//author “Bill Gates” the structure information can be strict or flexible. (i.e., must from some elements or prefered from some elements)

Tree Representation of Queries

Book Book

Author Bill Gates

/book “Bill Gates”

Bill Gates

/book//author “Bill Gates” Vector Space Model and XML

Book Book

Title Author Title Author

The plot to get Software Bill Gates Gary Rivlin Bill Gates

Vector Space Model and XML

z Vector space model for traditional IR System treats the keywords in a document equally; so the two “Gates” are the same for two documents z Vector space in XML data We must distinguish the two occurrences of “Gates” under different elements “Title” and “Author” Index must considers both the contents and the locations of keywords (e.g., different elements) Vector Space Model and XML

z Vector space in XML data Index must considers both the contents and the locations of keywords (e.g., different elements) To accomplish this, we need to consider the partial trees (structural items) within an XML document. Can we build indexes for the structural items (partials trees)?

Vector Space Model and XML

If we do not allow gap in the tree structures, we can have structural items (partial trees) as Book

Software Bill Gates

Title Author Title Author Author

Software Bill Gates Software Bill Gates Book Book … Title Author

Software Bill Gates Vector Space Model and XML

z Problems of Indexing with Structural items ¾ The number of distinct structural items can be very huge. ¾ It is not practical to build and store a vector space index with so many dimensions z Some possible solutions ¾ Build query-time partial vector space ¾ Restrict the structural items to a manageable set

Vector Space Model and XML

z Query-time partial vector space ¾ Instead of generating all structural items at one time, we can only generate the necessary partial vector space for a specific query (a much smaller set) ¾ For a specific query We seek all XML documents with any keyword satisfied the query, build partial vector space from these XML documents The similarity of qualified XML documents and the query can be calculated within the partial vector space Vector Space Model and XML

z Weights of Structural items (partial trees) ¾ Down-weighting for structural items

Book “Software” should have more influence (weight) for book element than “Windows”, “Platform”….

Title Full Text

Calculate the weight of a term to an Software P1 P2 element K levels up by a scaling factor βk, 0<β<1

Windows platform, linux…

Vector Space Model and XML

z Weights of Structural items (partial trees) ¾ Down-weighting for structural items

Book “Software” should have more influence (weight) for book element than “Windows”, “Platform”….

Title Full Text

Calculate the weight of a term to an Software P1 P2 element K levels up by a scaling factor βk, 0<β<1

Windows platform, linux… Vector Space Model and XML

z Weights of Structural items (partial trees) ¾ Down-weighting for structural items

Book 0.8 0.2 Weights can also be set for different partial trees. Title Full Text The weights can be predefined Weights can be application oriented Software P1 P2 Weights can be user-specific. Weights can be query-specific. Learning issues….. Windows platform, linux…

Vector Space Model and XML

z Other issues of Weights of Structural items (partial trees)

Down-weighting is to use the Book contents of low-level elements for high-level elements. (e.g., contents of “title” and “full text” for “book”). Title Full Text Should we also incorporate contents of high-level (or the same level) Software P1 P2 elements for low-level elemnets? The smoothing strategy…

Windows platform, linux… Vector Space Model and XML z Calculating the similarity ¾ Vocabulary mismatch of keywords and structures ¾ Keyword mismatch has been studied in traditional information retrieval, we can utilize techniques such as query expansion, latent semantic indexing, probabilistic semantic index…. ¾ Structure mismatch

Book Book Book

Software Title Full Text

Software Software

Vector Space Model and XML z Calculating the similarity ¾ First find all structural items in the query ¾ Find all similar match again the vocabulary of structural items It is not a Boolean match, but a similarity match (e.g., 0.9 similarity score with an item) ¾ Retrieve all documents/elements with that structural item, compute the cosine similarity etc. Vector Space Model and XML z Problems with the vector space model ¾ What IDF value? We cannot use a corpus-wide IDF value. The IDF value should be element-specific. But do we need to incorporate the IDF factor of high-level same-level elements? ¾ For heterogeneous XML documents We do not exactly know the mapping the schemas. Do we need schema mapping? How can we deal with uncertainty of schema mapping?

INEX: Benchmark for text-based XML Retrieval z INEX: INitiative for the Evaluation of XML Retrieval ¾ The analog of TREC (Text Retrieval Conference) for standard z unstructured information retrieval ¾ Provide testbed of Set of XML documents, plain queries (content-only queries) and structured queries (with XML structure) A set of retrieval tasks ¾ INEX 2002-2006: Mainly organized by people from Europe. It has attracted many participants from universities and big companies from all over the world INEX: Benchmark for text-based XML Retrieval z Ad-hoc XML Retrieval Task ¾ Each system index a set of XML documents ¾ For a set of queries (content-only, content and structure), system convert queries into internal representation ¾ In response, each system returns not documents, but most relevant elements within documents z Evaluation metrics ¾ The retrieved elements are evaluated on two measures: Relevance – how relevant is the retrieved element Coverage – is the retrieved element too specific, too general or just fine There are scales for the measures, then are turned into precision/recall measures

INEX: Benchmark for text-based XML Retrieval z Ad-hoc XML Retrieval Task ¾ 12,107 articles from IEEE Computer Society publications ¾ 494 Megabytes ¾ Average article: 1,532 XML nodes/elements ¾ Average node/element depth=7 INEX: Benchmark for text-based XML Retrieval z Relevance: ¾ Relevance assessed on a scale from Irrelevant (scoring 0) to Highly Relevant (scoring 3) z Coverage ¾ No Coverage (N), too general (L), too specific (S), Exact (E) z So every element returned by each engine has ratings from {0,1,2,3} × {N,S,L,E}

INEX: Benchmark for text-based XML Retrieval

Define scores: ⎧1 if rel,cov = 3E fstrict (rel,cov) = ⎨ ⎩0 otherwise

⎧1.00 if rel,cov = 3E ⎪0.75 if rel,cov∈{}2E,3L,3S ⎪ f generalized (rel,cov) = ⎨0.50 if rel,cov∈{}1E,2L,2S ⎪0.25 if rel,cov∈{}1S,1L ⎪ ⎩⎪0.00 if rel,cov = 0N. INEX: Benchmark for text-based XML Retrieval z Heterogeneous XML retrieval task: ¾ The adhoc track in INEX has dealt with a single DTD of one type of type (computer science journal aritcles) ¾ In “real-wordl” environments, XML retrieval must deal with different DTDs, different genres of data and widely varying topical content ¾ Problems: What methods can be used to map structural criteria onto other DTDs? Should mappings focus on element names or also deal with element content or semantic?

INEX: Benchmark for text-based XML Retrieval XML Information Retrieval: Outline

Basic Concepts of Information Retrieval: z Semi-Structure Data ¾ XML, Examples, Application z XML Search ¾ XQuery ¾ XIRQL z Text-Based XML Retrieval ¾ Vector-space model ¾ INEX

XML Resources

z www.w3.org/XML - XML resources at W3C z Jan-Marco Bremer’s publications on xml and ir: http://www.db.cs.ucdavis.edu/~bremer z Norbert Fuhr and Kai Grossjohann. XIRQL, SIGIR 2001 z INEX: http://inex.is.informatik.uni-duisburg.de/ z Chris Manning: Introduction to Information Retrieval Some contents of the slides are based on above materials