CS490W Semi-Structured Data

Structure of XML

z XML data is organized by documents like unstructured data XML data and Retrieval z There are structures (nodes/tags) within the documents z Each XML document is an ordered, labeled tree Luo Si z Element Nodes are labeled with Department of Computer Science ¾ Node name (e.g., chapter) Purdue University ¾ Node attributes and the values (e.g., size=1000; time=01/01/2007) ¾ May have child nodes or data z Data exist (e.g., text strings) within leaf nodes

XML and Retrieval: Outline XML Example

Outline: Machine Learning z Semi-Structure Data ¾ XML, Examples, Application Tom Mitchell z XML Search ... ¾ XQuery

Machine Learning Applications...

... ¾ XIRQL z Text-Based XML Retrieval Elements, Attributes/Values, Data(Text String) ¾ Vector-space model ¾ INEX

Semi-Structured Data XML Example

XML has been used as the standard representation of Semi- Machine Learning Structured Data Tom z eXtensible Markup Language Michael book is a W3C-recommended general-purpose markup language that supports a wide ... title variety of applications.

Machine Learning Applications...

... z A framework for defining markup languages author firstname surname z Open vocabulary for tags Elements, Attributes/Values, Data(Text String) chapter chapter z Each set of XML corresponds to different applications title para para para z facilitate the sharing of data across different information systems, particularly systems connected via the Internet z Examples: RSS, XHTML, MathML Elements Why XML?

z Elements are defined by markup tags z Unlike relational , XML data does not require z Elements: text relational schemata, etc., because the data itself contains this information. z ID of the element is TagName z Unlike widely used Web format, HTML, which only ensures z Attribute: attr_a; Values=“value” the correct presentation of the formatted data, XML also z Data/text: “text” guarantees total usability of data. z End tag

XML, HTML, SGML XML Applications

1986: SGML ISO 8879-1986 z CML – chemical markup language:

Nov 1995: HTML 2.0 z WML – wireless markup language

Nov 1996: Simplified and stripped down SGML draft z ThML – theological markup language (dubbed XML) Jan 1997: HTML 3.2 Aug 1997: XML working draft Dec 1997: XML 1.0 proposed recommendation Jan 1998: XML Feb 1999: XHTML

XML and HTML XML Applications

z Both of them are derivations of SGML z CML – chemical markup language: z HTML is a markup language mainly for display in browsers CML (Chemical Markup Language) is a new approach to managing z XML is a framework for markup languages molecular information using tools such as XML and Java. It was the first z HTML defines display domain specific implementation based strictly on XML, z XML defines the data structure, the display factor is separated from the content z HTML can be formalized as XML (XHTML) XML Applications XML Files

z z WML – wireless markup language Wireless Application Protocol (WAP) specification, such as mobile phones. DTD Example

"http://www.phone.com/dtd/wml11.dtd" > ]>

This is a sample WML page.

Tove XML Document Jani Reminder Don't forget me this weekend!

XML Applications XML Files

z XML Schema: z ThML – theological markup language Recommended by the W3C as the successor of DTDs, more informally referred to by the initialism for XML Schema instances, XSD (XML Schema Definition). XSDs are far more powerful than DTDs in describing –

‹ XML languages. – xmlns:xs="http://www.w3.org/2001/XMLSchema"> • • In the beginning God created the heaven and the earth. • And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.

‹

XML Files XML Search

z Schema/DTD: syntax definition of XML Language; z Most XML Search protocols use a database-based approach Document Type Definition (DTD file) XML provides an application independent way of sharing data. With a DTD, ¾ Non-text data match independent groups of people can agree to use a common DTD for ¾ Exact keyword (text) match interchanging data. However, this is often NOT the case ¾ Evaluate XML path expression ¾ No concept of relevant DTD Example ]> XML Search Principal Forms

z Traditional XML Search from Database-based approach z Path Query ¾ XQuery /book//title contains “” ¾ Search multiple types of data: value-based (e.g., price of title of the book contains keywords “Information Retrieval”

a book); ids (ISBN of book); keyword match (text) z Conditional expressions z XML text search from information retrieval approach $h/title, ¾ XIRQL IF $h/@type = "Journal" THEN …. ¾ Vector-space based if the type of an article is journal ¾ Search text data: estimate relevance of xml elements with respect of query ¾ Query may contain path expressions

XML Search Flowers (FLWR)

z XQuery z Programming Language: Flowers (FLWR) expression ¾ SQL for XML The programming language XQuery defines FLWOR or FLWR ¾ Used for text-rich documents; data-oriented documents (often pronounced as 'flower') as expression that supports (non-text); mixed documents iteration and binding of variables to intermediate results. ¾ Consider: path expression (XPath); XML Schema ¾ For and let create a sequence of tuples datatypes ¾ where filters the tuples on a boolean expression ¾ It is still a working draft; details are being improved ¾ order by sorts the tuples, using any comparable data ¾ return gets evaluated once for every tuple

XML Search Flowers (FLWR)

z XQuery considers some principal forms for $d in document("depts.xml")//deptno let $e := document("emps.xml")//employee[deptno = $d] ¾ Path expression where count($e) >= 10 ¾ Conditional expressions order by avg($e/salary) descending ¾ Datatype expressions return { $d, ¾ List expression {count($e)}, {avg($e/salary)} } ¾ etc z Programming Language: Flowers (FLWOR) expression z Principle forms can be evaluated with respect to context XML Search XML IR Challenges 3: User interface

z XQuery considers some principal forms and combine them z How to guide user to find relevant elements with Flowers (FLWR) ¾ Granularity control: Book->Abstract->Full Text

It is quite similar to SQL for relational database z What type of querying language z However, it does not have the concept of relevance, which is ¾ Natural language query (IR approach): most usable important for both text data (text-based information retrieval) ¾ With structure information: more powerful but less usable and non-text data (fuzzy search). z How to do query expansion Find a book about information retrieval ¾ How to automatically add structure information Find a book which is about $30. e.g., find a book written by J. K. Rowling, -> find a book written by /../author (J. K. Rowling,) open research problem

XML IR Challenges 1: Term Statistics XIRQL

z There are multiple types of elements: books/titles/abstracts; Prof. Norbert Furth University of Dortmund: Open source XML search engine how to construct the corpus-statistics (idf) for different XIRQL: a for information retrieval in XML elements? documents z How do we handle the term frequency information? ¾ Structured Document Retrieval Principle Example: ¾ Users may not know the schema /book//title “information retrieval” do we consider the book Allow users to search even if they do not know the schema of abstract? Hierarchical smoothing the data

XML IR Challenges 2: Schemas Units

z Ideal Case z Only atomic units can be returned ¾ There is a universal schema traditional IR treats documents as atomic units; XML treat ¾ User can associate data type with the universal schema tree-like view of documents. without ambiguity z XIRQL only indexes and returns atom-units ¾ Too ideal to be true… ¾ Atom-units can be leaf nodes that contain text z Real Word information ¾ There are many schemas; different spellings; different ¾ Atom-units can be other internal nodes concepts; different granularities; (e.g., “auth” & “authors”; ¾ Atom-units can be defined in DTD “abstract” & “description”; “abstract” & “keywords”) ¾ TF-IDF values are calculated based-on atom-units XIRQL Atom-Units XIRQL Summary

z Relevance ranking with respect to structure document retrieval principle z Recommends datatype-specific operators for different types of data

z Enable semantic roles

Structured Document Retrieval Principle Text-Based XML Retrieval

We should always rank the most specific/probable atom z Documents are marked up with XML tags units for answering a query. journal articles, conference papers, novels, manuals… Example query: xql z Queries Document: 0.3 XQL plain text queries, queries with structures (keywords in the title or abstracts)

0.5 example
z Results
0.8 XQL 0.7 syntax
System automatically adjust the granularities of the returned results. (e.g., the most specific section about “the role of Return section, not chapter p53 gene for cancer) Considers both coverage and specificity

Structured Document Retrieval Principle Vector Space Model and XML

Data types: z Vector space model for traditional IR XIRQL suggests vague predicates for different kinds of data Represent queries and plain documents by vectors in the types (e.g., person names, locations, dates). It suggests keyword space. Do not distinguish the keywords in different datatype-specific comparison operators (e.g., ‘near’, <, >, ‘broader’, ‘narrower’….) fields (e.g., title or full text). Calculate similarities between vectors Semantic Roles: z Vector space in XML data search for #persname, XIRQL searches all persons in documents, without specifying their role, regardless of Need to capture the structure of an XML document in the their position in the XML document tree vector space. Vector Space Model and XML Vector Space Model and XML

z Flexible queries for XML retrieval z Vector space model for traditional IR ¾ Content Only queries (CO) System treats the keywords in a document equally; so the information need of plan text queries, similar to those in two “Gates” are the same for two documents traditional information retrieval z Vector space in XML data ¾ Content and Structure (CAS) We must distinguish the two occurrences of “Gates” under information need of plan text and structure information different elements “Title” and “Author” /book//title “Bill Gates” or /book//author “Bill Gates” Index must considers both the contents and the locations of keywords (e.g., different elements) the structure information can be strict or flexible. (i.e., must from some elements or prefered from some elements)

Tree Representation of Queries Vector Space Model and XML

z Vector space in XML data Book Book Index must considers both the contents and the locations of keywords (e.g., different elements) To accomplish this, we need to consider the partial trees Author Bill Gates (structural items) within an XML document. /book “Bill Gates” Can we build indexes for the structural items (partials trees)?

Bill Gates

/book//author “Bill Gates”

Vector Space Model and XML Vector Space Model and XML

If we do not allow gap in the tree structures, we can have structural items (partial trees) as Book Book Book

Software Bill Gates

Title Author Title Author Title Author Title Author Author

Software Bill Gates The plot to get Software Bill Gates Gary Rivlin Software Bill Gates Bill Gates Book Book … Title Author

Software Bill Gates Vector Space Model and XML Vector Space Model and XML

z Problems of Indexing with Structural items z Weights of Structural items (partial trees) ¾ The number of distinct structural items can be very huge. ¾ Down-weighting for structural items

¾ It is not practical to build and store a vector space index with so Book “Software” should have more many dimensions influence (weight) for book element than “Windows”, “Platform”…. z Some possible solutions Title Full Text ¾ Build query-time partial vector space Calculate the weight of a term to an ¾ Restrict the structural items to a manageable set Software P1 P2 element K levels up by a scaling factor βk, 0<β<1

Windows platform, linux…

Vector Space Model and XML Vector Space Model and XML

z Query-time partial vector space z Weights of Structural items (partial trees) ¾ Instead of generating all structural items at one time, we can ¾ Down-weighting for structural items

only generate the necessary partial vector space for a specific Book query (a much smaller set) 0.8 0.2 Weights can also be set for different ¾ For a specific query partial trees. We seek all XML documents with any keyword satisfied the Title Full Text The weights can be predefined query, build partial vector space from these XML documents Weights can be application oriented The similarity of qualified XML documents and the query can Software P1 P2 Weights can be user-specific. be calculated within the partial vector space Weights can be query-specific. Learning issues….. Windows platform, linux…

Vector Space Model and XML Vector Space Model and XML

z Weights of Structural items (partial trees) z Other issues of Weights of Structural items (partial trees) ¾ Down-weighting for structural items Down-weighting is to use the Book “Software” should have more Book contents of low-level elements for influence (weight) for book element high-level elements. than “Windows”, “Platform”…. (e.g., contents of “title” and “full text” for “book”). Title Full Text Title Full Text Should we also incorporate contents of high-level (or the same level) Calculate the weight of a term to an Software P1 P2 Software P1 P2 elements for low-level elemnets? element K levels up by a scaling factor βk, 0<β<1 The smoothing strategy…

Windows Windows platform, linux… platform, linux… Vector Space Model and XML INEX: Benchmark for text-based XML Retrieval z Calculating the similarity z INEX: INitiative for the Evaluation of XML Retrieval ¾ Vocabulary mismatch of keywords and structures ¾ The analog of TREC (Text Retrieval Conference) for standard z ¾ Keyword mismatch has been studied in traditional information unstructured information retrieval retrieval, we can utilize techniques such as query expansion, ¾ Provide testbed of latent semantic indexing, probabilistic semantic index…. Set of XML documents, plain queries (content-only queries) ¾ Structure mismatch and structured queries (with XML structure) A set of retrieval tasks Book Book Book ¾ INEX 2002-2006: Mainly organized by people from Europe. It has attracted many participants from universities and big Software Title Full Text companies from all over the world

Software Software

Vector Space Model and XML INEX: Benchmark for text-based XML Retrieval

z Ad-hoc XML Retrieval Task z Calculating the similarity ¾ Each system index a set of XML documents ¾ First find all structural items in the query ¾ For a set of queries (content-only, content and structure), ¾ Find all similar match again the vocabulary of structural items system convert queries into internal representation It is not a Boolean match, but a similarity match (e.g., 0.9 ¾ In response, each system returns not documents, but most similarity score with an item) relevant elements within documents ¾ Retrieve all documents/elements with that structural item, z Evaluation metrics compute the cosine similarity etc. ¾ The retrieved elements are evaluated on two measures: Relevance – how relevant is the retrieved element Coverage – is the retrieved element too specific, too general or just fine There are scales for the measures, then are turned into precision/recall measures

Vector Space Model and XML INEX: Benchmark for text-based XML Retrieval

z Ad-hoc XML Retrieval Task z Problems with the vector space model ¾ 12,107 articles from IEEE Computer Society publications ¾ What IDF value? ¾ 494 Megabytes We cannot use a corpus-wide IDF value. The IDF value should ¾ Average article: 1,532 XML nodes/elements be element-specific. But do we need to incorporate the IDF factor of high-level same-level elements? ¾ Average node/element depth=7 ¾ For heterogeneous XML documents We do not exactly know the mapping the schemas. Do we need schema mapping? How can we deal with uncertainty of schema mapping? INEX: Benchmark for text-based XML Retrieval INEX: Benchmark for text-based XML Retrieval z Relevance: ¾ Relevance assessed on a scale from Irrelevant (scoring 0) to Highly Relevant (scoring 3) z Coverage ¾ No Coverage (N), too general (L), too specific (S), Exact (E) z So every element returned by each engine has ratings from {0,1,2,3} × {N,S,L,E}

INEX: Benchmark for text-based XML Retrieval XML Information Retrieval: Outline

Define scores: Basic Concepts of Information Retrieval: ⎧1 if rel,cov = 3E fstrict (rel,cov) = ⎨ z Semi-Structure Data 0 otherwise ⎩ ¾ XML, Examples, Application

z XML Search ⎧1.00 if rel,cov = 3E ¾ XQuery ⎪ 0.75 if rel,cov∈{}2E,3L,3S ¾ XIRQL ⎪ f generalized (rel,cov) = ⎨0.50 if rel,cov∈{}1E,2L,2S z Text-Based XML Retrieval ⎪0.25 if rel,cov∈{}1S,1L ¾ Vector-space model ⎪ ¾ INEX ⎩⎪0.00 if rel,cov = 0N.

INEX: Benchmark for text-based XML Retrieval XML Resources z Heterogeneous XML retrieval task: z www.w3.org/XML - XML resources at W3C ¾ The adhoc track in INEX has dealt with a single DTD of z Jan-Marco Bremer’s publications on xml and ir: one type of type (computer science journal aritcles) http://www.db.cs.ucdavis.edu/~bremer ¾ In “real-wordl” environments, XML retrieval must deal with z Norbert Fuhr and Kai Grossjohann. XIRQL, SIGIR 2001

different DTDs, different genres of data and widely z INEX: http://inex.is.informatik.uni-duisburg.de/ varying topical content z Chris Manning: Introduction to Information Retrieval ¾ Problems: Some contents of the slides are based on above materials What methods can be used to map structural criteria onto other DTDs? Should mappings focus on element names or also deal with element content or semantic?