XML Retrieval DB/IR in Theory Web in Practice

XML Retrieval DB/IR in Theory Web in Practice Sihem Amer-Yahia Mariano Consens Yahoo! Research University of Toronto In collaboration with: Ricardo Baeza-Yates Mounia Lalmas Yahoo! Research Queen Mary, Univ. of London VLDB 2007, Vienna, 26/09/07 Preliminaries DB focused on languages, expressiveness and efficient evaluation IR focused on scoring and relevance metrics In pppractice, a limited set of operations and simple ranking go a long way Theory is scary (think XQuery) Practice is inspiring but llkooks ad-hoc VLDB 2007, Vienna, 26/09/2007 2 Notion of Relevance Data retrieval: Syntax expresses semantics Information retrieval: Ambiguous semantics Relevance depends on user and context There is no “perfect” retrieval system User assessments to evaluate system effectiveness VLDB 2007, Vienna, 26/09/2007 3 Overview Preliminaries Web in Practice Search in Web 2.0 Microformats and Mashups DB/IR in Theory Query Languages Retrieval Semantics EltiEvaluation à la DB (Query Processi ng) Evaluation à la DB (Relevance Assessments) Chall enges VLDB 2007, Vienna, 26/09/2007 4 Web 2.0 (from Wikipedia) Ric h Set of Buzzwor ds VLDB 2007, Vienna, 26/09/2007 5 (Web) Search is a Basic Necessity A (grossly inadequate) analogy: Tootsilets and Web 2.0 "Rich societies have developed quite complicated and expensive systems for removing human wastes from houses and cities, usually by dumping them, treated to one degree or another, into subsoils or bodies of water." Peter Bane, 2006 VLDB 2007, Vienna, 26/09/2007 6 Rich Standard Infrastructure Standard Pipes XML VLDB 2007, Vienna, 26/09/2007 7 Big Infrastructure Sites Water Treatment Plants Search Engines Portals VLDB 2007, Vienna, 26/09/2007 8 Community Sites VLDB 2007, Vienna, 26/09/2007 9 The Importance of Mobility The need to carry around techlilhnological solilutions to basic necessities VLDB 2007, Vienna, 26/09/2007 10 Most Commonly Used is … Squat toilet “most popular searches” (2-3 keywords) There are simple and sophisticated solutions to basic necessities NdNeed for more sophititdhisticated search VLDB 2007, Vienna, 26/09/2007 11 Overview Preliminaries Web in Practice Search in Web 2.0 Microformats and Mashups DB/IR in Theory Query Languages Retrieval Semantics EltiEvaluation à la DB (Query Processi ng) Evaluation à la DB (Relevance Assessments) Chall enges VLDB 2007, Vienna, 26/09/2007 12 Microformats Community data formats Personal Data: hCard (vCard) Calendar and Events: hCal (iCal) Social Networking: XFN Reviews: hReview Licenses: rel-license Folksonomies: rel-tag Embedded in XHTML pages and RSS feeds Also RSS Extensions (iTunes, Yahoo! Media, Geo, Googg)le Base, 20+ more in use) VLDB 2007, Vienna, 26/09/2007 13 Example: hCal <strong class="summary">Fashion Expo</strong> in <span class="location">Paris, France</span>: <abbr class= "dtstart " tit le= "2006-10-20">Oct 20</bb/abbr> to <abbr class="dtend" title="2006-10-23">22</abbr> Large and growing list of websites Eventful.com LinkedIn Yedda upcoming.yahoo.com Yahoo! Local, Yahoo! Tech Reviews Benefit from shared tools, practices (hCalendar creator, iCa l Extracti on) VLDB 2007, Vienna, 26/09/2007 14 Semantic Mashups A “semantic” mashup can Contact (hCard) Friends (XFN,FOAF) To attend a recommended event (hCal, hReview) Microformats are the lower-case semantic web Also Machine Tags (eg, flickr:user=me) Tags that use a special syntax to define extra information about a tag Have a namespace, a predicate and a value (sounds familiar?) VLDB 2007, Vienna, 26/09/2007 15 Search in Mashup Creation VLDB 2007, Vienna, 26/09/2007 16 Mashup Tools Microsoft Popfly IBM ProjectZero Yahoo! Pipes Allow s devel opers to mash-up web dtdata drag and drop editor which enables user to connect multiple Internet data sources a source is grabbed and searched! both content and structure are queried VLDB 2007, Vienna, 26/09/2007 17 YYpahoo! Pipes Demo VLDB 2007, Vienna, 26/09/2007 18 YYpahoo! Pipes Demo VLDB 2007, Vienna, 26/09/2007 19 Yahoo! Pipes Demo Result VLDB 2007, Vienna, 26/09/2007 20 Overview Preliminaries Web in Practice Search in Web 2.0 Microformats and Mashups DB/IR in Theory Retrieval Languages and Semantics Evaluation àla DB (Query Processing) EltiEvaluation à la DB (Rel evance Asse ssments) Challenges VLDB 2007, Vienna, 26/09/2007 21 Take Away Search is crucial when accessing Web 2.0 sources There is already demand for exploiting additional structure in Web 2. 0 search Structure (XML) retrieval needs to: be exposed to users/developers support rich, context-dependent semantics address efficiency and effectiveness VLDB 2007, Vienna, 26/09/2007 22 Overview Preliminaries Web in Practice DB/IR in Theory Query Languages Retrieval Semantics Evaluation à la DB (Query Processing) Evaluation à la DB (Relevance Assessments) Chall enges VLDB 2007, Vienna, 26/09/2007 23 Languages Keyword search “t”“squat” Tag + Keyword search descript ion : squat Path Expression + Keyword search //image[./title about “squat”] XQuery + Complex ffllull-text search for $i in //image let score $$$s := $i ftscore “s quat” && “toilet” distance 2 VLDB 2007, Vienna, 26/09/2007 24 Overview Preliminaries Web in Practice DB/IR in Theory Query Languages Retrieval Semantics Evaluation à la DB (Query Processing) Evaluation à la DB (Relevance Assessments) Chall enges VLDB 2007, Vienna, 26/09/2007 25 Retrieval Semantics Structure search incorporates conditions on the underlying structure of a collection Schemas help Schemas prescribe data and help validation Provide limited description of valid instances New semantics Lowest Common Ancestor Query relaxation Overlapping elements VLDB 2007, Vienna, 26/09/2007 26 Lowest Common Ancestor Retrieve most relevant fragment References: Nearest Concept Queries (Schmidt etal, ICDE 2002) XRank (Guo et al, SIGMOD 2003) SchemaFree XQuery (Li et al VLDB 2004) XKSearch (Xu & Papakonstantinou, SIGMOD 2005) VLDB 2007, Vienna, 26/09/2007 27 XRank <workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <ppgroceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> ShiSearching on structure d text ibiis becoming more i mportant wi ihXMLth XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns:xlink=”http://www.acm .org/www8/paper/xmlql> … </ cite> </paper> (Guo etal, SIGMOD 2003) XRank <workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel,,,y Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”In tro duc tion ”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> … XIRQL <workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> Dav id Carmel, Yoelle Maarek, A ya Soffer </ editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recentlyyp pro posed lan gua ge … </abstract> <section name=”Introduction”> index nodes Searching on structured text is becoming more important with XML … <em> The XQL language </em> </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> … (Fuhr & Großjohann, SIGIR 2001) XML Query Relaxation Twig scoring Query image Hig h quality Expensive computation info title Path scoring toilet Binary scoring author squat Low quality FFmpast computation image image + image image ++image image info title info edition author info title toilet toilet squat toilet author author squat squat (Amer-Yahia et al, VLDB 2005) VLDB 2007, Vienna, 26/09/2007 31 XML Query Relaxation Query image Tree pattern relaxations: Leaf node deletion info title toilet Edge generalization author Subtree promotion squat Data image image image title? info title info info title toilet toilet author author edition squat squat squat (Amer-Yahia, SIGMOD 2004) (Schlieder, EDBT 2002) (Delobel & Rousset, 2002) VLDB 2007, Vienna, 26/09/2007 32 Controlling Overlap What most approaches are doing : • Given a ranked list of elements: 1. select element with the highest score within a path 2. discard all ancestors and descendants 3. go to step 1 unti l all elements have been dealt with • (Also referred to as brute-force filtering) VLDB 2007, Vienna, 26/09/2007 33 Post-Processing Overlap Sometimes with some “prior” processing to affect ranking: Use of a utilityyp function that captures the amount of useful information in an element Element score * Element size * Amount of relevant information Used as a prior probability Then apply “brute-force” overlap removal (Mihajlovic etal, INEX 2005; Ramirez etal, FQAS 2006)) VLDB 2007, Vienna, 26/09/2007 34 Post-Processing Overlap Score of elements containing or contained within higggpher ranking components are iteratively adjusted ((pdepends on amount of overla

XML Retrieval DB/IR in Theory Web in Practice

Privacy Protection for Smartphones: an Ontology-Based Firewall Johanne Vincent, Christine Porquet, Maroua Borsali, Harold Leboulanger

Wiki Semantics Via Wiki Templating

Hidden Meaning

Microformats the Next (Small) Thing on the Semantic Web?

Data Models for Home Services

Using Wordpress to Manage Geocontent and Promote Regional Food Products

Microformats, Metadata

Collective Collaborative Tagging System

An Entity Retrieval Model

Entity-Centric Social Profile Integration

The Web 2.0 Demystified – Six Theses on a Misinterpreted Concept

Work with Knowledge on the Internet – Local Search 129