The data landscape Semistructured Data Search Flat text HTML XML RDF Databases unstructured data semistructured data structured data Krisztian Balog University of Stavanger - Semistructured data - Lack of fixed, rigid schema - No separation between the data and the schema, self-describing structure (tags or other markers)

Promise Winter School 2013 | Bressanone, Italy, February 2013

Motivation Semistructured data

- Supporting users who cannot express their - Advantages need in structured query languages - The data is not constrained by a fixed schema - SQL, SPARQL, Inquery, etc. - Flexible (the schema can easily be changed) - Portable - Dealing with heterogeneity - Possible to view structured data as semistructured - Users are unaware of the schema of the data - No single schema to the data - Disadvantages - Queries are less efficient than in a constrained structure

In this talk Incorporating structure

- How to exploit the structure available in the data for retrieval purposes? 3 context - Different types of structure - Document, query, context - Working in a Language Modeling setting 1 document query 2 - Number of different tasks - Retrieving entire documents - I.e., no element-level retrieval - Textual document representation is readily available - No aggregation over multiple documents/sources

Language Modeling

- Rank documents d according to their likelihood Preliminaries of being relevant given a query q: P(d|q) P (q d)P (d) Language modeling P (d q)= | P (q d)P (d) | P (q) / |

Query likelihood Document prior Probability that query q Probability of the document was “produced” by document d being relevant to any query

P (q d)= P (t ✓ )n(t,q) | | d t q Y2 Language Modeling Language Modeling Query likelihood scoring Estimate a multinomial Smooth the distribution Number of times t appears in q probability distribution with one estimated from from the text the entire collection P (q d)= P (t ✓ )n(t,q) | | d t q Y2 Document language model Multinomial probability distribution Smoothing parameter over the vocabulary of terms

P (t ✓ )=(1 )P (t d)+P (t C) | d | | Empirical Maximum Collection document model likelihood model estimates

n(t, d) d n(t, d) d d P (t ✓d)=(1 )P (t d)+P (t C) | | P d | | | | | P

Example Empirical document LM n(t, d) P (t d)= In the town where I was born, | d Lived a man who sailed to sea, | | And he told us of his life, 0.15 In the land of submarines, 0.12 So we sailed on to the sun, Till we found the sea green, 0.09 And we lived beneath the waves, In our yellow submarine, 0.06

We all live in yellow submarine, 0.03 yellow submarine, yellow

submarine, We all live in yellow 0 submarine yellow we all live lived sailed sea beneath born found green he his i land life man our so submarines sun till told town us waves where who submarine, yellow submarine, yellow submarine.

Alternatively... Scoring a query

q = sea, submarine { }

P (q d)=P (“sea” ✓ ) P (“submarine” ✓ ) | | d · | d

Scoring a query Scoring a query

q = sea, submarine q = sea, submarine { } { } 0.03602 0.04538 0.03602 0.12601 P (q d)=P (“sea” ✓ ) P (“submarine” ✓ ) P (q d)=P (“sea” ✓ ) P (“submarine” ✓ ) | | d · | d | | d · | d

0.9 0.04 0.1 0.0002 0.9 0.14 0.1 0.0001 (1 )P (“sea” d)+P (“sea” C) (1 )P (“submarine” d)+P (“submarine” C) | | | |

t P(t|d) t P(t|C) t P(t|d) t P(t|C) submarine 0.14 submarine 0.0001 submarine 0.14 submarine 0.0001 sea 0.04 sea 0.0002 sea 0.04 sea 0.0002 ...... In this part

- Incorporate document structure into the document language model Part I - Represented as document fields Document structure P (d q) P (q d)P (d)=P (d) P (t ✓ )n(t,q) | / | | d t q Y2

Document language model

Use case Web document retrieval Web document retrieval Unstructured representation

PROMISE Winter School 2013 Bridging between Information Retrieval and Databases Bressanone, Italy 4 - 8 February 2013 The aim of the PROMISE Winter School 2013 on "Bridging between Information Retrieval and Databases" is to give participants a grounding in the core topics that constitute the multidisciplinary area of information access and retrieval to unstructured, semistructured, and structured information. The school is a week-long event consisting of guest lectures from invited speakers who are recognized experts in the field. The school is intended for PhD students, Masters students or senior researchers such as post-doctoral researchers form the fields of databases, information retrieval, and related fields. [...]

Web document retrieval Web document retrieval HTML source Fielded representation based on HTML markup

title: Winter School 2013 Winter School 2013 meta: PROMISE, school, PhD, IR, DB, [...] PROMISE Winter School 2013, [...] headings: PROMISE Winter School 2013 Bridging between Information Retrieval and Databases

PROMISE Winter School 2013

Bressanone, Italy 4 - 8 February 2013

Bridging between Information Retrieval and Databases

Bressanone, Italy 4 - 8 February 2013

body: The aim of the PROMISE Winter School 2013 on "Bridging between

The aim of the PROMISE Winter School 2013 on "Bridging between Information Retrieval and Databases" is to give participants a Information Retrieval and Databases" is to give participants a grounding grounding in the core topics that constitute the multidisciplinary in the core topics that constitute the multidisciplinary area of area of information access and retrieval to unstructured, information access and retrieval to unstructured, semistructured, and semistructured, and structured information. The school is a week- structured information. The school is a week-long event consisting of long event consisting of guest lectures from invited speakers who guest lectures from invited speakers who are recognized experts in the are recognized experts in the field. The school is intended for field. The school is intended for PhD students, Masters students or PhD students, Masters students or senior researchers such as post- senior researchers such as post-doctoral researchers form the fields of doctoral researchers form the fields of databases, information databases, information retrieval, and related fields.

retrieval, and related fields. [...]

Fielded Language Models Field Language Model [Ogilvie & Callan, SIGIR’03]

- Build a separate language model for each field Smoothing parameter - Take a linear combination of them P (t ✓ )=(1 )P (t d )+ P (t C ) m | dj j | j j | j P (t ✓d)= µjP (t ✓dj ) Empirical Maximum Collection | | likelihood j=1 field model field model X estimates Field language model n(t, dj) d n(t, dj) Smoothed with a collection model built d d from all document representations of the j P d j same type in the collection | | | | P Field weights m

µj =1 j=1 X Fielded Language Models Example Parameter estimation

- Smoothing parameter q = IR, winter, school - Dirichlet smoothing with avg. representation length { } fields = title, meta, headings, body { } - Field weights µ = 0.2, 0.1, 0.2, 0.5 - Heuristically (e.g., proportional to the length of text { } content in that field) P (q ✓ )=P (“IR” ✓ ) P (“winter” ✓ ) P (“school” ✓ ) - Empirically (using training queries) | d | d · | d · | d - Computationally intractable for more than a few fields P (“IR” ✓ )=0.2 P (“IR” ✓ ) | d · | dtitle +0.1 P (“IR” ✓ ) · | dmeta +0.2 P (“IR” ✓ ) · | dheadings +0.2 P (“IR” ✓ ) · | dbody

Use case Use case Entity retrieval in RDF data Entity retrieval in RDF data

dbpedia:Audi_A4

foaf:name Audi A4 rdfs:label Audi A4 rdfs:comment The Audi A4 is a compact executive car produced since late 1994 by the German car manufacturer Audi, a subsidiary of the Volkswagen Group. The A4 has been built [...] dbpprop:production 1994 2001 2005 2008 rdf:type dbpedia-owl:MeanOfTransportation dbpedia-owl:Automobile dbpedia-owl:manufacturer dbpedia:Audi dbpedia-owl:class dbpedia:Compact_executive_car owl:sameAs freebase:Audi A4 is dbpedia-owl:predecessor of dbpedia:Audi_A5 is dbpprop:similar of dbpedia:Cadillac_BLS

Hierarchical Entity Model Two-level hierarchy [Neumayer et al., ECIR’12]

- Number of possible fields is huge Name foaf:name Audi A4 rdfs:label Audi A4 - It is not possible to optimise their weights directly rdfs:comment The Audi A4 is a compact executive car produced since late 1994 by the German car manufacturer Audi, a subsidiary of the - Entities are sparse w.r.t. different fields Volkswagen Group. The A4 has been built [...] Attributes dbpprop:production 1994 - Most entities have only a handful of predicates 2001 2005 2008 - Organise fields into a 2-level hierarchy rdf:type dbpedia-owl:MeanOfTransportation dbpedia-owl:Automobile - Field types (4) on the top level Out-relations dbpedia-owl:manufacturer dbpedia:Audi dbpedia-owl:class dbpedia:Compact_executive_car - Individual fields of that type on the bottom level owl:sameAs freebase:Audi A4 In-relations is dbpedia-owl:predecessor of dbpedia:Audi_A5 - Estimate field weights is dbpprop:similar of dbpedia:Cadillac_BLS - Using training data for field types - Using heuristics for bottom-level types

Hierarchical Entity Model Field generation [Neumayer et al., ECIR’12]

- Uniform P (t ✓ )= P (t F, d)P (F d) | d | | - All fields of the same type are equally important F X Field type importance - Length Term importance Taken to be the same for all entities P (F d)=P (F ) - Proportional to field length (on the entity level) | - Average length P (t F, d)= P (t d ,F)P (d F, d) | | f f | - Proportional to field length (on the collection level) d F Xf 2 Term generation - Popularity Importance of a term is jointly determined by - Number of documents that have the given field the field it occurs as well as all fields of that Field type (smoothed with a coll. level model) generation P (t d ,F)=(1 )P (t d )+P (t ✓ ) | f | f | dF Comparison of models Use case Finding movies in IMDB data

t df t F df t d ... d ...... d ......

t df t F df t

Unstructured Fielded Hierarchical document model document model document model

Use case Probabilistic Retrieval Model Finding movies in IMDB data for Semistructured data [Kim et al., ECIR’09] The <a href="/tags/Transporter_(franchise)/" rel="tag">Transporter</a> 2002 English - Find which document field each query term Action Crime may be associated with Thriller USA Jason Statham - Extending [Ogilvie & Callan, SIGIR’03] Matt Schulze m François Berléand Ric Young P (t ✓ )= µ P (t ✓ ) Qi Shu/actress> | d j | dj j=1 Mapping probability Louis Leterrier X Corey Yuen Estimated for each query term Luc Besson Robert Mark Kamen m Luc Besson Pierre Morel P (t ✓d)= P (dj t)P (t ✓dj ) | | | j=1 X

PRMS PRMS Mapping probability Mapping example

n(t, d ) P (t C )= d j | j d Query: meg ryan war P d | j| Prior field probability Term likelihoodP Probability of mapping the query term Probability of a query term to this field before observing collection occurring in a given field type statistics dj P (t d ) d P (t dj) d P (t dj) | j j | j | cast 0.407 cast 0.601 genre 0.927 P (t dj)P (dj) P (dj t)= | team 0.382 team 0.381 title 0.070 | P (t) title 0.187 title 0.017 location 0.002

P (t d )P (d ) | k k Xdk

Structured query representations - Query may have a semistructured representation, i.e., multiple fields Part 2 - Examples Query structure - TREC Genomics track - (1) gene name, (2) set of symbols - TREC Enterprise track, document search task - (1) keyword query, (2) example documents - INEX Entity track - (1) keyword query, (2) target categories, (3) example entities - TREC Entity track - (1) keyword query, (2) input entity, (3) target type Use case Enterprise document search (TREC 2007)

- Task: create an overview page on a given topic - Find documents that discuss the topic in detail cancer risk +

CE-012 cancer risk Focus on genome damage and therefore cancer risk in humans. CSIRO145-10349105 CSIRO140-15970492 CSIRO139-07037024 CSIRO138-00801380

Query modeling Retrieval model

- Aims - Maximizing the query log-likelihood provides - Expand the original query with additional terms the same ranking as minimising KL-divergence - Assign the probability mass non-uniformly - Assuming uniform document priors

n(t,q) P (d q) P (d) P (t ✓d) P (t ✓d) P (t ✓q) | / | KL(✓q ✓d) t q | | 2 || Y Query model

Document Query log P (d q) log P (d)+ P (t ✓ ) log P (t ✓ ) model model | / | q | d t q X2

Estimating the query model Sampling from examples [Balog et al., SIGIR’08]

- Baseline maximum-likelihood d n(t, q) P (t S) P (t qˆ) P (t ✓q)=P (t q)= ... P (d S) P (t d) | top K | | | q | d | | | d terms - Query expansion using relevance models [Lavrenko & Croft, SIGIR’01] sampling expanded sample distribution query P (t ✓ )=(1 )P (t qˆ)+P (t q) documents | q | | Expanded query model P (t S) Based on term co-occurrence statistics P (t S)= P (t d)P (d S) P (t qˆ)= | | | | | t K P (t0 S) P (t, q1,...,qk) d S 02 | P (t qˆ) 2 X P | ⇡ P (t0,q1,...,qk) t0 Term importance Document importance P Sampling from examples Sampling from examples Importance of a sample document Estimating term importance

- Uniform P (d S)=1/ S - Maximum-likelihood estimate | | | n(t, d) - All sample document are equally important P (t d)= | d - Query-biased P (d S) P (d q) | | | / | - Smoothed estimate - Proportional to the document’s relevance to the P (t d)=P (t ✓ )=(1 )P (t d)+P (t C) (original) query | | d | | - Inverse query-biased P (d S) 1 P (d q) | / | - Ranking function by [Ponte, 2000] - Reward documents that bring in new aspects (not P (t d) covered by the original query) P (t d)=s(t)/ s(t0) s(t) = log | | P (t C) Xt0 | Use case Entity retrieval in Wikipedia (INEX 2007-09)

- Given a query, return a ranked list of entities - Entities are represented by their Wikipedia page - Entity search - Topic definition includes target categories - List search - Topic definition includes example entities

Example query Using categories for retrieval

- As a separate document field - Filtering Movies with eight or more Academy Awards - Similarity between target and entity categories best picture oscar british films - Set similarity metrics american films - Content-based (concatenating category contents) - Lexical similarity of category names

Also related to categories Modeling terms and categories [Balog et al., TOIS’11]

- Category expansion P (e q) P (q e)P (e) | / | - Based on category structure T T C C P (q e)=(1 )P (✓q ✓e )+P (✓q ✓e ) - Using lexical similarity of category names | | | - Ontology-base expansion - Generalisation Term-based representation Category-based representation - Automatic category assignment Query model Entity model Query model Entity model

p(t ✓T ) p(t ✓T ) p(c ✓C ) p(c ✓C ) q KL(✓T ✓T ) e q KL(✓C ✓C ) e | q || e | | q || e |

Other types of structure

- Link structure Part 3 - Linguistic structure - Social structures Contextual structures Link structure Link structure

- Aim - High number of inlinks from pages relevant to the topic and not many incoming links from other pages - Retrieve an initial set of documents, then rerank

P (d q) P (q d)P (d) | / |

Document prior Probability of the document being relevant to any query

Source: Wikipedia

Document link degree Relevance propagation [Kamps & Koolen, ECIR’08] [Tsikrika et al., INEX’06]

- Model a user (random surfer) that after seeing Local indegree Number of incoming links from within the the initial set of results top ranked documents retrieved for q - Selects one document and reads its description - Follows links connecting entities and reads the descriptions of related entities I (d) P (d) 1+ local - Repeats it N times / 1+Iglobal(d) P (d)=P (q d) 0 | Global indegree Pi(d)=P (q d)Pi 1(d)+ (1 P (q d0))P (d0 d)Pi 1(d0) | | | Number of incoming links d0 d from the entire collection X!

The probability of staying at Transition probabilities the node equals to its Outgoing links from d’ to d set to uniform relevance to the query

Relevance propagation Linguistic structure [Tsikrika et al., INEX’06]

- Weighted sum of probabilities at different steps

N P (d) µ P (d)+(1 µ ) µ P (d) / 0 0 0 i i i=1 X

Source: wisuwords.com

Translation Model Use case [Berger & Lafferty, SIGIR’99] Expert finding - Given a keyword query, return a ranked list of n(t,q) P (q d)= P (t w)P (w ✓d) people who are experts on the given topic | | | t q w V Y2 X2 - Content-based methods can return the most Translation model knowledgeable persons Probability that word w can “semantically translated” to word qi - However, when it comes to contacting an expert, social and physical proximity matters - Obtaining the translation model - Exploiting WordNet word co-occurrences [Cao et al., SIGIR’05] Expert profile User-oriented model for EF [Smirnova & Balog, ECIR’11] 710326 Toine M. Bogers Toine Bogers A. M. Bogers PhD student Faculty of Humanities Department of Communication and Information Sciences S(e u, q)=(1 )K(e u, q)+T (e u) Room D 348 | | |

P.O. Box 90153, NL-5000 LE Tilburg, The Netherlands
+31 13 466 245 +31 13 466 289 http://ilk.uvt.nl/~toine Knowledge gain Contact time [email protected] Difference between the Distance between the user and knowledge of the expert and that the expert in a social graph of the user on the query topic Design and Implementation of a University-wide Expert Search Engine R. Liebregts and T. Bogers 2009 Proceedings of the 31st European Conference on Information Retrieval [...] [...]

Social structures Social structures Organisational hierarchy Geographical information

710326 710326 Toine M. Bogers University Toine M. Bogers Campus [...] [...] Faculty of Humanities Room D 348 Department of Communication Faculty [...] Building and Information Sciences [...] Department Floor

Toine M. Bogers Toine M. Bogers 710326 710326

Social structures Social structures Co-authorship Combinations University

710326 Department Department Toine M. Bogers [...] Faculty Faculty Faculty Design and Implementation of a University-wide Expert Search Engine Toine M. Bogers 710326 Toine M. Bogers 710326 R. Liebregts and T. Bogers 2009 Proceedings of the 31st European Floor Floor Conference on Information Retrieval [...] Building Building [...] Campus

Summary

- Different types of structure - Document structure - Query structure Questions? - Contextual structures - Probabilistic IR and statistical Language Models yield a principled framework for representing and exploiting these structures

Contact | @krisztianbalog | krisztianbalog.com