Semistructured Data Search Preliminaries
Total Page:16
File Type:pdf, Size:1020Kb
The data landscape Semistructured Data Search Flat text HTML XML RDF Databases unstructured data semistructured data structured data Krisztian Balog University of Stavanger - Semistructured data - Lack of fixed, rigid schema - No separation between the data and the schema, self-describing structure (tags or other markers) Promise Winter School 2013 | Bressanone, Italy, February 2013 Motivation Semistructured data - Supporting users who cannot express their - Advantages need in structured query languages - The data is not constrained by a fixed schema - SQL, SPARQL, Inquery, etc. - Flexible (the schema can easily be changed) - Portable - Dealing with heterogeneity - Possible to view structured data as semistructured - Users are unaware of the schema of the data - No single schema to the data - Disadvantages - Queries are less efficient than in a constrained structure In this talk Incorporating structure - How to exploit the structure available in the data for retrieval purposes? 3 context - Different types of structure - Document, query, context - Working in a Language Modeling setting 1 document query 2 - Number of different tasks - Retrieving entire documents - I.e., no element-level retrieval - Textual document representation is readily available - No aggregation over multiple documents/sources Language Modeling - Rank documents d according to their likelihood Preliminaries of being relevant given a query q: P(d|q) P (q d)P (d) Language modeling P (d q)= | P (q d)P (d) | P (q) / | Query likelihood Document prior Probability that query q Probability of the document was “produced” by document d being relevant to any query P (q d)= P (t ✓ )n(t,q) | | d t q Y2 Language Modeling Language Modeling Query likelihood scoring Estimate a multinomial Smooth the distribution Number of times t appears in q probability distribution with one estimated from from the text the entire collection P (q d)= P (t ✓ )n(t,q) | | d t q Y2 Document language model Multinomial probability distribution Smoothing parameter over the vocabulary of terms P (t ✓ )=(1 λ)P (t d)+λP (t C) | d − | | Empirical Maximum Collection document model likelihood model estimates n(t, d) d n(t, d) d d P (t ✓d)=(1 λ)P (t d)+λP (t C) | | P d | | | − | | P Example Empirical document LM n(t, d) P (t d)= In the town where I was born, | d Lived a man who sailed to sea, | | And he told us of his life, 0.15 In the land of submarines, 0.12 So we sailed on to the sun, Till we found the sea green, 0.09 And we lived beneath the waves, In our yellow submarine, 0.06 We all live in yellow submarine, 0.03 yellow submarine, yellow submarine, We all live in yellow 0 submarine yellow we all live lived sailed sea beneath born found green he his i land life man our so submarines sun till told town us waves where who submarine, yellow submarine, yellow submarine. Alternatively... Scoring a query q = sea, submarine { } P (q d)=P (“sea” ✓ ) P (“submarine” ✓ ) | | d · | d Scoring a query Scoring a query q = sea, submarine q = sea, submarine { } { } 0.03602 0.04538 0.03602 0.12601 P (q d)=P (“sea” ✓ ) P (“submarine” ✓ ) P (q d)=P (“sea” ✓ ) P (“submarine” ✓ ) | | d · | d | | d · | d 0.9 0.04 0.1 0.0002 0.9 0.14 0.1 0.0001 (1 λ)P (“sea” d)+λP (“sea” C) (1 λ)P (“submarine” d)+λP (“submarine” C) − | | − | | t P(t|d) t P(t|C) t P(t|d) t P(t|C) submarine 0.14 submarine 0.0001 submarine 0.14 submarine 0.0001 sea 0.04 sea 0.0002 sea 0.04 sea 0.0002 ... ... ... ... In this part - Incorporate document structure into the document language model Part I - Represented as document fields Document structure P (d q) P (q d)P (d)=P (d) P (t ✓ )n(t,q) | / | | d t q Y2 Document language model Use case Web document retrieval Web document retrieval Unstructured representation PROMISE Winter School 2013 Bridging between Information Retrieval and Databases Bressanone, Italy 4 - 8 February 2013 The aim of the PROMISE Winter School 2013 on "Bridging between Information Retrieval and Databases" is to give participants a grounding in the core topics that constitute the multidisciplinary area of information access and retrieval to unstructured, semistructured, and structured information. The school is a week-long event consisting of guest lectures from invited speakers who are recognized experts in the field. The school is intended for PhD students, Masters students or senior researchers such as post-doctoral researchers form the fields of databases, information retrieval, and related fields. [...] Web document retrieval Web document retrieval HTML source Fielded representation based on HTML markup <html> title: Winter School 2013 <head> <title>Winter School 2013</title> meta: PROMISE, school, PhD, IR, DB, [...] <meta name="keywords" content="PROMISE, school, PhD, IR, DB, [...]" /> PROMISE Winter School 2013, [...] <meta name="description" content="PROMISE Winter School 2013, [...]" /> </head> headings: PROMISE Winter School 2013 <body> Bridging between Information Retrieval and Databases <h1>PROMISE Winter School 2013</h1> Bressanone, Italy 4 - 8 February 2013 <h2>Bridging between Information Retrieval and Databases</h2> <h3>Bressanone, Italy 4 - 8 February 2013</h3> body: The aim of the PROMISE Winter School 2013 on "Bridging between <p>The aim of the PROMISE Winter School 2013 on "Bridging between Information Retrieval and Databases" is to give participants a Information Retrieval and Databases" is to give participants a grounding grounding in the core topics that constitute the multidisciplinary in the core topics that constitute the multidisciplinary area of area of information access and retrieval to unstructured, information access and retrieval to unstructured, semistructured, and semistructured, and structured information. The school is a week- structured information. The school is a week-long event consisting of long event consisting of guest lectures from invited speakers who guest lectures from invited speakers who are recognized experts in the are recognized experts in the field. The school is intended for field. The school is intended for PhD students, Masters students or PhD students, Masters students or senior researchers such as post- senior researchers such as post-doctoral researchers form the fields of doctoral researchers form the fields of databases, information databases, information retrieval, and related fields. </p> retrieval, and related fields. [...] </body> </html> Fielded Language Models Field Language Model [Ogilvie & Callan, SIGIR’03] - Build a separate language model for each field Smoothing parameter - Take a linear combination of them P (t ✓ )=(1 λ )P (t d )+λ P (t C ) m | dj − j | j j | j P (t ✓d)= µjP (t ✓dj ) Empirical Maximum Collection | | likelihood j=1 field model field model X estimates Field language model n(t, dj) d n(t, dj) Smoothed with a collection model built d d from all document representations of the j P d j same type in the collection | | | | P Field weights m µj =1 j=1 X Fielded Language Models Example Parameter estimation - Smoothing parameter q = IR, winter, school - Dirichlet smoothing with avg. representation length { } fields = title, meta, headings, body { } - Field weights µ = 0.2, 0.1, 0.2, 0.5 - Heuristically (e.g., proportional to the length of text { } content in that field) P (q ✓ )=P (“IR” ✓ ) P (“winter” ✓ ) P (“school” ✓ ) - Empirically (using training queries) | d | d · | d · | d - Computationally intractable for more than a few fields P (“IR” ✓ )=0.2 P (“IR” ✓ ) | d · | dtitle +0.1 P (“IR” ✓ ) · | dmeta +0.2 P (“IR” ✓ ) · | dheadings +0.2 P (“IR” ✓ ) · | dbody Use case Use case Entity retrieval in RDF data Entity retrieval in RDF data dbpedia:Audi_A4 foaf:name Audi A4 rdfs:label Audi A4 rdfs:comment The Audi A4 is a compact executive car produced since late 1994 by the German car manufacturer Audi, a subsidiary of the Volkswagen Group. The A4 has been built [...] dbpprop:production 1994 2001 2005 2008 rdf:type dbpedia-owl:MeanOfTransportation dbpedia-owl:Automobile dbpedia-owl:manufacturer dbpedia:Audi dbpedia-owl:class dbpedia:Compact_executive_car owl:sameAs freebase:Audi A4 is dbpedia-owl:predecessor of dbpedia:Audi_A5 is dbpprop:similar of dbpedia:Cadillac_BLS Hierarchical Entity Model Two-level hierarchy [Neumayer et al., ECIR’12] - Number of possible fields is huge Name foaf:name Audi A4 rdfs:label Audi A4 - It is not possible to optimise their weights directly rdfs:comment The Audi A4 is a compact executive car produced since late 1994 by the German car manufacturer Audi, a subsidiary of the - Entities are sparse w.r.t. different fields Volkswagen Group. The A4 has been built [...] Attributes dbpprop:production 1994 - Most entities have only a handful of predicates 2001 2005 2008 - Organise fields into a 2-level hierarchy rdf:type dbpedia-owl:MeanOfTransportation dbpedia-owl:Automobile - Field types (4) on the top level Out-relations dbpedia-owl:manufacturer dbpedia:Audi dbpedia-owl:class dbpedia:Compact_executive_car - Individual fields of that type on the bottom level owl:sameAs freebase:Audi A4 In-relations is dbpedia-owl:predecessor of dbpedia:Audi_A5 - Estimate field weights is dbpprop:similar of dbpedia:Cadillac_BLS - Using training data for field types - Using heuristics for bottom-level types Hierarchical Entity Model Field generation [Neumayer et al., ECIR’12] - Uniform P (t ✓ )= P (t F, d)P (F d) | d | | - All fields of the same type are equally important F X Field type importance - Length Term importance Taken to be the same for all entities P (F d)=P (F ) - Proportional to field length (on the entity level) | - Average length P (t F, d)= P (t d ,F)P (d F, d) | | f f | - Proportional to field length (on the collection level) d F Xf 2 Term generation - Popularity Importance of a term is jointly determined by - Number of documents that have the given field the field it occurs as well as all fields of that Field type (smoothed with a coll.