<<

JOIN WORK WITH

Table Retrieval and Generaon

Kriszan Balog Informaon Access University of Stavanger & Interacon Shuo Zhang hp://iai.group @kriszanbalog @imsure318

SIGIR’18 workshop on Data Search (DATA:SEARCH’18) | Ann Arbor, Michigan, USA, July 2018

TABLES ARE EVERYWHERE MOTIVATION

MOTIVATION IN THIS TALK

• Three retrieval tasks, with tables as results • Ad hoc table retrieval • Query-by-table • On-the-fly table generaon

THE ANATOMY OF A RELATIONAL THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE (ENTITY-FOCUSED) TABLE

Formula 1 constructors’ statistics 2016 Table capon Formula 1 constructors’ statistics 2016

Constructor Engine Country Base … Constructor Engine Country Base …

Ferrari Ferrari Italy Italy Ferrari Ferrari Italy Italy Force India Mercedes India UK Force India Mercedes India UK

Haas Ferrari US US & UK Haas Ferrari US US & UK

Manor Mercedes UK UK Manor Mercedes UK UK

… …

THE ANATOMY OF A RELATIONAL THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE (ENTITY-FOCUSED) TABLE

Core column (subject column) Formula 1 constructors’ statistics 2016 Formula 1 constructors’ statistics 2016

Constructor Engine Country Base … Heading Constructor Engine Country Base … column labels Ferrari Ferrari Italy Italy (table schema) Ferrari Ferrari Italy Italy Force India Mercedes India UK Force India Mercedes India UK

Haas Ferrari US US & UK Haas Ferrari US US & UK

Manor Mercedes UK UK Manor Mercedes UK UK

… …

We assume that these enes are recognized and disambiguated, i.e., linked to a knowledge base TASK

• Ad hoc table retrieval: Singapore Search

• Given a keyword query as input, Singapore - Wikipedia, Economy Statistics (Recent Years) https://en.wikipedia.org/wiki/Singapore

return a ranked list of tables from GDP GDP GNI GNI GDP Real Year Nominal Nominal Nominal Nominal (Billion) a table corpus (Billion) Per Capita (Billion) Per Capita AD HOC TABLE RETRIEVAL 2011 S$346.353 S$66,816 S$342.371 S$338.452 S$65,292 2012 S$362.332 S$68,205 S$354.061 S$351.765 S$66,216 2013 S$378.200 S$70,047 S$324.592 S$366.618 S$67,902 Show more (5 rows total)

Singapore - Wikipedia, Language used most frequently at home

https://en.wikipedia.org/wiki/Singapore

Language Color in Figure Percent English Blue 36.9% Mandarin Yellow 34.9% Malay 10.7% Show more (6 rows total)

S. Zhang and K. Balog. Ad Hoc Table Retrieval using Semanc Similarity. In: The Web Conference 2018 (WWW '18)

APPROACHES UNSUPERVISED METHODS

• Unsupervised methods • Build a document-based representaon for each table, then • Single-field document representaon employ convenonal document retrieval methods • All table content, no structure • Supervised methods • Mul-field document representaon • Describe query-table pairs using a set of features, then employ supervised machine learning ("learning-to-rank") • Separate document fields for embedding document’s tle, secon tle, table capon, table body, and table headings • Contribuon #1: new state-of-the-art, using a rich set of features • Contribuon #2: new set of semanc matching features

SUPERVISED METHODS FEATURES

• Three groups of features • Query features • #query terms, query IDF scores • Table features • Table properes: #rows, #cols, #empty cells, etc. • Embedding document: link structure, number of tables, etc. • Query-table features • Query terms found in different table elements, LM score, etc. • Our novel semanc matching features

SEMANTIC MATCHING SEMANTIC MATCHING 1. CONTENT EXTRACTION

• The “raw” content of a query/table is represented as a set of terms, which can be words or enes • Main objecve: go beyond term-based matching • Three components: q1 t1 1. Content extracon Query … … Table 2. Semanc representaons qn tm 3. Similarity measures

SEMANTIC MATCHING SEMANTIC MATCHING 1. CONTENT EXTRACTION 2. SEMANTIC REPRESENTATIONS

• The “raw” content of a query/table is represented • Each of the raw terms is mapped to a semanc as a set of terms, which can be words or enes vector representaon

q1 t1 Query … Table … q1 ~q1 ~t1 t1 qn tm Query … … … … Table ~ qn ~qn tm tm Entity-based: - Top-k ranked entities from a - Entities in the core table column knowledge base - Top-k ranked entities using the embedding document/section title as a query SEMANTIC MATCHING SEMANTIC REPRESENTATIONS 3. SIMILARITY MEASURES

• Bag-of-concepts (sparse discrete vectors) ~ • Bag-of-enes ti q1 ~q1 semantic ~t1 t1 matching • Each vector element corresponds to an enty Query … … … … Table ~ j • t i [ j ] is 1 if there exists a link between enes i and j in the KB ~ qn ~qn tm tm • Bag-of-categories • Each vector element corresponds to a Wikipedia category

• ~t i [ j ] is 1 if enty i is assigned to Wikipedia category j • Embeddings (dense connuous vectors) • Word embeddings • Word2Vec (300 dimensions, trained on Google news) • Graph embeddings • RDF2vec (200 dimensions, trained on DBpedia)

SEMANTIC MATCHING SEMANTIC MATCHING EARLY FUSION MATCHING STRATEGY LATE FUSION MATCHING STRATEGY

q1 ~q1 semantic ~t1 t1 q1 ~q1 semantic ~t1 t1 matching matching Query … … … … Table Query … … … … Table ~ ~ qn ~qn tm tm qn ~qn tm tm

~q1 ~t ~q ~ 1 1 t1 … … … … … … ~qn ~tm ~qn ~tm AGGR

Late: Compute all pairwise similarities between the query and table semantic Early: Take the centroid of semantic vectors and compute their cosine similarity vectors, then aggregate those pairwise similarity scores (sum, avg, or max)

EXPERIMENTAL SETUP

• Table corpus • WikiTables corpus1: 1.6M tables extracted from Wikipedia • Knowledge base EXPERIMENTAL EVALUATION • DBpedia (2015-10): 4.6M enes with an English abstract

• Queries QS-1 QS-2 2,3 • Sampled from two sources video games asian countries currency • Rank-based evaluaon us cies laptops cpu • NDCG@5, 10, 15, 20 kings of africa food calories economy gdp guitars manufacturer

1 Bhagavatula et al. TabEL: Enty Linking in Web Tables. In: ISWC ’15. 2 Cafarella et al. Data Integraon for the Relaonal Web. Proc. of VLDB Endow. (2009) 3 Venes et al. Recovering Semancs of Tables on the Web. Proc. of VLDB Endow. (2011)

RELEVANCE ASSESSMENTS RESEARCH QUESTIONS

• Collected via crowdsourcing • Pooling to depth 20, 3120 query-table pairs in total • RQ1: Can semanc matching improve retrieval • Assessors are presented with the following scenario performance? • "Imagine that your task is to create a new table on the query topic" • RQ2: Which of the semanc representaons is the • A table is … most effecve? • Non-relevant (0): if it is unclear what it is about or it about a different topic • RQ3: Which of the similarity measures performs best? • Relevant (1): if some cells or values could be used from it • Highly relevant (2): if large blocks or several values could be used from it

RESULTS: RQ1 RESULTS: RQ2

NDCG@10 NDCG@20 Single-field document ranking 0.4344 0.5254 Mul-field document ranking 0.4860 0.5473 WebTable1 0.2992 0.3726 WikiTable2 0.4766 0.5206 LTR baseline 0.5456 0.6031 STR (LTR + semanc matching) 0.6293 0.6825

• Can semanc matching improve retrieval performance? • Which of the semanc representaons is the most effecve? • Yes. STR achieves substanal and significants improvements over LTR. • Bag-of-enes.

1 Cafarella et al. WebTables: Exploring the Power of Tables on the Web. Proc. of VLDB Endow. (2008) 2 Bhagavatula et al. Methods for Exploring and Mining Tables on Wikipedia. In: IDEA ’13. RESULTS: RQ3 FEATURE ANALYSIS

• Which of the similarity measures performs best? • Late-sum and Late-avg (but it also depends on the representaon)

ON-THE-FLY TABLE QUERY-BY-TABLE GENERATION

S. Zhang and K. Balog. On-the-fly Table Generaon. Currently under peer review In: 41st Internaonal ACM SIGIR Conference on Research and Development in Informaon Retrieval (SIGIR '18)

TASK APPROACH Core column enty ranking and schema determinaon could potenally mutually reinforce each other.

Query • On-the-fly table generaon: (q)

• Answer a free text query with a relaonal table, where Core column Schema • the core column lists all relevant enes; en+ty ranking E determina+on S • columns correspond to aributes of those enes; E • cells contain the values of the corresponding enty aributes. S

Value lookup E Video albums of Search Title Released data Label Formats S V CMT Crossroads: Taylor Swift and … Jun 16, 2009 Big Machine DVD Oct 11, 2011 Shout! Factory Blu-ray, DVD V World Tour-Live Nov 21, 2011 Big Machine CD/Blu-ray, … Live Dec 20, 2015 Big Machine Streaming

ALGORITHM KNOWLEDGE BASE ENTRY

Query (q) Entity name Core column Schema Entity type en+ty ranking E determina+on S Description ea ed E S

Property: value Value lookup Property: value ep … V

CORE COLUMN ENTITY RANKING CORE COLUMN ENTITY RANKING FEATURES

t 1 scoret(e, q)= wii(e, q, S ) i X Enty’s relevance to the query computed using language modeling CORE COLUMN ENTITY RANKING CORE COLUMN ENTITY RANKING FEATURES FEATURES Query

Compability matrix:

Matching sj matrix

Entity (n m) ei ⇥

Dense … 1, if matchKB(ei,sj) matchTC(ei,sj) layer Cij = _ Top-k 0, otherwise . entries (

Hidden layers Enty-schema compability score: 1 ESC(S, e )= C Output i S ij layer j s is the concatenaon of all schema labels in S | | X is the string concatenaon operator Matching Degree

SCHEMA DETERMINATION SCHEMA DETERMINATION FEATURES

score (s, q)= w (s, q, Et 1) t i i P (s q)= P (s T )P (T q) i | | | X T X2T Table’s relevance to the query

Schema label likelihood

1, maxs T dist(s, s0) P (s T )= 02 S | (0, otherwise .

SCHEMA DETERMINATION FEATURES SCHEMA DETERMINATION FEATURES

P (s q, E)= P (s T )P (T q, E) | | | XT Schema label likelihood 1 AR(s, E)= match(s, e, T )+drel(d, e)+sh(s, e)+kb(s, e) Table’s relevance to the query E P (T q, E) P (T E)P (T q) | | e E | / | | X2 Similarity between enty e and Whether s is a property of e schema label s with respect to T in the KB

Relevance of the document #hits returned by a web search containing the table engine to the query "[s] of [e]" above threshold Kopliku et al. Towards a Framework for Aribute Retrieval. In: CIKM ’11.

VALUE LOOKUP VALUE LOOKUP

• A catalog of possible enty aribute-value pairs • Finding a cell’s value is a lookup in that catalog • Enty, schema label, value, provenance quadruples

matching confidence e, s, v, p score(v, e, s, q) = max conf (p, q) s0,v,p eV - KB takes priority over TC h i h i2 - based on the corresponding table’s match(s,s0) T #123 values from KB relevance to the query

s

values from TC e v e s v T #123 values from KB so string matching - "birthday" vs. "date of birth" eV - "country" vs. "naonality" values from TC

EXPERIMENTAL SETUP

• Table corpus • WikiTables corpus: 1.6M tables extracted from Wikipedia EXPERIMENTAL EVALUATION • Knowledge base • DBpedia (2015-10): 4.6M enes with an English abstract • Two query sets • Rank-based metrics • NDCG for core column enty ranking and schema determinaon • MAP/MRR for value lookup QUERY SET 1 (QS-1) QUERY SET 2 (QS-2)

• List queries from the DBpedia-Enty v2 collecon1 (119) • Enty-relaonship queries from the RELink Query • "all cars that are produced in Germany" Collecon1 (600) • "permanent members of the UN Security Council" • Queries are answered by enty tuples (pairs or triplets) • "Airlines that currently use Boeing 747 planes" • That is, each query is answered by a table with 2 or 3 columns (including the • Core column enty ranking core enty column) • Queries and relevance judgments are obtained automacally from Wikipedia • Highly relevant enes from the collecon lists that contain relaonal tables • Schema determinaon • Human annotators were asked to formulate the corresponding informaon • Crowdsourcing, 3-point relevance scale, 7k query-label pairs need as a natural language query • Value lookup • "find peaks above 6000m in the mountains of Peru" • "Which countries and cies have accredited armenian embassadors?" • Crowdsourcing, 25 queries sample, 14k cell values • "Which an-aircra guns were used in ships during war periods and what country produced them?"

1 Hasibi et al. DBpedia-Enty v2: A Test Collecon for Enty Search. In: SIGIR ’17. 1 Saleiro et al. RELink: A Research Framework and Test Collecon for Enty-Relaonship Retrieval. In: SIGIR ’17.

CORE COLUMN ENTITY RANKING CORE COLUMN ENTITY RANKING (QUERY-BASED) (SCHEMA-ASSISTED)

QS-1 QS-2 QS-1 QS-2

NDCG@5 NDCG@10 NDCG@5 NDCG@10

LM 0.2419 0.2591 0.0708 0.0823

DRRM_TKS (ed) 0.2015 0.2028 0.0501 0.0540

DRRM_TKS (ep) 0.1780 0.1808 0.1089 0.1083

Combined 0.2821 0.2834 0.0852 0.0920

•R #0: without schema informaon (query only) •R #1-#3: with automac schema determinaon (top 10) •Oracle: with ground truth schema

SCHEMA DETERMINATION SCHEMA DETERMINATION (QUERY-BASED) (ENTITY-ASSISTED)

QS-1 QS-2 QS-1 QS-2

NDCG@5 NDCG@10 NDCG@5 NDCG@10

CP 0.0561 0.0675 0.1770 0.2092

DRRM_TKS 0.0380 0.0427 0.0920 0.1415

Combined 0.0786 0.0878 0.2310 0.2695

•R #0: without enty informaon (query only) •R #1-#3: with automac core column enty ranking (top 10) •Oracle: with ground truth enes

RESULTS VALUE LOOKUP EXAMPLE "Towns in the Republic of Ireland in 2006 Census Records"

QS-1 QS-2

MAP MRR MAP MRR

Knowledge base 0.7759 0.7990 0.0745 0.0745

Table corpus 0.1614 0.1746 0.9564 0.9564

Combined 0.9270 0.9427 0.9564 0.9564

ANALYSIS SUMMARY

• Answering queries with relaonal tables, QS-1 QS-2 summarizing enes and their aributes " # - " # - • Retrieving exisng tables from a table corpus QS-1 Round #0 vs. #1 43 38 38 52 7 60 • Generang a table on-the-fly Round #0 vs. #2 50 30 39 61 5 53 • Future work Round #0 vs. #3 49 26 44 59 2 58 QS-2 Round #0 vs. #1 166 82 346 386 56 158 • Moving from homogeneous Wikipedia tables to other types of tables (scienfic tables, Web tables) Round #0 vs. #2 173 74 347 388 86 126 Round #0 vs. #3 173 72 349 403 103 94 • Value lookup with conflicng values; verifying cell values • Result snippets for table search results • ... QUESTIONS?

@kriszanbalog kriszanbalog.com