Table Retrieval and Generafon
Total Page:16
File Type:pdf, Size:1020Kb
JOIN WORK WITH Table Retrieval and Generaon Kriszan Balog Informaon Access University of Stavanger & Interacon Shuo Zhang hp://iai.group @krisz'anbalog @imsure318 SIGIR’18 workshop on Data Search (DATA:SEARCH’18) | Ann Arbor, Michigan, USA, July 2018 TABLES ARE EVERYWHERE MOTIVATION MOTIVATION IN THIS TALK • Three retrieval tasks, with tables as results • Ad hoc table retrieval • Query-by-table • On-the-fly table generaon THE ANATOMY OF A RELATIONAL THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE (ENTITY-FOCUSED) TABLE Formula 1 constructors’ statistics 2016 Table capon Formula 1 constructors’ statistics 2016 Constructor Engine Country Base … Constructor Engine Country Base … Ferrari Ferrari Italy Italy Ferrari Ferrari Italy Italy Force India Mercedes India UK Force India Mercedes India UK Haas Ferrari US US & UK Haas Ferrari US US & UK Manor Mercedes UK UK Manor Mercedes UK UK … … THE ANATOMY OF A RELATIONAL THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE (ENTITY-FOCUSED) TABLE Core column (subject column) Formula 1 constructors’ statistics 2016 Formula 1 constructors’ statistics 2016 Constructor Engine Country Base … Heading Constructor Engine Country Base … column labels Ferrari Ferrari Italy Italy (table schema) Ferrari Ferrari Italy Italy Force India Mercedes India UK Force India Mercedes India UK Haas Ferrari US US & UK Haas Ferrari US US & UK Manor Mercedes UK UK Manor Mercedes UK UK … … We assume that these enes are recognized and disambiguated, i.e., linked to a knowledge base TASK • Ad hoc table retrieval: Singapore Search • Given a keyword query as input, Singapore - Wikipedia, Economy Statistics (Recent Years) https://en.wikipedia.org/wiki/Singapore return a ranked list of tables from GDP GDP GNI GNI GDP Real Year Nominal Nominal Nominal Nominal (Billion) a table corpus (Billion) Per Capita (Billion) Per Capita AD HOC TABLE RETRIEVAL 2011 S$346.353 S$66,816 S$342.371 S$338.452 S$65,292 2012 S$362.332 S$68,205 S$354.061 S$351.765 S$66,216 2013 S$378.200 S$70,047 S$324.592 S$366.618 S$67,902 Show more (5 rows total) Singapore - Wikipedia, Language used most frequently at home https://en.wikipedia.org/wiki/Singapore Language Color in Figure Percent English Blue 36.9% Mandarin Yellow 34.9% Malay Red 10.7% Show more (6 rows total) S. Zhang and K. Balog. Ad Hoc Table Retrieval using Seman'c Similarity. In: The Web Conference 2018 (WWW '18) APPROACHES UNSUPERVISED METHODS • Unsupervised methods • Build a document-based representaon for each table, then • Single-field document representaon employ convenonal document retrieval methods • All table content, no structure • Supervised methods • Mul-field document representaon • Describe query-table pairs using a set of features, then employ supervised machine learning ("learning-to-rank") • Separate document fields for embedding document’s tle, secon tle, table capon, table body, and table headings • Contribuon #1: new state-of-the-art, using a rich set of features • Contribuon #2: new set of semanc matching features SUPERVISED METHODS FEATURES • Three groups of features • Query features • #query terms, query IDF scores • Table features • Table properes: #rows, #cols, #empty cells, etc. • Embedding document: link structure, number of tables, etc. • Query-table features • Query terms found in different table elements, LM score, etc. • Our novel semanc matching features SEMANTIC MATCHING SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of a query/table is represented as a set of terms, which can be words or enes • Main objecve: go beyond term-based matching • Three components: q1 t1 1. Content extracon Query … … Table 2. Semanc representaons qn tm 3. Similarity measures SEMANTIC MATCHING SEMANTIC MATCHING 1. CONTENT EXTRACTION 2. SEMANTIC REPRESENTATIONS • The “raw” content of a query/table is represented • Each of the raw terms is mapped to a semanc as a set of terms, which can be words or enes vector representaon q1 t1 Query … Table … q1 ~q1 ~t1 t1 qn tm Query … … … … Table ~ qn ~qn tm tm Entity-based: - Top-k ranked entities from a - Entities in the core table column knowledge base - Top-k ranked entities using the embedding document/section title as a query SEMANTIC MATCHING SEMANTIC REPRESENTATIONS 3. SIMILARITY MEASURES • Bag-of-concepts (sparse discrete vectors) ~ • Bag-of-en''es ti q1 ~q1 semantic ~t1 t1 matching • Each vector element corresponds to an enty Query … … … … Table ~ j • t i [ j ] is 1 if there exists a link between enes i and j in the KB ~ qn ~qn tm tm • Bag-of-categories • Each vector element corresponds to a Wikipedia category • ~t i [ j ] is 1 if enty i is assigned to Wikipedia category j • Embeddings (dense connuous vectors) • Word embeddings • Word2Vec (300 dimensions, trained on Google news) • Graph embeddings • RDF2vec (200 dimensions, trained on DBpedia) SEMANTIC MATCHING SEMANTIC MATCHING EARLY FUSION MATCHING STRATEGY LATE FUSION MATCHING STRATEGY q1 ~q1 semantic ~t1 t1 q1 ~q1 semantic ~t1 t1 matching matching Query … … … … Table Query … … … … Table ~ ~ qn ~qn tm tm qn ~qn tm tm ~q1 ~t ~q ~ 1 1 t1 … … … … … … ~qn ~tm ~qn ~tm AGGR Late: Compute all pairwise similarities between the query and table semantic Early: Take the centroid of semantic vectors and compute their cosine similarity vectors, then aggregate those pairwise similarity scores (sum, avg, or max) EXPERIMENTAL SETUP • Table corpus • WikiTables corpus1: 1.6M tables extracted from Wikipedia • Knowledge base EXPERIMENTAL EVALUATION • DBpedia (2015-10): 4.6M enes with an English abstract • Queries QS-1 QS-2 2,3 • Sampled from two sources video games asian countries currency • Rank-based evaluaon us cies laptops cpu • nDCG@5, 10, 15, 20 kings of africa food calories economy gdp guitars manufacturer 1 Bhagavatula et al. TabEL: En'ty Linking in Web Tables. In: ISWC ’15. 2 Cafarella et al. Data Integra'on for the Rela'onal Web. Proc. of VLDB Endow. (2009) 3 Venes et al. Recovering Seman'cs of Tables on the Web. Proc. of VLDB Endow. (2011) RELEVANCE ASSESSMENTS RESEARCH QUESTIONS • Collected via crowdsourcing • Pooling to depth 20, 3120 query-table pairs in total • RQ1: Can semanc matching improve retrieval • Assessors are presented with the following scenario performance? • "Imagine that your task is to create a new table on the query topic" • RQ2: Which of the semanc representaons is the • A table is … most effecve? • Non-relevant (0): if it is unclear what it is about or it about a different topic • RQ3: Which of the similarity measures performs best? • Relevant (1): if some cells or values could be used from it • Highly relevant (2): if large blocks or several values could be used from it RESULTS: RQ1 RESULTS: RQ2 NDCG@10 NDCG@20 Single-field document ranking 0.4344 0.5254 Mul'-field document ranking 0.4860 0.5473 WebTable1 0.2992 0.3726 WikiTable2 0.4766 0.5206 LTR baseline 0.5456 0.6031 STR (LTR + seman'c matching) 0.6293 0.6825 • Can semanc matching improve retrieval performance? • Which of the semanc representaons is the most effecve? • Yes. STR achieves substanal and significants improvements over LTR. • Bag-of-enes. 1 Cafarella et al. WebTables: Exploring the Power of Tables on the Web. Proc. of VLDB Endow. (2008) 2 Bhagavatula et al. Methods for Exploring and Mining Tables on Wikipedia. In: IDEA ’13. RESULTS: RQ3 FEATURE ANALYSIS • Which of the similarity measures performs best? • Late-sum and Late-avg (but it also depends on the representaon) ON-THE-FLY TABLE QUERY-BY-TABLE GENERATION S. Zhang and K. Balog. On-the-fly Table Genera'on. Currently under peer review In: 41st Internaonal ACM SIGIR Conference on Research and Development in Informaon Retrieval (SIGIR '18) TASK APPROACH Core column enty ranking and schema determinaon could potenally mutually reinforce each other. Query • On-the-fly table generaon: (q) • Answer a free text query with a relaonal table, where Core column Schema • the core column lists all relevant enes; en+ty ranking E determina+on • columns correspond to aributes of those enes; S • cells contain the values of the corresponding enty aributes. E S Value lookup E Video albums of Taylor Swift Search Title Released data Label Formats S V CMT Crossroads: Taylor Swift and … Jun 16, 2009 Big Machine DVD Journey to Fearless Oct 11, 2011 Shout! Factory Blu-ray, DVD V Speak Now World Tour-Live Nov 21, 2011 Big Machine CD/Blu-ray, … The 1989 World Tour Live Dec 20, 2015 Big Machine Streaming ALGORITHM KNOWLEDGE BASE ENTRY Query (q) Entity name Core column Schema Entity type en+ty ranking E determina+on S Description ea ed E S Property: value Value lookup Property: value ep … V CORE COLUMN ENTITY RANKING CORE COLUMN ENTITY RANKING FEATURES t 1 scoret(e, q)= wiφi(e, q, S − ) i X Enty’s relevance to the query computed using language modeling CORE COLUMN ENTITY RANKING CORE COLUMN ENTITY RANKING FEATURES FEATURES Query Compability matrix: Matching sj matrix Entity (n m) ei ⇥ Dense … 1, if matchKB(ei,sj) matchTC(ei,sj) layer Cij = _ Top-k 0, otherwise . entries ( Hidden layers Enty-schema compability score: 1 ESC(S, e )= C Output i S ij layer j s is the concatenaon of all schema labels in S | | X is the string concatenaon operator Matching ⊕ Degree SCHEMA DETERMINATION SCHEMA DETERMINATION FEATURES score (s, q)= w φ (s, q, Et 1) t i i − P (s q)= P (s T )P (T q) i | | | X T X2T Table’s relevance to the query Schema label likelihood 1, maxs T dist(s, s0) γ P (s T )= 02 S ≥ | (0, otherwise . SCHEMA DETERMINATION FEATURES SCHEMA DETERMINATION FEATURES P (s q, E)= P (s T )P (T q, E) | | | XT Schema label likelihood 1 AR(s, E)= match(s, e, T )+drel(d, e)+sh(s, e)+kb(s, e) Table’s relevance to the query E P (T q, E) P (T E)P (T q) | | e E | / | | X2 Similarity between enty e and Whether s is a property of e schema label s with respect to T in the KB Relevance of the document #hits returned by a web search containing the table engine to the query "[s] of [e]" above threshold Kopliku et al.