Table Retrieval and Generafon

JOIN WORK WITH Table Retrieval and Generaon Kriszan Balog Informaon Access University of Stavanger & Interacon Shuo Zhang hp://iai.group @krisz'anbalog @imsure318 SIGIR’18 workshop on Data Search (DATA:SEARCH’18) | Ann Arbor, Michigan, USA, July 2018 TABLES ARE EVERYWHERE MOTIVATION MOTIVATION IN THIS TALK • Three retrieval tasks, with tables as results • Ad hoc table retrieval • Query-by-table • On-the-fly table generaon THE ANATOMY OF A RELATIONAL THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE (ENTITY-FOCUSED) TABLE Formula 1 constructors’ statistics 2016 Table capon Formula 1 constructors’ statistics 2016 Constructor Engine Country Base … Constructor Engine Country Base … Ferrari Ferrari Italy Italy Ferrari Ferrari Italy Italy Force India Mercedes India UK Force India Mercedes India UK Haas Ferrari US US & UK Haas Ferrari US US & UK Manor Mercedes UK UK Manor Mercedes UK UK … … THE ANATOMY OF A RELATIONAL THE ANATOMY OF A RELATIONAL (ENTITY-FOCUSED) TABLE (ENTITY-FOCUSED) TABLE Core column (subject column) Formula 1 constructors’ statistics 2016 Formula 1 constructors’ statistics 2016 Constructor Engine Country Base … Heading Constructor Engine Country Base … column labels Ferrari Ferrari Italy Italy (table schema) Ferrari Ferrari Italy Italy Force India Mercedes India UK Force India Mercedes India UK Haas Ferrari US US & UK Haas Ferrari US US & UK Manor Mercedes UK UK Manor Mercedes UK UK … … We assume that these enes are recognized and disambiguated, i.e., linked to a knowledge base TASK • Ad hoc table retrieval: Singapore Search • Given a keyword query as input, Singapore - Wikipedia, Economy Statistics (Recent Years) https://en.wikipedia.org/wiki/Singapore return a ranked list of tables from GDP GDP GNI GNI GDP Real Year Nominal Nominal Nominal Nominal (Billion) a table corpus (Billion) Per Capita (Billion) Per Capita AD HOC TABLE RETRIEVAL 2011 S$346.353 S$66,816 S$342.371 S$338.452 S$65,292 2012 S$362.332 S$68,205 S$354.061 S$351.765 S$66,216 2013 S$378.200 S$70,047 S$324.592 S$366.618 S$67,902 Show more (5 rows total) Singapore - Wikipedia, Language used most frequently at home https://en.wikipedia.org/wiki/Singapore Language Color in Figure Percent English Blue 36.9% Mandarin Yellow 34.9% Malay Red 10.7% Show more (6 rows total) S. Zhang and K. Balog. Ad Hoc Table Retrieval using Seman'c Similarity. In: The Web Conference 2018 (WWW '18) APPROACHES UNSUPERVISED METHODS • Unsupervised methods • Build a document-based representaon for each table, then • Single-field document representaon employ convenonal document retrieval methods • All table content, no structure • Supervised methods • Mul-field document representaon • Describe query-table pairs using a set of features, then employ supervised machine learning ("learning-to-rank") • Separate document fields for embedding document’s tle, secon tle, table capon, table body, and table headings • Contribuon #1: new state-of-the-art, using a rich set of features • Contribuon #2: new set of semanc matching features SUPERVISED METHODS FEATURES • Three groups of features • Query features • #query terms, query IDF scores • Table features • Table properes: #rows, #cols, #empty cells, etc. • Embedding document: link structure, number of tables, etc. • Query-table features • Query terms found in different table elements, LM score, etc. • Our novel semanc matching features SEMANTIC MATCHING SEMANTIC MATCHING 1. CONTENT EXTRACTION • The “raw” content of a query/table is represented as a set of terms, which can be words or enes • Main objecve: go beyond term-based matching • Three components: q1 t1 1. Content extracon Query … … Table 2. Semanc representaons qn tm 3. Similarity measures SEMANTIC MATCHING SEMANTIC MATCHING 1. CONTENT EXTRACTION 2. SEMANTIC REPRESENTATIONS • The “raw” content of a query/table is represented • Each of the raw terms is mapped to a semanc as a set of terms, which can be words or enes vector representaon q1 t1 Query … Table … q1 ~q1 ~t1 t1 qn tm Query … … … … Table ~ qn ~qn tm tm Entity-based: - Top-k ranked entities from a - Entities in the core table column knowledge base - Top-k ranked entities using the embedding document/section title as a query SEMANTIC MATCHING SEMANTIC REPRESENTATIONS 3. SIMILARITY MEASURES • Bag-of-concepts (sparse discrete vectors) ~ • Bag-of-en''es ti q1 ~q1 semantic ~t1 t1 matching • Each vector element corresponds to an enty Query … … … … Table ~ j • t i [ j ] is 1 if there exists a link between enes i and j in the KB ~ qn ~qn tm tm • Bag-of-categories • Each vector element corresponds to a Wikipedia category • ~t i [ j ] is 1 if enty i is assigned to Wikipedia category j • Embeddings (dense connuous vectors) • Word embeddings • Word2Vec (300 dimensions, trained on Google news) • Graph embeddings • RDF2vec (200 dimensions, trained on DBpedia) SEMANTIC MATCHING SEMANTIC MATCHING EARLY FUSION MATCHING STRATEGY LATE FUSION MATCHING STRATEGY q1 ~q1 semantic ~t1 t1 q1 ~q1 semantic ~t1 t1 matching matching Query … … … … Table Query … … … … Table ~ ~ qn ~qn tm tm qn ~qn tm tm ~q1 ~t ~q ~ 1 1 t1 … … … … … … ~qn ~tm ~qn ~tm AGGR Late: Compute all pairwise similarities between the query and table semantic Early: Take the centroid of semantic vectors and compute their cosine similarity vectors, then aggregate those pairwise similarity scores (sum, avg, or max) EXPERIMENTAL SETUP • Table corpus • WikiTables corpus1: 1.6M tables extracted from Wikipedia • Knowledge base EXPERIMENTAL EVALUATION • DBpedia (2015-10): 4.6M enes with an English abstract • Queries QS-1 QS-2 2,3 • Sampled from two sources video games asian countries currency • Rank-based evaluaon us cies laptops cpu • nDCG@5, 10, 15, 20 kings of africa food calories economy gdp guitars manufacturer 1 Bhagavatula et al. TabEL: En'ty Linking in Web Tables. In: ISWC ’15. 2 Cafarella et al. Data Integra'on for the Rela'onal Web. Proc. of VLDB Endow. (2009) 3 Venes et al. Recovering Seman'cs of Tables on the Web. Proc. of VLDB Endow. (2011) RELEVANCE ASSESSMENTS RESEARCH QUESTIONS • Collected via crowdsourcing • Pooling to depth 20, 3120 query-table pairs in total • RQ1: Can semanc matching improve retrieval • Assessors are presented with the following scenario performance? • "Imagine that your task is to create a new table on the query topic" • RQ2: Which of the semanc representaons is the • A table is … most effecve? • Non-relevant (0): if it is unclear what it is about or it about a different topic • RQ3: Which of the similarity measures performs best? • Relevant (1): if some cells or values could be used from it • Highly relevant (2): if large blocks or several values could be used from it RESULTS: RQ1 RESULTS: RQ2 NDCG@10 NDCG@20 Single-field document ranking 0.4344 0.5254 Mul'-field document ranking 0.4860 0.5473 WebTable1 0.2992 0.3726 WikiTable2 0.4766 0.5206 LTR baseline 0.5456 0.6031 STR (LTR + seman'c matching) 0.6293 0.6825 • Can semanc matching improve retrieval performance? • Which of the semanc representaons is the most effecve? • Yes. STR achieves substanal and significants improvements over LTR. • Bag-of-enes. 1 Cafarella et al. WebTables: Exploring the Power of Tables on the Web. Proc. of VLDB Endow. (2008) 2 Bhagavatula et al. Methods for Exploring and Mining Tables on Wikipedia. In: IDEA ’13. RESULTS: RQ3 FEATURE ANALYSIS • Which of the similarity measures performs best? • Late-sum and Late-avg (but it also depends on the representaon) ON-THE-FLY TABLE QUERY-BY-TABLE GENERATION S. Zhang and K. Balog. On-the-fly Table Genera'on. Currently under peer review In: 41st Internaonal ACM SIGIR Conference on Research and Development in Informaon Retrieval (SIGIR '18) TASK APPROACH Core column enty ranking and schema determinaon could potenally mutually reinforce each other. Query • On-the-fly table generaon: (q) • Answer a free text query with a relaonal table, where Core column Schema • the core column lists all relevant enes; en+ty ranking E determina+on • columns correspond to aributes of those enes; S • cells contain the values of the corresponding enty aributes. E S Value lookup E Video albums of Taylor Swift Search Title Released data Label Formats S V CMT Crossroads: Taylor Swift and … Jun 16, 2009 Big Machine DVD Journey to Fearless Oct 11, 2011 Shout! Factory Blu-ray, DVD V Speak Now World Tour-Live Nov 21, 2011 Big Machine CD/Blu-ray, … The 1989 World Tour Live Dec 20, 2015 Big Machine Streaming ALGORITHM KNOWLEDGE BASE ENTRY Query (q) Entity name Core column Schema Entity type en+ty ranking E determina+on S Description ea ed E S Property: value Value lookup Property: value ep … V CORE COLUMN ENTITY RANKING CORE COLUMN ENTITY RANKING FEATURES t 1 scoret(e, q)= wiφi(e, q, S − ) i X Enty’s relevance to the query computed using language modeling CORE COLUMN ENTITY RANKING CORE COLUMN ENTITY RANKING FEATURES FEATURES Query Compability matrix: Matching sj matrix Entity (n m) ei ⇥ Dense … 1, if matchKB(ei,sj) matchTC(ei,sj) layer Cij = _ Top-k 0, otherwise . entries ( Hidden layers Enty-schema compability score: 1 ESC(S, e )= C Output i S ij layer j s is the concatenaon of all schema labels in S | | X is the string concatenaon operator Matching ⊕ Degree SCHEMA DETERMINATION SCHEMA DETERMINATION FEATURES score (s, q)= w φ (s, q, Et 1) t i i − P (s q)= P (s T )P (T q) i | | | X T X2T Table’s relevance to the query Schema label likelihood 1, maxs T dist(s, s0) γ P (s T )= 02 S ≥ | (0, otherwise . SCHEMA DETERMINATION FEATURES SCHEMA DETERMINATION FEATURES P (s q, E)= P (s T )P (T q, E) | | | XT Schema label likelihood 1 AR(s, E)= match(s, e, T )+drel(d, e)+sh(s, e)+kb(s, e) Table’s relevance to the query E P (T q, E) P (T E)P (T q) | | e E | / | | X2 Similarity between enty e and Whether s is a property of e schema label s with respect to T in the KB Relevance of the document #hits returned by a web search containing the table engine to the query "[s] of [e]" above threshold Kopliku et al.

Table Retrieval and Generafon

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support