Table Search, Generation and Completion
Total Page:16
File Type:pdf, Size:1020Kb
Table Search, Generation and Completion by Shuo Zhang A dissertation submitted in fulfilment of the requirements for the degree of PHILOSOPHIAE DOCTOR (PhD) Faculty of Science and Technology Department of Electrical Engineering and Computer Science July 1, 2019 University of Stavanger N-4036 Stavanger NORWAY www.uis.no c Shuo Zhang, 2019 All rights reserved. ISBN 978-82-7644-873-3 ISSN 1890-1387 PhD Thesis UiS no. 478 Acknowledgement Pursuing a Ph.D. was a sudden decision I made by flipping a coin when I was finishing my master studies. To this day, I still have mixed feelings about that decision–to start on a completely new subject. The resulting lack of confidence and experience in computer science contributed to my uncer- tainty of how to handle the ups and downs of doctoral life properly. If I knew then the challenges that lay before me, I would surely have said no to this opportunity. Nevertheless, in the end I have to admit it has been worthwhile. This endeavour has been a groping forward, it has been a spir- itual fast, and it has taught me when to stop struggling and when to push forward. The experience has equipped me with a new creed of life: Wherever you find yourself, accept life as it is: Enjoy the peaks, enjoy the lows, and enjoy stagnation. I was like a blank page at the beginning, without much background in com- puter science. Today you can easily see this page is filled with the words and colors imparted to me by my supervisor Prof. Dr. Krisztian Balog. He taught me without reservation all the skills I needed to conduct research, in- cluding developing ideas, writing papers, delivering presentations, to name a few. I also want to acknowledge Krisztian’s real effect on my life beyond a merely formal description. Sometimes I teased him for being slow and steady as a nineteenth-century steam train, while I must be a runaway horse in his eyes. But I firmly believe that differing voices bring you closer to the truth, and I appreciate from my heart all the debates we had. In the end, “iron shapes iron”. Since last autumn, I was able to take an internship at Bloomberg L.P. and research visit at Carnegie Mellon University in the USA. My four-month internship at Bloomberg was a great experience, and I met many excellent researchers there. I would like to thank Edgar Meij, as well as my other col- leagues there, including Ridho Reinanda, Miles Osborne, Giorgio Stefanoni, Oana, Antonia, Nimesh, and all the rest, for giving me such a warm wel- come and much help. I was supervised by Prof. Dr. Jamie Callan when I visited CMU. I am grateful to Jamie for generously sharing his experience and wisdom, about both papers and academic careers. I would also thank Zhuyun Dai for her good company while I was at CMU. I would also like to extend my thanks to my co-supervisor, Prof. Dr. Heri Ramampiaro from NTNU, for kindly hosting our visits in Trondheim, and the head of our department, Dr. Tom Ryen, for always being supportive. I am grateful and honored to have Prof. Dr. Brian Davison, Prof. Dr. Min Zhang, and Prof. Dr. Hein Meling serving as my committee members. They reviewed my dissertation and provided me with excellent suggestions. i IAI witnesses our growth. I would like to thank the people who were in this team in the past years. Dario Gigliotti, Faegheh Hasibi, Heng Ding, Jan R. Benetka and Trond Linjordet. It is lucky to know someone to share similar views with in this adventure. Thanks to all the local friends in Stavanger. We did so many crazy things to- gether, which will be my long lasting memories. Most importantly, I convey my heartfelt appreciation to my family. Thanks for all the support, always giving me the freedom to follow my heart and be a wayward kid. Lastly, I want to thank myself. You are such a lucky guy to get so much love from the others. Thanks for your tireless running and being kind. If you can, please continue. ii Contents Contents iii 1 Introduction1 1.1 Research Questions......................... 2 1.2 Main Contributions......................... 4 1.3 Organization of the Thesis..................... 5 1.4 Origins................................ 7 2 Background 11 2.1 Table Types and Corpora ..................... 11 2.1.1 The Anatomy of a Table.................. 11 2.1.2 Types of Tables....................... 13 2.1.3 Table Corpora........................ 16 2.2 Table Extraction........................... 18 2.2.1 Web Table Extraction.................... 18 2.2.2 Wikipedia Table Extraction................ 23 2.2.3 Spreadsheet Extraction................... 23 2.3 Table Interpretation......................... 24 2.3.1 Column Type Identification................ 24 2.3.2 Entity Linking........................ 28 2.3.3 Relation Extraction..................... 31 2.3.4 Other Tasks......................... 34 2.4 Table Search............................. 35 2.4.1 Keyword Query Search .................. 35 2.4.2 Search by Table....................... 37 2.5 Question Answering on Tables.................. 39 2.5.1 Using a Single Table.................... 40 2.5.2 Using a Collection of Tables................ 41 2.6 Knowledge Base Augmentation.................. 42 2.6.1 Tables for Knowledge Exploration............ 42 iii Contents 2.6.2 Knowledge Base Augmentation and Construction . 43 2.7 Table Augmentation ........................ 44 2.7.1 Row Extension ....................... 44 2.7.2 Column Extension..................... 46 2.7.3 Data Completion...................... 49 2.8 Summary and Conclusions .................... 50 3 Keyword Table Search 53 3.1 Ad Hoc Table Retrieval....................... 53 3.1.1 Problem Statement..................... 54 3.1.2 Unsupervised Ranking .................. 54 3.1.3 Supervised Ranking.................... 56 3.2 Semantic Matching......................... 57 3.2.1 Content Extraction..................... 58 3.2.2 Semantic Representations................. 60 3.2.3 Similarity Measures .................... 61 3.3 Test Collection............................ 62 3.3.1 Table Corpus ........................ 62 3.3.2 Queries............................ 63 3.3.3 Relevance Assessments .................. 63 3.4 Evaluation.............................. 64 3.4.1 Research Questions..................... 64 3.4.2 Experimental Setup .................... 64 3.4.3 Baselines........................... 65 3.4.4 Experimental Results.................... 66 3.4.5 Analysis........................... 67 3.5 Summary and Conclusions .................... 70 4 Query-by-Table 71 4.1 Task Definition and Baseline Methods.............. 71 4.1.1 Keyword-based Search................... 73 4.1.2 Mannheim Search Join Engine.............. 73 4.1.3 Schema Complement ................... 74 4.1.4 Entity Complement .................... 74 4.1.5 Nguyen et al......................... 75 4.1.6 InfoGather.......................... 75 4.2 Using Hand-crafted Features ................... 76 4.2.1 Retrieval Framework.................... 76 4.2.2 Table Similarity Features ................. 76 4.2.3 Table Features........................ 76 4.3 The CRAB Approach........................ 77 4.3.1 Element-Level Table Matching Framework . 77 4.3.2 CRAB ............................ 80 4.4 Test collection............................ 82 iv Contents 4.4.1 Table Corpus ........................ 82 4.4.2 Query Tables and Relevance Assessments . 83 4.4.3 Evaluation Metrics..................... 83 4.5 Evaluation.............................. 84 4.5.1 Experimental Setup .................... 84 4.5.2 Baselines........................... 84 4.5.3 Results............................ 84 4.5.4 Further Analysis ...................... 88 4.6 Summary and Conclusions .................... 90 5 Table Completion 91 5.1 Problem Statement......................... 92 5.2 Populating Rows .......................... 94 5.2.1 Candidate Selection .................... 94 5.2.2 Ranking Entities ...................... 95 5.2.3 Entity Similarity ...................... 95 5.2.4 Column Labels Likelihood ................ 97 5.2.5 Caption Likelihood..................... 98 5.3 Populating Columns........................ 99 5.3.1 Candidate Selection .................... 99 5.3.2 Ranking Column Labels.................. 99 5.3.3 Label Likelihood...................... 100 5.3.4 Table Relevance Estimation . 100 5.4 Experimental design ........................ 101 5.4.1 Data ............................. 101 5.4.2 Entity-Focused Tables ................... 101 5.4.3 Simulation Process..................... 102 5.4.4 Matching Column Labels . 103 5.4.5 Evaluation Metrics..................... 103 5.5 Evaluation of Row Population................... 104 5.5.1 Candidate Selection .................... 104 5.5.2 Entity Ranking ....................... 105 5.5.3 Analysis........................... 107 5.6 Evaluation of Column Population . 108 5.6.1 Candidate Selection .................... 109 5.6.2 Column Label Ranking .................. 109 5.6.3 Analysis........................... 110 5.7 Summary and Conclusions .................... 111 6 Table Generation 113 6.1 Overview............................... 114 6.1.1 Iterative Table Generation Algorithm . 114 6.1.2 Data Sources ........................ 116 6.2 Core column entity ranking.................... 116 v Contents 6.2.1 Query-based Entity Ranking . 117 6.2.2 Schema-assisted Entity Ranking . 119 6.3 Schema determination....................... 120 6.3.1 Query-based Schema Determination . 120 6.3.2 Entity-assisted Schema Determination . 121 6.4 Value Lookup............................ 123 6.5 Experimental Setup......................... 124 6.5.1