Information Extraction on Para-Relational Data
Total Page:16
File Type:pdf, Size:1020Kb
Information Extraction on Para-Relational Data by Zhe Chen A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University of Michigan 2016 Doctoral Committee: Assistant Professor Michael J. Cafarella, Chair Associate Professor Eytan Adar Professor Hosagrahar V. Jagadish Associate Professor Qiaozhu Mei Assistant Professor Barzan Mozafari c Zhe Chen 2016 All Rights Reserved To My Parents ii ACKNOWLEDGEMENTS The PhD journey is full of ups and downs, and I would not be able to complete it on my own. I owe many thanks to my advisor, Michael Cafarella. I learned from him to think big and creatively. I am deeply influenced by his research style to build real, innovative systems that have impact, instead of sticking to small papers. I also learned from him to be a good writer and storyteller. English is my second language, and I did not know how to write a paper on my own when I first came to Michigan. Mike was very patient to help me with all kinds of paper re-organizing and re-writing, especially during my first several years. Without Mike's continuous guidance and help throughout my PhD studies, I would not be able to finish this thesis. So Mike | thank you. I would like to express my thanks to all my committee members. Professor Eytan Adar is an amazing person to work with. He is hands-on and knows everything. I have wanted to be a person like Professor H. V. Jagadish. He gives so much care to all the students and is always so nice to everybody. Professor Qiaozhu Mei is also very nice and was willing to give me advice on being an international student in the US. Professor Barzan Mozafari is very helpful and gave many insightful suggestions. I am so grateful that I can work with all these fantastic people at the University of Michigan. Thanks to the database group for their valuable discussions and help: Michael Anderson, Dolan Antenucci, Matthew Burgess, Daniel Fabbri, Lujun Fang, Arnab Nandi, Fei Li, Yunyao Li, Bin Liu, Yongjoo Park, Li Qian, Manish Singh, Jing Zhang, and others. Thanks to all the fellow students in the School of Information: iii Wei Ai, Xin Rong, Jian Tang, Shiyan Yan, Yue Wang, and many others. The people at Tableau are my friends and mentors, and they graciously allowed me to publish the research work and put it in the dissertation. Thanks to Jock Mackinlay, Sasha Dadiomov, Richard Wesley, Daniel Cory, Gang Xiao, and many others for all kinds of support and keen insights. By working with them, I learned to use research to impact the real product. I also want to thank my mentors and friends in my undergraduate studies. The internship in MSRA working with Zaiqing Nie changed my life: I decided to pursue a PhD degree at that time. Zaiqing's research style also influenced my way of critical thinking. My fellow students at Renmin University, Bolin Ding and Min Xie, were willing to help, even before I formally met them. My roommates and friends in Ann Arbor are my family in the United States. I moved to live with Lei Lei during the darkest time of my PhD studies, and I had not met her before. She welcomed me with such a sweet home. I am grateful that I can spend the last year of my PhD living with Jing Zhang and Shuyi Zhang, and I will not forget the times when we were roommates. I will also not forget the coffee time with Yue Liu when we were on fire discussing the interview questions. Thanks to Cheng Li for all the happiness that we shared and Zhe Zhao for all the hot pot parties. Lastly, I would like to express my deepest love to my parents and my family. The PhD study is rewarding, but full of all kinds of obstacles. This happy ending would not be possible without their support and comfort along the way. I have never imagined that I would begin my postgraduate journey with a job in a hedge fund company, but I am sure they will all be proud of me. iv TABLE OF CONTENTS DEDICATION :::::::::::::::::::::::::::::::::: ii ACKNOWLEDGEMENTS :::::::::::::::::::::::::: iii LIST OF FIGURES ::::::::::::::::::::::::::::::: viii LIST OF TABLES :::::::::::::::::::::::::::::::: xi ABSTRACT ::::::::::::::::::::::::::::::::::: xiii CHAPTER I. Introduction ..............................1 1.1 Para-relational Data . .1 1.1.1 Spreadsheets . .3 1.1.2 Dictionaries from Webpages . .5 1.1.3 Diagrams . .6 1.2 Design Criteria and Challenges . .8 1.3 Approaches and Systems . 12 1.3.1 Senbazuru: Extraction on Spreadsheet Structures . 12 1.3.2 Anthias: Extraction on Spreadsheet Properties . 14 1.3.3 Lyretail: Extraction on Dictionaries from Webpages 16 1.3.4 DiagramFlyer: Extraction on Diagrams from PDFs 17 1.4 Scientific Contributions . 18 1.5 Outline of the Dissertation . 19 II. Research Background ......................... 21 2.1 Information Extraction . 21 2.1.1 Domain-dependent Extraction . 22 2.1.2 Domain-independent Extraction . 22 2.2 Machine Learning . 23 2.2.1 Basic Classification Algorithms . 24 v 2.2.2 Joint Inference Algorithms . 25 2.2.3 Active Learning . 26 2.3 Data Management . 27 2.3.1 Relational Data Management . 27 2.3.2 Semi-structured Data Management . 28 III. Senbazuru: Extraction on Spreadsheet Structure ....... 30 3.1 Problem Overview . 30 3.2 Preliminary . 32 3.2.1 Terminology . 33 3.2.2 Data Sources . 34 3.2.3 Web Spreadsheet Statistics . 34 3.3 Spreadsheet Structure Extraction . 37 3.3.1 System Pipeline Overview . 38 3.3.2 Data Frame Extraction . 39 3.3.3 Hierarchy Extraction . 40 3.4 Experiments . 57 3.4.1 Data Frame Extraction . 57 3.4.2 Hierarchy Extraction . 58 3.5 System Demonstration . 67 3.5.1 User Interface . 67 3.5.2 A Walk-through Example . 69 3.6 Related Work . 70 3.7 Conclusion and Future Work . 71 IV. Anthias: Extraction on Spreadsheet Properties ........ 73 4.1 Problem Overview . 73 4.2 Preliminary . 75 4.2.1 Data Sources . 77 4.2.2 Spreadsheet Properties . 77 4.2.3 Property Examples . 79 4.3 Spreadsheet Property Extraction . 81 4.3.1 The Iterative Learning Framework . 81 4.3.2 User Work . 82 4.3.3 Iterative Learning Algorithms . 87 4.4 Experiments . 90 4.4.1 Property Extraction . 91 4.4.2 Large-scale Spreadsheets Study . 96 4.5 Related Work . 99 4.6 Conclusion and Future Work . 100 V. Lyretail: Extraction on Dictionaries from Webpages ..... 102 vi 5.1 Problem Overview . 102 5.2 Preliminary . 104 5.2.1 System Framework . 104 5.2.2 Data Sources . 105 5.3 Dictionary Extraction from Webpages . 107 5.3.1 Two Distinct Sets of Features . 107 5.3.2 Training Data Construction . 109 5.3.3 Implementing Page-specific Extractors . 112 5.3.4 Dictionary Aggregator . 115 5.4 Experiments . 116 5.4.1 Page-specific Extraction . 118 5.4.2 Dictionary Generation . 123 5.4.3 System Configuration . 124 5.5 Related Work . 126 5.6 Conclusions and Future Work . 127 VI. DiagramFlyer: Extraction on Diagrams from PDFs ...... 131 6.1 Problem Overview . 131 6.2 System Framework . 133 6.2.1 Diagram Metadata Extraction . 133 6.2.2 Query Interface and Language . 136 6.2.3 Software Architecture . 138 6.3 Demonstration . 142 6.4 Conclusions and Future Work . 142 VII. Conclusion and Future Work .................... 144 7.1 Contributions . 144 7.1.1 Senbazuru Contributions . 145 7.1.2 Anthias Contributions . 146 7.1.3 Lyretail Contributions . 146 7.1.4 DiagramFlyer Contributions . 147 7.2 Future Work . 147 BIBLIOGRAPHY :::::::::::::::::::::::::::::::: 149 vii LIST OF FIGURES Figure 1.1 An example of XML. .2 1.2 A spreadsheet about smoking rates, from the Statistical Abstract of the United States.............................4 1.3 The three semantic components of a data frame spreadsheet. .5 1.4 A list of camera manufacturers from a webpage (with corresponding HTML source code) can be represented in the relational format. .6 1.5 A diagram contains several characteristic regions of text: Title, x- label, y-label, legend, and so on. The diagram can be represented in a relational format. .7 1.6 Our user interface for repairing mappings. 14 1.7 A spreadsheet about population statistics, from the Statistical Ab- stract of the United States. 15 1.8 The ideal relational tables for the spreadsheet example shown in Fig- ure 1.7. 15 3.1 The three semantic components of a data frame spreadsheet. 31 3.2 A spreadsheet about smoking rates, from the Statistical Abstract of the United States. 32 3.3 The top 10 domains in our web spreadsheet corpus. h-top and h-left are percentages of spreadsheets with a hierarchical top or left region. 35 3.4 The distribution of web spreadsheets. 36 viii 3.5 The system pipeline for Senbazuru to process a single spreadsheet. 37 3.6 An example of the hierarchical top annotations in spreadsheets. 41 3.7 Our user interface for repairing mappings. 43 3.8 A sample of ParentChild variables. 44 3.9 An example of stylistic affinity shown in (a) and metadata affinity shown in (b). 48 3.10 Interaction cycle for interactive repair. 49 3.11 Performance for automatic extractor using different amounts of train- ing data. 61 3.12 Performance for automatic extractor on different domains in WEB. 61 3.13 The normalized repair number for interactive repair on SAUS and WEB test sets. 63 3.14 The normalized repair number for four interactive repair configura- tions. 65 3.15 The normalized repair number required by different configurations of metadata links.