QUERY-DRIVEN TEXT ANALYTICS for KNOWLEDGE EXTRACTION, RESOLUTION, and INFERENCE by CHRISTAN EARL GRANT a DISSERTATION PRESENTED

QUERY-DRIVEN TEXT ANALYTICS FOR KNOWLEDGE EXTRACTION, RESOLUTION, AND INFERENCE By CHRISTAN EARL GRANT A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2015 c 2015 Christan Earl Grant To Jesus my Savior, Vanisia my wife, my daughter Caliah, soon to be born son and my parents and siblings, whom I strive to impress. Also, to all my brothers and sisters battling injustice while I battled bugs and deadlines. ACKNOWLEDGMENTS I had an opportunity to see my dad, a software engineer from Jamaica work extremely hard to get a master's degree and work as a software engineer. I even had the privilege of sitting in some of his classes as he taught at a local university. Watching my dad work towards intellectual endeavors made me believe that anything is possible. I am extremely privileged to have someone I could look up to as an example of being a man, father, and scholar. I had my first taste of research when Dr. Joachim Hammer went out of his way to find a task for me on one of his research projects because I was interested in attending graduate school. After working with the team for a few weeks he was willing to give me increased responsibility | he let me attend the 2006 SIGMOD Conference in Chicago. It was at this that my eyes were opened to the world of database research. As an early graduate student Dr. Joseph Wilson exercised superhuman patience with me as I learned to grasp the fundamentals of paper writing. He helped me manage a rocky first few years. His abundance of wisdom would spill revealing jewels of truths that I still hold sacred. He along with Peter Dobbins, he helped me navigate the road to the Ph.D. I am delighted to have Dr. Daisy Zhe Wang as a dissertation advisor. I followed her work while she was still a graduate student and I was thrilled to hear she was considering coming to UF. Having the opportunity to watch someone as gifted as Dr. Wang brainstorm and write was an invaluable experience. Additionally, lab mates Clint P. George and Dr. Kun Li with whom I have worked with for many years. I also thank Sean Goldberg, Morteza Shahriari Nia, Yang Chen, Yang Peng, and Xiaofeng Zhou who have also been mentored by Dr. Wang, I appreciate their valuable feedback. During the last years of my graduate program there has been a large amount of civil unrest. While these issues do not affect me specifically, it is emotionally difficult to handle and can negatively affect my everyday productivity. It was important for me to have people around me who I know are going through similar circumstances emotionally 4 and still pursuing their degree. That is why I thank Dr. Pierre St. Juste, Dr. Corey Baker, and Jeremy Magruder for discussions about issues that are sacred to one's race and ethnicity. Finally, I would like to thank all the individuals who regularly attended the ACM Richard Tapia Celebration of Diversity in Computing. In 2007, I found this group because I was purposely searching for community. This is a group of talented intellectuals who continue to spur me towards excellence. Through them I met Dr. Juan Gilbert who has been an excellent mentor and role model throughout my research career. 5 TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................4 LIST OF TABLES.....................................9 LIST OF FIGURES.................................... 10 ABSTRACT........................................ 12 CHAPTER 1 INTRODUCTION.................................. 14 1.1 Database as the Querying Engine....................... 16 1.2 Query-Driven Machine Learning........................ 17 1.3 Question Answering............................... 18 2 IN-DATABASE QUERY-DRIVEN TEXT ANALYTICS............. 20 2.1 MADden Introduction............................. 20 2.2 MADden System Description......................... 22 2.2.1 MADden System Architecture..................... 22 2.2.2 Statistical Text Analysis Functions.................. 23 2.2.3 MADden Implementation Details................... 24 2.3 Text Analysis Queries and Demonstration.................. 26 2.3.1 Dataset for MADden Example..................... 26 2.3.2 MADden Text Analytics Queries.................... 27 2.3.3 MADden User Interface......................... 29 2.4 GPText Introduction.............................. 30 2.4.1 GPText Related Work......................... 32 2.4.2 Greenplum Text Analytics....................... 32 2.4.2.1 In-database document representation............ 33 2.4.2.2 ML-based advanced text analysis.............. 35 2.4.3 CRF for IE over MPP Databases................... 35 2.4.3.1 Implementation overview................... 35 2.4.3.2 Feature extraction using SQL................ 36 2.4.3.3 Parallel linear-chain CRF training.............. 37 2.4.3.4 Parallel linear-chain CRF inference............. 39 2.4.4 GPText Experiments and Results................... 39 2.4.5 GPText Application........................... 40 2.4.6 GPText Summary............................ 42 3 MAKING ENTITY RESOLUTION QUERY-DRIVEN.............. 43 3.1 Query-Driven Entity Resolution Introduction................. 43 3.2 Query-Driven Entity Resolution Preliminaries................ 46 6 3.2.1 Factor Graphs.............................. 46 3.2.2 Inference over Factor Graphs...................... 48 3.2.3 Cross-Document Entity Resolution.................. 49 3.3 Query-Driven Entity Resolution Problem Statement............. 51 3.4 Query-Driven Entity Resolution Algorithms................. 53 3.4.1 Intuition of Query-Driven ER..................... 54 3.4.2 Single-Node ER............................. 55 3.4.3 Multi-query ER............................. 58 3.5 Optimization of Query-Driven ER....................... 59 3.5.1 Influence Function: Attract and Repel................. 59 3.5.2 Query-proportional ER......................... 61 3.5.3 Hybrid ER................................ 62 3.5.4 Implementation Details......................... 62 3.5.5 Algorithms Summary Discussion.................... 64 3.6 Query-Driven Entity Resolution Experiments................ 65 3.6.1 Experiment Setup............................ 67 3.6.2 Realtime Query-Driven ER Over NYT................ 69 3.6.3 Single-query ER............................. 70 3.6.4 Multi-query ER............................. 74 3.6.5 Context Levels.............................. 75 3.6.6 Parallel Hybrid ER........................... 76 3.7 Query-Driven Entity Resolution Related Work................ 77 3.8 Query-Driven Entity Resolution Summary.................. 79 4 A PROPOSAL OPTIMIZER FOR SAMPLING-BASED ENTITY RESOLUTION 80 4.1 Introduction to the Proposal Optimizer.................... 80 4.2 Proposal Optimizer Background........................ 83 4.3 Accelerating Entity Resolution......................... 84 4.4 Proposal Optimizer Algorithms........................ 86 4.5 Optimizer.................................... 87 4.6 Proposal Optimizer Experiment Implementation............... 88 4.6.1 WikiLink Corpus............................ 89 4.6.2 Micro Benchmark............................ 89 4.7 Proposal Optimizer Summary......................... 92 5 QUESTION ANSWERING............................. 93 5.1 Morpheus QA Introduction.......................... 93 5.2 Morpheus QA Related Work.......................... 94 5.2.1 Question Answering Systems...................... 94 5.2.2 Ontology Generators.......................... 95 5.3 Morpheus QA System Architecture...................... 96 5.3.1 Using Ontology and Corpora...................... 96 5.3.2 Recording................................ 97 5.3.3 Ranking................................. 98 7 5.3.4 Executing New Queries......................... 100 5.4 Morpheus QA Results............................. 100 5.5 Morpheus QA Summary............................ 101 6 PATH EXTRACTION IN KNOWLEDGE BASES................ 103 6.1 Preliminaries for Knowledge Base Expansion................. 103 6.1.1 Probabilistic Knowledge Base..................... 103 6.1.2 Markov Logic Network and Factor Graphs.............. 104 6.1.3 Sampling for Marginal Inference.................... 105 6.1.3.1 Gibbs sampling........................ 105 6.1.3.2 MC-SAT............................ 106 6.1.4 Linking Facts in a Knowledge Base.................. 106 6.2 Fact Path Expansion Related Work...................... 107 6.2.1 SPARQL Query Path Search...................... 108 6.2.2 Path Ranking.............................. 108 6.2.3 Fact Rank................................ 109 6.3 Fact Path Expansion Algorithm........................ 109 6.4 Joint Inference of Path Probabilities..................... 113 6.4.1 Fuzzy Querying............................. 114 6.4.2 PostgreSQL Fact Path Expansion Algorithm............. 114 6.4.3 Graph Database Query......................... 119 6.4.4 Fact Path Expansion Complexity................... 120 6.5 Fact Path Expansion Experiments....................... 121 6.6 Fact Path Expansion Summary........................ 124 7 CONCLUSIONS................................... 125 REFERENCES....................................... 126 BIOGRAPHICAL SKETCH................................ 134 8 LIST OF TABLES Table page 2-1 Listing of current MADden functions........................ 23 2-2 List of each MADden functions and its NLP task................. 28 2-3 Abbreviated NFL dataset schema.......................... 28 3-1 Mentions sets M from a corpus........................... 57 3-2 Example query node q ................................ 57 3-3 Summary

QUERY-DRIVEN TEXT ANALYTICS for KNOWLEDGE EXTRACTION, RESOLUTION, and INFERENCE by CHRISTAN EARL GRANT a DISSERTATION PRESENTED

A Data-Driven Framework for Assisting Geo-Ontology Engineering Using a Discrepancy Index

Open Standards in Open Source Andrew Savory, Luminas

Usage-Dependent Maintenance of Structured Web Data Sets

Supported Reading Software

In-RDBMS Hardware Acceleration of Advanced Analytics

Document Publishing in the Daisy CMS

L Dataspaces Make Data Ntegration Obsolete?

Daisy the Open Source CMS

Exploring Digital Preservation Strategies Using DLT in the Context Of

Daisy Version 8.0

Daisy Producer: an Integrated Production Management System for Accessible Media

Data in Context: Aiding News Consumers While Taming Dataspaces