QUERY-DRIVEN TEXT ANALYTICS FOR KNOWLEDGE EXTRACTION, RESOLUTION, AND INFERENCE
By CHRISTAN EARL GRANT
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2015 c 2015 Christan Earl Grant To Jesus my Savior, Vanisia my wife, my daughter Caliah, soon to be born son and my parents and siblings, whom I strive to impress. Also, to all my brothers and sisters battling injustice while I battled bugs and deadlines. ACKNOWLEDGMENTS I had an opportunity to see my dad, a software engineer from Jamaica work extremely hard to get a master’s degree and work as a software engineer. I even had the privilege of sitting in some of his classes as he taught at a local university. Watching my dad work towards intellectual endeavors made me believe that anything is possible. I am extremely privileged to have someone I could look up to as an example of being a man, father, and scholar. I had my first taste of research when Dr. Joachim Hammer went out of his way to find a task for me on one of his research projects because I was interested in attending graduate school. After working with the team for a few weeks he was willing to give me increased responsibility — he let me attend the 2006 SIGMOD Conference in Chicago. It was at this that my eyes were opened to the world of database research. As an early graduate student Dr. Joseph Wilson exercised superhuman patience with me as I learned to grasp the fundamentals of paper writing. He helped me manage a rocky first few years. His abundance of wisdom would spill revealing jewels of truths that I still hold sacred. He along with Peter Dobbins, he helped me navigate the road to the Ph.D. I am delighted to have Dr. Daisy Zhe Wang as a dissertation advisor. I followed her work while she was still a graduate student and I was thrilled to hear she was considering coming to UF. Having the opportunity to watch someone as gifted as Dr. Wang brainstorm and write was an invaluable experience. Additionally, lab mates Clint P. George and Dr. Kun Li with whom I have worked with for many years. I also thank Sean Goldberg, Morteza Shahriari Nia, Yang Chen, Yang Peng, and Xiaofeng Zhou who have also been mentored by Dr. Wang, I appreciate their valuable feedback. During the last years of my graduate program there has been a large amount of civil unrest. While these issues do not affect me specifically, it is emotionally difficult to handle and can negatively affect my everyday productivity. It was important for me to have people around me who I know are going through similar circumstances emotionally
4 and still pursuing their degree. That is why I thank Dr. Pierre St. Juste, Dr. Corey Baker, and Jeremy Magruder for discussions about issues that are sacred to one’s race and ethnicity. Finally, I would like to thank all the individuals who regularly attended the ACM Richard Tapia Celebration of Diversity in Computing. In 2007, I found this group because I was purposely searching for community. This is a group of talented intellectuals who continue to spur me towards excellence. Through them I met Dr. Juan Gilbert who has been an excellent mentor and role model throughout my research career.
5 TABLE OF CONTENTS page ACKNOWLEDGMENTS...... 4 LIST OF TABLES...... 9 LIST OF FIGURES...... 10 ABSTRACT...... 12
CHAPTER 1 INTRODUCTION...... 14 1.1 Database as the Querying Engine...... 16 1.2 Query-Driven Machine Learning...... 17 1.3 Question Answering...... 18 2 IN-DATABASE QUERY-DRIVEN TEXT ANALYTICS...... 20 2.1 MADden Introduction...... 20 2.2 MADden System Description...... 22 2.2.1 MADden System Architecture...... 22 2.2.2 Statistical Text Analysis Functions...... 23 2.2.3 MADden Implementation Details...... 24 2.3 Text Analysis Queries and Demonstration...... 26 2.3.1 Dataset for MADden Example...... 26 2.3.2 MADden Text Analytics Queries...... 27 2.3.3 MADden User Interface...... 29 2.4 GPText Introduction...... 30 2.4.1 GPText Related Work...... 32 2.4.2 Greenplum Text Analytics...... 32 2.4.2.1 In-database document representation...... 33 2.4.2.2 ML-based advanced text analysis...... 35 2.4.3 CRF for IE over MPP Databases...... 35 2.4.3.1 Implementation overview...... 35 2.4.3.2 Feature extraction using SQL...... 36 2.4.3.3 Parallel linear-chain CRF training...... 37 2.4.3.4 Parallel linear-chain CRF inference...... 39 2.4.4 GPText Experiments and Results...... 39 2.4.5 GPText Application...... 40 2.4.6 GPText Summary...... 42 3 MAKING ENTITY RESOLUTION QUERY-DRIVEN...... 43 3.1 Query-Driven Entity Resolution Introduction...... 43 3.2 Query-Driven Entity Resolution Preliminaries...... 46
6 3.2.1 Factor Graphs...... 46 3.2.2 Inference over Factor Graphs...... 48 3.2.3 Cross-Document Entity Resolution...... 49 3.3 Query-Driven Entity Resolution Problem Statement...... 51 3.4 Query-Driven Entity Resolution Algorithms...... 53 3.4.1 Intuition of Query-Driven ER...... 54 3.4.2 Single-Node ER...... 55 3.4.3 Multi-query ER...... 58 3.5 Optimization of Query-Driven ER...... 59 3.5.1 Influence Function: Attract and Repel...... 59 3.5.2 Query-proportional ER...... 61 3.5.3 Hybrid ER...... 62 3.5.4 Implementation Details...... 62 3.5.5 Algorithms Summary Discussion...... 64 3.6 Query-Driven Entity Resolution Experiments...... 65 3.6.1 Experiment Setup...... 67 3.6.2 Realtime Query-Driven ER Over NYT...... 69 3.6.3 Single-query ER...... 70 3.6.4 Multi-query ER...... 74 3.6.5 Context Levels...... 75 3.6.6 Parallel Hybrid ER...... 76 3.7 Query-Driven Entity Resolution Related Work...... 77 3.8 Query-Driven Entity Resolution Summary...... 79 4 A PROPOSAL OPTIMIZER FOR SAMPLING-BASED ENTITY RESOLUTION 80 4.1 Introduction to the Proposal Optimizer...... 80 4.2 Proposal Optimizer Background...... 83 4.3 Accelerating Entity Resolution...... 84 4.4 Proposal Optimizer Algorithms...... 86 4.5 Optimizer...... 87 4.6 Proposal Optimizer Experiment Implementation...... 88 4.6.1 WikiLink Corpus...... 89 4.6.2 Micro Benchmark...... 89 4.7 Proposal Optimizer Summary...... 92 5 QUESTION ANSWERING...... 93 5.1 Morpheus QA Introduction...... 93 5.2 Morpheus QA Related Work...... 94 5.2.1 Question Answering Systems...... 94 5.2.2 Ontology Generators...... 95 5.3 Morpheus QA System Architecture...... 96 5.3.1 Using Ontology and Corpora...... 96 5.3.2 Recording...... 97 5.3.3 Ranking...... 98
7 5.3.4 Executing New Queries...... 100 5.4 Morpheus QA Results...... 100 5.5 Morpheus QA Summary...... 101 6 PATH EXTRACTION IN KNOWLEDGE BASES...... 103 6.1 Preliminaries for Knowledge Base Expansion...... 103 6.1.1 Probabilistic Knowledge Base...... 103 6.1.2 Markov Logic Network and Factor Graphs...... 104 6.1.3 Sampling for Marginal Inference...... 105 6.1.3.1 Gibbs sampling...... 105 6.1.3.2 MC-SAT...... 106 6.1.4 Linking Facts in a Knowledge Base...... 106 6.2 Fact Path Expansion Related Work...... 107 6.2.1 SPARQL Query Path Search...... 108 6.2.2 Path Ranking...... 108 6.2.3 Fact Rank...... 109 6.3 Fact Path Expansion Algorithm...... 109 6.4 Joint Inference of Path Probabilities...... 113 6.4.1 Fuzzy Querying...... 114 6.4.2 PostgreSQL Fact Path Expansion Algorithm...... 114 6.4.3 Graph Database Query...... 119 6.4.4 Fact Path Expansion Complexity...... 120 6.5 Fact Path Expansion Experiments...... 121 6.6 Fact Path Expansion Summary...... 124 7 CONCLUSIONS...... 125 REFERENCES...... 126 BIOGRAPHICAL SKETCH...... 134
8 LIST OF TABLES Table page 2-1 Listing of current MADden functions...... 23 2-2 List of each MADden functions and its NLP task...... 28 2-3 Abbreviated NFL dataset schema...... 28 3-1 Mentions sets M from a corpus...... 57 3-2 Example query node q ...... 57 3-3 Summary of algorithms and their most common methods for proposal jumps.. 64 3-4 Features used on the NYT Corpus. The first set of features are token specific features, the middle set are between pairs of mentions and the bottom set are entity wide features...... 68 3-5 The performance of the hybrid-repel ER algorithm for queries over the NYT corpus for the first 50 samples...... 70 4-1 A table of the techniques to improve the sampling process and each is classified by how they affect sampling...... 87 5-1 Example SSQ model...... 97 5-2 The output of NLP engine...... 101 5-3 Term classes and probabilities...... 101 5-4 Highest ranked Morpheus QA queries...... 102 6-1 The frequency of each term in out cleaned Reverb data set...... 121
9 LIST OF FIGURES Figure page 2-1 MADden architecture...... 22 2-2 Example MADden UI query template...... 30 2-3 The GPText architecture over Greenplum database...... 34 2-4 The MADLib CRF overall system architecture...... 36 2-5 Linear-chain CRF training scalability...... 40 2-6 Linear-chain CRF inference scalability...... 40 2-7 GPText application...... 41
3-1 Three node factor graph. Circles (random variables) with mi represent mentions and those with ei represent entities. Clouds are added for visual emphasis of entity clusters...... 47 3-2 A possible initialization for entity resolution...... 53 3-3 The correct entity resolution for all mentions...... 54 3-4 The entity containing q is internally coreferent; the other entities are not correctly resolved...... 54 3-5 Hybrid-repel performance for the first 50 samples for three queries. Each result is averaged over 6 runs...... 70 3-6 A comparison of single-query algorithms on a query with selectivity of 11.... 71 3-7 A comparison of single-query algorithms with a query node of selectivity 46... 72 3-8 A comparison of selection-driven algorithms with a query node of selectivity 130 72
3-9 The time until an f1q score of 0.95 for five queries of increasing selectivities; averaged over three runs...... 73 3-10 The progress of the hybrid algorithm across for multiple query nodes using difference scheduling algorithms. Each result is averaged over three runs...... 74 3-11 The performance of zuckerberg query with difference levels of context. Each result is averaged over 6 runs...... 75 3-12 Hybrid-attract algorithm with random queries run over the Wikilinks corpus. Each plot starts after the Vose structures are constructed...... 76 4-1 The high-level interaction of the optimizer...... 82
10 4-2 A distribution of entity sizes from the wiki links corpus [87] with an initial start and the truth...... 84 4-3 Comparison of baseline versus early stopping methods...... 90 4-4 The time for compression for varying entity sizes and cardinalities.This is compared with line representing the time it take to make 100K insertions...... 91 5-1 Abbreviated vehicular ontology...... 100 6-1 An example of the increase of the facts and the number of relations over several timestamps...... 111 6-2 A sample of nodes and their changing probabilities over time. The figure is darkened to show the many overlapping lines...... 112 6-3 Fact Path Expansion queries over the Titan Graph DB...... 122 6-4 Fact Path Expansion queries over PostgreSQL...... 122 6-5 PostgreSQL results of Fact Path Expansion queries with reset database cache. 123 6-6 Comparison of the experiments with Titan DB and PostgreSQL without cache. 123
11 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy QUERY-DRIVEN TEXT ANALYTICS FOR KNOWLEDGE EXTRACTION, RESOLUTION, AND INFERENCE By Christan Earl Grant August 2015 Chair: Daisy Zhe Wang Cochair: Joseph N. Wilson Major: Computer Engineering With the precipitous increase in data, performing text analytics using traditional methods has become increasingly difficult. From now until 2020 the world’s data is predicted to double every year. Techniques to store and process these large data stores are quickly growing out of date. The increase in data size with improper methods could mean a large increase in retrieval and processing time. In short, the former techniques do not scale. Complexity of data formats is increasing. No longer can one assume data will be structured numbers and names. Traditionally, to perform analytics, a data scientist extracts parts of large data sources to local machines and perform analytics using R, Python or SASS. Extracting this information is becoming a pain point. Additionally, many algorithms performed over sets of data perform extra work, the data scientist may only be interested in particular portion of the data. In this dissertation, I introduce query-driven text analytics. Query-Driven text analytics is the use of declarative semantics (a query) to direct, restrict and alter computation in analytic systems without a major sacrifice in accuracy. I demonstrate this principle in three ways. First, I add text analytics inside of a relational database where the user can use SQL to bind the scope of the algorithm, e.g. using a SELECT statement. In this way, computation takes place in the same location as storage and the user can take advantage of the query processing provided by the database. Second, I
12 alter an entity resolution algorithm so it uses example queries to drive computation. This demonstrates a method of making a non-trivial algorithm aware of the query. Finally, I describe a method for inferring information from knowledge bases. These techniques perform inference over knowledge bases that model uncertainty for a real scenario and its application within question answering.
13 CHAPTER 1 INTRODUCTION From Babylonian-era algorithms for accounting resources [51] to modern day web-scale processing, methods for analyzing data have been central to the progress of successful societies. Data analytics encompasses the algorithms and systems involved in extracting decision grade information from data. Notably, data analytics spans a series of fields including computer science, economics, marketing, physics, sociology and engineering. In a capitalist society the ability to make intelligent business decisions is critical. The globally connected society of modern day has demanded that competitive organizations find more efficient methods of extracting knowledge. If an organization cannot collect, manage and process data as efficiently as its competition, then it will have trouble surviving [85]. From now until 2020 the worlds data is predicted to double every year [35]. Techniques to store and process these large data stores are quickly growing out of date. The increase in data size with improper methods could mean a large increase in retrieval and processing time. In short, the former techniques do not scale. Complexity of data formats is increasing, no longer can one assume data will be structured numbers and names. Databases are now storing more a mix of structured and unstructured data. To support data analytics, queries over disparate data types cannot be an oversight. Additionally, user generated content such as click streams, tweets and videos are examples of new data sources with extremely high rates of growth. In typical data scientists’ text analytics pipeline, data is extracted from a database, analytics are then performed using R, Python or MATLAB, and the result is added back to the database. With increasing data sizes, the bottle neck of this process is quickly becoming the data transfer time, that is, transferring large amounts of data from and to the database. Oftentimes, large and diverse types of data sources cannot extracted from a database, either for security reasons or because of the large size. Neither, can a global
14 service be taken off-line for processing and updates. To perform text analytics in these scenarios it is preferable to bring the query to the data instead of brining the data to the query [23]. Text analytics is a class of methods for processing documents to obtain actionable or exploratory information. Text analytic tasks include linguistic processing, knowledge extraction and information visualization. Most text analytics techniques are created for processing information across the full supplied data set. That is, to extract answers from a data set, the full set must be processed. With large data set sizes, this approach becomes prohibitive. To use an analogy, if a clean plate is needed from the kitchen sink one should not use the dish washing machine. It is our observation that during the majority of exploration tasks, a data scientist may only interpret a small portion of the data set. For example, when clustering data for evaluation a data scientist may only look at a handful of data clusters. When running exploratory analysis over data streams, providing a template or example of expected results may be useful when sifting through noise. This dissertation defines the category of query-driven text analytics and presents three scenarios demonstrating the efficacy of query-driven techniques. Query-Driven text analytics is the use of declarative semantics to decrease the amount of processing without sacrifice in accuracy. In this proposal, we demonstrate this in three ways.
• We add machine learning algorithms inside of a parallel relational DBMS where the user can use SQL and UDFs to choose the scope of their algorithm (Chapter2);
• We alter a machine learning algorithm so it uses an example query to drive computation (Chapters3 and5);
• We investigate the use of knowledge based inference to assist question answering system (Chapter5) and understanding the connection between concepts (Chapter6). In the following subsections I briefly introduce each contributed area. In addition, I explicitly state the contribution of each work.
15 1.1 Database as the Querying Engine
When processing large data, often a bottleneck to computation is data movement. Moving data across geographical locations for processing is expensive. In-database analytics (dblytics) aims to build sophisticated analytic algorithms into data parallel systems, such as relational databases and massively parallel processing systems. Using a database as the ecosystem for analytics we get a declarative query interface, query optimization, transactional operations, efficient catching and fault tolerance. I present two projects demonstrating dblyics: MADden and GPText. MADden is a demonstration of in-database text analysis algorithms [41]. This demonstration focuses on answering queries for sports journalism, in particular NFL data sets using Mad Lib style queries. The demonstration made the following contributions:
• Processing declarative ad hoc queries involving various statistical text analytic functions.
• Joining and querying over multiple data sources of structured and unstructured text.
• Query-time rendering of visualizations over query results, using word clouds, histograms, and ranked lists of documents. GPText is a system for large-scale text indexing, search and ranking [57]. This is a new system that integrates Greenplum DB, MADlib analytic libraries and the Apache Solr enterprise search platform. Combined with our madlib algorithms such as Conditional Random Field part of speech tagging, GPText is an extremely large and scalable text analytics engine. GPText adds a Solr instance to each parallel Greenplum DB Segment and the database could communicate over the instances using http. Text searches are then parallelized across segments. Using UDFs we can mix sophisticated search predicates, ranking and database queries. In addition, we created an application that demonstrates the scalability of GPText and MADlib algorithms.
16 1.2 Query-Driven Machine Learning
In query-driven machine learning, the idea is to use examples of desired results to reduce the amount of time spent processing data. To demonstrate this we take a popular clustering problem called entity resolution and make it query-driven. Entity resolution (ER) is the process of determining records (mentions) in a database that correspond to the same real-world entity. Leading ER systems solve this problem by resolving every record in the database. For large datasets, however this is an expensive process. Moreover, such approaches are wasteful because in practice, users are interested in only one or a small subset of the entities mentioned in the database. In this work, we introduce new classes of SQL queries involving ER operators – single-query ER and multi-query ER. We develop novel variations of the Metropolis Hastings algorithm and introduce selectivity-based scheduling algorithms to support the two classes of ER queries. To support single-query ER queries, we develop three new variations of the Metropolis Hasting style Markov chain Monte Carlo algorithm for inference over the CRF-based probabilistic model. More specifically, instead of a uniform sampling distribution, we use a query sampling method that is biased to the arrangement of the probabilistic model. In the first target-fixed algorithm, we adapt the samples to resolve the query entity. The second query-proportional algorithm, selects mentions based on their probabilistic similarity to the query entity. The third hybrid algorithm combines the two approaches. Following the seminal work of Wick et al. [100], we devise an influence function to model the similarity between the mentions and the query entity attract score. This influence function is best when the cluster of mentions is heterogeneous. In the case when the cluster of mentions is homogeneous, for example the result of a high-quality canopy generation, we show a different algorithm to compute and apply an influence function to generate a repel score for biased sampling. To support multi-query ER, a naive nested-loop join can be performed using the single-query ER algorithms iteratively to compute resolution one entity at a time.
17 However, such a join algorithm can lead to unoptimized resource allocation if the same number of samples are generated for each target entity, or low throughput if one of the entities has a low convergence rate (e.g., a long sampling process). To alleviate this problem, we discuss three multi-query ER algorithms, which schedule the computation (i.e., sample generation) among different target entities in order to achieve optimum overall convergence rate. 1.3 Question Answering
The next step, after extracting information from large data sets and analysis is question answering. Question answering is bridging the gap between the way a user asks a question and the way an answer is encoded in the background knowledge. Understanding questions and extracting answers requires the full suite to text mining tasks. The process of question answering is inherently query-driven; all possible questions over a data set cannot be enumerated therefore any question answering system waits for a user query to initiate an answer discovery process. Question answering is the holy grail of text analytics. Many text analytic tasks are required to obtain accurate answers. In this work, we extract answers from the web corpus and we distinguish between two portions of the web, namely the surface web and the deep web. The surface web are the standard web pages accessible from a browser without authenticating or providing any credentials. These pages include blog posts, company web pages, news articles and more. By contrast, the deep web is the set of pages generated
through web