QUERY-DRIVEN TEXT ANALYTICS FOR , RESOLUTION, AND INFERENCE

By CHRISTAN EARL GRANT

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2015 c 2015 Christan Earl Grant To Jesus my Savior, Vanisia my wife, my daughter Caliah, soon to be born son and my parents and siblings, whom I strive to impress. Also, to all my brothers and sisters battling injustice while I battled bugs and deadlines. ACKNOWLEDGMENTS I had an opportunity to see my dad, a software engineer from Jamaica work extremely hard to get a master’s degree and work as a software engineer. I even had the privilege of sitting in some of his classes as he taught at a local university. Watching my dad work towards intellectual endeavors made me believe that anything is possible. I am extremely privileged to have someone I could look up to as an example of being a man, father, and scholar. I had my first taste of research when Dr. Joachim Hammer went out of his way to find a task for me on one of his research projects because I was interested in attending graduate school. After working with the team for a few weeks he was willing to give me increased responsibility — he let me attend the 2006 SIGMOD Conference in Chicago. It was at this that my eyes were opened to the world of research. As an early graduate student Dr. Joseph Wilson exercised superhuman patience with me as I learned to grasp the fundamentals of paper writing. He helped me manage a rocky first few years. His abundance of wisdom would spill revealing jewels of truths that I still hold sacred. He along with Peter Dobbins, he helped me navigate the road to the Ph.D. I am delighted to have Dr. Daisy Zhe Wang as a dissertation advisor. I followed her work while she was still a graduate student and I was thrilled to hear she was considering coming to UF. Having the opportunity to watch someone as gifted as Dr. Wang brainstorm and write was an invaluable experience. Additionally, lab mates Clint P. George and Dr. Kun Li with whom I have worked with for many years. I also thank Sean Goldberg, Morteza Shahriari Nia, Yang Chen, Yang Peng, and Xiaofeng Zhou who have also been mentored by Dr. Wang, I appreciate their valuable feedback. During the last years of my graduate program there has been a large amount of civil unrest. While these issues do not affect me specifically, it is emotionally difficult to handle and can negatively affect my everyday productivity. It was important for me to have people around me who I know are going through similar circumstances emotionally

4 and still pursuing their degree. That is why I thank Dr. Pierre St. Juste, Dr. Corey Baker, and Jeremy Magruder for discussions about issues that are sacred to one’s race and ethnicity. Finally, I would like to thank all the individuals who regularly attended the ACM Richard Tapia Celebration of Diversity in Computing. In 2007, I found this group because I was purposely searching for community. This is a group of talented intellectuals who continue to spur me towards excellence. Through them I met Dr. Juan Gilbert who has been an excellent mentor and role model throughout my research career.

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS...... 4 LIST OF TABLES...... 9 LIST OF FIGURES...... 10 ABSTRACT...... 12

CHAPTER 1 INTRODUCTION...... 14 1.1 Database as the Querying Engine...... 16 1.2 Query-Driven Machine Learning...... 17 1.3 Question Answering...... 18 2 IN-DATABASE QUERY-DRIVEN TEXT ANALYTICS...... 20 2.1 MADden Introduction...... 20 2.2 MADden System Description...... 22 2.2.1 MADden System Architecture...... 22 2.2.2 Statistical Text Analysis Functions...... 23 2.2.3 MADden Implementation Details...... 24 2.3 Text Analysis Queries and Demonstration...... 26 2.3.1 Dataset for MADden Example...... 26 2.3.2 MADden Text Analytics Queries...... 27 2.3.3 MADden User Interface...... 29 2.4 GPText Introduction...... 30 2.4.1 GPText Related Work...... 32 2.4.2 Greenplum Text Analytics...... 32 2.4.2.1 In-database document representation...... 33 2.4.2.2 ML-based advanced text analysis...... 35 2.4.3 CRF for IE over MPP ...... 35 2.4.3.1 Implementation overview...... 35 2.4.3.2 Feature extraction using SQL...... 36 2.4.3.3 Parallel linear-chain CRF training...... 37 2.4.3.4 Parallel linear-chain CRF inference...... 39 2.4.4 GPText Experiments and Results...... 39 2.4.5 GPText Application...... 40 2.4.6 GPText Summary...... 42 3 MAKING ENTITY RESOLUTION QUERY-DRIVEN...... 43 3.1 Query-Driven Entity Resolution Introduction...... 43 3.2 Query-Driven Entity Resolution Preliminaries...... 46

6 3.2.1 Factor Graphs...... 46 3.2.2 Inference over Factor Graphs...... 48 3.2.3 Cross-Document Entity Resolution...... 49 3.3 Query-Driven Entity Resolution Problem Statement...... 51 3.4 Query-Driven Entity Resolution Algorithms...... 53 3.4.1 Intuition of Query-Driven ER...... 54 3.4.2 Single-Node ER...... 55 3.4.3 Multi-query ER...... 58 3.5 Optimization of Query-Driven ER...... 59 3.5.1 Influence Function: Attract and Repel...... 59 3.5.2 Query-proportional ER...... 61 3.5.3 Hybrid ER...... 62 3.5.4 Implementation Details...... 62 3.5.5 Algorithms Summary Discussion...... 64 3.6 Query-Driven Entity Resolution Experiments...... 65 3.6.1 Experiment Setup...... 67 3.6.2 Realtime Query-Driven ER Over NYT...... 69 3.6.3 Single-query ER...... 70 3.6.4 Multi-query ER...... 74 3.6.5 Context Levels...... 75 3.6.6 Parallel Hybrid ER...... 76 3.7 Query-Driven Entity Resolution Related Work...... 77 3.8 Query-Driven Entity Resolution Summary...... 79 4 A PROPOSAL OPTIMIZER FOR SAMPLING-BASED ENTITY RESOLUTION 80 4.1 Introduction to the Proposal Optimizer...... 80 4.2 Proposal Optimizer Background...... 83 4.3 Accelerating Entity Resolution...... 84 4.4 Proposal Optimizer Algorithms...... 86 4.5 Optimizer...... 87 4.6 Proposal Optimizer Experiment Implementation...... 88 4.6.1 WikiLink Corpus...... 89 4.6.2 Micro Benchmark...... 89 4.7 Proposal Optimizer Summary...... 92 5 QUESTION ANSWERING...... 93 5.1 Morpheus QA Introduction...... 93 5.2 Morpheus QA Related Work...... 94 5.2.1 Question Answering Systems...... 94 5.2.2 Ontology Generators...... 95 5.3 Morpheus QA System Architecture...... 96 5.3.1 Using Ontology and Corpora...... 96 5.3.2 Recording...... 97 5.3.3 Ranking...... 98

7 5.3.4 Executing New Queries...... 100 5.4 Morpheus QA Results...... 100 5.5 Morpheus QA Summary...... 101 6 PATH EXTRACTION IN KNOWLEDGE BASES...... 103 6.1 Preliminaries for Expansion...... 103 6.1.1 Probabilistic Knowledge Base...... 103 6.1.2 Markov Logic Network and Factor Graphs...... 104 6.1.3 Sampling for Marginal Inference...... 105 6.1.3.1 Gibbs sampling...... 105 6.1.3.2 MC-SAT...... 106 6.1.4 Linking Facts in a Knowledge Base...... 106 6.2 Fact Path Expansion Related Work...... 107 6.2.1 SPARQL Query Path Search...... 108 6.2.2 Path Ranking...... 108 6.2.3 Fact Rank...... 109 6.3 Fact Path Expansion Algorithm...... 109 6.4 Joint Inference of Path Probabilities...... 113 6.4.1 Fuzzy Querying...... 114 6.4.2 PostgreSQL Fact Path Expansion Algorithm...... 114 6.4.3 Graph Database Query...... 119 6.4.4 Fact Path Expansion Complexity...... 120 6.5 Fact Path Expansion Experiments...... 121 6.6 Fact Path Expansion Summary...... 124 7 CONCLUSIONS...... 125 REFERENCES...... 126 BIOGRAPHICAL SKETCH...... 134

8 LIST OF TABLES Table page 2-1 Listing of current MADden functions...... 23 2-2 List of each MADden functions and its NLP task...... 28 2-3 Abbreviated NFL dataset schema...... 28 3-1 Mentions sets M from a corpus...... 57 3-2 Example query node q ...... 57 3-3 Summary of algorithms and their most common methods for proposal jumps.. 64 3-4 Features used on the NYT Corpus. The first set of features are token specific features, the middle set are between pairs of mentions and the bottom set are entity wide features...... 68 3-5 The performance of the hybrid-repel ER algorithm for queries over the NYT corpus for the first 50 samples...... 70 4-1 A table of the techniques to improve the sampling process and each is classified by how they affect sampling...... 87 5-1 Example SSQ model...... 97 5-2 The output of NLP engine...... 101 5-3 Term classes and probabilities...... 101 5-4 Highest ranked Morpheus QA queries...... 102 6-1 The frequency of each term in out cleaned Reverb data set...... 121

9 LIST OF FIGURES Figure page 2-1 MADden architecture...... 22 2-2 Example MADden UI query template...... 30 2-3 The GPText architecture over Greenplum database...... 34 2-4 The MADLib CRF overall system architecture...... 36 2-5 Linear-chain CRF training scalability...... 40 2-6 Linear-chain CRF inference scalability...... 40 2-7 GPText application...... 41

3-1 Three node factor graph. Circles (random variables) with mi represent mentions and those with ei represent entities. Clouds are added for visual emphasis of entity clusters...... 47 3-2 A possible initialization for entity resolution...... 53 3-3 The correct entity resolution for all mentions...... 54 3-4 The entity containing q is internally coreferent; the other entities are not correctly resolved...... 54 3-5 Hybrid-repel performance for the first 50 samples for three queries. Each result is averaged over 6 runs...... 70 3-6 A comparison of single-query algorithms on a query with selectivity of 11.... 71 3-7 A comparison of single-query algorithms with a query node of selectivity 46... 72 3-8 A comparison of selection-driven algorithms with a query node of selectivity 130 72

3-9 The time until an f1q score of 0.95 for five queries of increasing selectivities; averaged over three runs...... 73 3-10 The progress of the hybrid algorithm across for multiple query nodes using difference scheduling algorithms. Each result is averaged over three runs...... 74 3-11 The performance of zuckerberg query with difference levels of context. Each result is averaged over 6 runs...... 75 3-12 Hybrid-attract algorithm with random queries run over the Wikilinks corpus. Each plot starts after the Vose structures are constructed...... 76 4-1 The high-level interaction of the optimizer...... 82

10 4-2 A distribution of entity sizes from the links corpus [87] with an initial start and the truth...... 84 4-3 Comparison of baseline versus early stopping methods...... 90 4-4 The time for compression for varying entity sizes and cardinalities.This is compared with line representing the time it take to make 100K insertions...... 91 5-1 Abbreviated vehicular ontology...... 100 6-1 An example of the increase of the facts and the number of relations over several timestamps...... 111 6-2 A sample of nodes and their changing probabilities over time. The figure is darkened to show the many overlapping lines...... 112 6-3 Fact Path Expansion queries over the Titan Graph DB...... 122 6-4 Fact Path Expansion queries over PostgreSQL...... 122 6-5 PostgreSQL results of Fact Path Expansion queries with reset database cache. 123 6-6 Comparison of the experiments with Titan DB and PostgreSQL without cache. 123

11 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy QUERY-DRIVEN TEXT ANALYTICS FOR KNOWLEDGE EXTRACTION, RESOLUTION, AND INFERENCE By Christan Earl Grant August 2015 Chair: Daisy Zhe Wang Cochair: Joseph N. Wilson Major: Computer Engineering With the precipitous increase in data, performing text analytics using traditional methods has become increasingly difficult. From now until 2020 the world’s data is predicted to double every year. Techniques to store and process these large data stores are quickly growing out of date. The increase in data size with improper methods could mean a large increase in retrieval and processing time. In short, the former techniques do not scale. Complexity of data formats is increasing. No longer can one assume data will be structured numbers and names. Traditionally, to perform analytics, a data scientist extracts parts of large data sources to local machines and perform analytics using R, Python or SASS. Extracting this information is becoming a pain point. Additionally, many algorithms performed over sets of data perform extra work, the data scientist may only be interested in particular portion of the data. In this dissertation, I introduce query-driven text analytics. Query-Driven text analytics is the use of declarative semantics (a query) to direct, restrict and alter computation in analytic systems without a major sacrifice in accuracy. I demonstrate this principle in three ways. First, I add text analytics inside of a relational database where the user can use SQL to bind the scope of the algorithm, e.g. using a SELECT statement. In this way, computation takes place in the same location as storage and the user can take advantage of the query processing provided by the database. Second, I

12 alter an entity resolution algorithm so it uses example queries to drive computation. This demonstrates a method of making a non-trivial algorithm aware of the query. Finally, I describe a method for inferring information from knowledge bases. These techniques perform inference over knowledge bases that model uncertainty for a real scenario and its application within question answering.

13 CHAPTER 1 INTRODUCTION From Babylonian-era algorithms for accounting resources [51] to modern day web-scale processing, methods for analyzing data have been central to the progress of successful societies. Data analytics encompasses the algorithms and systems involved in extracting decision grade information from data. Notably, data analytics spans a series of fields including computer science, economics, marketing, physics, sociology and engineering. In a capitalist society the ability to make intelligent business decisions is critical. The globally connected society of modern day has demanded that competitive organizations find more efficient methods of extracting knowledge. If an organization cannot collect, manage and process data as efficiently as its competition, then it will have trouble surviving [85]. From now until 2020 the worlds data is predicted to double every year [35]. Techniques to store and process these large data stores are quickly growing out of date. The increase in data size with improper methods could mean a large increase in retrieval and processing time. In short, the former techniques do not scale. Complexity of data formats is increasing, no longer can one assume data will be structured numbers and names. Databases are now storing more a mix of structured and unstructured data. To support data analytics, queries over disparate data types cannot be an oversight. Additionally, user generated content such as click streams, tweets and videos are examples of new data sources with extremely high rates of growth. In typical data scientists’ text analytics pipeline, data is extracted from a database, analytics are then performed using R, Python or MATLAB, and the result is added back to the database. With increasing data sizes, the bottle neck of this process is quickly becoming the data transfer time, that is, transferring large amounts of data from and to the database. Oftentimes, large and diverse types of data sources cannot extracted from a database, either for security reasons or because of the large size. Neither, can a global

14 service be taken off-line for processing and updates. To perform text analytics in these scenarios it is preferable to bring the query to the data instead of brining the data to the query [23]. Text analytics is a class of methods for processing documents to obtain actionable or exploratory information. Text analytic tasks include linguistic processing, knowledge extraction and information visualization. Most text analytics techniques are created for processing information across the full supplied data set. That is, to extract answers from a data set, the full set must be processed. With large data set sizes, this approach becomes prohibitive. To use an analogy, if a clean plate is needed from the kitchen sink one should not use the dish washing machine. It is our observation that during the majority of exploration tasks, a data scientist may only interpret a small portion of the data set. For example, when clustering data for evaluation a data scientist may only look at a handful of data clusters. When running exploratory analysis over data streams, providing a template or example of expected results may be useful when sifting through noise. This dissertation defines the category of query-driven text analytics and presents three scenarios demonstrating the efficacy of query-driven techniques. Query-Driven text analytics is the use of declarative semantics to decrease the amount of processing without sacrifice in accuracy. In this proposal, we demonstrate this in three ways.

• We add machine learning algorithms inside of a parallel relational DBMS where the user can use SQL and UDFs to choose the scope of their algorithm (Chapter2);

• We alter a machine learning algorithm so it uses an example query to drive computation (Chapters3 and5);

• We investigate the use of knowledge based inference to assist question answering system (Chapter5) and understanding the connection between concepts (Chapter6). In the following subsections I briefly introduce each contributed area. In addition, I explicitly state the contribution of each work.

15 1.1 Database as the Querying Engine

When processing large data, often a bottleneck to computation is data movement. Moving data across geographical locations for processing is expensive. In-database analytics (dblytics) aims to build sophisticated analytic algorithms into data parallel systems, such as relational databases and massively parallel processing systems. Using a database as the ecosystem for analytics we get a declarative query interface, query optimization, transactional operations, efficient catching and fault tolerance. I present two projects demonstrating dblyics: MADden and GPText. MADden is a demonstration of in-database text analysis algorithms [41]. This demonstration focuses on answering queries for sports journalism, in particular NFL data sets using Mad Lib style queries. The demonstration made the following contributions:

• Processing declarative ad hoc queries involving various statistical text analytic functions.

• Joining and querying over multiple data sources of structured and unstructured text.

• Query-time rendering of visualizations over query results, using word clouds, histograms, and ranked lists of documents. GPText is a system for large-scale text indexing, search and ranking [57]. This is a new system that integrates Greenplum DB, MADlib analytic libraries and the Apache Solr enterprise search platform. Combined with our madlib algorithms such as Conditional Random Field part of speech tagging, GPText is an extremely large and scalable text analytics engine. GPText adds a Solr instance to each parallel Greenplum DB Segment and the database could communicate over the instances using http. Text searches are then parallelized across segments. Using UDFs we can mix sophisticated search predicates, ranking and database queries. In addition, we created an application that demonstrates the scalability of GPText and MADlib algorithms.

16 1.2 Query-Driven Machine Learning

In query-driven machine learning, the idea is to use examples of desired results to reduce the amount of time spent processing data. To demonstrate this we take a popular clustering problem called entity resolution and make it query-driven. Entity resolution (ER) is the process of determining records (mentions) in a database that correspond to the same real-world entity. Leading ER systems solve this problem by resolving every record in the database. For large datasets, however this is an expensive process. Moreover, such approaches are wasteful because in practice, users are interested in only one or a small subset of the entities mentioned in the database. In this work, we introduce new classes of SQL queries involving ER operators – single-query ER and multi-query ER. We develop novel variations of the Metropolis Hastings algorithm and introduce selectivity-based scheduling algorithms to support the two classes of ER queries. To support single-query ER queries, we develop three new variations of the Metropolis Hasting style Markov chain Monte Carlo algorithm for inference over the CRF-based probabilistic model. More specifically, instead of a uniform sampling distribution, we use a query sampling method that is biased to the arrangement of the probabilistic model. In the first target-fixed algorithm, we adapt the samples to resolve the query entity. The second query-proportional algorithm, selects mentions based on their probabilistic similarity to the query entity. The third hybrid algorithm combines the two approaches. Following the seminal work of Wick et al. [100], we devise an influence function to model the similarity between the mentions and the query entity attract score. This influence function is best when the cluster of mentions is heterogeneous. In the case when the cluster of mentions is homogeneous, for example the result of a high-quality canopy generation, we show a different algorithm to compute and apply an influence function to generate a repel score for biased sampling. To support multi-query ER, a naive nested-loop join can be performed using the single-query ER algorithms iteratively to compute resolution one entity at a time.

17 However, such a join algorithm can lead to unoptimized resource allocation if the same number of samples are generated for each target entity, or low throughput if one of the entities has a low convergence rate (e.g., a long sampling process). To alleviate this problem, we discuss three multi-query ER algorithms, which schedule the computation (i.e., sample generation) among different target entities in order to achieve optimum overall convergence rate. 1.3 Question Answering

The next step, after extracting information from large data sets and analysis is question answering. Question answering is bridging the gap between the way a user asks a question and the way an answer is encoded in the background knowledge. Understanding questions and extracting answers requires the full suite to text mining tasks. The process of question answering is inherently query-driven; all possible questions over a data set cannot be enumerated therefore any question answering system waits for a user query to initiate an answer discovery process. Question answering is the holy grail of text analytics. Many text analytic tasks are required to obtain accurate answers. In this work, we extract answers from the web corpus and we distinguish between two portions of the web, namely the surface web and the deep web. The surface web are the standard web pages accessible from a browser without authenticating or providing any credentials. These pages include blog posts, company web pages, news articles and more. By contrast, the deep web is the set of pages generated

through web

s. That is, accessing deep web pages required user interact and parameters. We present the Morpheus QA system, a question answering system that records user interaction in the deep web in order to answer questions. When a user poses a natural language question, that question is compiled and matched to previous interactions on the deep web and Morpheus QA compiles the set of pages require to answer the question. It then, at query-time, interacts with the deep web to extract an answer for the query.

18 The surface web contains vast amounts of facts that can be extracted using text analytics. These facts are uncertain, that is, there may be conflicting or ill-formed facts. These uncertainties may arise from the extraction process itself, changes in facts over time or the inherent ambiguity of natural language. Hence, these facts are stored along with their probabilities in a probabilistic knowledge base. A probabilistic knowledge base, is a database of facts with an inference engine. The probabilistic knowledge base allow queries to compute probability a fact is correct given the rest of the facts on the knowledge base. Additionally, knowledge bases can infer missing information. Given the promise of these types advanced knowledge bases, we can look at the calculation of the probabilities of knowledge base. We can to procure paths of connected facts from knowledge bases to show connections between entities. Connected paths between entities can help an information seeker understand how two entities are correspond. For a web-scale knowledge base, this task is difficult without using large machines. We give examples of how this method helps assist information seekers.

Dissertation outline. This dissertation is outlined in six chapters. Each chapter is independent and discusses any necessary background and related work within. Chapter2 describes how database can be used as a location for text analytics through two systems, MADden and GPText. Chapter3 describes query-driven entity resolution, this is the main contribution of this work. Chapter4 gives some depth about the entity resolution work by proposing an optimizer to increase its efficiency. Chapter5 describe the deep web question answering system known the Morpheus. Chapter6 presents the final part of the dissertation, namely, an algorithm for ranking paths between entities in knowledge bases. In Chapter7 we summarize the contribution and present some future area of research.

19 CHAPTER 2 IN-DATABASE QUERY-DRIVEN TEXT ANALYTICS In this chapter, I introduce two systems — MADden [41] and GPText [58] — that are created to introduce a text analytics paradigm where the data scientist will rarely have to leave the comfort of the RDBMS, where their data lies. This chapters shows how these systems allow for non-trivial text analytics, sophisticated text search and visualization. By empowering the RDBMS to perform these tasks, we can use the declarative query interface to decide over which data source we should perform analytics; that is, we are performing query-driven data analytics. The second half of this chapter discusses the GPText project. This work does have significant overlap between me and a fellow Ph.D. student Kun Li. Each of us contributed equally and it was important to include it in the dissertation as it was the second part of the MADden project. This body of work represents early contributions in statistical text analytics. 2.1 MADden Introduction

For many applications, unstructured text and structured data are both important natural resources to fuel data analysis. For example, a sports journalist covering NFL (National Football League1 — thirty-two American football teams with more than 1700 players) games would need a system that can analyze both the structured statistics (e.g., scores, biographic data) of teams and players and the unstructured tweets, blogs, and news about the games. In such applications, analytics are performed over text data from many sources. Text analysis uses statistical machine learning (SML) methods to extract structured information — such as part-of-speech tags, entities, relations, sentiments, and topics — from text. The result of the text analysis can be joined with other structured data sources

1 http://www.nfl.com

20 for more advanced analysis. For example, a sports journalist may want to correlate fan sentiment from tweets with statistics describing the player and team performance of the Miami Dolphins2 . To answer such queries, a software developer must understand and connect multiple tools, including Lucene for text search, Weka or R for sentiment analysis, and a database to join the structured data with the sentiment results. Using such a complex off-line batch process to answer a single query makes it difficult to make ad hoc queries over ever-evolving text data. These queries are essential for applications, such as computational journalism, e-discovery, and political campaign management, where queries are exploratory in nature, and follow-up queries need to be asked based on the result of previous queries. MADden implements four important text analysis functions, namely, part-of-speech tagging, entity extraction, classification (e.g., sentiment analysis), and entity resolution. Text analysis functions are implemented using PostgreSQL (a single-threaded database) and Greenplum (a massive parallel processing (MPP) framework). Two SML models and their inference algorithms are adapted: linear-chain Conditional Random Fields (CRF) 3 and Naive Bayes [95]. In-database and parallel SML algorithms are implemented in the Greenplum MPP framework. The MADden text analytic library is integrated into the MADLib open-source project4 . The declarative SQL query interface with MADden text analysis functions provides a higher-level abstraction. Such an abstraction shields users from detailed text analytic algorithms and enables users to focus more on the application specific data explorations.

2 http://www.miamidolphins.com/ 3 A CRF is a discriminative probabilistic graphical model used to encode arbitrary relationships for statistical processing. 4 MADLib is an open source project for scalable in-database analytics http://madlib.net

21 Figure 2-1. MADden architecture

In this chapter we will show the following points using e-journalism over an NFL corpus (our driving example):

• Processing declarative ad hoc queries involving various statistical text analytic functions;

• Joining and querying over multiple data sources with both structured and unstructured textual information;

• Query-time rendering of visualizations over query results, using word clouds, histograms, and ranked lists of documents. 2.2 MADden System Description

In this section we first discuss the general architecture and the basic techniques used in the implementation of the text analytics algorithms in MADden. Then we give an example of POS tagging implementation. 2.2.1 MADden System Architecture

MADden is a four-layered system, as can be seen in Figure 2-1. The user interface is where both naive and advanced users can construct queries over text, structured data,

22 Table 2-1. Listing of current MADden functions Function Task match(object1, object2) Entity Resolution sentiment(text) Sentiment Analysis entity find(text) Detects Named Entities viterbi Part of speech tags using CRF and models. From the user interface, queries are then passed to the DBMS, where both MADLib and MADden libraries sit on top of the query processor to add statistical and text processing functionality. It is important to emphasize that MADlib and MADden perform functions at the same logical layer. To enable text analytics, MADden works alongside statistical functions found in the MADlib library [44]. These queries are processed using PostgreSQL and Greenplum’s Parallel DB architecture to further optimize storage replication and query parallelism. 2.2.2 Statistical Text Analysis Functions

In this section we describe various text analysis algorithms. Many approaches exist for in-database information extraction. We build on our previous work using Conditional Random Fields (CRFs) for query-time information extraction [94]. We perform the extraction and the inference inside of the database. We rely on information provided in the query to make decisions on the type of algorithm used for extraction. Table ?? describes a list of statistical text analysis tasks. Entity resolution or co-reference resolution is the following problem: given any two mentions of a name, clustered them if and only if they refer to the same real world entity. Certain entities may be misrepresented by the presence of different names, misspellings in the text, or aliases. It is important to resolve these entities appropriately to better understand the data. Increasingly informal text, such as blog posts and tweets requires entity resolution. MADden use inverted indices within the database to perform text analysis on documents.We can scan the inverted indices of each document, filtering out documents that do not contain instances of the player names. To handle misspellings and nicknames we use trigram indices to perform approximate matches of searches for names

23 as database queries [46]. This method allows us to use indices to perform queries on only the relevant portions of the data set; thus we do no extra processing. We implemented functions to perform classification tasks such as POS tagging and sentiment analysis. These functions work at both a document and sentence level. In sentiment analysis we classify text by polarities, where positive sentiment refers to the positive nature of the expressed opinion, and to the negative sentiment, negative nature. Much work has already been done in this area for document-level and entity-level sentiment [71, 103]. We can join with other tables and functions within an SQL query, allowing more complex queries to be declaratively realized.

Parallelization. With a parallel database architecture such as Greenplum, we can parallelize to further optimize queries written with MADden. Each node within the parallel DB could run some query over a subset of the data (data parallel). This includes the statistical methods in MADLib, which were all built to be data parallel. Greenplum has a parallel shared-nothing architecture. Data is loaded onto segment servers. When a query is issued, a parallel query optimizer creates a global query plan which is pushed to each of the segment servers. Query-driven algorithms can then be executed in parallel over several data servers. 2.2.3 MADden Implementation Details

Core to many natural language processing tasks, part-of-speech (POS) tagging involves the labeling of terms by the parts of speech within a sentence. We implemented POS tagging in PostgreSQL and Greenplum. Our code is a part of the MADLib open source system. MADden uses first-order chain CRF to model the labeling of a sequence of tokens. The factor graph contains observed nodes on each sentence token with latent label variables attached to each token. Factors are functions that connect two nodes or signify the ends of the chain. We generate the features using a function generatemrtbl. This

24 function produces a table rfactor for single state features and a table mfactor for two state features. Training the CRF model is a one-time task that is performed outside the DBMS.5 We use a python script to parse and import the trained model into tables in the DBMS. Inference is performed over the stored models in order to find the highest assignment of labels in the model. We calculate the most probable label assignment. This is calculated using the Viterbi dynamic programming algorithm over the label space. We use the PL/Python language to manage the work flow of all the calculations. The computationally expensive Viterbi function is implemented as a database user-defined function in the C language. The feature generation and execution of inference over a table of sentences is implemented in SQL. When executed in Greenplum the query is performed in parallel. Implementing POS tagging inside the DBMS allows us to perform inference over a subset of tokens in response to a query instead of performing batch tagging over all tokens. We also get the benefit of using the query engine to parallelize our queries without losing the ability to drive the workflow using PL/Python. Example Algorithm1 performs POS tagging for all the sentences that contain the word ‘Jaguar’. This query interface allows the user to perform functions on a subset of the data. The segmenttbl holds a list of tokens and their position for each document (doc id). We assume a document is a sequence of tokens.

Algorithm 1 POS tagging on sentences with the word ‘Jaguars’ SELECT DISTINCT ON segtbl . doc_id , viterbi ( segtbl . seglist , mfactor . score , rfactor . score ) FROM segmenttbl , mfactor , rfactor , segtbl WHERE segtbl . doc_id = segmenttbl . doc_id AND segmenttbl . seg_text=’ Jaguar ’ ;

5 We use the IIT Bombay package for training available at http://crf.sourceforge.net

25 2.3 Text Analysis Queries and Demonstration

We describe various data sources from the NFL domain for computational journalism. Finally, we describe the query-driven user interfaces used for exploratory text analysis applications. 2.3.1 Dataset for MADden Example

Our sample demonstration for MADden involves a variety of NFL-based data sources. The data is represented in Table ?? as an abbreviated schema.6 The NFLCorpus table holds semi-structured data. Textual data from blogs, news articles, and fan tweets with document such as timestamp, tags and type, among others. The tweets were extracted using the Twitter Streaming API7 with a series of NFL related keywords, and the news articles and blogs were extracted from various sports media websites. These documents vary in size and quality. We have around 25 million tweets from the 2011 NFL season, including plays and recaps from every game in the season. The statistical data was extracted from the NFL.com player database. Each table contains the player’s name, position, number, and a series of statistics of different types (some players show up in multiple tables, others in only one). The Player table holds information about a player in the NFL, including college, birthday, height, weight, as well as years in NFL. The Team table holds some basic information about the 32 NFL teams, including location, conference, division, and stadium. TeamStats 2011 holds the team rankings and statistics in a variety of categories (Offense, Defense, Special Teams, Points, etc.). Extracted Entities stores the extracted entities found in the NFLCorpus documents.

6 These tables may be extracted to an RDBMS, or defined over an API using a foreign data wrapper. 7 https://dev.twitter.com/docs/streaming-apis

26 2.3.2 MADden Text Analytics Queries

Based on our example dataset, suppose a sports journalist wants to do an investigative piece on overall public opinion of all Florida-based NFL teams during the 2011-2012 season. Such a peice would require in-depth analysis of news reports, tweets, blog postings, among other sources. The standard approach would consist of trawling through the text sources either by hand with help, or using a series of different text processing toolkits and packages, sometimes specialized for a singular task. Instead of multiple tools, MADden can streamline this process, with its declarative, in-database approach to text analytics. A first step may consist of paring your corpora down to just the documents related to the Florida football teams, namely the Miami Dolphins, Jacksonville Jaguars, and Tampa Bay Buccaneers.

Algorithm 2 An entity resolution query using the MADden framework. SELECT DISTINCT doc_id FROM extracted_entities WHERE match (’Jaguars’, entity ) > match_thresh OR match (’Dolphins’, entity ) > match_thresh OR match ( ’Buccaneers’ , entity ) > match_thresh ;

The match function used in Algorithm2 is an Entity Resolution UDF which calculates a [0, 1] bound inverse metric, where terms that are close to our target will have a higher score than those that are less similar. Extracted Entities is a view constructed using entity find. This function detects entities using one of two accuracy settings (high accuracy with lower recognition, or low accuracy with higher recognition), on textual documents as they are added to the database (in this case, likely news articles and blogs). Table ?? shows some of the current text analysis functions implemented in MADden. Journalist may want to explore fan sentiment for the Jacksonville Jaguars based on tweets collected during the NFL season. Utilizing the first query as a building block (with some changes), we can construct this query as listed in Algorithm3.

27 Table 2-2. List of each MADden functions and its NLP task Functions Tasks match(target, against) Entity Resolution sentiment(text) Sentiment Analysis entity find(text, boolean) Detects Named Entities pos tag(text) POS tagging viterbi Part of speech tags using CRF pos extract(text, type) POS term extraction

Table 2-3. Abbreviated NFL dataset schema Tables Attributes NFLCorpus doc id, type, text, tstmp, tags PlayerStats2011 pid, type specific stats Player pid, f name, l name, college, etc Team Stats2011 team, points, pass yds, various stats Team team, city, state, stadium Extracted Entities doc id, entity

In the query of Algorithm3, one could accomodate for nicknames through OR-matching the extracted entities, an alias table, among other strategies. Notice that going from a singular text analytics task, to a more complex analysis only required a small change. Whereas a traditional approach would have us either looking for a customized solution or patching together packages, the declarative SQL approach allows the user to just state what the result is. And since we are working in SQL, we can combine queries on corpus tables with tables of structured data. For example, if our journalist wants to analyze the media opinion of the state’s best reciever, he could consult both the player stats table, as well as the media blogs as shown in Algorithm4.

Algorithm 3 Entity resolution and sentiment analysis in MADden. SELECTDISTINCTE . docid , E . entity , sentiment ( S . document ) FROM extracted_entities as E , NFLCorpus as S WHEREE . doc_id = S . doc_id AND sentiment ( S . document ) in ( ’+’ , ’− ’) AND match (’Jaguars’, E . entity ) > match_thresh ANDS . type = ’ tweet ’ ;

28 Algorithm 4 A MADden query over structured and unstructured data. SELECT BestWR . name , sentiment ( A . txt ), A . txt FROM NFLCorpus A , extracted_entities E , ( SELECTP . fname | | ’’ | | P . lname as name FROM Player P , PlayerStats2011_Rec S WHERES . pid = P . pid AND ( P . team = ’ Jaguars ’ ORP . team = ’Dolphins’ ORP . team = ’Buccaneers’) ORDERBYS . rec_yds DESC LIMIT 1) as BestWR WHEREE . doc_id = A . doc_id AND ( A . type = ’ blog ’ ORA . type = ’ news ’ ) AND match ( BestWR . name , E . entity ) > match_thresh ;

Algorithm4 uses the standard structured sql tables Player and PlayerStats2011 Rec, which represents players and their receiving stats. Our journalist finds the best receiver playing on a Florida NFL team based on receiving yards.8 Then, discover all the associated news and blog documents and perform entitiy resolution function on the extracted entities, returning the sentiment and text associated with that player. This method is not restricted to single domain analytics. One can run analytics combining different datasets (e.g. State Economies and the NFL), utilizing the same declarative in-database methods seen here. 2.3.3 MADden User Interface

We have given an interactive demonstration of the MADden’s capabilities. The demonstration is based around MADden UI, a web interface that allows users to perform analytic tasks on our dataset. MADden UI has two forms of interaction: raw SQL queries, and a Mad Lib9 style interface, with fill-in-the-blank query templates for quick interaction

8 Total yards over a season, yds/catch, and touchdowns usually decides who the best receiver was at the end of a season. 9 http://en.wikipedia.org/wiki/Mad Libs

29 Figure 2-2. Example MADden UI query template as shown in Figure 2-2. In a demonstration during [41] the user interface was included to assist users in interpreting the results. 2.4 GPText Introduction

Many companies keep large amounts of text data in relational databases. Several challenges exist in performing analysis on such datasets using state-of-the-art systems. First, expensive data transfer costs must be paid up-front to move data between databases and analytics systems. Second, many popular text analytics packages do not scale up to production-sized datasets. In this section, we introduce GPText, a Greenplum parallel statistical text analysis framework that addresses the above problems by supporting statistical inference and learning algorithms natively in a massively parallel processing database system. GPText seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib, an open source library for scalable in-database analytics which can be installed on PostgreSQL and Greenplum. In addition, GPText also developed and contributed a linear-chain conditional random field(CRF) module to MADLib to enable information extraction tasks such as part-of-speech tagging and named entity recognition. We show the performance and scalability of the parallel CRF implementation. Finally, we describe an e-discovery application built on the GPText framework. Text analytics has gained much attention in the big data research community due to the large amounts of text data generated every day in organizations such as companies, government and hospitals in the form of emails, electronic notes and internal documents. Many companies store this text data in relational databases because they relay on databases for their daily business needs. A good understanding of this unstructured text

30 data is crucial for companies to make business decision, for doctors to assess their patients, and for lawyers to accelerate document review processes. Traditional business intelligence pulls content from databases into other massive data warehouses to analyze the data. The typical “data movement process” involves moving information from the database for analysis using external tools and storing the final product back into the database. This movement process is time consuming and prohibitive for interactive analytics. Minimizing the movement of data is a huge incentive for businesses and researchers. One way to achieve this is for the datastore to be in the same location as the analytic engine. While Hadoop has become a popular platform to perform large-scale data analytics, newer parallel processing relational databases can also leverage more nodes and cores to handle large-scale datasets. The Greenplum database, built upon the open source database PostgreSQL, is a parallel database that adopts a shared-nothing massively parallel processing (MPP) architecture. Database researchers and vendors are capitalizing on the increase in database cores and nodes and investing in open-source data analytics ventures such as the MADLib project [23, 45]. MADLib is an open-source library for scalable in-database analytics on Greenplum and PostgreSQL. It provides parallel implementation of many machine learning algorithms. In this chapter, we motivate in-database text analytics by showing GPText, a powerful and scalable text analysis framework developed on Greenplum MPP database. GPText inherits scalable indexing, keyword search, and faceted search functionalities from an effective integration of the Solr search engine [33]. The GPText uses and contributes statistical methods to the MADLib open-source library. We show that we can use SQL and user defined aggregates to implement conditional random fields (CRFs) methods for information extraction in parallel. The experiment shows sub linear improvement in runtime for both CRF learning and inference with linear increase in the number of cores. As far as we know, GPText is the first toolkit for statistical text analysis in

31 relational database management systems. Finally, we describe the need and requirement for e-discovery applications and show that GPText is an effective platform to develop such sophisticated text analysis applications. 2.4.1 GPText Related Work

Researchers have created systems for large scale text analytics including GATE, PurpleSox and SystemT [15, 26, 60]. While both GATE and SystemT uses rule-based approach for information extraction (IE), PurpleSox uses statistical IE models. However, none of the above system are native built for a MPP framework. The parallel CRF in-database implementation follows the MAD methodology [45]. In a similar vein, researchers show that most of the machine learning algorithms can be expressed through unified RDBMS architectures [31]. Recently, parallel machine learning algorithms have been developed by many research groups [16, 56, 81]. There are several implementations of conditional random fields and but only a few large scale implementations for NLP tasks. One example is the PCRFs [76] that are implemented over massively parallel processing systems supporting Message Passing Interface (MPI) such as Cray XT3, SGI Altix, and IBM SP. However, this is not implemented over RDBMS. 2.4.2 Greenplum Text Analytics

GPText runs on the Greenplum database (GP), which is a shared-nothing massively parallel processing database. The Greenplum database adopts the most widely used master-slave parallel computing model where one master node orchestrates multiple slaves to work until the task is completed and there is no direct communication between slaves. As shown in Figure 2-3, it is a collection of PostgreSQL instances including one master instance and multiple slave instances (segments). The master node accepts SQL queries from clients, then divide the workloads and send sub-tasks to the segments. Besides harnessing the power of a collection of computer nodes, each node can also be configured with multiple segments to fully utilize the muticore processor. To provide high availability,

32 GP provides the options to deploy redundant standby master and mirroring segments in case of master segment and primary segment failure. The embarrassing processing capability powered by the Greenplum MPP framework lays the cornerstone to enable GPText to process production-sized text data. Besides the underlying MPP framework, GPText also inherited the features that a traditional database system brings to GPText for example, online expansion, data recovery and performance monitoring to name a few. On top of the underlying MPP framework, there are two building blocks, MADLib and Solr (illustrated in Figure 2-3), which distinguished GPText from many of the existing text analysis tools. MADLib makes GPText capable of performing sophisticated text data analysis tasks, such as part-of-speech tagging, named-entity recognition, document classification and topic modeling with a vast amount of parallelism. Solr is reliable and scalable text search platform from the project and it has been widely deployed in web servers. The major features includes powerful full-text search, faceted search and near realtime indexing. As shown in Figure 2-3, GPText uses Solr to create distributed indexing. Each primary segment is associated with exactly one Solr instance where the index of the data in the primary segment is stored for the purpose of load balancing. GPText has all the features that Solr has since Solr is integrated into GPText seamlessly. GPText also contributes statistical methods to MADLib for text analysis such as CRF. 2.4.2.1 In-database document representation

In GPText, a document can be represented as a vector of counts against a token dictionary, which is a vector of unique tokens in the dataset. For efficient storage and memory, GPText uses a sparse vector representation for each document instead of a naive vector representation. The following are an example with two different vector representations of a document. The dictionary contains all the unique terms (i.e., 1-grams) exist in the corpus. Dictionary: {am, before, being, bothered, corpus, document, in, is, me,

33 Figure 2-3. The GPText architecture over Greenplum database

never, now, one, really, second, the, third, this, until} Document: {i, am, second, document, in, the, corpus} Naive vector representation: {1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1} Sparse vector representation: {(1, 0, 1, 1, 1, 1, 0, 1, 1) : (1, 3, 1, 1, 1, 1, 6, 1, 1)}. GPText adopts run-length encoding to compress the naive vector representation using pairs of count-value vectors. The first vector in the sparse vector representation contains count values and the second vector is the number of contiguous appearances of the value in the first vector. Although not apparent in this example, the advantage of the sparse vector representation is dramatic in real world documents, where most of the elements in the vector are zero.

34 2.4.2.2 ML-based advanced text analysis

GPText relies on multiple machine learning modules in MADLib to perform statistical text analysis. Three of the commonly used modules are k-means for document clustering, multinomial naive bayes (multinomialNB) for document classification, and latent Dirichlet allocation (LDA) for topic modeling for dimensionality reduction and feature extraction. Performing k-means, multinomialNB, LDA in GPText follows the same pipeline as follows: 1. Create Solr index for documents. 2. Configure and populate Solr index. 3. Create terms table for each document. 4. Create a dictionary of the unique terms across documents. 5. Construct term vector using the term frequency-inverse document frequency (tf-idf) for each document. 6. Run MADLib k-means/MultinomiaNB/LDA algorithm. The following section details the implementation of the CRF learning and inference modules that we developed for GPText applications as part of MADLib. 2.4.3 CRF for IE over MPP Databases

A Conditional random field (CRF) is a type of discriminative undirected probabilistic graphical model. Linear-chain CRFs are special CRFs, that assume that the next state depends only on the current state. Linear-chain CRFs achieve start-of-the-art accuracy in many real world natural language processing (NLP) tasks such as part of speech tagging and named entity recognition. 2.4.3.1 Implementation overview

Figure 2-4 illustrates the detailed implementation of the CRF module developed for IE tasks in GPText. The top box shows the pipeline of the training phase. The bottom box shows the pipeline for the inference phase. We use declarative SQL statements to extract all features from text. Any features in the state of art packages can be extracted using one single SQL clause. All of the common features described in the

35 Figure 2-4. The MADLib CRF overall system architecture literature can be extracted with one SQL statement. The extracted features are stored in a relation for either single or two-state features. After the feature extraction, we use user-defined aggregates (UDAs) to calculate the maximum-a-priori (MAP) configuration and probability for inference. For learning, we uses UDFs to implement gradient and log-likelihood in parallel. 2.4.3.2 Feature extraction using SQL

Text feature extraction is a step in most statistical text analysis methods. We are able to implement all of the seven types of features used in POS and NER using exactly seven SQL statements. These features include: Dictionary: does this token exist in a dictionary? Regex: does this token match a regular expression? Edge: is the label of a token correlated with the label of a previous token? Word: does this token appear in the training data? Unknown: does this token appeared in the training data below certain threshold? Start/End: is this token first/last in the token sequence?

36 There are many advantages for extracting features using SQLs. The SQL statements hide a lot of the complexity present in the actual operation. It turns out that each type of feature can be extracted out using exactly one SQL statement, making the feature extraction code extremely succinct. Additionally, SQLs statements are naively parallel due to the set semantics supported by relational DBMS’s. For example, we compute features for each distinct token and avoid re-computing the features for repeated tokens. In Algorithm5 and Algorithm6 we show how to extract edge and regex features, respectively. Algorithm5 extracts adjacent labels from sentences and stores them in an array. Algorithm6 shows a query that selects all the sentences that satisfies any of the regular expressions present in the table regextable.

Algorithm 5 Query for extracting edge features SELECT doc2 . pos , doc2 . doc_id ,’E.’, ARRAY [ doc1 . label , doc2 . label ] FROM segmenttbl doc1 , segmenttbl doc2 WHERE doc1 . doc_id = doc2 . doc_id AND doc1 . pos + 1 = doc2 . pos

Algorithm 6 Query for extracting regex features SELECT start_pos , doc_id ,’R ’ | | r . name , ARRAY [ −1 , label ] FROM regextbl r , segmenttbl s WHERE s . seg_text ˜ r . pattern

2.4.3.3 Parallel linear-chain CRF training

Programming Model. In Algorithm7 we show the parallel CRF training strategy. The algorithm is expressed as a user-defined aggregate. User-defined aggregates are composed of three parts: a transition function (Algorithm8), a merge functionand a finalization function (Algorithm9). Following we describe these functions. In line1 of Algorithm7 the Initialization function creates a state object in the database. This object contains coefficient (w), gradient (∇) and log-likelihood (L) variables. This state is loaded (line3) and saved (line8) between iterations. We compute

37 Algorithm 7 CRF training(z1:M )

Input: z1:M , . Document set Convergence(), Initialization(), Transition(), Finalization() Output: Coefficients w ∈ RN Initialization/Precondition: iteration = 0 1: Initialization(state) 2: repeat 3: state ← LoadState() 4: for all m ∈ 1..M do 5: state ← Transition(state, zm) 6: end for 7: state ← F inalization(state) 8: WriteState(state) 9: until Convergence(state, iteration) return state.w

the gradient and log-likelihood of each segment in parallel (line4) much like a Map function. Then line7 computes the new coefficients much like a reduce function.

Transition strategies. Algorithm8 contains the logic of computing the gradient and log-likelihood for each tuple using the forward-backward algorithm. This algorithm is invoked in parallel over many segments and the result of these functions are combined using the merge function.

Algorithm 8 transition-lbfgs(state, zm) Input: state, . Transition state zm, . A Document Gradient() Output: state

1: {state.∇, state.L} ← Gradient(state, zm) 2: state.num rows ←state.num rows + 1 return state

Finalization strategy. The finalization function invokes the L-BFGS convex solver to get a new coefficient vector.

38 Algorithm 9 finalization-lbfgs(state) Input: state, LBFGS() . Convex optimization solver Output: state 1: {state.∇, state.L} ←penalty(state.∇, state.L) 2: instance ← LBFGS.init(state) 3: instance.lbfgs() . invoke the L-BFGS solver return instance.state

Limited-memory BFGS (L-BFGS), a variation of the Broyden-Fletcher-Goldfarb-Shannon (BFGS) algorithm is a leading method for large scale non-constraint convex optimization method. We translate an in-memory Java implementation [68] to a C++ in-database implementation. Before each iteration of L-BFGS optimization, we need to initialize the L-BFGS with the current state object. At the end of each iteration, we need to dump the updated variables to the database state for next iteration. 2.4.3.4 Parallel linear-chain CRF inference

The Viterbi algorithm is used to find the k most likely labeling of a document for CRF models. We chose to implement an SQL clause to drive the Viterbi inference. The Viterbi inference is implemented sequentially and each function call will finish labeling of one document. However, in Greenplum, Viterbi can be run in parallel over different subsets of the document on a multi-core machine. So, the CRF inference is naively parallel. 2.4.4 GPText Experiments and Results

In order to evaluate the performance and scalability of the linear-chain CRF learning and inference on Greenplum, we conducted experiments on various data sizes over on a 32-core machine with 2T hard drive and 64GB memory. We used the CoNLL2000 dataset containing 8936 tagged sentences for learning. This dataset is labeled with 45 POS tags. To evaluate the inference performance, we extracted 1.2 million sentences from the New York Times dataset. In Figure 2-5 and Figure 2-6 we show the algorithm is sublinear and

39 Figure 2-5. Linear-chain CRF training scalability

Figure 2-6. Linear-chain CRF inference scalability

improves with an increase in the number of segments. Our POS implementation achieves 0.9715 accuracy, which is consistent with the state of the art [63]. 2.4.5 GPText Application

With the support of Greenplum MPP data processing framework, efficient Solr indexing/search engine, and parallelized statistical text analysis modules, GPText positions itself to be a great platform for many applications that need to apply scalable text analytics methods with varying sophistications over unstructured data in databases. One of such application is e-discovery. E-discovery has become increasingly important in legal processes. Large enterprises keep terabytes of text data such as emails and internal documents in their databases. Traditional civil litigation often involves reviews of large amounts of documents both in the plaintiffs side and defendants side. E-discovery provides tools to pre-filter documents

40 Figure 2-7. GPText application for review to provide a more speedy, inexpensive and accurate solution. Traditionally, simple keyword search are used for such document retrieval tasks in e-discovery. Recently, predictive coding based on more sophisticated statistical text analysis methods are gaining more and more attention in civil litigation since it provides higher precision and recall in retrieval. In 2012, judges in several cases approved the use of the predictive coding based on the indication of Rules of the Federal Rules of Civil Procedure [1]. GPText developed a prototype of an e-discovery tool using the Enron dataset. Figure 2-7 is a snapshot of the e-discovery application. In the top pane, keyword ‘illegal’ is specified and Solr is used to retrieve all relevant emails that contains the keyword displayed in the bottom pane. As shown on the middle pane on the right, it also supports topic discovery in the email corpus using LDA and classification using k-means. K-means uses LDA and CRF to reduce the dimensionalities of features to speed up the k-means classification. CRF is also used to extract named entities. As you can see the topics are displayed in the Results panel. The tool also provides visualization of aggregate information and faceted search over the email corpus.

41 2.4.6 GPText Summary

We introduce GPText, a parallel statistical text analysis framework over MPP database. With the seamless integration with Solr and MADLib, GPText is a framework with powerful search engine and advanced statistical text analysis capabilities. We implemented a parallel CRF inference and training module for IE. The functionalities and scalability provided by GPText positions itself to be a great tools for sophisticated text analytics applications such as e-discovery.

42 CHAPTER 3 MAKING ENTITY RESOLUTION QUERY-DRIVEN In this chapter, I present technique to make an important problem query-driven. I take a model entity resolution described by McCallum et al. [67] and describe how to express a query over this model to return only the answer requested by the data scientist request. This method reduces the amount of work for obtaining the requested answers. 3.1 Query-Driven Entity Resolution Introduction

Entity resolution (ER) is the process of identifying and linking/grouping different manifestations (e.g., mentions, noun phrases, named entities) of the same real world object. It is a crucial task for many applications including knowledge base construction, information extraction, and question answering. For decades, ER has been studied in both database and natural language processing communities to link database records or to perform entity resolution over extracted mentions (noun phrases) in text. ER is a notoriously difficult and expensive task. Traditionally, entities are resolved using strict pairwise similarity, which usually leads to inconsistencies and low accuracy due to localized, myopic decisions [98]. More recently, collective entity resolution methods have achieved state-of-the-art accuracy because they leverage relational information in the data to determine resolution jointly rather than independently [10]. However, it is expensive to run collective ER based on probabilistic graphical models (GMs), especially for cross-document entity resolution, where ER must be performed over millions of mentions. In many previous approaches, collective ER is performed exhaustively over all the mentions in a data set, returning all entities. Researchers have developed new methods to perform large-scale cross-document entity resolution over parallel frameworks [86, 98]. However, in many ER applications, users are only interested in one or a small subset of entities. This key observation motivates query-driven ER, an alternative approach to solving the scalability problem for ER.

43 Compared to previous ER models and algorithms, query-driven techniques in this chapter scale to data sets that are in many cases three orders of magnitude larger. Moreover, the ER model in this chapter is general enough to take both bibliographic records and mentions extracted from unstructured text. Query-driven ER techniques over GMs can also be generalized for other applications to perform query-driven inference. This work follows a line of research on implementing ML models inside of databases [44, 58, 97]. Researchers use factor graphs because this flexible representation works well with other machine learning algorithms. ER is ubiquitous and an important part of many analytic pipelines; a probabilistic database implementation is natural. In this chapter, we first introduce SQL-like queries that involve ER operations. These ER operators are an SQL comparison operator (i.e., ER-based equality) that returns true if two mentions map to the same entity. Factor Graphs, a type of GM, are used to model the collective entity resolution over extracted mentions from text. Using this ER based comparison operator, users can pose selection queries to find all mentions that map to a single entity or pose join queries to find mentions that map to the subset of entities that they are interested in resolving. Because exhaustive ER is expensive it is common to use blocking techniques to partition the data set into approximately similar groups called canopies. Query-driven ER in this chapter differs from blocking in two important ways: 1) deterministic blocks are replaced by a pairwise distance-based metric, and 2) blocks (or canopies) are implicit to the query-driven ER data set and do not have to be created in advanced. The latter point, implicit blocking, is realized using a data structure created based on the similarity to a query mention. This data structure allows parameters to include or remove mentions from the working data set. This property is similar to the iterative blocking technique [96], which is shown to improve ER accuracy. Such an approach can dramatically amortize the overall ER cost suitable for the pay-as-you-go paradigm in dataspaces [62].

44 To support ER driven by queries, we develop three sampling algorithms for MCMC inference over graphical models. More specifically, instead of a uniform sampling distribution, we sample on a distribution that is biased to the query. We develop a query-driven sampling techniques that maximizes the resolution of the target query entity (target-fixed) and biases the samples based on the pairwise similarity metric between mentions and query nodes (query-proportional). We also introduce a hybrid method that performs query-proportional sampling over a fixed target. We develop two optimizations to the query-proportional and hybrid methods to model the similarity and dissimilarity between the mentions and the query entity, i.e., attract and repel scores. In the first target-fixed algorithm, we adapt the samples to resolve the query entity. The second query-proportional algorithm, selects mentions based on their probabilistic similarity to the query entity. The third hybrid algorithm combines the two approaches. A summary of approaches can be found in Table 3-3. When a user is interested in resolving more than one entity we employ multi-node ER techniques. To implement multi-node ER queries, single-node ER techniques may be naively performed iteratively to resolve one entity at a time. However, such an algorithm can lead to un-optimized resource allocation if the same number of samples is generated for each target entity, or low throughput if one of the entities has a disproportionately low convergence rate. To alleviate this problem, we present three multi-query ER algorithms that schedule the sample generation among query nodes in order to improve overall convergence rate. In summary, the contributions of this chapter are the following:

• We define a query-driven ER problem for cross-document, collective ER over text extracted from unstructured data sets;

• We develop three single-node algorithms that perform focused sampling and reduce convergence time by orders-of-magnitude compared to a non-query-driven baseline (Section 3.4). We develop two influence functions that use attract and repel techniques to grow or shrink query entities (Section 3.5.1);

45 • We develop scheduling algorithms to optimize the overall convergence rate of the multi-query ER (Section 3.5.2). The best scheduling algorithm is based on selectivity of different target entities (Section 3.5.3). The results show that query-driven ER algorithms is a promising method of enabling realtime, ad-hoc, ER-based queries over large data sets. Single node queries of different selectivity converge to a high-quality entity within 1-2 minutes over a newswire data set containing 71 million mentions. Experiments also show that such real-time ER query answering allows users to iteratively refine ER queries by adding context to achieve better accuracy (Section 3.6). 3.2 Query-Driven Entity Resolution Preliminaries

In this section we present a foundation of concepts discussed in this chapter. We start with an introduction of factor graphs then discuss sampling techniques over this model. Finally, we formally introduce state-of-the-art entity resolution approaches and explain the origin. 3.2.1 Factor Graphs

Graphical models are a formalism for specifying complex probability distributions over many interdependent random variables. Factor graphs are bipartite graphical models that can capture arbitrary relationships between random variables through the use of factors [53]. As depicted in Figure 3-1, links always connect random variables (represented as circles) and factor nodes (represented as black squares). Factors are functions that take as input the current setting of connected random variables, and output a positive real-valued scalar indicating the compatibility of the random variables settings. The probability of a setting to all the random variables is a normalized product of all the factors. Intuitively, the highest probability settings have variable assignments that yield the highest factor scores. We use factor graphs to represent complex entity resolution relationships. Nodes (random variables) may correspond to mentions of people, places and organizations in documents. Nodes also represent the random variables that correspond to groups of

46 Figure 3-1. Three node factor graph. Circles (random variables) with mi represent mentions and those with ei represent entities. Clouds are added for visual emphasis of entity clusters mentions (entities), these nodes are accompanied by clouds in Figure 3-1. The factors between mentions and entities give us a sound representation for many possible states. The factor graph model also gives us a simple mathematical expression of the relationship.

n Formally, a factor graph G = hx, ψi contains a set of random variables x = {xi}1 m and factors ψ = {ψi}1 . Each factor ψi maps the subset of variables it is associated with to a non-negative compatibility value. The probability of a setting ω among the set of all possible settings Ω occurring in the factor graph is given by a probability measure:

m m 1 X Y X X Y π(ω) = ψ (xi),Z = ψ (xi) Z i i x∈ω i=1 ω∈Ω x∈ω i=1

i where x is the set of random variables that neighbor the factor ψi(·) and Z is the normalizing constant. Querying graphical models produces the most likely setting for the random variables.

A query on a factor graph is defined as a triple hxq, xl, xei where xq is the set of nodes in question, xl is a set of latent nodes (entities) that are marginalized and xe is a set of evidence nodes (observed mentions). A query task is a sum over the all latent variables and the maximization of the query probability. A query over the factor graph is defined as

X Q(xq, xl, xe, π) = argmaxxq π(xq ∪ vl ∪ xe). vl∈xl

47 To obtain the best setting of the queries in question, inference is required. Several methods exist for performing inference over factor graphs. The entity resolution factor graph, being pairwise, is dense and highly connected. This property suggests the best methods for inference are Markov Chain Monte Carlo (MCMC) methods; in particular, we use a Metropolis Hastings variant [53]. We refer the reader to our previous work for a detailed discussion on inference over factor graphs and a deviation of the technique [100]. 3.2.2 Inference over Factor Graphs

Several methods exist for performing inference over factor graphs. The entity resolution factor graph, being pairwise, is dense and highly connected. This property suggests the best methods for inference are Markov Chain Monte Carlo (MCMC) methods; in particular, we use a Metropolis Hastings variant [53]. The idea of MCMC-MH is to propose modifications to a current setting and use the model to decide whether to accept or reject the proposed setting as a replacement for the current settings. When the models are being scored only the factors touching nodes with changed values, the Markov blanket, needs to be recomputed. We accept or reject changes so the model can iteratively proceed to an optimal setting. More formally, consider an MCMC transition function T :Ω × Ω → [0, 1] where given the current setting ω we can sample a subsequent setting ω0. The probability of accepting a transition given a graphical model distribution π is:

 π(ω0)T (ω, ω0) A(ω, ω0) = min 1, . (3–1) π(ω)T (ω0, ω) Additionally, the intractable partition function Z is canceled out, making sample generation inexpensive. This property allows us to calculate the probability of accepting the next state by simply computing the difference in score between the next and current state [100].

48 We say the algorithm converges when a steady state is reached.1 Intelligently sampling next states decreases the time to convergence. Convergence in MCMC is difficult to verify [25], we discuss convergence estimation in Section 3.6.1. 3.2.3 Cross-Document Entity Resolution

Cross-document ER is the problem of clustering mentions that appear across independent sets of documents into groups of mentions that correspond to the same real world entity. These ER tasks typically assume a set of preprocessed documents and perform linking across documents [7, 86]. The scale of the cross-document ER problem is typically several orders of magnitude more than intra-document ER. There are no document boundaries to limit inference scope and all entity mentions may be distributed arbitrarily across millions of documents.

To model cross-document ER, let M = {m1, . . . , m|M|} be the set of mentions in a data set. Each mention mi contains a set of attribute-value data points. Let E =

{e1, . . . , e|M|} represent the set of entities where each ei contain zero or more mentions. Note, we assume the maximum number of entities is no more than the number of mentions and no less than 1. Each mention may correspond to a unique entity or all mentions may correspond to a single entity. The baseline method of entity resolution is a straight-forward application of the MCMC-MH algorithm. We show pseudo code for the baseline method in Algorithm 10. Algorithm 10 takes as input a set of entities E and samples which is the number of iterations of the algorithm or a function to estimate convergence. The algorithm samples two entities from the entity set and moves one random2 mention into the other entity. After the move, the algorithm checks for an improvement in the overall score of the model.

1 We refer to literature for a more detailed description of convergence [100].

2 Given a set X, the function x ∼u X makes a uniform sample from the set X into a variable x.

49 Algorithm 10 The baseline entity resolution algorithm using Metropolis-Hastings sampling INPUT: A set of unresolved entities E each with one mention m. INPUT: A positive integer samples. OUTPUT: A set of resolved entities E. 1: while samples-- > 0 do 2: ei ∼u E 3: ej ∼u E 4: m ∼u ei 0 5: E ←move(E, m, ej) 6: if score(E) < score(E 0) then 7: E ← E 0 8: end if 9: end while return E

If the model score improves, the changes are kept, otherwise the proposed changes are

ignored. The SCORE function sums the weights of all the edges in the given entity to obtain a value for the model. This is equivalent to the probability of the setting π(·) as described in Section 3.2.1. Blocking. Blocking or canopy generation is a preprocessing technique to partition large amounts of data into smaller chunks, or blocks of items that are likely to be matches [64]. Blocking can use simple and fast techniques such as sorting based on attributes or more advanced techniques that map similar items onto a vector space [28, 84, 96]. In this chapter, we use two methods of blocking. First, we use an approximate string match over all the mentions in the database. To perform the approximate string filter we use a q-grams technique over all the mentions in the database. This method creates an inverted index for each mention in the database so a query can be performed to look for all words that contain a sufficient number of matching q-grams. This gives us a fast high-recall filter over many records [42]. The second is an implicit blocking structure created by computing the influence a query node has on the other nodes in the data set (see Section 3.5.1). This method

50 uses an estimate of the distant between the query nodes and the candidate mentions to prioritize samples. 3.3 Query-Driven Entity Resolution Problem Statement

In this section, we formally define the problem of query-driven ER. We use an SQL-like formalism to model traditional and query-driven entity resolution. In a probabilistic database, let a Mentions table contain all the extracted mentions from a text corpus. Its column entityp represents the probabilistic latent entity labels; they contain a mapping but that mapping may not represent the current state. The People table holds a watchlist of mentions and relevant contextual information. The context column is an abstract place holder for text data or richer schemas. This model only assumes there is a master column, the realization of the context column is flexible and implementation dependent.

Mentions(docID, startpos, mention, entityp, context) People(peopleID, mention, entityp, context) We also define a user-defined function coref map that performs maximum a posteriori (MAP) inference on the latent entityp random variables. The function takes two instances of mentions with at least one being from a probabilistic table such as the Mention table. When the query is executed the coref map function returns true if the mentions referenced are coreferent. Following, we describe the traditional exhaustive ER task as well as the single- and multi-node query-driven ER queries.

Exhaustive. The goals of traditional entity resolution is to cluster all mentions in a data set. All the mentions clustered inside each entity are coreferent with each other and not entity with any mention that is a part of a different entity cluster. The process of exhaustive ER can be modeled as a self-join database query where each mention is grouped into coreferent clusters. In Algorithm 11 we create a view displaying the results of a resolved query.

51 Algorithm 11 Example exhaustive entity resolution query that createds a database view CREATE VIEW CorefView AS SELECT m . docID , m . startpos , m . mention , m2 . mention FROM Mentions m , Mention m2 WHERE coref_map ( m . ∗ , m . entityˆp ), m2 . mention , m2 . context )

To obtain unique entity clusters, we can perform an aggregation query over the CorefView. In Figure 3-3 we see an example of the result of traditional entity resolution.

Single-node Query. In the ER task, we may only be interested in the mentions of one entity. We represent this entity with a template mention, or as a query node q. Single-node entity resolution is modeled as a selection query with a where-clause that includes the template mention q and returns only the mentions that are members of the entity cluster that contains the sample mention. Given a template mention q and its context q.context, Algorithm 12 we show the single-node query based on an example in Section 3.4.2.

Algorithm 12 Single query-node driven entity resolution query SELECT m . docID , m . startpos , m . mention FROM Mentions m WHERE coref_map ( m . ∗ , m . entityˆp ), q , q . context )

Here we add parameters to the coref map function that contain the specific query and its context. It performs ER over the mentions table but only returns an affirmative value if the labels for the entity cluster match the query node. For example, if the template mention q was ‘Mark Zuckerberg’, and the query context were keywords such as ‘facebook’ and ‘ceo’, the only returned mentions will be those that represent Mark Zuckerberg the facebook founder. This is similar to a ‘facebook’ approximate string search. The emphasis of this chapter is optimizing this function so while performing ER we perform less work compared an exhaustive query.

52 Figure 3-2. A possible initialization for entity resolution

Multi-Query. In many cases, a user may be interested in a watchlist of entities. Watchlist is a subset of the larger Mention set. This is common for companies looking for mentions of its products in a data set. In this case, mention are only clustered with the entities represented in the watchlist. Algorithm 13 is an example of a join-query between the Mentions table and the People table.

Algorithm 13 Multi-query between the Peoples watch list tablse and the full mentions set SELECT m . docID , m . startpos , m . mention , q FROM Mentions m , People q WHERE coref_map ( m . ∗ , m . entityˆp ), q , q . context )

This function combines a watch list of terms and performs ER with respect to the specific examples in the watchlist. The multi-query method uses scheduling to perform inference, or a fuzzy equal, over each mentions. In Section 3.4.3 we propose scheduling algorithms so multi-query node ER gracefully manage multiquery workloads. 3.4 Query-Driven Entity Resolution Algorithms

Query-driven ER is an understudied problem; in this section we describe our approach to query-driven ER with one entity (single-query ER) and with multiple entities (multi-query ER). First, we give a graphical intuition of query-driven ER algorithms.

53 Figure 3-3. The correct entity resolution for all mentions

Figure 3-4. The entity containing q is internally coreferent; the other entities are not correctly resolved

3.4.1 Intuition of Query-Driven ER

In this section, we remind the reader of the query-driven ER task with a formal definition. Each ER task is given a corpus G and a set of entity mentions M =

{m1, . . . , m|m|} extracted from the G. A user may supply set of query nodes Q =

{q1, . . . , q|Q|}. Each qi, also called a query template, may be a member of M or a manually declared mention that is appended to the set of mentions. For each node qi ∈ Q, the task of ER is to compute the set of mentions E = {e1, . . . , e|Q|} that only contain mentions that are coreferent with the query node,

eqi = {mi|mi ∈ M, QDER(M, mi, qi)}.

54 In Section 3.4.2, we describe implementations of the QDER algorithm for |Q| = 1. In Section 3.4.3, we describe techniques of scheduling the ER task for the general case of |Q| > 1. Fundamentally, the ER algorithm generates a graphical model and makes new state proposals (jumps) to reach the best state (see Section 3.2). The query-driven algorithms in this section use a query node to facilitate more sophisticated jumps. By making smart proposals we expect faster convergence to an accurate state. As a note to the reader, a summary of query-driven algorithms can be found in Table 3-3. Figure 3-4 shows an initial configuration and acceptable query-driven entity resolution solutions. An example initial state of this algorithm is shown in Figure 3-2 — each mention is initially assigned to separate entities. Alternatively, the model may be initialized randomly, or in an arrangement from a previous entity resolution output or with all mentions in one entity. Figure 3-3 is the full resolution for the data set; each mention is correctly assigned to its entity cluster. Figure 3-4 is a result that was resolved with query-driven methods and is a partially resolved data. Because the entity containing the query node is completely resolved the solution is acceptable. 3.4.2 Single-Node ER

Single-node ER algorithms are the class of algorithms that resolve a single query-node as discussed in Section 3.3. In particular, the target-fixed ER algorithm aims to focus a majority of the proposals on resolving the query entity. The algorithm fixes the query node as the target entity and then randomly selecting a source node to merge into the entity of the target query node. This focus on building the query entity in this type of importance sampling means the query entity should be resolved faster than if we sampling each entity uniformly. A query-driven ER algorithm that only selects the query-node as the target entity during sampling will create errors because such an algorithm is unable to remove erroneous mentions from the query entity. To prevent these errors, we allow the algorithm

55 to occasionally back out of poor decisions, that is, it makes non-query specific samples. Shown in Algorithm 14, target-fixed entity resolution adapts Algorithm 10 but it allows parameters to specify the proportion of time the different sampling methods are selected. In addition to the input mentions E from Algorithm 10, target-fixed entity resolution takes as input a query node q. The output of the algorithm is a resolved query entity and other partially resolved entities. For each sampling iteration the algorithm can make two decisions. The sampler may propose to merge a random source node that is not already a member of the query entity into the target query entity. Alternatively, the algorithm merges a random node with a random entity.

Algorithm 14 Target-fixed entity resolution algorithm Input: A query node q. A set of entities E each with one mention m. A positive integer samples. Output: A set of resolved entities E 0.i 1: E 0 ← E ∪ q 2: while samples-- > 0 do 3: if random() < τα then 0 4: ei ∼u E 5: ej ← q.entity 6: m ∼u ei 7: else 0 8: ej ← {e|∃e, e ∈ E , e 6= q.entity} 0 9: ei ← {e|∃e, e ∈ E , e 6= ej} 10: m ∼u ei 11: end if 00 0 12: E ←move(E , m, ej) 13: if score(E 0) < score(E 00) then 14: E 0 ← E 00 15: end if 16: end while return E 0

On lines3 to6 the algorithm takes a uniform sample from the list of entities. If the sampled entity is the same as the query entity it tries again and samples a distinct

entity. A node is drawn from this entity. The probability of this block being entered is τα.

56 Table 3-1. Mentions sets M from a corpus id Mention . . . m1 NY Giants . . . m2 Bronx Bombers . . . m3 New York Giants . . . m4 Yankees . . . m5 Brooklyn Dodgers . . . m6 The Yanks . . .

Table 3-2. Example query node q id Mention . . . q New York Yankees . . .

Lines7 to 10 are entered with a probability (1 − τα). This block performs a random entity assignment in the same manner as Algorithm 10. This block offsets the aggressive nature of the target-fixed algorithm by probabilistically backing out of any bad merges. Finally, the block starting from line 12 to line 15 scores the new arrangement and accepts if this improves the model score. We discuss parameter settings in Section 3.5.5.

Example. Take the synthetic mention set M shown in Table 3-1 and a query node q, the baseball team ‘New York Yankees’, in Table 3-2. This is the result of the approximate match of query q over a larger data set (blocking). The mentions of M may be initialized by assigning each mention to its own entity. After a successful run of traditional entity resolution the set of entities clusters are

{hq, m2, m4, m6i, hm1, m3i, hm5i}.

For query-driven scenario the only entity we are interested in is hq, m2, m4, m6i. Each mention in this query entity is an alias for the ‘New York Yankees’ baseball team. The other two mentions represent the ‘New York Giants’ football team and the ‘Brooklyn Dodgers’ baseball team respectively. The target-fixed algorithm attempts to merge nodes with the query entity one mention at a time and the merge is accepted if it improves the score of the overall model.

We can see in the example that a merge of m1 and m3 may improve the overall model

57 because they have similar keywords but one refers to the query entity and the other to different football team. The target-fixed algorithm can correct this type of error by probabilistically backing out of errors by moving mentions in the query node to a new entity as show in line7 to line 10 of Algorithm 14. 3.4.3 Multi-query ER

A user may want to resolve more than one query entity, that is, she may be interested in resolving a watch list of entities over the data set. To support multiple queries, first merge the canopies of each query node in the watch list to obtain a subset of the full graphical model containing only the nodes similar to query nodes. To resolve the entities we can use query-proportional methods iteratively over each query node. We define two classes of schedules, namely, static and dynamic. Static schedules are formulated before sampling while dynamic schedules are updated in response to estimated convergence. The two static schedules we develop are random and selectivity-based. In random scheduling each query node from the watch list is selected in a round robin style. Selectivity-based scheduling is a method of ordering multi-query samples to schedule proposals in proportion to the selectivity of the query node. Selectivity, in this case, is defined as the number of mentions retrieved using an approximate match of the data set, or the query node’s contribution to the total new graphical model. For example, the selectivity of our query node q in Table 3-2 the selectivity is simply the size of M, shown in Table 3-1. Random-based scheduling method performs well if all query nodes come from similar selectivity. Otherwise, if the selectivity of each query node vary, one query node may require more sampling compared to the others. If one query node needs a lot of samples to converge, it may take the whole process a long time to complete and cycles may be wasted on other nodes that have already converged. In addition to scheduling samples in proportion to their selectivity, we can schedule samples dynamically, depending on the progress of each query entity. To perform dynamic

58 scheduling we need to know how each query entity is progressing towards convergence. To estimate the running convergence we do not use standard techniques in literature because scheduling needs to occur before the model is close to convergence [25]. Instead, we estimate the convergence by measuring the fraction of accepted samples over the last N samples of each query in the watch list. The two dynamic scheduling algorithms are closest-first and sampling the farthest-first. In closest-first we queue up the query node that has the lowest positive average number of accepted nodes over the last N proposals. This scheduling method performs inference for the node that is closest to being resolved so it can move on to other nodes. Alternatively, the farthest-first algorithm schedules the node that has the highest convergence rate. This scheduling algorithm makes each query entity progress evenly. 3.5 Optimization of Query-Driven ER

The previous ER techniques aggressively attempt to resolve the query entity. However, if the query node is not representative of the query items performance of target-fixed ER can lead to undesirable results. We do not explore this trade-off; we assume users can select representative query nodes. In this section, we introduce optimizations to create approximate query-driven samples based on the query node. We first discuss the influence function that is used to make query-driven proposals. We then discuss the attract and repel versions of the influence function followed by two new algorithms. We end with implementation details and a summary of our query-driven algorithms. 3.5.1 Influence Function: Attract and Repel

To retrieve nodes from a graphical model that is similar to a query node we employ the notion of influence. Our assumption is that nodes that are similar have a high probability of being coreferent. An influence trail score between two nodes in a graphical model can be computed as the product of factors along their active trail as defined in literature [100]. For a node mi ∈ M and the query node q ∈ M the influence of mi on the

59 query node is defined as: X I(mi, q) = wjψj(mi, q) j∈F where F is the world of pairwise features and the feature weight and log-linear function are, respectively, wj and ψj. The influence function I is an implementation of this trail score. The influence function takes a set of entities — or the equivalent GM — and a query node q as parameters. The parameters to an influence function can be over the whole database or a canopy. Over several invocations of the function, I returns mentions from the graphical model with a frequency proportionate to their influence on q. If a mention has little or no influence, the influence acts as a blocking function, infrequently returning the mention. Recall influence is the distance active trail distance to query node. To implement the influence function we build a data structure based on an algorithm by Vose [93], hereafter referred to as a Vose structure. The input mentions to the blocking algorithms may result in high or low quality canopies. A high quality canopy means most of the mentions in the canopy are associated with the query node. Low quality canopies, which are more common, corresponds to only a small number of mentions being associated with the query node. When initializing query-driven algorithms the canopy quality is important for determining what algorithm to use. The attract method initializes each mention in the canopy in its own entity, and then mentions are merged until the convergence. The target-fixed algorithm discussed in Section 3.4.2 is explained using this method. The attract method works well for low quality canopies, or canopies that require a small number or items to merge. Conversely, the repel method works well with high quality canopies or when most items in a canopy belong to the query entity. The repel method initializes each mention in the canopy into a single entity. Then proposals are made to remove mentions from the entity so we are left with only the nodes

60 in the query entity. We discuss this method using the hybrid algorithm in Section 3.5.3. To build an influence function for the repel method we can use the same method and we only need to normalize and invert the influence scores. We refer to this as co-influence or I¯. 3.5.2 Query-proportional ER

In the query-proportional sampling algorithm, on every iteration, the source mention and target entity are selected in proportion to its distance to the query entity. Instead of focusing solely on the query entity, this algorithm prioritizes samples using a measure that represents probability of a mention being coreferent with the query entity. That is, each node p in the graphical model G is selected on the active trail between itself and the query node q. This algorithm merges nodes that are similar to the query node with an increased frequency. Before query-proportional sampling, a data structure for I is created. The I influence structure takes a query node q and the global graphical model E then returns a sampled mention. As I is called multiple times, the distribution of the nodes returned is proportional to their influence. Algorithm 15 describes the query-proportional algorithm.

Algorithm 15 Query-proportional algorithm Input: A query node q to drive computation. A set of entities E each with one mention m. A positive integer samples. A function I that samples from nodes entities according to its influence on a mention. Output: A set of resolved entities E 0. 1: E 0 ← E ∪ q 2: while samples-- > 0 do 0 3: m1 ← I(E , q) 0 4: m2 ← I(E , q) 00 0 5: E ←move(E , m1, m2.entity) 6: if score(E 0) < score(E 00) then 7: E 0 ← E 00 8: end if 9: end while return E 0

61 For each iteration, the algorithm selects mentions using the influence function (line3

and line4). Then, one mention m1 is moved into the entity of m2. Mentions m1 and m2 have a higher probability of being coreferent and therefore a higher probability of a merge occurring in the query entity compared to random selections as in Algorithm 10. As a corollary, the influence sampling property creates many small entities that are similar to the query entity. During query-proportional sampling more entities that are similar to the query node are created. Some of the mentions created in intermediate entities during query-proportional sampling will move to the query entity. This is a big advantage when performing entity-to-entity merges (as opposed to mention to entity merges). In this chapter, we do not investigate this extension to the algorithm. 3.5.3 Hybrid ER

The best of both the target-fixed and query-proportional algorithms can be combined to create a hybrid algorithm. Like the target-fixed algorithm, the hybrid method aggressively fixes the target as the query entity. The hybrid method also chooses its source node using the influence function in the same manner as the query-proportional algorithm. Algorithm 16 shows the hybrid algorithm using the repel method. With probability ¯ τα the algorithm chooses a mention using the repel method (I) and moves it to an entity that is not the query node. This is the opposite of merging a node into the query entity. Pseudocode is listed on lines3 to line5. 3.5.4 Implementation Details

The previous algorithms described single process sampling over the set of mentions. The multi-query methods are modeled for several interwoven sequential single-node ER processes. In this section, we describe our implementation of the hybrid algorithm over a parallel database management system.

62 Algorithm 16 Hybrid-Repel algorithm Input: A set of entities E, where one contains all the mentions m and the others are empty. A positive integer samples. A query node q. A function I that samples from nodes entities according to its influence on a mention. Output: A set of resolved entities E 0. 1: E 0 ← E ∪ q 2: while samples-- > 0 do 3: if random() < τα then 4: m ← I(E 0, q) 0 5: ei ← {e|∃e, e ∈ E , e 6= q.entity} 6: else 0 7: ei ∼u E 0 8: ej ← {e|∃e, e ∈ E , e 6= ei} 9: m ∼u ej 10: end if 00 0 11: E ←move(E , m, ei) 12: if score (E 0) < score(E 00) then 13: E 0 ← E 00 14: end if 15: end while return E 0

An independent Vose structure (I,§ 3.5.1) is created for each query node in the query set. The creation of the Vose structure query nodes is parallelized. When the number of query nodes increases the Vose structures demand more memory from the system. Each Vose structure contains array of type double precision and unsigned int. The space for the structure is O(|Q| · |M|) where |Q| is the number of query nodes in the query and |M| is the number of mentions in the corpus. The Vose structure is accessed over every sample and needs to be in memory. To increase scalability, one could store the full sets of precomputed samples and serialize the Vose structures to disk but that is not explored here [48]. Sampling over the query nodes for each algorithm can also be perform in parallel. In our method, a thread selects a query node using a random schedule as described in Section 3.4.3. The system will use the Vose structure associated with the query node to set up a proposal move. The system attempts to obtain a locks for both entities involved

63 in the proposal. If the system is unable to obtain a lock on either of the two entities the system will back out and resample new entities. When the number of query nodes is small the query-driven algorithms experience lot of contention at the entities containing the query nodes. In these circumstances, the system will back out and either restart the proposal process or attempt a baseline proposal. This avoids waiting for locked entities and keeps the sampling process active. In Section 3.6.6 we demonstrate the parallel hybrid method over a large data set. 3.5.5 Algorithms Summary Discussion

Algorithms 14, 15 and 16 are modifications of proposal jumps found in the baseline Algorithm 10. Table 3-3 describes the proposal process for each algorithm by its preferred jump method.

Table 3-3. Summary of algorithms and their most common methods for proposal jumps source target Baseline random random Target-Fixed random fixed Query-Proportional proportional proportional Hybrid proportional fixed

The target-fixed algorithm builds the query entity by aggressively proposing random samples to merge into the query entity. The query-proportional algorithm uses an influence function to ensure its samples are mostly related to the query node. The hybrid algorithm mixes the aggressiveness of the target-fixed with the intelligent selecting of the source node found in the query proportional method. After choosing the correct algorithm, a user needs to have a well trained model with several features. An advantage of using query-proportional techniques, because so little sampling is required, is that we can interactively test query accuracy. We can and also add context or keywords that were discovered from a previous run of the algorithm. This interactive querying workflow will help improve accuracy, which we experimentally verify in Section 3.6.

64 Parameter settings. The algorithm takes several parameters that affect performance. While not studied in this chapter, parameter settings are robust to change making parameter selection simple. The first is the number of proposals (samples). This number can be a function on the size of the data set. Each query node should have the opportunity to be merged into an entity more than once.

The value τα is between [0.0, 1.0] and represents how often to perform the main type of sampling. This value should be set to a high value, 0.9 for accept algorithms. With probability 1 − τα the algorithms back-off to random samples to improve mixing. This value is lowered to counter some of the aggression, particularly in Algorithm 14. The parallel experiments use a τα = 1 and back out when there is contention in the threads. In statistics, a negative binomial function is used to model the number of trials it takes for an event to be a success. We can also use a negative binomial function as a decay function for the output of the influence function. We use this function because we want values that are most similar (lowest score) to be sampled more often. We set the r value, or number of failures for the negative binomial function to 1. We set the p value, or the probability of each success to a value close to 0.05. In the multi-query ER algorithms we run inference for K steps before we look to change the query entity. In our experiments we choose a K of 500 and an increasing value from two to 100 thousand in the parallel experiments. 3.6 Query-Driven Entity Resolution Experiments

In this section, we describe the implementation details, the data sets and our experimental setup. Next, we discuss our hypotheses and four corresponding experiments. We then finish with a discussion of the results.

Implementation. We developed the algorithms described in Section 3.4 in Scala 2.9.1 using the Factorie package. Factorie is a toolkit for building imperatively defined factor graphs [65]. This framework allows a templated definition of the factor graoh to avoid fully materializing the structure. The training algorithms are also developed

65 using Factorie. The algorithms for canopy building and approximate string matching are developed as inside of PostgreSQL 9.1 and Greenplum 4.1 using SQL, PL/pgSQL and PL/Python. Inference is performed in-memory on an Intel Core i7 processors with 3.2GHz, 8 cores and 12GB of RAM. The approximate string matching on Greenplum is performed on a AMD Opteron 6272 32-core machine with 64 GB. The parallel experiments were developed entirely in a parallel database, DataPath [4]. DataPath is installed on a 48-core machine with 256 GBs.

Data sets.. The experiments use three data sets, the first is the English newswire articles from the Gigaword Corpus, we refer to this as the NYT Corpus [39]. The second is a smaller but fully-labeled Rexa data set.3 Because it is fully-labeled it allows us to run the more detailed micro benchmarks. The NYT corpus contains 1,655,279 articles and 29,866,129 paragraphs from the years 1994 to 2006. We extracted a total of 71,433,375 mentions using the natural language toolkit named entity extraction parser [12]. Additionally, we compute general statistics about the corpus including the term and document frequency and tf-idf scores for all terms. We manually labeled mentions for each query over the NYT data set. The second data set, Rexa, is citation data from a publication search engine named Rexa. This data set contains 2454 citations and 9399 authors of which 1972 are labeled. We perform experiments on the Rexa corpus because it is fully labeled unlike the NYT Corpus. The Rexa corpus is smaller in total size but it has average sized canopies. The third data set is the Wikilinks Corpus [87] largest labeled corpus for entity resolution that we could find at the time of development. It contains 40 million mentions and 3 million entities that were extracted from the web and truthed based on web anchor links to Wikipedia pages. We loaded a million mentions onto DataPath to demonstrate the parallel capabilities.

3 http://cs.neiu.edu/∼{}culotta/data/rexa.html

66 3.6.1 Experiment Setup

Table 3-4 lists the features and the weights for each feature.

Features.. Features that look for similarity between mention nodes are called affinity features and they are given positive weights. Features that look for dissimilarity between mentions nodes are called repulsion factors and they are given negative weights. We implement three classes of features: pairwise token features, pairwise context features and entity-wide features. Pairwise features directly compare tokens strings on attributes such as equality or matching substrings. Context features compare the information surrounding the mention. We can look at the surrounding sentence, paragraph, document or user specified keywords. The query nodes are extracted from text and contain a proper document context. With this context, we use a tf-idf weighted cosine similarity score to compare the context of each mention token. Finally, entity-wide features use all mentions inside an entity cluster to make a decision. An example entity-wide feature counts the matching mention strings between two entities.

Models.. Features on the NYT and Wikilinks data sets were manually tuned and the features for the Rexa data set were trained using sample rank [101] with confidence weighted updates. We manually tune some of the weights in the NYT corpus to make up for the lack the complete training data. The models can be graphically represented as the models in Figure 3-4.

Evaluation metrics.. Convergence of MCMC algorithms is difficult to measure as describe in a review by Cowles and Carlin [25]. We estimate the convergence progress by calculating the f1 score of the query node’s entity (f1q). We create this new measure because we are primarily concerned with the query entity. Other measures include B3 for entity resolution and several others for general MCMC models [7, 25].

The query-specific f1 score is the harmonic mean of the query-specific recall Rq and query-specific precision Pq. To accurately determine the Pq and Rq of each query in this experiment we label each correct query node. Query-specific precision is defined as

67 Table 3-4. Features used on the NYT Corpus. The first set of features are token specific features, the middle set are between pairs of mentions and the bottom set are entity wide features. Feature name Score+ Score− Feature type Equal mention Strings +20 -15 Token Specific Equal first character +5 Token Specific Equal second character +3 Token Specific Equal second character 0 Unequal mention -15 Token Specific Strings Unequal first character 0 Unequal second charac- 0 ter Unequal second charac- 0 ter Equal substrings +30 -150 Token Specific Unequal substrings -150 Token Specific Equal string lengths +10 Token Specific Matching first term +90 -3 Token Specific No matching first term -3 Token Specific Similarity ≥ 0.99 +120 Pairwise Similarity ≥ 0.90 +105 Pairwise Similarity ≥ 0.80 +80 Pairwise Similarity ≥ 0.70 +55 Pairwise Similarity ≥ 0.60 +35 Pairwise Similarity ≥ 0.50 +15 Pairwise Similarity ≥ 0.40 -5 Pairwise Similarity ≥ 0.30 -50 Pairwise Similarity ≥ 0.20 -80 Pairwise Similarity < 0.20 -100 Pairwise Matching terms +20 Pairwise Token in context +1 Pairwise No matching keyword +700 -10 Pairwise Matching Keyword +700 Pairwise Keyword in token +70 Pairwise Extra Token -500 Pairwise Matching token in +10 Pairwise context Similar neighbor +100 -5 Entity-wide No Similar neighbor in -5 Entity-wide entity Matching document +350 -15 Entity-wide No Matching docu- -15 Entity-wide ments in entity

68 |{relevant(M)}∩{retrieved(M)}| |{relevant(M)}∩{retrieved(M)}| Pq = |{retrieved(M)}| and query-specific recall Rq = |{relevant(M)}| . The f1 score for the query node’s entity q is defined as:

RqPq f1q = 2 . Rq + Pq

The f1q score is a good indicator of entity and answer quality. For multi-query experiments we calculate the average f1q scores for each query node. The run of each non-parallel algorithm is averaged over 3 to 10 runs. 3.6.2 Realtime Query-Driven ER Over NYT

In this experiment we show that query-driven entity resolution techniques allow us to obtain near realtime4 results on large data sets such as the NYT corpus.

Figure 3-5 shows the f1q score of the hybrid ER algorithms with three single-query ER queries. The graph shows performance over the first 50 proposals. For example, the ‘Zuckerberg’ query could be expressed as shown in Algorithm 17.

Algorithm 17 Example ER query over the entity ‘zuckerberg’ SELECT ∗ FROM Mention m WHERE coref_map ( m . ∗ , entityˆp ),‘ zuckerberg ’, context).

Recall, a canopy is first generated using an approximate match over the mention set. We use the repel inference function and all the mentions are initialized in one large entity.

The ‘Richard Hatch’ and ‘Carnegie Mellon’ queries start at an f1q score of .92 and .97, respectively. The ‘Zuckerberg’ query starts above .65 and improves to an f1q score over .8. These experiments show the repel method removing mismatches from the query entity. The co-influence function is used to quickly identify the mentions that do not belong in the entity and they are proposed to be removed. When a hybrid move is

4 We define realtime as only contributing a small or no time loss when this process a part of an external execution pipeline such as an information extraction pipeline.

69 Figure 3-5. Hybrid-repel performance for the first 50 samples for three queries. Each result is averaged over 6 runs

Table 3-5. The performance of the hybrid-repel ER algorithm for queries over the NYT corpus for the first 50 samples. Total time includes the time to build the I¯ data structure and result output. The NYT Corpus contains over 71 million mentions, a large amount for the entity resolution problems. Query Blocking Mentions Inference Total time Zuckerberg 24.4 s 103 2 s 37 s Richard Hatch 28.3 s 226 18.5 s 59 s Carnegie 25.9 s 1302 68 s 124 s Mellon

proposed, a mention from the large entity moved from a large entity group to a new, possibly empty, entity. This method relies on the good repulsion features and correct weights. In Table 3-5 we show the performance of three queries. In addition to the query token we add four columns: blocking time in seconds, canopy size, inference time in seconds and the total compute time. Total time is the complete time taken by each run, this includes building of the influence data structure and result writing. The values in Table 3-5 show that fast performance of query-driven ER over a large database of mentions. 3.6.3 Single-query ER

In this experiment we show a performance comparison between the single-query algorithms summarized in Sections 3.4 and 3.5. We run the query-driven algorithms over

70 Figure 3-6. A comparison of single-query algorithms on a query with selectivity of 11 queries with different selectivity levels and show the accuracy over time. Each algorithm uses the attract method, so each mention in the canopy starts in its own entity. Figure 3-6 shows the run time of all four algorithms on the Rexa data set with the query ‘Nemo Semret’, an author with a selectivity of 11. The performance for the baseline entity resolution does not get a correct proposal until about 500 seconds. The baseline algorithm takes a long time to accept the first proposal because it is randomly trying to insert mentions into an existing entity. Target-fixed immediately begins to make correct proposals. Hybrid and query-proportional have the best performance and resolve the entity almost instantaneously. The hybrid chooses the most likely nodes to merge into the query entity. As the first couple of proposals are correct merges, hybrid quickly converges to a high accuracy. Due to imperfect features, among the 10 averaged runs a few runs get stuck at local optimum and causing suboptimal results. Figure 3-7 shows the run time of four algorithms for query node id ‘A. A. Lazar’ with selectivity of 46. The baseline algorithm progresses the slowest. The hybrid algorithm quickly reaches a perfect f1q score. Query-proportional algorithm lags slightly behind the hybrid method but still reaches a perfect value. The target-fixed algorithm gradually increases to a perfect f1q score about 60 seconds after hybrid and query-proportional.

71 Figure 3-7. A comparison of single-query algorithms with a query node of selectivity 46

Figure 3-8. A comparison of selection-driven algorithms with a query node of selectivity 130

Figure 3-8 shows the run time of four algorithms with a query ‘Michael Jordan’ of selectivity 130. The baseline slowly increases over the 100 seconds. The hybrid algorithm again quickly achieves a perfect f1q score followed by query-proportional and then target-fixed. The time gap between each of the algorithms increases with the increase in selectivity, hybrid achieves the best performance. We look deeper at how selectivity affects the rate of convergence. In Figure 3-9 we show the time it takes for each algorithm to reach an f1q score of 0.95 over increasing

72 Figure 3-9. The time until an f1q score of 0.95 for five queries of increasing selectivities; averaged over three runs selectivity. We choose five query nodes of increasing selectivity but with the same canopy sizes. The hybrid algorithm runtime increased with the increase in selectivity but only slightly steeper than constant. Target-fixed increased for the first three queries but did not last more than 50 seconds. Query-proportional has only a slight increase in time till convergence for the first three queries. The highest two selectivity queries are expensive for query-proportional and we observe an exponential increase in runtime. These results are consistent with the exponentially large increase in the number of random comparisons needed to find a match for a query entity. The query-proportional algorithm does not focus on the query entity as aggressively as target-fixed and hybrid algorithms. Recall that the target-fixed and the hybrid algorithm focus on moving correct nodes into the query entity. Query-proportional selects candidate nodes using the influence function but it does not fix the target entity. With the target entity not fixed, the chance of correct node for the query entity decrease exponentially. This shows that selectivity of nodes affects the runtime performance of each algorithm. When performing join-driven ER it is important to take the relative selectivity of nodes into account for choosing best scheduling algorithms.

73 Figure 3-10. The progress of the hybrid algorithm across for multiple query nodes using difference scheduling algorithms. Each result is averaged over three runs

3.6.4 Multi-query ER

In this experiment we study performance of our different scheduling algorithms for join-driven ER queries. We choose ten query nodes of different selectivity and run the join queries scheduling algorithms described in Section 3.4.3. Consider a table like the People table in Section 3.3 with selectivity {130, 63, 68, 7, 12, 12, 301, 11, 46}. The four algorithms, random, closest-first, farthest-first and selectivity-based are shown in Figure 3-10. The selectivity-based method out performs the other three algorithms in terms of convergence rate. The jumps in accuracy on the graph correspond to the scheduling algorithms choosing new query nodes and accepting new proposals. It has a high jump when it starts sampling the seventh, and highest selectivity nodes. The farthest-first algorithm rises the slowest out of the scheduling algorithms because it tries to stop sampling the high performing query entity and makes proposals for the slowest growing. Selectivity-based method performs well early because the high selectivity queries are sampled first. The high selectivity query makes up a large proportion of the total f1q score. The large jump in the random method is when it reaches the node with selectivity

301. Notice, closest-first reaches its peak f1q score the fastest because it tries to get the most out of every query node.

74 Figure 3-11. The performance of zuckerberg query with difference levels of context. Each result is averaged over 6 runs

3.6.5 Context Levels

In this experiment we aim to discover how different levels of context specified at query time can improve convergence time and overall accuracy. We take the zuckerberg query and the hybrid-repel algorithm and ran ER three times over three levels of context. Each mention in the graph contains a ‘paragraph’ level of context and we only alter the context of the query node. The ‘none’ context only activates token specific features, any context features involving the query node are zeroed out. The ‘paragraph’ level context is the default context from the NYT corpus and the ‘document’ level context extends context to the entire news article. Additionally, we add specific keywords from Mark Zuckerberg’s DBpedia page to the ‘document’ and ‘paragraph’ context levels. We show the performance using the repel method in Figure 3-11. Adding specific keywords that activate the keyword features are the most effective methods for increasing the accuracy of query-driven ER. Query-driven methods allow a user to observe the results and add or remove keywords specific queries to improve the accuracy. This type of iterative improvement workflow is not feasible with batch methods.

75 Figure 3-12. Hybrid-attract algorithm with random queries run over the Wikilinks corpus. Each plot starts after the Vose structures are constructed

3.6.6 Parallel Hybrid ER

In this experiment has two objectives, first how does the hybrid algorithm perform in a canopy size of 1 million queries and what is the effect of increasing the number of queries nodes. In Figure 3-12 the Hybrid algorithm is able to resolve entities in a short amount of time. The creation time of the Vose structure is about linear in the number of queries. The trend in the graph is that as the ratio of queries to entities increases the performance benefit of the hybrid-attract method decreases. With more query nodes the construction time increases and the benefits of the algorithm decrease and become no better than the baseline method.

Experiment Summary. Each of the query-driven methods outperform the baseline methods in terms of runtime while not losing out on accuracy. Across different data set sizes hybrid algorithms have the most consistent performance. If a system has a quality blocking function then it is better to use the co-influence entity resolution method. With multiple query nodes, selectivity-based is the most consistent performing algorithm. More accurate estimation of MCMC convergence performance could allow the dynamic scheduling algorithms closest-first and farthest-first to achieve higher accuracy. The more contextual information that can be added to query nodes at query time causes higher

76 accuracy of the entity resolution algorithms. Parallel query-driven sampling is an effective way to get speed up in an ER data set when the ratio of mentions to entities is low. 3.7 Query-Driven Entity Resolution Related Work

This chapter is related to work in several areas. In this section we describe a selection of the literature that we found most relevant to different parts of the Query-Driven ER task.

Entity Resolution. The state-of-the-art method for entity resolution employs collective classification. Instead of purely pairwise decisions, collective classification methods consider group relationships when making clustering determinations. In a recent tutorial [37], collected classification methods were grouped into three categories: non-probabilistic [10, 29, 49], probabilistic [18, 36, 59, 67, 74, 89] and hybrid approaches [3, 83]. A relevant challenge proposed for entity resolution research by the tutorial is how to efficiently perform entity resolution when a query is involved. This chapter seeks to address this issue. Entity resolution is generally an expensive, offline batch process. Bhattacharya and Getoor proposed a method for query-time entity resolution [11]. This method performs inference by starting with a query node and performing ‘expand and resolve’ to resolve entities through resolution of attributes and expansion of hyper-edges. Unfortunately, hyper-edges between records are not always explicit in data sets. This chapter does not assume the presence of any link in the corpus, each entity or mentions are independently defined, which is the case for most applications. A recent paper by Altwaijry, Kalashnikov and Mehrotra [2] has a similar motivation of using SQL queries to drive entity resolution. That work focuses using predicates in the query to drive computation while this work uses example queries to drive computation. Both techniques are complementary and combining the two by updating the edge-picking policy described in their paper using our approach makes for interesting method of optimizing the entity resolution process.

77 The term query-driven appears in this chapter and has appeared in others across literature with different meanings [41]. Our definition of a query node is an example item, mention, from a data set. A query in Altwaijry et al. [2] are the predicates in an SQL statement. Query-driven in Grant et al. [41] is the SQL queries used to drive analytics. It is becoming increasingly normal to work with data sets of extremely large size, in response researchers have studied streaming and distributed processing. Rao, McNamee and Dredze describes an approach for streaming entity resolution [78]. This approach is fast and approximates entries in an LRU queue of clustered entity chains. We apply these techniques to static data set and do not yet handle streams of data. Singh, Subramanya, Pereira, and McCallum propose a technique for ER where entities are resolved in parallel blocks and then redistributed and resolved again in new blocks [86]. This parallel distribution method makes large-scale entity resolution tractable. In this chapter, we perform analysis on a similar scale data set but we show that great performance gains can be achieved when a query is specified.

Query specific sampling. Recently, several researchers have explored the idea of focusing sampling of graphical models to speed up inference. Below we discuss the three approaches that use sampling to speed up ER over graphical models. Query-Aware MCMC [100] found that when performing a query over a graphical model the cost of not sampling a node is exactly the nodes influence on the query node. This enables us to ignore some nodes that have low influence over the query node and incur a small amount of error. This influence score can be calculated as the mutual information between two nodes. The authors compare estimation techniques of the intractable mutual information score, this is called the influence trail score. Because ER has a fixed pairwise model, we can use the theory from this work and specialized data structures to gain performance when query-driven sampling. Type-based MCMC is a method of sampling groups of nodes of with the same attribute to increase the progress towards convergence [61]. This approach works well

78 when feature sets can be tractably counted and grouped. If query nodes are introduced it is not clear how one may focuses type-based sampling. Other researchers have explored using belief propagation with queries to approximate marginal of factor graphs [20]. However, the entity resolution graph is cyclic and highly connected. MCMC scales with large real world models better than loopy belief propagation [100]. 3.8 Query-Driven Entity Resolution Summary

In this chapter, I propose new approaches for accelerating large-scale entity resolution in the common case that the user is interested in one or a watch list of entities. These techniques can be integrated into existing data processing pipelines or used as a tool for exploratory data analysis. We showed three single-node ER algorithms and three scheduling algorithms for multi-query ER and show experimentally how their runtime performance is several orders of magnitude better than the baseline.

79 CHAPTER 4 A PROPOSAL OPTIMIZER FOR SAMPLING-BASED ENTITY RESOLUTION 4.1 Introduction to the Proposal Optimizer

Recently, an increasing number of organizations are tracking information across social media and the web. To this end, the National Institute of Standards hosted a three-year track to accelerate the extraction of information and construction of knowledge bases from streaming web resources [34]. This international contest highlighted the many difficulties of dealing with collecting unstructured data across the web. Across these efforts in this contest, we identify entity resolution as a major barrier to progress. Entity resolution across text corpora is the task of identifying mentions within the documents that correspond to the same real-world entities. To construct knowledge bases or extract accurate information, entity resolution (ER) is a required step. This task is a notoriously computationally difficult problem. Using Markov Chain Monte Carlo (MCMC) techniques exchanges raw performance for a flexible representation and guaranteed convergence [66, 86, 99]. Processing streaming textual documents exacerbates two of the core difficulties of ER. The first difficulty is the computation of large entities, and the second is the excessive computation spent resolving unambiguous entities. Over time, the growing size of large entities makes keeping up with the incoming documents untenable. Optimization that touches these critical portions is wholly understudied. In this chapter, we argue that compression and approximation techniques can efficiently decrease the runtime of traditional ER systems thus making them usable for streaming environment. In sampling-based entity resolution, entities are represented as clusters of mentions. A proposal is made to move a random mention from a source entity to a random destination entity. The proposed state is scored and if it improves the global state, the new state is accepted. If the proposal does not improve the global state, the proposal may still be accepted with some small probability. This process is repeated until the state converges.

80 Scoring the state of an entity cluster, through pairwise feature computation of the cluster mentions, is O(n2). For entity clusters larger than 1000 mentions, calculating the score for each proposal can become prohibitively expensive. Wick et al. present an entity resolution technique that uses a tree structure to organize related entities to reduce the amount of work performed in each step [99]. During each proposal, this approach avoids the pairwise comparison by restricting model calculation to the top nodes of the hierarchy. This approach can avoid massive amounts of computation by performing organizing the known sets of mentions. This discriminative tree structure is a type of compression. Singh et al. present a method of efficiently sampling factors to reduce the amount of work performed when computing features [88]. They observe that many factors are redundant and do not need to be computed when calculating the feature score. They use statistical techniques to estimate the calculating feature scores with a user-specified confidence. This approach can be categorized as early stopping for feature computation. There is no one size fits all sampling algorithm [82]; each of these methods, compression and early stopping, has drawbacks. Compression may slow down insertion speed and requires extra book keeping to keep to organize the data structure. Early stopping is not always precise and adding extra conditionals in the Metropolis Hastings loop structure slows computation. Applying each technique at appropriate times can remove pain points and accelerate the entity resolution process. In this chapter, we discuss initial work towards the design of an optimizer that modifies the sampling-based collective entity resolution process to improve sampling performance. Static parameters for evaluating entity resolution rarely hold for the lifetime of streaming processing task. The optimizer, in the spirit of the eddy database query optimizer [5], dynamically examines the current state of each proposal and suggests methods for evaluating proposals and structuring entities. We train a classifier to decide when the sampling process should use early stopping. Additionally, we use training data

81 Figure 4-1. The high-level interaction of the optimizer. As streaming data updates pass to the machine learning model, the optimizer recommends the best algorithms to update the model. Entity resolution is an example of a model that needs to be frequently updated with new data to decide when is the best time for a particular entity to be compressed. This is done with negligible bookkeeping. We make the following contributions:

• We identify several techniques to speed up sampling past a natural baseline.

• We create rules and techniques for an optimizer to choose parameters and methods at run time.

• We empirically evaluate these methods over a large data set. We recognize that optimizers can also apply to many different long running machine learning pipeline. Figure 4-1 depicts that the optimizer supervises the machine learning model. The optimizer determines the methods of processing the streaming updates of the model. As future work, we plan to create a full optimizer to study performance improvements on long running machines learning tasks. The outline of the paper is as follows. In Section 4.2, we give a introduction to factor graph models and entity resolution. In Section 4.3, we further discuss the statistics that an optimizer for entity resolution can use. In Sections 4.4 and 4.5, we discuss the implementation of the optimizer. Finally, in Section 4.6, we examine the benefits by

82 testing early stopping and compression over a synthetic and a popular real world entity resolution data set. 4.2 Proposal Optimizer Background

Factor graphs are a pairwise formalism for expressing arbitrarily complex relationships between random variables [53]. A factor graph F = hx, ψi, contains a set of random

n m variables x = {xi}1 and factors ψ = {ψi}1 . Random variables are connected to each other through factors. Factors are a mapping between one or more variables and a real-valued score. The probability of a setting ω among the set of all possible settings Ω occurring in a factor graph is given by a probability measure:

m m 1 X Y X X Y π(ω) = ψ (xi),Z = ψ (xi) Z i i x∈ω i=1 ω∈Ω x∈ω i=1

i where x is the set of random variables that neighbor the factor ψi(·) and Z is the normalizing constant. Exact inference over complex factors graphs is computationally expensive because it involves computing the normalizing constant. Therefore, it is popular for researchers to use Markov Chain Monte Carlo (MCMC) approximation techniques to estimate the probability of settings. In particular, for large and dense factor graphs MCMC Metropolis Hastings (MH) has been shown to be a scalable technique for inference calculation [86]. Cross-Document entity resolution, resolving entities across document borders, is usually several orders of magnitude smaller when compared to within document entity resolution. In large text corpora, the size of entities follows the power law [87]. For example, Figure 4-2 is a generated data set containing 40 million mentions and 3 million entities over 11 million web pages. As documents and mentions are incrementally streamed through, the scale problem becomes a critical issue. The mentions on disk can be represented as a large array of identifiers. Entities are a collection of mentions and can be represented as such. In the worst case there is an equal

83 Figure 4-2. A distribution of entity sizes from the wiki links corpus [87] with an initial start and the truth number of entities and mentions. This means each mention is its own individual entity. In the other extreme, all the mentions may be a part of the same entity. For streaming entity resolution, mentions within documents must be matched to the existing set of entities [78]. In this chapter, we assume the entity set is initialized by grouping the most similar mentions; new mentions are assigned to the closed match. To compute the score at each step, the number of comparisons is proportional to the number of pairwise factors between mentions. The pairwise factors are weighted functions such as approximate string matches, token overlap, n-gram matches. There are additional cluster-wide features calculated at each step. Such features include functions to check whether all mentions in a cluster share the same token. For clusters larger than 1000 mentions, calculating scores of the model becomes extremely expensive. Performing sophisticated techniques over smaller clusters also adds extra overhead. In this chapter, we examine the trade-off of selecting methods to accelerate the feature computation process. 4.3 Accelerating Entity Resolution

In this section, we discuss the acceleration in MCMC-MH sampling for entity resolution. We then motivate how we believe gains can be achieved using compression, sampling acceleration methods and optimizers. We use a large real-world corpus as a motivating example.

84 The two issues we are investigating are as follows: First, given a source entity,

destination entity and the mention (es, ed, m), which method can score the proposal in the least amount of time? Secondly, after the proposal is calculated, should we compress the entity structure? The optimizer will decide when to use each technique. The total size of all entities in the traditional representation is:

X sizeof(E) = c + (sizeof(int) ∗ |ei|), (4–1) i where sizeof is an abstract function to compute the size of the containing object, c is a class constant and |ei| is number of mentions in the entity. There are many compression techniques, one being to only keep mentions that have a unique representation inside entities. That is, if any mention token is a duplicate, we remove it. This compressed total entity size is:

X sizeof(Ecompressed) = c + (sizeof(int) ∗ #ei), (4–2) i where #ei is the cardinality of the mention tokens in entity ei. We note that when the

#ei  |ei|, it may be worth compressing the entity ei. In Figure 4-2, 45% percent of entities are smaller that 100 mentions in size. Additionally, 82% percent of entities contain less than 1000 mentions. These numbers suggest that at times we we can take advantage of the redundancy within large entities by compressing them. We investigate the wikilinks corpus further in Section 4.6.1. In addition, Figure 4-2 shows that there is an order of magnitude difference between the sizes of initial entities and the true entity sizes. The entities were initialized by exact string match, a common initialization scheme. This difference gives us some intuition of the trends of the entity resolution process. Additionally, this suggests that there are several distinct representations of entities During entity resolution the sizes of entities can expect to grow by an order of magnitude in size while the total number of smaller entities

85 will decrease. We can use this property to track the growth and change of entity sizes over time to understand how to process a particular grouping of entities. 4.4 Proposal Optimizer Algorithms

In this section, we will describe simple algorithms for entity sampling and entity simple compression. After introducing the compression and approximation techniques we discuss how an optimizer can be designed to improve the overall sampling time. The baseline method performs pairwise comparisons by iterating over the mentions using the order on disk. The mentions ids are used to extract the contextual information of each mention from a database. This is the traditional method of computing the pairwise similarity of two clusters. This method results in simple code so modern compilers are able to perform extreme optimizations such as loop unrolling. Confidence-based scoring method performs uniform samples of the mentions from the source and destination entities clusters during scoring. This method measures the confidence of the calculated pairwise samples and stops when the confidence of a score exceeds a threshold of 0.95. This is a simplified version of the sampling uniform sampling method described by Singh et al. [88]. The code to collect statistics is shown in Algorithm 18. The add function shows how and what statistics are recorded when each new mention is added. Notice themax and themin are variables in the Stats class that store the current maximum and minimum. The current sum and running mean are also updated with each new value added. The current implementation assumes the values from the pairwise factors follow a Gaussian distribution; the model in Singh et al. make the same assumption [88]. As entity sizes grow, we can expect to see many repeats of the same or very similar mentions. Reducing the entity size will shrink the effective memory footprint of entities. This is important for long running collection of entities. Run-length encoding is the simplest method for compressing entities. This method compresses the near duplicate mentions. A canonical mention is chosen along each exact duplicate and a counter map

86 Algorithm 18 Sample code from the stats showing how running statistics are recorded and how the variance can be computed Stats ::add( long double x ) { themax = MAX(themax,x) ; themin = MIN(themin ,x); sum += x ; ++n ; auto d e l t a = x − mean ; mean += (delta / n); M2 = M2 + delta ∗ ( x − mean) ; }

double Stats::variance ( void ) const { if (n > 2) return M2 / (n−1) ; else return 0 . 0 ; }

Table 4-1. A table of the techniques to improve the sampling process and each is classified by how they affect sampling Technique Compression Early Stopping Overhead Baseline No No None Confidence-based [88] No Yes Medium Discriminative Yes No Large Tree [99] Run-Length Yes No Small Encoding

records the number of duplicates that are represented. The compression rates become large for mention clusters with many duplicate. 4.5 Optimizer

Before calculating the MCMC-MH proposal there are several decisions we can make that will affect the runtime and accuracy of the algorithm. At each step we may: (1) approximate the calculation of the entity states; (2) update an entity structure to a compressed format; (3) skip the calculation of the proposal and directly accept or reject. These decisions can be made by observing several features of a source entity,

87 destination entity and a source mention. We enumerate a small set of features that can yield information to help us decide how the entity structure should be changed. The decision to compress an entity takes four main points into consideration.

First, the time it takes to compress the entity (Ctime). For example, if the time it takes to compress an entity is the same as the time it takes to reach an answer in the uncompressed format, then compression is superfluous. Second, it is important to consider the spaced saved in memory and the amount of additional entities that do not have to be fetched from disk and can now fit in memory (Cspace). Third, we need to know how active an entity has been (Cactivity). That is, how many additions or subtractions this entity has seen over a long period of time. This information is helpful in understanding the likelihood this entity will be requested for another addition or subtraction. (Modifying entities clusters causes them to block.) Last, we retain the activity of an entity over a recent, short period of time (Cvelocity). This information lets us know whether it is smart for this entity to take the time out to for compression while other mentions may be attempting an insertion or removal. At each proposal step the decision made should maximize the utility. Utility of the decision is a numeric score to represent the gain performing the proposal calculation. The utility value is a real number in the range (−∞, ∞). A formal model for utility is as follows:

U = Ctime + Cspace + Cactivity + Cvelocity

Collecting statistics to measure utility is can incur a significant overhead. Not every decision in the optimizer needs to be decided automatically. We can use some simple principles to estimate the utility at each point. In the next section we, examine an entity resolution data set and get some intuition for the development of the optimizer. 4.6 Proposal Optimizer Experiment Implementation

In this section, we first describe the Wikilink data set we use for experiments. Following, we present a micro benchmark to validate our investigation of entity approximation

88 and compression. We then discuss the implementation of the compression and approximation techniques over a large real-world cross-document entity resolution corpus. 4.6.1 WikiLink Corpus

The Wikilink corpus is the largest fully-labeled cross-document entity resolution data set to date [87]. When downloaded, the data set contains 40 million mentions and almost three million entities — it is a compressed 180 GBs of data. The Wikilink corpus was created by crawling pages across the web and extracting anchor tags that referenced Wikipedia articles. Each page contains multiple multiple mentions of different types. The Wikipedia articles act as the truth for each mention. Although manually constructed and not without its biases, this is the largest, fully-labeled entity resolution data set over web data that we could find (at the time of preparation). 4.6.2 Micro Benchmark

To increase our intuition of early stopping techniques we simulated the MCMC proposal processes. We hypothesize that a range of values exist, where performing the baseline cluster sampling faster than using early stopping methods. We arrange entity clusters in increasing size and we compute the time (in clock ticks) each proposal takes to compute the arrangement of the clusters. The data in the clusters are distributed uniformly for this experiment and each cluster point was 5 dimensional. For the baseline cluster score computation we used a pairwise calculation of the average cosine distance with and without the mention. To compute early stopping we set a confidence threshold to 0.8 and the early stopping code stopped computation when the predicted error was under 20%. There was no difference in the proposal choices of the baseline method or the early sorting method. The simulations were developed in GNU C++11 and compiled with g++ -O3. The CPU was an 8 core Intel i7 with 3.2 GHz and 12 GBs of Memory. Each arrangement was run 5 times and results averages.

89 Figure 4-3. Comparison of baseline versus early stopping methods

Early stopping or baseline.. We first determine when early stopping approaches from proposal scoring is beneficial. For this result we compare the base like proposal evaluator with a confidence-based scorer for varying entity sizes. The result of this experiment is summarized in Figure 4-3. The x-axis is the number of mentions in the source and destination cluster for each proposal. The y-axsis is the number of clock ticks on a log-scale. We observe that for proposals with fewer than 100 and 1000 source and destination mentions, the performance of the baseline proposer is better than or almost equal to that of the more sorted early stopping method. For proposals that contain an entity cluster with 10000 mentions the early stopping method performs significantly better than the baseline method. Surprisingly, the baseline proposals for for entities clusters containing 100K mentions performed over an order of magnitude better than the early stopping method. The optimization found in predictable code paths make simple implementations like the baseline method attractive for small cluster sizes and very large clusters sizes. In addition, 82% of the entities in the truthed Wikilinks data sets are less that 1000 mentions in size and 45% of the entities contain less than 100 mentions.

90 Figure 4-4. The time for compression for varying entity sizes and cardinalities.This is compared with line representing the time it take to make 100K insertions

The results of the micro benchmark suggests that different proposal estimation techniques are useful at different times. Note that for these techniques a small constant amount of book keeping space is required to perform early stopping.

Insertion vs Compressions Time. Compressing an entity is an expensive operation. When compressing an entity, it must be locked to prevent any concurrent access. In order to choose the best times to compress an entity cluster in this micro benchmark we look at the time to compression entity of different cardinalities and compare them to the time it takes to insert entities. Using a synthetic data set we generated entities of varying sizes and cardinality. This experiment is shown in Figure 4-4. Cardinality number is a ratio of duplicates in the data set. For example, cardinality 0.8 means 8 of 10 items in the data set are duplicates. The graph shows that in the time it take to compress entities of about 300K the sampler could make 100K samples. We can conclude from these result that compressing large entities is expensive should only be done if the cluster is prohibitively large and not popular. Cardinality estimation for millions of entities is a significant overhead. Tracking cardinalities simultaneously for each entity, even using small probabilistic sketches such as Hyperloglog [32] become prohibitive for large amounts of entities. By the time the cardinality of an entity needs to be monitored for possible compression, that entity might

91 as well be compressed. We are continuing to look for lighter weight cardinality estimators for millions of mentions so decisions can quickly be made. 4.7 Proposal Optimizer Summary

In this chapter, we describe an initial approach for optimizing sampling for the entity resolution process. We begin to develop an optimizer that attacks two major limitations, the size of the entities and the redundant computation. This chapter motivated the need for the optimizer and examined the feasibility of its creation. Future work include the implementation of a full optimizer over a large, streaming corpus, with resolved entities. We hope to soon have a fully resolved TREC streamcorpus1 and examine the performance of the optimizer of that large data set. Additionally, we hope to compare results with enterprise ER systems such as WOO [9].

1 After acceptance the http://trec-kba.org/kba-stream-corpus-2014.shtml was linked to Freebase and is now available for researchers [27].

92 CHAPTER 5 QUESTION ANSWERING Question Answering is the problem of bridging the gap between the way a user asks a question and the way an answer is encoded in the background knowledge. In this work we start with natural language questions and use the deep web as the background knowledge. The deep web, or hidden web, is the set of database behind web forms in the on the web. In 2007, these databases are estimated to contain data two orders of magnitude larger than the surface web. Contrary to the surface web, the information in these database is difficult to obtain. In this section, I describe a method for accessing the deep web to answer wh-questions. This system is called the Morpheus QA system [40]. 5.1 Morpheus QA Introduction

When traveling through a jungle to a destination, it is easy to get lost. The first person to journey somewhere may make a number of mistakes when trying to find the best path to their destination. Those who come later find it easier to reach the destination if a well-marked trail has been created. Olsen and Malizia describe this idea as exploiting trails [72]. Rather than treating a user’s discovery experience as a unique entity, one can exploit the fact that a similar search may have already been performed. In one study, almost 40 percent of web queries were repetitions of previous queries [92]. Thus, reuse of prior searches is one way to optimize the search process. Morpheus is a question answering system motivated by reuse of prior web search pathways to yield an answer to a user query. Morpheus follows path finders to their destinations and not only marks the trail, but also provides a taxi service to take followers to similar destinations. Morpheus focuses on the deep (or hidden) web to answer questions because of the large stores of quality information provided by the databases that support it [69]. Web forms act as an interface to this information. Morpheus employs user exploration through these web forms to learn the types of data each deep web location provides.

93 There are two distinct Morpheus user roles. A path finder enters queries in the Morpheus web interface and searches for an answer to the query using an instrumented web browser. This web tracking tool stores the query and necessary information to revisit the pathways to the page where the path finder found the answer. A path follower uses the Morpheus system much like a regular search engine with a natural language interface. The path follower enters a question in a text box and receives a guided path to the answer. The system exploits previously found paths to provide an answer. Morpheus represents a user question as a semi-structured query (SSQ). It assumes the query terms belong to classes of a consistent realm-based ontology, that is, one having a singly rooted heterarchy whose subclass/superclass relations have meaningful semantic interpretations. When a path follower enters a query, Morpheus ranks SSQs in the store based on class similarity. Suppose a path follower asks: A 1997 Toyota Camry V6 needs what size tires? In this query the classes associated with terms, e.g. Manufacturer with Toyota, help us identify similar queries. This chapter discusses related question answering and ontology generation systems in Section 5.2. Section 5.3 explains the Morpheus system and its implementation. In Section 5.4 we describe the current results of our approach. Finally, we conclude with future goals for the system. 5.2 Morpheus QA Related Work

5.2.1 Question Answering Systems

The earliest question answering systems such as BASEBALL [43] and Lunar [102] had closed domains and closed corpora, that is, they support a finite amount of questions on copora containing a fixed set of documents. Morpheus uses the web as its dynamic, open corpus and examines deep web sources to answer questions. This process is federated question answering.

94 Several other QA systems that use the web as a resource have been developed. Example systems include START1 and Swingly.2 These systems use web pages from searches web crawlers or search engines to find answers. Morpheus differs in that it seeks out relevant deep web sources, and instead of using a web search engine, it uses only the pages referenced in a previously answered question. 5.2.2 Ontology Generators

The DBpedia3 project is a community of contributors extracting semantic information from Wikipedia and making this information available on the Web. Wikipedia semantics includes disambiguation pages, geo-coordinates, categorization information, images, info-box templates, links to external web pages, and redirects to pages in Wiki markup form [13]. DBpedia does not define any new relations between the Wikipedia categories. is a semi-automatically constructed ontology obtained from the Wikipedia pages, info-boxes, categories, and WordNet4 synsets heterarchy [91]. YAGO uses the Wikipedia page titles as its ontology individuals and categories as its ontology classes. YAGO uses only the nouns from WordNet and ignores the WordNet verbs and adjectives. YAGO discovers connections between WordNet synsets and Wikipedia categories, parsing the category names and matching the parsed category components with the WordNet synsets. Each Wikipedia category not having a WordNet match is ignored in the YAGO ontology. The ontology’s heterarchy is built using the hypernym and hyponym relations of the WordNet synsets. We use YAGO’s principles to construct ontologies that provide similarity measures for answering questions within the same domain. Thus far, these ontologies can be used

1 http://start.csail.mit.edu 2 http://swingly.com 3 http://dbpedia.org 4 http://wordnet.princeton.edu

95 to classify terms, however their classes do not always appropriately categorize query parameters. It is necessary to provide an appropriate level of class granularity. Section 5.3 discusses our approach for identifying classes and their instances from deep web forms and documents. 5.3 Morpheus QA System Architecture

This section presents ontology and corpora, query processing, ranking queries, and query executing. 5.3.1 Using Ontology and Corpora

Morpheus uses an ontology that contains classes of a particular realm of interest. Each leaf node in the ontology is associated with a corpus of words belonging to a class. For example, we have constructed a vehicular ontology containing classes relevant to the vehicular realm. This ontology provides a structure for reference in the following sections. Morpheus references the DBpedia categories, Wikipedia pages, and the WordNet synset heterarchy to find class-relevant web pages. First a realm is mapped to a DBpedia category [13]. Using the DBpedia ontology properties broader and narrower, a Markov Blanket [75] is created covering all neighboring categories. To build a corpus for each of the leaf nodes in the ontology, we extract terms from the Wikipedia pages associated with the DBpedia categories found in its blanket. From this term corpus, we can find the likelihood of a term belonging to a particular class. This assists in classifying terms in a path follower query. The likelihood is determined by the probability of a class given a term using Bayes Rule (Eq. 5–1), since we can easily obtain the term-class and term-corpus probabilities as relative frequencies.

P (term|class)P (class) P (class|term) = (5–1) P (term)

In addition, we employ Latent Dirichlet Allocation (LDA) to identify latent topics of the documents in a corpus [14]. LDA is Bayesian model that represents a document in the corpus by distributions over topics, and a topic itself is a distribution over all

96 Table 5-1. Example SSQ model Terms: 1997 Toyota Camry V6 size tires Input: Date ManufacturerModel Engine Output: Measurement Part terms in the corpus. For example, the latent topics reflect the thematic structure of Wikipedia pages. Thus, LDA discovers relevant topic proportions in a document using posterior inference [14]. Given a text document, we tag related documents by matching their similarity over the estimated topic proportions, assisting in ontology and corpora construction. We use LDA as a dimensionality reduction tool. LDA’s topic mixture are represented as feature vectors for each document. We are evaluating support vector machines as a classifier over the documents-topic proportions. Due to its fully generative semantics, this usage of LDA could address drawbacks of frequency based approaches (e.g. TF-IDF, LSI, and pLSI) such as dimensionality and failure to find the discriminative set of words for a document. 5.3.2 Recording

The Query Resolution Recorder (QRR) is an instrumented web browser that records the interactions of a path finder answering a question. The path finder also uses the tool to identify ontological classes associated with search terms. Morpheus stores the query, its terms, and its classes as an SSQ. Table 5-1 is an example showing the SSQ model of the query: A 1997 Toyota Camry V6 needs what size tires? The SSQ in Table 5-1 is said to be qualified because the classes associated with its terms have been identified. Using the QRR, the path finder is also able to identify where answers can be found within traversed pages. The Query Resolution Method (QRM) is a data structure that models the question answering process. A QRM represents a generalized executable realization of the search process that the path finder followed. The QRM is able to reconstruct the page search path followed by the path finder. Each QRM contains a realm from our ontology, an SSQ,

97 and information to support the query answering process. For each dynamic page, the QRM contains a list of inputs and reference outputs from the URL. When a path follower submits a query the Morpheus search process parses and tags queries in order to record important terms. The system assigns the most probable realm given the terms in the query as calculated from realm-specific copora. Once the realm is assigned, an ontology search is performed to assign classes to the terms. An SSQ is constructed and the system attempts to match this new SSQ to existing QRM SSQs. Rather than matching exact query terms, the system matches input and output classes, because a QRM can potentially answer many similar queries. 5.3.3 Ranking

To answer a user’s query, a candidate SSQ, Morpheus finds similar qualified SSQs that are associated with QRMs in the Morpheus data store. To determine SSQ similarity, we consider the SSQ’s realm, input terms, output terms, and their assigned classes. The class divergence of two classes within the ontology characterizes their dissimilarity. This solution is motivated by the concept of multiple dispatch in CLOS and Dylan programming for generic function type matches [8]. We consider the class match as a type match and we use class divergence to calculate the relevance between the candidate SSQ and a qualified SSQ. Each qualified SSQ will have input terms, output terms, associated classes, and one realm from the QRM. For the candidate SSQ, the relevant classes for terms are determined from the natural language processing engine and corpora. The calculation of a realm for a candidate query is performed using the terms found within the query and any probabilities found with p(realm|term). We match QRMs that belong to the same realm of the candidate SSQ. The relevance of a qualified SSQ to the candidate SSQ is determined by aggregating the divergence measure of input term classes associated with each SSQ. In addition, we order QRMs in the data store by decreasing relevance. The order provides a ranking for the results to the user. The following describes class divergence in detail.

98 We define class divergence (Eq. 5–2), a quasi-metric, between a source class and a target class using the topological structure of the classes in an ontology. We write S ≺ T for the reflexive transitive closure of the superclass relation. Let d(P,Q) represent the hop distance in the directed ontology inheritance graph from P to Q. The class divergence cd, between a source and target class ranges from zero (for identical classes) to one (for type incompatible classes). Let S be the source class, T be the target class, and C be a least common ancestor class of S and T i.e., one that minimizes d(S,C) + d(T,C). The class divergence between S and T is defined by:   0 S.Uri ≡ T.Uri    d(S,T )/(3h) S ≺ T   cd(S,T ) = 1 T ≺ S (5–2)    (d(S, root) + d(S,C)     +d(T,C))/(3h) otherwise

where h is the longest path in the ontology class heterarchy. Note, if S ≺ T and S 6≺ Q then cd(S,T ) < cd(S,Q), that is, the divergence of a source class to a target ancestor class is smaller than the divergence of a source class to any class that is not an ancestor. This is an important property in determining the compatibility of classes for answering queries. If an SSQ answers queries concerning an ancestor class, it is more relevant than an SSQ that anwers queries from any non-ancestral class. Suppose we want to find the class divergence between Bus and Sedan from the ontology shown in Figure 5-1. Land Vehicle is their least common ancestor because Sedan is a subclass of Automobile, which is a subclass of Land Vehicle, and Bus is a subclass of Land Vehicle. The longest path from Bus and sedan to the tree root is four (h = 4). By the formula in Figure 5–2, cd(Bus, Sedan) = (d(Bus, Root) + d(Bus, Land V ehicle) + d(Sedan, Land V ehicle))/(3h) thus (3 + 1 + 2)/(3 ∗ 4) = 6/12.

99 Figure 5-1. Abbreviated vehicular ontology

5.3.4 Executing New Queries

Once we have ranked the QRMs for a given user query, we can produce answers by re-visiting the pathways stored in the QRMs. The Morpheus Query Executor (QE) evaluates a script of the query resolving process. It simulates a human clicking buttons to follow links, submit forms, and highlight data, forming a textual answer. The QE assumes that because of the auto generated nature of deep web pages, the location of answers are the same irrespective of page changes. It uses the relative XPath location to the answer node on HTML pages as described in Badica et al. [6]. 5.4 Morpheus QA Results

First, we built an ontology for the vehicular realm exploiting the Wikipedia pages, DBpedia categories, and WordNet synsets. For each of the classes in the ontology we built corpora from the corresponding Wikipedia pages. Figure 5-1 shows a subsection of this ontology. In Table 5-2 we show the data output by the Morpheus parse of the query. It extracts the wh-term that classifies the sentence as a question, identifies the answer class, and

100 Table 5-2. The output of NLP engine wh-term what descriptive phrases 1997 Toyota Camry V6 asking for size tires n-grams 1997, 1997 Toyota, 1997 Toyota Camry, Toyota, Toyota Camry, Toyota Camry V6, Camry, Camry V6, V6

Table 5-3. Term classes and probabilities Term Class P (Class|T erm) 1997 Sedans 404132.77e-14 1997 Toyota Engines 7.90e-14 Toyota Sedans 3486670.15e-14 Toyota Camry Sedans 12147.23e-14 Toyota Camry V6 Coupes 13.80e-14 Camry Sedans 312034.20e-14 Camry V6 Coupes 13.80e-14 V6 Sedans 4464535.40e-14 locates descriptive phrases to produce the answer. Finally, the engine produces n-grams from phrases in the descriptive information sections. Using the data in Table 5-2 we determine relevant classes in non-increasing order of relevance. Table 5-3 shows the eight best term classes and their probabilities for automotive queries. We found the best classes for the terms in the candidate SSQ. We calculated the class divergence between these classes and the qualified SSQ classes in the QRM store. QRMs are ranked based upon the relevance score and the class divergence measure. Table 5-4 shows the answers produced by the QRE, a python back end, and the three highest ranked queries. Finally, we execute the best QRMs and display the results to the user. 5.5 Morpheus QA Summary

In this work, we propose a novel question answering system that uses the deep web and previously answered user queries to answer similar questions. The system uses a path finder to annotate answer paths so path followers can discover answers to similar questions. Each (question, answer path) pair is assigned a realm, and new questions are

101 Table 5-4. Highest ranked Morpheus QA queries Query Tagged Classes Score A 1997 Toyota Camry V6 Sedan, Automobile, Engine, 0.91 needs what size tires? Manufacturer What is the tire size for a Van, Manufacturer 0.72 1998 Sienna XLE Van? Where can I buy an engine Sedan, Automobile Engine, 0.74 for a Toyota Camry V6? Manufacturers matched to existing (question, answer path) pairs. The classification of new question terms into classes is based on term frequency distributions in our realm specific corpora of web documents. These terms are the input to existing answer paths and we re-execute these paths with the new input to produce answers. Our solution is composed a web front end where users can ask questions. The QRR was developed as a plugin and an associated C# application. Our similarity measures were coded using Java and open source libraries. Answers are produced by the QRE, a python back end. The data is stored in a PostresSQL database. Topic modeling provides a promising approach to identifying pages relevant to a class in a more automated manner. We believe these web form entry annotation methods and form label extraction [69] can yield promising results. Combining this with the method of Elmeleegy et al. [30] may remove the user from the answer path generation process. Future investigation in this area should look to merge compatible QRMs to answer compound questions; chaining QRMs using the principles of transform composition [73].

102 CHAPTER 6 PATH EXTRACTION IN KNOWLEDGE BASES Knowledge bases are increasingly being augmented using unstructured data to extract actionable information. Typically, KBs are populated with triples of information and then searched with queries to discover a new subset of data. Inference is the task of extracting knowledge that is not explicitly represented. Knowledge bases contain boundless amounts of information that needs to be extracted using new and efficient methods. Extracting sets of information from knowledge bases is an exciting area of current research. In this chapter, we describe a new path traversal process over knowledge bases with uncertainty. I define an algorithm to discover, extract, and rank connected sets of facts in a knowledge base between multiple entities of interest. I empirically show that the path expansion methods described are useful to express relationships between entities. 6.1 Preliminaries for Knowledge Base Expansion

In this section, we discuss the fundamental concepts underlying path expansion. First, is a discussion of knowledge bases and we then formally describe a probabilistic knowledge base. Note that formalisms and previous work in this section are shared across recent publication. 6.1.1 Probabilistic Knowledge Base

In this section we formally describe a probabilistic knowledge base. This definition is derived from Chen et al. [21]. A probabilistic knowledge base is a 5-tuple Γ = E, C, R, Π, L) where

1. E = {e1, . . . , e|E|} is a set of entities. Each entity e ∈ E refers to a real-world object.

2. C = {C1,...,C|C|} is a set of classes (or types). Each class C ∈ C is a subset of E : C ⊆ E.

3. R = {R1,...,R|R|} is a set of relations. Each R ∈ R defines a binary relation on

Ci,Cj ∈ C : R ⊆ Ci × Cj. We call Ci,Cj the domain and range if R and use R(Ci,Cj) to denote the relation and its domain and range.

103 4.Π= {(r1, ω1),..., (r|Π|, ω|Π|} is a set of weighted facts (or relationships). For each

(r, ω) ∈ Π, r is a tuple (R, x, y), where R(Ci,Cj) ∈ R, x ∈ Ci ∈ C, y ∈ Cj ∈ C, and

(c, y) ∈ R; ω ∈ R is a weight indicating how likely r is true. We also use R(x, y) to denote the tuple (R, x, y).

5. L = {(F1,W1),... (F|L|,W|L|)} is a set of weighted clauses (or rules). It defines a Markov logic network. For each (F,W ) ∈ L,F is a first-order logic clause, and

W ∈ R is a weight indicating how likely formula F holds. We refer the reader to [21] for a discussion of first-order probabilistic logic. 6.1.2 Markov Logic Network and Factor Graphs

Markov logic networks (MLN) [79] combine first-order logic [90] and probabilistic graphical models [53] into a single model. Essentially, an MLN is a set of weighted

first-order formulae (Fi,Wi), the weights Wi indicating how likely the formula Fi is true. A simple example of an MLN is:

1. 0.96 born in(Ruth Gruber, New York) 2. 1.40 ∀ x ∈ Person, ∀ y ∈ CITY: born in(x,y) → live in(x, y)

It states a fact that Ruth Gruber is born in New York City and a rule that if a writer x is born in an area y, then x lives in y. However, both statements do not definitely hold. The weights 0.96 and 1.40 specify how strong they are; stronger rules are less likely to be violated. An MLN can be viewed as a template for constructing ground factor graphs. In the ground factor graph, each node represents a fact in the KB, and each factor represents the causal relationship among the connected facts. For instance, suppose in the bornIn(Ruth Gruber, New York), we have two nodes, one for the head and the other for the body, and a factor connecting them, the values depending on the weight of the rule. The factors together determine a joint probability distribution over the facts in the KB. A

factor graph is a set of factors Φ = {φ1,... ,φN }, where each factor φi is a function over

a random vector Xi indicating the causal relationships among the random variables in

104 Xi. These factors together determine a joint probability distribution over the random vector X consisting of all the random variables in the factors. Mathematically, we seek the maximum a posteriori (MAP) configuration: It defines a probability distribution over its variables X: 1 Y 1 X P (X = x) = φ (x) = exp w n (x) (6–1) Z i Z i i i i

6.1.3 Sampling for Marginal Inference

Computing the exact Z in Equation 6–1 is intractable due to the large space of possible configuration. Sampling algorithms are typically used to approximate the marginal distribution since direction computation is difficult. Two most popular of these approaches are Gibbs sampling [19] and MC-SAT [77]. These two sampling algorithms are briefly discussed in the following two paragraphs. 6.1.3.1 Gibbs sampling

Gibbs sampling [19] is a special case of the Metropolis Hastings algorithm [22]. The point of Gibbs sampling is that given a multivariate distribution it is simpler to sample from a conditional distribution than to marginal distribution by integrating over a joint distribution. The Gibbs Sampling algorithm is described in Algorithm 19:

Algorithm 19 Gibbs Sampling (0) 0 0 1: z := hz1, . . . , zki 2: for t ← 1 to T do 3: for i ← 1 to k do (t) (t) (t−1) (t−1) 4: P (Zi|z1 , . . . , zi−1, zi+1 , . . . , zk ) 5: end for 6: end for

It begins with some initial value, which can be determined randomly. Each variable is sampled from the distribution of that variable conditioned on all other variables, making use of the most recent values and updating the variable with its new value. The marginal probability of any variable can be approximated by averaging over all the samples of that variable. Usually, some number of samples (burn-in period) at the beginning are ignored,

105 and then values of the left samples are averaged to compute the expectation. The Gibbs sampling algorithms are implemented in the state-of-the-art statistical relational learning and probabilistic logic inference software packages [52, 70]. 6.1.3.2 MC-SAT

In real world datasets, considerable numbers of Markov logic rules are deterministic. Deterministic dependencies break a probability distribution into disconnected regions. When deterministic rules are presents, Gibbs sampling tends to be trapped in a single region and never converges to the correct answers. MC-SAT [77] solves the problem by wrapping a procedure around the SampleSAT uniform sampler that enables it to sample from highly non-uniform distributions over satisfying assignments.

Algorithm 20 MC-SAT(clauses, weight, num samples) 1: z(0) ← Satisfy(hard clauses) 2: for i ← 1 to num samples do 3: M ← ∅ i−1 4: for ck ∈ clauses satisfied by x do −wk 5: With probability 1-e add ck to M 6: end for (i) 7: Sample x ∼ USAT (M) 8: end for

For more detailed discussion about MC-SAT, we refer to the original publication [77]. 6.1.4 Linking Facts in a Knowledge Base

The linking of facts is knowledge. To extract knowledge from large sets of facts we look can used the linked structured created through probabilistic rules. This connection is only one method that may be used to link facts. We enumerate four methods that link concepts in a knowledge base, these are (a) Rules; (b) Records; (c) Record Linking; (d)

Rd-space. In a knowledge base, any fact may be represented as a single, or a set of triples. We previously discussed how to perform inference over knowledge bases using rules; this is the first way knowledge bases can be combined to extract information 6.1.2. Another well

106 known method for linking facts together is through SPARQL queries. Large triples stores (recall that each triple may be considered a fact) can be queries over using an declarative query language. Data in these triple stores are linked between nodes and edges, where an edge may be an entity or a value and edges represent relationships between two nodes. With SPARQL, users declaratively express the information that they would like to search for in a triple store and the language generates a mix of sub-tree templates and operations that produce the requested information. SPARQL is a powerful method for discovering and linking facts. The third method, record linking, describes the connection of the nodes and vertices in a triple store to create a path of connected statements. The connection of facts cause multiple entities that were not previously linked to form a relationship. While these connection may be long and possibly non-informative, in bulk it provides information seekers with a summary of the connection between entities. We study an implementation of this method over a large knowledge base and in a relational database management system and also a graph database.

The fourth method (d) Rd-space describes a method of embedding facts into a vector space. Then knowledge base type inference can be performed over the vector space itself. There are several methods of encoding facts into vector spaces [50] but knowledge base operations over the vector space is an area for present research. KB operations in Rd space is an open problem. 6.2 Fact Path Expansion Related Work

In this section, I describe three works related to fact path expansion. I first discuss paths creation of SPARQL queries to extract paths. Following, I discuss two research projects that use paths in knowledge bases and on the web to extract information. The first is path ranking from researchers at Carnegie Mellon University and the second fact ranking from Yahoo!.

107 6.2.1 SPARQL Query Path Search

Querying data over triples in an resource description framework (RDF)1 format is performed using an SQL-like language called the SPARQL Protocol and RDF Query Language (SPARQL). To discover paths in RDF stores, SPARQL has defined property paths as part of the standard.2 The property path extension to SPARQL allows for the specification of graph patterns of arbitrary length. Property path queries can define variable ends of paths to return all paths between nodes. We are most interested in a function to extract paths of any length called ArbitraryLengthPath. This definition can be found in the standard. In our use case, we do not use the RDF format as we do not store the knowledge base in the RDF format. To adapt our methods to the RDF data set we would need to perform approximate matching on RDF elements. We would still need an external method of ranking the paths. 6.2.2 Path Ranking

Ni Lao et al. investigated the ranking of paths in graphs and knowledge bases [54, 55]. They train a model to learn new instances of relations between two or more entities by paths. The researchers perform cross validation to rank the top returned paths using mechanical turk. The graph they describe connects entities using relations in the knowledge graph. Relations, or edges, are directed and if a path goes in the opposite direction it is described as an inverse walk. For efficiency, paths are also constrained based upon rules and class types. For example, researchers extract a path constrained based on the horn clause isa(x, c) ∩ isa(x0, c) ∩ AthletePlaysInLeague(x0, y) → AthletePlaysInLeague(x; y). A random walk is performed to discover paths common paths

1 http://www.w3.org/RDF/ 2 SPARQL 1.1 standard and the property path definitionhttp://www.w3.org/TR/2010/ WD-sparql11-query-20101014/#propertypaths.

108 and the top paths are ranked and returned. This work is essentially a type of rule learning over knowledge bases; the end product is a list of paths that are candidates for new logical rules. In this dissertation, it is not necessary to summarize the paths to create a new link. Instead, we look at the aggregate of all paths to make a statement about the source and destination entity. This work primarily creates a graph over the similarity space, although techniques can extend to the links generated over the rule space or over the graphs space. 6.2.3 Fact Rank

Jain et al. explored a technique called FactRank that used similar entities in a knowledge base (factbase) to find influential facts and trustworthy facts [47]. Given a set of relations of interest, a graph is created based on all the facts that have the same subjects and objects. Links in the graph are bidirectional and are created when two facts have a matching subject or object. They create a modification of the PageRank algorithm [17] that consistently out performs the traditional algorithm when discovering the most important facts. This work also connect graphs that have an overlapping subject or object but we do not restrict paths to a set of relation types. We focus on computation time and the discovery of novel path of facts. 6.3 Fact Path Expansion Algorithm

The goal of the fact path expansion is to collect and rank the most representative paths between a source and destination entity. Starting with a source entity, this algorithm performs a search through the knowledge base space to extract candidate paths. The candidate paths are ranked based on the similarity of the match along the path. We first formalize the definition of the algorithm and provide pseudocode. We separately discuss how to rank the output to obtain the most representative paths. After an explanation of the algorithm we describe the different implementations of the algorithm over a popular knowledge base.

109 Given a knowledge base G which stores triples gi ∈ hsi, pi, oii where gi ∈ G, the

algorithm takes a source query node es and a target node et. Step one of the algorithm is

to find all the triples ges ∈ G such that sges ∼ ses or oges ∼ oes . Each of these initial nodes are considered the start entities for any potential paths. Only the subject s and object o are linked in the path exploration, and not the predicate p, because we are interested in the entities connecting facts. For the purposes of exploration we are not so interested in the meaning of the predicate or similarity of relations. The most recent discovered nodes are added to the working set W where W ⊆ G. Next, we recursively expand each triple in the working set to look for triples that is similar to the working set triple. That is, for each w = hsw, pw, owi, find all the triples in gi ∈ G

such that sw ∼ sgi or ow ∼ ogi The result of this step becomes the new working set. We also order the item in the new working set based on the sum of calculated similarity scores of the current match and the previous paths. After we sort the paths, then we leave it to the end user to draw conclusions on the linked set of facts.

Algorithm 21 Formal algorithmic definition for fact path construction Input: A knowledge base G. A query start entity es. A query target entity et. Output: Linked paths P . 1: V ← ∅ . Visited set.

2: W ← {gw|ses ∼ sgw or oes ∼ ogw , gw ∈ G} 3: while W 6= ∅ do 0 4: W ← {gw|sgi ∼ sgw or ogi ∼ ogw , gw ∈ W, gi ∈ G} 5: W 0 ← W 0 \ V. Remove all visited nodes. 6: V ← V ∪ W 0 0 7: P ← P ∪ {w|et ∈ {ws, wo}, w ∈ W . Emit paths that reach the target. 8: W ← W 0 9: end while return Sort(P) . Assume all provenance is preserved.

Algorithm 21 describes the method for constructing the knowledge base paths. Notice that in each expansion step, only the most recently expanded nodes are added to the working set. The algorithm ends when the working set is empty or upon reaching a

110 Figure 6-1. An example of the increase of the facts and the number of relations over several timestamps. maximum depth (not shown). The variable P represents the linked sets of paths traversed by the algorithm. The final Sort algorithm sorts the paths by their connection similarity. Ranking Fact Paths. The set of candidate facts should be ranked by how representative they are of the set of paths connecting the two entities. A quality path is trustworthy, timely, representative, and relevant. It is important that each fact included in a path is true. An untrue fact, or a fact with low probability of being correct renders the resulting path arbitrary. Each fact is associated with an extraction time. Facts may have actually occurred at a much different time. The extraction time of a fact is distinct from the true range a fact represents. Figure 6-1 shows the number of connections nodes have over time. We see that over time the number of nodes and number of edges increase over time. These nodes, which represent facts, also increase in complexity. With facts changing over time, the meaning and possibly the truthfulness of facts may change. In fact, Figure 6-2 shows the change of probabilities over time. With the addition of facts comes changing probabilities, it is important to take the latest probability under consideration.

111 Figure 6-2. A sample of nodes and their changing probabilities over time. The figure is darkened to show the many overlapping lines

A representative set of paths are those that are unique to the set of all possible paths and also summarize the candidate paths. We compute what is essentially a TF-IDF score of the edges in the relations and entities mentioned in the candidate set. Lastly, the path should be relevant, meaning we favor the more popular entities and relations; obscure relations or entities may not be useful. For each path, a pair of global node/edge scores and local node/edge scores are computed. These score boost the paths in the candidate set that are most representative of the candidate set of paths and minimally representative of the global set. To that end we define the GlobalNode and LocalNode as follows:

X |n ∈ N| − minf (N) GlobalNode(n, N) = maxf (N) − minf (N) +  n∈Npath

X |n ∈ Npath| − minf (Npath) LocalNode(n, Npath) = maxf (Npath) − minf (Npath) +  n∈Npath

where Npath is the set of nodes in the candidate paths and N is the global set of nodes. Similarly, the GlobalEdge and LocalEdge are defined as follows:

112 X |e ∈ E| − minf (E) GlobalEdge(e, E) = maxf (E) − minf (E) +  e∈Epath

X |e ∈ Epath| − minf (Epath) LocalEdge(e, Epath) = maxf (Epath) − minf (Epath) +  e∈Epath where Epath is the set of edges in the candidate paths and E is the global set of edges. For each path, these values are included with the path length and timeliness to produce a score. Additionally, a truthfulness is score of the path is computed. This is the probability that each fact in the path is correct. If the probability of a fact cannot be computed in the knowledge base it is assumed to be true. Each of these values are also normalized but or simplicity that is not shown. The equation for SCORE is as shown below.

  P (LocalNode(n, N) + (1 − GlobalNode(n, N)) +  n∈Npath   P (LocalEdge(e, E) + (1 − GlobalNode(e, E)) +  e∈Epath  SCORE(path) = pathLength(path) +    truthfulness(path) +    timeliness(path)

6.4 Joint Inference of Path Probabilities

Discovering the probability of a grounded atom in a database is a well-studied problem. Tools such as Alchemy and GraphLab describe distributed and scalable implementations of inference of similar models [38]. The grounded atoms have weights associated with them (column Π in the ProbKB database). To score the likelihood of the path query corresponding to each available fact in the query we compute the joint probability.

113 For computational efficiency we assume that each fact in the path is independent. Following, we can compute the joint probability with the following formula:

1 X P (X = x) = exp W n (x) Z i i i where x is the set of ground atoms, ni(x) is the number of true grounding of the rule

Fi in x,Wi is its weight and Z is the normalization constant. Computing this value is computationally intractable so we embed Gibbs sampling techniques in the database to compute the answers within the ProbKB framework. 6.4.1 Fuzzy Querying

In order to create paths between facts we look for techniques to find approximate matches to the set of facts. The possible number of paths is exponential so much care must be taken when selecting paths. Graph databases can represent facts and materialize connections to make traversals of complex graphs efficient. Because graph databases have materialized and exact graphs they do not provide any special benefits for the traversal of fuzzy graphs. Search engines such as Lucene are optimized for the discovery of approximate results over text documents. However, Lucene is not a data storage system; it would need to be prepared with a storage system to perform approximate searches and traversals. PostgreSQL Full-Text Search (FTS) are full-text search capabilities inside of the PostgreSQL database management system. This system combines the storage with the fuzzy searching capabilities. Queries are logically performed one at a time making physical query plan expensive. In this work, we use database techniques to find matches and the ranking utilities of the PostgreSQL FTS to assist in ranking the results. 6.4.2 PostgreSQL Fact Path Expansion Algorithm

Algorithm 22 describes the path ranking algorithm implemented in PostgreSQL. This algorithm makes a query a step at a time. For each step, the algorithm checks for new nodes by looking for string overlap matches in the subject and object columns. Additionally, each step ensures that no cycle exists by ensuring any new vertex is distinct

114 in the current path (i.e. no loops). The results of each previous node are down sampled to fewer than ten percent of the result nodes. For this implementation, the s and o column of PostgreSQL are added as ts query type, that is, a text search type that is a stemmed version of the original string. This extra column makes it possible to perform fuzzy queries at each step. The ts rank function computes the compatibility between two values. For more information on the PostgreSQL Full Text search capability we direct the reader to PostgreSQL documentation.3 Moving forward, to make the implementation a little more tractable for a small number of hops we make three optimization: (1) there may be cycles in the set, (2) only exact matches string matches may be connections, and (3) we can sample or sort and rank the candidate paths at each hop. Checking for cycles at each step is expensive but it does keep the size of intermediate path low. Performing a fuzzy search in the form of a string token overlap is expensive and also results in noisy output. In practice, instead of performing fuzzy searches we can choose multiple start node with each canonical representation of the string. Lastly, at the cost of not receiving all possible candidates we can downsample the intermediate nodes. Downsampling at each step reduces redundancy especially during the later paths. Sorting and ranking the paths at each step is a recursive step in a greedy method for choosing the top-k paths at any point. Unfortunately, this process can also become expensive. Algorithm 23 describes the PostgreSQL recursive method for discovering paths when facts are described in a triple table. This method first uses the PostgreSQL FTS to find a set of starting nodes. Next, a set of candidate end nodes are also searched in the triple table. Then, the system recursively looks for paths are connected to the start nodes by either matching subject (s) or object (o). After a fixed number of iterations, the final

3 http://www.postgresql.org/docs/current/static/textsearch.html

115 Algorithm 22 PostgreSQL code for fact path expansion using for two hops and fuzzy joins

−− t r i p l e s(docid,s,p,o)

WITH start_nodes AS ( SELECT t . docid , 0 as level , t . docid :: text as path , t . docid as endid , s , o , qs , qo , ARRAY [ docid ] as apath , ’< ’ | | subject | | ’ | ’ | | predicate | | ’ | ’ | | object | | ’> ’ as statement , ts_rank ( s , plainto_tsquery (’marijuana’) ) + ts_rank ( o , plainto_tsquery (’ marijuana’) ) as rank FROM triples t WHERE ( t . s @@ plainto_tsquery ( ’marijuana ’) OR t . o @@ plainto_tsquery (’ marijuana ’)) ), onehop AS ( SELECT p1 . docid as docid , p1 . level+1 as level , p1 . path | | ’,’ | | t . docid :: text as path , t . docid as endid , t . s , t . o , t . qs , t . qo , array_append ( p1 . apath , t . docid ) as apath , statement | | ’−−> ’ | | ’< ’ | | subject | | ’ | ’ | | predicate | | ’ | ’ | | object | | ’> ’ as statement , p1 . rank + ts_rank ( t . s , p1 . qo ) + ts_rank ( t . o , p1 . qs ) + ts_rank ( t . o , p1 . qo ) + ts_rank ( t . s , p1 . qs ) as rank FROM triples as t , ( SELECT docid , level , path , endid , s , o , qs , qo , apath , statement , rank FROM start_nodes WHERE random () < 0 . 1 ) as p1 WHERE p1 . docid < t . docid AND ( p1 . s = t . s OR p1 . s = t . o OR p1 . o = t . s OR p1 . o = t . o ) AND NOT p1 . apath @> ARRAY [ t . docid ] ), twohop AS ( SELECT p1 . docid as docid , p1 . level+1 as level , p1 . path | | ’,’ | | t . docid :: text as path , t . docid as endid , t . s , t . o , t . qs , t . qo , array_append ( p1 . apath , t . docid ) as apath , statement | | ’−−> ’ | | ’< ’ | | subject | | ’ | ’ | | predicate | | ’ | ’ | | object | | ’> ’ as statement , p1 . rank + ts_rank ( t . s , p1 . qo ) + ts_rank ( t . o , p1 . qs ) + ts_rank ( t . o , p1 . qo ) + ts_rank ( t . s , p1 . qs ) as rank FROM triples as t , ( SELECT docid , level , path , endid , s , o , qs , qo , apath , statement , rank FROM onehop WHERE random () < 0 . 1 ) as p1 WHERE p1 . docid < t . docid AND ( p1 . s = t . s OR p1 . s = t . o OR p1 . o = t . s OR p1 . o = t . o ) AND NOT p1 . apath @> ARRAY [ t . docid ] ) SELECT ∗ FROM twohop ;

116 paths are joined with the end nodes and only paths that can join with the end nodes are preserved.

Algorithm 23 PostgreSQL code for recursive fact path expansion with fuzzy search for a ‘pot’ starting entity and a ‘fox news’ end entity

−− t r i p l e s(docid,s,p,o)

WITHRECURSIVE start_nodes AS ( SELECT docid , 0 , docid :: text FROM triple t WHERE t . s @@ plainto_tsquery ( ’ pot ’ ) OR t . o @@ plainto_tsquery ( ’ pot ’ ) ) , end_nodes AS ( SELECT docid , 0 , docid :: text FROM triple t WHERE t . s @@ plainto_tsquery (’fox news’) OR t . o @@ plainto_tsquery (’fox news’)), paths ( term , level , path ) AS ( SELECT ∗ FROM start_nodes

UNIONALL

SELECT t1 . docid , level+1, p . path :: text | | ’,’ | | t1 . docid :: text FROM paths AS p , triple AS t1 , triple AS t2 WHERE t1 <> t2 AND t2 . docid = p . term AND ( t2 . qs @@ t1 . s OR t2 . qs @@ t1 . o OR t2 . qo @@ t1 . s OR t2 . qo @@ t1 . o ) ) SELECT term , level , path FROM paths p WHERE level < 4 AND p . term IN ( SELECT term FROM end_nodes );

117 Alternatively, the triple data can be stored in an adjacency list format. That is, all nodes are stored in a ‘nodes’ table and all ‘edges’ are stored in an edges table. With a node-list representation it is straightforward to express a query as a self-join of multiple nodes. Algorithm 24 describes a non-recursive version of fact path expansion over the adjacency list representation. The algorithm describes a path involving seven facts with the start and end facts being specified.

Algorithm 24 PostgreSQL code for fact path expansion using database self joins over an edge table and ‘pot’ as starting entity and a ‘Fox News’ target entity

−− nodes(id, term) −− edges(src, dst, edge)

CREATE OR REPLACE FUNCTION getFact ( src int , dst int ) RETURNSTEXT AS ”” SELECT ’< ’ | | nsrc . term | | ’,’ | | e . edge | | ’,’ | | ndst . term | | ’> ’ FROM edges e , nodes nsrc , nodes ndst WHERE e . src = $1 AND e . dst = $2 AND nsrc . id = $1 AND ndst . id = $2 ”” LANGUAGESQLIMMUTABLE ;

SELECT getFact ( g1 . src , g1 . dst ), getFact ( g1 . dst , g2 . src ), getFact ( g2 . src , g2 . dst ), getFact ( g2 . dst , g3 . src ), getFact ( g3 . src , g3 . dst ), getFact ( g3 . dst , g4 . src ), getFact ( g4 . src , g4 . dst ) FROM edges g1 , edges g2 , edges g3 , edges g4 WHERE g1 . src = ( SELECT id from nodes where term = ’ pot ’ ) AND g1 . dst = g2 . src AND g2 . dst = g3 . src AND g3 . dst = g4 . src AND g4 . dst = ( SELECT id from nodes where term = ’Fox News’);

118 6.4.3 Graph Database Query

With the assumption of the linked graph, a graph database is a new option for searching paths. The Titan Graph Database4 is a distributed graph database that can support billions of edges and vertices across machines. It is also a transactional database so information can be stored and queried by many users. Titan also is packaged with a query language called Gremlin that allows path and subgraph queries to be easily expressed (when compared to SQL). The Titan graph database physically represents graphs in a method that is similar to the node-list format described above. Node are stored in a list and they are sorted by the identifier (id) although another index may be used to obtain another sort order. Each node links to a sorted set of edges and properties. Keeping the edges sorted requires some maintenance but there is a good trade-off with the query performance. Each edge contain the label identifier and a bit for the direction of the label. Following is the sort key, adjacent node identifier, the edge identifier and all other properties. Algorithm 25 shows a k-hop path query in Gremlin. Each path traversal that needs to be perform is expressed in the Gremlin Java or Groovy language API and an optimizer performs optimizations to improve the traversal. The algorithm first does a search for all the nodes in the graph that has a term label that equals to the source node. The algorithm then walks the graph to find all the out edges followed by all the incoming vertices of the graph. This process is repeated max path times. The final candidate paths are filtered out to contain only those with a final node term equal to the destination term. It is also noted that Cypher is an SQL-like declarative language that would allow paths to be expressed my declaratively over a Titan graph. Cypher can be used over arbitrary triples in much the same way SPARQL is used over RDF graphs. The Gremlin language is simpler to set up and sufficiently performs all path queries.

4 https://github.com/thinkaurelius/titan

119 Algorithm 25 Gremlin code to find all the paths that start at a vertex named src, end at a vertex named dst, and are less than max path def khopVertices = g . V ( ’ term ’ , src )// Get the starting nodes . outE . inV . random ( sample )// Down sample the current paths . loop ( max_path ) {it . loops < max_path} . filter{ dst == it . term } . simplePath // Remove cycles

6.4.4 Fact Path Expansion Complexity

Deciding if a simple path between two nodes with a minimum number of edges with a given number of edges is NP-Complete problem [24]. The problem can be verified in polynomial time with a non-deterministic Turing machine and a known NP-complete problem, the Hamiltonian cycle can be reduced to it. A search for a Hamiltonian cycle asks “Is there a simple cycle in a graph that visits every vertex?” The fact path expansion problem has added complexity in that at each hop a search is performed. All the paths of the graph are not known a priori; this is a weighted simple path problem. The node list has a slightly higher space complexity when compared to the triple method. The space complexity of the triple method is O(E) where E is the number of edges. The space complexity of the node list method is O(|V | + E), that is the cardinality of the nodes (V ) and E is the number of edges. The time complexity of this method is of Fact Path Expansion is the same as a depth-first search algorithm, O(bk) where b is the branch factor of the graph and k is the number of hops. The sort function of the algorithm is O(p · log(p)) where p is the number of paths returned from the answers. Prior work by Rubin in 2003 describes the use of Warshall’s Theorem to enumerate all the simple paths in a matrix graph [80]. Their proposed method has a complexity of O(N 3) where N is the number of vertices. However, such techniques need to assume that the graph has no self loops and no multiple edges, this is an assumption that we are

120 Table 6-1. The frequency of each term in out cleaned Reverb data set Term Frequency Biden 931 Brutality 41 Fox News 574 Marijuana 752 Reddit 95 not able to make. Additionally, the large size of the knowledge graphs make the all pairs approach not suitable. 6.5 Fact Path Expansion Experiments

In this section, we perform experiments to better understand the trends and performance of fact path expansion in the both graph and relational databases. We first describe the data set we use and the queries performed over it. We then discuss the timing experiments over both systems. In Table 6-1 we list the frequencies of terms used in the experiments. Note that Biden has the most mentions among our set while the term Brutality has the fewest. We selected words that may be of interested to people observing conversations about elections.

Runtime of Methods. Using full-text search techniques to link knowledge base elements can efficiently explore the space of possible links. Exact matching, while significantly quicker than the search the full-text search approach it lowers the recall of the exploration. The choice between the two methods depends how quickly a user would like results.

121 Figure 6-3. Fact Path Expansion queries over the Titan Graph DB.

Figure 6-4 shows the runtime of the Fact Path Expansion algorithms. The Java Virtual Machine (JVM) is set to allocate a maximum of 8 GB to insure the full graph can fits in memory. PostgreSQL is also given 8 GBs of working memory. The performance of the queries in the graph database reflect that of its time complexity. There is an exponential increase in run time for each hop because of the large branching factor. Titan is not able to recognize or cache paths so there is not advantage with repeated or incremental path queries.

Figure 6-4. Fact Path Expansion queries over PostgreSQL

Figure 6-4 shows the runtime of the Fact Path Expansion algorithm in the relational database. The effective cache size of the database queries is 8 GB so the graph is able to fit in memory. With native SQL queries PostgreSQL is able to aggressively perform

122 each join and construct the path. When the same query is subsequently run for a longer path there is a slight decrease in run time because PostgreSQL is able to use the previous query as a partial result. The database results are significantly better with queries of more than two hops. A one-hop query in the graph database is slightly faster that the relational database.

Figure 6-5. PostgreSQL results of Fact Path Expansion queries with reset database cache

Figure 6-5 shows the same experiment as Figure 6-4 but with the cache reset after each run. The performance decrease at hop 4 is much less. We overlay the aggregate performance of the graph database experiment and the relational database without the cache in Figure 6-6. We see that the mature PostgreSQL database is able to aggressivley optimize for longer running queries.

Figure 6-6. Comparison of the experiments with Titan DB and PostgreSQL without cache

123 6.6 Fact Path Expansion Summary

In this chapter, we describe the theory and implementation of version of the NP-complete minimum path problem, and its extension over knowledge bases. We showed an implementation inside of the PostgreSQL relational database and also an implementation inside of the Titan graph database. We also described parameters used in ranking the resulting path. We showed results of the times and example paths. This work will be continued for further and deeper analysis. Further analysis will include instrumentation of the Titan database cache and indexes. Behavior of PostgreSQL DBMS is well understood but not in comparison to graph databases. We also will perform the same experiments over other graph and relational databases. A particularly interesting direction is the query optimization across recursive calls of different database types.

124 CHAPTER 7 CONCLUSIONS This dissertation focuses on research demonstrating the query-driven text analytics paradigm. It shows general contributions to the database community through three projects. First, it shows how statistical text analytics can be performed in the database, improving analytic work flows. This group of work was one of the first to perform in-database text analytics. Second, this work changes a popular algorithm, entity resolution, to make it aware of the queries thereby improving computation time. The results showed orders of magnitude improvement of over the baseline. Finally, I an algorithm for extracting the most representative facts connecting two knowledge base entities. This fact path expansion method ranks knowledge base paths by truthfulness, relevance, timeliness, and representativeness of ranks. The query-diven techniques can be applied to more interactive scenarios. For example, query-driven techniques can be used to develop infrastructure to support multi-user problem solving. These next-generation interactive systems shall (1) allow users examine the progress of the algorithms at any point in the life of the application; (2) allow users to intervene in order to improve/redirect the algorithm using low latency interactions; (3) allow multiple users to be added in order to concurrently monitor the progress of algorithms. Each of these requirements advocate for the adaptation of query-driven techniques. This dissertation defines query-driven techniques for text analytics and it lays a foundation for user-focused data-centric research with critical user requirements.

125 REFERENCES [1] L. Aguiar and J. A. Friedman. Predictive coding: What it is and what you need to know about it. http://newsandinsight.thomsonreuters.com/Legal/Insight/2013. [2] H. Altwaijry, D. V. Kalashnikov, and S. Mehrotra. Query-driven approach to entity resolution. Proceedings of the VLDB Endowment, 6(14), 2013. [3] A. Arasu, R. Christopher, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In International Conference on Data Engineering, pages 952–963. IEEE, 2009. [4] S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. Perez. The datapath system: A data-centric analytic processing engine for large data warehouses. In Proc. of the 2010 ACM SIGMOD, pages 519–530, NY, USA, 2010. ACM. [5] R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In ACM SIGMoD Record, volume 29, pages 261–272. ACM, 2000. [6] C. Badica, A. Badica, and E. Popescu. A new path generalization algorithm for html wrapper induction. In Advances in Web Intelligence and Data Mining, pages 11–20. Springer, 2006. [7] A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vector space model. In 17th ACL, pages 79–85. ACL, 1998. [8] K. Barrett, B. Cassels, P. Haahr, D. A. Moon, K. Playford, and P. T. Withington. A monotonic superclass linearization for dylan. In OOPSLA, pages 69–82, 1996. [9] K. Bellare, C. Curino, A. Machanavajihala, P. Mika, M. Rahurkar, and A. Sane. Woo: A scalable and multi-tenant platform for continuous knowledge base synthesis. Proc. VLDB Endow., 6(11):1114–1125, Aug. 2013. [10] I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Trans. KDD, 1(1), Mar. 2007. [11] I. Bhattacharya, L. Getoor, and L. Licamele. Query-time entity resolution. In Proc.12th ACM SIGKDD, KDD ’06, pages 529–534, NY, USA, 2006. [12] S. Bird, E. Loper, and E. Klein. Natural language processing with python. In Natural Language Processing with Python. O’Reilly Media Inc, 2009. [13] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. Dbpedia - a crystallization point for the web of data. Web Se- mantics: Science, Services and Agents on the WWW, 7(3):154–165, September 2009. [14] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, Mar. 2003.

126 [15] P. Bohannon, S. Merugu, C. Yu, V. Agarwal, P. DeRose, A. Iyer, A. Jain, V. Kakade, M. Muralidharan, R. Ramakrishnan, and W. Shen. Purple sox extraction management system. SIGMOD Rec., 37(4):21–27, Mar. 2009. [16] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In L. Getoor and T. Scheffer, editors, ICML, pages 321–328. Omnipress, 2011. [17] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Conference on 7, WWW7, pages 107–117, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V. [18] M. Br¨ocheler, L. Mihalkova, and L. Getoor. Probabilistic similarity logic. In P. Gr¨unwald and P. Spirtes, editors, UAI, pages 73–82. AUAI Press, 2010. [19] G. Casella and E. I. George. Explaining the gibbs sampler. The American Statisti- cian, 46(3):167–174, 1992. [20] A. Chechetka and C. Guestrin. Focused belief propagation for query-specific inference. In AISTATS, May 2010. [21] Y. Chen and D. Z. Wang. Knowledge expansion over probabilistic knowledge bases. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 649–660, New York, NY, USA, 2014. ACM. [22] S. Chib and E. Greenberg. Understanding the metropolis-hastings algorithm. The American Statistician, 49(4):327–335, 1995. [23] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: new analysis practices for big data. Proc. VLDB Endow., 2(2):1481–1492, Aug. 2009. [24] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algo- rithms. MIT press, 2009. [25] M. Cowles and B. Carlin. Markov chain monte carlo convergence diagnostics: a comparative review. Journal of AmStat, 91(434):883–904, 1996. [26] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and W. Peters. Text Processing with GATE (Version 6). 2011. [27] J. Dalton, J. R. Frank, E. Gabrilovich, M. Ringgaard, and A. Subramanya. Fakba1: Freebase annotation of trec kba stream corpus, version 1 (release date 2015-01-26, format version 1, correction level 0), January 2015.

127 [28] A. Das Sarma, A. Jain, A. Machanavajjhala, and P. Bohannon. An automatic blocking mechanism for large-scale de-duplication tasks. In Proceedings of the 21st ACM CIKM, pages 1055–1064. ACM, 2012. [29] X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, SIGMOD ’05, pages 85–96, New York, NY, USA, 2005. ACM. [30] H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. Proc. VLDB Endow., 2(1):1078–1089, 2009. [31] X. Feng, A. Kumar, B. Recht, and C. R´e. Towards a unified architecture for in-rdbms analytics. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, pages 325–336, New York, NY, USA, 2012. ACM. [32] P. Flajolet, E.´ Fusy, O. Gandouet, and F. Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. DMTCS Proceedings, 0(1), 2008. [33] T. A. S. Foundation. Apache solr. http://lucene.apache.org/solr. [34] J. R. Frank, S. J. Bauer, M. Kleiman-Weiner, D. A. Roberts, N. Tripuraneni, C. Zhang, C. Re, E. Voorhees, and I. Soboroff. Evaluating stream filtering for entity profile updates for trec 2013 (kba track overview). Technical report, DTIC Document, 2013. [35] J. Gantz and D. Reinsel. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the Future, 2012. [36] I. Getoor. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 6th SIAM International Conference on Data Mining, volume 124, page 47. Society for Industrial Mathematics, 2006. [37] L. Getoor and A. Machanavajjhala. Entity resolution: Theory, practice & open challenges. In Proceedings of the 38rd VLDB, VLDB ’12. VLDB Endowment, 2012. [38] J. E. Gonzalez, Y. Low, C. Guestrin, and D. O’Hallaron. Distributed parallel inference on large factor graphs. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 203–212. AUAI Press, 2009. [39] D. Graff. Ldc2007t07: English gigaword corpus, 2007. [40] C. Grant, C. P. George, J.-d. Gumbs, J. N. Wilson, and P. J. Dobbins. Morpheus: A deep web question answering system. In Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services, iiWAS ’10, pages 841–844, New York, NY, USA, 2010. ACM.

128 [41] C. E. Grant, J.-d. Gumbs, K. Li, D. Z. Wang, and G. Chitouras. Madden: query-driven statistical text analytics. In Proceedings of the 21st ACM interna- tional conference on Information and , CIKM ’12, pages 2740–2742, New York, NY, USA, 2012. ACM. [42] L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using q-grams in a dbms for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28–34, 2001. [43] B. Green, A. Wolf, C. Chomsky, and K. Laughery. Baseball: an automatic question answerer. In Proc of the Western Joint Computer Conference, volume 19, pages 219–224, San Francisco, CA, USA, 1961. Morgan Kaufmann Publishers Inc. [44] J. M. Hellerstein, C. R´e,F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The madlib analytics library: or mad skills, the sql. Proceedings of the VLDB Endowment, 5(12):1700–1711, Aug. 2012. [45] J. M. Hellerstein, C. R´e,F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The madlib analytics library: or mad skills, the sql. Proc. VLDB Endow., 5(12):1700–1711, Aug. 2012. [46] A. Jain, P. Ipeirotis, and L. Gravano. Building query optimizers for information extraction: the sqout project. SIGMOD Rec., 37:28–34, March 2009. [47] A. Jain and P. Pantel. Factrank: Random walks on a web of facts. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pages 501–509, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. [48] R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas. Mcdb: A monte carlo approach to managing uncertain data. In Proc. of the 2008 ACM SIGMOD, pages 687–700, NY, USA, 2008. ACM. [49] D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst., 31(2):716–767, June 2006. [50] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-thought vectors. arXiv preprint arXiv:1506.06726, 2015. [51] D. E. Knuth. Ancient babylonian algorithms. Commun. ACM, 15(7):671–677, July 1972. [52] S. Kok, P. Singla, M. Richardson, P. Domingos, M. Sumner, H. Poon, and D. Lowd. The alchemy system for statistical relational ai. University of Washington, Seattle, 2005. [53] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.

129 [54] N. Lao and W. W. Cohen. Relational retrieval using a combination of path-constrained random walks. Machine learning, 81(1):53–67, 2010. [55] N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 529–539. Association for Computational Linguistics, 2011. [56] G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy. The unified logging infrastructure for data analytics at twitter. Proc. VLDB Endow., 5(12):1771–1780, Aug. 2012. [57] K. Li, C. Grant, D. Z. Wang, S. Khatri, and G. Chitouras. Gptext: Greenplum parallel statistical text analysis framework. In Proceedings of the Second Workshop on Data Analytics in the Cloud, DanaC ’13, pages 31–35, New York, NY, USA, 2013. ACM. [58] K. Li, C. Grant, D. Z. Wang, S. Khatri, and G. Chitouras. Gptext: Greenplum parallel statistical text analysis framework. In Proceedings of the Second Workshop on Data Analytics in the Cloud, pages 31–35. ACM, 2013. [59] X. Li, P. Morie, and D. Roth. Identification and tracing of ambiguous names: Discriminative and generative approaches. In Proceedings of the National Conference on Artificial Intelligence, pages 419–424. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2004. [60] Y. Li, F. R. Reiss, and L. Chiticariu. Systemt: a declarative information extraction system. In Proceedings of the 49th Annual Meeting of the Association for Compu- tational Linguistics: Human Language Technologies: Systems Demonstrations, HLT ’11, pages 109–114, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. [61] P. Liang, M. I. Jordan, and D. Klein. Type-based mcmc. In Human Language Technologies: The 2010 NAACL, HLT ’10, pages 573–581, Stroudsburg, PA, USA, 2010. ACL. [62] J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale : You can only afford to pay as you go. In Proceedings of CIDR, pages 342–350, 2007. [63] C. D. Manning. Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I, CICLing’11, pages 171–189, Berlin, Heidelberg, 2011. Springer-Verlag. [64] A. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of the 6th SIGKDD, pages 169–178, 2000.

130 [65] A. McCallum, K. Schultz, and S. Singh. FACTORIE: Probabilistic programming via imperatively defined factor graphs. In NIPS, pages 1426–1427, 2009. [66] A. Mccallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In In NIPS, pages 905–912. MIT Press, 2003. [67] A. Mccallum and B. Wellner. Conditional Models of Identity Uncertainty with Application to Noun Coreference. In NIPS, 2004. [68] J. Morales and J. Nocedal. Automatic preconditioning by limited memory quasi-newton updating. SIAM Journal on Optimization, 10(4):1079–1096, 2000. [69] H. Nguyen, T. Nguyen, and J. Freire. Learning to extract form labels. Proc. VLDB Endow., 1(1):684–694, 2008. [70] F. Niu, C. R´e,A. Doan, and J. Shavlik. Tuffy: Scaling up statistical inference in markov logic networks using an rdbms. Proceedings of the VLDB Endowment, 4(6):373–384, 2011. [71] B. O’Connor, R. Balasubramanyan, B. Routledge, and N. Smith. From tweets to polls: Linking text sentiment to public opinion time series. In Proc. AAAI Conf. on Weblogs and Social Media, pages 122–129, 2010. [72] K. Olsen and A. Malizia. Following virtual trails. Potentials, IEEE, 29(1):24 –28, jan.-feb. 2010. [73] M. Pamuk and M. Stonebraker. Transformscout : finding compositions of transformations for software re-use. Master’s thesis, MIT, 2007. [74] H. Pasula, B. Marthi, B. Milch, S. J. Russell, and I. Shpitser. Identity Uncertainty and Citation Matching. In Neural Information Processing Systems, pages 1401–1408, 2002. [75] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. [76] H. Phan and M. Le Nguyen. Flexcrfs: Flexible conditional random fields, 2004. [77] H. Poon and P. Domingos. Sound and efficient inference with probabilistic and deterministic dependencies. In AAAI, volume 6, pages 458–463, 2006. [78] D. Rao, P. McNamee, and M. Dredze. Streaming cross document entity coreference resolution. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 1050–1058. Association for Computational Linguistics, 2010. [79] M. Richardson and P. Domingos. Markov logic networks. Machine learning, 62(1-2):107–136, 2006.

131 [80] F. Rubin. Enumerating all simple paths in a graph. Circuits and Systems, IEEE Transactions on, 25(8):641–642, 1978. [81] F. Rusu and A. Dobra. Glade: a scalable framework for efficient analytics. SIGOPS Oper. Syst. Rev., 46(1):12–18, Feb. 2012. [82] D. Sculley and C. E. Brodley. Compression and machine learning: A new perspective on feature space vectors. In Data Compression Conference, 2006. DCC 2006. Proceedings, pages 332–341. IEEE, 2006. [83] W. Shen, X. Li, and A. Doan. Constraint-based entity matching. In Proceedings of the National Conference on Artificial Intelligence, volume 20, page 862. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005. [84] L. Shu, A. Chen, M. Xiong, and W. Meng. Efficient spectral neighborhood blocking for entity resolution. In 2011 IEEE 27th ICDE, pages 1067 –1078, april 2011. [85] P. Simon. Too Big to Ignore: The Business Case for Big Data. Wiley. com, 2013. [86] S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Large-scale cross-document coreference using distributed inference and hierarchical models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 793–803. Association for Computational Linguistics, 2011. [87] S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Wikilinks: A large-scale cross-document coreference corpus labeled via links to Wikipedia. Technical Report UM-CS-2012-015, 2012. [88] S. Singh, M. Wick, and A. McCallum. Monte carlo mcmc: efficient inference by approximate sampling. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1104–1113. Association for Computational Linguistics, 2012. [89] P. Singla and P. Domingos. Entity Resolution with Markov Logic. In IEEE International Conference on Data Mining, pages 572–582. IEEE, 2006. [90] R. M. Smullyan. First-order logic, volume 21968. Springer, 1968. [91] F. M. Suchanek. Automated Construction and Growth of a Large Ontology. PhD thesis, Saarland University, 2009. [92] J. Teevan, E. Adar, R. Jones, and M. A. S. Potts. Information re-retrieval: repeat queries in yahoo’s logs. In SIGIR ’07: Proc of the 30th annual international ACM SIGIR conference on R and D in information retrieval, pages 151–158, New York, NY, USA, 2007. ACM. [93] M. D. Vose. A linear algorithm for generating random numbers with a given distribution. IEEE Trans. Softw. Eng., 17(9):972–975, Sept. 1991.

132 [94] D. Wang, M. Franklin, M. Garofalakis, J. Hellerstein, and M. Wick. Hybrid in-database inference for declarative information extraction. In Proc. SIGMOD, pages 517–528. ACM, 2011. [95] D. Z. Wang, E. Michelakis, M. Garofalakis, and J. M. Hellerstein. Bayesstore: managing large, uncertain data repositories with probabilistic graphical models. Proc. VLDB Endow., 1:340–351, August 2008. [96] S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In 2009 ACM SIGMOD, pages 219–232. ACM, 2009. [97] M. Wick, A. McCallum, and G. Miklau. Scalable probabilistic databases with factor graphs and mcmc. Proc. VLDB Endow., 3(1-2):794–804, Sept. 2010. [98] M. Wick, S. Singh, and A. McCallum. A discriminative hierarchical model for fast coreference at large scale. In Proceedings of the 50th ACL, ACL ’12, pages 379–388, 2012. [99] M. Wick, S. Singh, and A. McCallum. A discriminative hierarchical model for fast coreference at large scale. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL ’12, pages 379–388, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. [100] M. L. Wick and A. McCallum. Query-aware mcmc. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in NIPS 24, pages 2564–2572, 2011. [101] M. L. Wick, K. Rohanimanesh, K. Bellare, A. Culotta, A. McCallum, and A. McCallum. Sample rank: Training factor graphs with atomic gradients. In ICML, pages 777–784, 2011. [102] W. Woods. Progress in natural language understanding - an application to lunar geology. In American Federation of Information Processing Societies (AFIPS) Conference Proc, 42, pages 441–450, 1973. [103] L. Zhang, R. Ghosh, M. Dekhil, M. Hsu, and B. Liu. Combining lexicon-based and learning-based methods for twitter sentiment analysis. 2011.

133 BIOGRAPHICAL SKETCH Christan Grant completed his Bachelor of Science and Master of Science and Ph.D. in computer science from the University of Florida. His research interests involve novel methods for answering difficult questions. This includes the addition of natural language processing within relational databases to probabilistic knowledge base assisted question answering systems. He has worked on developing a “query-driven” paradigm for text analytics. He is a recipient of the National Science Foundation Graduate Research Fellowship award in the area of “Database Information Retrieval and Web Search”. He was also awarded the Florida Georgia LSAMP Bridge to Doctorate Fellowship and a diversity award from the College of Engineering. Christan has served as an external editor reviewer for ACM SIGMOD, ACM VLDB, ACM CIKM, and IEEE ICDE conferences. He is also on the program committee for Broadening Participation in Data Mining. He holds several publications and also a patent from an internship at IBM Almaden Research Center.

134