Automatic Knowledge Base Construction Using Probabilistic Extraction, Deductive Reasoning, and Human Feedback
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Knowledge Base Construction using Probabilistic Extraction, Deductive Reasoning, and Human Feedback Daisy Zhe Wang Yang Chen Sean Goldberg Christan Grant Kun Li Department of Computer and Information Science and Engineering, University of Florida {daisyw,yang,sean,cgrant,kli}@cise.ufl.edu Abstract The automatically extracted information contains errors, uncertainties, and probabilities. We use a We envision an automatic knowledge base probabilistic database to preserve uncertainty in data construction system consisting of three inter- representations and propagate probabilities through related components. MADDEN is a knowl- query processing. edge extraction system applying statistical text analysis methods over database sys- Further, not all information can be extracted from tems (DBMS) and massive parallel processing the Web (Schoenmackers et al., 2008). A probabilis- (MPP) frameworks; PROBKB performs prob- tic deductive reasoning system is needed to infer ad- abilistic reasoning over the extracted knowl- ditional facts from the existing facts and rules ex- edge to derive additional facts not existing in tracted by MADDEN. the original text corpus; CAMEL leverages Finally, we propose to use human feedback to im- human intelligence to reduce the uncertainty prove the quality of the machine-generated knowl- resulting from both the information extraction and probabilistic reasoning processes. edge base since SML methods are not perfect. Crowdsourcing is one of the ways to collect this feedback and though much slower, it is often more 1 Introduction accurate than the state-of-the-art SML algorithms. In order to build a better search engine that performs 2 System Overview semantic search in addition to keyword matching, a knowledge base that contains information about all Our vision of the automatic knowledge base con- the entities and relationships on the web and beyond struction process consists of three main components is needed. With recent advances in technology such as shown in Figure 1. as cloud computing and statistical machine learning The first component is a knowledge extraction (SML), automatic knowledge base construction is system called MADDEN that sits on top of a prob- becoming possible and is receiving more and more abilistic database system such as BAYESSTORE or interest from researchers. We envision an automatic PrDB and treats probabilistic data, statistical mod- knowledge base (KB) construction system that in- els, and algorithms as first-class citizens (Wang et cludes three components: probabilistic extraction, al., 2008; Sen et al., 2009). MADDEN specifi- deductive reasoning, and human feedback. cally implements SML models and algorithms on Much research has been conducted on text anal- database systems (e.g., PostgreSQL) and massive ysis and extraction at web-scale using SML models parallel processing (MPP) frameworks (e.g., Green- and algorithms. We built our parallelized text anal- plum) to extract various types of information from ysis library MADDEN on top of relational database the text corpus, including entities, relations, and systems and MPP frameworks to achieve efficiency rules. Different types of information are extracted and scalability. by different text analysis tasks. For example, the 106 Proc. of the Joint Workshop on Automatic Knowledge Base Construction & Web-scale Knowledge Extraction (AKBC-WEKEX), pages 106–110, NAACL-HLT, Montreal,´ Canada, June 7-8, 2012. c 2012 Association for Computational Linguistics Computational E- Application Domains Journalism Discovery … Sentiment POS IE ER Analysis … 2. ProbKB 3. CAMeL 1. MADden Probabilistic DB Relational DBMS & PostgreSQL Greenplum …. MPP Frameworks Figure 1: Architecture for Automatic Knowledge Base Construction named entity recognition (NER) task extracts dif- 3 MADDEN: Statistical Text Analysis on ferent types of entities including people, companies, MPP Frameworks and locations from text. The focus of the MADDEN project has been to inte- grate statistical text analytics into DBMS and MPP The second component is a probabilistic reason- frameworks to achieve scalability and paralleliza- ing system called PROBKB. Given a set of enti- tion. Structured and unstructured text are core assets ties, relations, and rules extracted from a text cor- for data analysis. The increasing use of text analysis pus (e.g., WWW), PROBKB enables large-scale in- in enterprise applications has increased the expecta- ference and reasoning over uncertain entities and tion of customers and the opportunities for process- relations using probabilistic first-order logic rules. ing big data. The state-of-the-art text analysis and Such inference would generate a large number of extraction tools are increasingly found to be based new facts that did not exist in the original text cor- on statistical models and algorithms (Jurafsky et al., pus. The uncertain knowledge base is modeled by 2000; Feldman and Sanger, 2007). Markov logic networks (MLN) (Domingos et al., Basic text analysis tasks include part-of-speech 2006). In this model, the probabilistic derivation of (POS) tagging, named entity extraction (NER), and new facts from existing ones is equivalent to infer- entity resolution (ER) (Feldman and Sanger, 2007). ence over the MLNs. Different statistical models and algorithms are im- plemented for each of these tasks with different runtime-accuracy trade-offs. An example entity res- The third component is a crowd-based human olution task could be to find all mentions in a text feedback system called Crowd-Assisted Machine corpus that refer to a real-world entity X. Such a Learning or CAMEL. Given the set of extracted and task can be done efficiently by approximate string derived facts, rules and their uncertainties, CAMEL matching (Navarro, 2001) techniques to find all leverages the human computing power from crowd- mentions that approximately match the name of en- sourcing services to improve the quality of the tity X. Approximate string matching is a high re- knowledge base. Based on the probabilities associ- call and low precision approach when compared ated with the extracted and derived information in an to state-of-the-art collective entity resolution algo- uncertain knowledge base, CAMEL effectively se- rithms based on statistical models like Conditional lects and formulates questions to push to the crowd. Random Fields (CRFs) (Lafferty et al., 2001). CRFs are a leading probabilistic model for solv- The resulting knowledge base constructed from ing many text analysis tasks, including POS tagging, the extraction, derivation, and feedback steps can be NER, and ER (Lafferty et al., 2001). To support used in various application domains such as compu- sophisticated text analysis, we implement four key tational journalism and e-discovery. methods: text feature extraction, inference over a 107 CRF (Viterbi), Markov chain Monte Carlo (MCMC) drive the recursion in Viterbi. In the Greenplum inference, and approximate string matching. MPP Framework, Viterbi can be run in parallel over Text Feature Extraction: To analyze text, fea- different subsets of the document on a multi-core tures need to be extracted from documents and it can machine. be an expensive operation. To achieve results with MCMC Inference: MCMC methods are classi- high accuracy, CRF methods often compute hun- cal sampling algorithms that can be used to esti- dreds of features over each token in the document, mate probability distributions. We implemented two which can be high cost. Features are determined MCMC method: Gibbs sampling and Metropolis- by functions over the sets of tokens. Examples of Hastings (MCMC-MH). such features include: (1) dictionary features: does The MCMC algorithms involve iterative proce- this token exist in a provided dictionary? (2) regex dures where the current values depend on previ- features: does this token match a provided regular ous iterations. We use SQL window aggregates expression? (3) edge features: is the label of a to- for macro-coordination in this case, to carry “state” ken correlated with the label of a previous token? across iterations to perform the Markov-chain pro- (4) word features: does this the token appear in the cess. We discussed this implementation at some training data? and (5) position features: is this token length in recent work (Wang et al., 2011). We the first or last in the token sequence? The optimal are currently developing MCMC algorithms over combination of features depends on the application. Greenplum DBMS. Approximate String Matching: A recurring primitive operation in text processing applications 4 PROBKB: Probabilistic Knowledge Base is the ability to match strings approximately. The technique we use is based on qgrams (Gravano et The second component of our system is PROBKB, a al., 2001). We create and index 3-grams over text. probabilistic knowledge base designed to derive im- Given a string “Tim Tebow” we can create a 3-gram plicit knowledge from entities, relations, and rules by using a sliding window of 3 characters over this extracted from a text corpus by knowledge ex- text string. Given two strings we can compare the traction systems like MADDEN. Discovering new overlap of two sets of corresponding 3-grams and knowledge is a crucial step towards knowledge base compute a similarity as the approximate matching construction since many valuable facts are not ex- score. plicitly stated in web text; they need to be inferred Once we have the features, the next step is to per- from extracted facts and rules. form inference on the