Sparse Information Extraction: Unsupervised Language Models to the Rescue

Sparse Information Extraction: Unsupervised Language Models to the Rescue Doug Downey, Stefan Schoenmackers, and Oren Etzioni Turing Center, Department of Computer Science and Engineering University of Washington, Box 352350 Seattle, WA 98195, USA {ddowney,stef,etzioni}@cs.washington.edu Abstract the extraction and words or phrases indicative of class membership (e.g., “cities such as”). Even in a massive corpus such as the Web, a However, Zipf’s Law governs the distribution of substantial fraction of extractions appear in- extractions. Thus, even the Web has limited redun- frequently. This paper shows how to assess dancy for less prominent instances of relations. In- sparse the correctness of extractions by uti- deed, 50% of the extractions in the data sets em- lizing unsupervised language models. The ployed by (Downey et al., 2005) appeared only REALM system, which combines HMM- once. As a result, Downey et al.’s model, and re- n based and -gram-based language models, lated methods, had no way of assessing which ex- ranks candidate extractions by the likeli- traction is more likely to be correct for fully half of hood that they are correct. Our experiments the extractions. This problem is particularly acute show that REALM reduces extraction error when moving beyond unary relations. We refer to by 39%, on average, when compared with this challenge as the task of assessing sparse extrac- previous work. tions. Because REALM pre-computes language This paper introduces the idea that language mod- models based on its corpus and does not re- eling techniques such as n-gram statistics (Manning quire any hand-tagged seeds, it is far more and Schutze,¨ 1999) and HMMs (Rabiner, 1989) can scalable than approaches that learn mod- be used to effectively assess sparse extractions. The els for each individual relation from hand- paper introduces the REALM system, and highlights tagged data. Thus, REALM is ideally suited its unique properties. Notably, REALM does not for open information extraction where the require any hand-tagged seeds, which enables it to relations of interest are not specified in ad- scale to Open IE—extraction where the relations of vance and their number is potentially vast. interest are not specified in advance, and their number is potentially vast (Banko et al., 2007). 1 Introduction REALM is based on two key hypotheses. The Information Extraction (IE) from text is far from in- KnowItAll hypothesis is that extractions that oc- fallible. In response, researchers have begun to ex- cur more frequently in distinct sentences in the ploit the redundancy in massive corpora such as the corpus are more likely to be correct. For exam- Web in order to assess the veracity of extractions ple, the hypothesis suggests that the argument pair (e.g., (Downey et al., 2005; Etzioni et al., 2005; (Giuliani, New York) is relatively likely to be Feldman et al., 2006)). In essence, such methods uti- appropriate for the Mayor relation, simply because lize extraction patterns to generate candidate extrac- this pair is extracted for the Mayor relation relations (e.g., “Istanbul”) and then assess each candi- tively frequently. Second, we employ an instance of date by computing co-occurrence statistics between the distributional hypothesis (Harris, 1985), which can be phrased as follows: different instances of More formally, the list of candidate extrac- the same semantic relation tend to appear in sim- tions for a relation R is denoted as ER = ilar textual contexts. We assess sparse extractions {(a1, b1),..., (am, bm)}. An extraction (ai, bi) is by comparing the contexts in which they appear to an ordered pair of strings. The extraction is correct those of more common extractions. Sparse extrac- if and only if the relation R holds between the argu- tions whose contexts are more similar to those of ments named by ai and bi. For example, for R = common extractions are judged more likely to be Headquartered, a pair (ai, bi) is correct iff there correct based on the conjunction of the KnowItAll exists an organization ai that is in fact headquartered 1 and the distributional hypotheses. in the location bi. The contributions of the paper are as follows: ER is generated by applying an extraction mech- • The paper introduces the insight that the sub- anism, typically a set of extraction “patterns”, to field of language modeling provides unsuper- each sentence in a corpus, and recording the results. vised methods that can be leveraged to assess Thus, many elements of ER are identical extractions sparse extractions. These methods are more derived from different sentences in the corpus. scalable than previous assessment techniques, This task definition is notable for the minimal and require no hand tagging whatsoever. inputs required—IE assessment does not require • The paper introduces an HMM-based tech- knowing the relation name nor does it require hand- nique for checking whether two arguments are tagged seed examples of the relation. Thus, an IE of the proper type for a relation. Assessor is applicable to Open IE. • The paper introduces a relational n-gram 2.1 System Overview model for the purpose of determining whether In this section, we describe the REALM system, a sentence that mentions multiple arguments which utilizes language modeling techniques to per- actually expresses a particular relationship be- form IE Assessment. tween them. REALM takes as input a set of extractions ER, • The paper introduces a novel language- and outputs a ranking of those extractions. The modeling system called REALM that combines algorithm REALM follows is outlined in Figure 1. both HMM-based models and relational n- REALM begins by automatically selecting from ER gram models, and shows that REALM reduces a set of bootstrapped seeds SR intended to serve as error by an average of 39% over previous meth- correct examples of the relation R.REALM utilizes ods, when applied to sparse extraction data. the KnowItAll hypothesis, setting SR equal to the The remainder of the paper is organized as fol- h elements in ER extracted most frequently from lows. Section 2 introduces the IE assessment task, the underlying corpus. This results in a noisy set of and describes the REALM system in detail. Section seeds, but the methods that use these seeds are noise 3 reports on our experimental results followed by a tolerant. discussion of related work in Section 4. Finally, we REALM then proceeds to rank the remaining conclude with a discussion of scalability and with (non-seed) extractions by utilizing two language- directions for future work. modeling components. An n-gram language model 2 IE Assessment is a probability distribution P (w1, ..., wn) over con- secutive word sequences of length n in a corpus. This section formalizes the IE assessment task and Formally, if we assume a seed (s1, s2) is a correct describes the REALM system for solving it. An IE extraction of a relation R, the distributional hypoth- assessor takes as input a list of candidate extractions esis states that the context distribution around the meant to denote instances of a relation, and outputs seed extraction, P (w1, ..., wn|wi = s1, wj = s2) a ranking of the extractions with the goal that cor- for 1 ≤ i, j ≤ n tends to be “more similar” to rect extractions rank higher than incorrect ones. A 1For clarity, our discussion focuses on relations between correct extraction is defined to be a true instance of pairs of arguments. However, the methods we propose can be the relation mentioned in the input text. extended to relations of any arity. P (w1, ..., wn|wi = e1, wj = e2) when the extrac- REALM(Extractions ER = {e1, ..., em}) tion (e1, e2) is correct. Naively comparing context SR = the h most frequent extractions in ER distributions is problematic, however, because the UR = ER - SR arguments to a relation often appear separated by T ypeRankings(UR) ← HMM-T(SR, UR) several intervening words. In our experiments, we RelationRankings(UR) ← REL-GRAMS(SR, UR) found that when relation arguments appear together return a ranking of ER with the elements of SR at the in a sentence, 75% of the time the arguments are top (ranked by frequency) followed by the elements of separated by at least three words. This implies that UR = {u1, ..., um−h} ranked in ascending order of n must be large, and for sparse argument pairs it is T ypeRanking(ui) ∗ RelationRanking(ui). not possible to estimate such a large language model accurately, because the number of modeling param- Figure 1: Pseudocode for REALM at run-time. eters is proportional to the vocabulary size raised to The language models used by the HMM-T and the nth power. To mitigate sparsity, REALM utilizes REL-GRAMS components are constructed in a pre- smaller language models in its two components as a processing step. means of “backing-off’ from estimating context distributions explicitly, as described below. Section 2.3, REALM employs a relational n-gram First, REALM utilizes an HMM to estimate language model in order to accurately compare con- whether each extraction has arguments of the proper text distributions when extractions are sparse. type for the relation. Each relation R has a set REALM executes the type checking and relation of types for its arguments. For example, the rela- assessment components separately; each component tion AuthorOf(a, b) requires that its first ar- takes the seed and non-seed extractions as arguments gument be an author, and that its second be some and returns a ranking of the non-seeds. REALM then kind of written work. Knowing whether extracted combines the two components’ assessments into a arguments are of the proper type for a relation can single ranking. Although several such combinations be quite informative for assessing extractions.

Load more