Probabilistic Databases of Universal Schema
Total Page:16
File Type:pdf, Size:1020Kb
Probabilistic Databases of Universal Schema Limin Yao Sebastian Riedel Andrew McCallum Department of Computer Science University of Massachusetts, Amherst {lmyao,riedel,mccallum}@cs.umass.edu Abstract knowledge. For example, Freebase (Bollacker et al., 2008) captures the content of Wikipedia to some In data integration we transform information extent, but has no criticized(Person,Person) relation from a source into a target schema. A gen- and hence cannot answer a question like “Who crit- eral problem in this task is loss of fidelity and icized George Bush?”, even though partial answers coverage: the source expresses more knowl- are expressed in Wikipedia. This makes the database edge than can fit into the target schema, or knowledge that is hard to fit into any schema schema a major bottleneck in information extrac- at all. This problem is taken to an extreme in tion (IE). From a more general point of view, data in- information extraction (IE) where the source tegration always suffers from schema mismatch be- is natural language. To address this issue, one tween knowledge source and knowledge target. can either automatically learn a latent schema To overcome this problem, one could attempt to emergent in text (a brittle and ill-defined task), manually extend the schema whenever needed, but or manually extend schemas. We propose in- this is a time-consuming and expensive process. Al- stead to store data in a probabilistic database of universal schema. This schema is simply ternatively, in the case of IE, we can automatically the union of all source schemas, and the proba- induce latent schemas from text, but this is a brit- bilistic database learns how to predict the cells tle, ill-defined and error-prone task. This paper pro- of each source relation in this union. For ex- poses a third alternative: sidestep the issue of in- ample, the database could store Freebase re- complete schemas altogether, by simply combining lations and relations that correspond to natu- the relations of all knowledge sources into what we ral language surface patterns. The database will refer to as a universal schema. In the case of IE would learn to predict what freebase relations this means maintaining a database with one table per hold true based on what surface patterns ap- pear, and vice versa. We describe an anal- natural language surface pattern. For data integra- ogy between such databases and collaborative tion from structured sources it simply means storing filtering models, and use it to implement our the original tables as is. Crucially, the database will paradigm with probabilistic PCA, a scalable not only store what each source table does contain, and effective collaborative filtering method. it will also learn a probabilistic model about which other rows each source table should correctly con- 1 Introduction tain. Let us illustrate this approach in the context of Natural language is a highly expressive representa- IE. First we copy tables such as profession from a tion of knowledge. Yet, for many tasks databases structured source (say, DBPedia). Next we create are more suitable, as they support more effective de- one table per surface pattern, such as was-criticized- cision support, question answering and data min- by and was-attacked-by and fill these tables with the ing. But given a fixed schema, any database can entity pairs that appear with this pattern in some nat- only capture so much of the information natural lan- ural language corpus (say, the NYT Corpus). At guage can express, even if we restrict us to factual this point, our database is a simple combination of a structured and an OpenIE (Etzioni et al., 2008) r1!"#$"%"&'(')'*+"*,"- r2!"#.)"/%01+2$"- v v knowledge representation. However, while we insert r1 r2 this knowledge, we can learn a probabilistic model which is able to predict was-criticized-by pairs based x x ae e,r1 e,r2 on information from the was-attacked-by relation. e 3 4 "!"":*<=12"-**=)$" In addition, it learns that the profession relation in >'?*+"@"><AB)210 Freebase can help disambiguate between physical vr1 vr2 attacks in sports and verbal attacks in politics. At 9:;# the same time, the model learns that the natural lan- x x a e,r1 e,r2 guage relation was-criticized-by can help predict the e e"!"":*<=12"-**=)$" profession information in Freebase. Moreover, often 3 56758 >'?*+"@"><AB)210 users of the database will not need to study a par- T σ(ae vr2 ) ticular schema—they can use their own expressions (say, works-at instead of profession) and still find the Figure 1: gPCA re-estimates the representations of two right answers. relations and a tuple with the arrival of an observation In the previous scenario we could answer more r1 (e). This enables the estimation of the probability for questions than our structured sources alone, because unseen fact r2 (e). Notice that both tuple and relation we learn how to predict new Freebase rows. We components are re-estimated and can change with the ar- could answer more questions than the text corpus rival of new observations. and OpenIE alone, because we learn how to pre- dict new rows in surface pattern tables. We could dation. also answer more questions than in Distant Supervi- In our experiments we integrate Freebase data sion (Mintz et al., 2009), because our schema is not and information from the New York Times Cor- limited to the relations in the structured source. We pus (Sandhaus, 2008). We show that our prob- could even go further and import additional struc- abilistic database can answer questions neither of tured sources, such as Yago (Hoffart et al., 2012). In the sources can answer, and that it uses information this case the probabilistic database would have inte- from one source to improve predictions for the other. grated, and implicitly aligned, several different data sources, in the sense that each helps predict the rows 2 Generalized PCA of the other. In this paper we present results of our first techni- In this work we concentrate on a set of binary source cal approach to probabilistic databases with univer- relations R, consisting of surface patterns as well sal schema: collaborative filtering, which has been as imported tables from other databases, and a set successful in modeling movie recommendations. of entity pairs E. We introduce a matrix X where Here each entity tuple explicitly “rates” source ta- each cell xe,r is a binary random variable indicating bles as “I appear in it” or “I don’t”, and the recom- whether r (e) is true or not. The upper half of figure mender system model predicts how the tuple would 1 shows two cells of this matrix, based on the rela- “rate” other tables—this amounts to the probability tions r1 (“a-division-of ”) and r2 (“’s-parent,”) and of membership in the corresponding table. Collab- the tuple e (Pocket Books, Simon&Schuster). Gen- orative filtering provides us with a wide range of erally some of the cells will be observed (such as scalable and effective machine learning techniques. xe,r1 ) while others will be not (such as xe,r2 ). In particular, we are free to choose models that use We employ a probabilistic generalization of Prin- no latent representations at all (such as a graphical ciple Component Analysis (gPCA) to estimate the model with one random variable per database cell), probabilities P (r (e)) for every non-observed fact or models with latent representations that do not r (e) (Collins et al., 2001). In gPCA we learn a directly correspond to interpretable semantic con- k-dimensional feature vector representation vr for cepts. In this paper we explore the latter and use a each relation (column) r, and a k-dimensional fea- probabilistic generalization to PCA for recommen- ture vector representation ae for each entity pair e. Figure 1 shows example vectors for both rows and al., 2011) to overcome this problem. Fundamen- columns. Notice that these vectors do not have to be tally, these methods rely on a symmetric notion of positive nor sum up to one. Given these represen- synonymy in which certain patterns are assumed to tations, the probability of r (e) being true is given have the same meaning. Our approach rejects this σ θ 1 by the logistic function ( ) = 1+exp(−θ) applied assumption in favor of a model which learns that cer- ⊺ tain patterns, or combinations thereof, entail others to the dot product θr,e , ae vr. In other words, we in one direction, but not necessarily the other. represent the matrix of parameters Θ , (θr,e) using AV A a a low-rank approximation where = ( e)e∈E Methods that learn rules between textual pat- V v and = ( r)r∈R. terns in OpenIE aim at a similar goal as our pro- Given a set of observed cells, gPCA estimates posed gPCA algorithm (Schoenmackers et al., 2008; the tuple feature representations A and the relation Schoenmackers et al., 2010). Such methods learn feature representations V by maximizing the log- the structure of a Markov Network, and are ulti- likelihood of the observed data. This can be done mately bounded by limits on tree-width and den- both in batch mode or in a more incremental fashion. sity. In contrast, the gPCA learns a latent, although In the latter we observe new facts (such as r1 (e) in not necessarily interpretable, structure. This la- Figure 1) and then re-estimate A and V. In Figure tent structure can express models of very high tree- 1 we show what this means in practice. In the upper width, and hence very complex rules, without loss half we see the currently estimated representation in efficiency.