Commonsense Knowledge Mining from Pretrained Models

Commonsense Knowledge Mining from Pretrained Models Joshua Feldman∗, Joe Davison∗, Alexander M. Rush School of Engineering and Applied Sciences Harvard University fjoshua feldman@g, jddavison@g, [email protected] Abstract entities (i.e. dog, running away, excited, etc.) and the pre-defined edges representing the nature Inferring commonsense knowledge is a key of the relations between concepts (IsA, UsedFor, challenge in natural language processing, but due to the sparsity of training data, previ- CapableOf, etc.). Commonsense knowledge base ous work has shown that supervised methods completion (CKBC) is a machine learning task for commonsense knowledge mining under- motivated by the need to improve the coverage of perform when evaluated on novel data. In these resources. In this formulation of the prob- this work, we develop a method for generat- lem, one is supplied with a list of candidate entity- ing commonsense knowledge using a large, relation-entity triples, and the task is to distin- pre-trained bidirectional language model. By guish which of the triples express valid common- transforming relational triples into masked sentences, we can use this model to rank a sense knowledge and which are fictitious (Li et al., triple’s validity by the estimated pointwise 2016). mutual information between the two entities. Several approaches have been proposed for Since we do not update the weights of the training models for commonsense knowledge base bidirectional model, our approach is not bi- completion (Li et al., 2016; Jastrzebski et al., ased by the coverage of any one common- 2018). Each of these approaches uses some sense knowledge base. Though this method sort of supervised training on a particular knowl- performs worse on a test set than models ex- edge base, evaluating the model’s performance plicitly trained on a corresponding training set, it outperforms these methods when mining on a held-out test set from the same database. commonsense knowledge from new sources, These works use relations from ConceptNet, a suggesting that unsupervised techniques may crowd-sourced database of structured common- generalize better than current supervised ap- sense knowledge, to train and validate their mod- proaches. els (Liu and Singh, 2004). However, it has been shown that these methods generalize poorly to 1 Introduction novel data (Li et al., 2016; Jastrzebski et al., 2018). Commonsense knowledge consists of facts about Jastrzebski et al.(2018) demonstrated that much the world which are assumed to be widely of the data in the ConceptNet test set were simply known. For this reason, commonsense knowledge rephrased relations from the training set, and that is rarely stated explicitly in natural language, mak- this train-test set leakage led to artificially inflated ing it challenging to infer this information with- test performance metrics. This problem of train- out an enormous amount of data (Gordon and test leakage is typical in knowledge base comple- Van Durme, 2013). Some have even argued that tion tasks (Toutanova et al., 2015; Dettmers et al., machine learning models cannot learn common 2018). sense implicitly (Davis and Marcus, 2015). Instead of training a predictive model on any One method for mollifying this issue is directly specific database, we attempt to utilize the world augmenting models with commonsense knowl- knowledge of large language models to identify edge bases (Young et al., 2018), which typically commonsense facts directly. By constructing a contain high-quality information but with low cov- candidate piece of knowledge as a sentence, we erage. These knowledge bases are represented can use a language model to approximate the like- as a graph, with nodes consisting of conceptual lihood of this text as a proxy for its truthfulness. 1173 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 1173–1178, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics In particular, we use a masked language model to which maps a triple to a single sentence, and a estimate point-wise mutual information between scoring model σ which then determines a validity entities in a possible relation, an approach that score y. differs significantly from fine-tuning approaches Our approach relies on two types of pretrained used for other language modeling tasks. Since the language models. Standard unidirectional models weights of the model are fixed, our approach is are typically represented as autoregressive proba- not biased by the coverage of any one dataset. As bilities: we might expect, our method underperforms when m Y compared to previous benchmarks on the Con- p(w1; w2; : : : ; wm) = p(wijw1; : : : ; wi−1) ceptNet common sense triples dataset (Li et al., i 2016), but demonstrates a superior ability to gen- Masked bidirectional models such as BERT, pro- eralize when mining novel commonsense knowl- posed by Devlin et al.(2018), instead model in edge from Wikipedia. both directions, training word representations conditioned both on future and past words. The mask- Related Work Schwartz et al.(2017) and Trinh ing allows any number of words in the sequence to and Le(2018) demonstrate a similar approach to be hidden. This setup provides an intuitive frame- using language models for tasks requiring com- work to evaluate the probability of any word in a monsense, such as the Story Cloze Task and sequence conditioned on the rest of the sequence, the Winograd Schema Challenge, respectively 0 0 (Mostafazadeh et al., 2016; Levesque et al., 2012). p(wijw1:i−1; wi+1:m) Bosselut et al.(2019) and Trinh and Le(2019) where w0 2 V [ fκg and κ is a special token indi- use unidirectional language models for CKBC, but cating a masked word. their approach requires a supervised training step. Our approach differs in that we intentionally avoid 2.1 Generating Sentences from Triples training on any particular database, relying instead We first consider methods for turning a triple such on the language model’s general world knowl- as (ferret, AtLocation, pet store) into a edge. Additionally, we use a bidirectional masked sentence such as “the ferret is in the pet store”. model which provides a more flexible framework Our approach is to generate a set of candidate sen- for likelihood estimation and allows us to estimate tences via hand-crafted templates and select the point-wise mutual information. Although it is be- best proposal according to a language model. yond the scope of this paper, it would be interest- For each relation r 2 R, we hand-craft a set ing to adapt the methods presented here for the re- of sentence templates. For example, one template lated task of generating new commonsense knowl- in our experiments for the relation AtLocation edge (Saito et al., 2018). is, “you are likely to find HEAD in TAIL”. For 2 Method the above example, this would yield the sentence, “You are likely to find ferret in pet store”. Given a commonsense head-relation-tail triple Because these sentences are not always gram- x = (h; r; t), we are interested in determining the matically correct, such as in the above example, validity of that tuple as a representation of a com- we apply a simple set of transformations. These monsense fact. Specifically, we would like to de- consist of inserting articles before nouns, con- termine a numeric score y 2 R reflecting our con- verting verbs into gerunds, and pluralizing nouns fidence that a given tuple represents true knowl- which follow numbers. See the supplementary edge. materials for details and Table1 for an exam- We assume that heads and tails are arbitrary- ple. We then enumerate a set of alternative sen- length sequences of words in a vocabulary V tences S = fS1;:::;Sjg resulting from each tem- so that h = fh1; h2; : : : ; hng and t = plate and from all combinations of transforma- ft1; t2; : : : ; tmg. We further assume that we have tions. This yields a set of candidate sentences for a known set of possible relations R so that r 2 R. each data point. We then select the candidate sen- The goal is to determine a function f that maps tence with the highest log-likelihood according to relational triples to validity scores. We propose a pre-trained unidirectional language model Pcoh. decomposing f(x) = σ(τ(x)) into two sub- ∗ S = arg max [log Pcoh(S)] components: a sentence generation function τ S2S 1174 Candidate Sentence Si log p(Si) times. Finally, we calculate the total conditional likelihood of the tail by the product of these terms, “musician can playing musical instrument” −5:7 Qj p(tjh; r) = pk. “musician can be play musical instrument” −4:9 k=1 The marginal p(tjr) is computed similarly, but “musician often play musical instrument” −5:5 in this case we mask the head throughout. For “a musician can play a musical instrument” −2:9 example, to compute the marginal tail probability for the sentence, “You are likely to find a ferret Table 1: Example of generating candidate sentences. Several enumerated sentences for the in the pet store” we mask both the head and the triple (musician, CapableOf, play musical tail and then sequentially unmask the tail words instrument). The sentence with the highest log- only: “You are likely to find a κh1 in the κt1 κt2”. likelihood according to a pretrained language model is If κt2 = “store” has a higher probability than selected. κt1 = “pet”, we unmask “store” and compute “You are likely to find a κh1 in the κt1 store”. The We refer to this method of generating a sen- marginal likelihood p(tjr) is then the product of the two probabilities. tence from a triple as COHERENCY RANKING.

Load more