Commonsense Knowledge Mining from Pretrained Models

Joshua Feldman∗, Joe Davison∗, Alexander M. Rush School of Engineering and Applied Sciences Harvard University {joshua feldman@g, jddavison@g, srush@seas}.harvard.edu

Abstract entities (i.e. dog, running away, excited, etc.) and the pre-defined edges representing the nature Inferring commonsense knowledge is a key of the relations between concepts (IsA, UsedFor, challenge in natural language processing, but due to the sparsity of training data, previ- CapableOf, etc.). Commonsense ous work has shown that supervised methods completion (CKBC) is a task for commonsense knowledge mining under- motivated by the need to improve the coverage of perform when evaluated on novel data. In these resources. In this formulation of the prob- this work, we develop a method for generat- lem, one is supplied with a list of candidate entity- ing commonsense knowledge using a large, relation-entity triples, and the task is to distin- pre-trained bidirectional language model. By guish which of the triples express valid common- transforming relational triples into masked sentences, we can use this model to rank a sense knowledge and which are fictitious (Li et al., triple’s validity by the estimated pointwise 2016). mutual information between the two entities. Several approaches have been proposed for Since we do not update the weights of the training models for commonsense knowledge base bidirectional model, our approach is not bi- completion (Li et al., 2016; Jastrzebski et al., ased by the coverage of any one common- 2018). Each of these approaches uses some sense knowledge base. Though this method sort of supervised training on a particular knowl- performs worse on a test set than models ex- edge base, evaluating the model’s performance plicitly trained on a corresponding training set, it outperforms these methods when mining on a held-out test set from the same . commonsense knowledge from new sources, These works use relations from ConceptNet, a suggesting that unsupervised techniques may crowd-sourced database of structured common- generalize better than current supervised ap- sense knowledge, to train and validate their mod- proaches. els (Liu and Singh, 2004). However, it has been shown that these methods generalize poorly to 1 Introduction novel data (Li et al., 2016; Jastrzebski et al., 2018). Commonsense knowledge consists of facts about Jastrzebski et al.(2018) demonstrated that much the world which are assumed to be widely of the data in the ConceptNet test set were simply known. For this reason, commonsense knowledge rephrased relations from the training set, and that is rarely stated explicitly in natural language, mak- this train-test set leakage led to artificially inflated ing it challenging to infer this information with- test performance metrics. This problem of train- out an enormous amount of data (Gordon and test leakage is typical in knowledge base comple- Van Durme, 2013). Some have even argued that tion tasks (Toutanova et al., 2015; Dettmers et al., machine learning models cannot learn common 2018). sense implicitly (Davis and Marcus, 2015). Instead of training a predictive model on any One method for mollifying this issue is directly specific database, we attempt to utilize the world augmenting models with commonsense knowl- knowledge of large language models to identify edge bases (Young et al., 2018), which typically commonsense facts directly. By constructing a contain high-quality information but with low cov- candidate piece of knowledge as a sentence, we erage. These knowledge bases are represented can use a language model to approximate the like- as a graph, with nodes consisting of conceptual lihood of this text as a proxy for its truthfulness.

1173 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 1173–1178, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics In particular, we use a masked language model to which maps a triple to a single sentence, and a estimate point-wise mutual information between scoring model σ which then determines a validity entities in a possible relation, an approach that score y. differs significantly from fine-tuning approaches Our approach relies on two types of pretrained used for other language modeling tasks. Since the language models. Standard unidirectional models weights of the model are fixed, our approach is are typically represented as autoregressive proba- not biased by the coverage of any one dataset. As bilities: we might expect, our method underperforms when m Y compared to previous benchmarks on the Con- p(w1, w2, . . . , wm) = p(wi|w1, . . . , wi−1) ceptNet triples dataset (Li et al., i 2016), but demonstrates a superior ability to gen- Masked bidirectional models such as BERT, pro- eralize when mining novel commonsense knowl- posed by Devlin et al.(2018), instead model in edge from . both directions, training word representations con- ditioned both on future and past words. The mask- Related Work Schwartz et al.(2017) and Trinh ing allows any number of words in the sequence to and Le(2018) demonstrate a similar approach to be hidden. This setup provides an intuitive frame- using language models for tasks requiring com- work to evaluate the probability of any word in a monsense, such as the Story Cloze Task and sequence conditioned on the rest of the sequence, the Winograd Schema Challenge, respectively 0 0 (Mostafazadeh et al., 2016; Levesque et al., 2012). p(wi|w1:i−1, wi+1:m) Bosselut et al.(2019) and Trinh and Le(2019) where w0 ∈ V ∪ {κ} and κ is a special token indi- use unidirectional language models for CKBC, but cating a masked word. their approach requires a supervised training step. Our approach differs in that we intentionally avoid 2.1 Generating Sentences from Triples training on any particular database, relying instead We first consider methods for turning a triple such on the language model’s general world knowl- as (ferret, AtLocation, pet store) into a edge. Additionally, we use a bidirectional masked sentence such as “the ferret is in the pet store”. model which provides a more flexible framework Our approach is to generate a set of candidate sen- for likelihood estimation and allows us to estimate tences via hand-crafted templates and select the point-wise mutual information. Although it is be- best proposal according to a language model. yond the scope of this paper, it would be interest- For each relation r ∈ R, we hand-craft a set ing to adapt the methods presented here for the re- of sentence templates. For example, one template lated task of generating new commonsense knowl- in our experiments for the relation AtLocation edge (Saito et al., 2018). is, “you are likely to find HEAD in TAIL”. For 2 Method the above example, this would yield the sentence, “You are likely to find ferret in pet store”. Given a commonsense head-relation-tail triple Because these sentences are not always gram- x = (h, r, t), we are interested in determining the matically correct, such as in the above example, validity of that tuple as a representation of a com- we apply a simple set of transformations. These monsense fact. Specifically, we would like to de- consist of inserting articles before nouns, con- termine a numeric score y ∈ R reflecting our con- verting verbs into gerunds, and pluralizing nouns fidence that a given tuple represents true knowl- which follow numbers. See the supplementary edge. materials for details and Table1 for an exam- We assume that heads and tails are arbitrary- ple. We then enumerate a set of alternative sen- length sequences of words in a vocabulary V tences S = {S1,...,Sj} resulting from each tem- so that h = {h1, h2, . . . , hn} and t = plate and from all combinations of transforma- {t1, t2, . . . , tm}. We further assume that we have tions. This yields a set of candidate sentences for a known set of possible relations R so that r ∈ R. each data point. We then select the candidate sen- The goal is to determine a function f that maps tence with the highest log-likelihood according to relational triples to validity scores. We propose a pre-trained unidirectional language model Pcoh. decomposing f(x) = σ(τ(x)) into two sub- ∗ S = arg max [log Pcoh(S)] components: a sentence generation function τ S∈S 1174 Candidate Sentence Si log p(Si) times. Finally, we calculate the total conditional likelihood of the tail by the product of these terms, “musician can playing musical instrument” −5.7 Qj p(t|h, r) = pk. “musician can be play musical instrument” −4.9 k=1 The marginal p(t|r) is computed similarly, but “musician often play musical instrument” −5.5 in this case we mask the head throughout. For “a musician can play a musical instrument” −2.9 example, to compute the marginal tail probability for the sentence, “You are likely to find a ferret Table 1: Example of generating candidate sen- tences. Several enumerated sentences for the in the pet store” we mask both the head and the triple (musician, CapableOf, play musical tail and then sequentially unmask the tail words instrument). The sentence with the highest log- only: “You are likely to find a κh1 in the κt1 κt2”. likelihood according to a pretrained language model is If κt2 = “store” has a higher probability than selected. κt1 = “pet”, we unmask “store” and compute “You are likely to find a κh1 in the κt1 store”. The We refer to this method of generating a sen- marginal likelihood p(t|r) is then the product of the two probabilities. tence from a triple as COHERENCY RANKING. Coherency Ranking operates under the assump- The final score combines the marginal and con- tion that natural, grammatical sentences will have ditional likelihoods by employing a weighted form a higher likelihood than ungrammatical or unnat- of the point-wise mutual information, ural sentences. See an example subset of sen- PMIλ(t, h|r) = λ log p(t|h, r) − log p(t|r) tence candidates and their corresponding scores in Table1. From a qualitative evaluation of the where λ is treated as a hyperparameter. Although selected sentences, we find that this approach exact PMI is symmetrical, the approximate model produces sentences of significantly higher quality itself is not. We therefore average PMIλ(t, h|r) than those generated by deterministic rules alone. and PMIλ(h, t|r) to reduce the variance of our es- We also perform an ablation study in our experi- timates, computing the masked head values rather ments demonstrating the effect of each component than the tail values in the latter. on CKBC performance. 3 Experiments 2.2 Scoring Generated Triples To evaluate the Coherency Ranking approach we Assuming we have generated a proper sentence measure whether it can distinguish between valid from a relational triple, we now need a way to and invalid triples. For our masked model, we use score its validity with a pretrained model that con- BERT-large (Devlin et al., 2018). For sentence siders the relationship between the relation enti- ranking, we use the GPT-2 117M LM (Radford ties. We therefore propose using the estimated et al., 2019). The relation templates and grammar point-wise mutual information (PMI) of the head transformation rules which we use can be found in h and tail t of a triple conditioned on the relation the supplementary materials. r, defined as, We compare the proposed method to several PMI(t, h|r) = log p(t|h, r) − log p(t|r) baselines. Following Trinh and Le(2018), we evaluate a simple CONCATENATION method for We can estimate these scores by using a masked generating sentences, splitting the relation r into bidirectional language model, Pcmp. In the case separate words and concatenating it with the head where the tail is a single word, the model al- and tail. For the triple (ferret, AtLocation, lows us to evaluate the conditional likelihood of pet store), the Concatenation approach would a single triple component p(t|h, r) by computing yield, “ferret at location pet store”. Pcmp(wi = t |w1:i−1, wi+1:m) for the tail word. We also evaluate CKBC performance when we In practice, the tail might be realized as a j- construct sentences by applying a single hand- word phrase. To handle this complexity, we use crafted template. Since each triple is mapped a greedy approximation of its probability. We first to a sentence with a single template without any mask all of the tail words and compute the proba- grammatical transformations, we refer to this as bility of each. We then find the word with highest the TEMPLATE method. Using the Template probability pk, substitute it back in, and repeat j approach, (ferret, AtLocation, pet store) 1175 Model Task 1 Task 2 cal model is similar, but does not include the in- ner product between head and tail. Li et al.(2016) Unsupervised evaluate a deep neural network (DNN) for CKBC. CONCATENATION 68.8 2.95 ± 0.11 They concatenate embeddings for the head, rela- TEMPLATE 72.2 2.98 ± 0.11 tion, and tail, which they then feed through a mul- TEMPL.+GRAMMAR 74.4 2.56 ± 0.13 tilayer perceptron with one hidden layer. All three COHERENCY RANK 78.8 3.00 ± 0.12 models are trained on 100,000 ConceptNet triples. Supervised Task 1: Commonsense Knowledge Base Com- DNN 89.2 2.50 pletion Our experimental setup follows Li et al. FACTORIZED 89.0 2.61 (2016), evaluating our model with their test set PROTOTYPICAL 79.4 2.55 (n = 2400) containing an equal number of valid and invalid triples. The valid triples are from Table 2: Main results for Task 1: Commonsense the crowd-sourced knowledge base completion (test F1 score) and Task (OMCS) entries in the ConceptNet 5 dataset 2: Wikipedia mining (quality scores out of 4). Re- (Speer and Havasi, 2012). Invalid triples are gen- sults are included from the sentence generation meth- ods of simple concatenation, hand-crafted templates, erated by replacing an element of a valid tuple templates plus grammatical transformations, and co- with another randomly selected element. herency ranking. DNN, Factorized, and Prototypical We use our scoring method to classify each tu- models are described in Jastrzebski et al.(2018). ple as valid or invalid. To this end, we use our method to assign a score to each tuple and then group the resulting scores into two clusters. In- would become “You are likely to find ferret in pet stances in the cluster with the higher mean PMI store” using the template “you are likely to find are labeled as valid, and the remainder are labeled HEAD in TAIL”. as invalid. We use expectation-maximization with Next, we extend the Template method by ap- a mixture of Gaussians to cluster. We also tune the plying deterministic grammatical transformations, PMI weight via grid search over 90 points from which we refer to as the TEMPLATE + GRAMMAR λ ∈ [0.5, 5.], using the Akaike information crite- approach. Like the full approach, these trans- rion of the Gaussian mixture model for evaluation formations involve adding articles before nouns, (Akaike, 1974). converting verbs into gerunds, and pluralizing Table2 shows the full results. Our unsupervised nouns following numbers. The Template + Gram- approach achieves a test set F1 score of 78.8, com- mar approach differs from Coherency Ranking in parable to the 79.4 F1 score found by the super- that all transformations are applied to every sen- vised prototypical approach. The Factorized and tence instead of applying combinations of trans- DNN models significantly outperformed our ap- formations and templates, which are then ranked proach with F1 scores of 89.2 and 89.0, respec- by a language model. Returning to our exam- tively. Our grid search found an optimal λ value ple, the Template + Grammar method produces of 1.65 for the Concatenation sentence generation “You are likely to find a ferret in a pet store”. model and 1.55 for the Coherency Ranking model. While this sentence is grammatical, applying this The Template and Template + Grammar methods method to (star, AtLocation, outer space) found lambda values of 1.20 and 0.95, respec- yields “You are likely to find a star in an outer tively. space”, which is incorrect. We compare our results to the supervised mod- Task 2: Mining Wikipedia To assess the els from the work of Jastrzebski et al.(2018) model’s ability to generalize to unseen data, we and the best performing model from Li et al. evaluate our unsupervised model in comparison to (2016). Jastrzebski et al.(2018) introduce F AC- previous supervised methods on the task of min- TORIZED and PROTOTYPICAL models. The Fac- ing commonsense knowledge from Wikipedia. In torized model embeds the head, relation, and tail their evaluations, Li et al.(2016) curate a set of in a vector space and then produces a score by tak- 1.7M triples across 10 relations by applying part- ing a linear combination of the inner products be- of-speech patterns to Wikipedia articles. We sam- tween each pair of embeddings. The Prototypi- ple 300 triples from each relation. We apply our

1176 method to evaluate these 3000 triples. Using the Task 1 N (/100) F1 Score approach described by Speer and Havasi(2012), GRAMMATICAL 75 79.1 and followed by Li et al.(2016) and Jastrzebski UNGRAMMATICAL 25 66.7 et al.(2018), two human annotators manually rate the 100 triples with the highest predicted score CORRECT MEANING 91 77.6 on a 0 to 4 scale: 0 (Doesn’t make sense), 1 WRONG MEANING 9 66.7 (Not true), 2 (Opinion/Don’t know), 3 (Sometimes Task 2 - Quality true), and 4 (Generally true). We tuned λ by mea- suring the quality of the 100 triples with the high- GRAMMATICAL 83 3.01 est predicted score across λ ∈ {1, 2,..., 9, 10}. UNGRAMMATICAL 17 2.88 The top 100 triples selected by our model were CORRECT MEANING 88 3.22 assigned a mean rating of 3.00 (λ = 4) with a WRONG MEANING 12 1.18 standard error of 0.11 under the Coherency Rank- ing approach, well exceeding the performance of Table 3: Test results examining the effect of sen- current supervised methods (Table2). Standard tence meaning and grammaticality on task perfor- mance. Scores are shown for a sample of 100 triples errors were calculated using 1000 bootstrap sam- split by whether the generated sentence is grammati- ples of the top 100 triples. The ratings assigned cal and whether it conveys the correct meaning of the by the two human annotators had a 0.50 Pearson triple. correlation and 0.23 kappa inter-annotator agree- ment. Rater disagreements occur most frequently when triples are ambiguous or difficult to inter- 4 Conclusion pret. Notably, if we bucket the five scores into just We introduce a robust unsupervised method for two categories of true and false, this disagreement commonsense knowledge base completion using rate drops by 50%. To give a sense of the types the world knowledge of pre-trained language mod- of commonsense knowledge our models struggle els. We develop a method for expressing knowl- to capture, we report the top 100 most confident edge triples as sentences. Using a bidirectional predictions that receive an average score below masked language model on these sentences, we 3 in the supplementary material. Notably, some can then estimate the weighted point-wise mutual of the top 100 triples our model identified were information of a triple as a proxy for its valid- indeed true, but would not be reasonably con- ity. Though our approach performs worse on a sidered common sense (e.g. (vector bundle, held-out test set developed by Li et al.(2016), it HasProperty, manifold)). This suggests that does so without any previous exposure to the Con- our approach may be applicable to mining knowl- ceptNet database, ensuring that this performance edge beyond common sense. is not biased. In the future, we hope to explore whether this approach can be extended to min- Analysis: Sentence Generation In order to ing facts that are not commonsense and to gen- measure the impact of sentence generation on our erating new commonsense knowledge outside of model, we select a sample of 100 sentences and any given database of candidate triples. We also group the results by a) whether the sentence con- see potential benefit in the development of a more tained a grammatical error, and b) whether the expansive set of evaluation methods for common- sentence misrepresented the meaning of the triple. sense knowledge mining, which would strengthen For example, the triple (golf, HasProperty, the validity of our conclusions. good) yields the sentence “golf is a good”, which Acknowledgments is grammatically correct but conveys the wrong meaning. On both Wikipedia mining and CKBC, This work was supported by NSF research award we find that misrepresenting meaning has an ad- 1845664. verse impact on model performance. In CKBC, we also find that grammar has a high impact on the resulting F1 scores (Table3). Future work could therefore focus on designing templates that more reliably encode a relation’s true meaning.

1177 References Itsumi Saito, Kyosuke Nishida, Hisako Asano, and Junji Tomita. 2018. Commonsense knowledge base Hirotugu Akaike. 1974. A new look at the statistical completion and generation. In Proceedings of the model identification. In Selected Papers of Hirotugu 22nd Conference on Computational Natural Lan- Akaike, pages 215–222. Springer. guage Learning, pages 141–150, Brussels, Belgium. Association for Computational Linguistics. Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli C¸elikyilmaz, and Yejin Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila Choi. 2019. COMET: commonsense transformers Zilles, Yejin Choi, and Noah A. Smith. 2017. The for automatic construction. CoRR, effect of different writing tasks on linguistic style: abs/1906.05317. A case study of the ROC story cloze task. CoRR, abs/1702.01841. Ernest Davis and Gary Marcus. 2015. and commonsense knowledge in artificial Robert Speer and Catherine Havasi. 2012. Represent- intelligence. Commun. ACM, 58(9):92–103. ing general relational knowledge in conceptnet 5. In LREC, pages 3679–3686. Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018. Convolutional 2d Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi- knowledge graph embeddings. In Thirty-Second fung Poon, Pallavi Choudhury, and Michael Gamon. AAAI Conference on Artificial Intelligence. 2015. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Conference on Empirical Methods in Natural Lan- Kristina Toutanova. 2018. Bert: Pre-training of deep guage Processing, pages 1499–1509. bidirectional transformers for language understand- Trieu H. Trinh and Quoc V. Le. 2018. A sim- ing. arXiv preprint arXiv:1810.04805. ple method for commonsense reasoning. CoRR, abs/1806.02847. Jonathan Gordon and Benjamin Van Durme. 2013. Re- porting bias and knowledge acquisition. In Proceed- Trieu H. Trinh and Quoc V. Le. 2019. Do language ings of the 2013 workshop on Automated knowledge models have common sense? base construction, pages 25–30. ACM. Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou, Stanisław Jastrzebski, Dzmitry Bahdanau, Seyedar- Subham Biswas, and Minlie Huang. 2018. Aug- ian Hosseini, Michael Noukhovitch, Yoshua Ben- menting end-to-end dialogue systems with common- gio, and Jackie Chi Kit Cheung. 2018. Common- sense knowledge. In Thirty-Second AAAI Confer- sense mining as knowledge base completion? a ence on Artificial Intelligence. study on the impact of novelty. arXiv preprint arXiv:1804.09259.

Hector Levesque, Ernest Davis, and Leora Morgen- stern. 2012. The winograd schema challenge. In Thirteenth International Conference on the Princi- ples of Knowledge Representation and Reasoning.

Xiang Li, Aynaz Taheri, Lifu Tu, and Kevin Gimpel. 2016. Commonsense knowledge base completion. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1445–1455.

Hugo Liu and Push Singh. 2004. Conceptneta practi- cal commonsense reasoning tool-kit. BT technology journal, 22(4):211–226.

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A cor- pus and evaluation framework for deeper under- standing of commonsense stories. arXiv preprint arXiv:1604.01696.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1:8.

1178