Commonsense Knowledge Base Completion

Xiang Li∗‡ Aynaz Taheri† Lifu Tu‡ Kevin Gimpel‡ ∗University of Chicago, Chicago, IL, 60637, USA †University of Illinois at Chicago, Chicago, IL, 60607, USA ‡Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA [email protected], [email protected], {lifu,kgimpel}@ttic.edu

Abstract relation right term conf. MOTIVATEDBYGOAL relax 3.3 USEDFOR relaxation 2.6 We enrich a curated resource of common- MOTIVATEDBYGOAL your muscle be sore 2.3 sense knowledge by formulating the prob- HASPREREQUISITE go to spa 2.0 lem as one of knowledge base comple- CAUSES get pruny skin 1.6 HASPREREQUISITE change into swim suit 1.6 tion (KBC). Most work in KBC focuses on knowledge bases like that re- Table 1: ConceptNet tuples with left term “soak in late entities drawn from a fixed set. How- hotspring”; final column is confidence score. ever, the tuples in ConceptNet (Speer and Havasi, 2012) define relations between an unbounded set of phrases. We develop tain resources, researchers have developed meth- neural network models for scoring tuples ods to automatically increase coverage by infer- on arbitrary phrases and evaluate them by ring missing entries. These methods are com- their ability to distinguish true held-out monly categorized under the heading of knowl- tuples from false ones. We find strong edge base completion (KBC). KBC is widely- performance from a bilinear model using studied for knowledge bases like Freebase (Bol- a simple additive architecture to model lacker et al., 2008) which contain large sets of enti- phrases. We manually evaluate our trained ties and relations among them (Mintz et al., 2009; model’s ability to assign quality scores to Nickel et al., 2011; Riedel et al., 2013; West et novel tuples, finding that it can propose tu- al., 2014), including recent work using neural net- ples at the same quality level as medium- works (Socher et al., 2013; Yang et al., 2014). confidence tuples from ConceptNet. We improve the coverage of commonsense re- sources by formulating the problem as one of 1 Introduction knowledge base completion. We focus on a par- Many ambiguities in natural language process- ticular curated commonsense resource called Con- ing (NLP) can be resolved by using knowledge ceptNet (Speer and Havasi, 2012). ConceptNet of various forms. Our focus is on the type of contains tuples consisting of a left term, a rela- knowledge that is often referred to as “common- tion, and a right term. The relations come from sense” or “background” knowledge. This knowl- a fixed set. While terms in Freebase tuples are en- edge is rarely expressed explicitly in textual cor- tities, ConceptNet terms can be arbitrary phrases. pora (Gordon and Van Durme, 2013). Some re- Some examples are shown in Table 1. An NLP ap- searchers have developed techniques for inferring plication may wish to query ConceptNet for infor- this knowledge from patterns in raw text (Gor- mation about soaking in a hotspring, but may use don, 2014; Angeli and Manning, 2014), while oth- different words from those contained in the Con- ers have developed curated resources of common- ceptNet tuples. Our goal is to do on-the-fly knowl- sense knowledge via manual annotation (Lenat edge base completion so that queries can be an- and Guha, 1989; Speer and Havasi, 2012) or swered robustly without requiring the precise lin- games with a purpose (von Ahn et al., 2006). guistic forms contained in ConceptNet. Curated resources typically have high preci- To do this, we develop neural network mod- sion but suffer from a lack of coverage. For cer- els to embed terms and provide scores to arbi- trary tuples. We train them on ConceptNet tuples ing, though our methods could be applied to the and evaluate them by their ability to distinguish output of these or other extraction systems. true and false held-out tuples. We consider sev- Our goals are similar to those of the Analogy- eral functional architectures, comparing two com- Space method (Speer et al., 2008), which uses ma- position functions for embedding terms and two trix factorization to improve coverage of Concept- functions for converting term embeddings into tu- Net. However, AnalogySpace can only return a ple scores. We find that all architectures are able confidence score for a pair of terms drawn from to outperform several baselines and reach similar the training set. Our models can assign scores to performance on classifying held-out tuples. tuples that contain novel terms (as long as they We also experiment with several training ob- consist of words in our vocabulary). jectives for KBC, finding that a simple cross en- Though we use ConceptNet, similar techniques tropy objective with randomly-generated negative can be applied to other curated resources like examples performs best while also being fastest. WordNet (Miller, 1995) and FrameNet (Baker et We manually evaluate our trained model’s abil- al., 1998). For WordNet, tuples can contain lexi- ity to assign quality scores to novel tuples, find- cal entries that are linked via synset relations (e.g., ing that it can propose tuples at the same qual- “hypernym”). WordNet contains many multi- ity level as medium-confidence tuples from Con- word entries (e.g., “cold sweat”), which can be ceptNet. We release all of our resources, includ- modeled compositionally by our term models; al- ing our ConceptNet KBC task data, large sets of ternatively, entire glosses could be used as terms. randomly-generated tuples scored with our model, To expand frame relationships in FrameNet, tuples training code, and pretrained models with code for can draw relations from the frame relation types calculating the confidence of novel tuples.1 (e.g., “is causative of”) and terms can be frame lexical units or their definitions. 2 Related Work Several researchers have used commonsense knowledge to improve language technologies, in- Our methods are similar to past work on cluding sentiment analysis (Cambria et al., 2012; KBC (Mintz et al., 2009; Nickel et al., 2011; Lao Agarwal et al., 2015), semantic similarity (Caro et et al., 2011; Nickel et al., 2012; Riedel et al., 2013; al., 2015), and speech recognition (Lieberman et Gardner et al., 2014; West et al., 2014), particu- al., 2005). Our hope is that our models can en- larly methods based on distributed representations able many other NLP applications to benefit from and neural networks (Socher et al., 2013; Bordes commonsense knowledge. et al., 2013; Bordes et al., 2014a; Bordes et al., Our work is most similar to that of Angeli and 2014b; Yang et al., 2014; Neelakantan et al., 2015; Manning (2013). They also developed methods Gu et al., 2015; Toutanova et al., 2015). Most prior to assess the plausibility of new facts based on work predicts new relational links between terms a training set of facts, considering commonsense drawn from a fixed set. In a notable exception, data from ConceptNet in one of their settings. Neelakantan and Chang (2015) add new entities Like us, they can handle an unbounded set of terms to KBs using external resources along with prop- by using (simple) composition functions for novel erties of the KB itself. Relatedly, Yao et al. (2013) terms, which is rare among work in KBC. One key induce an unbounded set of entity categories and difference is that their best method requires iterat- associate them with entities in KBs. ing over the KB at test time, which can be com- Several researchers have developed techniques putationally expensive with large KBs. Our mod- for discovering commonsense knowledge from els do not require iterating over the training set. text (Gordon et al., 2010; Gordon and Schu- We compare to several baselines inspired by their bert, 2012; Gordon, 2014; Angeli and Manning, work, and we additionally evaluate our model’s 2014). Open information extraction systems like ability to score novel tuples derived from both REVERB (Fader et al., 2011) and NELL (Carl- ConceptNet and Wikipedia. son et al., 2010) find tuples with arbitrary terms and relations from raw text. In contrast, we start 3 Models with a set of commonsense facts to use for train-

1Available at http://ttic.uchicago.edu/ Our goal is to represent commonsense knowledge ˜kgimpel/commonsense.html. such that it can be used for NLP tasks. We as- sume this knowledge is given in the form of tuples where a is a nonlinear activation function (tuned ht1, R, t2i, where t1 is the left term, t2 is the right among ReLU, tanh, and logistic sigmoid) and term, and R is a (directed) relation that exists be- where we have introduced additional parameters tween the terms. Examples are shown in Table 1.2 W (B) and b(B). This gives us the following model: Given a set of tuples, our goal is to develop a > parametric model that can provide a confidence scorebilinear(t1, R, t2) = u1 MR u2 score for new, unseen tuples. That is, we want to design and train models that define a function When using the LSTM, we tune the decision about how to produce the final term vectors to pass to the score(t1, R, t2) that provides a quality score for bilinear model, including possibly using the final an arbitrary tuple ht1, R, t2i. These models will be evaluated by their ability to distinguish true held- vectors from each direction and the output of max out tuples from false ones. or average pooling. We use the same LSTM pa- We describe two model families for scoring tu- rameters for each term. ples. We assume that we have embeddings for 3.2 Deep Neural Network Models words and define models that use these word em- beddings to score tuples. So our models are lim- Our second family of models is based on deep neu- ited to tuples in which terms consist of words ral networks (DNNs). While bilinear models have in the vocabulary, though future been shown to work well for KBC, their functional work could consider character-based architectures form makes restrictions about how terms can inter- for open-vocabulary modeling (Huang et al., 2013; act. DNNs make no such restrictions. Ling et al., 2015). As above, we define two models, one based on using word averaging for the term model (DNN 3.1 Bilinear Models AVG) and one based on LSTMs (DNN LSTM). We first consider bilinear models, since they have For the DNN AVG model, we obtain the term vec- been found useful for KBC in past work (Nickel tors v1 and v2 by averaging word vectors in the re- et al., 2011; Jenatton et al., 2012; Garc´ıa-Duran´ et spective terms. We then concatenate v1, v2, and a al., 2014; Yang et al., 2014). A bilinear model has relation vector vR to form the input of the DNN, denoted vin . The DNN uses a single hidden layer: the following form for a tuple ht1, R, t2i: (D1) (D1) > u = a(W vin + b ) v1 MR v2 (D2) (D2) scoreDNN(t1, R, t2) = W u + b (1) r where v1 ∈ R is the (column) vector representing r r×r t1, v2 ∈ R is the vector for t2, and MR ∈ R where a is again a (tuned) nonlinear activation is the parameter matrix for relation R. function. The size of the hidden vector u is To convert terms t1 and t2 into term vectors v1 tuned, but the output dimensionality (the numbers (D2) (D2) and v2, we consider two possibilities: word aver- of rows in W and b ) is fixed to 1. We do aging and a bidirectional long short-term memory not use a nonlinear activation for the final layer (LSTM) recurrent neural network (Hochreiter and since our goal is to output a scalar score. Schmidhuber, 1997). This provides us with two For the DNN LSTM model, we first create a models: Bilinear AVG and Bilinear LSTM. single vector for the two terms using an LSTM. One downside of this architecture is that as the That is, we concatenate t1, a delimiter token, and length of the term vectors grows, the size of the re- t2 to create a single word sequence. We use a bidi- lation matrices grows quadratically. This can slow rectional LSTM to convert this word sequence to down training while requiring more data to learn a vector, again possibly using pooling (the deci- the large numbers of parameters in the matrices. sion is tuned; details below). We concatenate the To address this, we include an additional nonlin- output of this bidirectional LSTM with the rela- ear transformation of each term: tion vector vr to create the DNN input vector vin , then use Eq. 1 to obtain a score. We found this (B) (B) ui = a(W vi + b ) to work better than separately using an LSTM on each term. We can not try this for the Bilinear 2These examples are from the Open Mind Common Sense (OMCS) part of ConceptNet version 5 (Speer and Havasi, LSTM model since its functional form separates 2012). In our experiments below, we only use OMCS tuples. the two term vectors. The relation vectors vR are learned in addition loss). For a mini-batch of size β, there are β pos- to the DNN parameters W (D1), W (D2), b(D1), itive examples and 3β negative examples used for b(D2), and the LSTM parameters (in the case of training. The loss is summed over these 4β exam- the DNN LSTM model). Also, word embedding ples yielded by each mini-batch. parameters are updated in all settings. 4.3 Negative Examples 4 Training For the loss functions above, we need ways of au- Given a tuple training set T , we train our models tomatically generating negative examples. For ef- using two different loss functions: hinge loss and ficiency, we consider using the current mini-batch a binary cross entropy function. Both rely on ways only, as our models are trained using optimiza- of generating negative examples (Section 4.3). tion on mini-batches. We consider the follow- Both also use regularization (Section 4.4). ing three strategies to construct negative examples. Each strategy constructs three negative examples 4.1 Hinge Loss for each positive example τ: one by replacing t1, Given a training tuple τ = ht1, R, t2i, the hinge one by replacing R, and one by replacing t2. loss seeks to make the score of τ larger than the Random sampling. We create the three negative score of negative examples by a margin of at least examples for τ by replacing each component with γ. This corresponds to minimizing the following its counterpart in a randomly-chosen tuple in the loss, summed over all examples τ ∈ T : same mini-batch. loss (τ) = hinge Max sampling. We create the three negative ex- max{0, γ − score(τ) + score(τneg(t1))} amples for τ by replacing each component with

+ max{0, γ − score(τ) + score(τneg(R))} its counterpart in some other tuple in the mini- + max{0, γ − score(τ) + score(τ )} batch, choosing the substitution to maximize the neg(t2) score of the resulting negative example. For ex- ample, when swapping out t in τ = ht , R, t i, where τneg(t1) is the negative example obtained by 1 1 2 0 replacing t1 in τ with some other t1, and τneg(R) we choose the substitution t1 as follows: and τneg(t ) are defined analogously for the rela- 2 0 tion and right term. We describe how we generate t1 = argmax score(t, R, t2) t:ht,R0,t0 i∈µ\τ these negative examples in Section 4.3 below. 2

4.2 Binary Cross Entropy where µ is the current mini-batch of tuples. We perform the analogous procedure for R and t2. Though we only have true tuples in our training set, we can create a binary classification problem Mix sampling. This is a mixture of the above, by assigning a label of 1 to training tuples and a using random sampling 50% of the time and max label of 0 to negative examples. Then we can min- sampling the remaining 50% of the time. imize cross entropy (CE) as is common when us- ing neural networks for classification. To generate 4.4 Regularization negative examples, we consider the methods de- We use L2 regularization. For the DNN models, scribed in Section 4.3 below. We also need to con- we add the penalty term λkθk2 to the losses, where vert our models’ scores into probabilities, which λ is the regularization coefficient and θ contains all we do by using a logistic sigmoid σ on score. We other parameters. However, for the bilinear mod- denote the label as `, where the label is 1 if the tu- els we regularize the relation matrices MR toward ple is from the training set and 0 if it is a negative the identity matrix instead of all zeroes, adding the example. Then the loss is defined: following to the loss: lossCE(τ, `) = 2 X 2 λ1kθk + λ2 kMR − Irk2 −` log σ(score(τ)) − (1−`) log(1 − σ(score(τ))) R

When using this loss, we generate three negative where Ir is the r × r identity matrix, the summa- examples for each positive example (one for swap- tion is performed over all relations R, and θ repre- ping each component of the tuple, as in the hinge sents all other parameters. 5 Experimental Setup 5.2 Word Embeddings Our tuple models rely on initial word embeddings. We now evaluate our tuple models. We measure To help our models better capture the common- whether our models can distinguish true and false sense knowledge in ConceptNet, we generated tuples by training a model on a large set of tuples word embedding training data using the OMCS and testing on a held-out set. sentences underlying our training tuples (we ex- cluded the top 2400 tuples which were used for 5.1 Task Design creating DEV1, DEV2, and TEST). We created The tuples are obtained from the Open Mind Com- training data by merging the information in the mon Sense (OMCS) entries in the ConceptNet tuples and their OMCS sentences. Our goal was 5 dataset (Speer and Havasi, 2012). They are to combine the grammatical context of the OMCS sorted by a confidence score. The most confident sentences with the words in the actual terms, so as 1200 tuples were reserved for creating our test set to ensure that we learn embeddings for the words (TEST). The next most confident 600 tuples (i.e., in the terms. We also insert the relations into the those numbered 1201–1800) were used to build a OMCS sentences so that we can learn embeddings development set (DEV1) and the next most confi- for the relations themselves. dent 600 (those numbered 1801–2400) were used We describe the procedure by example and also to build a second development set (DEV2). release our generated data for ease of replication. For each set S (S ∈ {DEV1, DEV2, TEST}), for The tuple hsoak in a hotspring, CAUSES, get pruny each tuple τ ∈ S, we created a negative example skini was automatically extracted/normalized (by and added it to S. So each set doubled in size. To the ConceptNet developers) from the OMCS sen- create a negative example from τ ∈ S, we ran- tence “The effect of [soaking in a hotspring] is domly swapped one of the components of τ with [getting pruny skin]” where brackets surround another tuple τ 0 ∈ S. One third of the time we terms. We replace the bracketed portions with 0 swapped t1 in τ for t1 in τ , one third of the time their corresponding terms and insert the relation we swapped their R’s, and the remaining third of between them: “The effect of soak in a hotspring the time we swapped their t2’s. Thus, distinguish- CAUSES get pruny skin”. We do this for all train- 3 ing positive and negative examples in this task is ing tuples. similar to the objectives optimized during training. We used the word2vec (Mikolov et al., 2013) Each of DEV1 and DEV2 has 1200 tuples (600 toolkit to train skip-gram word embeddings on positive examples and 600 negative examples), this data. We trained for 20 iterations, using a while TEST has 2400 tuples (1200 positive and dimensionality of 200 and a window size of 5. 1200 negative). For training data, we selected We refer to these as “CN-trained” embeddings for 100,000 tuples from the remaining tuples (num- the remainder of this paper. Similar approaches bered 2401 and beyond). have been used to learn embeddings for partic- The task is to separate the true and false tuples ular downstream tasks, e.g., dependency pars- in our test set. That is, the labels are 1 for true ing (Bansal et al., 2014). We use our CN-trained tuples and 0 for false tuples. Given a model for embeddings within baseline methods and also pro- scoring tuples, we select a threshold by maximiz- vide the initial word embeddings of our models. For all of our models, we update the initial word ing accuracy on DEV1 and report accuracies on embeddings during learning. DEV2. This is akin to learning the bias feature weight (using DEV1) of a linear classifier that uses In the baseline methods described below, we our model’s score as its only feature. We tuned compare our CN-trained embeddings to pretrained several choices—including word embeddings, hy- word embeddings. We use the GloVe (Penning- perparameter values, and training objectives—on ton et al., 2014) embeddings trained on 840 bil- DEV2 and report final performance on TEST. One lion tokens of Common Crawl web text and the annotator (a native English speaker) attempted the PARAGRAM-SimLex embeddings of Wieting et al. same classification task on a sample of 100 tu- (2015), which were tuned to have strong perfor- ples from DEV2 and achieved an accuracy of 95%. mance on the SimLex-999 task (Hill et al., 2015). We release these datasets to the community so that 3For reversed relations, indicated by an asterisk in the others can work on this same task. OMCS sentences, we swap t1 and t2 in the tuple. 5.3 Baselines u2 above), the activation a, the hidden layer size g We consider three baselines inspired by those of for the DNN models, the relation vector length d Angeli and Manning (2013): for the DNN models, the LSTM hidden vector size h for models with LSTMs, the mini-batch size β, • Similar Fact Count (Count): For each tuple the regularization parameters λ, λ1, and λ2, and τ = ht1, R, t2i in the evaluation set, we count the AdaGrad learning rate α. the number of similar tuples in the training set. All tuning used early stopping: periodically 0 0 0 0 A training tuple τ = ht1,R , t2i is considered during training, we used the current model to 0 “similar” to τ if R = R , one of the terms find the optimal threshold on DEV1 and evaluated matches exactly, and the other term has the same on DEV2. Due to computational limitations, we 0 0 head word. That is, (R = R ) ∧ (t1 = t1) ∧ were unable to perform thorough grid searches for 0 0 (head(t2) = head(t2)), or (R = R ) ∧ (t2 = all hyperparameters. We combined limited grid 0 0 t2) ∧ (head(t1) = head(t1)). The head word searches with greedy hyperparameter tuning based for a term was obtained by running the Stanford on regions of values that were the most promising. Parser (Klein and Manning, 2003) on the term. For the Bilinear LSTM and DNN LSTM, we This baseline does not use word embeddings. did hyperparameter tuning by training on the full • Argument Similarity (ArgSim): This baseline training set of 100,000 tuples for 20 epochs, com- computes the cosine similarity of the vectors for puting DEV2 accuracy once per epoch. For the av- t1 and t2, ignoring the relation. Vectors for t1 eraging models, we tuned by training on a subset and t2 are obtained by word averaging. of 1000 tuples with β = 200 for 20 epochs; the • Max Similarity (MaxSim): For tuple τ in an averaging models showed more stable results and evaluation set, this baseline outputs the maxi- did not require the full training set for tuning. Be- mum similarity between τ and any tuple in the low are the tuned hyperparameter values: training set. The similarity is computed by con- • Bilinear AVG: for CE: r = 150, a = tanh, catenating the vectors for t , R, and t , then 1 2 β = 200, α = 0.01, λ = λ = 0.001. Hinge computing cosine similarity. As in ArgSim, 1 2 loss: same values as above except α = 0.005. we obtain vectors for terms by averaging their words. We only consider R when using our • Bilinear LSTM: for CE: r = 50, a = ReLU, CN-trained embeddings since they contain em- h = 200, β = 800, α = 0.02, and λ1 = λ2 = beddings for the relations. When using GloVe 0.00001. To obtain vectors from the term bidi- and PARAGRAM embeddings for this baseline, rectional LSTMs, we used max pooling. For we simply use the two term vectors (still con- hinge loss: r = 50, a = tanh, h = 200, structed via averaging the words in each term). β = 400, α = 0.007, λ1 = 0.00001, and λ2 = 0.01. To obtain vectors from the term We chose these baselines because they can all han- bidirectional LSTMs, we used the concatena- dle unbounded term sets but differ in their other tion of average pooling and final hidden vectors requirements. ArgSim and MaxSim use word in each direction. For each sampling method embeddings while Count does not. Count and and loss function, α was tuned by grid search MaxSim require iterating over the training set dur- with the others fixed to the above values. ing inference while ArgSim does not. • DNN AVG: for both losses: a = ReLU, d = For each baseline, we tuned a threshold on 200, g = 1000, β = 600, α = 0.01, λ = 0.001. DEV1 to maximize classification accuracy then • DNN LSTM: for both losses: a = ReLU, d = tested on DEV2 and TEST. 200, bidirectional LSTM hidden layer size h = 5.4 Training and Tuning 200, hidden layer dimension g = 800, β = 400, α = 0.005, and λ = 0.00005. To get vectors We used AdaGrad (Duchi et al., 2011) for opti- from the term LSTMs, we used max pooling. mization, training for 20 epochs through the train- ing tuples. We separately tuned hyperparameters 6 Results for each model and training objective. We tuned the following hyperparameters: the relation ma- Word Embedding Comparison. We first evalu- trix size r for the bilinear models (also the length ate the quality of our word embeddings trained on of the transformed term vectors, denoted u1 and the ConceptNet training tuples. Table 2 compares GloVe PARAGRAM CN-trained DEV2 TEST ArgSim 68 69 73 Count 75.4 79.0 MaxSim 73 70 82 ArgSim 72.9 74.2 MaxSim 81.9 83.5 Table 2: Accuracies (%) on DEV2 of two base- Bilinear AVG 90.3 91.7 lines using three different sets of word embed- Bilinear LSTM 90.8 90.7 DNN AVG 91.3 92.0 dings. Our ConceptNet-trained embeddings out- DNN LSTM 88.1 89.2 perform GloVe and PARAGRAM embeddings. Bilinear AVG + data 91.8 92.5 human ∼95.0 — DNN AVG DNN LSTM CE hinge CE hinge Table 5: Accuracies (%) of baselines and final random 124 230 710 783 model configurations on DEV2 and TEST. “+ data” mix 20755 21045 25928 26380 max 39338 41867 49583 49427 uses enlarged training set of size 300,000, and then doubles this training set by including tuples with Table 4: Loss function runtime comparison (sec- conjugated forms; see text for details. Human per- onds per epoch) of the DNN models. formance on DEV2 was estimated from a sample of size 100. accuracies on DEV2 for the two baselines that use embeddings: ArgSim and MaxSim. We find modeling terms. We suggest two reasons for this. that pretrained GloVe and PARAGRAM embed- The first is that most terms are short, with an aver- dings perform comparably, but both are outper- age term length of 2.3 words in our training tu- formed by our ConceptNet-trained embeddings. ples. An LSTM may not be needed to capture We use the latter for the remaining experiments in long-distance properties. The second reason may this paper. be due to hyperparameter tuning. Recall that we used a greedy search for optimal hyperparameter Training Comparison. Table 3 shows the re- values; we found that models with LSTMs take sults of our models with the two loss functions and more time per epoch, more epochs to converge, three sampling strategies. We find that the binary and exhibit more hyperparameter sensitivity com- cross entropy loss with random sampling performs pared to models based on averaging. This may best across models. We note that our conclusion have contributed to inferior hyperparameter values differs from some prior work that found max or for the LSTM models. mix sampling to be better than random (Wieting et al., 2016). We suspect that this difference may We also trained the Bilinear AVG model on a stem from characteristics of the ConceptNet train- larger training set (row labeled “Bilinear AVG + ing data. It may often be the case that the max- data”). We note that the ConceptNet tuples typ- scoring negative example in the mini-batch is ac- ically contain unconjugated forms; we sought to tually a true fact, due to the generic nature of the use both conjugated and unconjugated words. We facts expressed. began with a larger training set of 300,000 tuples Table 4 shows a runtime comparison of the from ConceptNet, then augmented them to include losses and sampling strategies.4 We find random conjugated word forms as in the following exam- sampling to be orders of magnitude faster than the ple. For the tuple hsoak in a hotspring, CAUSES, others while also performing the best. get pruny skini obtained from the OMCS sentence “The effect of [soaking in a hotspring] is [get- Final Results. Our final results are shown in Ta- ting pruny skin]”, we generated an additional tuple ble 5. We show the DEV2 and TEST accuracies for hsoaking in a hotspring, CAUSES, getting pruny our baselines and for the best configuration (tuned skini. We thus created twice as many training tu- on DEV2) for each model. All models outperform ples. The results with this larger training set im- all baselines handily. Our models perform simi- proved from 91.7 to 92.5 on the test set. We re- larly, with the Bilinear models and the DNN AVG lease this final model to the research community. model all exceeding 90% on both DEV2 and TEST. We note that the strongest baseline, MaxSim, is We note that the AVG models performed a nonparametric model that requires iterating over strongly compared to those that used LSTMs for all training tuples to provide a score to new tuples. 4These experiments were performed using 2 threads on a This is a serious bottleneck for use in NLP appli- 3.40-GHz Intel Core i7-3770 CPU with 8 cores. cations that may need to issue large numbers of Bilinear AVG Bilinear LSTM DNN AVG DNN LSTM CE hinge CE hinge CE hinge CE hinge random 90 84 91 83 91 87 88 57 mix 90 83 90 87 90 78 82 63 max 86 75 65 66 61 52 56 52

Table 3: Accuracies (%) on DEV2 of models trained with two loss functions (cross entropy (CE) and hinge) and three sampling strategies (random, mix, and max). The best accuracy for each model is shown in bold. Cross entropy with random sampling is best across models and is also fastest (see Table 4). queries. Our models are parametric models that t1, R, t2 score bus,ISA, public transportation 0.95 can compress a large training set into a fixed num- bus,ISA, public transit 0.90 ber of parameters. This makes them extremely fast bus,ISA, mass transit 0.79 for answering queries, particularly the AVG mod- bus,ATLOCATION, downtown area 0.98 bus,ATLOCATION, subway station 0.98 els, enabling use in downstream NLP applications. bus,ATLOCATION, city center 0.94 bus,CAPABLEOF, low cost 0.72 7 Generating and Scoring Novel Tuples bus,CAPABLEOF, local service 0.65 bus,CAPABLEOF, train service 0.63 We now measure our model’s ability to score novel tuples generated automatically from ConceptNet Table 6: Top Wikipedia tuples for 3 relations with and Wikipedia. We first describe simple pro- t1 = bus, scored by Bilinear AVG model. cedures to generate candidate tuples from these two datasets. We then score the tuples using our al., 2003) on the terms in our ConceptNet training MaxSim baseline and the trained Bilinear AVG tuples. We enumerate the 50 most frequent term model.5 We evaluate the highly-scoring tuples us- pair tag sequences for each relation. We do lim- ing a small-scale manual evaluation. ited manual filtering of the frequent tag sequences, The DNN AVG and Bilinear AVG models namely removing the sequences “DT NN NN” and reached the highest TEST accuracies in our evalua- “DT JJ NN” for the ISA relation. We do this in or- tion, though in preliminary experiments we found der to reduce noise in the extracted tuples. To fo- that the Bilinear AVG model appeared to perform cus on finding nontrivial tuples, for each relation better when scoring the novel tuples described be- we retain the top 15 POS tag sequences in which low. We suspect this is because the DNN func- t or t has at least two words. tion class has more flexibility than the bilinear one. 1 2 When scoring novel tuples, many of which may be We then run the tagger on sentences from En- highly noisy, it appears that the constrained struc- glish Wikipedia. We extract word sequence pairs ture of the Bilinear AVG model makes it more ro- corresponding to the relation POS tag sequence bust to the noise. pairs, requiring that there be a gap of at least one word between the two terms. We then remove 7.1 Generating Tuples From ConceptNet word sequence pairs in which one term is solely In order to get new tuples, we automatically mod- one of the following words: be, the, always, there, ify existing ConceptNet tuples. We take an exist- has, due, however. We also remove tuples con- ing tuple and randomly change one of the three taining words that are not in the vocabulary of our ConceptNet-trained embeddings. We require that fields (t1, t2, or R), ensuring that the result is not a tuple existing in ConceptNet. We then score these one term does not include the other term. We cre- tuples using MaxSim and the Bilinear AVG model ate tuples consisting of the two terms and all pos- and analyze the results. sible relations that occur with the POS sequences of those two terms. Finally, we remove tuples that 7.2 Generating Tuples from Wikipedia exactly match our ConceptNet training tuples. We also propose a simple method to extract candi- We use our trained Bilinear AVG model to date tuples from raw text. We first run the Stan- score these tuples. We extract term pairs that oc- ford part-of-speech (POS) tagger (Toutanova et cur within the same sentence because we hope that these will have higher precision than if we were to 5For the results in this section, we used the Bilinear AVG model that achieved 91.7 on TEST rather than the one aug- pair together arbitrary pairs. Some example tuples mented with additional data. for t1 = bus are shown in Table 6. 7.3 Manual Analysis of Novel Tuples tuples quality high-confidence CN tuples 3.68 To evaluate our models on newly generated tu- medium-confidence CN tuples 3.14 novel CN tuples, ranked by MaxSim 2.74 ples, we rank them using different models and novel CN tuples, ranked by Bilinear AVG 3.20 manually score the high-ranking tuples for qual- novel Wiki tuples, ranked by Bilinear AVG 2.78 ity. We first randomly sampled 3000 tuples from each set of novel tuples. We do so due to the time Table 7: Average quality scores from manual eval- requirements of the MaxSim baseline, which re- uation of novel tuples. Each row corresponds to a quires iterating through the entire training set for different set of tuples. See text for details. each candidate tuple. We score these sampled tu- After nine years of primary school, students can go to ples using MaxSim and the Bilinear AVG model the high school or to an educational institution. and rank them by their scores. The top 100 tu- t1, R, t2 score ples under each ranking were given to an annota- school,HASPROPERTY, educational 0.89 school,ISA, educational institution 0.80 tor who is a native English speaker. The annota- school,ISA, institution 0.78 tor assigned a quality score to each tuple, using school,HASPROPERTY, high 0.77 the same 0-4 annotation scheme as Speer et al. high school,ISA, institution 0.71 On March 14, 1964, Ruby was convicted of murder with (2010): 0 (“Doesn’t make sense”), 1 (“Not true”), malice, for which he received a death sentence. 2 (“Opinion/Don’t know”), 3 (“Sometimes true”), t1, R, t2 score ∗ and 4 (“Generally true”). We report the average murder,CAUSES, death 1.00 murder,CAUSES, death sentence 0.86 quality score across each set of 100 tuples. murder,HASSUBEVENT, death 0.84 The results are shown in Table 7. To calibrate murder,CAPABLEOF, death 0.51 the scores, we also gave two samples of Concept- Table 8: Top ranked tuples extracted from two Net (CN) tuples to the annotator: a sample of 100 example sentences and scored by Bilinear AVG high-confidence tuples (first row) and a sample model. ∗ = contained in ConceptNet. of 100 medium-confidence tuples (second row). We find the high-confidence tuples to be of high quality, recording an average of 3.68, though the contained in the sentence, rather than necessarily medium-confidence tuples drop to 3.14. indicating what the sentence means. This proce- The next two rows show the quality scores dure could provide relevant commonsense knowl- of the MaxSim baseline and the Bilinear AVG edge for a downstream application that seeks to model. The latter outperforms the baseline and understand the sentence. We leave further investi- matches the quality of the medium-confidence gation of this idea to future work. ConceptNet tuples. Since our novel tuples are not contained in ConceptNet, this result suggests that 8 Conclusion our model can be used to add medium-confidence We proposed methods to augment curated com- tuples to ConceptNet. monsense resources using techniques from knowl- The novel Wikipedia tuples (top 100 tuples edge base completion. By scoring novel tuples, ranked by Bilinear AVG model) had a lower qual- we showed how we can increase the applicability ity score (2.78), but this is to be expected due to of the knowledge contained in ConceptNet. In fu- the difference in domain. Since Wikipedia con- ture work, we will explore how to use our model tains a wide variety of text, we found the novel to improve downstream NLP tasks, and consider tuples to be noisier than those from ConceptNet. applying our methods to other knowledge bases. Still, we are encouraged that on average the tuples We have released all of our resources—code, data, are judged to be close to “sometimes true.” and trained models—to the research community.6 7.4 Text Analysis with Commonsense Tuples Acknowledgments We note that our method of tuple extraction and scoring could be used as an aid in applications We thank the anonymous reviewers, John Wieting, that require sentence understanding. Two example and Luke Zettlemoyer. sentences are shown in Table 8, along with the top tuples extracted and scored using our method. The 6Available at http://ttic.uchicago.edu/ tuples capture general knowledge about phrases ˜kgimpel/commonsense.html. References Alberto Garc´ıa-Duran,´ Antoine Bordes, and Nicolas Usunier. 2014. Effective blending of two and three- Basant Agarwal, Namita Mittal, Pooja Bansal, and way interactions for modeling multi-relational data. Sonal Garg. 2015. Sentiment analysis using In and Knowledge Discovery in common-sense and context information. Comp. Int. . and Neurosc. Matt Gardner, Partha Pratim Talukdar, Jayant Krish- Gabor Angeli and Christopher Manning. 2013. namurthy, and Tom Mitchell. 2014. Incorporat- Philosophers are mortal: Inferring the truth of un- ing vector space similarity in random walk inference seen facts. In Proc. of CoNLL. over knowledge bases. In Proc. of EMNLP.

Gabor Angeli and D. Christopher Manning. 2014. Jonathan Gordon and Lenhart K Schubert. 2012. NaturalLI: Natural logic inference for common Using textual patterns to learn expected event fre- sense reasoning. In Proc. of EMNLP. quencies. In Proc. of Joint Workshop on Auto- matic Knowledge Base Construction and Web-scale Collin F Baker, Charles J Fillmore, and John B Lowe. . 1998. The Berkeley FrameNet project. In Proc. of COLING. Jonathan Gordon and Benjamin Van Durme. 2013. Reporting bias and knowledge acquisition. In Proc. Mohit Bansal, Kevin Gimpel, and Karen Livescu. of Workshop on Automated Knowledge Base Con- 2014. Tailoring continuous word representations for struction. dependency parsing. In Proc. of ACL. Jonathan Gordon, Benjamin Van Durme, and Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Lenhart K Schubert. 2010. Learning from the Sturge, and Jamie Taylor. 2008. Freebase: a col- web: Extracting general world knowledge from laboratively created graph for structuring noisy text. In Collaboratively-Built Knowledge human knowledge. In Proc. of the ACM SIGMOD Sources and AI. International Conference on Management of Data. Jonathan Gordon. 2014. Inferential Commonsense Antoine Bordes, Nicolas Usunier, Alberto Garcia- Knowledge from Text. Ph.D. thesis, University of Duran, Jason Weston, and Oksana Yakhnenko. Rochester. 2013. Translating embeddings for modeling multi- relational data. In Advances in NIPS. Kelvin Gu, John Miller, and Percy Liang. 2015. Traversing knowledge graphs in vector space. In Antoine Bordes, Sumit Chopra, and Jason Weston. Proc. of EMNLP. 2014a. Question answering with subgraph embed- dings. In Proc. of EMNLP. Felix Hill, Roi Reichart, and Anna Korhonen. 2015. SimLex-999: Evaluating semantic models with Antoine Bordes, Xavier Glorot, Jason Weston, and (genuine) similarity estimation. Computational Lin- Yoshua Bengio. 2014b. A en- guistics. ergy function for learning with multi-relational data. Machine Learning, 94(2). Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long short-term memory. Neural computation. Erik Cambria, Daniel Olsher, and Kenneth Kwok. 2012. Sentic activation: A two-level affective com- Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, mon sense reasoning framework. In Proc. of AAAI. Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using Andrew Carlson, Justin Betteridge, Bryan Kisiel, clickthrough data. In Proc. of CIKM. Burr Settles, Estevam R Hruschka Jr, and Tom M Mitchell. 2010. Toward an architecture for never- Rodolphe Jenatton, Nicolas L. Roux, Antoine Bordes, ending language learning. In Proc. of AAAI. and Guillaume R Obozinski. 2012. A latent factor model for highly multi-relational data. In Advances Luigi Di Caro, Alice Ruggeri, Loredana Cupi, and in NIPS. Guido Boella. 2015. Common-sense knowledge for natural language understanding: Experiments in Dan Klein and Christopher D. Manning. 2003. Accu- unsupervised and supervised settings. In Proc. of rate unlexicalized parsing. In Proc. of ACL. AI*IA. Ni Lao, Tom Mitchell, and William W Cohen. 2011. John Duchi, Elad Hazan, and Yoram Singer. 2011. Random walk inference and learning in a large scale Adaptive subgradient methods for online learning knowledge base. In Proc. of EMNLP. and stochastic optimization. JMLR. Douglas B Lenat and Ramanathan V Guha. 1989. Anthony Fader, Stephen Soderland, and Oren Etzioni. Building large knowledge-based systems: represen- 2011. Identifying relations for open information ex- tation and inference in the project. Addison- traction. In Proc. of EMNLP. Wesley Longman Publishing Co., Inc. Henry Lieberman, Alexander Faaborg, Waseem Daher, Kristina Toutanova, Dan Klein, Christopher D Man- and Jose´ Espinosa. 2005. How to wreck a nice ning, and Yoram Singer. 2003. Feature-rich part-of- beach you sing calm incense. In Proc. of IUI. speech tagging with a cyclic dependency network. In Proc. of NAACL-HLT. Wang Ling, Chris Dyer, Alan W Black, Isabel Tran- coso, Ramon Fermandez, Silvio Amir, Luis Marujo, Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi- and Tiago Luis. 2015. Finding function in form: fung Poon, Pallavi Choudhury, and Michael Gamon. Compositional character models for open vocabu- 2015. Representing text for joint embedding of text lary word representation. In Proc. of EMNLP. and knowledge bases. In Proc. of EMNLP.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Luis von Ahn, Mihir Kedia, and Manuel Blum. 2006. rado, and Jeff Dean. 2013. Distributed representa- Verbosity: a game for collecting common-sense tions of words and phrases and their compositional- facts. In Proc. of CHI. ity. In Advances in NIPS. Robert West, Evgeniy Gabrilovich, Kevin Murphy, Shaohua Sun, Rahul Gupta, and Dekang Lin. 2014. George A Miller. 1995. WordNet: a lexical database Knowledge base completion via search-based ques- for English. Communications of the ACM, 38(11). tion answering. In Proc. of WWW. Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju- John Wieting, Mohit Bansal, Kevin Gimpel, Karen rafsky. 2009. Distant supervision for relation ex- Livescu, and Dan Roth. 2015. From paraphrase traction without labeled data. In Proc of ACL. database to compositional paraphrase model and back. Transactions of the ACL. Arvind Neelakantan and Ming-Wei Chang. 2015. In- ferring missing entity type instances for knowledge John Wieting, Mohit Bansal, Kevin Gimpel, and Karen base completion: New dataset and methods. In Livescu. 2016. Towards universal paraphrastic sen- Proc. of NAACL. tence embeddings. In Proc. of ICLR.

Arvind Neelakantan, Benjamin Roth, and Andrew Mc- Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Callum. 2015. Compositional vector space models Gao, and Li Deng. 2014. Embedding entities and for knowledge base completion. In Proc. of ACL. relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way model for collective Limin Yao, Sebastian Riedel, and Andrew McCallum. learning on multi-relational data. In Proc. of ICML. 2013. Universal schema for entity type prediction. In Proc. of Workshop on Automated Knowledge Base Maximilian Nickel, Volker Tresp, and Hans-Peter Construction. Kriegel. 2012. Factorizing YAGO: Scalable ma- chine learning for . In Proc. of WWW.

Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. GloVe: Global vectors for word representation. In Proc. of EMNLP.

Sebastian Riedel, Limin Yao, Andrew McCallum, and M. Benjamin Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proc. of NAACL-HLT.

Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural ten- sor networks for knowledge base completion. In Ad- vances in NIPS.

Robert Speer and Catherine Havasi. 2012. Represent- ing general relational knowledge in ConceptNet 5. In Proc. of LREC.

Robert Speer, Catherine Havasi, and Henry Lieberman. 2008. AnalogySpace: Reducing the dimensionality of common sense knowledge. In Proc. of AAAI.

Robert Speer, Catherine Havasi, and Harshit Surana. 2010. Using verbosity: Common sense data from games with a purpose. In Proc. of FLAIRS.