<<

CO-NNECT: A Framework for Revealing Commonsense Knowledge Paths as Explicitations of Implicit Knowledge in Texts

Maria Becker, Katharina Korfhage, Debjit Paul, Anette Frank Department of Computational Linguistics, Heidelberg University mbecker|korfhage|paul|[email protected]

Abstract paths can explicate implicit information conveyed by the text. Connections can either be direct, e.g. In this work we leverage commonsense knowl- given the sentences The car was too old and The edge in the form of knowledge paths to es- engine broke down, the concepts car and engine tablish connections between sentences, as a can be linked with a direct relation (singlehop path) form of explicitation of implicit knowledge. Such connections can be direct (singlehop car → HASA → engine; or indirect – here inter- paths) or require intermediate concepts (mul- mediate concepts are required to establish the link, tihop paths). To construct such paths we com- as between Berliners produce too much waste and bine two model types in a joint framework we Environmental protection should play a more im- call CO-NNECT: a relation classifier that pre- portant role, where the link between waste and envi- dicts direct connections between concepts; and ronmental protection requires a multihop reasoning a target prediction model that generates tar- path: waste → RECEIVESACTION → recycle → get or intermediate concepts given a source concept and a relation, which we use to con- PARTOF → environmental protection. struct multihop paths. Unlike prior work that We show that two complementary model types relies exclusively on static knowledge sources, can be combined to solve the two subtasks: (i) we leverage language models finetuned on for predicting singlehop paths between concepts, knowledge stored in ConceptNet, to dynam- we propose a relation classification model that is ically generate knowledge paths, as explana- very precise, but restricted to direct connections be- tions of implicit knowledge that connects sen- tween concepts; (ii) for constructing longer paths tences in texts. As a central contribution we design manual and automatic evaluation set- we rely on a target prediction model that can gen- tings for assessing the quality of the generated erate intermediate concepts and is thus able to paths. We conduct evaluations on two argu- generate multihop paths. However, the interme- mentative datasets and show that a combina- diate concepts can be irrelevant or misleading. To tion of the two model types generates mean- our knowledge, prior work has applied either re- ingful, high-quality knowledge paths between lation classification or target prediction models. sentences that reveal implicit knowledge con- We propose CO-NNECT, a framework that estab- veyed in text. lishes COmmonsense knowledge paths for CON- NECTing sentences by combining relation classi- arXiv:2105.03157v1 [cs.CL] 7 May 2021 1 Introduction fication and target prediction models, leveraging Commonsense knowledge covers simple facts their strengths and minimizing their weaknesses. about the world, people and everyday life, e.g., With CO-NNECT, we obtain high-quality knowl- Birds can fly or Cars are used for driving. It is edge paths that explicate implicit knowledge con- increasily used for many NLP tasks, e.g. for ques- veyed by the text. tion answering (Mihaylov et al., 2018), textual en- We focus on commonsense knowledge in Con- tailment (Weissenborn et al., 2018), or classifying ceptNet (Speer et al., 2017), a knowledge graph argumentative functions (Paul et al., 2020). In this (KG) that represents concepts (words or phrases) work, we leverage commonsense knowledge in the as nodes, and relations between them as edges, form of single- and multihop knowledge paths for e.g., hoven,USEDFOR, bakingi. As instances of the establishing connections between concepts from model types we use COREC (Becker et al., 2019), a different sentences in texts, and show that these multi-label relation classifier that predicts relation types and that we enhance with a pretrained lan- of-the-art results. To our knowledge, Becker et al. guage model; and COMET (Bosselut et al., 2019), (2019) is the only work that proposed a relation a pretrained transformer model that learns to gen- classification model specifically for ConceptNet erate target concepts given a source concept and a relations, which we adapt for our work. Besides relation. In contrast to models that retrieve knowl- COMET (Bosselut et al., 2019), the model used edge from static KGs (Mihaylov et al., 2018; Lin in our approach, Saito et al.(2018) perform tar- et al., 2019), both models are fine-tuned on Con- get prediction on ConceptNet using an attentional ceptNet and applied on the fly, to dynamically gen- encoder-decoder model. They improve the KB erate knowledge paths that generalize beyond the completion model of Li et al.(2016) by jointly static knowledge, allowing us to predict unseen scoring triples and predicting target concepts. knowledge paths. We compare our models to a baseline model that solely relies on static KGs. Utilizing commonsense knowledge paths. We evaluate our framework on two English argu- When using commonsense knowledge for question mentative datasets, IKAT (Becker et al., 2020) and answering (Mihaylov et al., 2018), commonsense ARC (Habernal et al., 2018), which offer annota- reasoning (Lin et al., 2019) or NLI (Kapanipathi tions that explain implicit connections between sen- et al., 2020), most approaches rely on paths re- tences. While knowledge paths have been widely trieved from static knowledge resources. In con- used in NLP downstream tasks, a careful evalua- trast, we propose a framework that in addition tion of these paths has not received much attention. makes use of dynamic knowledge provided by lan- As a central contribution of our work, we address guage models. Few other models have used knowl- this shortcoming by designing manual and auto- edge paths dynamically, e.g. Paul et al.(2020), matic settings for path evaluation: we evaluate the who enrich ConceptNet on the fly when classifying relevance and quality of the paths and their ability argumentative functions. to represent implicit knowledge in an annotation Wang et al.(2020) make use of language mod- experiment; and we compare the paths to the anno- els for question answering. They generate multi- tations of implicit knowledge in IKAT and ARC, hop paths by sampling random walks from Con- using automatic similarity metrics. ceptNet and finetune a language model on these Our main contributions are: i) we propose CO- paths to connect question and answers, improving NNECT, a framework that combines two comple- accuracy on two question answering benchmarks. mentary types of knowledge path prediction models Bosselut et al.(2021) generate knowledge paths that have previously only been applied separately;1 using a language model for zero-shot question an- ii) we show that commonsense knowledge paths swering, which they use to select the answer to generated with CO-NNECT effectively represent a question, surpassing performance of pretrained implied knowledge between sentences; iii) we pro- language models on SocialIQA (a multiple-choice pose an evaluation scheme that measures the qual- question answering dataset for probing machine’s ity of the knowledge paths, going beyond many ap- emotional and social intelligence in a variety of proaches that use knowledge paths for downstream everyday situations). Similarly, Chang et al.(2020) applications without analyzing their quality. incorporate knowledge from ConceptNet in pre- trained language models for SocialIQA. They ex- 2 Related Work tract keywords from question and answers, query ConceptNet for relevant triples, and incorporate In this work we combine relation classification them in their language models via attention. Their and target prediction for generating commonsense evaluation shows that their knowledge-enhanced knowledge representations over text. Relation model outperforms knowledge-agnostic baselines. classification covers a range of methods and learn- Finally, Paul and Frank(2020) propose an atten- ing paradigms for representing relations. A vari- tion model that encodes commonsense inference ety of neural architectures such as RNNs (Zhang rules and incorporates them in a transformer based et al., 2018), CNNs (Guo et al., 2019), sequence- reasoning cell, taking advantage of pretrained lan- to-sequence models (Trisedya et al., 2019) or lan- guage models and structured knowledge. Their guage models (Wu and He, 2019) achieved state- evaluation on two reasoning tasks shows that their 1The code for our framework can be found here: https: model improves performance over models that lack //github.com/Heidelberg-NLP/CO-NNECT. external knowledge. Hence, none of these sys- tems directly evaluates the quality of the generated et al., 2019), a transformer encoder-decoder based paths, but measure the effectiveness of common- on GPT-2 (Radford et al., 2019). Input to the model sense knowledge indirectly by evaluation on down- is a source concept cs and a relation ri. Then the stream tasks. We will address this shortcoming in pretrained language model is finetuned using Con- our work by carefully evaluating the quality of the ceptNet as labelled train set for the task of gener- generated paths. ating new concepts. Depending on the beam size, COMET can propose multiple targets per input 3 Enriching Texts with Commonsense instance. Knowledge Paths Datasets. To compare model performances, we evaluate COREC-LM and COMET on the CN- This section describes CO-NNECT, the framework 100k benchmark dataset of Li et al.(2016), which we propose for enriching texts with commonsense is based on the OMCS subpart of ConceptNet. The knowledge, by establishing relations or paths be- dataset comprises 37 relation types such as ISA, tween concepts from different sentences. Towards PARTOF or CAUSES and contains 100k relation this aim, we apply relation classification and target triples in the train set and 1200 in the develop- prediction models in combination. We first charac- ment and the test set, respectively. CN-100k con- terize differences between the two model types and tains a lot of infrequent relations which are hard their instantiations, COREC-LM and COMET, de- to learn and often overspecific (e.g. HASFIRST- scribe how we adapt them to our task and evaluate SUBEVENT), and hence not useful for establishing them on ConceptNet to assess their performance high quality relations and paths between concepts. (§3.1). We then show how we utilize the models We therefore extract a subset that contains all triples to establish connections between concepts in texts of the 13 most frequent relations (CN-13).2 CN-13 (§3.2) and present a baseline model that uses Con- covers 90,600 triples for training, 1080 triples for ceptNet as a static KG to establish commonsense development, and 1080 triples for testing. knowledge paths (§3.3). Since our application task requires that the re- 3.1 Comparing and Evaluating Model Types lation classifier also learns to detect that a given concept pair is not related, we extend the data for Relation classification and target prediction both training and testing COREC-LM with a RANDOM aim at representing relational commonsense knowl- class that contains unrelated concept pairs, which edge, but the respective task settings are funda- we add to CN-100k and CN-13.3 mentally different. We choose two models that PoS Sequence Filtering. We apply a type- have been developed for representing common- based PoS sequence filtering for COREC-LM and sense knowledge in CN: COREC, a relation classi- COMET, where the type is dependent on the pre- fication and COMET, a target prediction model. dicted relation. The relation ISA, for example, Relation Classification with COREC-LM. A is supposed to connect two noun phrases; in con- relation classifier is ideally suited to predict direct trast, HASPREREQUISITE typically relates two relations between concepts, hence we can apply verb phrases. We determine frequent PoS sequence COREC (Becker et al., 2019), an open-world multi- patterns for specific argument types from the Con- label relation classification system, for this task. ceptNet resource and use them to filter relation and Given a pair of concepts c , c from sentences, it s t path predictions. predicts one or several relations r from a set of i Metrics. We evaluate COREC-LM in terms of relation types R that hold between c and c . We CN s t weighted F1-scores, precision and recall, which enhance the original neural model with the pre- is its genuine evaluation setting. For COMET we trained language model DistilBERT (Sanh et al., report precision scores for the first prediction with 2019) to construct a classifier we call COREC-LM. highest confidence score (hits@1); we further re- We finetune this model on ConceptNet by mask- port hits@10 which gives information if the correct ing the relations and use sigmoid as output layer to model the probability of each relation indepen- 2These are: ATLOCATION,CAUSES,CAPABLEOF, dently, accounting for ambiguous relations in CN. ISA,HASPREREQUISITE,HASPROPERTY,HASSUBEVENT, Target Prediction with COMET. To generate USEDFOR, CAUSESDESIRE,DESIRES,HASA,MOTIVATED- BYGOAL and RECEIVESACTION. multihop paths that include (possibly novel) in- 3For details about the construction of the RANDOM class, termediate concepts, we apply COMET (Bosselut cf. Appendix. Figure 1: Our framework CO-NNECT: It finds single- and multihop paths between concepts, as explicitations of implicit knowledge that connects sentences. triple is included in the first ten predictions (which at learning commonsense knowledge representa- will be important since we later use a beamsize tions, but tackle different tasks and have different of 10 for generating paths). In addition, we re- strengths and weaknesses. COREC-LM is very port accuracy using the Bilinear AVG model of precise in its predictions, but is restricted to pre- Li et al.(2016) (COMET’s genuine evaluation set- dicting direct relations between two given concepts. ting), which is trained on CN-100k and produces COMET is more powerful since it can genuinely a probability for a generated relation triple to be generate novel target concepts and thus can gener- correct. Following Bosselut et al.(2019), we apply ate multihop paths. However, it tends to be more a beamsize of 1 and a threshold at 0.5 for judging imprecise, and bears the risk of generating irrele- a triple as correct. vant or noisy concepts. Hence, a combination of Model Performances. COREC-LM achieves models seems beneficial, to predict high-quality high F1-scores on CN-100k (76.5) and CN-13 single- and multihop paths between concepts. (86.0).4 Scores are significantly lower when adding the RANDOM class (-7pp on CN-100k&CN-13), 3.2 Establishing Connections Using Relation indicating that detecting unrelated concept pairs Classification and Target Prediction is not trivial. The results show that a strength In the following we describe how we combine and of COREC-LM is its precision (90.1/CN-100k; apply COREC-LM and COMET in a joint frame- 88.2/CN-13) – which we will leverage when com- work, CO-NNECT, to establish high-quality knowl- bining models. COMET achieves high accuracy edge paths between sentences. An overview is scores (92.3/CN-100k; 96.3/CN-13) according to given in Fig.1. In the first step we extract relevant the bi-score. For the much stricter metric hits@1 concepts from the text. For this we integrate the which judges a triple as correct only if it matches concept extraction tool COCO-EX (Becker et al., the respective triple in the testset, much lower 2021a), which extracts meaningful concepts from scores are achieved (25/CN-100k; 23.5/CN-13), texts and maps them to concept nodes in CN, con- which is evident given the wide range of possi- sidering all surface forms. ble target generations. Higher scores for hits@10 (65.3/CN-100k; 65.9/CN-13) show that the chance Linking Concepts with Direct Relations. We for correct predictions significantly rises with in- construct all possible pairs of concepts extracted S S c ×c creasing beam size. from 1 and 2 by taking the cross product s t, where c is a concept from S , and c a concept In sum, COREC-LM and COMET both aim s 1 t from S2 (Fig.1, Step 2, left). We then apply 4The original version of COREC (Becker et al., 2019) COREC-LM trained on CN-13+RANDOM with achieves F1 of 53.31/CN-100k; 72.33/CN-13. a tuned threshold of 0.9 for predicting which rela- tion ri ∈ RCN13 holds between the concept pairs, preferred over indirect multihop paths, since the or whether no relation holds (RANDOM) (cf. Fig. latter could include irrelevant or misleading inter- 1, Step 3 (left) for examples). mediate nodes. Thus, we discard all multihop paths Linking Concepts with Multihop Paths. for each concept pair for which COREC-LM pre- COMET requires as input a source concept and dicted a direct connection (Fig.1, Step 4, pair 4). and a relation. For each concept pair cs, ct we build If COMET COREC-LM produce a singlehop such inputs by combining cs with each relation ri path, we also prefer COREC-LM’s prediction, re- ∈ RCN13, yielding 13 pairs cs, ri which we input lying on the model’s high precision (pair 1 in Step for target prediction (Step 2, right). To discover 4). We keep the paths generated by COMET for relation chains starting from S2, we apply the same concept pairs for which no direct relation could be ANDOM process to ct, using cs as target concept. We also established (i.e., COREC-LM predicted R include inverse relations, which gives us greater or no prediction above its threshold, pair 3&6), as- flexibility for connecting entities, i.e., paths are al- suming that in such cases intermediate concepts are lowed to contain inverted triplets (e.g. baking ← required to establish a link. If only one of the mod- USEDFOR ← oven → ATLOCATION → kitchen). els establishes a link, we keep this connection (pair To this end, we switch the order of concept pairs 2), and if none of the models finds a link, we as- within a given relation ri, relabel the relation as sume that the concepts are not (closely) connected −1 (pair 5). ri , and add the inverted relation pair to COMET’s training set. Forward Chaining. For all pairs in the cross- 3.3 Static Baseline Model −1 product cs × ct, for each input cs, ri and cs, ri We compare COREC-LM and COMET against the we generate the 10 most confident concepts cti with COMET (beamsize 10) trained on CN-13 in- model of Paul and Frank(2019) that uses Concept- cluding inverse triples. We continue with all paths Net as a static KG. The system extracts paths be- tween pairs of concepts from sentence pairs, hence where the generated concept cti has minimum co- sine distance of 0.7 to the respective target concept conforms well to our setting. Following Paul and Frank(2019), starting from concepts in a sentence ct. We generate the next hop by using each ct as i 0 0 0 a new source concept, combine it with each of the pair (§3.2), we construct a subgraph G = (V ,E ) 0 13 original and inverse relations, generate novel of the ConceptNet graph, where V comprises all target predictions, and again compare to the tar- concepts ci in hS1,S2i. The system then finds all 0 get concept. This similarity comparison guides shortest paths p from ConceptNet that connect any 0 0 the forward chaining process towards the chosen concept pairs in V , and includes them in G . It 0 target concepts and helps detecting contextually then includes, for any concepts in G , all directly relevant paths. We use ConceptNet numberbatch connected concepts from ConceptNet together with embeddings for the encoding of concepts; for mul- their edges. This yields a small sub-graph from tiword concepts we average the embeddings of all ConceptNet that contains concepts and relations non-stopwords. relevant for capturing conceptual links across the sentence pair. To select relevant paths, G0 is filtered Terminating Paths. We terminate path genera- by computing scores for vertices and paths using tion as soon as the similarity between c and c is ti t PageRank and Closeness Centrality score, and we higher than 0.95 – here we expect the two concepts constrain path lengths to 3 hops. to express the same meaning. We restrict the path length to 3 hops and consider only completed paths for evaluation (Step 3, right in Fig.1). 4 Revealing Implicit Knowledge through Combining Approaches. With our framework Knowledge Paths: Experiments and CO-NNECT we leverage the potential of the com- Evaluation plementarity of the two model types by combining COREC-LM and COMET in a straightforward way. In this section we evaluate the paths generated by Our hypothesis is that a system that admits both sin- our proposed models. We first present our datasets gle and multihop connections for establishing links and statistics on established connections (§4.1), between concepts offers the greatest flexibility. We and then evaluate the quality of the paths manu- further hypothesize that direct relations should be ally (§4.2) and automatically (§4.3). Figure 2: Example generations from our three model types, where the first three instances come from IKAT, and the last two from ARC.

4.1 Datasets and Statistics COR COM CONN CN IKAT linked pairs 21,934 3,660 24,063 50,003 The IKAT dataset (Becker et al., 2020) is based avg. hops 1 1.4 1.1 2.1 on the English Microtexts Corpus of short argu- ARC linked pairs 9,844 1,826 10,828 14,940 mentative texts (Peldszus and Stede, 2016). For all avg. hops 1 1.5 1.1 2.1 sentence pairs that are adjacent or argumentatively Table 1: Statistics of paths generated by COREC-LM, related, annotators added the implicit knowledge COMET, their combination (CO-NNECT), and ranking that connects them, using short sentences. IKAT CN-graphs (CN): number of concepts pairs between contains 719 such sentence pairs, from which we which a link was found, and average number of hops extracted 60,294 concept pairs. The ARC dataset per path. (Habernal et al., 2018) contains arguments taken from online discussions in English, consisting of a claim and a premise, and an annotated implicit around 2k concept pairs, while almost 15k pairs warrant that explains why the claim follows from can be connected using ranked CN-graphs. In both the premise. We evaluate our models on the ARC datasets, the ranked CN-graphs contain on average test set that comprises 444 argument pairs, from 2.1 hops (relations) per path, while the paths gen- which we extracted 21,898 concept pairs; and the erated by COMET are shorter (1.4 on IKAT/1.5 on corresponding warrants. ARC). In fact, COMET establishes many direct re- Example generations for both datasets from lations (69% of all paths are single hops), whereas our three model types – COREC-LM, COREC, the ranked CN-graphs are mostly two- (49%) or and ranked CN-graphs – appear in Fig.2, where three-hop paths (37%). the first three sentence pairs come from IKAT, and Replacing Vague Relations in CN-Graphs. the last two from ARC. We find that in contrast to COREC-LM and Number of links and hops. Table1 gives statis- COMET, the ranked CN-graphs are constructed tics of the paths generated between concepts from using mostly the very general relation RELAT- sentence pairs from IKAT and ARC using our dif- EDTO (71%/IKAT; 72%/ARC), followed by the ferent models. We find that COREC-LM finds rela- likewise vague relation HASCONTEXT (8% in both tions between around 22k from 66k concept pairs datasets).5 For determining the impact of vague in IKAT, while COMET only generates paths be- relations on path quality, we replace all RELAT- tween 3,660 pairs. This can be explained by the EDTO and HASCONTEXT relations in the ranked very high similarity threshold we imposed for guid- CN-graphs with relations predicted by COREC- ing the forward chaining process towards the target LM (trained on CN-13, threshold 0.9). For IKAT, concept, since our motivation was not to gener- we replace 43.4% of all RELATEDTO and 46.2% many ate as paths as possible, but paths that are of all HASCONTEXT instances, in ARC we replace meaningful relevant and contextually . When com- 70.7% of all RELATEDTO and 37% of all HAS- bining paths from COMET and COREC-LM, we CONTEXT relations. We use this version when find links for more than 24k concept pairs in IKAT. evaluating paths, in addition to the original ranked The highest number of links is established by rank- CN-graphs. ing CN-subgraphs (50k linked concept pairs). For ARC, which contains 22k concept pairs, COREC- LM finds links between around 10k and COMET 5For details on relation distributions cf. Appendix. 4.2 Manual Evaluation of Path Quality IKAT ARC COR COM CN CN-r COR COM CN CN-r Our statistics showed that most links between con- Predictions 74 64 88 88 78 60 76 76 Relevance +2 70 50 36 40 63 49 30 34 cepts can be revealed using knowledge paths re- +1 19 27 22 24 25 28 28 32 0 8 18 27 21 8 9 29 22 trieved from ConceptNet as a static KG, whereby -1 3 5 10 10 2 6 4 3 -2 0 0 5 5 2 8 9 9 these paths tend to contain multiple hops and a Implicit yes 80 78 57 67 87 81 57 62 Knowledge no 20 22 43 33 13 19 43 38 high amount of vague relations. Fewer links are Best link 65 64 28 34 76 70 7 14 established using the dynamic models COREC-LM and COMET, which produce shorter paths using Table 2: Manual evaluation of paths from COREC-LM, COMET, ranked CN-graphs (CN), and CN-graphs with re- only specific relation types from CN-13. Since our placed vague relations (CN-r); all numbers in %. aim is to construct high-quality, meaningful knowl- edge paths that help to explain implicit information (rather than establishing as many links as possi- established by COREC-LM and 77% of the rela- ble), we now examine the quality and relevance of tions predicted by COMET were annotated as very the knowledge paths. We set up an annotation ex- relevant (+2) or relevant connections (+1), which periment, providing annotators with 100 sentence only applies for 58% of the ranked CN-paths. 15% pairs from each dataset, with marked concepts (one of the ranked CN-paths were annotated as not rele- from S1 and one from S2) and the path gener- vant (-1) or misleading (-2), which can be explained ated between these concepts by (i) COREC-LM, by noisy intermediate nodes; and 27% as vague (0), (ii) COMET, (iii) ranked paths from CN, and (iv) which can be explained by the large amount of un- ranked paths with replaced vague relations (CN-r). specific relations. When replacing RELATEDTO Annotation Setup. For each sentence pair, and HASCONTEXT (CN-r), the amount of paths an- our annotators evaluated if 1) the path is a notated as vague slightly decreases, and the amount meaningful and relevant explanation for the con- of paths labelled as relevant and very relevant in- nection between the two sentences (very rele- creases. vant/relevant/neutral/not relevant/misleading); if Moreover, paths generated by COREC-LM and 2) the path represents implicit information not ex- COMET were found to yield better implicit knowl- plicitly expressed in the sentences (yes/no); and 3) edge representations than ranked CN-paths (line which model generates the path that is most helpful 8-9, Table2), while we find that replacing vague and expressive for understanding the connection relations in the CN-paths improves their ability of between the sentences. 4) To evaluate the combina- representing implicit knowledge. Finally, 65% of tion of COREC-LM and COMET in CO-NNECT, relations predicted by COREC-LM and 64% of we generate a subset for each dataset that includes paths generated by COMET were chosen as ex- all sentence/concept pairs for which COREC-LM plaining the connections between sentences best, predicted a singlehop path and COMET generated which is only the case for 28% of the CN-paths, a multihop path (10 pairs per subset). For these and slightly better for the replaced version of the instances we ask in addition whether the multihop CN-paths (34%). paths include unrelated, unnecessary or uninforma- On ARC, the high amount of CN-paths anno- tive intermediate nodes (yes/no), misleading inter- tated as vague (29%) again indicates uninforma- mediate nodes (yes/no); or intermediate nodes that tive connections and can be reduced when replac- are important for explaining the connection and ing vague with more specific relations. Relations missing in the direct relation predicted by COREC- 6 predicted by COREC-LM were found to be less LM (yes/no). Annotations were performed by relevant for connecting sentences in ARC than in two annotators with a linguistic background. We IKAT, but 87% of them were evaluated as appropri- measure IAA using Cohen’s Kappa and achieve ate expressions of implicit knowledge. 76% of the an agreement of 81%. Remaining conflicts were relations predicted by COREC-LM were evaluated resolved by an expert annotator. as best connections, which applies only for 7% of Results. Table2 shows the results of our an- CN-paths and 14% of CN-paths with replaced re- IKAT notation experiment. On , 89% of the paths lations. For COMET we find overall comparable 6The annotation manual together with example anno- results between IKAT and ARC. tations can be found here: https://github.com/ Heidelberg-NLP/CO-NNECT/blob/main/manual. Regarding the combination of COREC-LM and pdf COMET addressed with question 4, according to our annotators 50% of the multihop paths in the lational knowledge, by extracting all concepts from IKAT subset include misleading nodes and all of each golden implicit knowledge sentence (My dog them include irrelevant or uninformative nodes. has a bone → dog, bone) using the CN-extraction Still, compared against the direct relations pre- tool COCO-EX (Becker et al., 2021a), and pre- dicted by COREC-LM, annotators state for 30% of dict the relations between them using COREC-LM, the multihop paths from COMET that they contain trained on CN-13 (dog, HASA, bone). If a sentence intermediate concepts that are important for ex- contains more than two relations, we concatenate plaining the connection. On the ARC subset, 40% the predicted relation triples. (ii) IKAT provides of the multihop paths include misleading and 60% manual annotations of ConceptNet relations for the include irrelevant nodes, and only 20% contain golden implicit knowledge sentences, which we important intermediate concepts that are missing use as Gold Paths (The tree is in the garden → in the direct relation. For each subset, annotators tree ATLOCATION garden). (iii) Gold-NL: Here preferred the shorter path over the multihop path we use the implicit knowledge (in natural language) in 90% of the given sentence pairs. Comparing as provided in the datasets: IKAT’s implicit knowl- singlehop paths generated by COMET to direct edge sentences and ARC’s implicit warrants. relations predicted by COREC-LM for the same For encoding the generated paths we apply concept pairs, our annotators preferred the relation two settings: (i) we concatenate all concepts and predicted by COREC-LM in 64% of the cases, in relations within the paths; (Generated Paths) and 29% the link was annotated as equally good, and (ii) we translate the relation triples and paths to only in 7% COMET’s generation was preferred. (pseudo) natural language using templates provided To summarize, according to our manual evalu- by ConceptNet (e.g. cs CAUSES ct → The effect of ation, the dynamic models COMET and COREC- cs is ct; Generated Paths-NL). LM are better suited for generating meaningful We apply two automatic similarity metrics, knowledge paths that express implicit knowledge comparing (a) Generated vs. Silver Paths, (b) Gen- between sentences than ranked paths from the static erated Paths-NL vs. Gold-NL, and (c) Generated vs. CN knowledge graph, even though replacing vague Gold Paths (only IKAT). (i) We encode each rep- by more specific relations slightly improves results. resentation as described above using ConceptNet When comparing multihop paths to direct relations numberbatch embeddings (Speer et al., 2017) (for established between the same concept pairs, we multiword concepts we average the embeddings find that longer paths tend to contain irrelevant or of all non-stopwords), and compute cosine simi- even misleading nodes, and that direct relations larity between them, and (ii) we use BERTScore are preferred by human annotators. These findings F1 (Zhang et al., 2020) to compare representations, support our proposed joint framework CO-NNECT, which computes string similarity using contextual- which gives preference to direct relations and uti- ized embeddings. Both metrics lie in [−1, 1]. lizes multihop paths only if no direct connection Results. Table3 shows that the paths gen- between concepts can be revealed. erated by combining COREC-LM and COMET in our framework CO-NNECT achieve the high- 4.3 Automatic Evaluation Against Gold est similarity scores according to Numberbatch- Our goal is to generate meaningful paths that con- Cosim on IKAT in setting (a) and (b), while for vey implicit knowledge between sentences. In our (c) we get the highest Cosim scores for ranked CN- automatic evaluation we compare the set of model- graphs with replaced vague relations. According generated paths between all concept pairs from two to BERTScore, either COREC-LM (setting a) or related sentences to the implicit knowledge anno- COMET (setting b) applied separately, or both ap- tated in IKAT and ARC for these sentences, using plied in combination (setting c) achieve highest similarity metrics. results on IKAT. On ARC, CO-NNECT achieves Since the generated relation and path represen- both highest Cosim and BERTScores in setting (a), tations differ from the annotated natural sentences, while in (b) we get the best scores for CN-r ac- we approximate a common representation as fol- cording to Cosim, and the best scores for COMET lows: We encode the golden annotations of im- according to BERTScore. plicit knowledge – usually short sentences – using Summarizing our insights from automatic eval- three settings: (i) Silver Paths: we encode their re- uations, we find that COREC-LM achieves high COR COM CONN CN CN-r establish knowledge connections between concepts (a) Generated Paths vs. Silver Paths from different sentences, as a form of explicitation IKAT .61/.85 .54/.82 .62/.84 .57/.78 .58/.80 of implicitly conveyed information. We combine ARC .41/.84 .39/.82 .42/.86 .40/.77 .40/.78 existing relation classification and target predic- (b) Generated Paths-NL vs. Gold-NL IKAT .69/.81 .65/.83 .70/.81 .65/.75 .69/.76 tion models in a dynamic knowledge prediction ARC .72/.81 .66/.82 .72/.81 .71/.75 .77/.76 framework, CO-NNECT, utilizing language models (c) Generated Paths vs. Gold Paths finetuned on knowledge relations from Concept- IKAT .57/.78 .49/.78 .58/.79 .66/.73 .67/.74 Net. We compare against a path ranking system that employs static knowledge from ConceptNet as Table 3: Comparing generated paths to implicit knowl- edge annotations on IKAT and ARC, measured by a baseline and evaluate the quality of the obtained Cosim/BERTScore (F1). paths (i) through manual evaluation and (ii) using automatic similarity metrics, by comparing gener- ated paths to annotations of implicit knowledge in scores when applied separately or in combination two argumentative datasets. Our evaluations show with COMET (CO-NNECT). COMET applied that we obtain the highest number of connections in isolation does not yield the highest scores, but from the static ConceptNet graph, however, they helps to boost COREC-LM’s performance in the are often noisy due to unrelated intermediate nodes, joint CO-NNECT framework. Ranked CN-graphs and – even after replacements – still contain many achieve highest Cosim in two settings/datasets unspecific relations. Our framework CO-NNECT, (ARC–b; IKAT–c), but we do not find significant instead, combines relation classification and target improvements when replacing vague relations in prediction, leveraging the high precision of the for- CN-graphs (expect for Cosim in setting b). This mer, and the ability to perform forward chaining can be explained by the fact that even though of the latter, and obtain high-quality, meaningful many RELATEDTO and HASCONTEXT instances and relevant knowledge paths that reveal implicit could be replaced, for both datasets a large amount knowledge conveyed by the text, as shown in a of vague relations still remain (56.6% of RELAT- profound manual evaluation experiment. EDTO/53.2% of HASCONTEXT in IKAT; 29.3% We believe that CO-NNECT is a useful frame- RELATEDTO/63% HASCONTEXT in ARC). There- work which can be applied for different tasks, fore, the vague relation types in the CN-graphs such as enriching texts with commonsense knowl- still remain problematic when representing implicit edge relations and paths, for dynamically enriching knowledge. knowledge bases, or for building knowledge con- When comparing our manual evaluation results straints for language generation. In Becker et al. to the automatic scores, we find that the genera- (2021b) for example we inject single- and multi- tions that were manually evaluated as most relevant hop commonsense knowledge paths predicted by and meaningful explanations of implicit knowledge CO-NNECT as constraints into language models are not always highest-ranked by automatic metrics, and show that this improves the model’s ability of which points to two limitations of our automatic generating sentences that explicate implicit knowl- evaluation: Besides well-known issues regarding edge which connects sentences in texts. We fur- the reliability, interpretability, and biases of auto- thermore believe that the paths established with matic metrics (Callison-Burch et al., 2006), we CO-NNECT, which can provide explicitations of evaluate the generated paths against an annotated implicit knowledge, can be useful to enhance many reference – paths or sentences – which is often only other NLP downstream tasks, such as argument one among several valid options for expressing the classification, stance detection, or commonsense implicit knowledge (cf. Becker et al. 2017). This reasoning. means that a generated path may still be a relevant explicitation of implicit information, even if not similar to the reference. Hence, automatic scores Acknowledgements are to be considered with caution. This work has been funded by the DFG within 5 Conclusion the project ExpLAIN as part of the Priority Pro- gram “Robust Argumentation Machines” (SPP- Our work aims to leverage commonsense knowl- 1999). We thank our annotators for their contri- edge in the form of single and multihop paths, to bution. References Zhijiang Guo, Yan Zhang, and Wei Lu. 2019. Atten- tion guided graph convolutional networks for rela- Maria Becker, Katharina Korfhage, and Anette Frank. tion extraction. In Proceedings of the 57th Annual 2020. Implicit Knowledge in Argumentative Texts: Meeting of the Association for Computational Lin- Proceedings of the Con- An Annotated Corpus. In guistics, pages 241–251, Florence, Italy. Association ference on Language Resources and Evaluation for Computational Linguistics. (LREC), pages 2316–2324, Online. Ivan Habernal, Henning Wachsmuth, Iryna Gurevych, Maria Becker, Katharina Korfhage, and Anette Frank. and Benno Stein. 2018. The argument reasoning 2021a. COCO-EX: A Tool for Linking Concepts comprehension task: Identification and reconstruc- from Texts to ConceptNet. In Proceedings of the tion of implicit warrants. In Proceedings of the 2018 Conference of the European Chapter of the Associ- Conference of the North American Chapter of the ation for Computational Linguistics (EACL), Demo Association for Computational Linguistics: Human Papers, Online. Language Technologies, Volume 1 (Long Papers), Maria Becker, Siting Liang, and Anette Frank. 2021b. pages 1930–1940, New Orleans, Louisiana. Associ- Reconstructing Implicit Knowledge with Language ation for Computational Linguistics. Models. Accepted at: Deep Learning Inside Out Pavan Kapanipathi, Veronika Thost, Siva Patel, (DeeLIO): Workshop on Knowledge Extraction and Spencer Whitehead, Ibrahim Abdelaziz, Avinash Integration for Deep Learning Architectures. Balakrishnan, Maria Chang, Kshitij Fadnis, Chulaka Maria Becker, Michael Staniek, Vivi Nastase, and Gunasekara, Bassem Makni, Nicholas Mattei, Kar- Anette Frank. 2017. Enriching Argumentative Texts tik Talamadupula, and Achille Fokoue. 2020. Infus- with Implicit Knowledge. In Applications of Natu- ing knowledge into the textual entailment task using ral Language to Data Bases (NLDB) - Natural Lan- graph convolutional networks. Proceedings of the guage Processing and Information Systems, Lecture AAAI Conference on Artificial Intelligence, 34:8074– Notes in Computer Science, pages 84–96. Springer. 8081. Maria Becker, Michael Staniek, Vivi Nastase, and Xiang Li, Aynaz Taheri, Lifu Tu, and Kevin Gimpel. Anette Frank. 2019. Assessing the Difficulty of 2016. Commonsense knowledge base completion. Classifying ConceptNet Relations in a Multi-Label In Proceedings of the 54th Annual Meeting of the As- Classification Setting. In Proceedings of RELA- sociation for Computational Linguistics (Volume 1: TIONS - Workshop on Meaning Relations between Long Papers), pages 1445–1455, Berlin, Germany. Phrases and Sentences, pages 1–15, Gothenburg, Association for Computational Linguistics. Sweden. Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xi- Antoine Bosselut, Ronan Le Bras, , and Yejin Choi. ang Ren. 2019. KagNet: Knowledge-aware graph 2021. Dynamic neuro-symbolic knowledge graph networks for commonsense reasoning. In Proceed- construction for zero-shot commonsense question ings of the 2019 Conference on Empirical Methods answering. In Proceedings of the 35th AAAI Con- in Natural Language Processing and the 9th Inter- ference on Artificial Intelligence (AAAI). national Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 2829–2839, Hong Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai- Kong, China. Association for Computational Lin- tanya Malaviya, Asli Celikyilmaz, and Yejin Choi. guistics. 2019. COMET: Commonsense for au- tomatic knowledge graph construction. In Proceed- Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish ings of the 57th Annual Meeting of the Association Sabharwal. 2018. Can a suit of armor conduct elec- for Computational Linguistics, pages 4762–4779, tricity? a new dataset for open book question an- Florence, Italy. Association for Computational Lin- swering. In Proceedings of the 2018 Conference on guistics. Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association Chris Callison-Burch, Miles Osborne, and Philipp for Computational Linguistics. Koehn. 2006. Re-evaluating the role of Bleu in ma- chine translation research. In 11th Conference of Debjit Paul and Anette Frank. 2019. Ranking and se- the European Chapter of the Association for Com- lecting multi-hop knowledge paths to better predict putational Linguistics, Trento, Italy. Association for human needs. In Proceedings of the 2019 Confer- Computational Linguistics. ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- Ting-Yun Chang, Yang Liu, Karthik Gopalakrishnan, guage Technologies, Volume 1 (Long and Short Pa- Behnam Hedayatnia, Pei Zhou, and Dilek Hakkani- pers), pages 3671–3681, Minneapolis, Minnesota. Tur. 2020. Incorporating commonsense knowledge Association for Computational Linguistics. graph in pretrained models for social commonsense tasks. In Proceedings of Deep Learning Inside Out Debjit Paul and Anette Frank. 2020. Social common- (DeeLIO): The First Workshop on Knowledge Ex- sense reasoning with multi-head knowledge atten- traction and Integration for Deep Learning Architec- tion. In Findings of the Association for Computa- tures, pages 74–79, Online. Association for Compu- tional Linguistics: EMNLP 2020, pages 2969–2980, tational Linguistics. Online. Association for Computational Linguistics. Debjit Paul, Juri Opitz, Maria Becker, Jonathan Kobbe, Shanchan Wu and Yifan He. 2019. Enriching pre- Graeme Hirst, and Anette Frank. 2020. Argu- trained language model with entity information mentative Relation Classification with Background for relation classification. In Proceedings of the Knowledge. In Proceedings of the International 28th ACM International Conference on Informa- Conference on Computational Models of Argument tion and Knowledge Management, CIKM ’19, page (COMMA), pages 319–330, Online. 2361–2364, New York, NY, USA. Association for Computing Machinery. Andreas Peldszus and Manfred Stede. 2016. An anno- tated corpus of argumentative microtexts. In Argu- Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. mentation and Reasoned Action: Proceedings of the Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- 1st European Conference on Argumentation, Lisbon uating text generation with bert. In ICLR. 2015 / Vol. 2, pages 801–815, London. College Pub- lications. Xiaobin Zhang, Fucai Chen, and Ruiyang Huang. 2018. A Combination of RNN and CNN for Attention- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, based Relation Classification. In Procedia Com- Dario Amodei, and Ilya Sutskever. 2019. Language puter Science, volume 131, pages 911 – 917. Models are Unsupervised Multitask Learners. APPENDIX Itsumi Saito, Kyosuke Nishida, Hisako Asano, and Junji Tomita. 2018. Commonsense knowledge base A Constructing the RANDOM Class for completion and generation. In Proceedings of the Training COREC-LM in an Open World 22nd Conference on Computational Natural Lan- Setting guage Learning, pages 141–150, Brussels, Belgium. Association for Computational Linguistics. Our downstream application task – finding connec- tions between concepts – requires that our relation Victor Sanh, Lysandre Debut, Julien Chaumond, and classifier also learns to detect that no direct relation Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, holds between a given pair of concepts. We thus abs/1910.01108. extend the data for training and testing COREC- LM with a RANDOM class which contains concept Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. pairs that are not related which we add to CN-100k Conceptnet 5.5: An open multilingual graph of gen- and CN-13 Instances for this class are generated eral knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, Febru- similarly to Vylomova et al.(2016): 50% of them ary 4-9, 2017, San Francisco, California, USA, are opposite pairs which we obtain by switching the pages 4444–4451. AAAI Press. order of concept pairs within the same relation, and 50% are corrupt pairs, obtained by replacing one Bayu Distiawan Trisedya, Gerhard Weikum, Jianzhong concept in a pair with a random concept from the Qi, and Rui Zhang. 2019. Neural relation extrac- tion for knowledge base enrichment. In Proceed- same relation. Corrupt pairs ensure that COREC- ings of the 57th Annual Meeting of the Association LM learns to encode relation instances rather than for Computational Linguistics, pages 229–240, Flo- simply learning properties of the word classes. We rence, Italy. Association for Computational Linguis- add these instances (2070 for training and 260 for tics. development and testing, respectively) to CN-100k Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and and CN-13 when training and evaluating in an open Timothy Baldwin. 2016. Take and took, gaggle and world setting. goose, book and read: Evaluating the utility of vec- tor differences for lexical relation learning. In Pro- B Relations Used for Constructing Single- and ceedings of the 54th Annual Meeting of the Associa- Multihop Paths tion for Computational Linguistics (Volume 1: Long Papers), pages 1671–1682, Berlin, Germany. Asso- Table4 lists the three most frequently used relations ciation for Computational Linguistics. when constructing single and multihop knowledge paths using COMET, COREC-LM, their combi- Peifeng Wang, Nanyun Peng, Filip Ilievski, Pedro Szekely, and Xiang Ren. 2020. Connecting the dots: nation, and ranked subgraphs, respectively for the A knowledgeable path generator for commonsense two datasets IKAT and ARC. The top three rela- question answering. EMNLP Findings, pages 4129– tions used by COREC-LM within both datasets 4140. are ATLOCATION, HASPROPERTY, and ISA. In- S AS COMET Dirk Weissenborn, Tomas Kocisky, and Chris Dyer. terestingly, besides I A and H A, fre- 2018. Dynamic Integration of Background Knowl- quently uses the only causal relation in the CN edge in Neural NLU Systems. ICLR 2018. inventory CAUSES. In contrast to COREC-LM and COREC-LM COMET CONNECT CN Subgraphs IKAT ATLOCATION(25%) ISA(19%) ATLOCATION(22%) RELATEDTO(71%) HASPROPERTY(20%) HASA(18%) ISA(17%) HASCONTEXT(8%) ISA(17%) CAUSES(17%) HASPROPERTY(16%) IsA(7%) ARC ATLOCATION(31%) ATLOCATION(22%) ATLOCATION(27%) RELATEDTO(72%) ISA(18%) CAUSES(20%) ISA(15%) HASCONTEXT(8%) HASPROPERTY(14%) HASA(18%) HASPROPERTY(10%) ISA(7%)

Table 4: Most frequently used relations when constructing single and multihop knowledge paths using COMET, COREC-LM, their combination, and ranked subgraphs from CN.

COMET, the ranked CN-graphs are constructed us- ing mostly the very general relation RELATEDTO, followed by the likewise vague relation HASCON- TEXT. When excluding paths that contain RELAT- EDTO, only 2,551 connected concept pairs remain in IKAT and 6,858 in ARC.