arXiv:2005.11787v2 [cs.CL] 11 Oct 2020 ( ( hw estl ...tetpso knowledge of types the w.r.t. versatile understand- shown language ( of tasks range ing wide a pretraining for are that useful suitable representations language learning very for task a modeling language ELMo 2019 like al. et Liu models neural Self-supervised Introduction 1 rdcsos–sai odebdigmodels embedding word static – predecessors eese al. et Peters oese al. et Rogers eivsiaemdl o complementing for models investigate we l u xeiet n eeatcd under: code relevant https://github.com/wluper/retrograph and experiments source our open of all in also We type OMCS. present and the ConceptNet explicitly require knowledge that conceptual tasks points) inference performance on 15-20 our outper- to (up that substantially BERT form reveals models analysis deeper adapter-based inconclusive a an paint picture, benchmark GLUE using the respectively, training corpus, adapter Common Mind ConceptNet (OMCS) Open Sense corresponding from its and with knowledge BERT of conceptual knowledge work, distributional this the In knowledge. forgetting catastrophic distributional other the of the to lead on may fine- knowledge, hand, external post-hoc on tuning prohibitively expensive, be may computationally objective) LM the to primary knowledge external joint on hand, based into objectives one adding scratch, the from training on resources (i.e., pre-training While external models. these from (structured) injecting knowledge on tasks, focused understanding work language recent of GPT-2 lan- variety or a BERT neural on as such of (LMs) success models guage major the Following omnSneo ol nweg?IvsiaigAdapter-Ba Investigating Knowledge? World or Sense Common ,o Le ( XLNet or ), , 2019b age al. et Wang ♦ , , bqiosKoldePoesn UP a,T amtd,G Darmstadt, TU Lab, (UKP) Processing Knowledge Ubiquitous 2018 2020 ♣ nweg neto noPerie Transformers Pretrained into Injection Knowledge neLauscher Anne aaadWbSineGop nvriyo anem Germany Mannheim, of University Group, Science Web and Data ,GT( GPT ), age al. et Yang ,BR ( BERT ), hl vrl eut on results overall While . hyecd,mc ietheir like much encode, they ) Abstract { rn Gurevych Iryna anne,goran , 2018 , al. et Radford , 2019 elne al. et Devlin 2019 { ♠ www.ukp.tu-darmstadt.de olga,nikolai lpr odn ntdKingdom United London, Wluper, aerendered have ) ♣ .Although ). laMajewska Olga } ♦ @informatik.uni-mannheim.de , , ioa Rozanov Nikolai 2019 2018 . ; , } 1995 ( ( ( othcfine-tuning Post-hoc proposed diinlojcie ae netra resources external on based with objectives objectives additional LM distributional augment hand, 2010 2007 2017 al. et Liu network. retraining transformer encoding the expensive of scratch from this BERT, computationally like models For implies scratch. from model al. et Lauscher 2018 ( al. et Peters knowledge linguistic 2019 ( knowledge (KBs) – bases exist of sources number knowledge a structured Yet, corpora. large from distributional information the “consume” only still LMs neural n n tnadps-o n-uig Adapter- fine-tuning. post-hoc pretrain- standard joint and both ing of shortcomings the remedies pretrain- ( in ing obtained knowledge forgetting distributional catastrophic to of lead may approach however, this substantial, is data fine-tuning the of If amount objectives. parameters, LM encoder’s distributional via external the pretrained on fine-tune based to objectives resources the use hand, other nlnug nesadn ak ( tasks understanding knowledge language such of in usefulness the demonstrated ( factual injecting on focused efforts corpora. text in underrepresented are of @wluper.com ioo tal. et Mikolov eufie al. et Rebuffi Dredze and Yu ♠ trigfo hsosrain otrecent most observation, this from Starting nti ok iia otecnurn work concurrent the to similar work, this In age al. et Wang ; noigmn ye fkoldethat knowledge of types many ( encoding – ) networks lexico-semantic and ) ). , ; enroF .Ribeiro R. F. Leonardo odelwe al. et Goodfellow 2019 i n Singh and Liu i tal. et Liu ♠ , oa Glava Goran dpe-ae fine-tuning adapter-based ). 2019a , on rtann models pretraining Joint , 2019 , , 2018 uhnke al. et Suchanek , , ( 2013 2020 2019 ; 2019a 2014 noperie M and LMs pretrained into ) ; eese al. et Peters , olb tal. et Houlsby ; oes( models ,w unt h recently the to turn we ), n ri h extended the train and ) 2004 , enntne al. et Pennington ; ; 2014 eese al. et Peters s ˇ ashre al. et Lauscher ; ♣ gyne al. et Nguyen ail n Ponzetto and Navigli , ; ermany hn tal. et Zhang , ♦ 2007 ikarc tal. et Kirkpatrick 2019 , 2019 ; , hn tal. et Zhang nteone the on , age al. et Wang ,o the on ), ure al. et Auer 2019 , paradigm sed , , ,which ), 2014 , Miller and ) 2019 2016 2019 – ) ; ; ; , , , , , , based training injects additional parameters into jecting the ConceptNet and OMCS information the encoder and only tunes their values: original into BERT, and leave the exploration of potentially transformer parameters are kept fixed. Because of more effective knowledge injection objectives for this, adapter training preserves the distributional future work. We inject the external information information obtained in LM pretraining, without into adapter parameters of the adapter-augmented the need for any distributional (re-)training. While BERT (Houlsby et al., 2019) via BERT’s natural (Wang et al., 2020) inject factual knowledge from objective – masked language modelling (MLM). Wikidata (Vrandeˇci´cand Kr¨otzsch, 2014) into OMCS, already a corpus in natural language, is BERT, in this work, we investigate two resources directly subjectable to MLM training – we filtered that are commonly assumed to contain general- out non-English sentences. To subject ConceptNet purpose and common sense knowledge:1 Concept- to MLM training, we need to transform it into a Net (Liu and Singh, 2004; Speer et al., 2017) and synthetic corpus. the Open Mind Common Sense (OMCS) corpus Unwrapping ConceptNet. Following es- (Singh et al., 2002), from which the ConceptNet tablished previous work (Perozzi et al., 2014; graph was (semi-)automatically extracted. For our Ristoski and Paulheim, 2016), we induce a first model, dubbed CN-ADAPT, we first create a synthetic corpus from ConceptNet by randomly synthetic corpus by randomly traversing the Con- traversing its graph. We convert relation strings ceptNet graph and then learn adapter parameters into NL phrases (e.g., synonyms to is a synonym with masked language modelling (MLM) training of ) and duplicate the object node of a triple, (Devlin et al., 2019) on that synthetic corpus. For using it as the subject for the next sentence. For our second model, named OM-ADAPT, we learn causes example, from the path “alcoholism−−−−→ stigma the adapter parameters via MLM training directly hasContext partOf on the OMCS corpus. −−−−−−→ christianity −−−→ religion” we create We evaluate both models on the GLUE bench- the text “alcoholism causes stigma. stigma is mark, where we observe limited improvements used in the context of christianity. christianity is over BERT on a subset of GLUE tasks. How- part of religion.”. We set the walk lengths to 30 ever, a more detailed inspection reveals large im- relations and sample the starting and neighboring provements over the base BERT model (up to nodes from uniform distributions. In total, we 20 Matthews correlation points) on language in- performed 2,268,485 walks, resulting with the ference (NLI) subsets labeled as requiring World corpus of 34,560,307 synthetic sentences. Knowledge or knowledge about Named Entities. Adapter-Based Training. We follow Investigating further, we relate this result to the Houlsby et al. (2019) and adopt the adapter- fact that ConceptNet and OMCS contain much based architecture for which they report solid more of what in downstream is considered to be performance across the board. We inject bottle- factual world knowledge than what is judged as neck adapters into BERT’s transformer layers. In common sense knowledge. Our findings pinpoint each transformer layer, we insert two bottleneck the need for more detailed analyses of compat- adapters: one after the multi-head attention sub- ibility between (1) the types of knowledge con- layer and another after the feed-forward sub-layer. tained by external resources; and (2) the types of Let X ∈ RT ×H be the sequence of contextualized knowledge that benefit concrete downstream tasks; vectors (of size H) for the input of T tokens within the emerging body of work on injecting in some transformer layer, input to a bottleneck knowledge into pretrained transformers. adapter. The bottleneck adapter, consisting of two feed-forward layers and a residual connection, 2 Knowledge Injection Models yields the following output: In this work, we are primarily set to investigate X X XW b W if injecting specific types of knowledge (given Adapter ( )= + f ( d + d) u + bu in the external resource) benefits downstream in- where Wd (with bias bd) and Wu (with bias ference that clearly requires those exact types bu) are adapter’s parameters, that is, the weights of knowledge. Because of this, we use the ar- of the linear down-projection and up-projection guably most straightforward mechanisms for in- sub-layers and f is the non-linear activation func- H×m 1Our results in §3.2 scrutinize this assumption. tion. Matrix Wd ∈ R compresses vectors in X to the adapter size m < H, and the ma- Training Details. We inject our adapters into a m×H trix Wu ∈ R projects the activated down- BERT Base model (12 transformer layers with 12 projections back to transformer’s hidden size H. attention heads each; H = 768) pretrained on low- The ratio H/m determines how many times fewer ercased corpora. Following (Houlsby et al., 2019), parameters we optimize with adapter-based train- we set the size of all adapters to m = 64 and ing compared to standard fine-tuning of all trans- use GELU (Hendrycks and Gimpel, 2016) as the former’s parameters. adapter activation f. We train the adapter param- eters with the Adam algorithm (Kingma and Ba, 3 Evaluation 2015) (initial learning rate set to 1e−4, with 10000 warm-up steps and the weight decay factor of We first briefly describe the downstream tasks and 0.01). In downstream fine-tuning, we train in training details, and then proceed with the discus- batches of size 16 and limit the input sequences to sion of results obtained with our adapter models. T = 128 wordpiece tokens. For each task, we find the optimal hyperparameter configuration from the 3.1 Experimental Setup. following grid: learning rate l ∈ {2 · 10−5, 3 · 10−5}, epochs in n ∈ {3, 4}. Downstream Tasks. We evaluate BERT and our two adapter-based models, CN-ADAPT and OM- ADAPT, with injected knowledge from Concept- 3.2 Results and Analysis Net and OMCS, respectively, on the tasks from the GLUE Results. Table 1 reveals the performance GLUE benchmark (Wang et al., 2018): of CN-ADAPT and OM-ADAPT in comparison CoLA (Warstadt et al., 2018): Binary sentence with BERT Base on GLUE evaluation tasks. We classification, predicting grammatical acceptabil- show the results for two snapshots of OM-ADAPT, ity of sentences from linguistic publications; after 25K and 100K update steps, and for two snap- shots of CN-ADAPT, after 50K and 100K steps SST-2 (Socher et al., 2013): Binary sentence clas- of adapter training. Overall, none of our adapter- sification, predicting binary sentiment (positive or based models with injected external knowledge negative) for movie review sentences; from ConceptNet or OMCS yields significant im- MRPC (Dolan and Brockett, 2005): Binary provements over BERT Base on GLUE. However, sentence-pair classification, recognizing sentences we observe substantial improvements (of around which are are mutual paraphrases; 3 points) on RTE and on the Diagnostics NLI STS-B (Cer et al., 2017): Sentence-pair regres- dataset (Diag), which encompasses inference in- sion task, predicting the degree of semantic sim- stances that require a specific type of knowledge. ilarity for a given pair of sentences; Since our adapter models draw specifically on the conceptual knowledge encoded in Concept- QQP (Chen et al., 2018): Binary classification Net and OMCS, we expect the positive impact task, recognizing question paraphrases; of injected external knowledge – assuming effec- MNLI (Williams et al., 2018): Ternary natural tive injection – to be most observable on test in- language inference (NLI) classification of sen- stances that target the same types of conceptual tence pairs. Two test sets are given: a matched ver- knowledge. To investigate this further, we mea- sion (MNLI-m) in which the test domains match sure the model performance across different cate- the domains from training data, and a mismatched gories of the Diagnostic NLI dataset. This allows version (MNLI-mm) with different test domains; us to tease apart inference instances which truly QNLI: A binary classification version of the Stan- test the efficacy of our knowledge injection meth- ford Q&A dataset (Rajpurkar et al., 2016); ods. We show the results obtained on different cat- egories of the Diagnostic NLI dataset in Table 2. RTE (Bentivogli et al., 2009): Another NLI The improvements of our adapter-based models dataset, ternary entailment classification for sen- over BERT Base on these phenomenon-specific tence pairs; subsections of the Diagnostics NLI dataset are Diag (Wang et al., 2018): A manually curated NLI generally much more pronounced: e.g., OM- dataset, with examples labeled with specific types ADAPT (25K) yields a 7% improvement on infer- of knowledge needed for entailment decisions. ence that requires factual or common sense knowl- Model CoLA SST-2 MRPC STS-B QQP MNLI-m MNLI-mm QNLI RTE Diag Avg MCC Acc F1 Spear F1 Acc Acc Acc Acc MCC – BERT Base 52.1 93.5 88.9 85.8 71.2 84.6 83.4 90.5 66.4 34.2 75.1

OM-ADAPT (25K) 49.5 93.5 88.8 85.1 71.4 84.4 83.5 90.9 67.5 35.7 75.0 OM-ADAPT (100K) 53.5 93.4 87.9 85.9 71.1 84.2 83.7 90.6 68.2 34.8 75.3

CN-ADAPT (50K) 49.8 93.9 88.9 85.8 71.6 84.2 83.3 90.6 69.7 37.0 75.5 CN-ADAPT (100K) 48.8 92.8 87.1 85.7 71.5 83.9 83.2 90.8 64.1 37.8 74.6

Table 1: Results on test portions of GLUE benchmark tasks. Numbers in brackets next to adapter-based models (25K, 50K, 100K) indicate the number of update steps of adapter training on the synthetic ConceptNet corpus (for CN-ADAPT) or on the original OMCS corpus (for OM-ADAPT). Bold: the best score in each column.

Model LS KNO LOG PAS All Model CS World NE BERT Base 38.5 20.2 26.7 39.6 34.2 BERT Base 29.0 10.3 15.1

OM-ADAPT (25K) 39.1 27.1 26.1 39.5 35.7 OM-ADAPT (25K) 28.5 25.3 31.4 OM-ADAPT (100K) 37.5 21.2 27.4 41.0 34.8 OM-ADAPT (100K) 24.5 17.3 22.3

CN-ADAPT (50K) 40.2 24.3 30.1 42.7 37.0 CN-ADAPT (50K) 25.6 21.1 26.0 CN-ADAPT (100K) 44.2 25.2 30.4 41.9 37.8 CN-ADAPT (100K) 24.4 25.6 36.5

Table 2: Breakdown of Diagnostics NLI performance Table 3: Results (Matthews correlation) on Common (Matthews correlation), according to information type Sense (CS), World Knowledge (World), and Named En- needed for inference (coarse-grained categories): Lexi- tities (NE) categories of the Diagnostic NLI dataset. cal Semantics (LS), Knowledge (KNO), Logic (LOG), and Predicate Argument Structure (PAS). common sense knowledge. Manual scrutiny of the diagnostic test instances from both CS and edge (KNO), whereas CN-ADAPT (100K) yields World categories uncovers a noticeable mismatch a 6% boost for inference that depends on lexico- between the kind of information that is considered semantic knowledge (LS). These results suggest common sense in KBs like ConceptNet and what that (1) ConceptNet and OMCS do contain the spe- is considered common sense knowledge in the cific types of knowledge required for these infer- downstream. In fact, the majority of information ence categories and that (2) we managed to inject present in ConceptNet and OMCS falls under the that knowledge into BERT by training adapters on World Knowledge definition of the Diagnostic these resources. NLI dataset, including factual geographic infor- mation (stockholm [partOf] sweden), Fine-Grained Knowledge Type Analysis. In domain knowledge (roadster [isA] car) our final analysis, we “zoom in” our models’ and specialized terminology (indigenous performances on three fine-grained categories of [synonymOf] aboriginal). the Diagnostics NLI dataset – inference instances In contrast, many of the CS inference instances that require Common Sense Knowledge (CS), require complex, high-level reasoning, understand- World Knowledge (World), and knowledge about ing metaphorical and idiomatic meaning, and mak- Named Entities (NE), respectively. The results for ing far-reaching connections. We display NLI Dig- these fine-grained categories are given in Table nostics examples from the World Knowledge and 3. These results show an interesting pattern: Common Sense categories in Table 4. In such our adapter-based knowledge-injection models cases, explicit conceptual links often do not suf- massively outperform BERT Base (up to 15 and fice for a correct inference and much of the re- 21 MCC points, respectively) for NLI instances quired knowledge is not explicitly encoded in the labeled as requiring World Knowledge or knowl- external resources. Consider, e.g., the following edge about Named Entities. In contrast, we see CS NLI instance: [premise: My jokes fully drops in performance on instances labeled as reveal my character ; hypothesis: If every- requiring common sense knowledge. This initially one believed my jokes, they’d know exactly who came as a surprise, given the common belief I was ; entailment]. While ConceptNet and that OMCS and ConcepNet contain the so-called OMCS may associate character with personality Knowledge Premise Hypothesis ConceptNet? World The sides came to an agree- The sides came to an agree- stockholm [partOf] ment after their meeting in ment after their meeting in sweden Stockholm. Sweden. Musk decided to offer up Musk decided to offer up roadster [isA] car his personal Tesla roadster. his personal car. The Sydney area has been The Sydney area has been indigenous [synonymOf] inhabited by indigenous inhabited by Aboriginal aboriginal Australians for at least people for at least 30,000 30,000 years. years. Common Sense My jokes fully reveal my If everyone believed my character. jokes, they’d know exactly who I was. The systems thus produced The systems thus produced are incremental: dialogues support the capability to in- are processed word-by- terrupt an interlocutor mid- word, shown previously sentence. to be essential in support- ing natural, spontaneous dialogue. He deceitfully proclaimed: He was satisfied. “This is all I ever really wanted.”

Table 4: Premise-hypothesis examples from the diagnostic NLI dataset tagged for commonsense and world knowl- edge, and relevant ConceptNet relations, where available. or personality with identity, the knowledge that external knowledge and allow for the preservation the phrase who I was may refer to identity is be- of the rich distributional knowledge acquired in yond the explicit knowledge present in these re- BERT’s pretraining in the original transformer pa- sources. This sheds light on the results in Ta- rameters. We demonstrated the effectiveness of ble 3: when the knowledge required to tackle these models in language understanding settings the inference problem at hand is available in the that require precisely the type of knowledge that external resource, our adapter-based knowledge- one finds in ConceptNet and OMCS, in which our injected models significantly outperform the base- adapter-based models outperform BERT by up to line transformer; otherwise, the benefits of knowl- 20 performance points. Our findings stress the edge injection are negligible or non-existent. The importance of having detailed analyses that com- promising results on world knowledge and named pare (a) the types of knowledge found in exter- entities portions of the Diagnostics dataset sug- nal resources being injected against (b) the types gest that our methods does successfully inject ex- of knowledge that a concrete downstream reason- ternal information into the pretrained transformer ing tasks requires. We hope this work motivates and that the presence of the required knowledge further research effort in the direction of fine- for the task in the external resources is an obvious grained knowledge typing, both of explicit knowl- prerequisite. edge in external resources and the implicit knowl- edge stored in pretrained transformers. 4 Conclusion Acknowledgments We presented two simple strategies for injecting external knowledge from ConceptNet and OMCS Anne Lauscher and Goran Glavaˇsare supported corpus, respectively, into BERT via bottleneck by the Eliteprogramm of the Baden-W¨urttemberg adapters. Additional adapter parameters store the Stiftung (AGREE grant). Leonardo F. R. Ribeiro has been supported by the German Research Foun- Diederik P Kingma and Jimmy Ba. 2015. Adam: A dation as part of the Research Training Group method for stochastic optimization. In Proceedings AIPHES under the grant No. GRK 1994/1. This of ICLR. work has been supported by the German Research James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Foundation within the project “Open Argument Joel Veness, Guillaume Desjardins, Andrei A Rusu, Mining” (GU 798/25-1), associated with the Pri- Kieran Milan, John Quan, Tiago Ramalho, Ag- nieszka Grabska-Barwinska, et al. 2017. Over- ority Program “Robust Argumentation Machines coming catastrophic forgetting in neural networks. (RATIO)” (SPP-1999). The work of Olga Majew- Proceedings of the national academy of sciences, ska was conducted under the research lab of Wlu- 114(13):3521–3526. per Ltd. (UK/ 10195181). Anne Lauscher, Ivan Vuli´c, Edoardo Maria Ponti, Anna Korhonen, and Goran Glavaˇs. 2019. Inform- ing unsupervised pretraining with external linguistic References knowledge. arXiv preprint arXiv:1909.02339. S¨oren Auer, Christian Bizer, Georgi Kobilarov, Jens Hugo Liu and Push Singh. 2004. Conceptnet—a practi- Lehmann, Richard Cyganiak, and Zachary Ives. cal tool-kit. BT technology 2007. Dbpedia: A nucleus for a web of open data. journal, 22(4):211–226. In The semantic web, pages 722–735. Springer. Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Haotang Deng, and Ping Wang. 2019a. K-bert: Giampiccolo. 2009. The fifth pascal recognizingtex- Enabling language representation with knowledge tual entailment challenge. In TAC. graph. arXiv preprint arXiv:1909.07606. Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Lopez-Gazpio, and Lucia Specia. 2017. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, SemEval-2017 task 1: Semantic textual similarity multilingualLuke and Zettlemoyer, crosslingual and focused Veselin evaluation Stoyanov.. 2019b. In Proceedings of the 11th International Workshop RoBERTa: A robustly optimized bert pretraining ap- on Semantic Evaluation (SemEval-2017), pages proach. arXiv preprint arXiv:1907.11692. 1–14, Vancouver, Canada. Association for Computa- tional Linguistics. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- Zihan Chen, Hongbo Zhang, Xiaoji Zhang, and Leqi tions of words and phrases and their compositional- Zhao. 2018. Quora question pairs. ity. In Advances in neural information processing systems, pages 3111–3119. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of George A Miller. 1995. Wordnet: a lexical for deep bidirectional transformers for language under- english. Communications of the ACM, 38(11):39– standing. In Proceedings of the 2019 Conference of 41. the North American Chapter of the Association for Computational Linguistics: Human Language Tech- Roberto Navigli and Simone Paolo Ponzetto. 2010. Ba- nologies, Volume 1 (Long and Short Papers), pages belnet: Building a very large multilingual semantic 4171–4186. network. In Proceedings of the 48th annual meet- ing of the association for computational linguistics, William B Dolan and Chris Brockett. 2005. Automati- pages 216–225. Association for Computational Lin- cally constructing a corpus of sentential paraphrases. guistics. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005). Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. 2016. Ian J Goodfellow, Mehdi Mirza, Aaron Courville Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. Da Xiao, and Yoshua Bengio. 2014. An empirical In Proceedings of ACL, pages 454–459. investigation of catastrophic forgeting in gradient- based neural networks. In In Proceedings of Inter- Jeffrey Pennington, Richard Socher, and Christopher national Conference on Learning Representations Manning. 2014. Glove: Global vectors for word rep- (ICLR. Citeseer. resentation. In Proceedings of the 2014 conference on empirical methods in natural language process- Dan Hendrycks and Kevin Gimpel. 2016. ing (EMNLP), pages 1532–1543. Gaussian error linear units (gelus). Bryan Perozzi, Rami Al-Rfou, Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, and Steven Skiena. 2014. Bruna Morrone, Quentin De Laroussilhe, Andrea Deepwalk: Online learning of social representations. Gesmundo, Mona Attariyan, and Sylvain Gelly. In Proceedings of the 20th ACM SIGKDD Inter- 2019. Parameter-efficient transfer learning for nlp. national Conference on Knowledge Discovery and In International Conference on , Data Mining, KDD ’14, page 701–710, New York, pages 2790–2799. NY, USA. Association for Computing Machinery. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Fabian M Suchanek, Gjergji Kasneci, and Gerhard Gardner, Christopher Clark, Kenton Lee, and Luke Weikum. 2007. Yago: a core of semantic knowledge. Zettlemoyer. 2018. Deep contextualized word rep- In Proceedings of the 16th international conference resentations. In Proceedings of NAACL-HLT, pages on , pages 697–706. ACM. 2227–2237. Denny Vrandeˇci´cand Markus Kr¨otzsch. 2014. Wiki- Matthew E. Peters, Mark Neumann, Robert Logan, Roy data: a free collaborative knowledgebase. Commu- Schwartz, Vidur Joshi, Sameer Singh, and Noah A. nications of the ACM, 57(10):78–85. Smith. 2019. Knowledge enhanced contextual word representations. In Proceedings of the 2019 Con- Alex Wang, Yada Pruksachatkun, Nikita Nangia, ference on Empirical Methods in Natural Language Amanpreet Singh, Julian Michael, Felix Hill, Omer Processing and the 9th International Joint Confer- Levy, and Samuel Bowman. 2019. Superglue: A ence on Natural Language Processing (EMNLP- stickier benchmark for general-purpose language un- IJCNLP), pages 43–54. derstanding systems. In Advances in Neural Infor- mation Processing Systems, pages 3261–3275. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Alex Wang, Amanpreet Singh, Julian Michael, Fe- Improving language understanding by generative pre-traininglix. Hill, Omer Levy, and Samuel Bowman. 2018. OpenAI Technical Report. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the Blacbox NLP Workshop, pages Alec Radford, Jeffrey Wu, Rewon Child, David Luan, 353–355. Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Blog, 1(8). Xuanjing Huang, Cuihong Cao, Daxin Jiang, Ming Zhou, et al. 2020. K-adapter: Infusing knowl- Pranav Rajpurkar, Jian Zhang, Kon- edge into pre-trained models with adapters. arXiv stantin Lopyrev, and Percy Liang. 2016. preprint arXiv:2002.01808. SQuAD: 100,000+ questions for machine comprehension of text. Alex Warstadt, Amanpreet Singh, and Samuel R Bow- In Proceedings of the 2016 Conference on Empir- man. 2018. Neural network acceptability judgments. ical Methods in Natural Language Processing, arXiv preprint arXiv:1805.12471. pages 2383–2392, Austin, Texas. Association for Computational Linguistics. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen- Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea tence understanding through inference. In Proceed- Vedaldi. 2018. Efficient parametrization of multi- ings of the 2018 Conference of the North American domain deep neural networks. In CVPR. Chapter of the Association for Computational Lin- Petar Ristoski and Heiko Paulheim. 2016. Rdf2vec: guistics: Human Language Technologies, Volume 1 Rdf graph embeddings for data mining. In Inter- (Long Papers), pages 1112–1122. national Semantic Web Conference, pages 498–514. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- Springer. bonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretrain- Anna Rogers, Olga Kovaleva, and Anna Rumshisky. ing for language understanding. arXiv preprint 2020. A primer in bertology: What we know about arXiv:1906.08237. how bert works. arXiv preprint arXiv:2002.12327. Mo Yu and Mark Dredze. 2014. Push Singh, Thomas Lin, Erik T Mueller, Grace Lim, Improving lexical embeddings with semantic knowledge. Travell Perkins, and Wan Li Zhu. 2002. Open mind In Proceedings of ACL, pages 545–550. common sense: Knowledge acquisition from the general public. In OTM Confederated International Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Conferences” On the Move to Meaningful Maosong Sun, and Qun Liu. 2019. ERNIE: En- Systems”, pages 1223–1237. Springer. hanced language representation with informative en- tities. In Proceedings of the 57th Annual Meet- Richard Socher, Alex Perelygin, Jean Wu, Jason ing of the Association for Computational Linguistics, Chuang, Christopher D Manning, Andrew Ng, and pages 1441–1451. Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment tree- bank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.

Robert Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of gen- eral knowledge. In Thirty-First AAAI Conference on Artificial Intelligence.