Arxiv:2005.11787V2 [Cs.CL] 11 Oct 2020 ( ( Hw Estl ...Tetpso Knowledge of Types the W.R.T

Common Sense or World Knowledge? Investigating Adapter-Based Knowledge Injection into Pretrained Transformers Anne Lauscher♣ Olga Majewska♠ Leonardo F. R. Ribeiro♦ Iryna Gurevych♦ Nikolai Rozanov♠ Goran Glavasˇ♣ ♣Data and Web Science Group, University of Mannheim, Germany ♠Wluper, London, United Kingdom ♦Ubiquitous Knowledge Processing (UKP) Lab, TU Darmstadt, Germany {anne,goran}@informatik.uni-mannheim.de {olga,nikolai}@wluper.com www.ukp.tu-darmstadt.de Abstract (Mikolov et al., 2013; Pennington et al., 2014) – neural LMs still only “consume” the distributional Following the major success of neural lan- information from large corpora. Yet, a number of guage models (LMs) such as BERT or GPT-2 structured knowledge sources exist – knowledge on a variety of language understanding tasks, bases (KBs) (Suchanek et al., 2007; Auer et al., recent work focused on injecting (structured) knowledge from external resources into 2007) and lexico-semantic networks (Miller, these models. While on the one hand, joint 1995; Liu and Singh, 2004; Navigli and Ponzetto, pre-training (i.e., training from scratch, adding 2010) – encoding many types of knowledge that objectives based on external knowledge to the are underrepresented in text corpora. primary LM objective) may be prohibitively Starting from this observation, most recent computationally expensive, post-hoc fine- efforts focused on injecting factual (Zhang et al., tuning on external knowledge, on the other 2019; Liu et al., 2019a; Peters et al., 2019) and hand, may lead to the catastrophic forgetting of distributional knowledge. In this work, linguistic knowledge (Lauscher et al., 2019; we investigate models for complementing Peters et al., 2019) into pretrained LMs and the distributional knowledge of BERT with demonstrated the usefulness of such knowledge conceptual knowledge from ConceptNet in language understanding tasks (Wang et al., and its corresponding Open Mind Common 2018, 2019). Joint pretraining models, on the one Sense (OMCS) corpus, respectively, using hand, augment distributional LM objectives with adapter training . While overall results on additional objectives based on external resources the GLUE benchmark paint an inconclusive picture, a deeper analysis reveals that our (Yu and Dredze, 2014; Nguyen et al., 2016; adapter-based models substantially outper- Lauscher et al., 2019) and train the extended form BERT (up to 15-20 performance points) model from scratch. For models like BERT, this on inference tasks that require the type of implies computationally expensive retraining conceptual knowledge explicitly present in from scratch of the encoding transformer network. ConceptNet and OMCS. We also open source Post-hoc fine-tuning models (Zhang et al., 2019; all our experiments and relevant code under: Liu et al., 2019a; Peters et al., 2019), on the https://github.com/wluper/retrograph. other hand, use the objectives based on external arXiv:2005.11787v2 [cs.CL] 11 Oct 2020 1 Introduction resources to fine-tune the encoder’s parameters, pretrained via distributional LM objectives. If the Self-supervised neural models like ELMo amount of fine-tuning data is substantial, however, (Peters et al., 2018), BERT (Devlin et al., 2019; this approach may lead to catastrophic forgetting Liu et al., 2019b), GPT (Radford et al., 2018, of distributional knowledge obtained in pretrain- 2019), or XLNet (Yang et al., 2019) have rendered ing (Goodfellow et al., 2014; Kirkpatrick et al., language modeling a very suitable pretraining 2017). task for learning language representations that are In this work, similar to the concurrent work useful for a wide range of language understand- of Wang et al. (2020), we turn to the recently ing tasks (Wang et al., 2018, 2019). Although proposed adapter-based fine-tuning paradigm shown versatile w.r.t. the types of knowledge (Rebuffi et al., 2018; Houlsby et al., 2019), which (Rogers et al., 2020) they encode, much like their remedies the shortcomings of both joint pretrain- predecessors – static word embedding models ing and standard post-hoc fine-tuning. Adapter- based training injects additional parameters into jecting the ConceptNet and OMCS information the encoder and only tunes their values: original into BERT, and leave the exploration of potentially transformer parameters are kept fixed. Because of more effective knowledge injection objectives for this, adapter training preserves the distributional future work. We inject the external information information obtained in LM pretraining, without into adapter parameters of the adapter-augmented the need for any distributional (re-)training. While BERT (Houlsby et al., 2019) via BERT’s natural (Wang et al., 2020) inject factual knowledge from objective – masked language modelling (MLM). Wikidata (Vrandeˇcićand Krötzsch, 2014) into OMCS, already a corpus in natural language, is BERT, in this work, we investigate two resources directly subjectable to MLM training – we filtered that are commonly assumed to contain general- out non-English sentences. To subject ConceptNet purpose and common sense knowledge:1 Concept- to MLM training, we need to transform it into a Net (Liu and Singh, 2004; Speer et al., 2017) and synthetic corpus. the Open Mind Common Sense (OMCS) corpus Unwrapping ConceptNet. Following es- (Singh et al., 2002), from which the ConceptNet tablished previous work (Perozzi et al., 2014; graph was (semi-)automatically extracted. For our Ristoski and Paulheim, 2016), we induce a first model, dubbed CN-ADAPT, we first create a synthetic corpus from ConceptNet by randomly synthetic corpus by randomly traversing the Con- traversing its graph. We convert relation strings ceptNet graph and then learn adapter parameters into NL phrases (e.g., synonyms to is a synonym with masked language modelling (MLM) training of ) and duplicate the object node of a triple, (Devlin et al., 2019) on that synthetic corpus. For using it as the subject for the next sentence. For our second model, named OM-ADAPT, we learn causes example, from the path “alcoholism−−−−→ stigma the adapter parameters via MLM training directly hasContext partOf on the OMCS corpus. −−−−−−→ christianity −−−→ religion” we create We evaluate both models on the GLUE bench- the text “alcoholism causes stigma. stigma is mark, where we observe limited improvements used in the context of christianity. christianity is over BERT on a subset of GLUE tasks. How- part of religion.”. We set the walk lengths to 30 ever, a more detailed inspection reveals large im- relations and sample the starting and neighboring provements over the base BERT model (up to nodes from uniform distributions. In total, we 20 Matthews correlation points) on language in- performed 2,268,485 walks, resulting with the ference (NLI) subsets labeled as requiring World corpus of 34,560,307 synthetic sentences. Knowledge or knowledge about Named Entities. Adapter-Based Training. We follow Investigating further, we relate this result to the Houlsby et al. (2019) and adopt the adapter- fact that ConceptNet and OMCS contain much based architecture for which they report solid more of what in downstream is considered to be performance across the board. We inject bottle- factual world knowledge than what is judged as neck adapters into BERT’s transformer layers. In common sense knowledge. Our findings pinpoint each transformer layer, we insert two bottleneck the need for more detailed analyses of compat- adapters: one after the multi-head attention sub- ibility between (1) the types of knowledge con- layer and another after the feed-forward sub-layer. tained by external resources; and (2) the types of Let X ∈ RT ×H be the sequence of contextualized knowledge that benefit concrete downstream tasks; vectors (of size H) for the input of T tokens within the emerging body of work on injecting in some transformer layer, input to a bottleneck knowledge into pretrained transformers. adapter. The bottleneck adapter, consisting of two feed-forward layers and a residual connection, 2 Knowledge Injection Models yields the following output: In this work, we are primarily set to investigate X X XW b W if injecting specific types of knowledge (given Adapter ( )= + f ( d + d) u + bu in the external resource) benefits downstream in- where Wd (with bias bd) and Wu (with bias ference that clearly requires those exact types bu) are adapter’s parameters, that is, the weights of knowledge. Because of this, we use the ar- of the linear down-projection and up-projection guably most straightforward mechanisms for in- sub-layers and f is the non-linear activation func- H×m 1Our results in §3.2 scrutinize this assumption. tion. Matrix Wd ∈ R compresses vectors in X to the adapter size m < H, and the ma- Training Details. We inject our adapters into a m×H trix Wu ∈ R projects the activated down- BERT Base model (12 transformer layers with 12 projections back to transformer’s hidden size H. attention heads each; H = 768) pretrained on low- The ratio H/m determines how many times fewer ercased corpora. Following (Houlsby et al., 2019), parameters we optimize with adapter-based train- we set the size of all adapters to m = 64 and ing compared to standard fine-tuning of all trans- use GELU (Hendrycks and Gimpel, 2016) as the former’s parameters. adapter activation f. We train the adapter parameters with the Adam algorithm (Kingma and Ba, 3 Evaluation 2015) (initial learning rate set to 1e−4, with 10000 warm-up steps and the weight decay factor of We first briefly describe the downstream tasks and 0.01). In downstream fine-tuning, we train in training details, and then proceed with the discus- batches of size 16 and limit the input sequences to sion of results obtained with our adapter models. T = 128 wordpiece tokens. For each task, we find the optimal hyperparameter configuration from the 3.1 Experimental Setup. following grid: learning rate l ∈ {2 · 10−5, 3 · 10−5}, epochs in n ∈ {3, 4}. Downstream Tasks. We evaluate BERT and our two adapter-based models, CN-ADAPT and OM- ADAPT, with injected knowledge from Concept- 3.2 Results and Analysis Net and OMCS, respectively, on the tasks from the GLUE Results.

Arxiv:2005.11787V2 [Cs.CL] 11 Oct 2020 ( ( Hw Estl ...Tetpso Knowledge of Types the W.R.T

Extracting Common Sense Knowledge from Text for Robot Planning

Open Mind Common Sense: Knowledge Acquisition from the General Public

Commonsense Knowledge Base Completion with Structural and Semantic Context

Common Sense Reasoning with the Semantic Web

Conceptnet 5.5: an Open Multilingual Graph of General Knowledge

Analogyspace: Reducing the Dimensionality of Common Sense

Improving User Experience in Information Retrieval Using Semantic Web and Other Technologies Erfan Najmi Wayne State University

Farsbase: the Persian Knowledge Graph

Multi-Task Learning for Commonsense Reasoning (UNION)

Senticnet: a Publicly Available Semantic Resource for Opinion Mining

Conceptnet — a Practical Commonsense Reasoning Tool-Kit

How Can Common Sense Support Instructors with Distance Education?