Arxiv:2104.07094V1 [Cs.CL] 14 Apr 2021
Total Page:16
File Type:pdf, Size:1020Kb
Static Embeddings as Efficient Knowledge Bases? Philipp Dufter∗, Nora Kassner∗, Hinrich Schütze Center for Information and Language Processing (CIS), LMU Munich, Germany {philipp,kassner}@cis.lmu.de Abstract Model Vocabulary Size p1 LAMA LAMA-UHN Recent research investigates factual knowl- edge stored in large pretrained language mod- Oracle 22.0 23.7 els (PLMs). Instead of structural knowledge BERT 30k 39.6 30.7 base (KB) queries, masked sentences such as mBERT 110k 36.3 27.4 “Paris is the capital of [MASK]” are used as BERT-30k 26.9 16.8 probes. The good performance on this analy- mBERT-110k 27.5 17.8 sis task has been interpreted as PLMs becom- 30k 16.4 5.8 fastText 120k 34.3 25.0 ing potential repositories of factual knowledge. 250k 37.7 29.0 In experiments across ten linguistically diverse 500k 39.9 31.8 languages, we study knowledge contained in 1000k 41.2 33.4 static embeddings. We show that, when re- stricting the output space to a candidate set, Table 1: Results for majority oracle, BERT, mBERT simple nearest neighbor matching using static and fastText. Static fastText embeddings are com- embeddings performs better than PLMs. E.g., petitive and outperform BERT for large vocabularies. static embeddings perform 1.6% points better BERT and mBERT use their subword vocabularies. than BERT while just using 0.3% of energy for For fastText, we use BERT/mBERT’s vocabularies and training. One important factor in their good newly trained wordpiece vocabularies on Wikipedia. comparative performance is that static embed- dings are standardly learned for a large vocab- ulary. In contrast, BERT exploits its more that PLMs are promising as repositories of factual sophisticated, but expensive ability to com- knowledge. In this paper, we present evidence that pose meaningful representations from a much simple static embeddings like fastText perform as smaller subword vocabulary. well as PLMs in the context of answering knowl- 1 Introduction edge base (KB) queries. Answering KB queries Pretrained language models (PLMs) (Peters et al., can be decomposed into two subproblems, typing 2018; Howard and Ruder, 2018; Devlin et al., 2019) and ranking. Typing refers to the problem of pre- can be finetuned to a variety of natural language dicting the correct type of the answer entity; e.g., processing (NLP) tasks and then generally yield “country” is the correct type for [MASK] in (Ex1), high performance. Increasingly, these models and a task that PLMs seem to be good at. Ranking consists of finding the entity of the correct type that arXiv:2104.07094v1 [cs.CL] 14 Apr 2021 their generative variants (e.g., GPT, Brown et al., 2020) are used to solve tasks by simple text gen- is the best fit (“France” in (Ex1)). By restricting the eration, without any finetuning. This motivated output space to the correct type we disentangle the research on how much knowledge is contained in two subproblems and only evaluate ranking. We do PLMs: Petroni et al.(2019) used models pretrained this for three reasons. (i) Ranking is the knowledge- with a masked language objective to answer cloze- intensive step and thus the key research question. style templates such as: (ii) Typed querying reduces PLMs’ dependency on the template. (iii) It allows a direct comparison (Ex1) Paris is the capital of [MASK]. between static word embeddings and PLMs. Prior Using this methodology, Petroni et al.(2019) work has adopted a similar approach (Xiong et al., showed that PLMs capture some knowledge im- 2020; Kassner et al., 2021). plicitly. This has been interpreted as suggesting For a PLM like BERT, ranking amounts to find- ∗ Equal contribution - random order. ing the entity whose embedding is most similar to the output embedding for [MASK]. For static Language Code Family Script embeddings, we rank entities (e.g., entities of type Arabic AR Afro-Asiatic Arabic country) with respect to similarity to the query en- German DE Indo-European Latin English EN Indo-European Latin tity (e.g., “Paris” in (Ex1)). In experiments across Spanish ES Indo-European Latin ten linguistically diverse languages, we show that Finnish FI Uralic Latin this simple nearest neighbor matching with fastText Hebrew HE Afro-Asiatic Hebrew Japanese JA Japonic Japanese embeddings performs comparably to or even better Korean KO Koreanic Korean than BERT. For example for English, fastText em- Turkish TR Turkic Latin beddings perform 1.6% points better than BERT Thai TH Tai-Kadai Thai (41.2% vs. 39.6%, see Table1, column “LAMA”). Table 2: Overview of the ten languages in our experi- This suggests that BERT’s core mechanism for an- ments, including language family and script. swering factual queries is not more effective than simple nearest neighbor matching using fastText embeddings. 2 Data We believe this means that claims that PLMs are We follow the LAMA setup introduced by Petroni KBs have to be treated with caution. Advantages of et al.(2019). More specifically, we use data from BERT are that it composes meaningful representa- TREx (Elsahar et al., 2018). TREx consists of tions from a small subword vocabulary and handles triples of the form (object, relation, subject). The typing implicitly (Petroni et al., 2019). In contrast, underlying idea of LAMA is to query knowledge answering queries without restricting the answer from PLMs using templates without any finetun- space to a list of candidates is hard to achieve with ing: the triple (Paris, capital-of, France) is queried static word embeddings. On the other hand, static with the template “Paris is the capital of [MASK].” embeddings are cheap to obtain, even for large vo- TREx has covers 41 relations. Templates for each cabulary sizes. This has important implications relation were manually created by Petroni et al. for green NLP. PLMs require tremendous compu- (2019). LAMA has been found to contain many tational resources, whereas static embeddings have “easy-to-guess” triples; e.g., it is easy to guess that only 0.3% of the carbon footprint of BERT (see a person with an Italian sounding name is Italian. Table4). This argues for proponents of resource- LAMA-UHN is a subset of triples that are “hard- hungry deep learning models to try harder to find to-guess” created by Poerner et al.(2020). cheap “green” baselines or to combine the best of Beyond English, we run experiments on nine ad- both worlds (cf. Poerner et al., 2020). ditional languages using mLAMA, a multilingual In summary, our contributions are: version of TREx (Kassner et al., 2021). For an i) We propose an experimental setup that al- overview of languages and language families see lows a direct comparison between PLMs and Table2. For training static embeddings, we use static word embeddings. We find that static Wikipedia dumps from October 2020. word embeddings show performance similar to BERT on the modified LAMA analysis task 3 Methods across ten languages. We describe our proposed setup, which allows to ii) We provide evidence that there is a trade-off compare PLMs with static embeddings. between composing meaningful representa- tions from subwords and increasing the vocab- 3.1 PLMs ulary size. Storing information through com- We use the following two PLMs: (i) BERT for position in a network seems to be more expen- English (BERT-base-cased, Devlin et al.(2019)), sive and challenging than simply increasing (ii) mBERT for all ten languages (the multilingual the number of atomic representations. version BERT-base-multilingual-cased). iii) Our findings may point to a general problem: Petroni et al.(2019) use templates like baselines that are simpler and “greener” are “Paris is the capital of [MASK]” and give not given enough attention in deep learning. arg maxw2V p(wjt) as answer where V is the vo- cabulary of the PLM and p(wjt) is the probability 1 Code and embeddings are available online. that word w gets predicted in the template t. 1https://github.com/pdufter/staticlama We follow the same setup as (Kassner et al., Vocab. p1 Model Size AR DE ES FI HE JA KO TH TR cheap for static embeddings. We thus experiment with different vocabulary sizes for static embed- Oracle 21.9 22.3 21.6 21.3 22.9 21.3 21.7 23.7 23.5 dings. To this end, we train new vocabularies for mBERT 110k 17.2 31.5 33.6 20.6 17.5 15.1 18.9 13.5 33.8 each language on Wikipedia using the wordpiece mB-110k 16.4 20.9 24.6 21.4 14.5 12.9 16.1 12.9 26.0 30k 20.8 16.2 17.1 16.7 21.4 14.6 17.3 21.3 22.1 tokenizer (Schuster and Nakajima, 2012). 120k 27.9 25.2 31.0 24.2 28.3 22.4 28.2 28.0 33.2 fastText 250k 30.1 30.3 34.2 28.8 32.8 24.9 30.5 31.6 35.6 500k 31.7 32.5 36.6 30.9 33.7 27.0 31.5 31.8 36.1 3.3 Static Embeddings 1000k 31.3 33.6 36.5 31.8 33.9 27.2 29.8 30.5 36.6 Using either newly trained vocabularies or existing Table 3: p1 for mBERT and fastText on mLAMA. fast- Text clearly outperforms mBERT for large vocabular- BERT vocabularies, we tokenize Wikipedia. We ies. Numbers across languages are not comparable as then train fastText embeddings (Bojanowski et al., the number of triples varies. 2017) with default parameters (http://fasttext.cc). We consider the same candidate set C as for PLMs. Model Power (W) h kWh · PUE CO2e Let c 2 C be a candidate that gets split into tokens BERT 12,041 79 1,507 1,438 t1; : : : ; tk by the wordpiece tokenizer.