A Pretrained Language Model for Scientific Text
Total Page:16
File Type:pdf, Size:1020Kb
SCIBERT: A Pretrained Language Model for Scientific Text Iz Beltagy Kyle Lo Arman Cohan Allen Institute for Artificial Intelligence, Seattle, WA, USA fbeltagy,kylel,[email protected] Abstract neural architectures. Leveraging the success of un- supervised pretraining has become especially im- Obtaining large-scale annotated data for NLP portant especially when task-specific annotations tasks in the scientific domain is challenging are difficult to obtain, like in scientific NLP. Yet and expensive. We release SCIBERT, a pre- while both BERT and ELMo have released pre- trained language model based on BERT (De- vlin et al., 2019) to address the lack of high- trained models, they are still trained on general do- quality, large-scale labeled scientific data. main corpora such as news articles and Wikipedia. SCIBERT leverages unsupervised pretraining In this work, we make the following contribu- on a large multi-domain corpus of scien- tions: tific publications to improve performance on (i) We release SCIBERT, a new resource demon- downstream scientific NLP tasks. We eval- strated to improve performance on a range of NLP uate on a suite of tasks including sequence tasks in the scientific domain. SCIBERT is a pre- tagging, sentence classification and depen- dency parsing, with datasets from a variety trained language model based on BERT but trained of scientific domains. We demonstrate sta- on a large corpus of scientific text. tistically significant improvements over BERT (ii) We perform extensive experimentation to and achieve new state-of-the-art results on sev- investigate the performance of finetuning ver- eral of these tasks. The code and pretrained sus task-specific architectures atop frozen embed- models are available at https://github. dings, and the effect of having an in-domain vo- com/allenai/scibert/. cabulary. 1 Introduction (iii) We evaluate SCIBERT on a suite of tasks in the scientific domain, and achieve new state-of- The exponential increase in the volume of scien- the-art (SOTA) results on many of these tasks. tific publications in the past decades has made NLP an essential tool for large-scale knowledge 2 Methods extraction and machine reading of these docu- Background The BERT model architecture (De- ments. Recent progress in NLP has been driven vlin et al., 2019) is based on a multilayer bidirec- by the adoption of deep neural models, but train- tional Transformer (Vaswani et al., 2017). Instead ing such models often requires large amounts of of the traditional left-to-right language modeling labeled data. In general domains, large-scale train- objective, BERT is trained on two tasks: predicting ing data is often possible to obtain through crowd- randomly masked tokens and predicting whether sourcing, but in scientific domains, annotated data two sentences follow each other. SCIBERT fol- is difficult and expensive to collect due to the ex- lows the same architecture as BERT but is instead pertise required for quality annotation. pretrained on scientific text. As shown through ELMo (Peters et al., 2018), GPT (Radford et al., 2018) and BERT (Devlin Vocabulary BERT uses WordPiece (Wu et al., et al., 2019), unsupervised pretraining of language 2016) for unsupervised tokenization of the input models on large corpora significantly improves text. The vocabulary is built such that it contains performance on many NLP tasks. These models the most frequently used words or subword units. return contextualized embeddings for each token We refer to the original vocabulary released with which can be passed into minimal task-specific BERT as BASEVOCAB. 3615 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3615–3620, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics We construct SCIVOCAB, a new WordPiece vo- ACL-ARC (Jurgens et al., 2018) and SciCite (Co- cabulary on our scientific corpus using the Sen- han et al., 2019) assign intent labels (e.g. Com- tencePiece1 library. We produce both cased and parison, Extension, etc.) to sentences from sci- uncased vocabularies and set the vocabulary size entific papers that cite other papers. The Paper to 30K to match the size of BASEVOCAB. The re- Field dataset is built from the Microsoft Academic sulting token overlap between BASEVOCAB and Graph (Sinha et al., 2015)3 and maps paper titles SCIVOCAB is 42%, illustrating a substantial dif- to one of 7 fields of study. Each field of study ference in frequently used words between scien- (i.e. geography, politics, economics, business, so- tific and general domain texts. ciology, medicine, and psychology) has approxi- mately 12K training examples. Corpus We train SCIBERT on a random sample of 1.14M papers from Semantic Scholar (Ammar 3.3 Pretrained BERT Variants et al., 2018). This corpus consists of 18% papers from the computer science domain and 82% from BERT-Base We use the pretrained weights for the broad biomedical domain. We use the full text BERT-Base (Devlin et al., 2019) released with the 4 of the papers, not just the abstracts. The average original BERT code. The vocabulary is BASE- paper length is 154 sentences (2,769 tokens) re- VOCAB. We evaluate both cased and uncased ver- sulting in a corpus size of 3.17B tokens, similar to sions of this model. the 3.3B tokens on which BERT was trained. We split sentences using ScispaCy (Neumann et al., SCIBERT We use the original BERT code to 2019),2 which is optimized for scientific text. train SCIBERT on our corpus with the same con- figuration and size as BERT-Base. We train 4 3 Experimental Setup different versions of SCIBERT:(i) cased or un- cased and (ii)BASEVOCAB or SCIVOCAB. The 3.1 Tasks two models that use BASEVOCAB are finetuned We experiment on the following core NLP tasks: from the corresponding BERT-Base models. The other two models that use the new SCIVOCAB are 1. Named Entity Recognition (NER) trained from scratch. 2. PICO Extraction (PICO) Pretraining BERT for long sentences can be 3. Text Classification (CLS) slow. Following the original BERT code, we set a 4. Relation Classification (REL) maximum sentence length of 128 tokens, and train 5. Dependency Parsing (DEP) the model until the training loss stops decreasing. We then continue training the model allowing sen- PICO, like NER, is a sequence labeling task where tence lengths up to 512 tokens. the model extracts spans describing the Partici- We use a single TPU v3 with 8 cores. Training pants, Interventions, Comparisons, and Outcomes the SCIVOCAB models from scratch on our corpus in a clinical trial paper (Kim et al., 2011). REL takes 1 week5 (5 days with max length 128, then is a special case of text classification where the 2 days with max length 512). The BASEVOCAB model predicts the type of relation expressed be- models take 2 fewer days of training because they tween two entities, which are encapsulated in the aren’t trained from scratch. sentence by inserted special tokens. All pretrained BERT models are converted to be compatible with PyTorch using the pytorch- 3.2 Datasets transformers library.6 All our models (Sec- For brevity, we only describe the newer datasets tions 3.4 and 3.5) are implemented in PyTorch us- here, and refer the reader to the references in Ta- ing AllenNLP (Gardner et al., 2017). ble1 for the older datasets. EBM-NLP (Nye et al., 2018) annotates PICO spans in clinical trial ab- 3https://academic.microsoft.com/ 4https://github.com/google-research/ stracts. SciERC (Luan et al., 2018) annotates enti- bert ties and relations from computer science abstracts. 5BERT’s largest model was trained on 16 Cloud TPUs for 4 days. Expected 40-70 days (Dettmers, 2019) on an 8-GPU 1https://github.com/google/ machine. sentencepiece 6https://github.com/huggingface/ 2https://github.com/allenai/SciSpaCy pytorch-transformers 3616 Casing We follow Devlin et al.(2019) in using For DEP, we use the full model from Dozat and the cased models for NER and the uncased models Manning(2017) with dependency tag and arc em- for all other tasks. We also use the cased models beddings of size 100 and the same BiLSTM setup for parsing. Some light experimentation showed as other tasks. We did not find changing the depth that the uncased models perform slightly better or size of the BiLSTMs to significantly impact re- (even sometimes on NER) than cased models. sults (Reimers and Gurevych, 2017). We optimize cross entropy loss using Adam, 3.4 Finetuning BERT but holding BERT weights frozen and applying a We mostly follow the same architecture, optimiza- dropout of 0.5. We train with early stopping on tion, and hyperparameter choices used in Devlin the development set (patience of 10) using a batch et al.(2019). For text classification (i.e. CLS size of 32 and a learning rate of 0.001. and REL), we feed the final BERT vector for the We did not perform extensive hyperparameter [CLS] token into a linear classification layer. For search, but while optimal hyperparameters are go- sequence labeling (i.e. NER and PICO), we feed ing to be task-dependent, some light experimenta- the final BERT vector for each token into a linear tion showed these settings work fairly well across classification layer with softmax output. We dif- most tasks and BERT variants. fer slightly in using an additional conditional ran- dom field, which made evaluation easier by guar- 4 Results anteeing well-formed entities. For DEP, we use Table1 summarizes the experimental results. We the model from Dozat and Manning(2017) with observe that SCIBERT outperforms BERT-Base dependency tag and arc embeddings of size 100 on scientific tasks (+2.11 F1 with finetuning and and biaffine matrix attention over BERT vectors in- +2.43 F1 without)8.