Scispacy: Fast and Robust Models for Biomedical Natural Language Processing

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing Mark Neumann, Daniel King, Iz Beltagy, Waleed Ammar Allen Institute for Artificial Intelligence, Seattle, WA, USA fmarkn,daniel,beltagy,[email protected] Abstract Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly un- der domain shift. Processing biomedical and clinical text is a critically important applica- tion area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scis- paCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Figure 1: Growth of the annual number of cited ref- Models and code are available at https:// erences from 1650 to 2012 in the medical and health allenai.github.io/scispacy/. sciences (citing publications from 1980 to 2012). Fig- ure from (Bornmann and Mutz, 2014). 1 Introduction The publication rate in the medical and biomedical tools which cover more classical natural language sciences is growing at an exponential rate (Born- processing (NLP) tasks such as the GENIA tag- mann and Mutz, 2014). The information over- ger (Tsuruoka et al., 2005; Tsuruoka and Tsujii, load problem is widespread across academia, but 2005), or phrase structure parsers such as those is particularly apparent in the biomedical sciences, presented in McClosky and Charniak(2008) typi- where individual papers may contain specific dis- cally do not make use of new research innovations coveries relating to a dizzying variety of genes, such as word representations or neural networks. drugs, and proteins. In order to cope with the sheer In this paper, we introduce scispaCy, a spe- volume of new scientific knowledge, there have cialized NLP library for processing biomedical been many attempts to automate the process of ex- texts which builds on the robust spaCy library,1 tracting entities, relations, protein interactions and and document its performance relative to state other structured knowledge from scientific papers of the art models for part of speech (POS) tag- (Wei et al., 2016; Ammar et al., 2018; Poon et al., ging, dependency parsing, named entity recogni- 2014). tion (NER) and sentence segmentation. Specifi- Although there exists a wealth of tools for cally, we: processing biomedical text, many focus primar- ily on named entity recognition and disambigua- • Release a reformatted version of the GENIA tion. MetaMap and MetaMapLite (Aronson, 2001; 1.0 (Kim et al., 2003) corpus converted into Demner-Fushman et al., 2017), the two most Universal Dependencies v1.0 and aligned widely used and supported tools for biomedical with the original text from the PubMed ab- text processing, support entity linking with nega- stracts. tion detection and acronym resolution. However, 1spacy.io 319 Proceedings of the BioNLP 2019 workshop, pages 319–327 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics Model Vocab Size Vector Min Min Processing Times Per Count Word Doc Software Package Abstract (ms) Sentence (ms) Freq Freq NLP4J (java) 19 2 en core sci sm 58,338 0 50 5 Genia Tagger (c++) 73 3 en core sci md 101,678 98,131 20 5 Biaffine (TF) 272 29 Biaffine (TF + 12 CPUs) 72 7 Table 1: Vocabulary statistics for the two core packages jPTDP (Dynet) 905 97 Dexter v2.1.0 208 84 in scispaCy. MetaMapLite v3.6.2 293 89 en core sci sm 32 4 en core sci md 33 4 • Benchmark 9 named entity recognition models for more specific entity extraction ap- Table 2: Wall clock comparison of different publicly plications demonstrating competitive perfor- available biomedical NLP pipelines. All experiments mance when compared to strong baselines. run on a single machine with 12 Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz and 62GB RAM. For the • Release and evaluate two fast and convenient Biaffine Parser, a pre-compiled Tensorflow binary with pipelines for biomedical text, which include support for AVX2 instructions was used in a good faith tokenization, part of speech tagging, depen- attempt to optimize the implementation. Dynet does support the Intel MKL, but requires compilation from dency parsing and named entity recognition. scratch and as such, does not represent an “off the shelf” system. TF is short for Tensorflow. 2 Overview of (sci)spaCy In this section, we briefly describe the models used in the spaCy library and describe how we build on Processing Speed. To emphasize the efficiency them in scispaCy. and practical utility of the end-to-end pipeline pro- vided by scispaCy packages, we perform a speed spaCy. The Python-based spaCy library (Hon- 2 comparison with several other publicly available nibal and Montani, 2017) provides a variety of processing pipelines for biomedical text using 10k practical tools for text processing in multiple lan- randomly selected PubMed abstracts. We report guages. Their models have emerged as the defacto results with and without segmenting the abstracts standard for practical NLP due to their speed, ro- into sentences since some of the libraries (e.g., bustness and close to state of the art performance. GENIA tagger) are designed to operate on sen- As the spaCy models are popular and the spaCy tences. API is widely known to many potential users, we choose to build upon the spaCy library for creating As shown in Table2, both models released a biomedical text processing pipeline. in scispaCy demonstrate competitive speed to pipelines written in C++ and Java, languages de- scispaCy. Our goal is to develop scispaCy as a signed for production settings. robust, efficient and performant NLP library to satisfy the primary text processing needs in the Whilst scispaCy is not as fast as pipelines biomedical domain. In this release of scispaCy, designed for purely production use-cases (e.g., we retrain spaCy3 models for POS tagging, depen- NLP4J), it has the benefit of straightforward in- dency parsing, and NER using datasets relevant tegration with the large ecosystem of Python li- to biomedical text, and enhance the tokenization braries for machine learning and text processing. module with additional rules. scispaCy contains Although the comparison in Table2 is not an ap- two core released packages: en core sci sm and ples to apples comparison with other frameworks en core sci md. Models in the en core sci md (different tasks, implementation languages etc), it package have a larger vocabulary and include is useful to understand scispaCy’s runtime in the word vectors, while those in en core sci sm have context of other pipeline components. Running a smaller vocabulary and do not include word vec- scispaCy models in addition to standard Entity tors, as shown in Table1. Linking software such as MetaMap would result in only a marginal increase in overall runtime. 2Source code at https://github.com/ explosion/spaCy In the following section, we describe the POS 3scispaCy models are based on spaCy version 2.0.18 taggers and dependency parsers in scispaCy. 320 3 POS Tagging and Dependency Parsing Package/Model GENIA The joint POS tagging and dependency parsing MarMoT 98.61 model in spaCy is an arc-eager transition-based jPTDP-v1 98.66 parser trained with a dynamic oracle, similar to NLP4J-POS 98.80 Goldberg and Nivre(2012). Features are CNN BiLSTM-CRF 98.44 representations of token features and shared across BiLSTM-CRF- charcnn 98.89 all pipeline models (Kiperwasser and Goldberg, BiLSTM-CRF - char lstm 98.85 2016; Zhang and Weiss, 2016). Next, we describe en core sci sm 98.38 the data we used to train it in scispaCy. en core sci md 98.51 3.1 Datasets Table 3: Part of Speech tagging results on the GENIA Test set. GENIA 1.0 Dependencies. To train the dependency parser and part of speech tagger in both Package/Model UAS LAS released models, we convert the treebank of Mc- Stanford-NNdep 89.02 87.56 Closky and Charniak(2008), 4 which is based on NLP4J-dep 90.25 88.87 the GENIA 1.0 corpus (Kim et al., 2003), to jPTDP-v1 91.89 90.27 Universal Dependencies v1.0 using the Stanford Stanford-Biaffine-v2 92.64 91.23 Dependency Converter (Schuster and Manning, Stanford-Biaffine-v2(Gold POS) 92.84 91.92 2016). As this dataset has POS tags annotated, we use it to train the POS tagger jointly with the de- en core sci sm-SD 90.31 88.65 pendency parser in both released models. en core sci md-SD 90.66 88.98 As we believe the Universal Dependencies con- en core sci sm 89.69 87.67 verted from the original GENIA 1.0 corpus are en core sci md 90.60 88.79 generally useful, we have released them as a sep- arate contribution of this paper.5 In this data re- Table 4: Dependency Parsing results on the GENIA 1.0 lease, we also align the converted dependency corpus converted to dependencies using the Stanford parses to their original text spans in the raw, un- Universal Dependency Converter. We additionally pro- tokenized abstracts from the original release,6 and vide evaluations using Stanford Dependencies(SD) in include the PubMed metadata for the abstracts order for comparison relative to the results reported in (Nguyen and Verspoor, 2018). which was discarded in the GENIA corpus released by McClosky and Charniak(2008). We hope that this raw format can emerge as a resource OntoNotes 5.0. To increase the robustness of the for practical evaluation in the biomedical domain dependency parser and POS tagger to generic text, of core NLP tasks such as tokenization, sentence we make use of the OntoNotes 5.0 corpus7 when segmentation and joint models of syntax.

Load more