Arxiv:1907.02679V1 [Cs.CL] 5 Jul 2019
Total Page:16
File Type:pdf, Size:1020Kb
Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings Zenan Zhai1, Dat Quoc Nguyen1, Saber A. Akhondi2, Camilo Thorne2, Christian Druckenbrodt2, Trevor Cohn1, Michelle Gregory2, Karin Verspoor1 1The University of Melbourne, Australia; 2Elsevier 1fzenan.zhai,dqnguyen,trevor.cohn,[email protected] 2fs.akhondi,c.thorne.1,c.druckenbrodt,[email protected] Abstract supporting relation extraction (Wei et al., 2016), reaction prediction (Schwaller et al., 2018) and Chemical patents are an important resource for chemical information. However, few chemi- retro-synthesis (Segler et al., 2018). cal Named Entity Recognition (NER) systems However, performing NER in chemical patents have been evaluated on patent documents, due can be challenging (Akhondi et al., 2014). As le- in part to their structural and linguistic com- gal documents, patents are written in a very differ- plexity. In this paper, we explore the NER ent way compared to scientific literature. When performance of a BiLSTM-CRF model utilis- writing scientific papers, authors strive to make ing pre-trained word embeddings, character- their words as clear and straight-forward as pos- level word representations and contextual- sible, whereas patent authors often seek to pro- ized ELMo word representations for chemi- cal patents. We compare word embeddings tect their knowledge from being fully disclosed pre-trained on biomedical and chemical patent (Valentinuzzi, 2017). corpora. The effect of tokenizers optimized In tension with this is the need to claim broad for the chemical domain on NER performance scope for intellectual property reasons, and hence in chemical patents is also explored. The re- patents typically contain more details and are sults on two patent corpora show that contex- more exhaustive than scientific papers (Lupu et al., tualized word representations generated from 2011). ELMo substantially improve chemical NER performance w.r.t. the current state-of-the-art. There are a number of characteristics of patent We also show that domain-specific resources texts that create challenges for NLP in this con- such as word embeddings trained on chemical text. Long sentences listing names of compounds patents and chemical-specific tokenizers have in chemical patents are frequently used. The struc- a positive impact on NER performance. ture of sentences in patent claims is usually com- plex, and syntactic parsing in patents can be diffi- 1 Introduction cult (Hu et al., 2016). A quantitative analysis by Chemical patents are an important starting point Verberne et al.(2010) showed that the average sen- for understanding of chemical compound purpose, tence length in a patent corpus is much longer than properties, and novelty. New chemical compounds in general language use. That work also showed are often initially disclosed in patent documents; that the lexicon used in patents usually includes arXiv:1907.02679v1 [cs.CL] 5 Jul 2019 however it may take 1-3 years for these chemi- domain-specific and novel terms that are difficult cals to be mentioned in chemical literature (Senger to understand. Some patent authorities use Op- et al., 2015), suggesting that patents are a valuable tical Character Recognition (OCR) for digitizing but underutilized resource. As the number of new patents, which can be problematic when applying chemical patent applications is drastically increas- automatic NLP approaches as the OCR errors in- ing every year (Muresan et al., 2011), it is becom- troduces extra noise to the data (Akhondi et al., ing increasingly important to develop automatic 2019). natural language processing (NLP) approaches en- Most NER systems for the chemical domain abling information extraction from these patents were developed, trained and tested on either chem- (Akhondi et al., 2019). Chemical Named-Entity ical literature or only the title and abstract of Recognition (NER) is a fundamental step for in- chemical patents (Akhondi et al., 2019). There formation extraction from chemical-related texts, are substantial linguistic differences between ab- stracts and the corresponding full text publications detecting common prefixes/suffixes in chemical (Cohen et al., 2010). The performance of NER words. The obtained results show that the per- approaches on full patent documents has still not formance of CRF and SVM models can be sig- been fully explored (Krallinger et al., 2015). nificantly improved by incorporating unsupervised Hence, this paper will focus on presenting the features (e.g. word embeddings, word cluster- best NER performance achieved to date on full ing). The study also showed that the SVM model chemical patent corpus. slightly outperformed the CRF model in the chem- We use a combination of pre-trained word em- ical NER task. beddings, a CNN-based character-level word rep- To perform chemical NER on the CHEMD- resentation and contextualized word representa- NER patents corpus, Akhondi et al.(2016) tions generated from ELMo, trained on a patent proposed an ensemble approach combining a corpus, as input to a BiLSTM-CRF model. The gazetteer-based method and a modified version results show that contextualized word represen- of tmChem. Here, the gazetteer-based method tations help improve chemical NER performance utilized a wide range of chemical dictionaries, substantially. In addition, the impact of the choice while additional features such as stems, pre- of pre-trained word embeddings and tokenizers is fixes/suffixes, chemical elements were added to assessed. the original feature set of tmChem. In the en- The results show that word embeddings that are semble approach, tokens were predicted as chem- pre-trained on chemical patents outperform em- ical mentions if recognized as positive by either beddings pre-trained on biomedical datasets, and tmChem or the gazetteer-based method. The re- using tokenizers optimized for the chemical do- sults showed that both gazetteer-based and ensem- main can improve NER performance in chemical ble approaches were outperformed by the modi- patent corpora. fied tmChem version in terms of overall F1 score, although these two approaches can obtain higher 2 Related work recall. In this section, we summarize previous methods Huang et al.(2015) proposed a BiLSTM-CRF and empirical studies on NER in chemical patents. based on the use of a bidirectional long-short Two existing Conditional Random Field (CRF)- term memory network – BiLSTM (Schuster and based systems for chemical named entity recog- Paliwal, 1997) – to extract (latent) features for a nition are tmChem (Leaman et al., 2015) and CRF classifier. The BiLSTM encodes the input in ChemSpot (Rocktaschel¨ et al., 2012); each makes both forward and backward directions and passes use of numerous hand-crafted features includ- the concatenation of outputs from both directions ing word shape, prefix, suffix, part-of-speech and as input to a linear-chain CRF sequence tagging character N-grams in an algorithm based on mod- layer. In this approach, the BiLSTM selectively elling of tag sequences. A previous detailed encodes information and long-distance dependen- empirical study explored the generalization per- cies observed while processing input sentences in formance of these systems and their ensembles both directions, while the CRF layer globally opti- (Habibi et al., 2016). The application of the tm- mizes the model by using information from neigh- Chem model trained on chemical literature cor- bor labels. pora of the BioCreative IV CHEMDNER task The morphological structures within words are (Krallinger et al., 2015) and the ChemSpot model also important clues for identifying named enti- trained on a subset of the SCAI corpus (Klinger ties in biological domain. Such morphological et al., 2008) resulted in a significant performance structures are widely used in systematic chemical drop over chemical patent corpora. name formats (e.g. IUPAC names) and hence par- Zhang et al.(2016) compared the performance ticularly informative for chemical NER (Klinger of CRF- and Support Vector Machine (SVM)- et al., 2008). Character-level word representa- based models on the CHEMDNER-patents corpus tions have been developed to leverage information (Krallinger et al., 2015). The features constructed from these structures by encoding the character se- in that work included the binarized embedding quences within tokens. Ma and Hovy(2016) uses (Guo et al., 2014), Brown clustering (Brown et al., Convolutional Neural Networks (CNNs) to encode 1992) and domain-specific features extracted by character sequences while Lample et al.(2016) developed a LSTM-based approach for encoding CRF-based models (Section 3.3) with pre-trained character level information. word embeddings (Section 3.4), character-level Habibi et al.(2017) presented an empirical word embeddings (Section 3.5), contextualized study comparing three NER models on a large col- word embeddings (Section 3.6) and implementa- lection of biomedical corpora including the BioSe- tion details (Section 3.7). mantics patent corpus: (1) tmChem–the CRF- based model with hand-crafted features–used as 3.1 Dataset the baseline; (2) a second CRF model based on CRFSuite (Okazaki, 2007) using pre-trained word We conduct experiments on 2 patent corpora: the embeddings; (3) and a BiLSTM-CRF model with BioSemantics patent corpus (Akhondi et al., 2014) additional LSTM-based character-level word em- and Reaxys gold set (Akhondi et al., 2019). beddings (Lample et al., 2016). The performance The BioSemantics patent corpus (Akhondi of CRFSuite-