Overview of the Chemical Compound and Drug Name Recognition (CHEMDNER) Task
Total Page:16
File Type:pdf, Size:1020Kb
Overview of the chemical compound and drug name recognition (CHEMDNER) task Martin Krallinger1⋆, Florian Leitner1, Obdulia Rabal2, Miguel Vazquez1, Julen Oyarzabal2, and Alfonso Valencia1 1 Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Madrid, Spain 2 Small Molecule Discovery Platform, Center for Applied Medical Research (CIMA), University of Navarra {mkrallinger, fleitner, avalencia}@cnio.es http://www.cnio.es Abstract. There is an increasing need to facilitate automated access to information relevant for chemical compounds and drugs described in text, including scientific articles, patents or health agency reports. A number of recent efforts have implemented natural language processing (NLP)andtextminingtechnologiesforthechemicaldomain(ChemNLP orchemicaltextmining).DuetothelackofmanuallylabeledGoldStan- dard datasets together with comprehensive annotation guidelines, both theimplementationaswellasthecomparativeassessmentofChemNLP technologiesBSFopaque.Twokeycomponentsformostchemicaltextmin- ing technologies are the indexing of documents with chemicals (chemi- cal document indexing - CDI) and finding the mentions of chemicals in text (chemical entity mention recognition - CEM). These two tasks formed part of the chemical compound and drug named entity recog- nition (CHEMDNER) task introduced at the fourth BioCreative chal- lenge, a community effort to evaluate biomedical text mining applica- tions.Forthistask,theCHEMDNERtextcorpuswasconstructed,con- sistingof10,000abstractscontainingatotalof84,355mentionsofchem- ical compounds and drugs that have been manually labeled by domain expertsfollowingspecificannotationguidelines.Thiscorpuscoversrep- resentative abstracts from major chemistry-related sub-disciplines such asmedicinalchemistry,biochemistry,organicchemistryandtoxicology. A total of 27 teams – 23 academic and 4 commercial HSPVQT, comprisedof 87 researchers – submitted results for this task. Of these teams, 26 provided submissions for the CEM subtask and 23 for the CDI subtask. Teams were provided with the manual annotations of 7,000 abstracts toimplement and train their systems and then had to return predictions for the 3,000 test set abstracts during a short period of time. When comparing exact matches of the automated results against the manually labeled Gold Standard annotations, the best teams reached an F-score ⋆ Corresponding author Proceedings of the fourth BioCreative challenge evaluation workshop, vol. 2 of 87.39% JO the CEM task and of 88.20% JO the CDI task. This can beregardedasaverycompetitiveresultwhencomparedtotheexpected upper boundary, the agreement between to human annotators, at 91%. In general, the technologies used to detect chemicals and drugs by the teams included machine learning methods (particularly CRFs using a considerablerangeofdifferentfeatures),interactionofchemistry-related lexical resources and manual rules (e.g., to cover abbreviations, chemi- calformulaorchemicalidentifiers).Bypromotingtheavailabilityofthe software of the participating systems as well as through the release of the CHEMDNER corpus to enable implementation of new tools, this work fosters the development of text mining applications like the au- tomatic extraction of biochemical reactions, toxicological properties of compounds,orthedetectionofassociationsbetweengenesormutations BOEdrugsinthecontextpharmacogenomics. Key words: CHEMDNER; ChemNLP; BioCreative; Named Entity Recog- nition; Chemical compounds; Drugs; Text Mining 1 Introduction There is an increasing interest, both on the academic side as well as the in- dustry, to facilitate more efficient access to information on chemical compounds and drugs (chemical entities) described in text repositories, including scientific articles, patents, health agency reports, or the Web. Chemicals and drugs are also among the entity types most frequently searched in the PubMed database. With the rapid accumulation of scientific literature, linking chemical entities to articles through manually curation cannot cope with the amount of newly pub- lished articles. If text mining methods should improve information access for this domain, two fundamental initial tasks need to be addressed first, namely (1) identifying mentions of chemical compounds within the text and (2) indexing the documents with the compounds described in them. The recognition of chem- ical entities is also crucial for other subsequent text processing strategies, such as detection of drug-protein interactions, adverse effects of chemical compounds and their associations to toxicological endpoints, or the extraction of pathway and metabolic reaction relations. Most of the research carried out in biomedical sciences and also bioinformat- ics is centered on a set of entities, mainly genes, proteins, chemicals and drugs. There is a considerable demand in facilitating a more efficient retrieval of docu- ments and sentences that describe these entities. Also a more granular extraction of certain types of descriptions and relationships from the literature on these bio- entities has attracted considerable attention [8]. For finding bio-entity relevant documents by text retrieval (TR) systems as well as for the automatic extraction of particular relationships using information extraction (IE) methods a key step is the correct identification of bio-entity referring expressions in text. Finding the mentions of entities in text is commonly called named entity recognition (NER). Originally NER systems were built for the identification in text of pre- defined categories of proper names from the general domain (newswire texts), BioCreative IV - CHEMDNER track such as organizations, locations or person names [10]. Among the bio-entities that were studied in more detail so far are mentions of genes, proteins, DNA, RNA, cell lines, cell types, drugs, chemical compounds and mutations [9, 6, 15, 8]. Some effort has also been dedicated to find medication and disease names as well as mentions of species and organisms [3] in text. A common characteristic of most named entities is that they generally refer to some sort of real world object or object class. From a linguistic perspective, named entities are often characterized by being referenced in text by their proper names. Most of the previous research related to the recognition of biomedical entities focused mainly on the detection of gene/protein mentions. For the detection of gene/protein mentions in text, previous BioCreative challenges (BioCreative I and II) could attracted a considerable interest and served to determine the performance of various methods under a controlled evaluation setting [13]. Despite its importance, only a rather limited number of publicly accessible chemical compound recognition systems are currently available [15]. Among the main bottlenecks at present encountered to implement and compare the perfor- mance of such systems are: (a) the lack of suitable training/test data, (b) the intrinsic difficulty in defining annotation guidelines of what actually constitutes a chemical compound or drug, (c) the heterogeneity in terms of scope and tex- tual data sources used, as well as (d) the lack of comparative evaluation efforts for this topic. To addresses these issues, the CHEMDNER task was organized, a commu- nity challenge to examine the efficiency of automatic recognition of chemical entities in text. From the point of view of the teams that submitted results for this task, the main motivations for participation provided by them included using the framework offered by the CHEMDNER task as a way to (a) im- prove/demonstrate the performance of their software, (b) push the state of the art for this topic, (c) test their tools on a different corpus, (d) adapt their system to detect (new) types of chemical entities as defined in the CHEMDNER cor- pus, (e) to be able compare their system to other strategies, (f) to improve the scalability of their system and (g) to explore the integration of different NER system for this task. In order to regulate the naming of entities in natural sciences, one strategy is to define some nomenclature and terminological guidelines to constrain how such entities are correctly constructed. For instance the International Union of Pure and Applied Chemistry (IUPAC) has defined a set of rules for the chemical nomenclature. Unfortunately naming standards are not sufficiently followed in practice when examining the scientific literature [14]. Chemical science is characterized by a considerable degree of specialization, resulting in implicit variability across diGGerent sub-disciplines. 5IFSFGPSF an almostarbitrary collection of expressions may be used in the literature to refer to thevery same chemical compound. This variability can be explained by the use ofaliases, e.g. diGGerent synonyms used for the same entity. In addition one can encounter variability through alternative typographical expressions. The problem of variability has an eGGect on the resulting recall of text mining systems, Proceedings of the fourth BioCreative challenge evaluation workshop, vol. 2 requiring mapping of the various synonyms and alternative variants into a single canonical form. On the other hand, a given word can correspond to a chemical entity or to some other concept. This results in ambiguities that can only be resolved by taking into account the context of the mentions (context-based sense disam- biguation). With this respect an important source of ambiguity is the heavy use of acronyms, abbreviations,