Bionlp Shared Task 2013 – an Overview of the Bacteria Biotope Task

BioNLP shared Task 2013 – An Overview of the Bacteria Biotope Task Robert Bossy1, Wiktoria Golik1, Zorana Ratkovic1,2, Philippe Bessières1, Claire Nédellec1 1Unité Mathématique, Informatique et Génome MIG INRA UR1077 – F-78352 Jouy-en-Josas – France 2LaTTiCe UMR 8094 CNRS, 1 rue Maurice Arnoux, F-92120 Montrouge – France [email protected] Abstract The aim of the previous edition of the BB task (BioNLP-ST’11) was to solve the first This paper presents the Bacteria Biotope information extraction step. The results obtained task of the BioNLP Shared Task 2013, by the participant systems reached 45 percent F- which follows BioNLP-ST-11. The measure. These results showed both the Bacteria Biotope task aims to extract the feasibility of the task, as well as a large room for location of bacteria from scientific web improvement (Bossy et al., 2012). pages and to characterize these locations The 2013 edition of the BB task maintains the with respect to the OntoBiotope primary objective of event extraction, and ontology. Bacteria locations are crucial introduces the second issue of biotope knowledge in biology for phenotype normalization. It is handled through the studies. The paper details the corpus categorization of the locations into a large set of specifications, the evaluation metrics, types defined in the OntoBiotope ontology. and it summarizes and discusses the Bacteria locations range from hosts, plant and participant results. animals, to natural environments (e.g. water, soil), including industrial environments. BB’11 1 Introduction set of categories contained 7 types. This year, entity categorization has been enriched to better The Bacteria Biotope (BB) task extends the answer the biological needs, as well as to BioNLP 2013 Shared Task molecular biology contribute to the general problem of automatic scope. It consists of extracting bacteria and their semantic annotation by ontologies. locations from web pages, and categorizing the BB task is divided into three sub-tasks. Entity locations with respect to the OntoBiotope1 detection and event extraction are tackled by two ontology of microbe habitats. The locations distinct sub-tasks, so that the contribution of denote the places where given species live. The each method could be assessed. A third sub-task bacteria habitat information is critical for the conjugates the two in order to measure the study of the interaction between the species and impact of the method interactions. their environment, and for a better understanding of the underlying biological mechanisms at a 2 Context molecular level. The information on bacteria Biological motivation. biotopes and their properties is very abundant in Today, new sequencing methods allow biologists scientific literature and in genomic databases and BRC (Biology Resource Center) catalogues. to study complex environments such as microbial ecosystems. Therefore, the sequence However, the information is highly diverse and annotation process is facing radical changes with expressed in natural language (Bossy et al., respect to the volume of data and the nature of 2012). The two critical missing steps for the annotations to be considered. Not only do population of biology databases and biotope biochemical functions still need to be assigned to knowledge modeling are (1) the automatic newly identified genes, but biologists have to extraction of organism/location pairs and (2) the take into account the conditions and the normalization of the habitat names with respect properties of the ecosystems in which to biotope ontologies. microorganisms are living and are identified, as well as the interactions and relationships 1http://bibliome.jouy.inra.fr/MEM- developed with their environment and other OntoBiotope/OntoBiotope_BioNLP-ST13.obo 161 Proceedings of the BioNLP Shared Task 2013 Workshop, pages 161–169, Soﬁa, Bulgaria, August 9 2013. c 2013 Association for Computational Linguistics living organisms (Korbel et al., 2005). source of vocabulary for the analysis of bacteria Metagenomic studies of ecosystems yield literature, but its structure and scope are strongly important information on the phylogenetic biased by the indexing of metagenome projects. composition of the microbiota. The availability The OntoBiotope ontology is appropriate for the of bacteria biotope information represented in a categorization of bacteria biotopes in the BB task formal language would then pave the way for because its scope and its organization reflect the many new environment-aware bioinformatic scientific subject division and the microbial services. The development of methods that are diversity. Its size (1,756 concepts) and its deep able to extract and normalize natural language hierarchical structure are suitable for a fine- information at a large scale would allow us to grained normalization of the habitats. Its rapidly obtain and summarize information that vocabulary has been selected after a thorough the bacterial species or genera are associated terminological analysis of relevant scientific with in the literature. In turn, this will allow for documents, papers, GOLD (Chen et al., 2010) the formulation of hypotheses regarding and GenBank, which was partly automated by properties of the bacteria, the ecosystem, and the term extraction. Related terms are attached to the links between them. OntoBiotope concept labels (i.e. 383 synonyms), The pioneering work on EnvDB (Pignatelli et al., improving OntoBiotope coverage of natural 2009) aimed to link GenBank sequences of language documents. microbes to biotope mentions in scientific Its structure and a part of its vocabulary have papers. However, EnvDB was affected by the been inspired by EnvO, the Metagenome incompleteness of the GenBank isolation source classification and the small ATCC (American field, the low number of related bibliographic Type Collection Culture) classification for references, the bag-of-words extraction method microbial collections (Floyd et al., 2005). and the small size of its habitat classification. Explicit references to 34 EnvO terms are given in the OntoBiotope file. Its main topics are: Habitat categories. - « Artificial » environments (industrial and The most developed classifications of habitats domestic), Agricultural habitats, Aquaculture are EnvO, the Metagenome classification habitats, Processed food; supported by the Genomics Standards - Medical environments, Living organisms, Consortium (GSC), and the OntoBiotope Parts of living organisms, Bacteria- ontology developed by our group. EnvO associated habitats; (Environment Ontology project) targets a - « Natural » environment habitats, Habitats Minimum Information about a Genome wrt physico-chemical property (including Sequence (MIGS) specification (Field et al., extreme ones); 2008) of mainly Eukaryotes. This ambitious - Experimental medium (i.e. experimental detailed environment ontology aims to support biotopes designed for studying bacteria). standard manual annotations of all types of The structure, the comprehensiveness and the organism environments and biological samples. detail of the habitat classification are critical However, it suffers from some limitations for factors for research in biology. Biological bacterial biotope descriptions. A large part of investigations involving the habitats of bacteria EnvO is devoted to environmental biotopes and are very diverse and still unanticipated. Thus, extreme habitats, whilst it fails to finely account shallow and light classifications are insufficient for the main trends in bacteria studies, such as to tackle the full extent of the biological their technological use for food transformation questions. Indexing genomic data with a and bioremediation, and their pathogenic or hierarchical fine-grained ontology such as symbiotic properties. Moreover, EnvO terms are OntoBiotope allows us to obtain aggregated and often poorly suited for bacteria literature analysis adjusted information by selecting the right level (Ratkovic et al., 2012). or axis of abstraction. The Metagenome Classification from JGI of DOE (Joint Genome Institute, US Department Of Bacteria Biotope Task. Energy) is intended to classify metagenome The corpus is the same as BB’11. The documents projects and samples according to a mixed are scientific web pages intended for a general typology of habitats (e.g. environmental, host) audience in the form of encyclopedia notices. and their physico-chemical properties (e.g. pH, They focus on a single organism or a family. The salinity) (Ivanova et al., 2010). It is a valuable habitat mentions are dense and more diverse than 162 in PubMed abstracts. These features make the Sub-task 3 is the combination of these two sub- task both useful and feasible with a reduced tasks. It consists of predicting both the entity investment in biology. Its linguistic positions and the relations between entities. characteristics, high frequency of anaphora, Compared to sub-task 1, the systems have to entities denoted by complex nominal expressions predict Habitat entities, but also Geographical raised interesting question for BioNLP that have and Bacteria entities. It is similar to the BB task been treated for a long time in the general and of BioNLP-ST’11, except that no categorization the biomedical domains. of the entities is required. 3 Task description 4 Corpus description The BB Task is split into two secondary goals: The BB corpus document sources are web pages 1. The detection of entities and their from bacteria sequencing projects, (EBI, NCBI, categorization(s) (Sub-task

Load more