ECO, the Evidence & Conclusion Ontology
Total Page:16
File Type:pdf, Size:1020Kb
D1186–D1194 Nucleic Acids Research, 2019, Vol. 47, Database issue Published online 8 November 2018 doi: 10.1093/nar/gky1036 ECO, the Evidence & Conclusion Ontology: community standard for evidence information Michelle Giglio1,*, Rebecca Tauber1, Suvarna Nadendla1, James Munro1, Dustin Olley1, Shoshannah Ball1, Elvira Mitraka1, Lynn M. Schriml 1, Pascale Gaudet 2, Elizabeth T. Hobbs3, Ivan Erill3, Deborah A. Siegele4, James C. Hu5, Chris Mungall6 and Marcus C. Chibucos1 1Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA, 2Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, 3Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD 21250, USA, 4Department of Biology, Texas A&M University, College Station, TX 77840, USA, 5Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX 77840, USA and 6Molecular Ecosystems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Received September 14, 2018; Revised October 12, 2018; Editorial Decision October 15, 2018; Accepted October 16, 2018 ABSTRACT biomolecular sequence, phenotype, neuroscience, behav- ioral, medical and other data. The process of biocuration The Evidence and Conclusion Ontology (ECO) con- involves making one or more descriptive statements about tains terms (classes) that describe types of evi- a biological/biomedical entity, for example that a protein dence and assertion methods. ECO terms are used performs a particular function, that a specific position in in the process of biocuration to capture the evidence DNA is methylated, that a particular drug causes an adverse that supports biological assertions (e.g. gene prod- effect or that bat wings and human arms are homologous. uct X has function Y as supported by evidence Z). Such statements, or assertions, are represented as annota- Capture of this information allows tracking of an- tions in a database by representing the entities and asser- notation provenance, establishment of quality con- tions using unique identifiers, controlled vocabularies and trol measures and query of evidence. ECO con- other systematic tools. Since knowing the scientific basis for tains over 1500 terms and is in use by many lead- an assertion is of vital importance, a fundamental part of the curation process is to document the evidence (2,3). Ev- ing biological resources including the Gene Ontol- idence comes from many sources including, but not limited ogy, UniProt and several model organism databases. to, field observations, high-throughput sequencing assays, ECO is continually being expanded and revised sequence orthology determination, clinical notes, mutage- based on the needs of the biocuration commu- nesis experiments and enzymatic activity assays. The Ev- nity. The ontology is freely available for download idence and Conclusion Ontology (ECO) provides a con- from GitHub (https://github.com/evidenceontology/) trolled vocabulary for the systematic capture of evidence in- or the project’s website (http://evidenceontology. formation. org/). Users can request new terms or changes to ex- An annotation statement consists of at least three ele- isting terms through the project’s GitHub site. ECO ments: the object being annotated, an aspect of the object is released into the public domain under CC0 1.0 Uni- that is being asserted and the evidence supporting the as- versal. sertion. The Gene Ontology (GO) (4,5) is in common use for capture of three aspects of gene products: molecular function, biological process and location in cellular compo- INTRODUCTION nent. To illustrate an example annotation (Figure 1), imag- ine a paper is published that experimentally characterizes Biocuration is the process whereby information about bi- the function of Protein X. A curator reads the paper and ological entities is collected and stored. Ideally, this is assigns a GO function term to Protein X based on the exper- done electronically, in a standardized format, and kept in imental evidence in the paper. This information is captured a central biological data repository so that it may be eas- with appropriate GO and ECO terms and deposited in a ily accessed by the research community (1). Over the past repository accessible by others. This is an example of biocu- few decades, many such databases have been built to host *To whom correspondence should be addressed. Tel: 410 706 7694; Fax: 410 706 1482; Email: [email protected] Present address: Elvira Mitraka, Computer Science, Dalhousie University, Halifax, NS, Canada. C The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2019, Vol. 47, Database issue D1187 Figure 1. Two annotation examples. In Annotation Event #1, the annotation is made by a curator who has read a published experimental characterization of a protein. In Annotation Event #2, the annotation is made in a transitive manor based on sequence similarity. ration from the literature. Now, further imagine that a cu- is required that the matching protein must itself be experi- rator has primary sequence data for Protein Y, but no pub- mentally characterized. The GO has established such rules lications describing its function. To gain information that (6). might help the curator figure out the function of the pro- Since evidence comes in many flavors and its consistent tein, they search Protein Y against a database of annotated storage is of such importance to effective annotation, a sys- proteins using a sequence alignment tool. The curator finds tematic way to organize types of evidence and apply them that Protein Y has a strong match to Protein X. The curator in annotations is essential. An ontology is the best solution can then transitively assign the function from Protein X to to this challenge. An ontology is a controlled vocabulary of Protein Y based on their sequence similarity, a type of evi- terms (or classes) related to each other in defined ways. The dence. Figure 1 illustrates these two annotation examples. ECO (2) provides terms that are used to represent evidence Biological assertions/annotations are routinely used for information. ECO has become the community standard for hypothesis generation as well as multiple kinds of bioinfor- evidence and is used by multiple major data resources such matic analyses and thus can have far-reaching downstream as UniProt (7)andtheGO(4,5). Here we provide an update effects. The ability to fully query and track all aspects of on the status of ECO, our current areas of focus and future each assertion is important and serves numerous useful pur- plans. poses. For example, queries of annotation databases allow one to know which types and sources of evidence were used for annotations, either individually or in aggregate. In ad- CURRENT STATUS OF THE ONTOLOGY dition, since many annotations are made in a transitive way ECO is structured around two root classes: ‘evidence’ and (i.e. by propagating the annotation from one biological en- ‘assertion method’. Terms describing types of evidence are tity to another biological entity based on some determina- grouped under ‘evidence’ which is defined as ‘a type of in- tion of similarity) evidence documentation allows the track- formation that is used to support an assertion’. All of the ing of provenance of information and chains of annotation. first level subclasses of this term can be seen in Figure 2. Thus, if one link in that chain is later shown to be invalid The second root class, ‘assertion method’, provides a mech- (for example, a paper characterizing the function of a pro- anism for recording if a particular assertion was made by tein is found to be in error) it is then possible to quickly a human (i.e. ‘manual assertion’) or if it was done com- find all assertions that were based on that erroneous data pletely in an automated fashion (i.e. ‘automatic assertion’). and remove or correct them. Further, representing evidence Whereas, the ‘evidence’ branch of ECO contains dozens information in a computationally friendly manner allows of high-level terms with hundreds of children, the ‘asser- the establishment and enforcement of curation rules to pro- tion method’ branch at present contains just two main sub- mote and ensure best practices in annotation. For example, classes. a group could establish that if a function is asserted for a Terms from the ‘evidence’ and ‘assertion method’ nodes protein based on sequence similarity to another protein, it have been combined to create cross-product terms that de- D1188 Nucleic Acids Research, 2019, Vol. 47, Database issue Figure 4. Logical definitions link ECO, OBI and GO terms together. Figure 2. Tree view of ECO terms. This tree view shows all of the first level children of the root node ‘evidence’, showing the variety of types of ev- idence. In addition, the ‘sequence alignment evidence’ node is expanded Of these, 188 have logical definitions that link out to other showing its parentage back to the root and its two assertion method- vocabularies, such as GO and the Ontology for Biomedical specific children. Investigations (OBI) (11), and 425 terms have logical def- initions that link ECO evidence classes to ECO assertion method classes. An example of one of these logical defini- tions with linkages to both GO and OBI is found in Figure 4. DEVELOPMENT OF THE ONTOLOGY Development format and tools Originally, ECO was developed in the Open Biological and Biomedical Ontologies (OBO) format using the OBO-Edit tool (12). However, neither the OBO format nor the OBO- Edit tool were amenable to achieve the level of expressive- ness that is needed for the complex logical definitions and relationships within ECO. Therefore, in 2016 we shifted to the development of ECO directly in the Web Ontology Lan- guage (OWL).