4.9 Feature Biocuration.Indd MH.Indd

Total Page:16

File Type:pdf, Size:1020Kb

4.9 Feature Biocuration.Indd MH.Indd NATURE|Vol 455|4 September 2008 FEATURE The future of biocuration To thrive, the field that links biologists and their data urgently needs structure, recognition and support. Doug Howe, Seung Yon Rhee et al. The exponential growth in the amount of biological data means that revolutionary meas- ures are needed for data man- agement, analysis and accessibility. Online databases have become important avenues for publishing biological data. Biocuration, the activity of organizing, representing and making biological information accessible to both humans and computers, has become an essential part of biological discovery and biomedical research. But curation increas- ingly lags behind data generation in funding, develop ment and recognition. — whowho managemanage rawraw biologicalbiological data,data, We propose three urgent actions to advance extractextract iinformationnformation ffromrom ppublishedublished this key field. First, authors, journals and literature,literature, developdevelop structuredstructured vocabu-vocabu- curators should immediately begin to work larieslaries ttoo ttagag ddataata aandnd mmakeake tthehe iinfor-nfor- together to facilitate the exchange of data mationmation aavailablevailable oonlinenline3 (Box 1). In between journal publications and databases. thethe pastpast decade,decade, itit hashas becomebecome secondsecond Second, in the next five years, curators, naturenature forfor biologistsbiologists toto visitvisit websiteswebsites toto researchers and university administrations obtainobtain datadata forfor furtherfurther analysisanalysis oror inte-inte- should develop an accepted recognition struc- grationgration wwithith llocalocal rresources.esources. OOurur ssur-ur- ture to facilitate community-based curation veyvey ofof severalseveral well-curatedwell-curated databasesdatabases efforts. Third, curators, researchers, academic (nine(nine model-organismmodel-organism databases,databases, Uni-Uni- institutions and funding agencies should, in protprot aandnd PProteinrotein DDataata BBank)ank) sshowedhowed the next ten years, increase the visibility and thatthat nearlynearly 750,000750,000 visitorsvisitors ((uniqueunique IIPP support of scientific curation as a professional knowledge, much as we are experiencing addresses) viewed more than 20 million pages career. exponential growth in data today. in just one month (March 2008, Eva Huala, Failure to address these three issues will Peter Rose, Rolf Apweiler, personal commu- cause the available curated data to lag far- Data avalanche nications). ther behind current biological knowledge. Biology, like most scientific disciplines, is in Despite the essential part that it plays in Researchers will observe an increasing occur- an era of accelerated information accrual and today’s research, biocuration has been slow to rence of obvious gaps in knowledge. As these scientists increasingly depend on the availabil- develop. To provide a forum for the exchange of gaps expand, resources will become less effec- ity of each others’ data. Large-scale sequencing ideas and methods, and to facilitate collabora- tive for generating and testing hypotheses, and centres, high-throughput analytical facilities tions and training, more than 150 biocurators the usefulness of curated data will be seriously and individual laboratories produce vast met at two international conferences and cre- compromised. amounts of data such as nucleotide and pro- ated a mailing list and a website (www.biocu- When all the data produced or published tein sequences, protein crystal structures, rator.org). These meetings and discussions are curated to a high standard and made gene-expression measurements, protein and have honed in on the three actions, outlined accessible as soon as they become avail- genetic interactions and phenotype studies. above and elaborated on below, that must now able, biological research will be conducted By July 2008, more than 18 million articles be addressed to ensure scientists’ continued in a manner that is quite unlike the way it is had been indexed in PubMed and nucleotide access to the high-quality data on which their done now. Researchers will be able to process sequences from more than 260,000 organ- research depends. massive amounts of complex data much isms had been submitted to GenBank1,2. The more quickly. They will garner insight about recently announced project to sequence 1,000 Come together the areas of their interest rapidly with the human genomes in three years to reveal DNA Extracting, tagging with controlled vocabu- help of inference programs. Digesting infor- polymorphisms (www.1000genomes.org) is a laries, and representing data from the lit- mation and generating hypotheses at the tip of the data iceberg. erature, are some of the most important and computer screen will be so much faster that Such data, produced at great effort and time-consuming tasks in biocuration. Curated researchers will get back to the bench quickly expense, are only as useful as researchers’ information from the literature serves as the for more experiments. Experiments will be ability to locate, integrate and access them. In gold-standard data set for computational designed with more insight; this increased recent years, this challenge has been met by analysis, quality assessment of high-through- specificity will cause an exponential growth in a growing cadre of biologists — ‘biocurators’ put data and benchmarking of data-mining 47 FEATURE BIG DATA NATURE|Vol 455|4 September 2008 algorithms. Meanwhile, the boundaries of Box 1 | The role of biocurators offers software to assist in preparation and 14 the biological domain that researchers study ● To extract knowledge from published validation of such crystallographic data . An are widening rapidly, so researchers need papers analogous system to help authors identify, tag faster and more reliable ways to understand ● To connect information from different and validate the crucial basic information in unfamiliar domains. This too is facilitated by sources in a coherent and comprehensible their research reports before publication would literature curation. way accelerate the automated linkage of literature to Typically, biocurators read the full text of ● To inspect and correct automatically key records in existing databases and improve articles and transfer the essence into a data- predicted gene structures and protein the accuracy of the published data. base. For a paper about the molecular biology sequences to provide high-quality proteomes In short, authors and publishers must use the of a particular gene, process or pathway, such ● To develop and manage structured existing publication infrastructure to facilitate information might include gene-expression controlled vocabularies that are crucial for literature curation much more to the benefit patterns, mutant phenotypes, results of bio- data relations and the logical retrieval of large of all parties. chemical assays, protein-complex membership data sets and the authors’ inferences about the functions ● To integrate knowledge bases to represent Community curation and roles of the gene products studied. As each complex systems such as metabolic Curation of large-scale genomics and post- paper uses different experimental and analysis pathways and protein-interaction networks. genomics data enjoys no such luxury of ‘an ● To correct inconsistencies and errors in methods, capturing this information in a con- existing publication infrastructure’ to lever- data representation sistent fashion requires intensive thought and age, although emerging standards of data ● To help data users to render their research 4–9 effort. Limited resources and staff mean that more productive in a timely manner reporting are promising . Sooner or later, the most curation groups can’t keep up with all the ● To steer the design of web-based research community will need to be involved relevant literature. resources in the annotation effort to scale up to the rate How information is presented in the lit- ● To interact with researchers to facilitate of data generation. This transition will require erature greatly affects how fast biocurators direct data submissions to databases annotation tools, standardized methods, over- can identify and curate it. Papers still often sight by expert curators and a combination of report newly cloned genes without providing discussed; and descriptions of species, strains, social infrastructure, tool development, train- GenBank IDs or the species from which the cell types and genotypes used. Examples ing and feedback. Biocurators are especially genes were cloned. The entities discussed in a of sources for this information are listed in important for establishing such an infrastruc- paper, including species, genes, proteins, geno- Table 1. This would accelerate literature cura- ture and training to maintain consistency and types and phenotypes must be unambiguously tion, uphold information integrity, facilitate accuracy. identified during curation. For example, using the proper linkage of data to other resources To date, not much of the research community the HUGO Gene Nomenclature Committee and support automated mining of data from is rolling up its sleeves to annotate. What will resource (www.genenames.org), we find that papers. Another model is for authors to be the tipping point? The main limitation in the human gene CDKN2A has ten literature- provide a ‘structured digital abstract’ — a community annotation is the perceived lack of based synonyms. One of those, p14, is
Recommended publications
  • Original Article Text Mining in the Biocuration Workflow: Applications for Literature Curation at Wormbase, Dictybase and TAIR
    Database, Vol. 2012, Article ID bas040, doi:10.1093/database/bas040 ............................................................................................................................................................................................................................................................................................. Original article Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR Kimberly Van Auken1,*, Petra Fey2, Tanya Z. Berardini3, Robert Dodson2, Laurel Cooper4, Donghui Li3, Juancarlos Chan1, Yuling Li1, Siddhartha Basu2, Hans-Michael Muller1, Downloaded from Rex Chisholm2, Eva Huala3, Paul W. Sternberg1,5 and the WormBase Consortium 1Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, 2Northwestern University Biomedical Informatics Center and Center for Genetic Medicine, 420 E. Superior Street, Chicago, IL 60611, 3Department of Plant Biology, Carnegie Institution, 260 Panama Street, Stanford, CA 94305, 4Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331 and 5Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA http://database.oxfordjournals.org/ *Corresponding author: Tel: +1 609 937 1635; Fax: +1 626 568 8012; Email: [email protected] Submitted 18 June 2012; Revised 30 September 2012; Accepted 2 October 2012 ............................................................................................................................................................................................................................................................................................
    [Show full text]
  • Coupling of Spliceosome Complexity to Intron Diversity
    bioRxiv preprint doi: https://doi.org/10.1101/2021.03.19.436190; this version posted March 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Coupling of spliceosome complexity to intron diversity Jade Sales-Lee1, Daniela S. Perry1, Bradley A. Bowser2, Jolene K. Diedrich3, Beiduo Rao1, Irene Beusch1, John R. Yates III3, Scott W. Roy4,6, and Hiten D. Madhani1,6,7 1Dept. of Biochemistry and Biophysics University of California – San Francisco San Francisco, CA 94158 2Dept. of Molecular and Cellular Biology University of California - Merced Merced, CA 95343 3Department of Molecular Medicine The Scripps Research Institute, La Jolla, CA 92037 4Dept. of Biology San Francisco State University San Francisco, CA 94132 5Chan-Zuckerberg Biohub San Francisco, CA 94158 6Corresponding authors: [email protected], [email protected] 7Lead Contact 1 bioRxiv preprint doi: https://doi.org/10.1101/2021.03.19.436190; this version posted March 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. SUMMARY We determined that over 40 spliceosomal proteins are conserved between many fungal species and humans but were lost during the evolution of S. cerevisiae, an intron-poor yeast with unusually rigid splicing signals. We analyzed null mutations in a subset of these factors, most of which had not been investigated previously, in the intron-rich yeast Cryptococcus neoformans.
    [Show full text]
  • The ELIXIR Core Data Resources: ​Fundamental Infrastructure for The
    Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure ​ for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 (https://zenodo.org/record/2625247). ​ ​ Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences ​ 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that ​ ​ mention Core Data Resource by name or include specific data record accession numbers.
    [Show full text]
  • Creating the Gene Ontology Resource: Design and Implementation
    Resource Creating the Gene Ontology Resource: Design and Implementation The Gene Ontology Consortium2 The exponential growth in the volume of accessible biological information has generated a confusion of voices surrounding the annotation of molecular information about genes and their products. The Gene Ontology (GO) project seeks to provide a set of structured vocabularies for specific biological domains that can be used to describe gene products in any organism. This work includes building three extensive ontologies to describe molecular function, biological process, and cellular component, and providing a community database resource that supports the use of these ontologies. The GO Consortium was initiated by scientists associated with three model organism databases: SGD, the Saccharomyces Genome database; FlyBase, the Drosophila genome database; and MGD/GXD, the Mouse Genome Informatics databases. Additional model organism database groups are joining the project. Each of these model organism information systems is annotating genes and gene products using GO vocabulary terms and incorporating these annotations into their respective model organism databases. Each database contributes its annotation files to a shared GO data resource accessible to the public at http://www.geneontology.org/. The GO site can be used by the community both to recover the GO vocabularies and to access the annotated gene product data sets from the model organism databases. The GO Consortium supports the development of the GO database resource and provides tools enabling curators and researchers to query and manipulate the vocabularies. We believe that the shared development of this molecular annotation resource will contribute to the unification of biological information. As the amount of biological information has grown, it has examining microarray expression data, sequencing genotypes become increasingly important to describe and classify bio- from a population, or identifying all glycolytic enzymes is logical objects in meaningful ways.
    [Show full text]
  • Bioinformatics Study of Lectins: New Classification and Prediction In
    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
    [Show full text]
  • Arabidopsis Thaliana
    Downloaded from genome.cshlp.org on September 28, 2021 - Published by Cold Spring Harbor Laboratory Press RESEARCH A Physical Map of Chromosome 2 of Arabidopsis thaliana Eve Ann Zachgo, 2,4 Ming Li Wang, 1'2'4 Julia Dewdney, 1'2 David Bouchez, 3 Christine Carnilleri, 3 Stephen Belmonte, 2 Lu Huang, 2 Maureen Dolan, 2 and Howard M. Goodman 1'2'5 1Department of Genetics, Harvard Medical School and 2Department of Molecular Biology, Massachusetts General Hospital, Boston, Massachusetts 02114; 3Laboratoire de Biologie Cellulaire, Institut National de la Recherche Agronomique (INRA), 78026 Versailles CEDEX, France A yeast artificial chromosome (YAC] physical map of chromosome 2 of Arabidopsis thaliana has been constructed by hybridization of 69 DNA markers and 61 YAC end probes to gridded arrays of YAC clones. Thirty-four YACs in four contigs define the chromosome. Complete closure of the map was not attained because some regions of the chromosome were repetitive or were not represented in the YAC library. Based on the sizes of the YACs and their coverage of the chromosome, the length of chromosome 2 is estimated to be at least 18 Mb. These data provide the means for immediately identifying the YACs containing a genetic locus mapped on Arabidopsis chromosome 2. The small flowering plant Arabidopsis thaliana is ters (Maluszynska and Heslop-Harrison 1991; A1- an excellent model system for metabolic, genetic, bini 1994; Copenhaver et al. 1995). We present and developmental studies in plants. Its haploid here a YAC contig physical map of chromosome nuclear genome is small (-100 Mb), consisting of 2 of A.
    [Show full text]
  • Differential Requirement for SUB1 in Chromosomal and Plasmid Double-Strand DNA Break Repair
    University of Massachusetts Medical School eScholarship@UMMS University of Massachusetts Medical School Faculty Publications 2013-03-12 Differential requirement for SUB1 in chromosomal and plasmid double-strand DNA break repair Lijian Yu University of Massachusetts Medical School Et al. Let us know how access to this document benefits ou.y Follow this and additional works at: https://escholarship.umassmed.edu/faculty_pubs Part of the Amino Acids, Peptides, and Proteins Commons, Biochemistry Commons, Enzymes and Coenzymes Commons, Genetic Phenomena Commons, Molecular Biology Commons, and the Molecular Genetics Commons Repository Citation Yu L, Volkert MR. (2013). Differential requirement for SUB1 in chromosomal and plasmid double-strand DNA break repair. University of Massachusetts Medical School Faculty Publications. https://doi.org/ 10.1371/journal.pone.0058015. Retrieved from https://escholarship.umassmed.edu/faculty_pubs/248 This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in University of Massachusetts Medical School Faculty Publications by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. Differential Requirement for SUB1 in Chromosomal and Plasmid Double-Strand DNA Break Repair Lijian Yu, Michael R. Volkert* Microbiology and Physiological Systems, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America Abstract Non homologous end joining (NHEJ) is an important process that repairs double strand DNA breaks (DSBs) in eukaryotic cells. Cells defective in NHEJ are unable to join chromosomal breaks. Two different NHEJ assays are typically used to determine the efficiency of NHEJ. One requires NHEJ of linearized plasmid DNA transformed into the test organism; the other requires NHEJ of a single chromosomal break induced either by HO endonuclease or the I-SceI restriction enzyme.
    [Show full text]
  • Proteomic Analysis of the Mediator Complex Interactome in Saccharomyces Cerevisiae Received: 26 October 2016 Henriette Uthe, Jens T
    www.nature.com/scientificreports OPEN Proteomic Analysis of the Mediator Complex Interactome in Saccharomyces cerevisiae Received: 26 October 2016 Henriette Uthe, Jens T. Vanselow & Andreas Schlosser Accepted: 25 January 2017 Here we present the most comprehensive analysis of the yeast Mediator complex interactome to date. Published: 27 February 2017 Particularly gentle cell lysis and co-immunopurification conditions allowed us to preserve even transient protein-protein interactions and to comprehensively probe the molecular environment of the Mediator complex in the cell. Metabolic 15N-labeling thereby enabled stringent discrimination between bona fide interaction partners and nonspecifically captured proteins. Our data indicates a functional role for Mediator beyond transcription initiation. We identified a large number of Mediator-interacting proteins and protein complexes, such as RNA polymerase II, general transcription factors, a large number of transcriptional activators, the SAGA complex, chromatin remodeling complexes, histone chaperones, highly acetylated histones, as well as proteins playing a role in co-transcriptional processes, such as splicing, mRNA decapping and mRNA decay. Moreover, our data provides clear evidence, that the Mediator complex interacts not only with RNA polymerase II, but also with RNA polymerases I and III, and indicates a functional role of the Mediator complex in rRNA processing and ribosome biogenesis. The Mediator complex is an essential coactivator of eukaryotic transcription. Its major function is to communi- cate regulatory signals from gene-specific transcription factors upstream of the transcription start site to RNA Polymerase II (Pol II) and to promote activator-dependent assembly and stabilization of the preinitiation complex (PIC)1–3. The yeast Mediator complex is composed of 25 subunits and forms four distinct modules: the head, the middle, and the tail module, in addition to the four-subunit CDK8 kinase module (CKM), which can reversibly associate with the 21-subunit Mediator complex.
    [Show full text]
  • To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari
    UNIT 1.11 Using The Arabidopsis Information Resource (TAIR) to Find Information About Arabidopsis Genes Leonore Reiser1, Shabari Subramaniam1, Donghui Li1, and Eva Huala1 1Phoenix Bioinformatics, Redwood City, CA USA ABSTRACT The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource of Arabidopsis biology for plant scientists. TAIR curates and integrates information about genes, proteins, gene function, orthologs gene expression, mutant phenotypes, biological materials such as clones and seed stocks, genetic markers, genetic and physical maps, genome organization, images of mutant plants, protein sub-cellular localizations, publications, and the research community. The various data types are extensively interconnected and can be accessed through a variety of Web-based search and display tools. This unit primarily focuses on some basic methods for searching, browsing, visualizing, and analyzing information about Arabidopsis genes and genome, Additionally we describe how members of the community can share data using TAIR’s Online Annotation Submission Tool (TOAST), in order to make their published research more accessible and visible. Keywords: Arabidopsis ● databases ● bioinformatics ● data mining ● genomics INTRODUCTION The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource for the biology of Arabidopsis thaliana (Huala et al., 2001; Garcia-Hernandez et al., 2002; Rhee et al., 2003; Weems et al., 2004; Swarbreck et al., 2008, Lamesch, et al., 2010, Berardini et al., 2016). The TAIR database contains information about genes, proteins, gene expression, mutant phenotypes, germplasms, clones, genetic markers, genetic and physical maps, genome organization, publications, and the research community. In addition, seed and DNA stocks from the Arabidopsis Biological Resource Center (ABRC; Scholl et al., 2003) are integrated with genomic data, and can be ordered through TAIR.
    [Show full text]
  • A Beginner's Guide to Eukaryotic Genome Annotation
    REVIEWS STUDY DESIGNS A beginner’s guide to eukaryotic genome annotation Mark Yandell and Daniel Ence Abstract | The falling cost of genome sequencing is having a marked impact on the research community with respect to which genomes are sequenced and how and where they are annotated. Genome annotation projects have generally become small-scale affairs that are often carried out by an individual laboratory. Although annotating a eukaryotic genome assembly is now within the reach of non-experts, it remains a challenging task. Here we provide an overview of the genome annotation process and the available tools and describe some best-practice approaches. Genome annotation Sequencing costs have fallen so dramatically that a sin- with some basic UNIX skills, ‘do-it-yourself’ genome A term used to describe two gle laboratory can now afford to sequence large, even annotation projects are quite feasible using present- distinct processes. ‘Structural’ human-sized, genomes. Ironically, although sequencing day tools. Here we provide an overview of the eukary- genome annotation is the has become easy, in many ways, genome annotation has otic genome annotation process, describe the available process of identifying genes and their intron–exon become more challenging. Several factors are respon- toolsets and outline some best-practice approaches. structures. ‘Functional’ genome sible for this. First, the shorter read lengths of second- annotation is the process of generation sequencing platforms mean that current Assembly and annotation: an overview attaching meta-data such as genome assemblies rarely attain the contiguity of the Assembly. The first step towards the successful annota- gene ontology terms to classic shotgun assemblies of the Drosophila mela- tion of any genome is determining whether its assem- structural annotations.
    [Show full text]
  • A Semantic Standard for Describing the Location of Nucleotide and Protein Feature Annotation Jerven T
    Bolleman et al. Journal of Biomedical Semantics (2016) 7:39 DOI 10.1186/s13326-016-0067-z RESEARCH Open Access FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation Jerven T. Bolleman1*, Christopher J. Mungall2, Francesco Strozzi3, Joachim Baran4, Michel Dumontier5, Raoul J. P. Bonnal6, Robert Buels7, Robert Hoehndorf8, Takatomo Fujisawa9, Toshiaki Katayama10 and Peter J. A. Cock11 Abstract Background: Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. Description: We have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Conclusions: Our ontology allows
    [Show full text]
  • Essential Genes and Their Role in Autism Spectrum Disorder
    University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 2017 Essential Genes And Their Role In Autism Spectrum Disorder Xiao Ji University of Pennsylvania, [email protected] Follow this and additional works at: https://repository.upenn.edu/edissertations Part of the Bioinformatics Commons, and the Genetics Commons Recommended Citation Ji, Xiao, "Essential Genes And Their Role In Autism Spectrum Disorder" (2017). Publicly Accessible Penn Dissertations. 2369. https://repository.upenn.edu/edissertations/2369 This paper is posted at ScholarlyCommons. https://repository.upenn.edu/edissertations/2369 For more information, please contact [email protected]. Essential Genes And Their Role In Autism Spectrum Disorder Abstract Essential genes (EGs) play central roles in fundamental cellular processes and are required for the survival of an organism. EGs are enriched for human disease genes and are under strong purifying selection. This intolerance to deleterious mutations, commonly observed haploinsufficiency and the importance of EGs in pre- and postnatal development suggests a possible cumulative effect of deleterious variants in EGs on complex neurodevelopmental disorders. Autism spectrum disorder (ASD) is a heterogeneous, highly heritable neurodevelopmental syndrome characterized by impaired social interaction, communication and repetitive behavior. More and more genetic evidence points to a polygenic model of ASD and it is estimated that hundreds of genes contribute to ASD. The central question addressed in this dissertation is whether genes with a strong effect on survival and fitness (i.e. EGs) play a specific oler in ASD risk. I compiled a comprehensive catalog of 3,915 mammalian EGs by combining human orthologs of lethal genes in knockout mice and genes responsible for cell-based essentiality.
    [Show full text]