Using Wormbase, a Genome Biology Resource for Caenorhabditis Elegans and Related Nematodes
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Original Article Text Mining in the Biocuration Workflow: Applications for Literature Curation at Wormbase, Dictybase and TAIR
Database, Vol. 2012, Article ID bas040, doi:10.1093/database/bas040 ............................................................................................................................................................................................................................................................................................. Original article Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR Kimberly Van Auken1,*, Petra Fey2, Tanya Z. Berardini3, Robert Dodson2, Laurel Cooper4, Donghui Li3, Juancarlos Chan1, Yuling Li1, Siddhartha Basu2, Hans-Michael Muller1, Downloaded from Rex Chisholm2, Eva Huala3, Paul W. Sternberg1,5 and the WormBase Consortium 1Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, 2Northwestern University Biomedical Informatics Center and Center for Genetic Medicine, 420 E. Superior Street, Chicago, IL 60611, 3Department of Plant Biology, Carnegie Institution, 260 Panama Street, Stanford, CA 94305, 4Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331 and 5Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA http://database.oxfordjournals.org/ *Corresponding author: Tel: +1 609 937 1635; Fax: +1 626 568 8012; Email: [email protected] Submitted 18 June 2012; Revised 30 September 2012; Accepted 2 October 2012 ............................................................................................................................................................................................................................................................................................ -
Unicarbkb: Building a Standardised and Scalable Informatics Platform for Glycosciences Research
Platforms EOI: UniCarbKB: Building a Standardised and Scalable Informatics Platform for Glycosciences Research 20 September 2019 at 13:02 Project title UniCarbKB: Building a Standardised and Scalable Informatics Platform for Glycosciences Research Field of Research code(s) 03 CHEMICAL SCIENCES 06 BIOLOGICAL SCIENCES 08 INFORMATION AND COMPUTING SCIENCES 11 MEDICAL AND HEALTH SCIENCES EOI Lead Name Matthew Campbell EOI lead Research Group Institute for Glycomics EOI lead Organisation Institute for Glycomics, Griffith University EOI lead Email Collaborator details Name Research Group Organisation Malcolm Wolski eResearch Services Griffith University Project description Over the last few years the glycomics knowledge platform UniCarbKB has contributed to the growth of international glycoinformatics resources by providing access to data collections, web services and software applications supported by biocuration activities. However, UniCarbKB has continued to develop as a single monolithic resource. This proposed project will improve the growth and scalability of UniCarbKB by breaking down the platform into smaller and composable pieces. This will be achieved through the adoption of a more decentralised approach and a microservice architecture that will allow components of the platform to scale at different rates. This architecture will allow us is to build an extendable and cross-disciplinary resource by integrating existing data with new analysis tools. The adoption of international resources (e.g. NIH GlyGen), will allow not just mapping of glycan data to genes and proteins but also pathways, gene ontology, publications and a wide variety of data from EBI, NCBI and other sources. Additionally, we propose to integrate glycan array data and develop ontologies that can serve as tools for integration and subsequently analysis of glycan-associated Existing technology Adopt The proposed project will adopt a multi-tier application architecture in which applications will be developed and deployed as microservices. -
The ELIXIR Core Data Resources: Fundamental Infrastructure for The
Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 (https://zenodo.org/record/2625247). Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that mention Core Data Resource by name or include specific data record accession numbers. -
Hierarchical Classification of Gene Ontology Terms Using the Gostruct
Hierarchical classification of Gene Ontology terms using the GOstruct method Artem Sokolov and Asa Ben-Hur Department of Computer Science, Colorado State University Fort Collins, CO, 80523, USA Abstract Protein function prediction is an active area of research in bioinformatics. And yet, transfer of annotation on the basis of sequence or structural similarity remains widely used as an annotation method. Most of today's machine learning approaches reduce the problem to a collection of binary classification problems: whether a protein performs a particular function, sometimes with a post-processing step to combine the binary outputs. We propose a method that directly predicts a full functional annotation of a protein by modeling the structure of the Gene Ontology hierarchy in the framework of kernel methods for structured-output spaces. Our empirical results show improved performance over a BLAST nearest-neighbor method, and over algorithms that employ a collection of binary classifiers as measured on the Mousefunc benchmark dataset. 1 Introduction Protein function prediction is an active area of research in bioinformatics [25]; and yet, transfer of annotation on the basis of sequence or structural similarity remains the standard way of assigning function to proteins in newly sequenced organisms [20]. The Gene Ontology (GO), which is the current standard for annotating gene products and proteins, provides a large set of terms arranged in a hierarchical fashion that specify a gene-product's molecular function, the biological process it is involved in, and its localization to a cellular component [12]. GO term prediction is therefore a hierarchical classification problem, made more challenging by the thousands of annotation terms available. -
Bioinformatics Study of Lectins: New Classification and Prediction In
Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr. -
Gene Ontology Analysis with Cytoscape
Introduction to Systems Biology Exercise 6: Gene Ontology Analysis Overview: Gene Ontology (GO) is a useful resource in bioinformatics and systems biology. GO defines a controlled vocabulary of terms in biological process, molecular function, and cellular location, and relates the terms in a somewhat-organized fashion. This exercise outlines the resources available under Cytoscape to perform analyses for the enrichment of gene annotation (GO terms) in networks or sub-networks. In this exercise you will: • Learn how to navigate the Cytoscape Gene Ontology wizard to apply GO annotations to Cytoscape nodes • Learn how to look for enriched GO categories using the BiNGO plugin. What you will need: • The BiNGO plugin, developed by the Computational Biology Division, Dept. of Plant Systems Biology, Flanders Interuniversitary Institute for Biotechnology (VIB), described in Maere S, et al. Bioinformatics. 2005 Aug 15;21(16):3448-9. Epub 2005 Jun 21. • galFiltered.sif used in the earlier exercises and found under sampleData. Please download any missing files from http://www.cbs.dtu.dk/courses/27041/exercises/Ex3/ Download and install the BiNGO plugin, as follows: 1. Get BiNGO from the course page (see above) or go to the BiNGO page at http://www.psb.ugent.be/cbd/papers/BiNGO/. (This site also provides documentation on BiNGO). 2. Unzip the contents of BiNGO.zip to your Cytoscape plugins directory (make sure the BiNGO.jar file is in the plugins directory and not in a subdirectory). 3. If you are currently running Cytoscape, exit and restart. First, we will load the Gene Ontology data into Cytoscape. 1. Start Cytoscape, under Edit, Preferences, make sure your default species, “defaultSpeciesName”, is set to “Saccharomyces cerevisiae”. -
Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes
ARTICLES https://doi.org/10.1038/s41564-019-0510-x Corrected: Author Correction Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth’s biomes Simon Roux 1*, Mart Krupovic 2, Rebecca A. Daly3, Adair L. Borges4, Stephen Nayfach1, Frederik Schulz 1, Allison Sharrar5, Paula B. Matheus Carnevali 5, Jan-Fang Cheng1, Natalia N. Ivanova 1, Joseph Bondy-Denomy4,6, Kelly C. Wrighton3, Tanja Woyke 1, Axel Visel 1, Nikos C. Kyrpides1 and Emiley A. Eloe-Fadrosh 1* Bacteriophages from the Inoviridae family (inoviruses) are characterized by their unique morphology, genome content and infection cycle. One of the most striking features of inoviruses is their ability to establish a chronic infection whereby the viral genome resides within the cell in either an exclusively episomal state or integrated into the host chromosome and virions are continuously released without killing the host. To date, a relatively small number of inovirus isolates have been extensively studied, either for biotechnological applications, such as phage display, or because of their effect on the toxicity of known bacterial pathogens including Vibrio cholerae and Neisseria meningitidis. Here, we show that the current 56 members of the Inoviridae family represent a minute fraction of a highly diverse group of inoviruses. Using a machine learning approach lever- aging a combination of marker gene and genome features, we identified 10,295 inovirus-like sequences from microbial genomes and metagenomes. Collectively, our results call for reclassification of the current Inoviridae family into a viral order including six distinct proposed families associated with nearly all bacterial phyla across virtually every ecosystem. -
To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari
UNIT 1.11 Using The Arabidopsis Information Resource (TAIR) to Find Information About Arabidopsis Genes Leonore Reiser1, Shabari Subramaniam1, Donghui Li1, and Eva Huala1 1Phoenix Bioinformatics, Redwood City, CA USA ABSTRACT The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource of Arabidopsis biology for plant scientists. TAIR curates and integrates information about genes, proteins, gene function, orthologs gene expression, mutant phenotypes, biological materials such as clones and seed stocks, genetic markers, genetic and physical maps, genome organization, images of mutant plants, protein sub-cellular localizations, publications, and the research community. The various data types are extensively interconnected and can be accessed through a variety of Web-based search and display tools. This unit primarily focuses on some basic methods for searching, browsing, visualizing, and analyzing information about Arabidopsis genes and genome, Additionally we describe how members of the community can share data using TAIR’s Online Annotation Submission Tool (TOAST), in order to make their published research more accessible and visible. Keywords: Arabidopsis ● databases ● bioinformatics ● data mining ● genomics INTRODUCTION The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource for the biology of Arabidopsis thaliana (Huala et al., 2001; Garcia-Hernandez et al., 2002; Rhee et al., 2003; Weems et al., 2004; Swarbreck et al., 2008, Lamesch, et al., 2010, Berardini et al., 2016). The TAIR database contains information about genes, proteins, gene expression, mutant phenotypes, germplasms, clones, genetic markers, genetic and physical maps, genome organization, publications, and the research community. In addition, seed and DNA stocks from the Arabidopsis Biological Resource Center (ABRC; Scholl et al., 2003) are integrated with genomic data, and can be ordered through TAIR. -
Learning Protein Constitutive Motifs from Sequence Data Je´ Roˆ Me Tubiana, Simona Cocco, Re´ Mi Monasson*
TOOLS AND RESOURCES Learning protein constitutive motifs from sequence data Je´ roˆ me Tubiana, Simona Cocco, Re´ mi Monasson* Laboratory of Physics of the Ecole Normale Supe´rieure, CNRS UMR 8023 & PSL Research, Paris, France Abstract Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (a-helixes and b-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and ’turning up’ or ’turning down’ the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families. DOI: https://doi.org/10.7554/eLife.39397.001 Introduction In recent years, the sequencing of many organisms’ genomes has led to the collection of a huge number of protein sequences, which are catalogued in databases such as UniProt or PFAM Finn et al., 2014). Sequences that share a common ancestral origin, defining a family (Figure 1A), *For correspondence: are likely to code for proteins with similar functions and structures, providing a unique window into [email protected] the relationship between genotype (sequence content) and phenotype (biological features). -
DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment Erik S
Wright BMC Bioinformatics (2015) 16:322 DOI 10.1186/s12859-015-0749-z RESEARCH ARTICLE Open Access DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment Erik S. Wright1,2 Abstract Background: Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments. Results: Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on largesequencesets. Conclusions: Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. -
UC Davis UC Davis Previously Published Works
UC Davis UC Davis Previously Published Works Title Longer first introns are a general property of eukaryotic gene structure. Permalink https://escholarship.org/uc/item/9j42z8fm Journal PloS one, 3(8) ISSN 1932-6203 Authors Bradnam, Keith R Korf, Ian Publication Date 2008-08-29 DOI 10.1371/journal.pone.0003093 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Longer First Introns Are a General Property of Eukaryotic Gene Structure Keith R. Bradnam*, Ian Korf Genome Center, University of California Davis, Davis, California, United States of America Abstract While many properties of eukaryotic gene structure are well characterized, differences in the form and function of introns that occur at different positions within a transcript are less well understood. In particular, the dynamics of intron length variation with respect to intron position has received relatively little attention. This study analyzes all available data on intron lengths in GenBank and finds a significant trend of increased length in first introns throughout a wide range of species. This trend was found to be even stronger when using high-confidence gene annotation data for three model organisms (Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster) which show that the first intron in the 59 UTR is - on average - significantly longer than all downstream introns within a gene. A partial explanation for increased first intron length in A. thaliana is suggested by the increased frequency of certain motifs that are present in first introns. The phenomenon of longer first introns can potentially be used to improve gene prediction software and also to detect errors in existing gene annotations. -
A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide
Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).