Browsing Genomes with Ensembl Annotation

Total Page:16

File Type:pdf, Size:1020Kb

Browsing Genomes with Ensembl Annotation Browsing genomes with EnsEMBL Annotation • During recent years release of large amounts of sequence data • Raw sequence data are not so useful on its own. They are most valuable when provided with comprehensive good quality annotation CCCAACAAGAATGTAAAATCTTTAAGTGCCTGTTTTCATACTTATTTGACCACCCTATCTCTAGAATCTTGCATGATG TCTAGCCCTAGTAGGATCAAAAAATACTTACAAAGCAACTGAATAGCTACATGAATAGATGGATGAATAAATGCATG GGTGGATGGATGGATTAATGAAATCATTTATATGACTTAAAGTTTGCAGAGGAGTATCATATTTGGAAGGCAGTAAG GAAGTCTGTGTAGTCGATGGTAAAGGCAATTGGGAAGTTTGTTAGGCACAATAGGTCAAAATTTGTTTTTGAAGTCC TGTTACTTCACGTTTCTTTGTTTCACTTTCTTAAAACAGGAAACTCTTTTCTATGATCATTCTTCCAGGGCCTGGCTCT TCATCTGCAACCCAGTAATATCCCTAATGTCAAAAAGCTACTGGTTTAATTCGTGCCATTTTCAAAGAGGACTACTGA ATTCTGATGTGGCTTCAAACATTTAGGTTAGGCATATCTAATGGAGAACTTGCAGCCACACTGACTTGTAGTGAAAT ATCTATTTTGAGCCTGCCCAGTGTTGCTTAAATTGTAGTTTTCCTTGCCAGCTATTCATACAAGAGATGTGAGAAGCA CCATAAAAGGCGTTGTGAGGAGTTGTGGGGGAGTGAGGGAGAGAAGAGGTTGAAAAGCTTATTAGCTGCTGTACGG TAAAAGTGAGCTCTTACGGGAATGGGAATGTAGTTTTAGCCCTCCAGGGATTCTATTTAGCCCGCCAGGAATTAACC TTGACTATAAATAGGCCATCAATGACCTTTCCAGAGAATGTTCAGAGACCTCAACTTTGTTTAGAGATCTTGTGTGGG TGGAACTTCCTGTTTGCACACAGAGCAGCATAAAGCCCAGTTGCTTTGGGAAGTGTTTGGGACCAGATGGATTGTAG GGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAACATGGGGAAGAACAAACTCCTTCATCC AAGTCTGGTTCTTCTCCTCTTGGTCCTCCTGCCCACAGACGCCTCAGTCTCTGGAAAACCGTGAGTTCCACACAGAG AGCGTGAAGCATGAACCTAGAGTCCTTCATTTATTGCAGATTTTTCTTTATATCATTCCTTTTTCTTTCCTATGATACT GTCATCTTCTTATCTCTAAGATTCCTTCCAGATTTTACAAATCTAGTTTACTCATTACTTGCTTACTTTTAATCATTCT TCCCCAACTCTCTGAAGCTCTAATATGCAAAGCCTTCCTAAGGGGTGTCAGAAATTTTTAGCTTTTTAAAAGAATAAA TTTTAGATATTCACATTCATATTGATCTACTTGAGACCATGCTATTTATCTTTTCTTATTTCCTCTTTCTCAAGGGTCC The Ensembl project • The Ensembl project was started in 1999, some years before the draft human genome was completed • Joint project between the Sanger Institute and the EBI Goals: • To provide automated but accurate gene annotation • Open source • Integrate this annotation with other available biological data (directly or via Distributed Annotated System DAS) • Both web and programmatic interface Ensembl team Ensembl Paul Flicek (EBI), Steve Searle (Sanger Institute) Software Glenn Proctor, Andreas Kähäri, Stephen Keenan, Rhoda Kinsella, Eugene Kulesha, Ian Longden, Iliana Toneva Comparative Genomics Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon, Matthieu Muffato , Miguel Pignatelli Functional Genomics Ian Dunham, Nathan Johnson, Daniel Sobral, Steven Wilder Variation Fiona Cunningham, Laurent Gil, Pontus Larsson, Will McLaren, Graham Ritchie Analysis and Annotation Jan-Hinnerck Vogel, Bronwen Aken, Susan Fairley, Thibaut Hourlier, Magali Ruffier, Simon White, Amy Tang, Amonida Zadissa Web Team Anne Parker, Ridwan Amode, Simon Brent, Maurice Hendrix, Bethan Pritchard, Steve Trevanion (VEGA) Outreach Xosé M Fernández, Jeff Almeida-King, Bert Overduin, Michael Schuster (QC), Giulietta Spudich, Jana Vandrovcova Systems & Support Guy Coates, James Beal, Gen-Tao Chiang, Peter Clapham, Simon Kelley, Shelley Goddard, Tracy Mumford, Kerry Smith Benoît Ballester, Petra Catalina Schwalie, André Faure, Markus Fritz, Damian Keefe, Alison Meynert, Dace Ruklisa, Mikhail Spivakov, Research David Thybert, Sander Timmer, Albert Vilella Chao-Kung Chen, Laura Clarke, Jonathan Hinton, Zam Iqbal, Vasudev Kumanduri, Ilkka Lappalainen, Edoardo Marcora, Pablo Marín, Vertebrate Genomics Damian Smedley, Richard Smth, Phil Wilkinson, Holly Zheng-Bradley Paul Kersey, Paul Derwent, Matthias Haimel, Alan Horne, Arnaud Kerhornou, Uma Maheswari, Michael Nuhn, Dan Staines, Ensembl Genomes Andy Yates VectorBase Dan Lawson, Gautier Koscielny, Karyn Megy Zebrafish Kerstin Howe, Kim Brugger, Will Chow, Britt Reimholz, James Torrance Ensembl Strategy Ewan Birney, Richard Durbin, Tim Hubbard 4 Species in Ensembl • Ensembl focuses on vertebrates, more than 50 species available Extending the taxonomic space 6 Data in Ensembl Core data: • Genomic sequence • Gene / transcript / protein models • External references • Mapped cDNAs, proteins, microarray probes, BAC clones, cytogenetic bands, repeats, markers etc. Comparative data: • orthologs and paralogs, protein families, whole genome alignments, syntenic regions Variation data: • sequence variants, structural variants, phenotypes, linkage disequilibrium Regulatory data: • “best guess” set of regulatory elements Gene models • Ensembl genes - Automatic annotation • Genome-wide determination using Ensembl genebuild pipeline, based on biological evidence • Highly consistent, rather frequently updated • Protein sequences UniProt Knowledge Base (UniProtKB) • cDNA sequences INSDC (ENA, GenBank and DDBJ) • Havana genes - Manual curation • Reviewed determination, hand-checked, on case-by-case basis • Highly accurate, but very labour intensive • Protein sequences UniProt Knowledge Base (UniProtKB) • cDNA sequences INSDC (ENA, GenBank and DDBJ) • EST sequences INSDC (ENA, GenBank and DDBJ) GENCODE gene set • Ensembl-Havana merge the super-set of both approaches • Ensembl transcripts get merged into the Havana transcripts, require perfect splice site overlap of each exon boundary • Both, high quality and high accuracy • Distributed as GENCODE gene set of the Encyclopedia Of DNA Elements (ENCODE) project (http://www.gencodegenes.org/) • Participate in the Consensus Coding Sequence Consortium (CCDS) • The basis of gene trees, orthologue and paralogue predictions Gene annotation – graphical view Chromosomal location Experimental evidence RNA sequences Protein sequences Annotated transcripts for SMAD2 gene Assembly Pop-up window with more Genome sequence build from information and links overlapping contigs Gene models based on RNA-Seq data • In-house pipeline development, based on BWA and Exonerate • Pioneered for Gorilla and Zebrafish • Illumina® Human Body Map 2.0 Data • For the moment only one transcript per gene created • Intron-features illustrate alternative splicing events • But … Further analysis and validation is required • Future Goal: Merge with conventional cDNA and protein-based annotation SLC25A3 gene 8.58 Kb Forward strand 98,988,000 98,989,000 98,990,000 98,991,000 98,992,000 98,993,000 98,994,000 98,995,000 Chromosome bands q23.1 skeletal muscle lung liver brain blood CCDS set CCDS9065.1 > CCDS set CCDS9066.1 > CCDS set Ensembl/Havana ... SLC25A3-008 > protein coding SLC25A3-201 > protein coding SLC25A3-001 > protein coding SLC25A3-005 > protein coding SLC25A3-013 > protein coding SLC25A3-004 > protein coding SLC25A3-015 > protein coding SLC25A3-014 > protein coding SLC25A3-016 > protein coding SLC25A3-006 > nonsense mediated decay SLC25A3-007 > protein coding SLC25A3-002 > protein coding 98,988,000 98,989,000 98,990,000 98,991,000 98,992,000 98,993,000 98,994,000 98,995,000 Reverse strand 8.58 Kb There are currently 387 tracks turned off. Ensembl Homo sapiens version 64.37 (GRCh37) Chromosome 12: 98,987,369 - 98,995,946 How to access the data? www.ensembl.org uswest.ensembl.org useast.ensembl.org archive.ensembl.org pre.ensembl.org Location view Additional Information is information can be displayed in added using views different views: specific displays • Species and “Configure summary this page” button • Location • Gene • Transcript • Variation • Regulation Individual Click and drag “tracks” mouse to select a region or use the navigation buttons M R Comparative H genomics Gene trees - homologues prediction Gene tree Multiple Sequence Alignment Ideograms Gene trees Gene (red) Duplication node (red) Speciation node (blue) Paralogue (blue) Collapsed sub-tree Location view Conservation Scores Multiple Sequence Alignment BLASTZ Conservation Tracks TBLAT Conservation Tracks Multi-species view Location view - Synteny Mouse chromosomes Human chromosome 18 List of orthologous genes located within the syntenic region Functional genomics Credits: Darryl Leja (NHGRI), Ian Dunham (EBI) Regulatory features An example of a promoter associated region in a region in detail Good predictors of promoter regions • Dnase I hypersensitivity sites (marks of accessible chromatin) • H3K4me3 (histone 3 trimetylation) • RNA polymerase II recruitment TBFS annotation • position of putative TF binding sites within the annotated regulatory regions • for transcription factor (TF) which has both a ChIP-seq data and a publicly available position weight matrix (PWM) • PMW are taken from JASPAR database Externally curated data • cisRED – regulatory motifs • miRanda – miRNA tragets • VISTA – enhancers • MeDIP and Reduced representation bisulphite sequencing (RRBS) from ENCODE – methylation data • eQTL data Variation Types of Variation • Germline • Somatic • Large scale (structural) • Many show variation in Copy Number (CNV) variants • Small scale (sequence) • Single Nucleotide Polymorphisms (SNP) • Deletion/Insertions (DIPs or indels) Variation sources Structural • Database of Genomic Variants Archive (DGVa) / Database of genomic structural variation (dbVar) • Several Affymetrix and Illumina arrays DAS sources: • DECIPHER (DatabasE of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources) • DGV loci (Database of Genomic Variants) Sequence • Most sequence variants imported from dbSNP repository (imported data: alleles, flanking sequences, frequencies, calculated data: position, synonymous status, amino acid change) • Several microarrays • Ensembl variations (from comparison to other individuals/strains/breeds) • HGMD (Human Gene Mutation Database) Public • Leiden Open Variation Database • UniProtKB • COSMIC (Catalogue of Somatic Mutations in Cancer) Variation display Phenotype Consequence . Ensembl
Recommended publications
  • Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P
    Published online 1 November 2009 Nucleic Acids Research, 2010, Vol. 38, Database issue D563–D569 doi:10.1093/nar/gkp871 Ensembl Genomes: Extending Ensembl across the taxonomic space P. J. Kersey*, D. Lawson, E. Birney, P. S. Derwent, M. Haimel, J. Herrero, S. Keenan, A. Kerhornou, G. Koscielny, A. Ka¨ ha¨ ri, R. J. Kinsella, E. Kulesha, U. Maheswari, K. Megy, M. Nuhn, G. Proctor, D. Staines, F. Valentin, A. J. Vilella and A. Yates EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK Received August 14, 2009; Revised September 28, 2009; Accepted September 29, 2009 ABSTRACT nucleotide archives; numerous other genomes exist in states of partial assembly and annotation; thousands of Ensembl Genomes (http://www.ensemblgenomes viral genomes sequences have also been generated. .org) is a new portal offering integrated access to Moreover, the increasing use of high-throughput genome-scale data from non-vertebrate species sequencing technologies is rapidly reducing the cost of of scientific interest, developed using the Ensembl genome sequencing, leading to an accelerating rate of genome annotation and visualisation platform. data production. This not only makes it likely that in Ensembl Genomes consists of five sub-portals (for the near future, the genomes of all species of scientific bacteria, protists, fungi, plants and invertebrate interest will be sequenced; but also the genomes of many metazoa) designed to complement the availability individuals, with the possibility of providing accurate and of vertebrate genomes in Ensembl. Many of the sophisticated annotation through the similarly low-cost databases supporting the portal have been built in application of functional assays.
    [Show full text]
  • Abstracts Genome 10K & Genome Science 29 Aug - 1 Sept 2017 Norwich Research Park, Norwich, Uk
    Genome 10K c ABSTRACTS GENOME 10K & GENOME SCIENCE 29 AUG - 1 SEPT 2017 NORWICH RESEARCH PARK, NORWICH, UK Genome 10K c 48 KEYNOTE SPEAKERS ............................................................................................................................... 1 Dr Adam Phillippy: Towards the gapless assembly of complete vertebrate genomes .................... 1 Prof Kathy Belov: Saving the Tasmanian devil from extinction ......................................................... 1 Prof Peter Holland: Homeobox genes and animal evolution: from duplication to divergence ........ 2 Dr Hilary Burton: Genomics in healthcare: the challenges of complexity .......................................... 2 INVITED SPEAKERS ................................................................................................................................. 3 Vertebrate Genomics ........................................................................................................................... 3 Alex Cagan: Comparative genomics of animal domestication .......................................................... 3 Plant Genomics .................................................................................................................................... 4 Ksenia Krasileva: Evolution of plant Immune receptors ..................................................................... 4 Andrea Harper: Using Associative Transcriptomics to predict tolerance to ash dieback disease in European ash trees ............................................................................................................
    [Show full text]
  • The ELIXIR Core Data Resources: ​Fundamental Infrastructure for The
    Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure ​ for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 (https://zenodo.org/record/2625247). ​ ​ Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences ​ 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that ​ ​ mention Core Data Resource by name or include specific data record accession numbers.
    [Show full text]
  • Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I
    EMBL-European Bioinformatics Institute Annual Scientific Report 2013 On the cover Structure 3fof in the Protein Data Bank, determined by Laponogov, I. et al. (2009) Structural insight into the quinolone-DNA cleavage complex of type IIA topoisomerases. Nature Structural & Molecular Biology 16, 667-669. © 2014 European Molecular Biology Laboratory This publication was produced by the External Relations team at the European Bioinformatics Institute (EMBL-EBI) A digital version of the brochure can be found at www.ebi.ac.uk/about/brochures For more information about EMBL-EBI please contact: [email protected] Contents Introduction & overview 3 Services 8 Genes, genomes and variation 8 Molecular atlas 12 Proteins and protein families 14 Molecular and cellular structures 18 Chemical biology 20 Molecular systems 22 Cross-domain tools and resources 24 Research 26 Support 32 ELIXIR 36 Facts and figures 38 Funding & resource allocation 38 Growth of core resources 40 Collaborations 42 Our staff in 2013 44 Scientific advisory committees 46 Major database collaborations 50 Publications 52 Organisation of EMBL-EBI leadership 61 2013 EMBL-EBI Annual Scientific Report 1 Foreword Welcome to EMBL-EBI’s 2013 Annual Scientific Report. Here we look back on our major achievements during the year, reflecting on the delivery of our world-class services, research, training, industry collaboration and European coordination of life-science data. The past year has been one full of exciting changes, both scientifically and organisationally. We unveiled a new website that helps users explore our resources more seamlessly, saw the publication of ground-breaking work in data storage and synthetic biology, joined the global alliance for global health, built important new relationships with our partners in industry and celebrated the launch of ELIXIR.
    [Show full text]
  • Strategic Plan 2011-2016
    Strategic Plan 2011-2016 Wellcome Trust Sanger Institute Strategic Plan 2011-2016 Mission The Wellcome Trust Sanger Institute uses genome sequences to advance understanding of the biology of humans and pathogens in order to improve human health. -i- Wellcome Trust Sanger Institute Strategic Plan 2011-2016 - ii - Wellcome Trust Sanger Institute Strategic Plan 2011-2016 CONTENTS Foreword ....................................................................................................................................1 Overview .....................................................................................................................................2 1. History and philosophy ............................................................................................................ 5 2. Organisation of the science ..................................................................................................... 5 3. Developments in the scientific portfolio ................................................................................... 7 4. Summary of the Scientific Programmes 2011 – 2016 .............................................................. 8 4.1 Cancer Genetics and Genomics ................................................................................ 8 4.2 Human Genetics ...................................................................................................... 10 4.3 Pathogen Variation .................................................................................................. 13 4.4 Malaria
    [Show full text]
  • The Genomic Basis of Circadian and Circalunar Timing Adaptations in a Midge Tobias S
    OPEN ARTICLE doi:10.1038/nature20151 The genomic basis of circadian and circalunar timing adaptations in a midge Tobias S. Kaiser1,2,3†, Birgit Poehn1,3, David Szkiba2, Marco Preussner4, Fritz J. Sedlazeck2†, Alexander Zrim2, Tobias Neumann1,2, Lam-Tung Nguyen2,5, Andrea J. Betancourt6, Thomas Hummel3,7, Heiko Vogel8, Silke Dorner1, Florian Heyd4, Arndt von Haeseler2,3,5 & Kristin Tessmar-Raible1,3 Organisms use endogenous clocks to anticipate regular environmental cycles, such as days and tides. Natural variants resulting in differently timed behaviour or physiology, known as chronotypes in humans, have not been well characterized at the molecular level. We sequenced the genome of Clunio marinus, a marine midge whose reproduction is timed by circadian and circalunar clocks. Midges from different locations show strain-specific genetic timing adaptations. We examined genetic variation in five C. marinus strains from different locations and mapped quantitative trait loci for circalunar and circadian chronotypes. The region most strongly associated with circadian chronotypes generates strain-specific differences in the abundance of calcium/calmodulin-dependent kinase II.1 (CaMKII.1) splice variants. As equivalent variants were shown to alter CaMKII activity in Drosophila melanogaster, and C. marinus (Cma)-CaMKII.1 increases the transcriptional activity of the dimer of the circadian proteins Cma-CLOCK and Cma-CYCLE, we suggest that modulation of alternative splicing is a mechanism for natural adaptation in circadian timing. Around the new or full moon, during a few specific hours surround- Our study aimed to identify the genetic basis of C. marinus adaptation ing low tide, millions of non-biting midges of the species C.
    [Show full text]
  • ALEXA: a Microarray Design Platform for Alternative Expression Analysis
    CORRESPONDENCE expression of alternative mRNA isoforms in 5-fluorouracil (5-FU)- ALEXA: a microarray design platform for sensitive and resistant colorectal cancer cell lines5 and compared alternative expression analysis the results to those from the Affymetrix ‘GeneChip Human Exon 1.0 ST’ array (see Supplementary Results, Supplementary Fig. 2 To the editor: Eukaryotic genomes are predicted to contain about and Supplementary Table 2 online). Genes and exons differentially 7,000–29,000 genes1. Each of these genes may be alternatively expressed between 5-FU–sensitive and resistant cells were identi- processed to produce multiple distinct mRNAs by alternative fied by both platforms (with significant overlap), but ALEXA arrays transcript initiation, splicing and polyadenylation (collectively provided additional information on the connectivity and boundar- referred to as alternative expression). Although analysis of avail- ies of exons (Table 1). Furthermore, alternative expression events able transcript resources indicates that up to ~75% of genes are identified by ALEXA were significantly enriched for known alterna- alternatively processed, most microarray expression platforms tive expression events represented in publicly available mRNA and cannot detect alternative transcripts2. expressed sequence tag (EST) databases (Supplementary Results methods Proof-of-principle experiments have described the use of oli- and Supplementary Data 1 online). Finally, we demonstrated the gonucleotide microarrays to profile transcript isoforms gener- advantage of the ALEXA approach by identifying several differen- ated by alternative expression, but resources to create such arrays tially expressed known and predicted isoforms with potential rele- are lacking3,4. To address this limitation we created a microarray vance to 5-FU resistance (Supplementary Fig. 3 and Supplementary .com/nature e design platform for alternative expression analysis (ALEXA), Tables 3 and 4 online).
    [Show full text]
  • Gene3d: Multi-Domain Annotations for Protein Sequence and Comparative Genome Analysis Jonathan G
    D240–D245 Nucleic Acids Research, 2014, Vol. 42, Database issue Published online 21 November 2013 doi:10.1093/nar/gkt1205 Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis Jonathan G. Lees1,*, David Lee1, Romain A. Studer1, Natalie L. Dawson1, Ian Sillitoe1, Sayoni Das1, Corin Yeats2, Benoit H. Dessailly1, Robert Rentzsch3 and Christine A. Orengo1 1Division of Biosciences, Institute of Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK, 2Department of Infectious Disease Epidemiology, Imperial College London, St Mary’s Campus, Norfolk Place, London W2 1PG, UK and 3Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany Received October 1, 2013; Revised November 2, 2013; Accepted November 4, 2013 Downloaded from ABSTRACT contain long disordered sections that are also of functional importance. The CATH database which focuses on the Gene3D (http://gene3d.biochem.ucl.ac.uk) is a data- folded domain, classifies structures in the PDB into their base of protein domain structure annotations for constituent domains, and each domain is subsequently http://nar.oxfordjournals.org/ protein sequences. Domains are predicted using a assigned to a single superfamily by homology (1,2). The library of profile HMMs from 2738 CATH super- classified domain structure sequences are used in a families. Gene3D assigns domain annotations to pipeline to build domain superfamily specific HMMs that Ensembl and UniProt sequence sets including are then used to identify domains in structurally >6000 cellular genomes and >20 million unique uncharacterized protein sequences. As some CATH protein sequences. This represents an increase of superfamilies can be large and functionally diverse, we 45% in the number of protein sequences since our recently introduced protocols for subdividing each super- last publication.
    [Show full text]
  • What's New in Rnacentral and Rfam
    What’s new in RNAcentral and Rfam Anton Petrov [email protected] Benasque - July 20, 2018 What is https://www.vecteezy.com/vector-art/92726-question-mark-background-vector The non-coding RNA sequence database rnacentral.org • >10 million sequences • 27 databases • 800,000 species RNAcentral has lots of useful data • Sequence • Description • RNA type • Links to other databases • Genome locations • Publications • RNA modifications from MODOMICS and PDB http://rnacentral.org/rna/URS00005A4DCF/9606 Until recently just data aggregation, now additional analysis Two important new features 1. Quality control using Rfam 2. Comprehensive genome models mapping 1. Rfam models are used to annotate RNAcentral • ~90% of RNAcentral sequences match Rfam models • about 2% of RNAcentral sequences can be used to build new Rfam models Rfam annotations help detect: • truncated sequences • potential contamination • missing annotations 2. Comprehensive genome mapping for >250 species before after • genome mapping Ensembl genomes and blat • >95% of sequences mapped for human, mouse, and other key species • one of the largest collections RNA types of ncRNA genome annotations % of mapped human sequences Example of Rfam quality control and genome mapping Here is an Ensembl miRNA http://www.ensembl.org/Mus_musculus/Transcript/Summary?g=ENSMUSG00000106355;r=4:10874064-10874170;t =ENSMUST00000197675 But is it a miRNA or a tRNA? http://www.ensembl.org/Mus_musculus/Transcript/Summary?g=ENSMUSG00000106355;r=4:10874064-10874170;t =ENSMUST00000197675 RNAcentral shows a
    [Show full text]
  • Browsing Genes and Genomes with Ensembl
    Browsing Genes and Genomes with Ensembl Ensembl Workshop University of Cambridge 3rd September 2013 www.ensembl.org Denise Carvalho-Silva Ensembl Outreach Notes: This workshop is based on Ensembl release 72 (June 2013). 1) Presentation slides The pdf file of the talks presented in this workshop is available in the link below: http://www.ebi.ac.uk/~denise/naturimmun 2) Course booklet A diGital Copy of this course booklet can be found below: http://www.ebi.ac.uk/~denise/coursebooklet.pdf 3) Answers The answers to the exercises in this course booklet can be found below: http://www.ebi.ac.uk/~denise/answers 2 TABLE OF CONTENTS OVERVIEW ...................................................................................................... 4 INTRODUCTION TO ENSEMBL .................................................................. 5 BROWSER WALKTHROUGH .................................................................... 14 EXERCISES .................................................................................................... 38 BROWSER ..................................................................................................... 38 BIOMART ...................................................................................................... 41 VARIATION ................................................................................................... 48 COMPARATIVE GENOMICS ...................................................................... 50 REGULATION ..............................................................................................
    [Show full text]
  • Rfam 14: Expanded Coverage of Metagenomic, Viral and Microrna Families Ioanna Kalvari, Eric P
    Rfam 14: expanded coverage of metagenomic, viral and microRNA families Ioanna Kalvari, Eric P. Nawrocki, Nancy Ontiveros-Palacios, Joanna Argasinska, Kevin Lamkiewicz, Manja Marz, Sam Griffiths-Jones, Claire Toffano-Nioche, Daniel Gautheret, Zasha Weinberg, et al. To cite this version: Ioanna Kalvari, Eric P. Nawrocki, Nancy Ontiveros-Palacios, Joanna Argasinska, Kevin Lamkiewicz, et al.. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Research, 2020, 10.1093/nar/gkaa1047. hal-03031715 HAL Id: hal-03031715 https://hal.archives-ouvertes.fr/hal-03031715 Submitted on 14 Dec 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. This article has been accepted for publication in Nucleic Acid Research Published by Oxford University Press : • DOI : 10.1093/nar/gkaa1047 • PUBMED : 33211869 Nucleic Acids Research, 2020 1 doi: 10.1093/nar/gkaa1047 Rfam 14: expanded coverage of metagenomic, viral and microRNA families Ioanna Kalvari 1,EricP.Nawrocki 2, Nancy Ontiveros-Palacios 1, Joanna Argasinska 1, Kevin Lamkiewicz 3,4, Manja Marz 3,4, Sam Griffiths-Jones 5, Claire Toffano-Nioche 6, Daniel Gautheret 6, Zasha Weinberg 7, Elena Rivas 8, Sean R. Eddy 8,9,10, Downloaded from https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa1047/5992291 by guest on 14 December 2020 Robert D.
    [Show full text]
  • Comparative Analysis of Plant Genomes Through Data Integration
    Comparative Analysis of Plant Genomes through Data Integration Michiel Van Bel Promoter: Prof. Dr. Yves Van de Peer Co-Promoter: Prof. Dr. Klaas Vandepoele Ghent University Faculty of Sciences Department of Plant Biotechnology and Bioinformatics VIB Department of Plant Systems Biology Bioinformatics and Systems Biology Dissertation submitted in fulfillment of the requirements for the degree of Doctor (PhD) in Sciences, Bioinformatics). Academic year: 2012-2013 Examination Committee Prof. Dr. Geert De Jaeger (chair) Faculty of Sciences, Department of Plant Biotechnology and Bioinformatics, Ghent University Prof. Dr. Yves Van de Peer (promoter) Faculty of Sciences, Department of Plant Biotechnology and Bioinformatics, Ghent University Prof. Dr. Klaas Vandepoele (co-promoter) Faculty of Sciences, Department of Plant Biotechnology and Bioinformatics, Ghent University Prof. Dr. Jan Fostier Faculty of Engineering, Department of Information Technology, Ghent University Prof. Dr. Peter Dawyndt Faculty of Science, Department of Applied Mathematics and Computer Science, Ghent University Dr. Steven Robbens Bayer Cropscience, Belgium Dr. Matthieu Conte Syngenta Seeds, France II Acknowledgements While the cover of this book carries my name, this thesis did not come to fruition by my hand only. These past years have been a great experience, for which I would like to express my gratitude to several people. First of all, I would like to thank Thomas Abeel, for getting me in touch with Yves’ research group, and encouraging me to start a PhD in bioinformatics. Without a chance encounter with him, I never would have dreamed obtaining a PhD would be possible. Secondly, I would like to thank my promoter and co-promoter, Yves Van de Peer and Klaas Vande- poele.
    [Show full text]