A Bioinformatics Approach to Microrna-Sequencing Analysis

Total Page:16

File Type:pdf, Size:1020Kb

A Bioinformatics Approach to Microrna-Sequencing Analysis Henry Ford Health System Henry Ford Health System Scholarly Commons Orthopaedics Articles Orthopaedics / Bone and Joint Center 3-1-2021 A bioinformatics approach to microRNA-sequencing analysis Pratibha Potla Shabana A. Ali Mohit Kapoor Follow this and additional works at: https://scholarlycommons.henryford.com/orthopaedics_articles Osteoarthritis and Cartilage Open 3 (2021) 100131 Contents lists available at ScienceDirect Osteoarthritis and Cartilage Open journal homepage: www.elsevier.com/journals/osteoarthritis-and-cartilage-open/2665-9131 Experimental Protocol A bioinformatics approach to microRNA-sequencing analysis Pratibha Potla a,b,1, Shabana Amanda Ali c,**,1, Mohit Kapoor a,b,d,* a Schroeder Arthritis Institute, University Health Network, Toronto, Ontario, Canada b Krembil Research Institute, University Health Network, Toronto, Ontario, Canada c Bone and Joint Center, Department of Orthopaedic Surgery, Henry Ford Health System, Detroit, MI, USA d Department of Surgery and Department of Laboratory Medicine and Pathobiology, University of Toronto, Ontario, Canada ARTICLE INFO ABSTRACT Keywords: The rapid expansion of Next Generation Sequencing (NGS) data availability has made exploration of appropriate High-throughput nucleotide sequencing bioinformatics analysis pipelines a timely issue. Since there are multiple tools and combinations thereof to analyze Computational biology any dataset, there can be uncertainty in how to best perform an analysis in a robust and reproducible manner. This MicroRNAs is especially true for newer omics applications, such as miRNomics, or microRNA-sequencing (miRNA- Bioinformatics sequencing). As compared to transcriptomics, there have been far fewer miRNA-sequencing studies performed to Osteoarthritis date, and those that are reported seldom provide detailed description of the bioinformatics analysis, including aspects such as Unique Molecular Identifiers (UMIs). In this article, we attempt to fill the gap and help researchers understand their miRNA-sequencing data and its analysis. This article will specifically discuss a customizable miRNA bioinformatics pipeline that was developed using miRNA-sequencing datasets generated from human osteoarthritis plasma samples. We describe quality assessment of raw sequencing data files, reference-based alignment, counts generation for miRNA expression levels, and novel miRNA discovery. This report is expected to improve clarity and reproducibility of the bioinformatics portion of miRNA-sequencing analysis, applicable across any sample type, to promote sharing of detailed protocols in the NGS field. 1. Introduction gold standard approach for profiling nucleic acid, including miRNAs. Detecting single-nucleotide sequence changes or altogether novel se- Next Generation Sequencing (NGS) technology has revolutionized the quences are added advantages of sequencing [10]. As a result, study of human genetic code, enabling a fast, reliable, and cost-effect sequencing has the capacity to identify molecules with greater sensi- method for reading the genome. Whereas “first generation” sequencing tivity, specificity, and predictive ability for detecting disease [11]. For involved sequencing one molecule at a time, NGS involves sequencing these reasons, sequencing has been applied to biomarker discovery for a multiple molecules in parallel [1–3]. This advance has reduced the time variety of diseases, but not without limitations. There are several sources and cost per base that is sequenced, and has expanded sequencing ap- of error that can be introduced during a sequencing experiment. Among plications which now includes microRNAs [4–7]. MicroRNAs (miRNAs) these, the patient cohort may be underpowered [12]; sample extraction, are small RNAs of 22–25 base length, regulating gene expression through library preparation, and sequencing may create bias that leads to over- or degradation of mRNA transcripts and inhibition of translation [8]. under-estimation of the expression level of a molecule or subset of mol- MiRNAs have emerged as critical regulators of health and disease, and ecules [13]; or a one-size-fits-all approach may be inappropriately when found in circulation, represent promising biomarkers given their applied to data analysis. To harness the potential of NGS to identify stability, specificity, and ease of detection and quantification [9]. miRNAs as biomarkers – including novel miRNAs – a rigorous approach By providing a quantitative readout of all molecules of interest in a that overcomes existing limitations is needed [14]. This report focuses on sample without relying on endogenous controls or pre-selected probes as the data analysis aspect, where a rigorous methodology for bioinfor- do real-time PCR and microarray approaches, NGS has emerged as the matics analysis of miRNA-sequencing data has been developed and * Corresponding author. Schroeder Arthritis Institute, Krembil Research Institute, University Health Network, 60 Leonard Avenue, Toronto, Ontario, M5T 2S8, Canada. ** Corresponding author. Bone and Joint Center, Department of Orthopaedic Surgery, Henry Ford Health System, 6135 Woodward Avenue, Detroit, MI, 48202, USA. E-mail addresses: [email protected] (S.A. Ali), [email protected] (M. Kapoor). 1 PP and SAA share equal first author contribution. https://doi.org/10.1016/j.ocarto.2020.100131 Received 12 December 2020; Accepted 14 December 2020 2665-9131/© 2021 Osteoarthritis Research Society International (OARSI). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). P. Potla et al. Osteoarthritis and Cartilage Open 3 (2021) 100131 applied to identify miRNAs in plasma samples from osteoarthritis pa- sufficient expertise in bioinformatics methods to understand the steps tients [15]. that need to be taken when analyzing a miRNA-sequencing dataset. It Here we focus on two major advantages of miRNA-sequencing, the will also benefit bioinformaticians who have not previously worked with discovery of novel miRNAs, and the use of unique molecular identifiers miRNA-sequencing data, given that this approach is relatively new as (UMIs). A novel miRNA is predicted based on secondary structure and compared to more established sequencing approaches such as DNA- lack of homology with miRNAs in other species [16]. Novel miRNAs sequencing and RNA-sequencing. Furthermore, having a standardized represent promise in precision medicine approaches given their potential protocol will promote integration of research findings from different specificity to disease states. Given this potential biological importance, groups, consistent with the efforts of established guidelines such as we have developed and tested a method for discovery of novel miRNA ‘Minimum Information about a high-throughput Nucleotide SEQuencing sequences that are present in miRNA-sequencing data. In addition to Experiment’ (MINSEQE - http://fged.org/projects/minseqe/) and novel miRNA discovery, our pipeline includes analysis of UMIs. During ‘Encyclopedia of DNA Elements’ (ENCODE) pipelines - https://www.enc library preparation prior to amplification and sequencing, UMIs are odeproject.org/microrna/microrna-seq/). We leverage only open source added to each miRNA transcript. Following sequencing, UMI reads are software in our pipeline, offering customizable scripts for more advanced collapsed such that the counts per miRNA remaining are more repre- users. Having applied this pipeline and identified a unique signature of sentative of the original starting sample prior to amplification. This is an 11 circulating miRNAs in early knee osteoarthritis, we present the internal control for managing library amplification bias, enabling accu- pipeline in sufficient detail to be replicated and widely used by others for rate miRNA quantitation. While previous studies reporting the bioinformatics analysis of miRNA-sequencing data [15]. miRNA-sequencing analysis may have incorporated UMI analysis, this level of detail is often not reported, nor is the method used to execute 2. Overview of miRNA NGS analysis pipeline UMI analysis. Examples of available software which enable miRNA-sequencing analysis, but not UMI processing, include There is more than one way to analyze miRNA-sequencing data so CAP-miRSeq and miRge [17,18]. Other software, such as TRUmiCount, here we present the approach we determined to be most suitable for handles UMI processing, and integrates the same UMI-tools software as bioinformatics analysis of miRNA-sequencing data generated from we describe in our pipeline [19]. Yet other software, like sRNABench and human plasma samples. Fig. 1 depicts an overview of the pipeline in its sRNAtoolbox, provide a similar pipeline but the UMI processing is entirety, including: Prerequisite sequencing quality checks, Alignment available only on the web-server mode and not standalone version, which steps, and Novel miRNA analysis. The first section begins with assessing is not secure for analyzing data generated from patient samples [20]. To the quality of the raw sequencing data, which is crucial to defining the overcome these limitations in the field, we put forth a detailed protocol path of downstream data processing. The second section involves read for analysis of miRNA-sequencing data, including quality control, align- mapping and populating the UMI-based miRNA expression table for all ment, demultiplexing, UMI analysis, and novel miRNA analysis. samples in an experiment. This section represents the core of analysis. It is our aim to establish
Recommended publications
  • Analysis of the Impact of Sequencing Errors on Blast Using Fault Injection
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Illinois Digital Environment for Access to Learning and Scholarship Repository ANALYSIS OF THE IMPACT OF SEQUENCING ERRORS ON BLAST USING FAULT INJECTION BY SO YOUN LEE THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2013 Urbana, Illinois Adviser: Professor Ravishankar K. Iyer ABSTRACT This thesis investigates the impact of sequencing errors in post-sequence computational analyses, including local alignment search and multiple sequence alignment. While the error rates of sequencing technology are commonly reported, the significance of these numbers cannot be fully grasped without putting them in the perspective of their impact on the downstream analyses that are used for biological research, forensics, diagnosis of diseases, etc. I approached the quantification of the impact using fault injection. Faults were injected in the input sequence data, and the analyses were run. Change in the output of the analyses was interpreted as the impact of faults, or errors. Three commonly used algorithms were used: BLAST, SSEARCH, and ProbCons. The main contributions of this work are the application of fault injection to the reliability analysis in bioinformatics and the quantitative demonstration that a small error rate in the sequence data can alter the output of the analysis in a significant way. BLAST and SSEARCH are both local alignment search tools, but BLAST is a heuristic implementation, while SSEARCH is based on the optimal Smith-Waterman algorithm.
    [Show full text]
  • Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J
    Perspective Cite This: J. Am. Chem. Soc. 2019, 141, 14463−14479 pubs.acs.org/JACS Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J. Gray, Lukasz G. Migas, Perdita E. Barran, Kevin Pagel, Peter H. Seeberger, ∥ ⊥ # ⊗ ∇ Claire E. Eyers, Geert-Jan Boons, Nicola L. B. Pohl, Isabelle Compagnon, , × † Göran Widmalm, and Sabine L. Flitsch*, † School of Chemistry & Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, U.K. ‡ Institute for Chemistry and Biochemistry, Freie Universitaẗ Berlin, Takustraße 3, 14195 Berlin, Germany § Biomolecular Systems Department, Max Planck Institute for Colloids and Interfaces, Am Muehlenberg 1, 14476 Potsdam, Germany ∥ Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool L69 7ZB, U.K. ⊥ Complex Carbohydrate Research Center, University of Georgia, Athens, Georgia 30602, United States # Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States ⊗ Institut Lumierè Matiere,̀ UMR5306 UniversitéLyon 1-CNRS, Universitéde Lyon, 69622 Villeurbanne Cedex, France ∇ Institut Universitaire de France IUF, 103 Blvd St Michel, 75005 Paris, France × Department of Organic Chemistry, Arrhenius Laboratory, Stockholm University, S-106 91 Stockholm, Sweden *S Supporting Information established connection between their structure and their ABSTRACT: Carbohydrates possess a variety of distinct function, full characterization of unknown carbohydrates features with
    [Show full text]
  • "Phylogenetic Analysis of Protein Sequence Data Using The
    Phylogenetic Analysis of Protein Sequence UNIT 19.11 Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program Antonis Rokas1 1Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee ABSTRACT Phylogenetic analysis is the study of evolutionary relationships among molecules, phenotypes, and organisms. In the context of protein sequence data, phylogenetic analysis is one of the cornerstones of comparative sequence analysis and has many applications in the study of protein evolution and function. This unit provides a brief review of the principles of phylogenetic analysis and describes several different standard phylogenetic analyses of protein sequence data using the RAXML (Randomized Axelerated Maximum Likelihood) Program. Curr. Protoc. Mol. Biol. 96:19.11.1-19.11.14. C 2011 by John Wiley & Sons, Inc. Keywords: molecular evolution r bootstrap r multiple sequence alignment r amino acid substitution matrix r evolutionary relationship r systematics INTRODUCTION the baboon-colobus monkey lineage almost Phylogenetic analysis is a standard and es- 25 million years ago, whereas baboons and sential tool in any molecular biologist’s bioin- colobus monkeys diverged less than 15 mil- formatics toolkit that, in the context of pro- lion years ago (Sterner et al., 2006). Clearly, tein sequence analysis, enables us to study degree of sequence similarity does not equate the evolutionary history and change of pro- with degree of evolutionary relationship. teins and their function. Such analysis is es- A typical phylogenetic analysis of protein sential to understanding major evolutionary sequence data involves five distinct steps: (a) questions, such as the origins and history of data collection, (b) inference of homology, (c) macromolecules, developmental mechanisms, sequence alignment, (d) alignment trimming, phenotypes, and life itself.
    [Show full text]
  • EMBL-EBI Powerpoint Presentation
    Processing data from high-throughput sequencing experiments Simon Anders Use-cases for HTS · de-novo sequencing and assembly of small genomes · transcriptome analysis (RNA-Seq, sRNA-Seq, ...) • identifying transcripted regions • expression profiling · Resequencing to find genetic polymorphisms: • SNPs, micro-indels • CNVs · ChIP-Seq, nucleosome positions, etc. · DNA methylation studies (after bisulfite treatment) · environmental sampling (metagenomics) · reading bar codes Use cases for HTS: Bioinformatics challenges Established procedures may not be suitable. New algorithms are required for · assembly · alignment · statistical tests (counting statistics) · visualization · segmentation · ... Where does Bioconductor come in? Several steps: · Processing of the images and determining of the read sequencest • typically done by core facility with software from the manufacturer of the sequencing machine · Aligning the reads to a reference genome (or assembling the reads into a new genome) • Done with community-developed stand-alone tools. · Downstream statistical analyis. • Write your own scripts with the help of Bioconductor infrastructure. Solexa standard workflow SolexaPipeline · "Firecrest": Identifying clusters ⇨ typically 15..20 mio good clusters per lane · "Bustard": Base calling ⇨ sequence for each cluster, with Phred-like scores · "Eland": Aligning to reference Firecrest output Large tab-separated text files with one row per identified cluster, specifying · lane index and tile index · x and y coordinates of cluster on tile · for each
    [Show full text]
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide
    Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).
    [Show full text]
  • Genomic Sequencing of SARS-Cov-2: a Guide to Implementation for Maximum Impact on Public Health
    Genomic sequencing of SARS-CoV-2 A guide to implementation for maximum impact on public health 8 January 2021 Genomic sequencing of SARS-CoV-2 A guide to implementation for maximum impact on public health 8 January 2021 Genomic sequencing of SARS-CoV-2: a guide to implementation for maximum impact on public health ISBN 978-92-4-001844-0 (electronic version) ISBN 978-92-4-001845-7 (print version) © World Health Organization 2021 Some rights reserved. This work is available under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 IGO licence (CC BY-NC-SA 3.0 IGO; https://creativecommons.org/licenses/by-nc-sa/3.0/igo). Under the terms of this licence, you may copy, redistribute and adapt the work for non-commercial purposes, provided the work is appropriately cited, as indicated below. In any use of this work, there should be no suggestion that WHO endorses any specific organization, products or services. The use of the WHO logo is not permitted. If you adapt the work, then you must license your work under the same or equivalent Creative Commons licence. If you create a translation of this work, you should add the following disclaimer along with the suggested citation: “This translation was not created by the World Health Organization (WHO). WHO is not responsible for the content or accuracy of this translation. The original English edition shall be the binding and authentic edition”. Any mediation relating to disputes arising under the licence shall be conducted in accordance with the mediation rules of the World Intellectual Property Organization (http://www.wipo.int/amc/en/mediation/rules/).
    [Show full text]
  • Next Generation DNA Sequencing
    Genes 2010, 1, 385-387; doi:10.3390/genes1030385 OPEN ACCESS genes ISSN 2073-4425 www.mdpi.com/journal/genes Editorial Special Issue: Next Generation DNA Sequencing Paul Richardson PRC LLC, 25 Amber Lane, Lafayette, CA 94549, USA; E-Mail: [email protected]; Tel.: +1 925 301-5460 Received: 21 October 2010 / Accepted: 26 October 2010 / Published: 27 October 2010 Next Generation Sequencing (NGS) refers to technologies that do not rely on traditional dideoxy-nucleotide (Sanger) sequencing where labeled DNA fragments are physically resolved by electrophoresis. These new technologies rely on different strategies, but essentially all of them make use of real-time data collection of a base level incorporation event across a massive number of reactions (on the order of millions versus 96 for capillary electrophoresis for instance). The major commercial NGS platforms available to researchers are the 454 Genome Sequencer (Roche), Illumina (formerly Solexa) Genome analyzer, the SOLiD system (Applied Biosystems/Life Technologies) and the Heliscope (Helicos Corporation). The techniques and different strategies utilized by these platforms are reviewed in a number of the papers in this special issue. These technologies are enabling new applications that take advantage of the massive data produced by this next generation of sequencing instruments. In this special issue, nine papers review and demonstrate the utility and potential of next generation sequencing. One of the biggest consequences with NGS technologies is how to deal with all of the data produced by these platforms. Magi et al. [1] review the software tools available for the multiple functions needed to process and interpret the huge amounts of data produced by these instruments.
    [Show full text]
  • The Biogrid Interaction Database
    D470–D478 Nucleic Acids Research, 2015, Vol. 43, Database issue Published online 26 November 2014 doi: 10.1093/nar/gku1204 The BioGRID interaction database: 2015 update Andrew Chatr-aryamontri1, Bobby-Joe Breitkreutz2, Rose Oughtred3, Lorrie Boucher2, Sven Heinicke3, Daici Chen1, Chris Stark2, Ashton Breitkreutz2, Nadine Kolas2, Lara O’Donnell2, Teresa Reguly2, Julie Nixon4, Lindsay Ramage4, Andrew Winter4, Adnane Sellam5, Christie Chang3, Jodi Hirschman3, Chandra Theesfeld3, Jennifer Rust3, Michael S. Livstone3, Kara Dolinski3 and Mike Tyers1,2,4,* 1Institute for Research in Immunology and Cancer, Universite´ de Montreal,´ Montreal,´ Quebec H3C 3J7, Canada, 2The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada, 3Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA, 4School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JR, UK and 5Centre Hospitalier de l’UniversiteLaval´ (CHUL), Quebec,´ Quebec´ G1V 4G2, Canada Received September 26, 2014; Revised November 4, 2014; Accepted November 5, 2014 ABSTRACT semi-automated text-mining approaches, and to en- hance curation quality control. The Biological General Repository for Interaction Datasets (BioGRID: http://thebiogrid.org) is an open access database that houses genetic and protein in- INTRODUCTION teractions curated from the primary biomedical lit- Massive increases in high-throughput DNA sequencing erature for all major model organism species and technologies (1) have enabled an unprecedented level of humans. As of September 2014, the BioGRID con- genome annotation for many hundreds of species (2–6), tains 749 912 interactions as drawn from 43 149 pub- which has led to tremendous progress in the understand- lications that represent 30 model organisms.
    [Show full text]
  • The Interpro Database, an Integrated Documentation Resource for Protein
    The InterPro database, an integrated documentation resource for protein families, domains and functional sites R Apweiler, T K Attwood, A Bairoch, A Bateman, E Birney, M Biswas, P Bucher, L Cerutti, F Corpet, M D Croning, et al. To cite this version: R Apweiler, T K Attwood, A Bairoch, A Bateman, E Birney, et al.. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research, Oxford University Press, 2001, 29 (1), pp.37-40. 10.1093/nar/29.1.37. hal-01213150 HAL Id: hal-01213150 https://hal.archives-ouvertes.fr/hal-01213150 Submitted on 7 Oct 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. © 2001 Oxford University Press Nucleic Acids Research, 2001, Vol. 29, No. 1 37–40 The InterPro database, an integrated documentation resource for protein families, domains and functional sites R. Apweiler1,*, T. K. Attwood2,A.Bairoch3, A. Bateman4,E.Birney1, M. Biswas1, P. Bucher5, L. Cerutti4,F.Corpet6, M. D. R. Croning1,2, R. Durbin4,L.Falquet5,W.Fleischmann1, J. Gouzy6,H.Hermjakob1,N.Hulo3, I. Jonassen7,D.Kahn6,A.Kanapin1, Y. Karavidopoulou1, R.
    [Show full text]
  • Enabling Interpretation of Protein Variation Effects with Uniprot
    Andrew Nightingale1, Jie luo, Michele Magrane1, Peter McGarvey2, Sandra Orchard1, Maria Martin1, UniProt Consortium1,2,3 1EMBL-European Bioinformatics Institute, Cambridge, UK 2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland 3Protein Information Resource, Georgetown University, Washington DC & University of Delaware, USA Enabling interpretation of protein variation effects with UniProt Introduction Understanding the effect of genetic variants on protein function is crucial to thoroughly understand the role of proteins in disease biology. UniProt aims to support the scientific community, computational biologists and clinical researchers, by providing a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. This includes a comprehensive catalogue of protein altering variation data coupled with information about how these variants affect protein function. UniProt variant data sources Variant data from literature Large-scale variant data Variants are captured from the scientific 1. Imported variant literature and manually reviewed for data is dependent addition to UniProtKB/Swiss-Prot upon exact mapping between the reference proteome Description of disease associated with and genome. genetic variations in a protein TCGA 2. Variant data is imported from a variety of resources Variant data including effects of the variant Database Total Imported Variant Total Unique to complement the on the protein and links to variant 1000Genomes 859,757 81,216 resources ClinVar 183,655 76,218 set of variants COSMIC 184,237 18,863 Category Number ESP 939,238 68,803 captured from the ExAC 4,333,620 2,776,617 Total reviewed variants 79,284 TCGA 1,202,700 920,549 literature Disease-associated variants 30,471 UniProt 80,224 49,971 Total 7,781,431 3,992,437 Number of proteins with variants 12,886 *Represents the number of UniProt variants with a dbSNP identifier Interpretation protein variant effect with UniProt 1.
    [Show full text]
  • Evaluation of Normalization Methods in Mammalian Microrna-Seq Data
    Downloaded from rnajournal.cshlp.org on September 27, 2021 - Published by Cold Spring Harbor Laboratory Press METHOD Evaluation of normalization methods in mammalian microRNA-Seq data LANA XIA GARMIRE1 and SHANKAR SUBRAMANIAM1 Department of Bioengineering, Jacobs School of Engineering, University of California at San Diego, La Jolla, California 92093-0412, USA ABSTRACT Simple total tag count normalization is inadequate for microRNA sequencing data generated from the next generation sequencing technology. However, so far systematic evaluation of normalization methods on microRNA sequencing data is lacking. We comprehensively evaluate seven commonly used normalization methods including global normalization, Lowess normalization, Trimmed Mean Method (TMM), quantile normalization, scaling normalization, variance stabilization, and invariant method. We assess these methods on two individual experimental data sets with the empirical statistical metrics of mean square error (MSE) and Kolmogorov-Smirnov (K-S) statistic. Additionally, we evaluate the methods with results from quantitative PCR validation. Our results consistently show that Lowess normalization and quantile normalization perform the best, whereas TMM, a method applied to the RNA-Sequencing normalization, performs the worst. The poor performance of TMM normalization is further evidenced by abnormal results from the test of differential expression (DE) of microRNA-Seq data. Comparing with the models used for DE, the choice of normalization method is the primary factor that affects the results of DE. In summary, Lowess normalization and quantile normalization are recommended for normalizing microRNA-Seq data, whereas the TMM method should be used with caution. Keywords: microRNA-Seq; next generation sequencing; statistical normalization; high-throughput data analysis INTRODUCTION much smaller than mRNAs. For example, so far there are <1000 annotated microRNAs in human that are expected to The next generation sequencing (NGS) technology has regulate z30% of genes.
    [Show full text]
  • Genbank Is a Reliable Resource for 21St Century Biodiversity Research
    GenBank is a reliable resource for 21st century biodiversity research Matthieu Leraya, Nancy Knowltonb,1, Shian-Lei Hoc, Bryan N. Nguyenb,d,e, and Ryuji J. Machidac,1 aSmithsonian Tropical Research Institute, Smithsonian Institution, Panama City, 0843-03092, Republic of Panama; bNational Museum of Natural History, Smithsonian Institution, Washington, DC 20560; cBiodiversity Research Centre, Academia Sinica, 115-29 Taipei, Taiwan; dDepartment of Biological Sciences, The George Washington University, Washington, DC 20052; and eComputational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052 Contributed by Nancy Knowlton, September 15, 2019 (sent for review July 10, 2019; reviewed by Ann Bucklin and Simon Creer) Traditional methods of characterizing biodiversity are increasingly (13), the largest repository of genetic data for biodiversity (14, 15). being supplemented and replaced by approaches based on DNA In many cases, no vouchers are available to independently con- sequencing alone. These approaches commonly involve extraction firm identification, because the organisms are tiny, very difficult and high-throughput sequencing of bulk samples from biologically or impossible to identify, or lacking entirely (in the case of eDNA). complex communities or samples of environmental DNA (eDNA). In While concerns have been raised about biases and inaccuracies in such cases, vouchers for individual organisms are rarely obtained, often laboratory and analytical methods used in metabarcoding
    [Show full text]