A Bioinformatics Approach to Microrna-Sequencing Analysis

Henry Ford Health System Henry Ford Health System Scholarly Commons Orthopaedics Articles Orthopaedics / Bone and Joint Center 3-1-2021 A bioinformatics approach to microRNA-sequencing analysis Pratibha Potla Shabana A. Ali Mohit Kapoor Follow this and additional works at: https://scholarlycommons.henryford.com/orthopaedics_articles Osteoarthritis and Cartilage Open 3 (2021) 100131 Contents lists available at ScienceDirect Osteoarthritis and Cartilage Open journal homepage: www.elsevier.com/journals/osteoarthritis-and-cartilage-open/2665-9131 Experimental Protocol A bioinformatics approach to microRNA-sequencing analysis Pratibha Potla a,b,1, Shabana Amanda Ali c,**,1, Mohit Kapoor a,b,d,* a Schroeder Arthritis Institute, University Health Network, Toronto, Ontario, Canada b Krembil Research Institute, University Health Network, Toronto, Ontario, Canada c Bone and Joint Center, Department of Orthopaedic Surgery, Henry Ford Health System, Detroit, MI, USA d Department of Surgery and Department of Laboratory Medicine and Pathobiology, University of Toronto, Ontario, Canada ARTICLE INFO ABSTRACT Keywords: The rapid expansion of Next Generation Sequencing (NGS) data availability has made exploration of appropriate High-throughput nucleotide sequencing bioinformatics analysis pipelines a timely issue. Since there are multiple tools and combinations thereof to analyze Computational biology any dataset, there can be uncertainty in how to best perform an analysis in a robust and reproducible manner. This MicroRNAs is especially true for newer omics applications, such as miRNomics, or microRNA-sequencing (miRNA- Bioinformatics sequencing). As compared to transcriptomics, there have been far fewer miRNA-sequencing studies performed to Osteoarthritis date, and those that are reported seldom provide detailed description of the bioinformatics analysis, including aspects such as Unique Molecular Identifiers (UMIs). In this article, we attempt to fill the gap and help researchers understand their miRNA-sequencing data and its analysis. This article will specifically discuss a customizable miRNA bioinformatics pipeline that was developed using miRNA-sequencing datasets generated from human osteoarthritis plasma samples. We describe quality assessment of raw sequencing data files, reference-based alignment, counts generation for miRNA expression levels, and novel miRNA discovery. This report is expected to improve clarity and reproducibility of the bioinformatics portion of miRNA-sequencing analysis, applicable across any sample type, to promote sharing of detailed protocols in the NGS field. 1. Introduction gold standard approach for profiling nucleic acid, including miRNAs. Detecting single-nucleotide sequence changes or altogether novel se- Next Generation Sequencing (NGS) technology has revolutionized the quences are added advantages of sequencing [10]. As a result, study of human genetic code, enabling a fast, reliable, and cost-effect sequencing has the capacity to identify molecules with greater sensi- method for reading the genome. Whereas “first generation” sequencing tivity, specificity, and predictive ability for detecting disease [11]. For involved sequencing one molecule at a time, NGS involves sequencing these reasons, sequencing has been applied to biomarker discovery for a multiple molecules in parallel [1–3]. This advance has reduced the time variety of diseases, but not without limitations. There are several sources and cost per base that is sequenced, and has expanded sequencing ap- of error that can be introduced during a sequencing experiment. Among plications which now includes microRNAs [4–7]. MicroRNAs (miRNAs) these, the patient cohort may be underpowered [12]; sample extraction, are small RNAs of 22–25 base length, regulating gene expression through library preparation, and sequencing may create bias that leads to over- or degradation of mRNA transcripts and inhibition of translation [8]. under-estimation of the expression level of a molecule or subset of mol- MiRNAs have emerged as critical regulators of health and disease, and ecules [13]; or a one-size-fits-all approach may be inappropriately when found in circulation, represent promising biomarkers given their applied to data analysis. To harness the potential of NGS to identify stability, specificity, and ease of detection and quantification [9]. miRNAs as biomarkers – including novel miRNAs – a rigorous approach By providing a quantitative readout of all molecules of interest in a that overcomes existing limitations is needed [14]. This report focuses on sample without relying on endogenous controls or pre-selected probes as the data analysis aspect, where a rigorous methodology for bioinfor- do real-time PCR and microarray approaches, NGS has emerged as the matics analysis of miRNA-sequencing data has been developed and * Corresponding author. Schroeder Arthritis Institute, Krembil Research Institute, University Health Network, 60 Leonard Avenue, Toronto, Ontario, M5T 2S8, Canada. ** Corresponding author. Bone and Joint Center, Department of Orthopaedic Surgery, Henry Ford Health System, 6135 Woodward Avenue, Detroit, MI, 48202, USA. E-mail addresses: [email protected] (S.A. Ali), [email protected] (M. Kapoor). 1 PP and SAA share equal first author contribution. https://doi.org/10.1016/j.ocarto.2020.100131 Received 12 December 2020; Accepted 14 December 2020 2665-9131/© 2021 Osteoarthritis Research Society International (OARSI). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). P. Potla et al. Osteoarthritis and Cartilage Open 3 (2021) 100131 applied to identify miRNAs in plasma samples from osteoarthritis pa- sufficient expertise in bioinformatics methods to understand the steps tients [15]. that need to be taken when analyzing a miRNA-sequencing dataset. It Here we focus on two major advantages of miRNA-sequencing, the will also benefit bioinformaticians who have not previously worked with discovery of novel miRNAs, and the use of unique molecular identifiers miRNA-sequencing data, given that this approach is relatively new as (UMIs). A novel miRNA is predicted based on secondary structure and compared to more established sequencing approaches such as DNA- lack of homology with miRNAs in other species [16]. Novel miRNAs sequencing and RNA-sequencing. Furthermore, having a standardized represent promise in precision medicine approaches given their potential protocol will promote integration of research findings from different specificity to disease states. Given this potential biological importance, groups, consistent with the efforts of established guidelines such as we have developed and tested a method for discovery of novel miRNA ‘Minimum Information about a high-throughput Nucleotide SEQuencing sequences that are present in miRNA-sequencing data. In addition to Experiment’ (MINSEQE - http://fged.org/projects/minseqe/) and novel miRNA discovery, our pipeline includes analysis of UMIs. During ‘Encyclopedia of DNA Elements’ (ENCODE) pipelines - https://www.enc library preparation prior to amplification and sequencing, UMIs are odeproject.org/microrna/microrna-seq/). We leverage only open source added to each miRNA transcript. Following sequencing, UMI reads are software in our pipeline, offering customizable scripts for more advanced collapsed such that the counts per miRNA remaining are more repre- users. Having applied this pipeline and identified a unique signature of sentative of the original starting sample prior to amplification. This is an 11 circulating miRNAs in early knee osteoarthritis, we present the internal control for managing library amplification bias, enabling accu- pipeline in sufficient detail to be replicated and widely used by others for rate miRNA quantitation. While previous studies reporting the bioinformatics analysis of miRNA-sequencing data [15]. miRNA-sequencing analysis may have incorporated UMI analysis, this level of detail is often not reported, nor is the method used to execute 2. Overview of miRNA NGS analysis pipeline UMI analysis. Examples of available software which enable miRNA-sequencing analysis, but not UMI processing, include There is more than one way to analyze miRNA-sequencing data so CAP-miRSeq and miRge [17,18]. Other software, such as TRUmiCount, here we present the approach we determined to be most suitable for handles UMI processing, and integrates the same UMI-tools software as bioinformatics analysis of miRNA-sequencing data generated from we describe in our pipeline [19]. Yet other software, like sRNABench and human plasma samples. Fig. 1 depicts an overview of the pipeline in its sRNAtoolbox, provide a similar pipeline but the UMI processing is entirety, including: Prerequisite sequencing quality checks, Alignment available only on the web-server mode and not standalone version, which steps, and Novel miRNA analysis. The first section begins with assessing is not secure for analyzing data generated from patient samples [20]. To the quality of the raw sequencing data, which is crucial to defining the overcome these limitations in the field, we put forth a detailed protocol path of downstream data processing. The second section involves read for analysis of miRNA-sequencing data, including quality control, align- mapping and populating the UMI-based miRNA expression table for all ment, demultiplexing, UMI analysis, and novel miRNA analysis. samples in an experiment. This section represents the core of analysis. It is our aim to establish

A Bioinformatics Approach to Microrna-Sequencing Analysis

Analysis of the Impact of Sequencing Errors on Blast Using Fault Injection

Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J

"Phylogenetic Analysis of Protein Sequence Data Using The

EMBL-EBI Powerpoint Presentation

A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide

Genomic Sequencing of SARS-Cov-2: a Guide to Implementation for Maximum Impact on Public Health

Next Generation DNA Sequencing

The Biogrid Interaction Database

The Interpro Database, an Integrated Documentation Resource for Protein

Enabling Interpretation of Protein Variation Effects with Uniprot

Evaluation of Normalization Methods in Mammalian Microrna-Seq Data

Genbank Is a Reliable Resource for 21St Century Biodiversity Research