Smith-Waterman Sequence Alignment for Massively Parallel High-Performance Computing Architectures

Total Page:16

File Type:pdf, Size:1020Kb

Smith-Waterman Sequence Alignment for Massively Parallel High-Performance Computing Architectures SMITH-WATERMAN SEQUENCE ALIGNMENT FOR MASSIVELY PARALLEL HIGH-PERFORMANCE COMPUTING ARCHITECTURES A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Shannon Irene Steinfadt May 2010 Dissertation written by Shannon Irene Steinfadt B.A., Hiram College, 2000 M.A., Kent State University, 2003 Ph.D., Kent State University, 2010 Approved by Dr. Johnnie W. Baker , Chair, Doctoral Dissertation Committee Dr. Kenneth Batcher , Members, Doctoral Dissertation Committee Dr. Paul Farrell Dr. James Blank Accepted by Dr. Robert Walker , Chair, Department of Computer Science Dr. John Stalvey , Dean, College of Arts and Sciences ii TABLE OF CONTENTS LIST OF FIGURES . viii LIST OF TABLES . xii Copyright . xiii Dedication . xiv Acknowledgements . xv 1 Introduction . 1 2 Sequence Alignment . 4 2.1 Background . 4 2.2 Pairwise Sequence Alignment . 5 2.3 Needleman-Wunch . 9 2.4 Smith-Waterman Sequence Alignment . 10 2.5 Scoring . 13 2.6 Opportunities for Parallelization . 16 3 Parallel Computing Models . 19 iii 3.1 Models of Parallel Computation . 19 3.1.1 Multiple Instruction, Multiple Data (MIMD) . 20 3.1.2 Single Instruction, Multiple Data (SIMD) . 22 3.2 Associative Computing Model . 23 3.2.1 Associative Functions . 26 4 Smith-Waterman Using Associative Massive Parallelism (SWAMP) . 29 4.1 Overview . 29 4.2 ASC Emulation . 30 4.2.1 Data Setup . 30 4.2.2 SWAMP Algorithm Outline . 33 4.3 Performance Analysis . 35 4.3.1 Asymptotic Analysis . 35 4.3.2 Performance Monitor Result Analysis . 36 4.3.3 Predicted Performance as S1 and S2 Grow . 38 4.3.4 Additional Avenues of Discovery . 40 4.3.5 Comments on Emulation . 40 4.4 SWAMP with Added Traceback . 41 4.4.1 SWAMP with Traceback Analysis . 44 5 Extended Smith-Waterman Using Associative Massive Parallelism (SWAMP+) 46 5.1 Overview . 46 iv 5.2 Single-to-Multiple SWAMP+ Algorithm . 48 5.2.1 Algorithm . 48 5.3 Multiple-to-Single SWAMP+ Algorithm . 52 5.4 Multiple-to-Multiple SWAMP+ Algorithm . 52 5.4.1 Algorithm . 53 5.4.2 Asymptotic Anaylsis . 55 5.5 Future Directions . 56 5.6 Clearspeed Implementation . 56 6 Feasible Hardware Survey for the Associative SWAMP Implementation . 57 6.1 Overview . 57 6.2 IBM Cell Processor . 58 6.3 Field-Programmable Gate Arrays - FPGAs . 59 6.4 Graphics Processing Units - GPGPUs . 60 6.4.1 Implementing ASC on GPGPUs . 63 6.5 Clearspeed SIMD Architecture . 64 7 SWAMP+ Implementation on ClearSpeed Hardware . 69 7.1 Implementing Associative SWAMP+ on the ClearSpeed CSX . 69 7.2 Clearspeed Running Results . 71 7.2.1 Parallel Matrix Computation . 72 7.2.2 Sequential Traceback . 78 v 7.3 Conclusions . 81 8 Smith-Waterman on a Distributed Memory Cluster System . 82 8.1 Introduction . 82 8.2 JumboMem . 84 8.3 Extreme-Scale Alignments on Clusters . 86 8.3.1 Experiments . 87 8.3.2 Results . 89 8.4 Conclusion . 92 9 Ongoing and Future Work . 94 9.1 Hierarchical Parallelism for Smith-Waterman Incorporating JumboMem 94 9.1.1 Within a Single Core . 95 9.1.2 Across Cores and Nodes . 95 9.2 Continuing SWAMP+ Work . 97 10 Conclusions . 99 BIBLIOGRAPHY . 101 Appendices . 106 A ASC Source Code for SWAMP . 107 A.1 ASC Code for SWAMP . 107 vi B ClearSpeed Code for SWAMP+ . 120 vii LIST OF FIGURES 1 An example of the sequential Smith-Waterman matrix. The depen- dencies of cell (3, 2) are shown with arrows. While the calculated C values for the entire matrix are given, the shaded anti-diagonal (where all i + j values are equal) shows one wavefront or logical parallel step since they can be computed concurrently. Affine gap penalties are used in this example as well as in the parallel code that produces the top alignment and other top scoring alignments. 11 2 Smith-Waterman matrix with traceback and resulting alignment. 13 3 A high-level view of the ASC model of parallel computation. 25 4 Mapping the \shifted" data on to the ASC model. Every S2[$] column stores one full anti-diagonal from the original matrix. Here the number of PEs > m and the unused (idle) PEs are grayed out. When the number of PEs< m, the PEs are virtualized and one PE will process [m=# PEs] worth of work. The PE Interconnection Network is omitted for simplicity. 31 5 Showing (i + j = 4) step-by-step iteration of the m + n loop to shift S2. This loop stores each anti-diagonal in a single variable of the ASC array S2[$] so that it can be processed in parallel. 32 viii 6 Reduction in the number of operations through further parallelization of the SWAMP algorithm. 37 7 Actual and predicted performance measurements using ASCs perfor- mance monitor. Predictions were obtained using linear regression and the least squares method and are shown with a dashed line. 39 8 SWAMP+ Variations where k=3 in both a) and b) and k=2 in c). 47 9 A detail of one streaming multiprocessor (SM) is shown here. On CUDA-enabled NVIDIA hardware, a varied number of SMs exist for massively parallel processing. Each SM contains eight streaming pro- cessor (SP) cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared mem- ory. One example organization is the NVIDIA Tesla T10 with 30 SMs for a total of 240 SPs. 61 10 The CSX 620 PCI-X Accelerator Board . 65 11 ClearSpeed CSX processor organization. Diagram courtesy of Clear- Speed http:// www.clearspeed.com/products/csx700/. 66 ix 12 The average number of calculation cycles over 30 runs. This graph was broken down into each subalignment. There were eight outliers in over 4500 runs, each were an order of magnitude larger than the cycle counts for the rest of the runs. That is what pulled the calculation cycle count averages up, as seen in the graph. It does show that the number parallel computation steps is roughly the same, regardless of sequence size. Lower is better. 74 13 With the top eight outliers removed, the error bars show the compu- tation cycle counts in the same order of magnitude as the rest of the readings. 75 14 Cell Updates Per Second for Matrix Computation (CUPS) where higher is better. 77 15 The average number of traceback cycles over 30 runs. The longest alignment is the first alignment, as expected. Therefore the first trace- back in all runs with 1 to 5 alignments returned has a higher cycle count than any of the subsequent alignments. 79 16 Comparison of Cycle Counts for Computation and Traceback . 80 17 Across multiple node's main memory, JumboMem allows an entire clus- ter's memory to look like local memory with no additional hardware, no recompilation, and no root account access. 86 x 18 The cell updates per second (CUPS) does experience some performance degradation, but not as much as if it had to page to disk. 89 19 The execution time grows consistently even as JumboMem begins to use other nodes' memory. Note the logarithmic scales, since as input string size doubles, the calculations and memory requirements quadru- ple. .................................... 91 20 A wavefront of wavefronts approach, merging a hierarchy of parallelism, first within a single core, and then across multiple cores. 96 xi LIST OF TABLES 1 PAL Cluster Characteristics . 87 xii Copyright This material is copyright: c 2010 Shannon Irene Steinfadt. xiii This is dedicated to my guys, including Jim, Minky, Ike, Tyke, Spike, Thaddeus, Bandy, BB and the rest of the gang. I include my family who made education and learning a top priority. I also dedicate it to all of my friends and family (by blood and by kindred spirit) who have supported me throughout the years of effort. Shannon Irene Steinfadt March 18, 2010, Kent, Ohio xiv Acknowledgements I acknowledge the help and input from my advisor Dr. Johnnie Baker. In addi- tion, the support from my dissertation committee, the department chair Dr. Robert Walker and the Department of Computer Science at Kent State helped me bring this dissertation to completion. I also acknowledge ClearSpeed for the use of their equipment necessary for my work. And many thanks to the Performance and Architectures Laboratory (PAL) team at Los Alamos National Laboratory, especially Kevin Barker, Darren Kerbyson, and Scott Pakin for their support, advice and insight. The use of the PAL cluster and JumboMem made some of this work possible. My gratitude goes out to the Angel Fire / TAOS team at Los Alamos National Laboratory as well. They supported me during the last few months of intense effort. xv CHAPTER 1 Introduction The increasing growth and complexity of high-performance computing as well as the stellar data growth in the bioinformatics field stand as posts guiding this work. The march is towards increasing processor counts, each processor with an increasing number of compute cores and often associated with accelerator hardware. The bi-annual Top500 listing of the most powerful computers in the world stands as proof of this. With hundreds of thousands of cores, many using accelerators, massive parallelism is a top tier fact in high-performance computing. This research addresses one of the most often used tools in bioinformatics, se- quence alignment. While my application focus is sequence alignment, this work is applicable to other problems in other fields. The parallel optimizations and tech- niques presented here for a Smith-Waterman-like sequence alignment can be applied to algorithms that use dynamic programming with a wavefront approach. A pri- mary example is a parallel benchmark called Sweep3D, a neutron transport model.
Recommended publications
  • Sequencing Alignment I Outline: Sequence Alignment
    Sequencing Alignment I Lectures 16 – Nov 21, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline: Sequence Alignment What Why (applications) Comparative genomics DNA sequencing A simple algorithm Complexity analysis A better algorithm: “Dynamic programming” 2 1 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G T C C A A T 3 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G | : | : | | : T C C – A A T 4 2 Sequence Alignment: Why The most basic sequence analysis task First aligning the sequences (or parts of them) and Then deciding whether that alignment is more likely to have occurred because the sequences are related, or just by chance Similar sequences often have similar origin or function New sequence always compared to existing sequences (e.g. using BLAST) 5 Sequence Alignment Example: gene HBB Product: hemoglobin Sickle-cell anaemia causing gene Protein sequence (146 aa) MVHLTPEEKS AVTALWGKVN VDEVGGEALG RLLVVYPWTQ RFFESFGDLS TPDAVMGNPK VKAHGKKVLG AFSDGLAHLD NLKGTFATLS ELHCDKLHVD PENFRLLGNV LVCVLAHHFG KEFTPPVQAA YQKVVAGVAN ALAHKYH BLAST (Basic Local Alignment Search Tool) The most popular alignment tool Try it! Pick any protein, e.g.
    [Show full text]
  • Comparative Analysis of Multiple Sequence Alignment Tools
    I.J. Information Technology and Computer Science, 2018, 8, 24-30 Published Online August 2018 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2018.08.04 Comparative Analysis of Multiple Sequence Alignment Tools Eman M. Mohamed Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected]. Hamdy M. Mousa, Arabi E. keshk Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected], [email protected]. Received: 24 April 2018; Accepted: 07 July 2018; Published: 08 August 2018 Abstract—The perfect alignment between three or more global alignment algorithm built-in dynamic sequences of Protein, RNA or DNA is a very difficult programming technique [1]. This algorithm maximizes task in bioinformatics. There are many techniques for the number of amino acid matches and minimizes the alignment multiple sequences. Many techniques number of required gaps to finds globally optimal maximize speed and do not concern with the accuracy of alignment. Local alignments are more useful for aligning the resulting alignment. Likewise, many techniques sub-regions of the sequences, whereas local alignment maximize accuracy and do not concern with the speed. maximizes sub-regions similarity alignment. One of the Reducing memory and execution time requirements and most known of Local alignment is Smith-Waterman increasing the accuracy of multiple sequence alignment algorithm [2]. on large-scale datasets are the vital goal of any technique. The paper introduces the comparative analysis of the Table 1. Pairwise vs. multiple sequence alignment most well-known programs (CLUSTAL-OMEGA, PSA MSA MAFFT, BROBCONS, KALIGN, RETALIGN, and Compare two biological Compare more than two MUSCLE).
    [Show full text]
  • Chapter 6: Multiple Sequence Alignment Learning Objectives
    Chapter 6: Multiple Sequence Alignment Learning objectives • Explain the three main stages by which ClustalW performs multiple sequence alignment (MSA); • Describe several alternative programs for MSA (such as MUSCLE, ProbCons, and TCoffee); • Explain how they work, and contrast them with ClustalW; • Explain the significance of performing benchmarking studies and describe several of their basic conclusions for MSA; • Explain the issues surrounding MSA of genomic regions Outline: multiple sequence alignment (MSA) Introduction; definition of MSA; typical uses Five main approaches to multiple sequence alignment Exact approaches Progressive sequence alignment Iterative approaches Consistency-based approaches Structure-based methods Benchmarking studies: approaches, findings, challenges Databases of Multiple Sequence Alignments Pfam: Protein Family Database of Profile HMMs SMART Conserved Domain Database Integrated multiple sequence alignment resources MSA database curation: manual versus automated Multiple sequence alignments of genomic regions UCSC, Galaxy, Ensembl, alignathon Perspective Multiple sequence alignment: definition • a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned • homologous residues are aligned in columns across the length of the sequences • residues are homologous in an evolutionary sense • residues are homologous in a structural sense Example: 5 alignments of 5 globins Let’s look at a multiple sequence alignment (MSA) of five globins proteins. We’ll use five prominent MSA programs: ClustalW, Praline, MUSCLE (used at HomoloGene), ProbCons, and TCoffee. Each program offers unique strengths. We’ll focus on a histidine (H) residue that has a critical role in binding oxygen in globins, and should be aligned. But often it’s not aligned, and all five programs give different answers.
    [Show full text]
  • How to Generate a Publication-Quality Multiple Sequence Alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012)
    Tutorial: How to generate a publication-quality multiple sequence alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012) 1) Get your sequences in FASTA format: • Go to the NCBI website; find your sequences and display them in FASTA format. Each sequence should look like this (http://www.ncbi.nlm.nih.gov/protein/6678177?report=fasta): >gi|6678177|ref|NP_033320.1| syntaxin-4 [Mus musculus] MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTIL ATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINK CNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEI QQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICV SVTVLILAVIIGITITVG 2) In a text editor, paste all your sequences together (in the order that you would like them to appear in the end). It should look like this: >gi|6678177|ref|NP_033320.1| syntaxin-4 [Mus musculus] MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTIL ATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINK CNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEI QQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICV SVTVLILAVIIGITITVG >gi|151554658|gb|AAI47965.1| STX3 protein [Bos taurus] MKDRLEQLKAKQLTQDDDTDEVEIAVDNTAFMDEFFSEIEETRVNIDKISEHVEEAKRLYSVILSAPIPE PKTKDDLEQLTTEIKKRANNVRNKLKSMERHIEEDEVQSSADLRIRKSQHSVLSRKFVEVMTKYNEAQVD FRERSKGRIQRQLEITGKKTTDEELEEMLESGNPAIFTSGIIDSQISKQALSEIEGRHKDIVRLESSIKE LHDMFMDIAMLVENQGEMLDNIELNVMHTVDHVEKAREETKRAVKYQGQARKKLVIIIVIVVVLLGILAL IIGLSVGLK
    [Show full text]
  • Bioinformatics Study of Lectins: New Classification and Prediction In
    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
    [Show full text]
  • "Phylogenetic Analysis of Protein Sequence Data Using The
    Phylogenetic Analysis of Protein Sequence UNIT 19.11 Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program Antonis Rokas1 1Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee ABSTRACT Phylogenetic analysis is the study of evolutionary relationships among molecules, phenotypes, and organisms. In the context of protein sequence data, phylogenetic analysis is one of the cornerstones of comparative sequence analysis and has many applications in the study of protein evolution and function. This unit provides a brief review of the principles of phylogenetic analysis and describes several different standard phylogenetic analyses of protein sequence data using the RAXML (Randomized Axelerated Maximum Likelihood) Program. Curr. Protoc. Mol. Biol. 96:19.11.1-19.11.14. C 2011 by John Wiley & Sons, Inc. Keywords: molecular evolution r bootstrap r multiple sequence alignment r amino acid substitution matrix r evolutionary relationship r systematics INTRODUCTION the baboon-colobus monkey lineage almost Phylogenetic analysis is a standard and es- 25 million years ago, whereas baboons and sential tool in any molecular biologist’s bioin- colobus monkeys diverged less than 15 mil- formatics toolkit that, in the context of pro- lion years ago (Sterner et al., 2006). Clearly, tein sequence analysis, enables us to study degree of sequence similarity does not equate the evolutionary history and change of pro- with degree of evolutionary relationship. teins and their function. Such analysis is es- A typical phylogenetic analysis of protein sential to understanding major evolutionary sequence data involves five distinct steps: (a) questions, such as the origins and history of data collection, (b) inference of homology, (c) macromolecules, developmental mechanisms, sequence alignment, (d) alignment trimming, phenotypes, and life itself.
    [Show full text]
  • Aligning Reads: Tools and Theory Genome Transcriptome Assembly Mapping Mapping
    Aligning reads: tools and theory Genome Transcriptome Assembly Mapping Mapping Reads Reads Reads RSEM, STAR, Kallisto, Trinity, HISAT2 Sailfish, Scripture Salmon Splice-aware Transcript mapping Assembly into Genome mapping and quantification transcripts htseq-count, StringTie Trinotate featureCounts Transcript Novel transcript Gene discovery & annotation counting counting Homology-based BLAST2GO Novel transcript annotation Transcriptome Mapping Reads RSEM, Kallisto, Sailfish, Salmon Transcript mapping and quantification Transcriptome Biological samples/Library preparation Mapping Reads RSEM, Kallisto, Sequence reads Sailfish, Salmon FASTQ (+reference transcriptome index) Transcript mapping and quantification Quantify expression Salmon, Kallisto, Sailfish Pseudocounts DGE with R Functional Analysis with R Goal: Finding where in the genome these reads originated from 5 chrX:152139280 152139290 152139300 152139310 152139320 152139330 Reference --->CGCCGTCCCTCAGAATGGAAACCTCGCT TCTCTCTGCCCCACAATGCGCAAGTCAG CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133lo:LM-Mel-14neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-42neg Normal:HAH CD133hi:LM-Mel-42pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-14neg Normal:HAH CD133hi:LM-Mel-14pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-42neg DBTSS:human_MCF7 CD133hi:LM-Mel-42pos CD133lo:LM-Mel-14neg CD133lo:LM-Mel-34neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-42neg CD133hi:LM-Mel-42pos CD133hi:LM-Mel-42poschrX:152139280
    [Show full text]
  • Alignment of Next-Generation Sequencing Data
    Gene Expression Analyses Alignment of Next‐Generation Sequencing Data Nadia Lanman HPC for Life Sciences 2019 What is sequence alignment? • A way of arranging sequences of DNA, RNA, or protein to identify regions of similarity • Similarity may be a consequence of functional, structural, or evolutionary relationships between sequences • In the case of NextGen sequencing, alignment identifies where fragments which were sequenced are derived from (e.g. which gene or transcript) • Two types of alignment: local and global http://www‐personal.umich.edu/~lpt/fgf/fgfrseq.htm Global vs Local Alignment • Global aligners try to align all provided sequence end to end • Local aligners try to find regions of similarity within each provided sequence (match your query with a substring of your subject/target) gap mismatch Alignment Example Raw sequences: A G A T G and G A T TG 2 matches, 0 4 matches, 1 4 matches, 1 3 matches, 2 gaps insertion insertion end gaps A G A T G A G A ‐ T G . A G A T ‐ G . A G A T G . G A T TG . G A T TG . G A T TG . G A T TG NGS read alignment • Allows us to determine where sequence fragments (“reads”) came from • Quantification allows us to address relevant questions • How do samples differ from the reference genome • Which genes or isoforms are differentially expressed Haas et al, 2010, Nature. Standard Differential Expression Analysis Differential Check data Unsupervised expression quality Clustering analysis Trim & filter Count reads GO enrichment reads, remove aligning to analysis adapters each gene Align reads to Check
    [Show full text]
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide
    Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).
    [Show full text]
  • Errors in Multiple Sequence Alignment and Phylogenetic Reconstruction
    Multiple Sequence Alignment Errors and Phylogenetic Reconstruction THESIS SUBMITTED FOR THE DEGREE “DOCTOR OF PHILOSOPHY” BY Giddy Landan SUBMITTED TO THE SENATE OF TEL-AVIV UNIVERSITY August 2005 This work was carried out under the supervision of Professor Dan Graur Acknowledgments I would like to thank Dan for more than a decade of guidance in the fields of molecular evolution and esoteric arts. This study would not have come to fruition without the help, encouragement and moral support of Tal Dagan and Ron Ophir. To them, my deepest gratitude. Time flies like an arrow Fruit flies like a banana - Groucho Marx Table of Contents Abstract ..........................................................................................................................1 Chapter 1: Introduction................................................................................................5 Sequence evolution...................................................................................................6 Alignment Reconstruction........................................................................................7 Errors in reconstructed MSAs ................................................................................10 Motivation and aims...............................................................................................13 Chapter 2: Methods.....................................................................................................17 Symbols and Acronyms..........................................................................................17
    [Show full text]
  • Tunca Doğan , Alex Bateman , Maria J. Martin Your Choice
    (—THIS SIDEBAR DOES NOT PRINT—) UniProt Domain Architecture Alignment: A New Approach for Protein Similarity QUICK START (cont.) DESIGN GUIDE Search using InterPro Domain Annotation How to change the template color theme This PowerPoint 2007 template produces a 44”x44” You can easily change the color theme of your poster by going to presentation poster. You can use it to create your research 1 1 1 the DESIGN menu, click on COLORS, and choose the color theme of poster and save valuable time placing titles, subtitles, text, Tunca Doğan , Alex Bateman , Maria J. Martin your choice. You can also create your own color theme. and graphics. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), We provide a series of online tutorials that will guide you Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK through the poster design process and answer your poster Correspondence: [email protected] production questions. To view our template tutorials, go online to PosterPresentations.com and click on HELP DESK. ABSTRACT METHODOLOGY RESULTS & DISCUSSION When you are ready to print your poster, go online to InterPro Domains, DAs and DA Alignment PosterPresentations.com Motivation: Similarity based methods have been widely used in order to Generation of the Domain Architectures: You can also manually change the color of your background by going to VIEW > SLIDE MASTER. After you finish working on the master be infer the properties of genes and gene products containing little or no 1) Collect the hits for each protein from InterPro. Domain annotation coverage Overlap domain hits problem in Need assistance? Call us at 1.510.649.3001 difference b/w domain databases: the InterPro database: sure to go to VIEW > NORMAL to continue working on your poster.
    [Show full text]
  • Six-Fold Speed-Up of Smith-Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors
    Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors Running head: Six-fold speed-up of Smith-Waterman searches Torbjørn Rognes* and Erling Seeberg Institute of Medical Microbiology, University of Oslo, The National Hospital, NO-0027 Oslo, Norway Abstract Motivation: Sequence database searching is among the most important and challenging tasks in bioinformatics. The ultimate choice of sequence search algorithm is that of Smith- Waterman. However, because of the computationally demanding nature of this method, heuristic programs or special-purpose hardware alternatives have been developed. Increased speed has been obtained at the cost of reduced sensitivity or very expensive hardware. Results: A fast implementation of the Smith-Waterman sequence alignment algorithm using SIMD (Single-Instruction, Multiple-Data) technology is presented. This implementation is based on the MMX (MultiMedia eXtensions) and SSE (Streaming SIMD Extensions) technology that is embedded in Intel’s latest microprocessors. Similar technology exists also in other modern microprocessors. Six-fold speed-up relative to the fastest previously known Smith-Waterman implementation on the same hardware was achieved by an optimised 8-way parallel processing approach. A speed of more than 150 million cell updates per second was obtained on a single Intel Pentium III 500MHz microprocessor. This is probably the fastest implementation of this algorithm on a single general-purpose microprocessor described to date. Availability: Online searches with the software are available at http://dna.uio.no/search/ Contact: [email protected] Published in Bioinformatics (2000) 16 (8), 699-706. Copyright © (2000) Oxford University Press. *) To whom correspondence should be addressed.
    [Show full text]