Lecture Notes on Sequence Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Lecture Notes on Sequence Analysis Sequence Analysis Lecture notes Faculty of Technology, Bielefeld University Winter 2015/16 Contents 1 Overview 1 1.1 Application Areas of Sequence Analysis.....................1 1.2 A Small Selection of Problems on Sequences...................2 2 Basic Definitions3 2.1 Sets and Basic Combinatorics...........................3 2.2 Review of Elementary Probability Theory....................4 2.3 Asymptotics.....................................6 2.4 Alphabets and Sequences.............................7 3 Metrics on Sequences 11 3.1 Problem Motivation................................ 11 3.2 Definition of a Metric............................... 12 3.3 Transformation Distances............................. 12 3.4 Metrics on Sequences of the Same Length.................... 14 3.5 Edit Distances for Sequences........................... 16 3.6 An Efficient Algorithm to Compute Edit Distances............... 18 3.7 The q-gram Distance................................ 21 3.8 The Maximal Matches Distance.......................... 26 3.9 Filtering....................................... 28 4 Pairwise Sequence Alignment 29 4.1 Definition of Alignment.............................. 29 4.2 The Alignment Score................................ 31 4.3 The Alignment Graph............................... 32 4.4 A Universal Alignment Algorithm........................ 33 4.5 Alignment Types: Global, Free End Gaps, Local................ 34 4.6 Score and Cost Variations for Alignments.................... 38 i Contents 5 Advanced Topics in Pairwise Alignment 43 5.1 Suboptimal Alignments.............................. 43 5.2 Approximate String Matching........................... 44 5.3 The Forward-Backward Technique........................ 48 5.4 Pairwise Alignment in Linear Space....................... 50 6 Pairwise Alignment in Practice 55 6.1 Alignment Visualization with Dot Plots..................... 55 6.2 Fundamentals of Rapid Database Search Methods............... 56 6.3 BLAST: An On-line Database Search Method.................. 58 7 Suffix Trees 61 7.1 Motivation..................................... 61 7.2 An Informal Introduction to Suffix Trees..................... 62 7.3 A Formal Introduction to Suffix Trees...................... 64 7.4 Suffix Tree Construction: The WOTD Algorithm................ 65 7.5 Suffix Tree Construction: Ukkonen's Linear-Time Online Algorithm..... 67 7.6 Applications of Suffix Trees............................ 67 7.6.1 Exact String Matching........................... 67 7.6.2 The Shortest Unique Substring...................... 69 7.6.3 Maximal Repeats.............................. 70 7.6.4 Maximal Unique Matches......................... 73 8 Suffix Arrays 77 8.1 Motivation..................................... 77 8.2 Basic Definitions.................................. 78 8.3 Suffix Array Construction Algorithms...................... 79 8.3.1 Linear-Time Construction using a Suffix Tree.............. 79 8.3.2 Direct Construction............................ 80 8.3.3 Construction of the rank and lcp Arrays................ 81 8.4 Applications of Suffix Arrays........................... 83 9 Burrows-Wheeler Transformation 85 9.1 Introduction..................................... 85 9.2 Transformation and Retransformation...................... 85 9.3 Exact String Matching............................... 88 9.4 Other Applications................................. 88 9.4.1 Compression with Run-Length Encoding................. 89 10 Multiple Sequence Alignment 91 10.1 Basic Definitions.................................. 91 10.2 Why multiple sequence comparison?....................... 93 10.3 Multiple Alignment Scoring Functions...................... 94 10.4 Multiple Alignment Problems........................... 96 10.5 Digression: NP-completeness........................... 99 10.6 A Guide to Multiple Sequence Alignment Algorithms.............. 102 11 Algorithms for Sum-of-Pairs Multiple Alignment 103 11.1 An Exact Algorithm................................ 103 ii Contents 11.1.1 The Basic Algorithm............................ 103 11.1.2 Variations of the Basic Algorithm.................... 105 11.2 Carrillo and Lipman's Search Space Reduction................. 106 11.3 The Center-Star Approximation.......................... 109 11.4 Divide-and-Conquer Alignment.......................... 111 12 Tree Alignment Algorithms 117 12.1 Sankoff's Algorithm................................ 117 12.2 Generalized Tree Alignment............................ 119 12.2.1 Greedy Three-Way Tree Alignment Construction............ 120 12.2.2 The Deferred Path Heuristic....................... 122 13 Whole Genome Alignment 123 13.1 Filter Algorithms.................................. 124 13.2 Pairwise Genome Alignment (MUMmer 1 & 2)................. 125 13.3 Multiple Genome Alignment (MGA and MUMmer 3).............. 126 13.4 Multiple Genome Alignment with Rearrangements (MAUVE)......... 126 A Distances versus Similarity Measures on Sequences 129 A.1 Biologically Inspired Distances.......................... 129 A.2 From Distance to Similarity............................ 131 A.3 Log-Odds Score Matrices............................. 135 B Pairwise Sequence Alignment (Extended Material) 139 B.1 The Number of Global Alignments........................ 139 C Pairwise Alignment in Practice (Extended Material) 143 C.1 Fast Implementations of the Smith-Waterman Algorithm............ 143 C.2 FASTA: An On-line Database Search Method.................. 143 C.3 Index-based Database Search Methods...................... 146 C.4 Software....................................... 148 D Alignment Statistics 151 D.1 Preliminaries.................................... 151 D.2 Statistics of q-gram Matches and FASTA Scores................. 152 D.3 Statistics of Local Alignments........................... 154 E Basic RNA Secondary Structure Prediction 157 E.1 Introduction..................................... 157 E.2 The Optimization Problem............................ 159 E.3 Context-Free Grammars.............................. 160 E.4 The Nussinov Algorithm.............................. 161 E.5 Further Remarks.................................. 163 F Suffix Tree (Extended Material) 165 F.1 Memory Representations of Suffix Trees..................... 165 G Advanced Topics in Pairwise Alignment (Extended Material) 169 G.1 Length-Normalized Alignment........................... 169 iii Contents G.1.1 A Problem with Alignment Scores.................... 169 G.1.2 A Length-Normalized Scoring Function for Local Alignment...... 171 G.1.3 Finding an Optimal Normalized Alignment............... 171 G.2 Parametric Alignment............................... 176 G.2.1 Introduction to Parametric Alignment.................. 176 G.2.2 Theory of Parametric Alignment..................... 177 G.2.3 Solving the Ray Search Problem..................... 178 G.2.4 Finding the Polygon Containing p .................... 181 G.2.5 Filling the Parameter Space........................ 182 G.2.6 Analysis and Improvements........................ 182 G.2.7 Parametric Alignment in Practice.................... 183 H Multiple Alignment in Practice: Mostly Progressive 185 H.1 Progressive Alignment............................... 185 H.1.1 Aligning Two Alignments......................... 187 H.2 Segment-Based Alignment............................. 188 H.2.1 Segment Identification........................... 189 H.2.2 Segment Selection and Assembly..................... 190 H.3 Software for Progressive and Segment-Based Alignment............ 190 H.3.1 The Program Clustal W.......................... 190 H.3.2 T-COFFEE................................. 191 H.3.3 DIALIGN.................................. 192 H.3.4 MUSCLE.................................. 192 H.3.5 QAlign................................... 194 Bibliography 195 iv Preface At Bielefeld University, elements of sequence analysis are taught in several courses, starting with elementary pattern matching methods in \Algorithms and Data Structures" in the first semester. The present three-hour course \Sequence Analysis" is taught in the third semester and is extended by a practical course in the fourth semester. Prerequisites. It is assumed that the student has had some exposure to algorithms and mathematics, as well as to elementary facts of molecular biology. The following topics are required to understand some of the material here: • exact string matching algorithms (e.g. the naive method, Boyer-Moore, Boyer-Moore- Horspool, Knuth-Morris-Pratt), • comparison-based sorting (e.g. insertion sort, mergesort, heapsort, quicksort), • asymptotic complexity analysis (O notation). These topic are assumed to have been covered before this course and the first two do not appear in these notes. In Section 2.3 the O notation is briefly reviewed, more advanced complexity topics are discussed in Section 10.5. Advanced topics. Because the material is taught in one three-hour course, some advanced sequence analysis techniques are not covered. These include: • fast algorithms for approximate string matching (e.g. bit-shifting), • advanced filtering methods for approximate string matching (e.g. gapped q-grams), • efficient methods for concave gap cost functions in pairwise alignment, v Contents • in-depth discussion of methods used in practice. How to use existing software to solve a particular problem generally requires hands-on experience which can be conveyed in the \Sequence Analysis practical course". Some of the theory of such methods is covered in the Appendix of these notes. Suggested Reading Details about the recommended textbooks
Recommended publications
  • Comparative Analysis of Multiple Sequence Alignment Tools
    I.J. Information Technology and Computer Science, 2018, 8, 24-30 Published Online August 2018 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2018.08.04 Comparative Analysis of Multiple Sequence Alignment Tools Eman M. Mohamed Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected]. Hamdy M. Mousa, Arabi E. keshk Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected], [email protected]. Received: 24 April 2018; Accepted: 07 July 2018; Published: 08 August 2018 Abstract—The perfect alignment between three or more global alignment algorithm built-in dynamic sequences of Protein, RNA or DNA is a very difficult programming technique [1]. This algorithm maximizes task in bioinformatics. There are many techniques for the number of amino acid matches and minimizes the alignment multiple sequences. Many techniques number of required gaps to finds globally optimal maximize speed and do not concern with the accuracy of alignment. Local alignments are more useful for aligning the resulting alignment. Likewise, many techniques sub-regions of the sequences, whereas local alignment maximize accuracy and do not concern with the speed. maximizes sub-regions similarity alignment. One of the Reducing memory and execution time requirements and most known of Local alignment is Smith-Waterman increasing the accuracy of multiple sequence alignment algorithm [2]. on large-scale datasets are the vital goal of any technique. The paper introduces the comparative analysis of the Table 1. Pairwise vs. multiple sequence alignment most well-known programs (CLUSTAL-OMEGA, PSA MSA MAFFT, BROBCONS, KALIGN, RETALIGN, and Compare two biological Compare more than two MUSCLE).
    [Show full text]
  • "Phylogenetic Analysis of Protein Sequence Data Using The
    Phylogenetic Analysis of Protein Sequence UNIT 19.11 Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program Antonis Rokas1 1Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee ABSTRACT Phylogenetic analysis is the study of evolutionary relationships among molecules, phenotypes, and organisms. In the context of protein sequence data, phylogenetic analysis is one of the cornerstones of comparative sequence analysis and has many applications in the study of protein evolution and function. This unit provides a brief review of the principles of phylogenetic analysis and describes several different standard phylogenetic analyses of protein sequence data using the RAXML (Randomized Axelerated Maximum Likelihood) Program. Curr. Protoc. Mol. Biol. 96:19.11.1-19.11.14. C 2011 by John Wiley & Sons, Inc. Keywords: molecular evolution r bootstrap r multiple sequence alignment r amino acid substitution matrix r evolutionary relationship r systematics INTRODUCTION the baboon-colobus monkey lineage almost Phylogenetic analysis is a standard and es- 25 million years ago, whereas baboons and sential tool in any molecular biologist’s bioin- colobus monkeys diverged less than 15 mil- formatics toolkit that, in the context of pro- lion years ago (Sterner et al., 2006). Clearly, tein sequence analysis, enables us to study degree of sequence similarity does not equate the evolutionary history and change of pro- with degree of evolutionary relationship. teins and their function. Such analysis is es- A typical phylogenetic analysis of protein sential to understanding major evolutionary sequence data involves five distinct steps: (a) questions, such as the origins and history of data collection, (b) inference of homology, (c) macromolecules, developmental mechanisms, sequence alignment, (d) alignment trimming, phenotypes, and life itself.
    [Show full text]
  • Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods
    International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-9 Issue-1, October 2019 Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods Arunima Mishra, B. K. Tripathi, S. S. Soam MSA is a well-known method of alignment of three or more Abstract: Protein Multiple sequence alignment (MSA) is a biological sequences. Multiple sequence alignment is a very process, that helps in alignment of more than two protein intricate problem, therefore, computation of exact MSA is sequences to establish an evolutionary relationship between the only feasible for the very small number of sequences which is sequences. As part of Protein MSA, the biological sequences are not practical in real situations. Dynamic programming as used aligned in a way to identify maximum similarities. Over time the sequencing technologies are becoming more sophisticated and in pairwise sequence method is impractical for a large number hence the volume of biological data generated is increasing at an of sequences while performing MSA and therefore the enormous rate. This increase in volume of data poses a challenge heuristic algorithms with approximate approaches [7] have to the existing methods used to perform effective MSA as with the been proved more successful. Generally, various biological increase in data volume the computational complexities also sequences are organized into a two-dimensional array such increases and the speed to process decreases. The accuracy of that the residues in each column are homologous or having the MSA is another factor critically important as many bioinformatics same functionality. Many MSA methods were developed over inferences are dependent on the output of MSA.
    [Show full text]
  • HMMER User's Guide
    HMMER User's Guide Biological sequence analysis using pro®le hidden Markov models http://hmmer.wustl.edu/ Version 2.1.1; December 1998 Sean Eddy Dept. of Genetics, Washington University School of Medicine 4566 Scott Ave., St. Louis, MO 63110, USA [email protected] With contributions by Ewan Birney ([email protected]) Copyright (C) 1992-1998, Washington University in St. Louis. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are retained on all copies. The HMMER software package is a copyrighted work that may be freely distributed and modi®ed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. Some versions of HMMER may have been obtained under specialized commercial licenses from Washington University; for details, see the ®les COPYING and LICENSE that came with your copy of the HMMER software. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Appendix for a copy of the full text of the GNU General Public License. 1 Contents 1 Tutorial 5 1.1 The programs in HMMER . 5 1.2 Files used in the tutorial . 6 1.3 Searching a sequence database with a single pro®le HMM . 6 HMM construction with hmmbuild . 7 HMM calibration with hmmcalibrate . 7 Sequence database search with hmmsearch . 8 Searching major databases like NR or SWISSPROT .
    [Show full text]
  • Contributions to Biostatistics: Categorical Data Analysis, Data Modeling and Statistical Inference Mathieu Emily
    Contributions to biostatistics: categorical data analysis, data modeling and statistical inference Mathieu Emily To cite this version: Mathieu Emily. Contributions to biostatistics: categorical data analysis, data modeling and statistical inference. Mathematics [math]. Université de Rennes 1, 2016. tel-01439264 HAL Id: tel-01439264 https://hal.archives-ouvertes.fr/tel-01439264 Submitted on 18 Jan 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. HABILITATION À DIRIGER DES RECHERCHES Université de Rennes 1 Contributions to biostatistics: categorical data analysis, data modeling and statistical inference Mathieu Emily November 15th, 2016 Jury: Christophe Ambroise Professeur, Université d’Evry Val d’Essonne, France Rapporteur Gérard Biau Professeur, Université Pierre et Marie Curie, France Président David Causeur Professeur, Agrocampus Ouest, France Examinateur Heather Cordell Professor, University of Newcastle upon Tyne, United Kingdom Rapporteur Jean-Michel Marin Professeur, Université de Montpellier, France Rapporteur Valérie Monbet Professeur, Université de Rennes 1, France Examinateur Korbinian Strimmer Professor, Imperial College London, United Kingdom Examinateur Jean-Philippe Vert Directeur de recherche, Mines ParisTech, France Examinateur Pour M.H.A.M. Remerciements En tout premier lieu je tiens à adresser mes remerciements à l’ensemble des membres du jury qui m’ont fait l’honneur d’évaluer mes travaux.
    [Show full text]
  • Sequence Analysis Instructions
    Sequence Analysis Instructions In order to predict your drug metabolizing phenotype from your CYP2D6 gene sequence, you must determine: 1) The assembled sequence from your two opposing sequencing reactions 2) If your PCR product even represents the human CYP2D6 gene, 3) The location of your sequence within the CYP2D6 gene, 4) Whether differences between your alleles and CYP2D6*1 sequence represents sequencing errors or polymorphisms in your sequence, and 5) The effect(s) of any polymorphisms on CYP2D6 protein sequence. As you analyze your gene sequence, copy and paste your analyses and results into a text file for your final lab report. If some factor, like the quality of your sequence, prevents you from carrying out the complete analysis, your grade will not be penalized—just complete as many of the steps below as possible, and include an explanation of why you could not complete the analysis in your final report. Font appearing in bold green italics describes questions to answer and items to include in your final report. Part 1: Assembling your CYP2D6 sequence from both directions The sequencing reaction only produces 700-800 bases of good sequence. To get most of the sequence of the 1.2 Kbp PCR product, sequencing reactions were performed from both directions on the PCR product. These need to be assembled into one sequence using the overlap of the two sequences. In order to assure that it is going in the forward direction, it is best to derive the reverse complement sequence from the reverse sequencing file. Take the text file and paste it into a program like http://www.bioinformatics.org/sms/rev_comp.html.
    [Show full text]
  • Agenda: Gene Prediction by Cross-Species Sequence Comparison Leila Taher1, Oliver Rinner2,3, Saurabh Garg1, Alexander Sczyrba4 and Burkhard Morgenstern5,*
    Nucleic Acids Research, 2004, Vol. 32, Web Server issue W305–W308 DOI: 10.1093/nar/gkh386 AGenDA: gene prediction by cross-species sequence comparison Leila Taher1, Oliver Rinner2,3, Saurabh Garg1, Alexander Sczyrba4 and Burkhard Morgenstern5,* 1International Graduate School for Bioinformatics and Genome Research, University of Bielefeld, Postfach 10 01 31, 33501 Bielefeld, Germany, 2GSF Research Center, MIPS/Institute of Bioinformatics, Ingolsta¨dter Landstrasse 1, 85764 Neuherberg, Germany, 3Brain Research Institute, ETH Zuurich,€ Winterthurerstrasse 190, CH-8057 Zuurich,€ Switzerland, 4Faculty of Technology, Research Group in Practical Computer Science, University of Bielefeld, Postfach 10 01 31, 33501 Bielefeld, Germany and 5Institute of Microbiology and Genetics, University of Go¨ttingen, Goldschmidtstrasse 1, 37077 Go¨ttingen, Germany Received March 5, 2004; Revised and Accepted March 24, 2004 ABSTRACT Thus, with the huge amount of data produced by sequencing projects, the development of high-quality gene-prediction Automatic gene prediction is one of the major tools has become a priority in bioinformatics. For eukaryotic challenges in computational sequence analysis. organisms, gene prediction is particularly challenging, since Traditional approaches to gene finding rely on statis- protein-coding exons are usually separated by introns of vari- tical models derived from previously known genes. By able length. Most traditional methods for gene prediction are contrast, a new class of comparative methods relies intrinsic methods; they use hidden Markov models (HMMs) or on comparing genomic sequences from evolutionary other stochastic approaches describing the statistical composi- related organisms to each other. These methods are tion of introns, exons, etc. Such models are trained with already based on the concept of phylogenetic footprinting: known genes from the same or a closely related species; the they exploit the fact that functionally important most popular of these tools is GenScan (1).
    [Show full text]
  • Syntax Highlighting for Computational Biology Artem Babaian1†, Anicet Ebou2, Alyssa Fegen3, Ho Yin (Jeffrey) Kam4, German E
    bioRxiv preprint doi: https://doi.org/10.1101/235820; this version posted December 20, 2017. The copyright holder has placed this preprint (which was not certified by peer review) in the Public Domain. It is no longer restricted by copyright. Anyone can legally share, reuse, remix, or adapt this material for any purpose without crediting the original authors. bioSyntax: Syntax Highlighting For Computational Biology Artem Babaian1†, Anicet Ebou2, Alyssa Fegen3, Ho Yin (Jeffrey) Kam4, German E. Novakovsky5, and Jasper Wong6. 5 10 15 Affiliations: 1. Terry Fox Laboratory, BC Cancer, Vancouver, BC, Canada. [[email protected]] 2. Departement de Formation et de Recherches Agriculture et Ressources Animales, Institut National Polytechnique Felix Houphouet-Boigny, Yamoussoukro, Côte d’Ivoire. [[email protected]] 20 3. Faculty of Science, University of British Columbia, Vancouver, BC, Canada [[email protected]] 4. Faculty of Mathematics, University of Waterloo, Waterloo, ON, Canada. [[email protected]] 5. Department of Medical Genetics, University of British Columbia, Vancouver, BC, 25 Canada. [[email protected]] 6. Genome Science and Technology, University of British Columbia, Vancouver, BC, Canada. [[email protected]] Correspondence†: 30 Artem Babaian Terry Fox Laboratory BC Cancer Research Centre 675 West 10th Avenue Vancouver, BC, Canada. V5Z 1L3. 35 Email: [[email protected]] bioRxiv preprint doi: https://doi.org/10.1101/235820; this version posted December 20, 2017. The copyright holder has placed this preprint (which was not certified by peer review) in the Public Domain. It is no longer restricted by copyright. Anyone can legally share, reuse, remix, or adapt this material for any purpose without crediting the original authors.
    [Show full text]
  • Software List for Biology, Bioinformatics and Biostatistics CCT
    Software List for biology, bioinformatics and biostatistics v CCT - Delta Software Version Application short read assembler and it works on both small and large (mammalian size) ALLPATHS-LG 52488 genomes provides a fast, flexible C++ API & toolkit for reading, writing, and manipulating BAMtools 2.4.0 BAM files a high level of alignment fidelity and is comparable to other mainstream Barracuda 0.7.107b alignment programs allows one to intersect, merge, count, complement, and shuffle genomic bedtools 2.25.0 intervals from multiple files Bfast 0.7.0a universal DNA sequence aligner tool analysis and comprehension of high-throughput genomic data using the R Bioconductor 3.2 statistical programming BioPython 1.66 tools for biological computation written in Python a fast approach to detecting gene-gene interactions in genome-wide case- Boost 1.54.0 control studies short read aligner geared toward quickly aligning large sets of short DNA Bowtie 1.1.2 sequences to large genomes Bowtie2 2.2.6 Bowtie + fully supports gapped alignment with affine gap penalties BWA 0.7.12 mapping low-divergent sequences against a large reference genome ClustalW 2.1 multiple sequence alignment program to align DNA and protein sequences assembles transcripts, estimates their abundances for differential expression Cufflinks 2.2.1 and regulation in RNA-Seq samples EBSEQ (R) 1.10.0 identifying genes and isoforms differentially expressed EMBOSS 6.5.7 a comprehensive set of sequence analysis programs FASTA 36.3.8b a DNA and protein sequence alignment software package FastQC
    [Show full text]
  • A Comparison of Latent Class and Sequence Analysis
    Classifying life course trajectories: a comparison of latent class and sequence analysis Nicola Barban DONDENA \Carlo F. Dondena" Centre for Research on Social Dynamics, Universit`aBocconi, Milan, Italy [email protected] Francesco C. Billari DONDENA \Carlo F. Dondena" Centre for Research on Social Dynamics, Department of Decision Sciences and IGIER Universit`aBocconi, Milan, Italy Draft version. Abstract In this article we compare two techniques that are widely used in the analysis of life course trajectories, i.e. latent class analysis (LCA) and sequence analysis (SA). In particular, we focus on the use of these techniques as devices to obtain classes of individual life course trajectories. We first compare the consistency of the clas- sification obtained via the two techniques using an actual dataset on the life course trajectories of young adults. Then, we adopt a simulation approach to measure the ability of these two methods to correctly classify groups of life course trajecto- ries when specific forms of \random" variability are introduced within pre-specified classes in an artificial datasets. In order to do so, we introduce simulation opera- tors that have a life course and/or observational meaning. Our results contribute on the one hand to outline the usefulness and robustness of findings based on the classification of life course trajectories through LCA and SA, on the other hand to illuminate on the potential pitfalls of actual applications of these techniques. 1 Introduction In recent years, there has been a significantly growing interest in the holistic study of life course trajectories, i.e. in considering whole trajectories as a unit of analysis, both in a social science setting and in epidemiological and medical studies.
    [Show full text]
  • Computational Protocol for Assembly and Analysis of SARS-Ncov-2 Genomes
    PROTOCOL Computational Protocol for Assembly and Analysis of SARS-nCoV-2 Genomes Mukta Poojary1,2, Anantharaman Shantaraman1, Bani Jolly1,2 and Vinod Scaria1,2 1CSIR Institute of Genomics and Integrative Biology, Mathura Road, Delhi 2Academy of Scientific and Innovative Research (AcSIR) *Corresponding email: MP [email protected]; AS [email protected]; BJ [email protected]; VS [email protected] ABSTRACT SARS-CoV-2, the pathogen responsible for the ongoing Coronavirus Disease 2019 pandemic is a novel human-infecting strain of Betacoronavirus. The outbreak that initially emerged in Wuhan, China, rapidly spread to several countries at an alarming rate leading to severe global socio-economic disruption and thus overloading the healthcare systems. Owing to the high rate of infection of the virus, as well as the absence of vaccines or antivirals, there is a lack of robust mechanisms to control the outbreak and contain its transmission. Rapid advancement and plummeting costs of high throughput sequencing technologies has enabled sequencing of the virus in several affected individuals globally. Deciphering the viral genome has the potential to help understand the epidemiology of the disease as well as aid in the development of robust diagnostics, novel treatments and prevention strategies. Towards this effort, we have compiled a comprehensive protocol for analysis and interpretation of the sequencing data of SARS-CoV-2 using easy-to-use open source utilities. In this protocol, we have incorporated strategies to assemble the genome of SARS-CoV-2 using two approaches: reference-guided and de novo. Strategies to understand the diversity of the local strain as compared to other global strains have also been described in this protocol.
    [Show full text]
  • Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D
    BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D. Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B. F. Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto BIOINFORMATICS SECOND EDITION METHODS OF BIOCHEMICAL ANALYSIS Volume 43 BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D. Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B. F. Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley & Sons, Inc., is aware of a claim, the product names appear in initial capital or ALL CAPITAL LETTERS. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Copyright ᭧ 2001 by John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic or mechanical, including uploading, downloading, printing, decompiling, recording or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher.
    [Show full text]