Sequence Assembly and Alignment Tech Guide

Total Page:16

File Type:pdf, Size:1020Kb

Sequence Assembly and Alignment Tech Guide www.roche-applied-science.com Genome Sequencer FLX System Longer sequencing reads mean more applications. In 2005, the Genome Sequencer 20 System was launched ■ Read length: 100 bases ■ 20 million bases in less than 5 hours In 2007, the Genome Sequencer FLX System was launched ■ Read length: 250 to 300 bases ■ 100 million bases in less than 8 hours Available in 2008, the Genome Sequencer FLX with improved chemistries ■ Read length: >400 bases ■ 1 billion bases in less than 24 hours More applications lead to more publications Sequencing-by-Synthesis: Using an enzymatically Proven performance with an expanding list of applications and coupled reaction, light is generated when individual more than 80 peer-reviewed publications. nucleotides are incorporated. Hundreds of thousands of Visit www.genome-sequencing.com to learn more. individual DNA fragments are sequenced in parallel. Roche Diagnostics Roche Applied Science For life science research only. Not for use in diagnostic procedures. Indianapolis, Indiana 454 and GENOME SEQUENCER are trademarks of 454 Life Sciences Corporation, Branford, CT, USA. © 2007 Roche Diagnostics. All rights reserved. Table of contents Letter from the Editor . .5 Index of Experts . .5 Q1: How do you decide whether to take a global, local, or hybrid approach to alignment? . 6 Q2: What scoring function do you use? Do you allow gaps? . 9 Q3: How do you choose which assembly algorithm to use? . 12 Q4: What approach do you use to deal with very short reads? . 14 Q5: How do you ensure high-quality assemblies? What's your process for detecting errors? . 15 Q6: How do you close the gaps? At what point is a genome considered "finished"? . 17 List of Resources . 21 Evolving? Don’t change jobs without us. E-mail your updated address information to [email protected]. Please include the subscriber number appearing directly above your name on the address label. GenomeWeb Intelligence Network D N A S T A R ArrayStar® shines at straightforward If your studies involve gene expression, ArrayStar is the only microarray microarray software that helps you do so much analysis so easily. Why shuffle between a confusing assortment of complex, analysis expensive, hard-to-learn software when you can quickly and easily perform the analyses you want with a single user-friendly program? And ArrayStar is as straightforward to own as it is to use. When you purchase ArrayStar, the perpetual license makes it yours forever. Your research benefits from these features! • Compatible with Affymetrix and NimbleGen normalized data files • Import wizard supports .txt files • Conducts a wide range of analyses and visualizations to evaluate expression patterns • Identifies genes with similar expression patterns • Clusters images to identify groups of co-regulated genes • Plots gene expression levels to A selection of genes highlighted within simultaneous view relationships with Line Graphs multiple views of ArrayStar and Thumbnails • Displays expression levels of many genes across multiple experiments with a Heat Map Affymetrix is a trademark of Affymetrix, Inc. NimbleGen is a trademark of NimbleGen Systems, Inc. To receive a 30-day fully-functional free trial version visit us at www.dnastar.com/arraystar For more information, or to order ArrayStar today, please call toll-free in the US, 1.866.511.5090 or in the UK, 0.808.234.1643 DNASTAR, Inc • Madison, WI, USA • www.dnastar.com Toll free: 866-511-5090 • 608-258-7420 • [email protected] www.dnastar.com/arraystar Letter from the editor At long last, you’ve got your sequence choices and decisions, Genome Technology has data. But now you have to figure out rounded up experts in the sequence alignment and what it means. To make heads or tails assembly fields. These experienced veterans give their of all that information, you might want advice on how to approach sequence alignment, to align and then assemble your choose an assembly algorithm, and when to declare genome of choice. Gone, though, are yourself "done!" with putting that genome's pieces the days of manually post-processing together. Without further ado, but with much thanks the snippets of sequence. The flurry of activity as more to the following contributors, here are their thoughts and more genomes are sequenced (Dog! Honey bee! on the questions facing researchers in sequence Shark!) has brought more advanced and accurate ways alignment and assembly. of aligning and assembling whatever your favorite — Ciara Curtin genome might be. To help you slog through all the Index of experts Genome Technology would like to thank the following contributors for taking the time to respond to the questions in this tech guide. Stephen Altschul Sudhir Kumar Senior Investigator Professor National Center for Biotechnology Arizona State University Information Elliott Margulies Serafim Batzoglou Investigator Assistant Professor National Human Genome Research Stanford University Institute Gustavo Glusman Darren Platt Senior Research Scientist Informatics Department Head Systems Biology Institute DOE Joint Genome Institute Xiaoqiu Huang Mihai Pop Associate Professor Assistant Professor Iowa State University University of Maryland Genome Technology Sequence Assembly and Alignment 5 How do you decide whether to Questiontake a global, local, or hybrid approach to sequence alignment? If you know beforehand that two or more sequences problems such as alignment of orthologous proteins are globally related, then it is almost always or overlap of reads in DNA sequencing projects. advantageous to use a global alignment algorithm to Local alignment is best used for searching a database compare them. This will cut down on noise from for a sequence of interest. Hybrid approaches are random local similarities, and will force the needed when aligning large genomic regions alignment of related sequence regions that a local between different species. algorithm might leave unaligned. However, if it is not — Serafim Batzoglou known whether the sequences being compared are globally related, or indeed whether they are related Most sequence assembly tools start by identifying at all, then a local alignment algorithm is to be overlapping sequence reads and aligning them. For preferred. Because most database searches are example, Phrap uses cross match as its alignment exploratory in this sense, almost all popular sequence engine. Arachne implements its own fast heuristic to database search programs, identify overlapping reads, by such as FASTA and Blast, indexing subsequences employ local alignment “Almost all popular sequence (words); the optimal align- methods. ment is obtained using A few years ago, Dr. Yi- database search programs, such dynamic programming in a Kuo Yu described what he as FASTA and Blast, employ local process conceptually related called a "hybrid" alignment to the FASTA algorithm. In method, which was a alignment methods.” the context of sequence combination of the Hidden assembly, therefore, the — Stephen Altschul Markov Model approach to choice of alignment algor- local sequence alignment ithm is out of the user's (which considers all possible hands, having already been paths through a path graph) and the Smith- addressed by the assembly tool developer. Waterman algorithm (which seeks only the optimal — Gustavo Glusman local alignment). This hybrid approach, which is a local alignment method, has significant advantages Start with a local alignment program named SIM. If from a statistical perspective, and perhaps the alignment covers most of the sequences, use a advantages from the perspective of sensitivity as global alignment program named GAP. If the well. However, there is so far no widely available sequences contain similar regions separated by software implementing this approach. different regions, use a pair-wise alignment program — Stephen Altschul named GAP3 or a multiple alignment program named MAP2. The global alignment model is appropriate in specific — Xiaoqiu Huang 6 Sequence Assembly and Alignment Genome Technology In global alignments, sequences are aligned The overarching answer is usually: do the right thing. beginning to end, with gaps inserted whenever If you expect the sequence comes from a gene, then needed. This procedure is appropriate for protein it should align in full over the coding area that you (amino) sequences of individual genes, ribosomal want put it in. So that typically means global RNA sequences, and short stretches of the genomic alignment. In assembly, that's what you want. You sequence. For genomic DNA, it is always necessary to want every read to align along the genome. In account for medium and large-scale rearrangements practice, you might put low-quality sequence off the in addition to large sequence insertions and ends and make sure it actually aligns properly. deletions, which necessitates the building of local — Darren Platt alignments. For aligning DNA sequences of genome segments coding for In general I prefer to use a proteins (protein-coding local alignment procedure, genes), one will often need followed by post-processing to take a hybrid approach in in case I need to place which the first step is to “The overarching answer is additional constraints on the align the translated protein usually: do the right thing.” alignment. For example, sequences by taking a global when mapping a sequenc- approach, which is followed — Darren Platt ing read to a reference by the adjustment of exon genome, I try to ensure the DNA sequences to reflect alignment spans the entire the protein alignment, and read (i.e. a global alignment then the use of local from the point of view of procedure for aligning homologous intron the read). Requiring a global alignment for the reads, sequences. however, would miss polymorphisms between the — Sudhir Kumar DNA being sequenced and the available reference. — Mihai Pop I typically align genomic sequences, so a hybrid approach works best for me. A global alignment alone doesn't work when the parts of the genome you're trying to align are in different segments and all shuffled up, so you need to first pre-process the data to figure out which sequences should be aligning in the first place.
Recommended publications
  • Classifying Transport Proteins Using Profile Hidden Markov Models And
    Classifying Transport Proteins Using Profile Hidden Markov Models and Specificity Determining Sites Qing Ye A Thesis in The Department of Computer Science and Software Engineering Presented in Partial Fulfillment of the Requirements for the Degree of Master of Computer Science (MCompSc) at Concordia University Montréal, Québec, Canada April 2019 ⃝c Qing Ye, 2019 CONCORDIA UNIVERSITY School of Graduate Studies This is to certify that the thesis prepared By: Qing Ye Entitled: Classifying Transport Proteins Using Profile Hidden Markov Models and Specificity Determining Sites and submitted in partial fulfillment of the requirements for the degree of Master of Computer Science (MCompSc) complies with the regulations of this University and meets the accepted standards with respect to originality and quality. Signed by the Final Examining Committee: Chair Dr. T.-H. Chen Examiner Dr. T. Glatard Examiner Dr. A. Krzyzak Supervisor Dr. G. Butler Approved by Martin D. Pugh, Chair Department of Computer Science and Software Engineering 2019 Amir Asif, Dean Faculty of Engineering and Computer Science Abstract Classifying Transport Proteins Using Profile Hidden Markov Models and Specificity Determining Sites Qing Ye This thesis develops methods to classifiy the substrates transported across a membrane by a given transmembrane protein. Our methods use tools that predict specificity determining sites (SDS) after computing a multiple sequence alignment (MSA), and then building a profile Hidden Markov Model (HMM) using HMMER. In bioinformatics, HMMER is a set of widely used applications for sequence analysis based on profile HMM. Specificity determining sites (SDS) are the key positions in a protein sequence that play a crucial role in functional variation within the protein family during the course of evolution.
    [Show full text]
  • BIOGRAPHICAL SKETCH NAME: Berger
    BIOGRAPHICAL SKETCH NAME: Berger, Bonnie eRA COMMONS USER NAME (credential, e.g., agency login): BABERGER POSITION TITLE: Simons Professor of Mathematics and Professor of Electrical Engineering and Computer Science EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, include postdoctoral training and residency training if applicable. Add/delete rows as necessary.) EDUCATION/TRAINING DEGREE Completion (if Date FIELD OF STUDY INSTITUTION AND LOCATION applicable) MM/YYYY Brandeis University, Waltham, MA AB 06/1983 Computer Science Massachusetts Institute of Technology SM 01/1986 Computer Science Massachusetts Institute of Technology Ph.D. 06/1990 Computer Science Massachusetts Institute of Technology Postdoc 06/1992 Applied Mathematics A. Personal Statement Advances in modern biology revolve around automated data collection and sharing of the large resulting datasets. I am considered a pioneer in the area of bringing computer algorithms to the study of biological data, and a founder in this community that I have witnessed grow so profoundly over the last 26 years. I have made major contributions to many areas of computational biology and biomedicine, largely, though not exclusively through algorithmic innovations, as demonstrated by nearly twenty thousand citations to my scientific papers and widely-used software. In recognition of my success, I have just been elected to the National Academy of Sciences and in 2019 received the ISCB Senior Scientist Award, the pinnacle award in computational biology. My research group works on diverse challenges, including Computational Genomics, High-throughput Technology Analysis and Design, Biological Networks, Structural Bioinformatics, Population Genetics and Biomedical Privacy. I spearheaded research on analyzing large and complex biological data sets through topological and machine learning approaches; e.g.
    [Show full text]
  • The Myth of Junk DNA
    The Myth of Junk DNA JoATN h A N W ells s eattle Discovery Institute Press 2011 Description According to a number of leading proponents of Darwin’s theory, “junk DNA”—the non-protein coding portion of DNA—provides decisive evidence for Darwinian evolution and against intelligent design, since an intelligent designer would presumably not have filled our genome with so much garbage. But in this provocative book, biologist Jonathan Wells exposes the claim that most of the genome is little more than junk as an anti-scientific myth that ignores the evidence, impedes research, and is based more on theological speculation than good science. Copyright Notice Copyright © 2011 by Jonathan Wells. All Rights Reserved. Publisher’s Note This book is part of a series published by the Center for Science & Culture at Discovery Institute in Seattle. Previous books include The Deniable Darwin by David Berlinski, In the Beginning and Other Essays on Intelligent Design by Granville Sewell, God and Evolution: Protestants, Catholics, and Jews Explore Darwin’s Challenge to Faith, edited by Jay Richards, and Darwin’s Conservatives: The Misguided Questby John G. West. Library Cataloging Data The Myth of Junk DNA by Jonathan Wells (1942– ) Illustrations by Ray Braun 174 pages, 6 x 9 x 0.4 inches & 0.6 lb, 229 x 152 x 10 mm. & 0.26 kg Library of Congress Control Number: 2011925471 BISAC: SCI029000 SCIENCE / Life Sciences / Genetics & Genomics BISAC: SCI027000 SCIENCE / Life Sciences / Evolution ISBN-13: 978-1-9365990-0-4 (paperback) Publisher Information Discovery Institute Press, 208 Columbia Street, Seattle, WA 98104 Internet: http://www.discoveryinstitutepress.com/ Published in the United States of America on acid-free paper.
    [Show full text]
  • BIOGRAPHICAL SKETCH NAME: Bonnie Berger POSITION TITLE
    BIOGRAPHICAL SKETCH NAME: Bonnie Berger POSITION TITLE: Simons Professor of Mathematics and Professor of Electrical Engineering & Computer Science EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, include postdoctoral training and residency training if applicable. Add/delete rows as necessary.) DEGREE Completion (if Date FIELD OF STUDY INSTITUTION AND LOCATION applicable) MM/YYYY Brandeis University, Waltham, MA AB 06/1983 Computer Science Massachusetts Institute of Technology SM 01/1986 Computer Science Massachusetts Institute of Technology Ph.D. 06/1990 Computer Science Massachusetts Institute of Technology Postdoc 06/1992 Applied Mathematics A. Personal Statement Many advances in modern biology revolve around automated data collection and the large resulting data sets. I am considered a pioneer in the area of bringing computer algorithms to the study of biological data, and a founder in this community that I have witnessed grow so profoundly over the last 20 years. I have made major contributions to many areas of computational biology and biomedicine, largely, though not exclusively through algorithmic insights, as demonstrated by ten thousand citations to my scientific papers and widely-used software. My research group works on diverse challenges, including and Computational Genomics, Structural Bioinformatics, High-throughput Technology Analysis and Design, Network Inference, and Data Privacy. We collaborate closely with biologists, MDs, and software engineers, implementing these new techniques in order to design experiments to maximally leverage the power of computation for biological exploration. Over the past five years I have been particularly active analyzing large and complex biological data sets; for example, my lab has played integral roles in modENCODE (non-coding RNA analysis), MPEG (biological data compression standard), and the Broad Institute’s sequence analysis efforts.
    [Show full text]
  • Genome Informatics
    Joint Cold Spring Harbor Laboratory/Wellcome Trust Conference GENOME INFORMATICS September 15–September 19, 2010 View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Cold Spring Harbor Laboratory Institutional Repository Joint Cold Spring Harbor Laboratory/Wellcome Trust Conference GENOME INFORMATICS September 15–September 19, 2010 Arranged by Inanc Birol, BC Cancer Agency, Canada Michele Clamp, BioTeam, Inc. James Kent, University of California, Santa Cruz, USA SCHEDULE AT A GLANCE Wednesday 15th September 2010 17.00-17.30 Registration – finger buffet dinner served from 17.30-19.30 19.30-20:50 Session 1: Epigenomics and Gene Regulation 20.50-21.10 Break 21.10-22.30 Session 1, continued Thursday 16th September 2010 07.30-09.00 Breakfast 09.00-10.20 Session 2: Population and Statistical Genomics 10.20-10:40 Morning Coffee 10:40-12:00 Session 2, continued 12.00-14.00 Lunch 14.00-15.20 Session 3: Environmental and Medical Genomics 15.20-15.40 Break 15.40-17.00 Session 3, continued 17.00-19.00 Poster Session I and Drinks Reception 19.00-21.00 Dinner Friday 17th September 2010 07.30-09.00 Breakfast 09.00-10.20 Session 4: Databases, Data Mining, Visualization and Curation 10.20-10.40 Morning Coffee 10.40-12.00 Session 4, continued 12.00-14.00 Lunch 14.00-16.00 Free afternoon 16.00-17.00 Keynote Speaker: Alex Bateman 17.00-19.00 Poster Session II and Drinks Reception 19.00-21.00 Dinner Saturday 18th September 2010 07.30-09.00 Breakfast 09.00-10.20 Session 5: Sequencing Pipelines and Assembly 10.20-10.40
    [Show full text]
  • Lecture Notes in Bioinformatics 5541 Edited by S
    Lecture Notes in Bioinformatics 5541 Edited by S. Istrail, P. Pevzner, and M. Waterman Editorial Board: A. Apostolico S. Brunak M. Gelfand T. Lengauer S. Miyano G. Myers M.-F. Sagot D. Sankoff R. Shamir T. Speed M. Vingron W. Wong Subseries of Lecture Notes in Computer Science Serafim Batzoglou (Ed.) Research in Computational Molecular Biology 13th Annual International Conference, RECOMB 2009 Tucson, AZ, USA, May 18-21, 2009 Proceedings 13 Series Editors Sorin Istrail, Brown University, Providence, RI, USA Pavel Pevzner, University of California, San Diego, CA, USA Michael Waterman, University of Southern California, Los Angeles, CA, USA Volume Editor Serafim Batzoglou Computer Science Department James H. Clark Center, 318 Campus Drive, RM S266 Stanford, CA 94305-5428, USA E-mail: serafi[email protected] Library of Congress Control Number: Applied for CR Subject Classification (1998): J.3, I.3.5, F.2.2, F.2, G.2.1 LNCS Sublibrary: SL 8 – Bioinformatics ISSN 0302-9743 ISBN-10 3-642-02007-0 Springer Berlin Heidelberg New York ISBN-13 978-3-642-02007-0 Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer.
    [Show full text]
  • BIOINFORMATICS Doi:10.1093/Bioinformatics/Btu322
    Vol. 30 ISMB 2014, pages i3–i8 BIOINFORMATICS doi:10.1093/bioinformatics/btu322 ISMB 2014 PROCEEDINGS PAPERS COMMITTEE PROCEEDINGS PAPERS COMMITTEE CHAIRS F. Gene Regulation and Transcriptomics Serafim Batzoglu, Stanford University, United States Alexander Haremink, Duke University, Durham, United States Russell Schwartz, Carnegie Mellon University, Pittsburgh, Zohar Yakhini, Agilent, Haifa, Israel United States G. Mass Spectrometry and Proteomics Olga Vitek, Purdue University, West Lafayette, United States Bill Noble, University of Washington, Seattle, United States PROCEEDINGS PAPERS-AREA CHAIRS A. Applied Bioinformatics H. Metabolic Networks Thomas Lengauer, Max Planck Institute for Informatics, Jason Papin, University of Virginia, Charlottesville, United Saarbrucken, Germany States Lenore Cowen, Tufts University, Medford, United States I. Population Genomics Eran Halperin, Tel-Aviv University, Israel B. Bioimaging and Data Visualization Itsik Pe’er, Columbia University, New York, United States Robert Murphy, Carnegie Mellon University, Pittsburgh, United States J. Protein Interactions and Molecular Networks Mona Singh, Princeton University, United States C. Databases and Ontologies and Text Mining Trey Ideker, UC San Diego, United States Hagit Shatkay, University of Delaware, Newark, United States Alex Bateman, European Bioinformatics Institute (EMBL-EBI), K. Protein Structure and Function Wellcome Trust Genome Campus, Hinxton, United Kingdom Jie Liang, University of Illinois at Chicago, United States Jinbo Xu, Toyota Technical Institute,
    [Show full text]
  • Reports from the Fifth Edition of CAGI: the Critical Assessment of Genome Interpretation
    Received: 16 July 2019 | Accepted: 19 July 2019 DOI: 10.1002/humu.23876 OVERVIEW Reports from the fifth edition of CAGI: The Critical Assessment of Genome Interpretation Gaia Andreoletti1 | Lipika R. Pal2 | John Moult2,3 | Steven E. Brenner1,4 1Department of Plant and Microbial Biology, University of California, Berkeley, California Abstract 2Institute for Bioscience and Biotechnology Interpretation of genomic variation plays an essential role in the analysis of Research, University of Maryland, Rockville, cancer and monogenic disease, and increasingly also in complex trait disease, with Maryland 3Department of Cell Biology and Molecular applications ranging from basic research to clinical decisions. Many computational Genetics, University of Maryland, College impact prediction methods have been developed, yet the field lacks a clear consensus Park, Maryland on their appropriate use and interpretation. The Critical Assessment of Genome 4Center for Computational Biology, University of California, Berkeley, California Interpretation (CAGI, /'kā‐jē/) is a community experiment to objectively assess computational methods for predicting the phenotypic impacts of genomic variation. Correspondence John Moult, Institute for Bioscience and CAGI participants are provided genetic variants and make blind predictions of Biotechnology Research, University of resulting phenotype. Independent assessors evaluate the predictions by comparing Maryland, 9600 Gudelsky Drive, Rockville, MD 20850. with experimental and clinical data. Email: [email protected] CAGI has completed five editions with the goals of establishing the state of art in Steven E. Brenner, Department of Plant and genome interpretation and of encouraging new methodological developments. This Microbial Biology, University of California, special issue (https://onlinelibrary.wiley.com/toc/10981004/2019/40/9) comprises Berkeley, CA 94720.
    [Show full text]
  • Analyses of Deep Mammalian Sequence Alignments and Constraint Predictions for 1% of the Human Genome
    Downloaded from genome.cshlp.org on October 3, 2021 - Published by Cold Spring Harbor Laboratory Press Article Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome Elliott H. Margulies,2,7,8,21 Gregory M. Cooper,2,3,9 George Asimenos,2,10 Daryl J. Thomas,2,11,12 Colin N. Dewey,2,4,13 Adam Siepel,5,12 Ewan Birney,14 Damian Keefe,14 Ariel S. Schwartz,13 Minmei Hou,15 James Taylor,15 Sergey Nikolaev,16 Juan I. Montoya-Burgos,17 Ari Löytynoja,14 Simon Whelan,6,14 Fabio Pardi,14 Tim Massingham,14 James B. Brown,18 Peter Bickel,19 Ian Holmes,20 James C. Mullikin,8,21 Abel Ureta-Vidal,14 Benedict Paten,14 Eric A. Stone,9 Kate R. Rosenbloom,12 W. James Kent,11,12 NISC Comparative Sequencing Program,1,8,21 Baylor College of Medicine Human Genome Sequencing Center,1 Washington University Genome Sequencing Center,1 Broad Institute,1 UCSC Genome Browser Team,1 British Columbia Cancer Agency Genome Sciences Center,1 Stylianos E. Antonarakis,16 Serafim Batzoglou,10 Nick Goldman,14 Ross Hardison,22 David Haussler,11,12,24 Webb Miller,22 Lior Pachter,24 Eric D. Green,8,21 and Arend Sidow9,25 A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy).
    [Show full text]
  • Structural Variation Discovery: the Easy, the Hard
    STRUCTURAL VARIATION DISCOVERY: THE EASY, THE HARD AND THE UGLY by Fereydoun Hormozdiari B.Sc., Sharif University of Technology, 2004 M.Sc., Simon Fraser University, 2007 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in the School of Computing Science Faculty of Applied Science c Fereydoun Hormozdiari 2011 SIMON FRASER UNIVERSITY Fall 2011 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. APPROVAL Name: Fereydoun Hormozdiari Degree: Doctor of Philosophy Title of thesis: Structural Variation Discovery: the easy, the hard and the ugly Examining Committee: Dr. Ramesh Krishnamurti Chair Dr. S. Cenk Sahinalp, Senior Supervisor Computing Science, Simon Fraser University Dr. Evan E. Eichler, Supervisor Genome Sciences, University of Washington Dr. Artem Cherkasov, Supervisor Urologic Sciences, University of British Columbia Dr. Inanc Birol, Supervisor Bioinformatics group leader, GSC, BC Cancer Agency Dr. Fiona Brinkman, Internal Examiner Molecular Biology and Chemistry, Simon Fraser Univer- sity Dr. Serafim Batzoglou, External Examiner Computer Science, Stanford University Date Approved: August 22nd, 2011 ii Partial Copyright Licence Abstract Comparison of human genomes shows that along with single nucleotide polymorphisms and small indels, larger structural variants (SVs) are common. Recent studies even suggest that more base pairs are altered as a result of structural variations (including copy number variations) than as a result of single nucleotide variations or small indels.
    [Show full text]
  • ABSTRACT GENOME ASSEMBLY and VARIANT DETECTION USING EMERGING SEQUENCING TECHNOLOGIES and GRAPH BASED METHODS Jay Ghurye Doctor
    ABSTRACT Title of dissertation: GENOME ASSEMBLY AND VARIANT DETECTION USING EMERGING SEQUENCING TECHNOLOGIES AND GRAPH BASED METHODS Jay Ghurye Doctor of Philosophy, 2018 Dissertation directed by: Professor Mihai Pop Department of Computer Science The increased availability of genomic data and the increased ease and lower costs of DNA sequencing have revolutionized biomedical research. One of the critical steps in most bioinformatics analyses is the assembly of the genome sequence of an organ- ism using the data generated from the sequencing machines. Despite the long length of sequences generated by third-generation sequencing technologies (tens of thousands of basepairs), the automated reconstruction of entire genomes continues to be a formidable computational task. Although long read technologies help in resolving highly repetitive regions, the contigs generated from long read assembly do not always span a complete chromosome or even an arm of the chromosome. Recently, new genomic technologies have been developed that can “bridge” across repeats or other genomic regions that are difficult to sequence or assemble and improve genome assemblies by “scaffolding” to- gether large segments of the genome. The problem of scaffolding is vital in the context of both single genome assembly of large eukaryotic genomes and in metagenomics where the goal is to assemble multiple bacterial genomes in a sample simultaneously. First, we describe SALSA2, a method we developed to use interaction frequency between any two loci in the genome obtained using Hi-C technology to scaffold frag- mented eukaryotic genome assemblies into chromosomes. SALSA2 can be used with either short or long read assembly to generate highly contiguous and accurate chromo- some level assemblies.
    [Show full text]
  • ENCODE MSA.Pdf
    Downloaded from www.genome.org on June 14, 2007 Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome Elliott H. Margulies, Gregory M. Cooper, George Asimenos, Daryl J. Thomas, Colin N. Dewey, Adam Siepel, Ewan Birney, Damian Keefe, Ariel S. Schwartz, Minmei Hou, James Taylor, Sergey Nikolaev, Juan I. Montoya-Burgos, Ari Löytynoja, Simon Whelan, Fabio Pardi, Tim Massingham, James B. Brown, Peter Bickel, Ian Holmes, James C. Mullikin, Abel Ureta-Vidal, Benedict Paten, Eric A. Stone, Kate R. Rosenbloom, W. James Kent, Gerard G. Bouffard, Xiaobin Guan, Nancy F. Hansen, Jacquelyn R. Idol, Valerie V.B. Maduro, Baishali Maskeri, Jennifer C. McDowell, Morgan Park, Pamela J. Thomas, Alice C. Young, Robert W. Blakesley, Donna M. Muzny, Erica Sodergren, David A. Wheeler, Kim C. Worley, Huaiyang Jiang, George M. Weinstock, Richard A. Gibbs, Tina Graves, Robert Fulton, Elaine R. Mardis, Richard K. Wilson, Michele Clamp, James Cuff, Sante Gnerre, David B. Jaffe, Jean L. Chang, Kerstin Lindblad-Toh, Eric S. Lander, Angie Hinrichs, Heather Trumbower, Hiram Clawson, Ann Zweig, Robert M. Kuhn, Galt Barber, Rachel Harte, Donna Karolchik, Matthew A. Field, Richard A. Moore, Carrie A. Matthewson, Jacqueline E. Schein, Marco A. Marra, Stylianos E. Antonarakis, Serafim Batzoglou, Nick Goldman, Ross Hardison, David Haussler, Webb Miller, Lior Pachter, Eric D. Green and Arend Sidow Genome Res. 2007 17: 760-774 Access the most recent version at doi:10.1101/gr.6034307 Supplementary "Supplemental Reseach Data" data http://www.genome.org/cgi/content/full/17/6/760/DC1 References This article cites 68 articles, 33 of which can be accessed free at: http://www.genome.org/cgi/content/full/17/6/760#References Article cited in: http://www.genome.org/cgi/content/full/17/6/760#otherarticles Open Access Freely available online through the Genome Research Open Access option.
    [Show full text]