Algorithms and Applications of Next-Generation DNA Sequencing
Total Page:16
File Type:pdf, Size:1020Kb
Algorithms and Applications of Next-Generation DNA Sequencing Chip-Seq, database of human variations, and analysis of mammary ductal carcinomas by Anthony Peter Fejes Bachelor of Science, Biochemistry (Hons. Co-op), University of Waterloo, 2000 Bachelor of Independent Studies, University of Waterloo, 2001 Master of Science, Microbiology & Immunology, The University of British Columbia, 2004 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Bioinformatics) The University Of British Columbia (Vancouver) April 2012 © Anthony Peter Fejes, 2012 Abstract Next Generation Sequencing (NGS) technologies enable Deoxyribonucleic Acid (DNA) or Ribonucleic Acid (RNA) sequencing to be done at volumes and speeds several orders of magnitude faster than Sanger (dideoxy termination) based methods and have enabled the development of novel experiment types that would not have been practical before the advent of the NGS-based machines. The dramatically increased throughput of these new protocols requires significant changes to the algorithms used to process and analyze the results. In this thesis, I present novel algorithms used for Chromatin Immunoprecipitation and Sequencing (ChIP-Seq) as well as the structures required and challenges faced for working with Single Nucleotide Variations (SNVs) across a large collection of samples, and finally, I present the results obtained when performing an NGS based analysis of eight mammary ductal carcinoma cell lines and four matched normal cell lines. ii Preface The work described in this thesis is based entirely upon research done at the Canada’s Michael Smith Genome Sciences Centre (BCGSC) in Dr. Steve J.M. Jones’ group by Anthony Fejes. Two exceptions to this statement are the work on the dicer1 gene and the Motif Identification for ChIP-Seq Analysis (MICSA) software package, both of which involved collaborative work for which Anthony Fejes was granted co-authorship on subsequent publications. Contributions for each collaboration are detailed below. Work on chapter 2 was done by Anthony Fejes, with code contributions by Timothee Cezard, and with the guidance of Drs. Gordon Robertson and Mikhail Bilenky. Code contributions consist of the Lander-Water algorithm (implemented by Dr. Bilenky and merged into the FindPeaks code repository by Timothee Cezard), as well as numerous bug fixes contributed by Timothee Cezard. The work in this chapter is, in part, published in an application note, written by Anthony Fejes: A.P. Fejes et al. “FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology”. In: Bioinformat- ics 24 (Aug. 2008), pp. 1729–1730 Background information and a literature review on Chromatin Immunoprecipi- tation and Sequencing (ChIP-Seq) was published in the textbook: A. P. Fejes and S. J. Jones. “Chip-Seq: Mapping of Protein-DNA Interac- tions”. In: Next-Generation Genome Sequencing: Towards Personalized Medicine. Ed. by Michal Janitz. Wiley, John & Sons, November 2008 iii Discussion of the MICSA software and extensions to the FindPeaks package in chapter 2 describes collaborative work directed by Valentina Boeva of the Institut Curie. Contributions to this publication included the completed FindPeaks 3.3 software, used as the basis for the MICSA algorithms, as well as support and consultations with the co-authors. V. Boeva et al. “De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis”. In: Nucleic Acids Res. 38 (June 2010), e126 Work on chapter 3 was done by Anthony Fejes, with contributions to in- sertion/deletion processing by Alireza Hadj Khodabakhshi. An He assisted by automating and importing the data sets into the database. The work in this chap- ter is, in part, published in an application note, written by Anthony Fejes with contributions by Alireza Hadj Khodabakhshi: A. P. Fejes et al. “Human variation database: an open-source database template for genomic discovery”. In: Bioinformatics 27 (Apr. 2011), pp. 1155– 1156 Details of the exact contributions of developers to the code base discussed in chapter 2 as well as chapter 3 can be obtained through the code repository at http://vancouvershortr.svn.sourceforge.net/viewvc/vancouvershortr. Discussion of the work on the dicer 1, ribonuclease type III (DICER1) gene was performed in collaboration with Alireza Moussavi in the Huntsman lab. Contri- butions to this work included customized searches to assist in the identification and classification of recurrent variations and access to the software and database described in chapter 3. The work described in chapter 3 has been published: A. Heravi-Moussavi et al. “Recurrent Somatic DICER1 Mutations in Nonep- ithelial Ovarian Cancers”. In: N Engl J Med (Dec. 2011) Work on chapter 4 was done by Anthony Fejes, using data generated at the B.C. Genome Sciences Centre, with the assistance of Richard Varhol (Alignment), Nina Theissen (Single Nucleotide Polymorphism (SNP)-calling) and Karen Mungal and Readman Chu (Ribonucleic Acid (RNA) Assembly). This work has not been iv published. Work on chapter 5 was done by Anthony Fejes (bioinformatics) and Steven Leach (wet lab work, including Sanger sequencing), with the exception of the screening of the panel of ductal carcinomas, which was organized by Sohrab Shah and Kane Tse, and analyzed by Anthony Fejes. This work has not been published. v Table of Contents Abstract . ii Preface . iii Table of Contents . vi List of Tables . xiii List of Figures . xv Glossary . xvii List of Acronyms . xix Acknowledgements . xxiii 1 Introduction . 1 1.1 Research Presented . 2 1.1.1 Research not Included . 2 1.1.2 Outline . 2 1.1.3 Goals of this Research . 4 1.2 Vancouver Short Read Analysis Package . 4 1.2.1 Open Source Bioinformatics . 5 1.2.2 Why do Open Source Bioinformatics? . 6 1.2.3 Libraries . 8 vi 1.2.4 Availability . 9 1.3 ChIP-Seq . 9 1.3.1 Background . 9 1.3.2 History of Chromatin Immunoprecipitation . 10 1.3.3 Medical Applications of ChIP-Seq . 22 1.3.4 Challenges . 23 1.3.5 Future Uses of ChIP-Seq . 25 1.3.6 FindPeaks . 27 1.4 Variation Database . 27 1.4.1 Variations . 28 1.4.2 Pipelines Producing SNP- and SNV-Calls . 31 1.4.3 Single Nucleotide Polymorphism Databases . 34 1.5 Relational Databases . 37 1.5.1 SQL Access . 37 1.5.2 Postgresql . 38 1.5.3 ODBC . 39 1.5.4 Keys and Indexing . 39 1.5.5 Hardware Performance . 40 1.5.6 Optimization . 43 1.5.7 Background on the Variation Database . 46 1.6 Breast Cancer . 48 1.6.1 Molecular Subtypes . 49 1.6.2 ATCC Mammary Ductal Carcinoma Cell lines . 50 1.6.3 Epstein-Barr/B-Cell Derived Matched Normals . 50 1.6.4 Research Done . 52 1.6.5 Recurrent Variations . 54 1.6.6 Purpose . 56 1.7 Notch Genes, Strawberry Notch and the Epidermal Growth Factor 56 1.7.1 Pathways . 56 1.7.2 Notch Signalling . 57 vii 2 ChIP-Seq . 59 2.1 FindPeaks . 59 2.2 Paired End Tag versus Single End Tag ChIP-Seq . 60 2.3 Read Length Modelling . 61 2.3.1 Native Lengths - No Extension . 61 2.3.2 Hard Extension . 62 2.3.3 Triangle Distribution . 63 2.3.4 Read Shifting . 64 2.4 Peak Calling . 64 2.4.1 Trimming Peaks . 65 2.4.2 Peak Separation . 66 2.5 False Discovery Rates and ChIP-Seq Controls . 68 2.5.1 Sources of Error . 68 2.5.2 Simulated Control - Monte Carlo . 70 2.5.3 Simulated Control - Lander-Waterman . 71 2.5.4 Minimal Biological Control - Null Immunoprecipitation Control . 71 2.5.5 Biological Control . 72 2.6 Comparing ChIP-Seq Experiments . 72 2.6.1 Normalization of ChIP-Seq Results . 72 2.6.2 Normalization by Equivalent Peaks . 74 2.6.3 Limitation of Normalization by Equivalent Peaks . 76 2.6.4 Statistics . 76 2.6.5 Post-Normalization Processing . 77 2.7 Analysis of Normalized Samples . 77 2.7.1 Comparison by Ratio - Method of Perpendicular Lines . 77 2.7.2 Comparison by Equivalent Areas - “Method of Hyperbolic Sections” . 80 2.8 Example - Extending FindPeaks . 82 2.8.1 Method . 83 viii 2.8.2 EWS-FLII . 86 2.9 Summary . 87 3 Variation Database . 88 3.1 SNVs and INDELs . 89 3.2 Methods . 89 3.2.1 Novel Functions . 89 3.2.2 Data . 90 3.2.3 Graphic Output . 90 3.2.4 Input Formats . 90 3.2.5 Library Information . 90 3.2.6 Variation Annotations . 91 3.3 Design . 92 3.3.1 Design Philosophies . 92 3.4 Modularity . 98 3.4.1 Database Application Programming Interface . 99 3.4.2 File Iterators . 100 3.4.3 User Interface . 102 3.5 Common Use-Cases and User Interactions . 104 3.5.1 Querying . 104 3.5.2 ExperimentalRecord . 105 3.5.3 Concordance . 114 3.5.4 Modifying the Application Programming Interface (API) and the User Interface (UI) . 114 3.5.5 Ensembl . 114 3.6 Applications Using the Variation Database . 116 3.6.1 Filtering Polymorphisms . 116 3.6.2 Filtering Recurrent Variations . 117 3.6.3 Filtering to Identify Cancer Drivers . 118 3.6.4 Variations Only Found in Cancer . 119 3.6.5 Variations Never Found in Cancer . 123 ix 3.6.6 RNA Editing . 123 3.6.7 Transition and Transversion Frequency . 125 3.6.8 Growth of the Database . ..