Development of a Transcriptome-Based Genome Assembly Tool and Whole Genome Sequencing for Autism Spectrum Disorders

Development of A Transcriptome-Based Genome Assembly Tool and Whole Genome Sequencing for Autism Spectrum Disorders Robert Baldwin Centre for Biotechnology Submitted in partial fulfillment of the requirements for the degree of Master’s of Science Faculty of Biological Sciences, Brock University St. Catharines, Ontario Ó 2018 Abstract This thesis consisted of two independent projects. The first involved developing a software tool that uses transcriptome data to improve genome assemblies. The second involved processing and analyzing whole genome sequencing (WGS) from the ASPIRE autism spectrum disorder (ASD) cohort. The first project produced the bioinformatics software called RDNA. This free tool was written in Perl and should be valuable for users interested in genome assembly. Comparative assessment between RDNA and the leading transcript based scaffolding software showed that RDNA can significantly improve genome assemblies while making relatively few scaffolding connection errors. RDNA also makes possible the assembly of scaffolding connections, including gap filling, using BLAST. The second project was undertaken with collaborators and involved processing and analyzing whole genome sequencing (WGS) data from the ASPIRE ASD cohort. The ASPIRE ASD cohort consisted of several hundred probands from both simplex and multiplex families. Sequencing occurred for 120 of these individuals who were selected based upon membership in two phenotype clusters (C1 and C2). These individuals had a relatively high rate of intellectual disability (ID) compared to heavily studied ASD cohorts such as the Simons Simplex Collection (SSC), indicating a significant involvement of de novo sequence variants. Analysis of rare single nucleotide variants (SNVs) and insertion/deletions (indels) identified large risk factors for severe neurodevelopmental disorders (NDDs), two of which were previously observed de novo among individuals with severe, undiagnosed NDDs. On this basis, ABCA1 was found to be a novel candidate risk gene. Gene Ontology (GO) analysis of rare loss of function and missense SNVs indicted the importance of lipid metabolic processes and synaptic signalling. Overall, the genetic variation examined by this study pertained to a modest number of cases, consistent with previous findings that ASD is a genetically heterogeneous disorder with a complex genetic architecture. Table of Contents Chapter I: Improving de novo genome assemblies using RDNA……………1 Chapter II: Whole genome sequencing and variant discovery in the ASPIRE autism cohort………………………………………………………………………..21 References…………………………………………………………………..44 Appendices………………………………………………………………….52 Acknowledgments I’d like to thank my supervisor Ping Liang for taking me on as his student and supervising my work, including helping edit this manuscript. I’d also like to thank Radesh for his work helping prepare the genome assemblies needed to finish the evaluations for the assembly software, and my committee members, Fiona Hunter and Feng Li. Tables and Figures Figure 1.1: Overview of RDNA scaffolding method 5 Figure 1.2: Scaffolding performance of RDNA and LRNA for human data 11 Table 1.1: Scaffolding performance of RDNA and LRNA for human transcriptome data 12 Table 1.2: RDNA and L_RNA_scaffolder BUSCO assessment results for transcriptome human data 13 Figure 1.3: Error rates for RDNA and LRNA for human data 14 Figure 1.4: Scaffolding performance of RDNA and LRNA for lavender data 16 Table 1.3: RDNA and L_RNA_scaffolder genome assembly quality metrics for lavender data 16 Table 1.4: RDNA and L_RNA_scaffolder BUSCO results for lavender data 17 Table 2.1: GO enrichment results from the combined C1 and C2 SNV gene set 33 Table 2.2: Previously validated de novo variants 34 Table 2.3: Prioritized novel PTVs and missense variants 35 Appendix, Table 1: RDNA output example 52 Appendix, Table 2: Variants identified in this study previously prioritized by collaborators 52 Appendix, Table 3: All missense variants identified in this study not previously prioritized by collaborators 54 Appendix, Table 4: All PTVs identified in this study not previously prioritized by collaborators 66 Appendix, Table 5: C1 and C2 missense and PTV gene sets used to GO analysis 68 Appendix, Table 6: Key variant calling metrics collected with PICARD 72 Abbreviations ASD – autism spectrum disorder BAM – binary alignment and mapping file BGI – Beijing Genomics Institute BQSR – base quality score recalibration CADD – combined annotation dependent depletion CNV – copy number variant DDD – Deciphering Developmental Disorders Study DSM – diagnostic and statistical manual of mental disorders EST – expressed sequence tags ExAC – Exome Aggregation Database GATK – genome analysis toolkit GDD – global developmental delay gnomAD – Genome Aggregation Database GO – gene ontology ID – intellectual disability LGD – likely gene disrupting variant MSSNG – MISSING Project NDD – neurodevelopmental disorder PE – putative error PTV – protein truncating variant pLI – probability of loss of function intolerance (ExAC) RefSeq – NCBI reference sequence SFARI – Simons Foundation Autism Resource Initiative SSC – Simons Simplex Collection SRA – sequence read archive SNV – single nucleotide variant VEP – Ensembl variant effect predictor VCF – variant call format VQSR – variant quality score recalibration WES – whole exome sequencing WGS – whole genome sequencing 1000GP – One thousand genomes project Chapter I: Improving de novo genome assemblies using transcriptome sequences Abstract Genome assembly is a major challenge due to the short reads generating by next generation sequencing (NGS) technology and can be assisted using the connection information provided by NGS RNA sequencing (RNA-seq) data. In particular, transcriptomes are commonly being assembled as part of genome projects. Yet the development of assembly tools designed to work with this data is lacking and limited to a single tool, L_RNA_scaffolder. Presented here is a transcript based genome assembly tool called RDNA. Assessment using human data showed that the connection error rate increased dramatically for both RDNA and L_RNA_scaffolder using a transcriptome assembled de novo compared to the NCBI reference sequence (RefSeq) transcripts. However, the connection error rate for RDNA was much lower than for L_RNA_scaffolder. The higher error rate for L_RNA_scaffolder was the cost of making a greater number of scaffolding connections. In addition, RDNA offers utilities for gap filling and joining and collapsing among overlapping scaffolds. Overall, RDNA has advantages over L_RNA_scaffolder and should be especially useful for users wishing to minimize assembly errors. 1 Introduction: The scaffolding stage of genome assemblies generally relies on paired reads obtained from long insert fosmid or jumping libraries to order, orient, and connect contigs into larger sequences called scaffolds. Unfortunately, this method of scaffolding is complex, time consuming, and leads to a high rate of incorrect and missed connections. In the ongoing effort to improve genome assemblies there is an interest in novel scaffolding methods that make use of data generated from NGS RNA-seq approaches. Transcriptome profiling will likely be included as part of many genome projects, and the data can be used to scaffold the transcribed regions of a genome (Xue et al. 2013). These regions correspond to features such as genes, non-coding RNA, and small RNA, which most likely will be the primary subject of subsequent biological research. As the quality of these annotations is affected by the quality of the underlying assembly, scaffolding with RNA-seq data addresses the need that the genomic regions for these features are well assembled (Denton et al. 2014). Although mRNAs and expressed sequence tags (ESTs) were used to help assemble the human reference genome, there are very few readily available tools that are specifically designed to work with RNA-seq. For transcriptome based scaffolding there is a single tool, L_RNA_scaffolder (Xue et al. 2013). Furthermore, the performance of this tool has been poorly evaluated. Perhaps for these reasons some genome projects are continuing to rely upon custom RNA-seq based scaffolding strategies (Warren et al. 2015), or have decided to avoid them entirely. Presented here is a transcript based scaffolding tool called RDNA that is capable of outperforming L_RNA_scaffolder based on assembly quality metrics and BUSCO genome 2 completeness analysis (Simao et al. 2015). Comparative assessment showed the connection error rate of RDNA to be more than two times lower than that of L_RNA_scaffolder regardless of what transcript data was used as input. RDNA therefore offers an alternative scaffolding method that provides a much lower connection error rate than L_RNA_scaffolder. Furthermore, it includes sequence merging, including gap filling, as a unique alternative to scaffolding connections. The Perl code for RDNA is freely available at github (http://github.com/RobertWBaldwin). Material and Methods: The main problem and general strategy used to solve it RDNA was written in Perl (version 5.22.1) and uses a familiar “divide and conqueror” strategy whereby a complex problem is broken down into a set of smaller, easier to solve problems. Most of these smaller problems were solved using graph algorithms, or modifications of such algorithms, which are used extensively when working with biological sequence data, including genome assembly problems (Jones and Pevzner 2009). Using this approach, a genomic

Load more