A Thesis Entitled Human Genome and Transcriptome Analysis with Next

A Thesis entitled Human Genome and Transcriptome Analysis with Next-Generation Sequencing by Basil Khuder Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science in Biomedical Sciences Degree in Bioinformatics, Proteomics, and Genomics ___________________________________________ Dr. Alexi Fedorov, Committee Chair ___________________________________________ Dr. David Kennedy, Committee Member ___________________________________________ Dr. Robert Blumenthal, Committee Member ___________________________________________ Dr. Amanda Bryant-Friedrich, Dean College of Graduate Studies The University of Toledo August 2017 Copyright 2017, Basil Khuder This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of Human Genome and Transcriptome Analysis with Next-Generation Sequencing by Basil Khuder Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Masters of Science in Biomedical Sciences Degree in Bioinformatics, Proteomics, Genomics The University of Toledo August 2017 Advancements in Next-Generation Sequencing technologies and steep declines in costs have enabled sequencing to occur at astronomical rates. With this technology, researchers have made great strides in progressing our understanding of the human genome. Additionally, this surplus of data has also opened the doors to many popular databases, such as such as NCBI’s Sequencing-Read Archive, to help host the enormity of files. The onslaught of data has also, however, created predicaments for scientists, as researchers are still trying to find the most optimal methods for using and processing these data to answer some of their most challenging questions. Here we present two approaches to normalize and analyze high-throughput data, that can help respond to questions about the human genome and transcriptome and demonstrate how constructing sequence analysis methods can produce substantial biological implications. The first approach uses NCBI’s Sequence Read Archive to analyze unnormalized RNA-Seq data. An algorithm was created to manually normalize the data, allowing us to use the SRA database to look for plausible expression levels of long, highly-conserved, non-coding sequences within the Fat-Mass and Obesity Gene, FTO. Bioinformatic software was then used to try to confirm both our preliminary results and predicted patterns of expression. The second approach uses pre-existing genome iii assembly software, namely the Genome Analysis Toolkit by the Broad Institute, to normalize exome-sequencing data from two individuals afflicted with retinitis pigmentosa, and find variants that might have contributed to the disease. These two approaches showcase how researchers can readily analyze their data, and gain far better insight into the understanding of the human genome and transcriptome. iv I dedicate this thesis to my loving parents, Maha and Sadik, for their support throughout this process. Acknowledgements I would like to extend my deepest gratitude to my adviser Dr. Alexei Fedorov, for his great mentorship over the past two years. His helpfulness and guidance has allowed explore my passion in the fields of Computational Biology and Bioinformatics and guide me towards a future career. I would also like to greatly thank Dr. David Kennedy and Dr. Robert Blumenthal, for assisting me through the process, as members of my thesis committee. Lastly, I would like to thank my lab colleagues Patrick Brennan, Sharmistha Chakraborthy, Rajib Dutta, Joseph Mainsah, and Yuriy Yatskiv for their friendship and assistance. v Table of Contents Abstract ................................................................................................................................................... iii Acknowledgements ................................................................................................................................ v Table of Contents.................................................................................................................................. vi List of Tables ..................................................................................................................................... viii List of Figures ........................................................................................................................................ ix List of Abbreviations ............................................................................................................................. x 1 Transciptome Analysis of Ultra-Conserved Elements in Human Fat-Mass and Obesity Gene (FTO)… ....................................................................................................…………...1 1.1 Abstract……………………………………………………………………….. 1 1.2 Introduction ........................................................................................................................ 2 1.2.1 FTO GWAS and Regulatory Studies ............................................... 3 1.2.2 FTO Transcripts ................................................................................. 4 1.3 Methods…. ......................................................................................................................... 5 1.3.1 Non-Code Database ........................................................................... 5 1.3.2 SRA-to-BLAST ................................................................................... 5 1.3.3 RNA-Seq by Expectation Maximization ........................................ 6 1.4 Results...…... ....................................................................................................................... 7 1.4.1 SRA-to-BLAST Hits........................................................................... 7 1.4.2 SRA-to-BLAST Controls................................................................... 8 vi 1.4.3 RSEM ..................................................................................................... 9 1.5 Discussion...…... ................................................................................................................ 9 1.6 Conclusion...…... .............................................................................................................. 12 2 Investigations into Exome Sequencing Data Provide Insights into Rare Retinal Disease Obesity Gene (FTO)… .................................................................................. …………...27 2.1 Abstract……………………………………………………………………… . 27 2.2 Introduction…………………………………………………………………...28 2.2.1 RP1L1 ................................................................................................. 30 2.3 Methods…………………………………………………………………….. ... 30 2.3.1 Genome Alignment .......................................................................... 30 2.3.2 Variant Calling ................................................................................... 33 2.3.3 Custom Variant Filtration ................................................................ 34 2.4 Results…………………………………………………………………….. ...... 35 2.5 Discussion…………………………………………………………………. ..... 37 2.6 Conclusion………………………………………………………………….. ... 38 References…………………………………………………………………….. ................... 46 A Experimental Results from SRA-to-BLAST ..................................................................... 52 B Linux Commands for Genomic Alignment and Variant Calling ................................... 67 vii List of Tables 1.1 FTO Intron and Exon Coordinates .................................................................................... 13 1.2 FTO UCNE Coordinates ..................................................................................................... 15 1.3 FTO Transcripts ..................................................................................................................... 18 1.4 RSEM Results ......................................................................................................................... 25 viii List of Figures 1-1 FTO Gene Depiction ............................................................................................................ 16 1-2 FTO Downstream Model ..................................................................................................... 17 1-3 UCNE Expression Workflow .............................................................................................. 19 1-4 UCNE with Highest Hits...................................................................................................... 20 1-5 Average UCNE Hits .............................................................................................................. 21 1-6 Reads Spots and Average UCNE Hits ............................................................................... 22 1-7 Intron #5 Hits and Average UCNE Hits .......................................................................... 23 1-8 Intron #5 Hits and UCNE #4 Hits .................................................................................... 24 2-1 Workflow for Genome Alignment and Variant Calling .................................................. 40 2-2 Representation of Reference Indexing ............................................................................... 41 2-3 Filtration Steps Conducted ................................................................................................... 42 2-4 Distribution of SNPs and INDEls across Chromosomes .............................................

A Thesis Entitled Human Genome and Transcriptome Analysis with Next

De Novo Genomic Analyses for Non-Model Organisms: an Evaluation of Methods Across a Multi-Species Data Set

MATCH-G Program

The Variant Call Format Specification Vcfv4.3 and Bcfv2.2

To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari

A Semantic Standard for Describing the Location of Nucleotide and Protein Feature Annotation Jerven T

Whole Genome Sequencing Data of Multiple Individuals of Pakistani

Large Scale Genomic Rearrangements in Selected Arabidopsis Thaliana T

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

A Combined RAD-Seq and WGS Approach Reveals the Genomic

Homology & Alignment

Microbes and Metagenomics in Human Health an Overview of Recent Publications Featuring Illumina® Technology TABLE of CONTENTS

Assembly Exercise