A Thesis Entitled Snps and Indels Analysis in Human Genome Using
Total Page:16
File Type:pdf, Size:1020Kb
A Thesis entitled SNPs and Indels Analysis in Human Genome using Computer Simulation and Sequencing Data by Sharmistha Chakrabortty Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Sciences: Bioinformatics, Proteomics and Genomics ________________________________________ Dr. Alexei Fedorov, Committee Chair ________________________________________ Dr. Robert Blumenthal, Committee Member ________________________________________ Dr. Sadik Khuder, Committee Member ________________________________________ Dr. Amanda Bryant-Friedrich, Dean College of Graduate Studies The University of Toledo August 2017 Copyright 2017, Sharmistha Chakrabortty This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of SNPs and Indels Analysis in Human Genome using Computer Simulation and Sequencing Data by Sharmistha Chakrabortty Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Sciences: Bioinformatics, Proteomics and Genomics The University of Toledo August 2017 Genetic variations are the heritable changes in DNA caused by mutation and can be present in both coding and non-coding region of the DNA. They provide great resources for the evolution of an organism in response to environmental and biological changes. Analysis of these variants (such as Single Nucleotide Polymorphism (SNPs), Indels, and other structural variants like Copy Number Variations (CNV)) thus, have a wide range of potential applications. These include identification of causative variants and the genes for genetic diseases, personalized genomics, population and evolutionary genetics, and forensic biology. This study represents two such applications of human variant analysis (particularly the analysis of SNPs and Indels). In the first chapter, SNPs were analyzed to understand the correlation between recombination rate and genetic diversity in the human genome, using a computational modeling program. A iii simulated human population was used to study the effect of various population level factors such as natural selective forces, the type of mutations, etc., on this correlation. In the second chapter, Next Generation Sequencing (in this case Whole Exome Sequencing) data and associated computational variant analysis tools and software were used to analyze both SNPs and Indels in the human genomes to find a lead candidate genetic variant responsible for Inherited Retinal Dystrophy in a family. iv I dedicated this work firstly to Lord Almighty for bestowing his kind blessings onto me at every stage of my life. And to my parents, Shri Arun Kumar Chakrabortty and Smt. Sikha Chakrabortty, and my brother Dr. Sudipto Kumar Chakrabortty, who has supported me all my life and encouraged me to chase my dreams, no matter how far-fetched and difficult they may seem. Finally, I would also like to dedicate this work to all my teachers and co-workers in the past who have inspired me and ignited my mind with curiosity and thirst for knowledge Acknowledgements First and foremost, I would like to acknowledge immense contribution of my parents Shri. Arun Kumar Chakrabortty and Smt. Sikha Chakrabortty for successful completion of my research, for it is they who kept me going despite numerous difficulties and always inspired me to reach my goals. They sacrificed their present to secure my future. Without unwavering support and continuous encouragement from my brother Dr. Sudipto Kumar Chakrabortty, this work would have never seen the light of the day. I would like to take this opportunity thank my advisor Dr. Alexei Fedorov for his immense patience and constant motivation while I took my baby steps towards the huge ocean of scientific knowledge. I will be forever indebted for his irreplaceable ideas of critical analysis, and vital strategies to deal with insurmountable bioinformatics and algorithmic challenges. I am also deeply obligated towards my teachers and committee members Dr. Robert Blumenthal, Dr. Sadik Khuder, and Dr. Robert Trumbly for their invaluable professional and personal lessons for a successful bioinformatics career; without their constant support, this degree would not have come to fruition. I also owe gratitude toward my coworkers Rajib Dutta, Patrick Brennan and Basil Khuder for their inspiring ideas, constant assistance while we worked as a team, and for fostering strong bonds of friendship and camaraderie. I would also like to thank Jo Anne Gray and all my colleagues for helping me at different stages of the graduate program. v Table of Contents Abstract ................................................................................................................... iii Acknowledgements ................................................................................................... iv Table of Contents ....................................................................................................... v List of Tables ........................................................................................................ viii List of Figures........................................................................................................... ix List of Abbreviations ................................................................................................. x List of Symbols ......................................................................................................... xi 1 Chapter 1. Correlation of recombination rate with genetic diversity in human genome 1.1 Synopsis ..................................................................................................... 1 1.2 Introduction 1.2.1 Recombination rate an important determinant of genetic diversity .............. 2 1.2.2 Recombination increases genetic diversity by reducing the effect of two main selective forces: Genetic Hitchhiking and Background Selection ................ 4 1.2.3 Recombination rate is positively correlated with genetic diversity in natural populations ..................................................................................................... 5 1.3 Materials and Methods 1.3.1 GEMA computational modelling program ................................................... 7 1.3.2 Modes of GEMA program .......................................................................... 9 1.3.3 Fitness calculation .................................................................................... 10 vi 1.3.4 Parameters used for GEMA modelling ..................................................... 10 1.3.4.1 Recombination rate .................................................................... 11 1.3.4.2 Modes of gene functionality ........................................................ 11 1.3.4.3 Number of offspring ................................................................... 12 1.3.4.4 Population size ........................................................................... 12 1.3.4.5 Gene size and gene length .......................................................... 12 1.3.4.6 Mutation rate .............................................................................. 13 1.3.4.7 Distribution of Selection Coefficient in the population ............... 14 1.4 Results 1.4.1 GEMA program under saturated mode 1.4.1.1 GEMA in dominant mode of gene functionality ......................... 15 1.4.1.2 GEMA in codominant mode of gene functionality ...................... 18 1.4.1.3 GEMA in recessive mode of gene functionality .......................... 18 1.4.2 GEMA program under unsaturated mode 1.4.2.1 GEMA in dominant mode of gene functionality .......................... 21 1.4.2.2 GEMA in codominant mode of gene functionality ..................... 24 1.4.2.3 GEMA in recessive mode of gene functionality .......................... 24 1.4.3 GEMA under no selection pressure ........................................................... 31 1.5 Summary of conclusion ........................................................................................... 34 2 Chapter 2. Identification of rare genetic variant for Retinal Dystrophy in a family 2.1 Synopsis ................................................................................................... 35 2.2 Introduction ................................................................................................... 36 vii 2.3 Materials and Methods ............................................................................................ 42 2.3.1 Filtering against 1000 Genome Phase 1 and Phase 3 ................................. 45 2.3.2 Filtering based upon Genotype ................................................................. 45 2.3.3 Variant Analysis 2.3.3.1 Variant analysis using IGV ........................................................ 46 2.3.3.2 Variant analysis using database and literature survey .................. 49 2.3.4 Confirmation of unknown novel variant ................................................... 52 2.4 Results ................................................................................................... 53 2.5 Summary of conclusion ......................................................................................... 61 References ................................................................................................... 67 A Appendix A ....................................................................................................................