A Thesis Entitled an Investigation of Personal Ancestry Using
Total Page:16
File Type:pdf, Size:1020Kb
A Thesis entitled An Investigation of Personal Ancestry Using Haplotypes by Patrick Brennan Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Science: Bioinformatics, Proteomics, and Genomics ________________________________________ Dr. Alexei Fedorov, Committee Chair ________________________________________ Dr. Robert Blumenthal, Committee Member ________________________________________ Dr. Sadik Khuder, Committee Member ________________________________________ Dr. Amanda Bryant-Friedrich, Dean College of Graduate Studies The University of Toledo August 2017 Copyright 2017, Patrick John Brennan This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of An Investigation of Personal Ancestry Using Haplotypes by Patrick Brennan Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Science: Bioinformatics, Proteomics, and Genomics The University of Toledo August 2017 Several companies over the past decade have started to offer ancestry analysis, the most notable company being 23andMe. For a relatively low price, 23andMe will sequence select variants in a person’s genome to determine where their ancestors came from. Since 23andMe is a private company, the exact techniques and algorithms it uses to determine ancestry are proprietary. Many customers have wondered about the accuracy of these results, often citing their own genealogical research of recent ancestors. To bridge the gap between 23andMe and the public, we sought to provide a tool that could assess the ancestry results of 23andMe. Using publicly available 23andMe genotype files, we constructed a program pipeline that takes these files and compares them against genomes from the 1000 Genomes Project. We constructed haplotypes from the 23andMe file by converting 50 adjacent SNPs (single nucleotide polymorphisms) into haplotypes and comparing them against the haplotypes of 2504 individuals in the Phase 3 data from the 1000 Genome Project. To smooth the data, we bundled together six of our haplotype segments to form an “IBD segment” (Identity-by-descent segment) and used a point scoring system to calculate the highest matching population. Our pipeline determined iii ancestry results for 57 individuals with similar results to 23andMe. Fifty of our subjects showed European ancestry, while the other seven subjects showed ancestry from East Asia, Africa, America, and South Asia. Of our 5 geographic categories (South Asia, Africa, America, Europe, East Asia), 98% of our subjects showed ancestral representation from 4 of the 5 categories. In addition to ancestry, we also investigated IBD sharing across populations, particularly IBD segments in the Human Leukocyte Antigen (HLA) region on chromosome 6. We hope this tool will help 23andMe customers and those of similar genotyping companies understand the methods used to determine ancestry and verify their results. iv I dedicate this thesis to my wife, Madeline. She has been incredibly supportive of me throughout my two years at The University of Toledo. Her encouragement and trust has allowed me to flourish and become a better student and person. Acknowledgements I would like to thank Dr. Alexei Fedorov for accepting me into his lab and teaching me how to program. Programming is something I have come to enjoy immensely, and I owe it to him for introducing me to the world of computer programming. I would also like to thank my committee members Dr. Robert Blumenthal and Dr. Sadik Khuder for the guidance they provided as I worked though this thesis. I also give thanks to Jo Anne Gray who has provided enormous help over the past two years. Lastly, I would like to thank my lab colleagues Basil Khuder, Sharmistha Chakrabortty, and Rajib Dutta for their friendship and help throughout this project. v Table of Contents Abstract .............................................................................................................................. iii Acknowledgements ..............................................................................................................v Table of Contents ............................................................................................................... vi List of Tables ................................................................................................................. viii List of Figures .................................................................................................................... ix List of Abbreviations ...........................................................................................................x 1 Chapter 1: An Investigation of Ancestry Using Haplotypes ...................................1 1. Introduction ..................................................................................................1 1.1 History of 23andMe .........................................................................1 1.2 Haplotypes .......................................................................................3 1.3 IBD Background……………………………………………….….5 1.4 Current Tools……………………………………...………………6 2. Material & Methods……………………………………………………….8 3. Results……………………………………………………………………..9 3.1 23andMe File……………………………………………………...9 3.2 Extracting Data from Phase 3……………………………………10 3.3 Haplotype Construction………………………………………….11 3.4 Ancestry Analysis……………………………………………..…12 vi 3.5 European Ancestry………………………………………………15 3.6 IBD Sharing……………………………………………………..15 4. Discussion……………………………………………………………….31 4.1 Future Work……………………………………………………….…34 References ..........................................................................................................................36 Appendix A ........................................................................................................................39 vii List of Tables 1 Format of the 23andMe File ..................................................................................17 2 Table Used for Haplotype Construction ................................................................18 3 Ancestry Results for 57 Subjects using IBD Analysis...............................28, 29, 30 A.1 Characterization of Populations in Phase 3 1000 Genomes Data ..........................39 viii List of Figures 1 Characterization of 23andMe File with Filtering...................................................18 2 Haplotype Construction and Characterization .......................................................20 3 Before and After IBD Smoothing Technique ........................................................21 4 Haplotype Segments Created by build_haplo_v2.pl ..............................................22 5 Creation of IBD Segments and Scoring………………………………………….23 6 Amount of IBD Sharing by Chromosome……………………………………….24 7 IBD Results vs 23andMe Results for Control 1…………………………………25 8 IBD Results vs 23andMe Results for Control 3…………………………………26 9 IBD Results vs 23andMe Results for Control 4…………………………………27 ix List of Abbreviations DNA………………… Deoxyribonucleic Acid HLA ………………… Human Leukocyte Antigen IBD ………………….. Identical - by – Descent Mb …………………… Megabase (1 Million Base Pairs) SNP ............................Single Nucleotide Polymorphism VCF ............................Variant Call Format x Chapter 1 An Investigation of Personal Ancestry Using Haplotypes 1. Introduction: 1.1 History of 23andMe: 23andMe is a genomics company that was cofounded by Linda Avey and Anne Wojcicki in 2006 (Goetz, et al. 2007). 23andMe offers an ancestry and health analysis service based on the customer’s DNA. It sends the customer a kit that includes a tube, a specimen bag, and a prepaid shipping box; the customer is instructed to spit in the tube and return the kit to the company. When 23andMe receives the kit, the DNA is extracted from the saliva and is genotyped at variants that are important for health and ancestry determination. Once the DNA has been genotyped, the customer receives a health report that lets them know whether they are at risk for select common diseases. The ancestry report gives the customer a percentage breakdown of the geographical regions from which their ancestors came (Goetz, et al. 2007). When it started in 2007, 23andMe’s service was $999 (Baertlein, et al. 2007.. Now, ten years later, the price has dropped to as low as $99. With reduced cost and greater availability, one of Wojcicki’s goals is to bring the field of genetics to the public to allow them to make decisions about their health. In 1 an interview Wojicicki stated, “I want 25 million people. Once you get 25 million people, there's just a huge power of what types of discoveries you can make. Big data is going to make us all healthier. What kind of diet should certain people be on? Are there things people are doing that make them really high-risk for cancer? There's a whole group of people who are 100-plus and have no disease. Why?" (Murphy, et al. 2013). Genomics has suddenly become accessible to the public, and there is more and more curiosity about what our DNA can tell us. While 23andMe remains a leader in the public’s relationship to genomics, one potential issue is that since 23andMe is a private company, transparency in their methods and data is limited. Skeptics may wonder, “How did they determine that?”, “What methods do they use?”, and “How might I verify their results?”. There are numerous examples of