<<

A Thesis

entitled

An Investigation of Personal Ancestry Using

by

Patrick Brennan

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the

Master of Degree in Biomedical Science:

Bioinformatics, Proteomics, and

______Dr. Alexei Fedorov, Committee Chair

______Dr. Robert Blumenthal, Committee Member

______Dr. Sadik Khuder, Committee Member

______Dr. Amanda Bryant-Friedrich, Dean College of Graduate Studies

The University of Toledo

August 2017

Copyright 2017, Patrick John Brennan

This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of

An Investigation of Personal Ancestry Using Haplotypes

by

Patrick Brennan

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Science: Bioinformatics, Proteomics, and Genomics

The University of Toledo

August 2017

Several companies over the past decade have started to offer ancestry analysis, the most notable company being 23andMe. For a relatively low price, 23andMe will sequence select variants in a person’s to determine where their came from. Since

23andMe is a private company, the exact techniques and algorithms it uses to determine ancestry are proprietary. Many customers have wondered about the accuracy of these results, often citing their own genealogical research of recent ancestors. To bridge the gap between 23andMe and the public, we sought to provide a tool that could assess the ancestry results of 23andMe. Using publicly available 23andMe files, we constructed a program pipeline that takes these files and compares them against from the . We constructed haplotypes from the 23andMe file by converting 50 adjacent SNPs (single polymorphisms) into haplotypes and comparing them against the haplotypes of 2504 individuals in the Phase 3 data from the

1000 Genome Project. To smooth the data, we bundled together six of our segments to form an “IBD segment” (Identity-by-descent segment) and used a point scoring system to calculate the highest matching population. Our pipeline determined

iii ancestry results for 57 individuals with similar results to 23andMe. Fifty of our subjects showed European ancestry, while the other seven subjects showed ancestry from East

Asia, Africa, America, and South Asia. Of our 5 geographic categories (South Asia,

Africa, America, , East Asia), 98% of our subjects showed ancestral representation from 4 of the 5 categories. In addition to ancestry, we also investigated

IBD sharing across populations, particularly IBD segments in the Human Leukocyte

Antigen (HLA) region on 6. We hope this tool will help 23andMe customers and those of similar genotyping companies understand the methods used to determine ancestry and verify their results.

iv

I dedicate this thesis to my , Madeline. She has been incredibly supportive of me throughout my two years at The University of Toledo. Her encouragement and trust has allowed me to flourish and become a better student and person.

Acknowledgements

I would like to thank Dr. Alexei Fedorov for accepting me into his lab and teaching me how to program. Programming is something I have come to enjoy immensely, and I owe it to him for introducing me to the world of computer programming. I would also like to thank my committee members Dr. Robert Blumenthal and Dr. Sadik Khuder for the guidance they provided as I worked though this thesis. I also give thanks to Jo Anne Gray who has provided enormous help over the past two years.

Lastly, I would like to thank my lab colleagues Basil Khuder, Sharmistha Chakrabortty, and Rajib Dutta for their friendship and help throughout this project.

v

Table of Contents

Abstract ...... iii

Acknowledgements ...... v

Table of Contents ...... vi

List of Tables ...... viii

List of Figures ...... ix

List of Abbreviations ...... x

1 Chapter 1: An Investigation of Ancestry Using Haplotypes ...... 1

1. Introduction ...... 1

1.1 History of 23andMe ...... 1

1.2 Haplotypes ...... 3

1.3 IBD Background……………………………………………….….5

1.4 Current Tools……………………………………...………………6

2. Material & Methods……………………………………………………….8

3. Results……………………………………………………………………..9

3.1 23andMe File……………………………………………………...9

3.2 Extracting Data from Phase 3……………………………………10

3.3 Haplotype Construction………………………………………….11

3.4 Ancestry Analysis……………………………………………..…12 vi

3.5 European Ancestry………………………………………………15

3.6 IBD Sharing……………………………………………………..15

4. Discussion……………………………………………………………….31

4.1 Future Work……………………………………………………….…34

References ...... 36

Appendix A ...... 39

vii

List of Tables

1 Format of the 23andMe File ...... 17

2 Table Used for Haplotype Construction ...... 18

3 Ancestry Results for 57 Subjects using IBD Analysis...... 28, 29, 30

A.1 Characterization of Populations in Phase 3 1000 Genomes Data ...... 39

viii

List of Figures

1 Characterization of 23andMe File with Filtering...... 18

2 Haplotype Construction and Characterization ...... 20

3 Before and After IBD Smoothing Technique ...... 21

4 Haplotype Segments Created by build_haplo_v2.pl ...... 22

5 Creation of IBD Segments and Scoring………………………………………….23

6 Amount of IBD Sharing by Chromosome……………………………………….24

7 IBD Results vs 23andMe Results for Control 1…………………………………25

8 IBD Results vs 23andMe Results for Control 3…………………………………26

9 IBD Results vs 23andMe Results for Control 4…………………………………27

ix

List of Abbreviations

DNA………………… Deoxyribonucleic Acid

HLA ………………… Human Leukocyte Antigen

IBD ………………….. Identical - by – Descent

Mb …………………… Megabase (1 Million Base Pairs)

SNP ...... Single Nucleotide Polymorphism

VCF ...... Variant Call Format

x

Chapter 1

An Investigation of Personal Ancestry Using Haplotypes

1. Introduction:

1.1 History of 23andMe:

23andMe is a genomics company that was cofounded by Linda Avey and Anne

Wojcicki in 2006 (Goetz, et al. 2007). 23andMe offers an ancestry and health analysis service based on the customer’s DNA. It sends the customer a kit that includes a tube, a specimen bag, and a prepaid shipping box; the customer is instructed to spit in the tube and return the kit to the company. When 23andMe receives the kit, the DNA is extracted from the saliva and is genotyped at variants that are important for health and ancestry determination. Once the DNA has been genotyped, the customer receives a health report that lets them know whether they are at risk for select common diseases. The ancestry report gives the customer a percentage breakdown of the geographical regions from which their ancestors came (Goetz, et al. 2007). When it started in 2007, 23andMe’s service was $999 (Baertlein, et al. 2007.. Now, ten years later, the price has dropped to as low as $99. With reduced cost and greater availability, one of Wojcicki’s goals is to bring the field of to the public to allow them to make decisions about their health. In 1

an interview Wojicicki stated, “I want 25 million people. Once you get 25 million people, there's just a huge power of what types of discoveries you can make. Big data is going to make us all healthier. What kind of diet should certain people be on? Are there things people are doing that make them really high-risk for cancer? There's a whole group of people who are 100-plus and have no disease. Why?" (Murphy, et al. 2013). Genomics has suddenly become accessible to the public, and there is more and more curiosity about what our DNA can tell us.

While 23andMe remains a leader in the public’s relationship to genomics, one potential issue is that since 23andMe is a private company, transparency in their methods and data is limited. Skeptics may wonder, “How did they determine that?”, “What methods do they use?”, and “How might I verify their results?”. There are numerous examples of people asking these questions in online blogs and articles, and most of these people have no way of verifying the results for themselves. For example, one article criticized the limited number of non-Caucasian subjects, and claimed that this could cause bias toward Europe in the analysis (Hong, et al. 2016). Another customer said, “it said I was ~60 percent European, which makes some coherent sense in their non-optimal reference population set, but then stated that my daughter was >90 percent European.

Since 23andMe did confirm she was 50% identical by descent with me these results didn’t make sense.” (Bud, et al. 2012). One customer who had documented from her 64 great-great-great-great grandparents said the 23andMe results conflicted deeply with her known history. 23andMe stated that she had recent ancestors

(after 1690AD) from Scandinavia, Eastern Europe, and Italy, which she verifiably did not

(Jestes, et al. 2017). Besides researching genealogy, most customers do not have the 2

resources to verify their results. As a bioinformatics lab, we sought to develop tools to independently assess interpretations from the 23andMe genotype file that is provided to each customer.

Our lab has much experience with the 1000 Genomes Project, so our first idea was to compare the 23andMe SNP datasets with the publicly available genomes. Prior to this project, our lab had only used the Phase 1 data of the 1000 Genomes Project. We decided to use the Phase 3 data for this project because it covers 26 populations, rather than the 14 in Phase 1, which will be crucial for determining ancestry. The Phase 3 data is also “phased”, which will allow us to construct haplotypes as our method for investigating ancestry. Phasing will be described in more detail in the next section.

1.2 Haplotypes:

Haplotypes are a combination of at different markers along the same chromosome that are inherited as a unit (Crawford, et al 2005). When constructing haplotypes, one major challenge is phasing. Phasing is the method of assigning alleles to the paternal and maternal . Because of limited sequencing technology, raw

WGS () data does not offer insight into which sequence came from which parent; in other words, WGS data is unphased. To construct haplotypes, phasing must occur so we know which parent is responsible for which at a given loci. To phase genotype data, there are two main approaches: the computational approach and the lab based experimental approach. The computational approach allows unrelated individuals to be phased by using sets of common haplotypes to explain the genotype

3

data (Browning, et al. 2011). For this reason, computational phasing works best for large datasets because more individuals allow for a better phasing estimation.

There are many different methods and algorithms used for computational phasing.

The first attempts at phasing were Clark’s algorithm and EM algorithm, but these were quickly replaced because were not efficient and had decreased accuracy with large sets of polymorphisms. Many different techniques then emerged based on hidden Markov models, and were an improvement over Clark and EM (Browning, et al. 2011). Methods that use hidden Markov models typically use haplotype templates based on preexisting data to predict which alleles belong to which parent. The model become more accurate with increased iterations over the haplotype templates. SHAPEIT2 is currently one of the most used algorithms, and it is based on hidden Markov models (O’Connell, et al. 2014).

O’Connell recently compared 7 different phasing methods designed to phase unrelated individuals and found that SHAPEIT2 was the most accurate (O’Connell, et al. 2014).

SHAPEIT2 was used to phase the 1000 Genomes Project data used in this project.

Once haplotypes have been phased, they have been used to investigate human ancestry, , recombination, and the study of human diseases. Haplotypes have been crucial for understanding disease because they reveal information about recombination. Locating disease-causing is often done by investigating , which is the statistical association between two SNPs. Haplotypes allow the relationship between SNPs to be understood, so knowledge of single SNP can predict another SNP if the linkage disequilibrium is high (Crawford, et al 2005). One clinical application for haplotypes has been checking patient/donor compatibility for organ transplants. The HLA region of the patient and donor is compared using haplotypes to see 4

if immune systems are compatible (Crawford, et al 2005). Similar haplotypes can be an indicator of a recent common , which makes the two individuals more compatible for transplants. For this reason, haplotypes have been used as a tool for determining ancestry. Haplotype patterns begin to emerge when haplotypes from large numbers of unrelated individuals are compared against one another (1000 Genomes

Project Consortium, et al. 2015). The haplotype patterns observed can be correlated with a specific population or geographic location, and individuals can be sorted into populations based on their haplotype patterns. In our study, we used haplotype construction as the basis of our ancestry analysis for 57 individuals.

1.3 IBD Background:

Identical-by-Descent (IBD) segments are very similar to haplotypes in that they both address inheritance from a single ancestor. Where IBD segments differ is they do not focus on variants, but rather on a section of the chromosome that has not undergone recombination. If no recombination has occurred, the subject and his/her ancestor can have an identical segment of DNA. The size of an IBD segment is often dependent on how recently an ancestor lived. For example, IBD segments shared by a parent and child can be millions of base pairs long because recombination has only occurred once. The more recombination that occurs, the shorter the IBD segments will be. There are more factors that influence IBD segments than just ancestry, though. For example, one study states that selection can increase the amount of IBD sharing (Albrechtsen, et al. 2010).

The best example of this is the human leukocyte antigen (HLA) region of the genome, which shows high IBD sharing among individuals within and across populations 5

(Albrechtsen, et al. 2010). The HLA region encodes for many molecules that play key roles in the immune system, and is the most -dense region within the entire genome.

Selection is strong in the HLA region because it has the largest degree of polymorphism in the genome, mutations in the HLA region can cause autoimmune diseases, and HLA has the densest linkage disequilibrium (Simmonds, et al. 2007). Another study found similar results and stated that the HLA region is, “shared among individuals unrecombined at least 4-fold more than any other region in the genome” (Gusev, et al.

2012). While the HLA region shows the most IBD sharing, other regions on chromosomes 2, 4, and 8 have also been identified as having increased IBD sharing

(Gusev, et al.2012). Specifically, IBD sharing was seen between 11.1 and 13.3 Mbs on chromosome 8. In our analysis, we used IBD segments to determine ancestry, but also searched for areas with high IBD sharing across populations.

1.4 Current Tools:

There are some tools currently available that allow 23andMe customers to further analyze their results (Bettinger, et al. 2013). Most tools allow customers to obtain additional information on the SNPs by drawing from available databases. The tool

Promethease, for example, draws information from SNPedia based on the 23andMe results, and displays the data in a clean graphical layout. Another tool, GEDmatch, allows

23andMe customers to construct family trees by comparing their results with those of others using the service around the world. Very few of the tools currently available try to recreate the results from scratch. One tool that attempts to do this is called Spatial

Ancestry Analysis (SPA). SPA is a downloadable tool that can be used with the 23andMe 6

file to create a visualization of ancestry (Yang et al 2012). SPA uniquely plots each SNP into a 3D graph based on the known allele frequencies of the SNP across populations.

The result is clusters of SNPs in 3D space, where each cluster represents a population, and the size of the clusters can determine an individual’s overall ancestry (Yang, et al.

2012). Our program pipeline determines a person’s ancestry from scratch and displays the ancestry results in a similar format to genotype companies like 23andMe.

7

2. Materials & Methods:

The genotype data for this project was taken from the Phase 3 data of the 1000

Genomes Project. We downloaded the data using ftp from

(ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) in the Variant Call Format

(VCF). The VCF files from the Phase 3 data span all human chromosomes and include

84.4 million variants among 2504 individuals from 26 different populations around the world. For the scope of this project, only were used. The Phase 3 data is entirely “phased”, meaning parental haplotypes were determined for each individual, which allowed us to construct haplotypes for our ancestry analysis.

Our control subjects were four individuals whose ancestors were disclosed. These four individuals acted as controls because we had access to their 23andMe file in conjunction with their 23andMe results or family history. Using publicly available

23andMe files from the Personal Genome Project (https://my.pgp- hms.org/public_genetic_data?utf8=%E2%9C%93&data_type=23andMe&commit=Search) we investigated 53 more individuals. Although most individuals chose to remain anonymous, some individuals elected to provide their name or ethnic origin. In these cases, we checked if our program’s output verified their self-identification.

A Linux workstation was used to write and execute all programs for this study.

The programs used for this study were create_haplo_table.pl, build_haplo_v2.pl, and insert_ind.pl. See the appendix section for more details on these programs and the protocol used to execute them.

8

3. Results:

3.1 23andMe File:

The genotype files generated by 23andMe differ depending on the type of genotype chip used for the analysis. For example, the current chip and the chip used in this project is version 4 (v4), which was released in 2013. The v4 chip contains ~610,000

SNPs, while the previous chip, v3, contained over 900,000 SNPs. The Perl programs used in this study are compatible with previous versions of the chip, but our study only used the programs on files from the v4 chip. Our first step in analysis was to investigate the

23andMe file and its format. The 23andMe file comes as a tab-delimited text file that contains ~610,000 SNPs with the subject’s genotype information. The file has four information columns containing the SNP ID, chromosome number, location, and genotype (Table 1). Most of the SNPs are identified with an RS number, but 23andMe uses its own SNP identifiers that are marked with an “i”, which stands for “internal”.

There were 50,041 of these “iSNPs” in the 23andMe file (Figure 1). These identifiers are unique to 23andMe, so there is no publicly available data on these sites. 23andMe also uses genotype designations of “-“, “D”, and “I”, which represent unknown, deletion, and insertion, respectively. Sites using these designations were filtered out for the final analysis. To ensure the genotype file was standardized, we compared the 23andMe files of two different individuals to ensure they contained all the same SNPs and that the SNPs were in the same position in the file. The results confirmed that the order of the SNPs were fixed and each file contained the same variants.

9

3.2 Extracting Data from Phase 3:

There is a large size difference between the 23andMe file and the Phase 3 VCF files. The VCF files contain 84.4 million SNPs, while the 23andMe file only contains

~610,000 SNPs. The program create_haplo_table.pl takes a 23andMe file as an input file and filters out all the SNPs in Phase 3 that are not present in the 23andMe file. For

23andMe’s “internal” SNPs (iSNPs), we designed our program to match SNPs based on location and chromosome number so SNPs marked with “i” could be matched with an RS

SNP from the Phase 3 data. We found that 6,910 of the 50,041 iSNPs matched with known RS SNP from the Phase 3 data (Figure 1). The output of create_hapo_table.pl was a table that contained all the 23andMe SNPs paired with the Phase 3 data for each SNP

(Table 2). The 2504 individuals in the Phase 3 data were not sorted by population, so create_hapo_table.pl was designed to sort the individuals by their 3-letter population code, which can be seen in the output table (Table 2). Approximately 50,000 SNPs in the

23andMe file were not found in the Phase 3 data, so these variants were marked with a

‘NOMATCH’ tag. To improve efficiency, create_haplo_table.pl was designed to only open the VCF files once, so that the program was not too time and resource intensive.

This program was executed for each chromosome separately, as opposed to all together, to ensure that SNP location was a unique identifier and to reduce execution time by running programs in parallel. The result of create_hapo_table.pl was 22 output tables

(one for each chromosome) that were concatenated together into a single table that was used for haplotype construction (concatenation was done with the Unix “cat” command).

The final table used for haplotype construction was 532,532 lines (each line representing a SNP). The table was then filtered further so only sites in which the 10

individual was homozygous remained. Sites were considered homozygous if the genotype was “GG”, “AA”, “CC”, or “TT”. The number of homozygous sites among individuals varied, but usually ~355,000 SNPs remained after this filtering took place.

3.3 Haplotype Construction:

Before haplotypes were constructed, all heterozygous and “NOMATCH” sites were removed from the final table. Heterozygous sites were removed because the

23andMe genotype file was unphased, unlike the Phase 3 data. Since the 23andMe data was unphased, we were not able to determine which heterozygous allele belonged to which parent, so a single haplotype was constructed out of the homozygous sites. SNPs were designated as either “0” or “1”, depending on whether the genotype information matched the reference allele or alternative allele listed in the Phase 3 data. If the genotype matched the reference allele, that site was assigned as “0”. If the genotype matched the alternative allele, that site was assigned as “1”. If the genotype was heterozygous, a “0” or “1” could not be assigned, so all heterozygous sites were filtered out using the program build_haplo_v2.pl, leaving ~355,000 SNPs for haplotype construction.

Using the filtered homozygous table, we started building haplotypes. The genotype column was compared against the “REF” and “ALT” columns to convert the genotype information from 23andMe into a haplotype (Figure 2). Since the Phase 3 data is phased, two haplotypes were constructed for each of the 2504 individuals, for a total of 5008 haplotypes; the Phase 3 data genotype information was already in the 0/1 format, so no conversion needed to take place. The length of each haplotype was 50 adjacent SNPs because a shorter segment allowed for too many matches, and more than 50 SNPs 11

generated too few matches. The homozygous table was divided into blocks of 50 rows, and each of these blocks contained the subject’s single haplotype, along with the 5008 haplotypes from Phase 3. These 50 row blocks were named “haplotype segments”. The 50

SNP length generated approximately 7100 haplotype segments (355,000 / 50 = 7100). The totality of the highest matching population for each of the ~7100 haplotype segments provided insight into the ancestry of the individual.

3.4 Ancestry Analysis:

To determine an individual’s ancestry, the number of haplotype matches for each population was calculated first. For example, the population TSI had 112 individuals in the

Phase 3 data. The subject’s haplotype was tested against the haplotypes of these 112 individuals (224 total haplotypes). If 30 of those 224 haplotypes matched the subject’s haplotype, a 13% matching score was assigned to TSI (30/224 = ~13%). The “percent matched” was calculated for each of the 26 populations and then those populations were ordered from highest matching percentage to lowest matching percentage and printed to an output file (Figure 4). Insight into a subject’s ancestry was gained by counting the highest occurring population for each of the ~7100 haplotype segments. When the results of this analysis were compared to the 23andMe results for Control 1, the ancestry differed dramatically. The results revealed that Control 1 was only 65% European, while the

23andMe results showed Control 1 was 99.6% European (Figure 3). Based on Control 1’s family history, we hypothesized that the European ancestry percentage should be greater than 65%. Looking through the haplotype segments, regions where multiple adjacent segments with the same highest occurring population were evident. This suggested large 12

regions that spanned multiple haplotype segments came from the same population. Based on our observation, and in attempt to smooth the data, adjacent haplotype segments were bundled together to form what we called “IBD segments”. A point system was used to determine the highest occurring population for each IBD segment, and the totality of the

IBD segments determined ancestry.

IBD segments were created by grouping 6 of the haplotype segments together and using a point system to calculate the highest scoring population from the 6 haplotype segments (Figure 5). The point system assigned a score of 100 to the highest matching population of the haplotype segment, and then assigned points to other populations proportional to the number of matches compared to the highest number of matches. For example, if TSI was the highest matching with a matching percentage of 4, and the second highest matching percentage was CEU with 3, TSI would be assigned 100 points and CEU would be assigned 75 points. If a population had more than 450 points out of the 600 possible for 6 haplotype segments, the IBD segment was counted for that population

(Figure 5). After an IBD segment was found, haplotype segments were added to the IBD segment until a point quota was no longer sufficient. This means many IBD segments contained more than 6 haplotype segments and had scores greater than 600. If a segment had a score >= 650, it was referred to as a “strong” IBD segment. The results from the IBD segments offered a better picture of ancestry that was more comparable with the results of

23andMe and our controls’ family history.

An IBD segment analysis was performed for 57 individuals in this study. Europe was the predominant continent of ancestry for 50/57 subjects (~88%) (Table 3). The seven subjects with non-European ancestry were Han Lin and Control 3 with East Asian ancestry, 13

Daniel Zuranich and Vincent Yount with American ancestry, Sash Balaskinkam with South

Asian ancestry, and Terrence Pinder and Shellise Mayrant with African ancestry (All the names listed were the names individual provided with their data posted to www.personalgenomes.org). The number of IBD segments per individual ranged from 821 to 2461 (Table 3). An ANOVA test revealed a significant difference (p-value <= 0.01) between the geographical groups and the mean number of IBD segments. All but two subjects (96%) had representation from 4 or more of the 5 geographic regions (Table 3).

Han Lin and Control 3 were the only subjects with representation from less than 4 of the regions, with representation from East Asia and America. Travis Jupp, who self-identified as “Caucasian/Apache Mix” was found to be 57% European ancestry and 40.5% American ancestry (Table 3). Control 1, whose family is Russian in origin, had the highest FIN percentage of all subjects (Table 3); Finland (FIN) was geographically closest to Russia out of all the populations in the Phase 3 data. Control 1 showed trace amounts of ancestry from America (1.9%) and South Asia (1.9%), while the 23andMe results showed no ancestry from those regions (Figure 7).

Control 3, who was born in Korea and has Korean heritage, showed 99.3% East

Asian ancestry. 23andMe showed similar ancestry overall, but the breakdown of East Asian populations was different between the two analyses (Figure 8). Japan (JPT) had the highest percentage of the East Asian population in our IBD analysis, while 23andMe showed Korea as the highest-ranking population with Japan as the second highest (Figure 8). The Phase

3 data in our analysis does not have a Korean population,

Control 4, who has family history from Britain, showed 95.4% European ancestry, most of which came from Northwest Europe (Figure 9). When compared with the 23andMe 14

data, our analysis shows small amounts of American (3.2%) and South Asian (1.2%) ancestry, while the 23andMe data does not (Figure 9).

3.5 European Ancestry:

For individuals with a majority European ancestry (>50%), the average ancestry from Europe was 92%. On average, 34.2% of this ancestry came from CEU, 24.1% from

FIN, 22.6% from GBR, 6.3% from IBS, and 4.8% from TSI. No subject from European ancestry had a TSI or IBS percentage that was greater than the GBR, CEU, or FIN percentage.

3.6 IBD Sharing:

To investigate IBD sharing, IBD segments from populations that contributed less than 5.0% to the overall ancestry of their subject were investigated; these populations were termed as “rare populations”. “Strong” IBD segments (>=650 points) among rare populations were identified and their location recorded. The results showed that chromosome 6 had more than double the amount of strong IBD segments than any other chromosome (Figure 6). Not only did chromosome 6 have the most IBD segments, but 75 of the 108 IBD segments (69%) from chromosome 6 were located within the HLA region

(25 - 35 Mb) (See Supplemental Table). IBD sharing was also found on chromosome 8, which had the second most IBD segments with 46. The IBD sharing was located in the centromere region (~42Mb - ~52Mb), where 18 of the 46 IBD segments (39%) were located. The populations assigned to all 18 IBD segments were from East Asia, in comparison to the all geographic regions represented among the 75 HLA IBD segments. 15

More IBD sharing was found on chromosome 11 in the centromere region (45Mb –

59Mb) (See Supplemental Table).

16

Table 1

SNP ID CHR # Location Genotype i6033924 1 1249133 DD rs12142199 1 1249187 AA i6033925 1 1254599 II i6054441 1 1254678 GG rs112653110 1 1255028 GG rs200498230 1 1256453 GG rs140361978 1 1258527 CC i6033928 1 1262347 II rs307354 1 1264539 CC rs35744813 1 1265460 CC i6019336 1 1268409 -- rs141631187 1 1268505 GG i6033933 1 1269320 II i6019337 1 1269331 -- rs307377 1 1269554 CC rs3855955 1 1276077 AG i6019338 1 1284412 CC

Table 1 Format of the 23andMe file

This shows the format of the genotype file provided by 23andMe. It contains 4 columns, the first has the SNP ID, the second contains the chromosome number, the third contains the location of the SNP, and the fourth contains the subject’s genotype information. The symbols “-“, “I”, and “D” represent “unknown”, “insertion”, and “deletion” respectively.

17

Figure 1

Figure 1 Characterization of 23andMe File with Filtering

Shows the total number of SNPs in the 23AndMe file as well as the break down between

RS SNPs and “internal” SNPs, designated with an “i”. The table also shows the number of SNPs from these two categories that matched the Phase 3 1000 genomes data. The filtering process yields the final table, which will be used to build haplotypes.

18

Table 2

23AndMe 1000 Genomes Phase 3 RSID CHR LOC GENOTYPE CHR LOC RSID REF ALT POP IND 1 IND 1 IND 2 IND 2 rs4970383 1 838555 CC 1 838555 rs4970383 C A CHB 1 0 1 1 rs4475691 1 846808 CC 1 846808 rs4475691 C T CHB 1 0 0 0 rs7537756 1 854250 AA 1 854250 rs7537756 A G CHB 0 0 0 0 rs13302982 1 861808 GG 1 861808 rs13302982 A G CHB 1 1 0 0 rs55678698 1 864490 CC 1 864490 rs55678698 C T CHB 0 0 0 0 rs1110052 1 873558 GT 1 873558 rs1110052 G T CHB 1 1 1 0 rs2272756 1 882033 AG 1 882033 rs2272756 G A CHB 0 0 0 0 rs67274836 1 884767 GG 1 884767 rs67274836 G A CHB 0 0 0 0 rs13302945 1 889159 CC 1 889159 rs13302945 A C CHB 1 1 1 1 rs13303106 1 891945 AG 1 891945 rs13303106 A G CHB 1 1 1 1

Table 2 Table Created by create_haplo_tables.pl, which is Used for Haplotype

Construction

The first 4 columns are the original 23AndMe file, while the remaining columns are from

Phase 3. The three-letter population code in column 9 indicates the following columns are individuals from that population. The table contains 5008 “IND” columns representing 2504 people and 26 populations columns. Since Phase 3 data is phased, each “IND” column represents a collection of SNPs the individual inherited from a specific parent. Therefore, each individual will have 2 haplotypes, the first from parent 1 and the second from parent 2.

19

Figure 2

23AndMe 1000 Genomes Phase 3 RSID CHR LOC GENOTYPE CHR LOC RSID REF ALT POP IND 1 IND 1 IND 2 IND 2 rs148828841 1 760998 CC 1 760998 rs148828841 C A CHB 0 0 0 0 rs3131972 1 752721 GG 1 752721 rs3131972 A G CHB 1 1 1 1 rs12124819 1 776546 AA 1 776546 rs12124819 A G CHB 0 0 0 0 rs4970383 1 838555 CC 1 838555 rs4970383 C A CHB 1 0 1 1 rs4475691 1 846808 CC 1 846808 rs4475691 C T CHB 1 0 0 0 rs7537756 1 854250 AA 1 854250 rs7537756 A G CHB 0 0 0 0 rs13302982 1 861808 GG 1 861808 rs13302982 A G CHB 1 1 0 0 rs55678698 1 864490 CC 1 864490 rs55678698 C T CHB 0 0 0 0 rs13302945 1 889159 CC 1 889159 rs13302945 A C CHB 1 1 1 1

Convert to hapltoype by comparing to REF/ALT rs148828841 1 760998 0 0 0 0 0 rs3131972 1 752721 1 1 1 1 1 rs12124819 1 776546 0 0 0 0 0 rs4970383 1 838555 0 1 0 1 1 rs4475691 1 846808 0 1 0 0 0 rs7537756 1 854250 0 0 0 0 0 rs13302982 1 861808 1 Compare Subject's Haplotypes with 1 1 0 0 rs55678698 1 864490 0 2504 Individuals (5008 Haplotypes) 0 0 0 0 rs13302945 1 889159 1 1 1 1 1

MATCH FOUND

Figure 2 Haplotype Construction and Characterization

Shows the matched table after heterozygous sites have been filtered out, leaving approximately ~355,000 SNPs depending on the individual. The figure above demonstrates how this homozygous table was converted into haplotypes. The genotype column was compared with the REF and ALT and assigned a “1” if it matched the ALT, or a “0” if it matched the REF. Matches were counted for each population and divided by the total number of haplotypes to get a “percent matched” score for each population.

The above figure is just an example; actual haplotypes were constructed with 50 adjacent

SNPs.

20

Figure 3

Figure 3 Before and After IBD Smoothing Technique

Shows the ancestry results for Control 1 before and after smoothing was implemented.

Smoothing was accomplished by bundling 6 haplotype segments to create an IBD segment. Using the IBD segments as a smoothing technique made the ancestry results more similar to Control 1’s family history and the results of 23andMe.

21

Figure 4

Figure 4 Haplotype Segments Created by build_haplo_v2.pl

Each line represents a haplotype segment, followed by the position and length of the segment. The “HETERO” designation is the number of heterozygous SNPs found within the haplotype segment. ‘NA’ means that a subject’s haplotype did not match any of the

2504 individuals in the Phase 3 data. The numbers next to the population codes represent the percent of haplotypes from that population that matched the subject’s haplotype.

22

Figure 5

Figure 5 Creation of IBD Segments and Scoring

Each IBD segment was composed of 6 haplotype segments. Each haplotype segment was labeled with its chromosome and position, followed by the number of matches for each population sorted in descending order. Points were assigned to each population based upon the number of matches. The population with the highest matching percentage in a haplotype segment was assigned 100 points, and the other populations were assigned points proportional to highest. For example, looking at the first haplotype segment in the second IBD segment above, we see FIN has the highest matching percentage with 4, followed by PUR, GBR, and CEU with 3. In this case, FIN was assigned 100 points, while PUR, GBR, and CEU were each assigned 75 points. If a population exceeded 450 points combined over 6 consecutive haplotype segments, it was considered an IBD segment for that population. The total amount of IBD segments for each population were counted to determine ancestry.

23

Figure 6

Figure 6 Amount of IBD Sharing by Chromosome

Shows the distribution of IBD segments from rare population with scores >= 650 across all chromosomes of 57 individuals. The locations of these segments are an indication of

IBD sharing. High IBD sharing was seen in the HLA region on chromosome 6 (25 -35

Mb) and in the centromere region on chromosome 8 (42 – 52 Mb). More detailed information about the above IBD segments is provided in a supplementary file.

24

Figure 7

Figure 7 IBD Results vs 23andMe Results for Control 1

Shows the comparison between 23andMe results and the results from our IBD analysis for Control 1, who was born in Russia and comes from Russian heritage. The two results are similar in that they place Control 1 as >95% European, but they differ in the

European populations to which they assign ancestry. The IBD analysis shows population

GBR has 14.8% ancestry, while Control 1 has 3.5% Northwestern European in the

23andMe results. The IBD analysis also shows trace ancestry from America (1.9%) and

South Asia (1.9%), while the 23andMe results show no ancestry from those regions.

25

Figure 8

Figure 8 IBD Results vs 23andMe Results for Control 3

Shows the comparison between 23andMe results and the results from the IBD analysis for Control 3, whose family history is from Korea. The two results are similar in that they place Control 3 as >99% East Asian, but they differ in the East Asian populations to which they assign ancestry. The IBD analysis did not include a Korean population, so the highest-ranking population was from Japan (JPT), which is the second highest population for 23andMe, which does have a Korean population.

26

Figure 9

Figure 9 IBD Results vs 23andMe Results for Control 4

Shows the comparison between the 23andMe results and the results from the IBD analysis for Control 4, who comes from a British/Irish heritage. The two results are similar in that they place Control 4 as > 95% European, but they differ in European populations to which they assign ancestry. The IBD analysis assigned more ancestry to

American populations (3.2%) while the 23andMe results showed < 0.1% from the

Americas.

27

Table 3

892

942

821

988

1603

1278

1371

1472

1556

1518

1542

1387

1452

1412

1489

1482

1560

1279

1483

1417

1466

1526

1479

1507

1428

1108

1399

2373

1453

Tot_IBD_SEG

0.4

0.2

0.1

0.4

0.1

0.3

0.2

0.1

0.5

0.3

0.1

0.1

0.1

0.3

0.1

0.2

0.1

0.2

ITU

13.0

0.4

0.1

0.1

0.1

0.2

0.1

0.2

0.1

0.2

0.1

0.2

0.1

0.3

0.1

0.4

0.1

0.2

0.1

0.1

45.6

STU

0.7

0.2

0.2

0.1

0.1

0.3

0.1

0.1

0.4

0.1

0.1

0.3

0.1

0.2

0.4

0.1

0.4

0.1

BEB

12.3

South Asia South

1.0

0.5

0.8

0.4

1.0

0.1

0.5

0.3

0.9

0.3

1.1

1.5

0.1

0.8

0.3

0.2

0.6

1.1

0.3

0.2

0.9

0.7

0.4

0.8

0.8

0.1

0.7

PJL

11.3

2.8

0.1

0.5

0.7

0.2

0.1

0.1

1.4

0.6

0.1

0.5

0.3

0.1

2.6

0.4

0.3

0.5

9.0

0.5

0.2

0.3

0.4

0.3

0.2

0.1

0.9

1.0

GIH

0.6

0.2

1.3

0.2

1.2

1.3

0.3

0.6

7.7

0.7

0.3

0.2

0.5

2.2

0.6

0.7

1.0

0.9

0.1

0.5

0.9

0.3

1.4

0.9

0.6

0.9

0.7

PEL

68.4

29.7

1.5

1.7

1.0

1.2

0.3

1.2

0.8

1.2

0.9

0.9

1.1

4.8

0.5

1.2

1.9

1.5

1.8

0.6

1.0

1.5

1.4

2.0

0.9

2.7

1.1

2.3

0.4

CLM

0.3

0.6

0.2

0.1

0.2

0.5

0.5

0.8

0.4

0.9

1.1

0.7

0.6

0.3

0.5

0.2

0.4

0.6

0.8

0.6

0.5

0.7

0.5

1.3

0.8

0.5

0.1

PUR

America

0.2

0.7

0.3

0.5

4.0

0.5

0.2

1.4

0.1

1.2

0.3

0.3

0.2

9.1

0.2

0.2

0.5

0.1

0.9

0.7

0.1

0.5

0.4

0.2

1.3

0.7

MXL

0.3

0.1

0.3

ACB

0.1

0.1

0.2

0.2

ASW

0.1

0.1

0.1

0.1

0.3

ESN

0.1

MSL

Africa

0.1

0.1

0.1

0.2

GWD

0.4

0.1

0.1

LWK

0.2

0.2

0.1

YRI

7.6

5.6

5.2

7.0

5.3

1.6

5.5

7.7

5.5

3.7

6.0

7.1

4.8

6.4

5.3

2.9

7.5

6.8

0.2

6.5

5.0

6.4

6.1

7.4

6.8

6.3

5.4

IBS

10.6

6.1

1.7

22.4

24.0

18.8

22.3

22.0

23.6

25.3

23.5

25.7

20.3

14.1

25.1

24.0

26.7

14.5

25.4

24.8

23.9

24.0

25.8

28.1

25.4

21.0

25.7

20.6

14.8

GBR

6.3

0.9

FIN

24.3

27.3

36.9

26.8

33.9

23.6

19.9

24.1

21.4

26.6

20.9

22.4

23.2

19.9

14.5

21.2

22.6

22.9

31.1

19.0

20.4

20.4

20.9

24.4

36.6

48.4

Europe

6.8

3.4

5.0

5.2

5.0

1.3

3.4

4.1

3.7

5.5

4.5

8.9

3.2

4.5

4.4

2.9

3.9

4.2

0.4

4.2

3.7

5.0

4.4

5.9

7.9

3.5

6.7

4.1

TSI

9.1

1.5

29.3

35.1

28.8

34.9

29.8

39.0

39.6

38.0

30.9

36.9

38.3

41.8

37.1

34.9

22.2

37.3

36.8

37.0

32.6

38.4

34.7

37.3

30.2

35.0

21.8

22.7

CEU

0.4

0.1

0.1

0.2

0.1

0.1

0.2

0.3

0.1

0.1

0.3

1.6

0.1

0.1

0.2

15.5

KHV

0.3

0.2

0.2

0.3

0.2

0.1

0.3

0.2

0.1

0.4

0.5

0.1

0.4

0.6

0.2

0.1

0.1

0.5

0.7

0.1

0.8

10.4

CDX

0.3

0.1

0.1

0.2

0.1

0.1

0.1

0.2

0.2

0.4

0.2

0.1

0.6

0.2

CHS

14.7

East AsiaEast

0.3

0.3

0.3

0.2

0.1

0.3

0.1

0.1

0.1

0.5

0.4

0.2

0.1

0.2

0.1

0.2

0.2

0.3

JPT

42.3

0.1

0.2

0.1

0.1

0.1

0.1

0.1

0.8

0.3

0.1

0.1

0.1

0.2

0.4

0.1

0.3

0.2

0.5

0.1

0.1

0.1

16.4

CHB

23DN

NAME

Control4

Control3

Control2

Control1

huFF6370

hu32506B

hu42CD37

Tim BoyleTim

Travis JuppTravis

JessePortz

Daiyu Hurst Daiyu

Grant Kovich Grant

Brian DennisBrian

BillyAshcraft

Brett Monson Brett

Lauren Welch Lauren

WesleyMarks

AimeeHaynes

Wayne Warthy Wayne

GabrielSzaszko

DanielZuranich

Adam DavidsonAdam

StephenShanks

AmberKleckner

Sash BalasinkamSash

NathanielHubel

Emma Borhanian Emma NicholasBlasgen

28

Table 3 Cont.

962

1257

1429

1334

1556

1695

1540

1012

1147

1401

1139

1467

1424

1370

1571

1516

1518

1133

1329

2461

1553

1402

1096

1580

1285

1373

1697

1632

Tot_IBD_SEG

0.2

0.2

0.2

0.1

0.1

0.1

0.3

0.2

0.3

0.3

0.3

0.1

0.1

0.1

0.1

0.5

0.3

0.4

0.1

0.1

ITU

0.2

0.2

0.1

0.1

0.2

0.1

0.1

0.3

0.1

0.1

0.4

STU

0.3

0.2

0.1

0.4

0.2

0.1

0.3

0.4

0.4

0.1

0.1

0.3

0.1

0.2

0.1

0.5

0.2

0.1

0.7

BEB

South Asia South

0.3

0.4

0.4

0.3

0.5

0.1

0.4

0.2

0.1

0.6

0.3

0.3

0.2

0.5

1.7

0.7

0.1

0.4

0.4

0.6

0.3

0.6

0.5

0.5

1.1

PJL

0.6

0.3

0.6

0.2

3.2

0.4

0.1

0.7

0.6

0.3

0.1

0.2

1.9

0.5

0.4

0.4

0.3

0.3

0.1

0.1

0.3

GIH

0.5

1.0

0.3

0.8

1.1

0.2

0.4

1.0

0.6

0.8

0.1

1.6

0.3

0.9

0.4

0.4

0.6

0.9

0.4

0.4

0.5

0.8

0.7

1.1

PEL

17.3

66.4

33.0

25.0

1.5

1.6

1.2

0.8

1.4

0.8

0.9

2.9

1.6

1.1

1.2

1.4

0.6

1.1

1.6

0.3

1.5

1.1

2.8

1.6

1.3

2.8

0.9

1.6

2.1

CLM

0.6

0.1

1.0

0.4

0.2

0.3

0.2

0.4

0.4

0.2

0.2

0.7

0.8

1.0

0.6

0.3

0.3

0.4

0.9

0.3

0.3

0.4

1.7

0.2

0.6

0.6

PUR

America

2.0

0.2

1.0

0.7

0.2

0.7

6.3

0.7

0.3

0.6

0.9

0.3

0.5

0.3

0.2

3.3

0.5

0.4

0.9

0.5

0.7

0.6

0.9

15.5

MXL

0.1

0.4

10.4

13.1

ACB

0.1

0.1

4.0

4.6

0.1

0.1

0.2

ASW

0.2

0.1

0.2

0.7

ESN

30.2

40.6

0.1

6.8

0.3

0.1

11.5

MSL

Africa

0.4

3.1

1.9

0.1

0.2

GWD

0.5

0.1

0.5

10.6

13.1

LWK

0.1

0.3

0.3

0.1

0.4

0.1

YRI

11.4

13.3

7.8

4.4

5.8

3.9

4.0

5.8

0.5

0.1

6.3

7.7

6.5

6.7

8.2

7.9

5.9

6.3

8.1

6.0

5.3

5.4

8.6

5.4

5.7

6.8

2.5

6.8

IBS

13.2

2.5

5.2

0.4

16.1

25.7

23.8

24.9

21.9

24.6

12.6

26.3

21.5

24.1

29.5

22.0

23.8

20.7

25.6

22.5

15.7

24.2

24.1

26.8

23.8

16.5

23.9

18.4

GBR

2.9

5.1

0.1

FIN

16.5

29.2

21.2

30.2

20.7

27.6

13.5

24.5

24.9

24.2

18.3

24.8

21.1

25.2

20.0

26.8

17.8

18.5

22.3

20.4

20.0

42.4

24.1

18.6

Europe

6.5

3.2

4.4

4.4

1.0

4.7

0.7

5.9

2.7

4.3

5.8

5.4

3.4

3.6

3.8

8.0

4.0

4.8

3.6

3.5

3.7

3.5

3.7

4.6

8.2

TSI

10.2

4.3

7.2

1.0

29.4

33.6

39.4

31.7

43.1

32.3

19.1

34.0

36.9

32.9

36.7

38.1

41.0

32.3

38.5

36.8

25.6

31.6

40.2

38.2

38.2

31.6

35.7

33.7

CEU

0.2

0.1

0.1

0.1

0.1

0.2

0.1

0.1

0.1

0.1

0.2

0.2

0.4

0.1

0.1

27.5

KHV

0.1

0.2

0.1

0.1

0.1

0.1

0.1

0.1

0.2

0.3

0.1

0.1

0.1

0.2

0.2

0.1

0.1

23.6

CDX

0.1

0.2

0.2

0.1

0.2

0.1

0.1

0.1

CHS

20.3

East AsiaEast

0.2

0.3

0.3

0.5

0.1

0.1

0.5

0.1

0.3

0.4

0.1

0.3

0.4

0.1

0.4

0.2

0.3

0.3

JPT

11.8

0.1

0.3

0.4

0.1

0.1

0.1

0.1

0.5

0.1

0.1

0.4

16.4

CHB

NAME

Han Lin Han

rarobins

hu301476

hu026DEA

Kathy HullKathy

Cory Tripp Cory

Laura Nono Laura

Patrick Love Patrick

JimBearden

Terry Turner Terry

Leah BartleyLeah

DanielOlson

Laura Peddle Laura

JohnnyStuto

Ryan Renslow Ryan

VincentYount

Hayden Hume Hayden

Joan EntwistleJoan

LindgrenRider

FionaBremner

RobbieGerlach

Michael WidingMichael

StephenHolton

KennethSutton

TerrencePinder

StephenBradley

ShelliseMayrant PeterVandermolen

29

Table 3 Ancestry Results for 57 Subjects using IBD Analysis

Shows the ancestry results for 57 subjects. The 26 populations represented by their three-letter code are grouped into 5 categories: East Asia, Europe, Africa, America, and

South Asia. The last column shows the total number of IBD segments found in the subject.

Each of the numbers underneath the populations represents the percentage of the total

IBD segments that came from that population. For example, Control 1 had 5.4% of the

1453 total IBD segments come from the population IBS. Meaning that ~78 IBD segments

(1453 x 0.054 = ~78) scored highest for IBS during the analysis. If no number is listed, it means no IBD segments were found for that population. All listed names are from the publicly available personalgenomes.org database.

30

4. Discussion:

The results of this analysis supported the hypothesis that ancestry of an individual can be determined using publicly-available genomes and databases. While the majority of the 57 subjects had ancestry from Europe, there were subjects whose results showed ancestry from East Asia, South Asia, Africa, and the Americas. Also, for subjects Han

Lin and Sash Balasinkam, the ancestry results matched the area of the world from which those names originated, which increased our confidence that the ancestry results were representative of the subject’s family history. While it would have been helpful to have more background on our publicly-available subjects, a name’s origin can be an indirect indicator of ancestry for that person.

The ancestry results also matched the family history provided to us by our control subjects. Control 1 had the highest FIN percentage of all the subjects at 48.4%. Control 1 was born in Russia, and all traceable ancestors came from Russia as well. Out of all the populations we tested, Finland is the closest geographically to Russia as they share a border, so it is consistent that the people of Finland would be closely related to a Russian individual. The 23andMe results for Control 1 showed most of the ancestry from Eastern

European populations, which are not present in the Phase 3 dataset. Control 2 and Control

4, who had family histories from Europe, both showed greater than 92% European ancestry. Finally, Control 3, who had family history from Korea, showed 99.4% East

Asian ancestry. The combination of these four controls and other individuals who self- identified, provided support for our program pipeline having accurately identified the major geographic areas of our subjects’ ancestry.

31

One surprising result was the lack of ancestral diversity for the subjects with East

Asian ancestry. All other subjects, with the exception of Han Lin and Control 3, had IBD segments from at least 4 of our 5 geographic locations (Africa, America, Europe, South

Asia, East Asia). Han Lin and Control 3 only had representation from 2 of our locations, and showed greater than 99% East Asian ancestry with trace amounts of ancestry from

America. Han Lin had the lowest among the subjects and had the highest amount of total IBD segments. We expected that a subject from East Asia would have some trace amount of ancestry from South Asia due to proximity, but these data suggest that the East Asian populations may be mostly isolated from the South Asian populations. In support of this, one study found that South Asian populations were more related to European populations than to East Asian populations. When haplotypes from

75 Asian populations were analyzed, it was found that 90% of East Asian haplotypes could be found in South Asia, but that South Asia had only minor contributions to the

East Asian populations (HUGO Pan-Asian SNP Consortium, et al. 2009). It was also surprising that Han Lin and Control 3 had a much higher number of IBD segments. It is possible that the homogony of their ancestry results (>99% East Asian) allowed for continuous sections of IBD segments where other individuals had more frequent interruptions from rare populations.

The distribution of IBD segments from rare populations were non-random and indicated IBD sharing. We expected to find IBD sharing in on chromosome 6 because that is where the HLA region is located and previous studies had cited high IBD sharing at the HLA region, and indeed our strongest IBD sharing occurred in that region. Not only did chromosome 6 have the most IBD segments, but 75 of the 108 IBD segments 32

from chromosome 6 were located within the HLA region (25 - 35 Mb). Of the 75 IBD segments in the HLA region, Europe, America, South Asia, East Asia, and Africa were all represented (see Supplementary Table). Europe, America, and South Asia represented the majority of the 75 IBD segments; Africa and East Asia showed only 3 IBD segments each in the HLA region. We expected to find IBD sharing on 11.1 – 13.3 Mb in chromosome 8, in accordance with the findings of Gusev, et al. 2012. While we did find moderate IBD sharing in that region (8 out of 46 IBD Segments), we saw stronger IBD sharing in the centromere region (~42Mb - ~52Mb) where 18 of the 46 IBD segments were located. Surprisingly, all 18 segments in the centromere region originated from East

Asia, unlike in the HLA where great diversity was found. More IBD sharing was found on chromosome 11 in the centromere region (~45Mb – ~59Mb). These IBD segments showed population diversity, much like in the HLA region, meaning all 5 continent groups were represented among the IBD segments.

In relation to IBD sharing, Control 1 and Control 4 both showed small amounts of ancestry from America and South Asia where none existed in the 23andMe results

(Figure 7 & 9). This disparity could be explained by IBD sharing regions like HLA that showed a lot of ancestry from America and South Asia. These IBD sharing regions could be acting as false positives and indicating ancestry where none exists. The HLA region has the largest degree of polymorphism in the genome, and during our analysis, we had to filter out all heterozygous sites (Simmonds, et al. 2007). This large loss of data makes the

HLA region unreliable for assessing ancestry. It is also possible that the 23andMe analysis overestimates the ancestry from Europe or disregards data from high IBD

33

sharing regions. An interesting next step in this project would be to normalize for the high IBD sharing regions to see if our results more closely resemble the 23andMe results.

One of the main limitations to our study was the lack information on our subjects.

There are many publicly available files from the personal genome project ( http://www.personalgenomes.org/), but very few files came with any ancestry information about the subjects. This was limiting to our project because we did not have the 23andMe results against which to compare our results, and we had no way to check if our results matched the subject’s family history. Some subjects chose to self-identify as

“Caucasian”, “Native American/Caucasian mix”, or to provide their name, but an investigation into family history would prove more robust than relying on self-reporting.

4.1 Future Work:

In the future, more smoothing techniques could be used. The length of the IBD segments could be shortened or lengthened depending on the researchers needs. Another technique that could improve analysis is the allowance of mismatches in haplotypes. For this analysis, only haplotypes that matched 100% were counted. If mismatches were permitted, the haplotype length could be extended while still maintaining an adequate number of matches. Longer haplotypes could provide more information about more recent ancestors because IBD segments received from recent ancestors are longer.

Conversely, shorter haplotypes could give more information on distant ancestor because more recombination would have occurred causing the IBD segments to be smaller.

Another improvement would be to add more controls that could help normalize the data. It appears that among subjects with European ancestry, CEU, FIN, and GBR overwhelm the ancestry of TSI and IBS. It is possible that all our subjects’ ancestries are 34

from northwestern Europe, but the discrepancy could also be a result of bias in the data, meaning that a subject with documented family history from TSI or IBS could still have ancestry that favored FIN, CEU, or GBR. Obtaining controls with verifiable family history from TSI or IBS would allow us to identify if a data bias is occurring. If this was the case, action could be taken to help normalize the data.

Overall, our hypothesis that ancestry could be determined with publicly available data was confirmed. Our results mimicked the results of 23andMe with minor differences. While we used 23andMe files in this study, the application of our program pipeline could be extended to apply to any file containing SNP genotype information.

Our pipeline is also a good reminder that ancestry results are not 100% accurate.

Ancestry results from our pipeline and 23andMe should be interpreted as an ancestry

“estimate” because as we saw from the controls, different methods yield different results.

35

References

1. 1000 Genomes Project Consortium. "A global reference for human genetic

variation." 526.7571 (2015): 68-74.

2. Albrechtsen, Anders, Ida Moltke, and Rasmus Nielsen. " and the

distribution of identity-by-descent in the ." Genetics 186.1 (2010):

295-308.

3. Baertlein, Lisa (11-20-2007). "-backed 23andMe offers $999 DNA

test". USA Today.

4. Bettinger, Blaine (09-22-2013). “What Else Can I Do With My DNA Test

Results?”. The Genetic Genealogist. Retrieved from

http://thegeneticgenealogist.com/2013/09/22/what-else-can-i-do-with-my--

test-results/.

36

5. Browning, Sharon R., and Brian L. Browning. "Haplotype phasing: existing

methods and new developments." Nature Reviews Genetics 12.10 (2011): 703-

714.

6. Bud, Michael (12-10-2012). “How accurate is 23andMe?”. Genetic Literacy

Project. Retrieved from https://geneticliteracyproject.org/2012/12/10/how-

accurate-is-23andme/.

7. Crawford, Dana C., and Deborah A. Nickerson. "Definition and clinical

importance of haplotypes." Annu. Rev. Med. 56 (2005): 303-320.

8. Goetz, Thomas. "23andMe will decode your DNA for $1,000: welcome to the age

of genomics." Wired Mag 15 (2007).

9. Gusev, Alexander, et al. "The architecture of long-range haplotypes shared within

and across populations." Molecular biology and evolution 29.2 (2012): 473-486.

10. Hong, Euny (08-26-2016). “23andMe has a problem when it comes to ancestry

reports for people of color”. Quartz. Retrieved from

https://qz.com/765879/23andme-has-a-race-problem-when-it-comes-to-ancestry-

reports-for-non-whites/.

11. HUGO Pan-Asian SNP Consortium. "Mapping human genetic diversity in

Asia." Science 326.5959 (2009): 1541-1545.

12. Jestes, Roberta (01-17-2017). “Calling HOGWASH on 23andMe’s Ancestry

Timeline”. DNAeXplained – . Retrieved from https://dna-

explained.com/2017/01/17/calling-hogwash-on-23andmes-ancestry-timeline/.

13. Murphy, Elisabeth. "Inside 23andMe founder ’s $99 DNA

revolution." Fast Company 180 (2013). 37

14. O'Connell, Jared, et al. "A general approach for haplotype phasing across the full

spectrum of relatedness." PLoS Genet 10.4 (2014): e1004234.

15. Simmonds, M. J., and S. C. L. Gough. "The HLA region and autoimmune disease:

associations and mechanisms of action." Current genomics 8.7 (2007): 453-465.

16. Yang, Wen-Yun, et al. "A model-based approach for analysis of spatial structure

in genetic data." Nature genetics 44.6 (2012): 725-731.

38

Appendix A

Reference Table 1 Characterization of Populations in Phase 3 1000 Genomes Data

This table includes all the populations from the Phase 3 dataset that were used in this study. Populations codes will be used throughout the paper for simplicity.

39

Protocol:

The first step to our project was to trim the Phase 3 data (~80 million SNPS) down to the

~610,000 SNPs from the 23AndMe file. The program create_haplo_table.pl was designed to open a 23AndMe file and save all of the location coordinates into a hash. Then the program searched for these locations in the Phase 3 data and if found, printed the matches out into tables named HAPLO_RESULTS_CHR#. The file “Phase3_individuals_Final” is a file that contains two columns, the first containing 1000 genomes ID for the individual and the second column contains the 3-letter population code associated with that individual.

This table was used to determine the which columns corresponded to which population.

Phase3_individuals_Final is a required input file for the create_haplo_table.pl program and will be provided to you in the supplemental materials. The steps below will guide use of the programs in this study.

Step 1: Download the Phase 3 VCF files from

(ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) and save into a directory.

Step 2: Save the Phase3_individuals_Final file into a directory.

Step 3: Obtain the 23andMe file of interest.

40

Step 4: Route “open” statements in create_haplo_table.pl to the location of your files.

This will need to be done for all programs.

Step 5: Run create_haplo_table.pl using the Unix tool “nohup” and the chromosome number as an ARGV argument. The command line is shown below:

This command should be run for chromosomes 1 – 22.

create_haplo_table.pl

#!/usr/bin/perl # #By Patrick Brennan

@TSI=();@IBS=();@MXL=();@CLM=();@GBR=();@YRI=();@LWK=();@CEU=();@CHB=() ;@JPT=();@ASW=();@PUR=();@CHS=();@FIN=();@GWD=();@MSL=();@ESN=();@PJL=( );@PEL=();@STU=();@ITU=();@CDX=();@GIH=();@ACB=();@KHV=();@BEB=(); @nationalities=qw(CHB JPT CHS CDX KHV CEU TSI FIN GBR IBS YRI LWK GWD MSL ESN ASW ACB MXL PUR CLM PEL GIH PJL BEB STU ITU);

$CHR_NUM = $ARGV[0]; open (IN, "Phase3_individuals_Final") || die "Cannot open file: $!\n";

$column=8; while () { chomp $_; $column++; if ($_=~/\t(\w\w\w)/) { if ($1 eq 'TSI') {push(@TSI, $column);} elsif ($1 eq 'IBS') {push(@IBS, $column);} elsif ($1 eq 'MXL') {push(@MXL, $column);} elsif ($1 eq 'CLM') {push(@CLM, $column);} elsif ($1 eq 'GBR') {push(@GBR, $column);} elsif ($1 eq 'YRI') {push(@YRI, $column);} elsif ($1 eq 'LWK') {push(@LWK, $column);} elsif ($1 eq 'CEU') {push(@CEU, $column);} elsif ($1 eq 'CHB') {push(@CHB, $column);}

41

elsif ($1 eq 'JPT') {push(@JPT, $column);} elsif ($1 eq 'PUR') {push(@PUR, $column);} elsif ($1 eq 'CHS') {push(@CHS, $column);} elsif ($1 eq 'FIN') {push(@FIN, $column);} elsif ($1 eq 'ASW') {push(@ASW, $column);} elsif ($1 eq 'GWD') {push(@GWD, $column);} elsif ($1 eq 'MSL') {push(@MSL, $column);} elsif ($1 eq 'ESN') {push(@ESN, $column);} elsif ($1 eq 'PJL') {push(@PJL, $column);} elsif ($1 eq 'PEL') {push(@PEL, $column);} elsif ($1 eq 'STU') {push(@STU, $column);} elsif ($1 eq 'ITU') {push(@ITU, $column);} elsif ($1 eq 'CDX') {push(@CDX, $column);} elsif ($1 eq 'GIH') {push(@GIH, $column);} elsif ($1 eq 'ACB') {push(@ACB, $column);} elsif ($1 eq 'KHV') {push(@KHV, $column);} elsif ($1 eq 'BEB') {push(@BEB, $column);} else {print "$1\tCould not find population\n";} } } close IN;

#foreach $x (@nationalities) { # $sum=$#{$x}; # print "$sum\n"; #} #print "@IBS\n"; #

@lines=(); @loc=(); open (IN2, "/PATH_TO_23ANDME_FILE") || die "Cannot open file: $!\n"; while (){ chop $_; chop $_; next if $. < 19; @d = split (/\t/, $_); if ($d[1] == $CHR_NUM) { push (@loc, $d[2]); push (@lines, $_); } } close IN2; print "$lines[84]\n"; print "$#lines\n"; print "$#loc\n"; 42

print "$loc[6]\n";

%hash=(); open (IN3, "zcat /PATH_TO_PHASE3_VCF_FILES ALL.chr$CHR_NUM.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotyp es.vcf.gz |") || die "Cannot open file $p: $!\n"; while(){ chomp $_; @c = split(/\t/, $_); #print "$c[1]\n"; $hash{$c[1]}=1; } close IN3; open (OUT, ">/PATH_TO_OUTPUT_DIRECTORY/HAPLO_RESULTS_$CHR_NUM") || die "cannot write file: $!\n";

#print OUT "CHR\tCoord#\tSNP_ID\t\tAlt\tRef\tCHB\tJPT\tCHS\tCDX\tKHV\tCEU\tTSI\tFI N\tGBR\tIBS\tYRI\tLWK\tGWD\tMSL\tESN\tASW\tACB\tMXL\tPUR\tCLM\tPEL\tGIH \tPJL\tBEB\tSTU\tITU\n";

#for $x (1..22) { open (IN4, "zcat /PATH_TO_FILE/ALL.chr$CHR_NUM.phase3_shapeit2_mvncall_integrated_v5a.20 130502..vcf.gz |") || die "cannot open VCF file: $!\n"; $c=0; $x=0; while () { chomp $_; $c++; next if $c < 254; @a=split(/\t/, $_); #next if $a[4]=~/,/;

PAT:{ print "$loc[$x]\n"; print "$hash{$loc[$x]}\n"; if ($x > $#loc) {last;} if ($hash{$loc[$x]}!=1) { print OUT "$lines[$x]\tNO MATCH\n";

$x++; redo PAT; }

} if ($hash{$loc[$x]}==1) { 43

#print "$x\t$.\n"; @b = split (/\t/, $_); if ($loc[$x] == $b[1]) { print OUT "$lines[$x]\t$b[0]\t$b[1]\t$b[2]\t$b[3]\t$b[4]"; $hash{$loc[$x]}=0; $x++; for $x (@nationalities) { print OUT "\t$x"; for $y (@{$x}) { if ($a[$y]=~/^([012])\|([012])/) { print OUT "\t$1\t$2"; } #else {print OUT "\tX";}

} } } else {next;} }

print OUT "\n"; }

#}

The output of this step will yield 22 files named HAPLO_RESULTS_CHR# shown below:

Step 6: The next step was to concatenate all of tables (one for each chromosome) into a single table by using the Unix “cat” command. Each file should be separated by a single space and the “>” is used to save into new file.

44

Step 7: The resulting table of Step 6 needs to undergo filter of the ‘NOMATCH’ SNPs. matched_filter.pl was created filter out all the SNPs that were marked with a

“NOMATCH” meaning the SNP was not found in the Phase 3 data. The program matched_filter.pl takes the output table from the previous “cat” step as an ARGV element and saves the filtered table to a second ARGV argument.

matched_filter.pl

#!/usr/bin/perl # #By Patrick Brennan #Tests input file looks at line length to determine if match has occurred.

$file = $ARGV[0]; $file2 = $ARGV[1]; open (IN, "$file") || die "Cannot open file: $!\n"; open (OUT, ">$file2") || die "Cannot open file: $!\n"; while () { chomp $_; @a=split(/\t/, $_); if ($#a > 4) {print OUT "$_\n";} }

Step 8: We named the final table from Step 7 “HAPLO_RESULTS_COMBINED”. Once we have our final table, the haplotype analysis can begin. For this analysis, we used the program build_haplo_v2.pl showed below. This program takes a single ARGV argument

45

of the length of haplotype you wish to create (# of SNPs per haplotype). For our analysis, we used 50.

build_haplo_v2.pl

#!/usr/bin/perl #By Patrick Brennan %continents = ( CEU => "EUR", TSI => "EUR", FIN => "EUR", GBR => "EUR", IBS => "EUR", CHB => "EAS", JPT => "EAS", CHS => "EAS", CDX => "EAS", KHV => "EAS", LWK => "AFR", GWD => "AFR", MSL => "AFR", ESN => "AFR", ASW => "AFR", ACB => "AFR", YRI => "AFR", MXL => "AMR", PUR => "AMR", CLM => "AMR", PEL => "AMR", GIH => "SAS", PJL => "SAS", BEB => "SAS", STU => "SAS", ITU => "SAS", ASH => "ASH", ); @TSI=();@IBS=();@MXL=();@CLM=();@GBR=();@YRI=();@LWK=();@CEU=();@CHB=() ;@JPT=();@ASW=();@PUR=();@CHS=();@FIN=();@GWD=();@MSL=();@ESN=();@PJL=( );@PEL=();@STU=();@ITU=();@CDX=();@GIH=();@ACB=();@KHV=();@BEB=();@ASH= (); 46

@TSI1=();@IBS1=();@MXL1=();@CLM1=();@GBR1=();@YRI1=();@LWK1=();@CEU1=() ;@CHB1=();@JPT1=();@ASW1=();@PUR1=();@CHS1=();@FIN1=();@GWD1=();@MSL1=( );@ESN1=();@PJL1=();@PEL1=();@STU1=();@ITU1=();@CDX1=();@GIH1=();@ACB1= ();@KHV1=();@BEB1=();@ASH1=(); @nationalities1=qw(CHB1 JPT1 CHS1 CDX1 KHV1 CEU1 TSI1 FIN1 GBR1 IBS1 YRI1 LWK1 GWD1 MSL1 ESN1 ASW1 ACB1 MXL1 PUR1 CLM1 PEL1 GIH1 PJL1 BEB1 STU1 ITU1 ASH1); @nationalities=qw(CHB JPT CHS CDX KHV CEU TSI FIN GBR IBS YRI LWK GWD MSL ESN ASW ACB MXL PUR CLM PEL GIH PJL BEB STU ITU ASH);

open (IN, "HAPLO_RESULTS_COMBINED") || die "Cannot open file:$!\n"; open (OUT, ">HAPLO_RESULTS_COMBINED_FILTERED_v3") || die "Cannot write file: $!\n"; open (OUT5, ">Hetero_AF") || die "Cannot write file: $!\n"; open (OUT6, ">Hetero_HAPLO") || die "Cannot write file: $!\n"; open (OUT7, ">Homo_HAPLO") || die "Cannot write file: $!\n"; $count=0;$lines=1;$heterozygous=0; $tot_lines=0; @hetero=();%final=(); while () { $count++; print "$count\n"; if ($lines%51==0) { push (@hetero, $heterozygous); print OUT6 "Hetero_DATA:\t"; for $x (@nationalities) { $product =1; for $p (@{$x}) { $product *= $p; } $final{$x} = $product; @{$x} = (); } @prod = sort { $final{$b} <=> $final{$a} } keys(%final); $max = $final{$prod[0]}; for $u (@prod) { $final{$u} = sprintf ("%.0f", eval{($final{$u} / $max) *100}); if ($final{$u} != 0) {print OUT6 "$u=$final{$u}\t";} } print OUT6 "\n"; $heterozygous =0;$lines=1;%final=(); print OUT7 "Homo_DATA:\t"; for $x (@nationalities1) { #print "$x\t@{$x}\n"; $product =1; for $p (@{$x}) { $product *= $p; } 47

#print "$x\t$product\n"; $final{$x} = $product; @{$x} = (); } @prod = sort { $final{$b} <=> $final{$a} } keys(%final); $max = $final{$prod[0]}; #print "@prod\n"; for $u (@prod) { $unchopped = $u; $final{$u} = sprintf ("%.0f", eval{($final{$u} / $max) *100}); if ($final{$u} != 0) {chop($u);print OUT7 "$u=$final{$unchopped}\t";} } print OUT7 "\n"; %final=();

} chomp $_; @a = split(/\t/, $_); @b = split (//, $a[3]); next if $a[8]=~/,/;

if (($a[3] eq 'AA' || $a[3] eq 'TT' || $a[3] eq 'CC' || $a[3] eq 'GG') && ($b[0] eq $a[7] || $b[0] eq $a[8])){ print OUT "$_\n"; $lines++; $tot_lines++; $first_pop = 'CHB1'; for $t (10..$#a) { if ($a[$t]== 1) {$one++;} if ($a[$t]== 0) {$zero++;} if ($a[$t]=~/(\w\w\w)/) {

$sum = $one + $zero; if ($b[0] eq $a[7]) {$percent = sprintf ("%.10f", $zero/$sum);} if ($b[0] eq $a[8]) {$percent = sprintf ("%.10f", $one/$sum);} push (@{$first_pop}, $percent); $new = $1 . '1'; $first_pop=$new;$one=0; $zero=0; } } push (@{$first_pop}, $percent);

} elsif ($a[3] eq 'AG' || $a[3] eq 'GA' || $a[3] eq 'TA' || $a[3] eq 'AT' || $a[3] eq 'AC' || $a[3] eq 'CA' || $a[3] eq 'GT' || $a[3] eq 48

'TG' || $a[3] eq 'CT' || $a[3] eq 'TC' || $a[3] eq 'GC' || $a[3] eq 'CG') { if ($lines!=1) {$heterozygous++;} print OUT5 "$a[0]\t$a[1]\t$a[2]\t$a[3]\t$a[7]\t$a[8]\t"; $first_pop = 'CHB'; for $t (10..$#a) { if ($a[$t]== 1) {$one++;} if ($a[$t]== 0) {$zero++;} if ($a[$t]=~/(\w\w\w)/) { $sum = $one + $zero; if ($one >=$zero) {$percent = sprintf ("%.10f", $zero/$sum);} if ($zero >= $one) {$percent = sprintf ("%.10f", $one/$sum);} print OUT5 "$first_pop=$percent\t"; push (@{$first_pop}, $percent); $first_pop=$1;$one=0; $zero=0; } }

print OUT5 "$first_pop=$percent\n"; push (@{$first_pop}, $percent); } else { next; }

} close IN; close OUT; close OUT5; close OUT6;close OUT7; open (OUT6, "Hetero_HAPLO") || die "Cannot open file: $!\n"; @hetero_lines=; close OUT6; open (OUT7, "Homo_HAPLO") || die "Cannot open file: $!\n"; @homo_lines=; close OUT7; open (IN2, "HAPLO_RESULTS_COMBINED_FILTERED_v3") || die "Cannot open file: $!\n"; @file = ; close IN2;

#for $u (9..20){ # print "$original[9][$u]\n"; #Testing #} #$size = @{$original[12]}; 49

#print "$#ind\n$size\n";

$start=0; $end=$ARGV[0] - 1;$hap_count=0; %code=(); %ancestry=(); %minmax=(); $max =0;$hetero_count=0; open (OUT2, ">RESULTS_HAPLOTYPES_$ARGV[0]\_v2") || die "Cannot write file: $!\n"; open (OUT3, ">RESULTS_ANCESTRY_$ARGV[0]\_v2") || die "Cannot write file: $!\n"; open (OUT4, ">RESULTS_Countries_sorted_v2") || die "Cannot write file: $!\n"; @location=(); @chr=(); while ($tot_lines > $end){ $string = ''; $cur_pop = 'CHB'; @ind = (); $counter=0; for $i ($start..$end) { chomp $file[$i]; @a = split (/\t/, $file[$i]); @b = split (//, $a[3]);

if ($b[0] eq $a[7]) {$ind =0;} elsif ($b[0] eq $a[8]) {$ind=1;} else {print "error!\n";}

push (@location, $a[2]); push (@chr, $a[1]); $length = $#a; for $x (9..$#a) { push (@{$original[$x]}, $a[$x]);# Creates 2D Array }

$string .= $ind; }

$hap_length = $location[$end] - $location[$start]; print OUT2 "HAPLOTYPE_$hap_count\t$string\t"; for $k (10..$length) { $string2 = ''; for $i (0..$ARGV[0]-1) { if ($original[$k][$i]=~/(\w\w\w)/) { if ($i == 1){ $normalized = sprintf ("%0.2f",($code{$string} / $pop_count)) * 100; $minmax{$cur_pop} = $normalized; print OUT2 "$cur_pop = $normalized\t";

50

if ($normalized > $max) {$max = $normalized; $top_pop= $cur_pop;} $cur_pop = $1; #print "$hap_count\t$pop_count\t$normalized\n"; %code=(); $pop_count=0; } else {next;} } #if ($original[$k][$i] == 'X') {$original[$k][$i] = $ind[$i];} $string2 .= $original[$k][$i]; } $code{$string2}++; $pop_count++; } $hap_count++; #$normalized = sprintf ("%0.2f",($code{$string} / $pop_count)) * 100; #$minmax{$cur_pop} = $normalized; print OUT2 "\n"; #if ($normalized > $max) {$max = $normalized; $top_pop= $cur_pop;} if ($max == 0) {$top_pop = 'NA';} $ancestry{$top_pop}++;

my @keys = sort { $minmax{$b} <=> $minmax{$a} } keys(%minmax); print OUT4 "CHR_$chr[$start]\tPOS=$location[$start]\tLEN=$hap_length\tHETERO=$hete ro[$hetero_count]\t"; if ($max > 0) { $cycle_num=1;$second = 1; for $w (@keys) { if ($minmax{$w} > 0 && $cycle_num==1){ print OUT4 "$continents{$w}\t"; $cycle_num++; $first_pop = $continents{$w}; } if ($continents{$w} ne $first_pop && $second ==1 && $minmax{$w} > 0) { print OUT4 "$continents{$w}"; $second=0; } } if ($second ==1) {print OUT4 "---";} print OUT4 "\t"; for $t (@keys) { if ($minmax{$t} > 0) {print OUT4 "$t=$minmax{$t}\t";} } print OUT4 "\n"; } 51

else {print OUT4 "NA\n"} print OUT4 "$hetero_lines[$line_count]"; print OUT4 "$homo_lines[$line_count]\n"; $start+=$ARGV[0];$end+=$ARGV[0]; $hetero_count++; $line_count++; %code = (); %minmax=(); $max=0;$pop_count=0; @original=(); }

@q = keys (%ancestry); foreach $x (@q) { print OUT3 "$x\t$ancestry{$x}\n"; }

Step 9: build_haplo_v2.pl constructs haplotype segments and predicts the subject’s ancestry based on the population with the highest matching percentage in each haplotype segment. The data can be smoothed by using IBD segments or a smoothing technique of your choice. If you want to run this analysis for another subject, the program below will allow you to insert another subject’s genotype information into the existing

HAPLO_RESULTS_COMBINED table. The prevents the user from having to create a completely new table using create_haplo_table.pl. Once the new individual genotype information has been inserted into the table, the analysis build_haplo_v2.pl (Step 8) can run again.

insert_ind.pl

#!/usr/bin/perl # #by Patrick Brennan # open (IN, "genome_Control1_v4_Full_20161025122535.txt") || die "Cannot open file: $!\n"; %info=(); while (){ chop $_; chop $_; 52

@a = split (/\t/, $_); $info{$a[0]} = $a[3]; } close IN; open (IN2, "HAPLO_RESULTS_COMBINED") || die "Cannot open file: $!\n"; open (OUT, ">HAPLO_RESULTS_COMBINED_Control1") || die "Cannot open file: $!\n"; while () { chomp $_; @b= split (/\t/, $_); if ($info{$b[0]} ne '') { $b[3] = $info{$b[0]}; for $x (0..$#b) { if ($x ==$#b) {print OUT "$b[$x]";} else{print OUT "$b[$x]\t";} } print OUT "\n"; #print "@b\n"; } else {print "Error\n";} } close IN2;

53