University of Cincinnati

Date: 2/17/2011

I, Mehdi Keddache , hereby submit this original work as part of the requirements for the degree of Doctor of Philosophy in Biomedical Engineering.

It is entitled: Relationship between triplet repeat polymorphisms and HapMap tagSNPs

Student's name: Mehdi Keddache

This work and its defense approved by:

Committee chair: Bruce Aronow, PhD

Committee member: William S. Ball, MD , MD

Committee member: Jeffrey Johnson, PhD

1333

Last Printed:2/24/2011 Document Of Defense Form

Relationship between triplet repeat polymorphisms and HapMap tagSNPs

A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY In the Department of Biomedical Engineering of the College of Engineering

By

Mehdi Keddache M.A., Genetics & Development Columbia University, May 1998

Committee Chair: Bruce Aronow PhD Abstract

Single Amino Acid Repeat (SARPs) are a class of peptides that contain extended stretches of the same amino acid. At the DNA level they are represented by a repetition of the same triplet of bases. Their influence can be on regulation, transcription and function depending on the number of repeats. Mutations that add or subtract repeat units are both frequent and reversible.

Several human diseases have been identified that are caused by variations in the size of repeated DNA sequences. Animal and plant genetic studies showed that variations in repeat sequences can lead to complex phenotypes and that variations within a normal (i.e. non-pathological) range of repeat number commonly yields small, quantitative functional effects.

Even though triplet repeats present an intriguing disease mechanism for other complex human diseases, methods for a fast, inexpensive and systematic search have never been utilized. We have devised an assay that can be used for a high throughput, genome wide detection of SARPs in those diseases. We have optimized our method for low cost and validated our results through direct sequencing.

We also showed that inferring SARP alleles from SNP data derived from genome wide association studies (GWAS) is often not possible and therefore a SNP based GWAS approach will be underpowered or may even fail if the disease studied is caused by a triplet repeat allele. In addition, we found that current methods of next generation sequencing may miss the detection of triplet repeat variations, probably due to technical limitations of the technology. 2

3

Acknowledgements

I want to thank my advisor Dr. Bruce Aronow and my committee Dr William Ball and Dr. Jeffrey Johnson for their encouragements and support during the course of my graduate work.

A very special thank you to Dr. Gregory Grabowski, Jerry Diegmueller and the entire division of Human Genetics at CCHMC for their support and for giving me the opportunity to pursue a doctoral degree.

I am deeply indebted to Katie Lutz and Sarah Srodulski from the CCHMC Genotyping facility for their assistance in the generation of triplet genotypes.

To my friends Dr. Martina Durner and Dr. Larry Altsteil, thank you so much for your support and encouragement during all the years I have been working on my doctorate.

To my colleagues Dr. Auvo Reponen and Patrick Putnam, thank you your invaluable database management and programming advice.

4

Table of contents

Abstract ...... 2

Acknowledgements ...... 4

Part A: Background and Significance ...... 7

Part B: Specific Aims ...... 12

Part C: Specific Aim 1 ...... 14

Primer design and amplification strategy for triplet repeat analysis ..... 15

Incorporation of fluorophore by the fluorescent adaptamer PCR technique ...... 16

Polymorphic determination of triplet repeats using DNA of varied ethnic origin ...... 18

Correlation of the experimentally determined polymorphism of the selected triplets with the repeat database Satellog...... 19

Verification of the normal meiotic origin of the observed triplet variations ...... 20

Validation of assayed alleles through sequencing ...... 21

Analysis of sequencing and genotyping results ...... 28

Using the technique beyond triplet repeats...... 35

Cost considerations ...... 36

5

Sensitivity and specificity considerations ...... 37

Part D: Specific Aim 2 ...... 38

High throughput triplet genotyping protocol ...... 40

Data structure to efficiently store millions of bialleleic SNP genotypes42

Determination of haplotype block around the SARP location ...... 44

Determination of haplotypes with specific SARP alleles ...... 45

Determination of the relationship between SARP alleles and haplotypes ...... 46

Simulation of data to understand the significance of the Wn statistic . 50

Part E: Discussion ...... 53

References ...... 66

Appendix A: List of 207 triplets tested and primer sequences used for amplification ...... 70 Appendix B: Labeling PCR products with the fluorescent adaptamer method ...... 79 Appendix C: Output of reconstructed haplotypes by PHASE ...... 80

6

A. BACKGROUND AND SIGNIFICANCE

Single Amino Acid Repeat Proteins (SARPs) are a class of peptides that contain extended stretches of the same amino acid. At the DNA level they are represented by a repetition within an exonic sequence of the same triplet of bases coding for one amino acid (codon). Because several codons can encode the same amino acid (GCC and GCA both code for Alanine) we make a distinction, at the DNA level, between SARPs with homogeneous triplet repeats (the same codon is repeated) and SARPs with imperfect triplet repeats (multiple codons are repeated).

One of the interesting features of SARPs is their potential repeat number polymorphism. The number of repeats is thought to influence gene regulation, transcription and protein function. Mutations that add or subtract repeat units are both frequent and reversible, thus providing a prolific source of quantitative and qualitative genetic variation producing phenotypes upon which natural selection can occur.

Several human diseases have been identified that are caused by variations in the size of repeated DNA sequences within a specific gene (Pearson et al, 2005). In addition, it has been shown that SARP polymorphisms are also involved in complex phenotypes in animal species. SARPs are therefore thought to be ideal candidate variations potentially involved in complex human genetic diseases that may be caused by several each one with only subtle effect. A mechanism that does not interrupt gene function but merely alters it is a very attractive potential function of SARPs. Approximately 2,100 SARPs have been identified with 4 or more repeats (Subramanian et al, 2003) and triplet repeats are also found in promoter regions of active genes. In particular 2/3 of the SARPs are located in genes with brain related function.

7

The preferred current method of screening for genetic association with disease is with SNPs densely spaced throughout the genome. A Single nucleotide polymorphism or SNP (pronounced snip) is a DNA sequence variation occurring when a single nucleotide - A, T, C, or G - in the genome differs between individuals. The first phase of the International HapMap project was completed in 4 major populations with the goal of identifying the key point variations (tagSNPs) in the that allow the prediction of most other nearby point variations because they are on the same “haplotype block”. Commercial products have become available to systematically genotype all HapMap tagSNPs and perform genome wide association (GAW) studies. The question arises whether those tagSNPs would be able to also capture the SARP alleles within their haplotypes.

Repeat polymorphisms are expected to be more plastic than SNPs. The main reason being that SNPs most likely occur as the result of uncorrected polymerization errors and chemical or radiation DNA damage whereas repeat polymorphisms can occur by more frequent events like polymerase slippage and unequal recombination that can either shrink or expand the repeat length. The formation of imperfect triplet repeats through point mutations is thought to counteract further growth of the number of repeats and prevent their infinite expansion (Siwash et al, 2006). This mechanism of genetic evolution is considered very rapid and can generate a lot of variability as well a subtle changes in both the expression and function of the genes that contain the repeats (Caburet et al, 2005).

The interest in SARPs comes from the fact that several human diseases have been identified to be caused by variations in the size of repeated DNA sequences within a specific gene, in the case of triplet repeats such diseases

8 have been designed by the acronym TREDs (triplet repeat expansion diseases, Rozanska et al, 2007). Fragile X for example is caused by an increase in the number of CTG repeats outside of a coding region while other diseases arise from coding triplet repeat expansion like the polyglutamine track in Huntington disease and spinocerebellar ataxia (SCA). Expansion of triplets outside the coding regions is thought to disrupt through transcriptional silencing because RNA polymerase cannot transcribe through the expanded repeat track. Coding region triplet expansion can lead to reduced gene expression through inefficient transcription but can also change the protein conformation and/or folding which has a direct effect on gene activity. In addition, triplets are also the cause of genetic anticipation in which the severity (or age of onset) of the disorder is related to the number of repeats, like in myotonic dystrophy (Ranum and Day, 2004).

Very interestingly SARP alleles have also been reported to be involved in the determination of non-clinical phenotypes in animals like the morphological determination of the skull in dogs (Fondon and Garner, 2004). This finding leads the authors to suggest that SARPs (and other gene associated tandem repeat polymorphisms) provide abundant and modulated variations on phenotype enabling rapid evolution upon which natural selection can occur. This important hypothesis reconciles low SNP mutation rates and their dramatic impact on protein function with the fast pace and continuity of morphological changes in vertebrates shown by the fossil record. This underlines the importance of the study of variations in triplet repeat length since they potentially have the phenotype related characteristics associated with complex human diseases (common alleles, spotty inheritance, multi , subtle effects at each locus with normal individuals being carriers and also having environmental influences).

9

With the sequence of the human genome completed and the localization of the vast majority of the genes known it is possible to search for SARPs in a systematic fashion. Bioinformatic tools have been developed to screen exonic sequences for both homogeneous and imperfect triplet repeats. With the availability of databases with cDNA variant sequences it is also possible to estimate the degree of polymorphism for the identified SARPs by comparing the number of triplet repeats in the translated published sequence to the number of triplet repeats in the cDNA clones (Missirlis et al, 2005). Despite the abundance and potential role in phenotype determination of triplet repeats, to date there are no assays that would allow a rapid high-throughput screening of SARP alleles.

Current methods used to determine a SARP genotype (number of triplet repeats) involve A) direct sequencing of PCR amplified triplet repeat loci, B) radioactive labeling of PCR amplified triplet repeat loci and band size comparison to a known product on polyacrylamide gels or C) fluorescent labeling of PCR amplified triplet repeat loci and fragment size comparison to a known product by capillary electrophoresis. Although direct sequencing has the advantage to provide information on imperfect repeats it is the most costly methodology on a per genotype basis. It also has a potential to be non interpretable when the genotype involves alleles with very different number of repeats because of the fact that both alleles are sequenced simultaneously. Radioactive labeling followed by gel electrophoresis is inexpensive but has a very low throughput since allele sizing is done manually by visual comparison to a standard. Digitizing radiographic images and using software to assist in allele calling can improve throughput. Conventional fluorescent genotyping by capillary electrophoresis is fast, accurate in sizing a wide range of alleles and has the advantage of an all-digital data collection. However, in its current

10 implementation, the method can be expensive when developing new assays, as the synthesis of fluorescently labeled PCR primers is the driving factor of cost.

11

B. SPECIFIC AIMS

Specific Aim 1

To develop a fluorescent adaptamer PCR labeling technique that can be used to genotype SARP alleles and verify that these genotypes are accurately determined.

We propose to significantly reduce the cost of current triplet repeat genotyping protocols and make it appropriate for the development of genotyping assays for potentially all the SARPs in the human genome. In the PCR based assays, the reverse primer remains the same but the forward primer does not have an expensive fluorophore at its 5‟-end. Instead it is tailed with a generic 18 to 20- nucleotide sequence like M13 that is not costly to synthesize. The fluorescent label is incorporated into the PCR product by adding a 3rd primer that has only the sequence of the tail with the fluorophore at its 5‟-end (fluorescent adaptamer). This generic labeling primer can be reused in all the assays designed.

12

Specific Aim 2

To determine the degree of linkage disequilibrium between SARP alleles and neighboring SNPs alleles.

We propose to study the relationship between SNP alleles and SARP alleles in order to determine whether the important phenotypes potentially encoded in SARPs can be detected by genome wide association studies. The purpose is to establish the degree of correlation between SARP alleles and HapMap tagSNPs in order to answer the question of whether the triplet alleles need to be assessed separately or if they are captured using the current state of the art whole genome SNP genotyping technologies.

13

C. SPECIFIC AIM 1

To develop a fluorescent adaptamer PCR labeling technique that can be used to genotype SARP alleles and verify that these genotypes are accurately determined.

While there is evidence in triplet repeats for the existence of polymorphism in the number of repeats and different phenotypes associated with some alleles, including human diseases, the literature remains sparse in this field. We attribute this fact to the difficulty and cost of triplet repeat genotyping by conventional DNA sequencing. We describe a methodology that captures the number of repeats for most triplet repeat loci in an inexpensive and high throughput fashion.

Because SARP alleles are based on the number of repeats of a triplet sequence they are of different size at the DNA level depending on how many times the triplet is repeated. The principle behind the genotyping technique is amplifying the DNA by PCR and determine the total size of the PCR product to deduce the number of repeats.

For the development of our assay we chose 207 triplet repeats in or near genes with an emphasis on brain function. Potentially polymorphic triplet repeats had previously been detected bioinformatically from the human genome sequence by Dr. Aronow. The triplet repeats were either coding (SARPs) or located within a known regulatory or promoter element of a gene.

14

Primer design and amplification strategy for triplet repeat analysis

We used the online application ExonPrimer (http://ihg.gsf.de/ihg/ExonPrimer.html) to design PCR primers to amplify the complete repeat area and synthesized the reverse primer as suggested by the program. The forward was synthesized with the addition of a 5‟-tail with the M13(-21) sequence (TGT AAA ACG ACG GCC AGT) in preparation for the fluorescent labeling step. The complete list of triplets and primer sequences can be found in Appendix A. PCR reactions were tested using the same protocol for all primer sets. Because the promoter and regulatory region triplets are usually located in GC-rich areas of the genome (figure 1) that are notoriously difficult to amplify by PCR, we used a GC-rich thermostable DNA polymerase and an associated buffer (Advantage GC 2 system from Clontech) as part of our standard amplification.

Distribution of triplet amplicon %GC

85-90 80-85 75-80 70-75 65-70 60-65 55-60

50-55 %GC content content %GC range 45-50 40-45 <40

0.00 0.05 0.10 0.15 0.20 0.25 Frequency of amplicons

Figure 1: Distribution of GC content in triplet repeats and their frequencies contributing to difficulties in amplification

15

The PCR thermocycling parameters for the amplification of triplet repeats are as follows: 94C 3 min, 35 cycles of (94C 30 sec, 60C 30sec, 68C 1 min), 68C 3min then hold at 4C. Amplification success and expected band size were evaluated by agarose gel electrophoresis (figure 2).

Figure 2: Test amplification of 48 triplets visualized by agarose gel electrophoresis. Size standards in side lanes have a band every 100bp. Lanes marked red have failed standard amplification and need to be repeated, optimized or redesigned. Lanes marked yellow have more than one band and need to be optimized.

137 designs (66.2%) were validated after the first round of PCR with both specific amplification (single band) and appropriate fragment size, indicating the usability of the GC-rich chemistry as a standard amplification for triplet repeats.

Incorporation of fluorophore by the fluorescent adaptamer PCR technique

In order to efficiently automate fragment sizing using the 3730xl DNA Analyzer from Applied Biosystems we needed to reduce the reaction volume to decrease reagent usage and incorporate a fluorescent dye into the PCR products. We conducted PCR reactions in 5ul using 384 well microplates and added to the reaction a 3rd primer, a 5‟-FAM labeled M13(-21) sequence. We used equimolar ratios of the reverse (untailed) primer and the FAM labeled primer (2.5uM final) and used 1/10 of that molar ratio for the forward (tailed) primer (0.25nM final). The rationale is to generate the M13(-21) tailed specific PCR product during the first few cycles of the amplification using the low

16 molarity forward primer then quickly drive the amplification with the high molarity labeled primer. This produces the same effect as a 2-step nested PCR (tailed forward + reverse then FAM tail + reverse) with the added convenience of a single tube setup. The labeling of PCR products by the fluorescent adaptamer method is illustrated in appendix B. Using a generic labeling primer that can be reused in all the assays designed for which only a pair of unlabelled standard primers needs to be synthesized greatly reduces the costs normally associated with genotyping.

Injection of the pure PCR reaction was saturating the fluorescence detector of the instrument and a normal signal was observed after 100 fold dilution of the product. The PCR reaction protocol was then modified to use 1/10 of the initial primer amount and a more comfortable 1/25 dilution factor is now required for a fluorescent signal to be detected within the dynamic range of the instrument. Thermocycling parameters remained unchanged.

Detectability of the amplified fragments by the 3730xl DNA Analyzer was tested in 9 unrelated laboratory control (designated as 9ULC) DNAs of varied ethnic origin (German, French Indian, Chinese, African American, 3 US Caucasians and 1 from Canada) for the 137 amplifiable triplets.

The standard fragment sizing protocol for microsatellites from Applied Biosystems cannot accommodate a range of sizes beyond 400bp while maintaining a 1bp resolution between peaks. To reach accurate sizing in the 100-700bp range of our triplet repeat amplicons, we had to use a different size standard. We chose the ROX-MapMarker1000 DNA ladder from Molecular Probes which has sizing peaks up to 1kb. However, linearity of the sizing curve in this wide a range (900bp) could not be reached with standard injection recipes on the 3730xl instrument we had available to conduct our experiments.

17

We modified the 3730xl run parameters by lowering the run current and increasing the run time until we obtained a linear sizing curve containing all the peaks in the previously determined range of triplet amplicon sizes (100-700bp). Size matches for the ROX-MapMarker1000 DNA ladder and size calling curve with a linearity fitting R2 >.999 are illustrated in figure 3.

Figure 3: ROX-MapMarker1000 DNA ladder size standard (top panel) and corresponding sizing curve (bottom panel) showing size (Y axis) as a function of electrophoresis time (X axis)

Polymorphic determination of triplet repeats using DNA of varied ethnic origin

We first tested the 137 amplifiable triplets for polymorphism in our 9ULC cohort. In order to determine the number of repeats in the alleles of the triplets we analyzed the electropherograms using the GeneMapper 3.5 software from Applied Biosystems.

18

40 of the 137 triplets analyzed showed more than one allele in the 9 ULC DNA dataset (29%). In order to further identify rare alleles that may not be represented in the small sample size we proceeded with a 2nd batch of amplification for each triplet which did not exhibit polymorphism in the 9 ULC DNAs. We used the 24 sample “Polymorphism Discovery Resource” (designed by 24PDR) DNA set from the Coriell Institute consisting of US admixed individuals with ancestry from different regions of the world (table 1). After testing an additional 46 triplets, 18 triplets were identified as polymorphic bringing the total number of identified polymorphic loci to 58 which was fitting the goals set for this project.

Ethnicity Proportion of samples European-American 1 % African-American 17 % Mexican-American 39 % Native American 5 % Asian-American 10 % Table 1: Ethnic distribution of the Coriell Institute Polymorphism Discovery Resource Samples

Correlation of the experimentally determined polymorphism of the selected triplets with the repeat database Satellog.

The Satellog database (Missirlis et al, 2005) is an online bioinformatic tool that can be used to retrieve repeats in the human genome that fit a user defined motif. It also uses the UniGene database of sequenced cDNA libraries and can be set to detect differences between the reference genome and the electronic transcriptome. Using this tool to output all the triplet repeats in coding regions and the alignments with UniGene to predict potential polymorphism of the

19 triplet repeats we were able to confirm in-silico 5 of the 58 polymorphic triplets we have identified. The output from the matching Satellog database query is presented in table 2.

Chr. Start hg19 End hg19 Repeat Repeat HUGO Minimum Maximum Unit Length Name Length Length X 65,631,950 65,632,018 GCA 23 AR 15 27 12 101,854,639 101,854,675 GCA 12 ASCL1 10 14 6 161,428,639 161,428,671 CTG 11 MAP3K4 10 11 20 46,965,237 46,965,257 GCA 7 NCOA3 4 7 3 181,973,881 181,973,892 ACC 4 FXR1 3 4 Table 2: Information about the 5 triplet repeats within our 58 triplet list that are found in-silico to be polymorphic using the Satellog online tool. Minimum and maximum repeat length are based on evidence from sequenced mRNAs containing the coding repeat region

Verification of the normal meiotic origin of the observed triplet variations

Our experimental determination of polymorphism is based on DNA derived from cell lines. Repeat variations observed in those cell lines could have resulted from mitotic errors during the numerous passages that the commercial cell lines have potentially gone through. We therefore genotyped 4 polymorphic triplets (AR, MN1, MYST4, ZIC5) in a de-identified DNA collection of 96 Caucasian individuals extracted from saliva to verify the usability of our cohort.

The number of alleles observed in the Caucasian cell line DNAs and the Caucasian saliva DNAs were identical even though the samples from each population are different. The allele frequencies are also very comparable, indicating that the detection of triplet repeat polymorphisms in cell lines is unlikely to be spurious (figure 4).

20

Correlation between Caucasian allele frequencies in cell lines DNA and DNA extracted from saliva 1 0.9 ATNX1_TGC-SALIVA 0.8 ATNX1_TGC-CELL LINE 0.7 0.6 AR_GCA-SALIVA 0.5 AR_GCA-CELL LINE 0.4 CACNA1A_CTG-SALIVA 0.3 CACNA1A_CTG-CELL LINE Repeat Repeat frequency 0.2 MYST4_GAG-SALIVA 0.1 0 MYST4_GAG-CELL LINE Number of repeats

Figure 4: Comparison of repeat frequency in saliva DNA and cell line DNA for 4 different triplet loci.

Validation of assayed alleles through sequencing

The accuracy of allele calling using our fluorescent adaptamer PCR labeling technique was tested by sequencing at least the 2 most common alleles in 24 triplets showing polymorphism. We sequenced more alleles when the sizing data from the 3730xl DNA Analyzer appeared to mix 2 alleles together. This may be due to a degree of polymorphism beyond a simple number of repeats for example the presence of an imperfect repeat with the allele. We used the software Mutation Surveyor to resolve number of repeats in the triplet alleles from the sequence data and compared with the sizes of the fragments obtained in the genotyping data. The Mutation Surveyor software uses a patented anti- correlation technology to provide highly accurate discovery of DNA variants from Sanger sequencing traces. It rapidly locates SNPs and insertion/deletions between reference traces and sample traces with excellent accuracy and 21 sensitivity.

Results from the comparison of genotyping and sequencing for the determination of triplet alleles are summarized in tables 3 and 4. For each triplet table 3 shows the motifs repeated, at the protein and DNA levels, the allele that appears on the human genome sequence data (build hg19) and whether or not the evidence of polymorphism in the repeat area is known in the human SNP database (dbSNP 130). 6 of the 24 triplets tested had no evidence of polymorphism previously reported (17%). Table 4 lists the samples sequenced for each triplet and their genotype determined by fluorescent adaptamer PCR labeling technique in arbitrary allele numbers as well as the number of repeats detected by sequencing and triplet repeat size in base pairs. A match between the methods is when the size and the number of repeats are in a linear relationship with a 3 fold factor (size = 3 * repeats + constant). Samples with letter codes refer to one of the 9ULC DNAs from our laboratory and samples with number codes refer to one of the 24PDR DNAs from the Coriell Institute. We also indicate the nature of repeats observed for each allele: regular (R) when the same codon is repeated, imperfect (I) when multiple codons are found within the repeat or complex (C) in other cases. Difficulties that have been encountered in the sequencing are also noted in the comments column. In the table is also indicated whether the Mutation Surveyor software was able to deconvolute the mixed sequence of the polymorphism into individual alleles. In 3 out of 24 triplets sequenced (13%) the software was not able to deconvolute all the alleles observed. The direction of the sequencing used for the deconvolution is also indicated in the column „Dir‟. Efforts were made to use the clearest read direction (F for forward, R for reverse) in each case since each case was sequenced bidirectionally.

22

Triplet gene Protein motif DNA motif Variants Allele reported appearing in in hg19 dbSNP130

MAP3K4 Poly A CTG Yes A10

NUMBL Poly Q GTC/GTT Yes Q20

CACNA1A Poly H GTG No H10

E2F4 Poly S AGC Yes S13

ATXN2 Poly Q GTC/GTT Yes Q23

BIK Poly LLALLL* CTG/CTC Yes LLALLL2

FOXE1 Poly A GCG/GCT/GCC Yes A16

IRS2 Poly A GGC Yes A8

ARID3B Poly Q CAG No Q11 EMX2(p) n/a** Poly CTCTTCTTCCTCCTCCTC* Yes 20 rep ACVR2A(p) n/a** GCC Yes 7 rep

ATXN1 PolyQ HQH PolyQ CAG Yes Q14 HQH Q12 EMX2(p) n/a** ATA Yes 5 rep

MN1 Poly Q CAG/CAA Yes Q28

NRXN1 Poly G GGC No G11

NCOA3 Poly Q CAG/CAA Yes Q29 NAP1L3 del GSGSSSS AGC/GGC No 2 rep

MYT1L Poly E GAA/GAG No E16

AR Poly Q CAG/CAA Yes Q23

ASCL1 Poly Q CAG Yes Q12

DLX6 Poly Q CAG/CAA Yes Q6 ***

MYST4 PolyE D PolyE GAG/GAA Yes E12 D E14

HOXD13 Poly A GCG/GCA/GCT/GCC Yes A15 MEF2C(p)** n/a del TCTTCCTCCTCCT * No 2 rep Table 3: List of 24 triplet loci for which the sequencing method and genotyping method were compared. The allele appearing in build 19 of the human genome has been translated when relevant to compare to the protein level motif. (*) No sample with other allele available for verification of motif. (**)Promoter triplet, no associated protein motif. (***)Proximity of exon intron boundary makes determination uncertain, imperfect repeats possible.

23

Triplet Sample Genotype Allelic Dir Repeat Comments Comparison of gene tested deconvolution type genotyping (allele size) by Mutation and sequencing Surveyor (number of repeats)

allele repeats size (bp) MAP3K4 B 1-1 ref R R 1 9 27 MAP3K4 M 1-2 Y R R 2 10 30 MAP3K4 14 1-3 Y R R 3 11 33 NUMBL N 1-1 ref F I 1 18 54 NUMBL Z 2-2 Y F I 2 20 60 CACNA1A 21 1-1 ref F R non polymorphic 1 10 30 E2F4 M 3-3 ref F R 3 13 39 ATXN2 20 3-3 ref R I 3 22 66 ATXN2 23 3-4 Y R I 4 23 69 BIK 20 2-2 ref F C 2 n/a * n/a BIK 21 2-3 N F C 3 n/a n/a FOXE1 J 2-2 ref F I 2 14 42 FOXE1 D 2-3 Y F I 3 16 48 IRS2 D 2-2 ref R R 2 8 24 IRS2 13 1-2 Y R R 1 7 21 ARID3B 19 1-1 ref R R 1 9 27 EMX2(p) M 2-2 ref R C 2 20 60 EMX2(p) 13 1-2 Y R C looks like 6 rep polymorphism 1 14 42

Triplet Sample Genotype Allelic Dir Repeat Comments Comparison of gene tested deconvolution type genotyping (allele size) by Mutation and sequencing Surveyor (number of repeats)

allele repeats size (bp) ACVR2A(p) 8 1-1 ref F R 1 7 21 ACVR2A(p) 6 1-2 Y F R 2 8 24 ACVR2A(p) 13 1-6 N** F R 6 12 36 ACVR2A(p) 24 2-4 Y** F R 4 10 30 ACVR2A(p) K 3-3 Y F R 3 9 27 ACVR2A(p) 5 3-5 N** F R 5 11 33 ATXN1 J 5-5 ref F R 2 interrupted repeats, each polymorphic 5 15/13 93 ATXN1 7 3-5 N F R sequencing cannot determine exact repeat pattern 3 ? ? ATXN1 Z 3-9 N F R cloning needed to determine exact pattern 9 ? ? ATXN1 S 4-5 N F R sequencing confirms only region of expansion 4 ? ? ATXN1 N 3-7 N F R deciphering allelic pattern is challenging 7 ? ? ATXN1 17 5-6 Y F R 6 15/14 90 ATXN1 5 1-6 N F R deciphering allelic pattern is challenging 1 ? ? ATXN1 18 0-5 N F R deciphering allelic pattern is challenging 0 ? ? EMX2(p) B 1-1 ref R R 1 4 12 EMX2(p) 21 1-2 Y R R 2 5 15 MN1 J 3-3 ref R I 3 28 84 MN1 K 3-4 Y R I 4 29 87 NRXN1 S 2-2 ref F R 1 10 30 NRXN1 B 1-2 Y F R 2 11 32

25

Triplet Sample Genotype Allelic Dir Repeat Comments Comparison of gene tested deconvolution type genotyping (allele size) by Mutation and sequencing Surveyor (number of repeats)

allele repeats size (bp) NCOA3 B 6-6 ref F I 6 29 87 NCOA3 K 5-6 Y F I 5 28 84 NCOA3 S 5-5 Y F I NAP1L3 M 2-2 ref R C 2 n/a * n/a NAP1L3 20 1-2 Y R C 1 n/a n/a NAP1L3 7 1-1 Y R C MYT1L Z 2-2 ref F I 2 16 48 MYT1L 18 1-2 Y F I 1 14 42 AR 20 1-1 ref F I 1 19 57 AR 21 1-6 N F I 6 25 72 AR 22 6-6 N F I ASCL1 17 4-4 ref R R 1 4 12 ASCL1 18 4-5 Y R R 2 5 15 DLX6 20 1-1 ref F R Proximity of exon intron boundary 1 6 18 DLX6 22 1-9 Y F R makes determination uncertain, 9 14 42 imperfect repeats possible MYST4 Z 2-2 ref F R 2 interrupted repeats, each polymorphic 1 12/13 78 MYST4 B 1-2 Y F R 2 12/12 75 MYST4 M 0-2 Y F R 0 8/12 63

26

Triplet Sample Genotype Allelic Dir Repeat Comments Comparison of genotyping gene tested deconvolution type (allele size) and sequencing by Mutation (number of repeats) Surveyor

allele repeats size (bp) HOXD13 B 2-2 ref F I The variety of codon usage and 2 15 45 HOXD13 18 1-2 Y F I imperfections in the repeat minimally 1 12 36 affect size determination MEF2C(p) N 2-2 ref R C Common poly-C variant near repeat 2 n/a * n/a MEF2C(p) M 1-2 Y R C significantly affects size determination 1 n/a n/a

Table 4: List of samples for which the sequencing and genotyping methods were compared. (*)Complex motifs cannot be expressed in simple repeat counts and size. (**)Depending on reference chosen, inexact alleles are scored in terms of repeat limits; some alleles can be called accurately after adjusting software parameters.

27

Analysis of sequencing and genotyping results

In 21 out of the 24 triplet repeats examined (87%) the use of a genotyping approach to determine the number of repeats in the triplet was an exact reflection of the sequencing approach in terms of how many times the amino acid is repeated in the gene product. In 2 of the 3 cases when genotyping does not show the whole picture, the sequencing pattern is so complex that manual interpretation is still uncertain even with very high quality reads and the use of insertion/deletion deconvolution software to assist (triplets ATNX1 and MYST4). The 3rd case (triplet BIX) is a complex triplet repeat-like motif utilizing several codons for which the genotyping method can determine the number repeats but cannot tell if alternative codons have been used. From the 2 samples sequenced is looks like a regular poly LLALLL repeat but that is hardly enough data to make a definitive conclusion. Nevertheless, the benefits of genotyping are the lower cost, the faster throughput and the ease of analysis provided by the existing software for fragment sizing. However the advantage of sequencing over genotyping is in the ability to determine the pattern of imperfect repeats. The ease of interpretation between sequencing and genotyping data is illustrated in figure 5, using a case when the sequence reads are of optimal quality. The way Mutation Surveyor deconvolutes the each allele in the DNA sequencing traces of a heterozygote individual is illustrated in figure 6. While this process in computer assisted it is hardly suitable for high throughput data generation.

Figure 5: DNA sequencing (top) and genotyping (left) of 3 individuals at the MAP4K4 triplet repeat locus. The middle lane is a homozygote individual as indicated by the single pattern of peaks in the sequencing trace and a single peak in the genotyping trace. The upper and lower individuals are heterozygotes as indicated by the doublet pattern of sequencing peaks starting at base 65 and the presence of 2 peaks in the genotyping trace. Because the doublet base pattern starts at the same position in the sequencing for both heterozygote individuals we can tell that they have an allele in common but it is not straightforward to figure out what they are, all we can quickly say is that they have different secondary allele since their pattern is different at base 111 for example. On the other hand it‟s very clear to see on the genotyping trace that the upper individual has a secondary allele 6 bases larger than the primary allele (2 more repeats) and that the lower individual has a secondary allele 3 bases larger than the primary allele (1 more repeat).

Figure 6: Deconvolution by Mutation Surveyor of the sequencing trace of a heterozygote individual into each allele (triplet locus MAP4K4). The Reference panel is the trace of a homozygote reference sample, the Sample panel is the trace of the individual to deconvolute, the Conserved panel is the part of the sample trace identical to the reference trace, the Mutant panel is the graphical subtraction of the Reference trace from the Sample trace and the Shifted panel is the realignment of the Mutant trace to the Reference trace. In this case the Shifted trace realigns to the Reference trace by inserting 2 repeat motifs (CAG) indicating that the individual tested has a reference allele and another allele with 2 additional repeats.

30

Figure 7 illustrates the fact that while genotyping still determines the number of amino acids repeated in the gene product only the sequencing can identify the imperfect nature of the repeats at the DNA level and how complex motifs can sometimes appear.

Figure 7: Sequencing trace (top) and genotyping trace (left) of a heterozygote individual at the BIK triplet repeat locus. While the genotyping method indicates that the 2 alleles are 18bp apart (6 repeats), the sequencing shows that they are imperfect repeats coding for 2 amino acids (grey underline L codon, no underline A codon). However, the fact that only alleles differing by 18bp have been observed supports the possibility of a complex pattern (LLALLL repeat, yellow underline) in which case the genotyping method would give an accurate representation of the number of repeats.

Further analysis of the 2 cases when neither the genotyping approach nor the sequencing approach can easily determine the final amino acid composition of the gene product revealed a situation involving pairs of repeat polymorphisms separated by a few nucleotides. The ATNX1 and MYST4 proteins have 2 adjacent chains of single amino acid repeats separated by a different amino acid and each chain has a variable and independent length. While the genotyping approach will still determine the sum of the sizes of both chains, the number of repeats in each chain is not accessible by this method. It might be relevant to note however that in these cases the sequencing pattern of 2 adjacent size polymorphisms in a single read is so complex that the interpretation is not definite in every case. Only the cloning of the PCR products followed by sequencing of each allele independently can unambiguously establish the exact amino acid composition in the gene product. The cloning approach is highly impractical in any situation when multiple samples have to be analyzed like in the case of association studies. Figure 8 shows the complexity of the sequencing pattern for the pairs of ATNX1 single amino acid repeats.

Figure 8: Sequencing trace of 8 heterozygote individuals for the ATNX1 triplet repeat locus. The presence of 2 adjacent polymorphic TGC chains makes it impossible to resolve individual alleles accurately.

32

While not affecting the performance of the genotyping method, some unexpected patterns arose in the genotyping electropherograms that are worth mentioning. The same number of repeats as determined by sequencing was giving rise to 2 peaks differing by up to 2bp in length. Analysis of the sequence beyond the triplet repeat area revealed that another polymorphism was present nearby and affected the migration of the main alleles. This was either caused by a poly-C polymorphism (triplet MEF2C(p), figure 9) with a discreet effect on sizing (within 2bp) or a variable utilization of codons for the same amino acid with a continuous effect on size (within 1bp) depending on codon usage (triplet HOXD13, figure 10).

Figure 9: The presence of a 3-allele polymorphic C stretch in the PCR product for the MEF2C promoter triplet repeat influences the migration of a triplet allele peak. Top panel is a homozygote individual for the triplet repeat but heterozygote for the polyC with 8 and 9 repeats. Bottom panel is homozygote for the triplet repeat with the same allele as the top panel individual but heterozygote with 9 and 10 repeats for the polyC. The same triplet repeat allele can take 3 different migration positions each differing by 1 bp depending on the number of repeats at the polyC locus.

33

Figure 10: Electropherogram overlay of 12 individuals (24 alleles) homozygote for the 15 repeat allele of the HOXD13 triplet repeat locus. The same number of repeat produces different migration speeds for the genotyping peak, within 1bp. Analysis of the sequencing (inset panel) reveals up to 4 different codon usage (GCG black overlay, GCA green overlay, GCT red overlay and GCC blue overlay) encoding the same amino acid Alanine. Variable use of the codons within the 15 repeats produces different DNA composition levels for the PCR products slightly influencing their migration during electrophoresis.

Another interesting observation made during the analysis of correlation between genotyping and sequencing is the case of DLX6. The poly glutamine (Q) polymorphism is encoded by a repetition of the imperfect CAG/CAA repeat at the very end of an exon. The next amino acid is His encoded by the codon CAC with the 1st and 2nd base of the codon belonging to the same exon as the repeat but the 3rd base belonging to the next exon (figure 11). This special case makes it difficult to determine exactly where the splicing machinery is going to act and the actual number of repeated glutamines could potentially

34 be affected by splicing. Alternatively it could mean that no matter how many repeats are found at the DNA level, the splicing machinery has a preference for a set number of repeats that will be used in the final mRNA product.

Another interesting observation at this locus is the presence of extensive polymorphism. 2 SNPs have been reported in the splice donor site of the triplet exon and the number of repeat alleles we observed in 66 is 10 with a number of repeats varying from 3 to 14. Yet the 1000 genomes project reports no variation in this region in 16 chromosomes (4 Asian, 4 African and 8 European).

Figure 11: Translation and variations known around the DLX6 triplet repeat locus. The exon containing the repeat ends with the CA part of CAC codon spanning 2 adjacent exons. The database shows a single DLX6 isoform yet more histidine encoding repeats (CAG or CAA) could be translated and still produce a CA at the end of the donor exon (red arrow)

Using the technique beyond triplet repeats

It is interesting to note that the applicability of the fluorescent adaptamer technique does not stop at the study of triplet repeat polymorphisms. The technique could be used to genotype other types of repeat polymorphisms like dinucleotide and tetranucleotide repeats and other types of microsatellite loci and variable number of tandem repeat (VNTR) loci. In fact the method would

35 probably be applicable to any size polymorphism, no necessarily a repeat, as long as PCR primers can be designed spanning the variable region and that the product can be separated by capillary electrophoresis. Because the DNA Analyzers used for this technique have a 1 base resolution all the way to about 1 kilobase, a wide range of size polymorphisms could conceivably be genotyped in that manner.

In addition further cost saving can be obtained using multiple fluorophores and non size overlapping PCR products. One could design a panel of loci that can be simultaneous amplified and separated, this would be done by extending the method to have additional tail sequences, one for each fluorophore and running multiplexed PCRs. The process would probably require optimization and be restricted to specific panels of loci that are suitable to be amplified in a multiplex fashion, however the cost savings would be proportional to the number of loci pooled in a single reaction.

Cost considerations

The cost savings of the fluorescent adaptamer method compared to the traditional specifically labeled primer approach is in the initial investment of oligonucleotide synthesis. The cost of HPLC purified 6-FAM labeled forward specific primer is around $135, compared to $14 for a tailed unlabeled specific primer. In this project, a conservative approach using the traditional method would have been to synthesize 207 conventional primer pairs and only synthesize labeled primers for the 137 loci that were amplifiable under standard PCR conditions. This would have resulted in over $18,000 of additional expense which represents, considering the $11,000 total reagent costs of the

36 project, almost tripling the budget (2.9x exactly).In conclusion, using the low cost genotyping approach is, in addition to being a more straightforward alternative to sequencing, a more economical approach even as a first line approach.

Sensitivity and specificity considerations

While the number of cases examined is too low to give accurate measure of the fluorescent adaptamer method‟s specificity and sensitivity in terms of percentage values we can comment on the value of the method in qualitative terms. If the ability to detect imperfect repeats is not taken into account since it is a known limitation of the method, the observed sensitivity (22 cases out of 24 the method returns a result) is estimated to be high. In fact, in the 2 cases when the genotyping method gives incomplete information, using the sequencing method would not have lead to a satisfactory result due to the complex pattern to interpret. So we can say that the fluorescent adaptamer method is as sensitive as the sequencing method currently considered the gold standard. In terms of specificity however, the sequencing is perfect but the fluorescent adaptamer method, while still highly specific (21 out of the 22 reported cases are exact), will have a small percentage of occurrences when it will not return the expected result and will not hint at the possibility of an imperfect repeat or non triplet repetition, for example when the motif contains several similar codons but is in fact a small segmental duplication.

37

D. SPECIFIC AIM 2

To determine the degree of linkage disequilibrium between SARP alleles and neighboring SNPs

The search for genes involved in human diseases is usually done by genome wide association studies (GWAS) in which 1/2 million or more SNPs are genotyped in hundreds of affected individuals and compared to an equivalent number of healthy controls. The principle behind GWASs is that the specific SNPs chosen for genotyping are the results of the HapMap project (tagSNPs) and carry information about entire haplotypes inherited within a population as blocks of SNPs within which recombination rarely occurs. Therefore, an association between a tagSNP and disease is indicative that a genetic locus related to the disease is contained within the haplotype block tagged by the tagSNP. Because triplet repeats have a much higher mutation rate than SNPs, a specific triplet allele that originates in a haplotype block can change, by expanding or contracting, generation after generation without the haplotype block ever recombining or changing, thereby no longer being related to the tagSNPs representing the block in which the original triplet allele occurred.

We propose to study the relationship between SNP alleles and SARP alleles in order to determine whether the important phenotypes potentially encoded in SARPs can be detected by genome wide association studies. The purpose is to establish the degree of correlation between SARP alleles and HapMap tagSNPs in order to answer the question of whether the triplet alleles need to be assessed separately or if they are captured using the current state of the art whole genome SNP genotyping technologies.

Haplotypes are a combination of alleles that are located closely together on the

38 same and that tend to be inherited together. A haplotype block is defined by islands of lower meiotic recombination between those loci that are close together and they are separated from each other by recombination hotspots. It has been shown that most of the haplotype structure (allele combination) in a particular chromosomal region can be captured by genotyping a smaller number of markers (e.g. „tagSNPs‟) than all of those that constitute the haplotypes. Because of their statistical association with these „tagSNPs‟ the genotypes of these untyped markers can be inferred from them. Our goal was to investigate whether currently used SNP microarrays with dense genome wide coverage will be also capable of capturing SARP genotypes with the same assumption under which they can capture SNPs that are not typed.

We obtained DNA from the Coriell Institute from 249 neurologic normal subjects in order to determine triplet repeat genotypes. For each of these samples genotypes from the Illumina HumanHap550 array had been previously determined. They were downloaded from the Coriell website. For each of the 249 subjects 58 triplet repeat genotypes were determined when possible, since the amount of DNA available for genotyping for each subject was limited to only 100 to 150 ng of DNA depending on the sample.

The haplotype block structure that encompasses the location of every SARP was first determined. We then used all the SNPs genotypes on the haplotype block and included the SARP genotypes to investigate on which haplotype each SARP allele occurs. We then determined the strength of association of each SARP allele to certain haplotypes. Only if there would be a strong correlation of a specific SARP allele with one haplotype would it be possible to use standard chip technology to investigate disease associations with SARPs without directly genotyping them.

39

High throughput triplet genotyping protocol

To reduce costs we wanted to minimize the reaction size as much as possible. Reactions were conducted in 5ul volume on 384 well plates using a Tecan FreedomEvo200 robotic laboratory workstation equipped with a 384 well TeMo multichannel pipettor. We kept the same PCR basic chemistry (Clontech Advantage GC2 kit) for its low cost and high performance within a wide range of GC content templates. Starting from the manufacturer's suggested recipe, we then tried to lower the enzyme amount used in the PCR while still maintaining an acceptable rate of the triplet genotype calling. The high PCR yields observed previously were reduced by the smaller reaction size and lower enzyme concentration so the need for dilution of amplicons before injection was reduced. The final amplification protocol followed the same thermocycling parameters and minimized primer concentrations as described in experiments for aim 1 but involved drying 2-3ng of DNA and adding 5ul of the 1x reaction mix containing 100mM final GC melt solution, 200mM final of each dNTP and half the recommended amount of GC polymerase (0.05ul) in 1x final PCR buffer.

The electropherograms were scored using the GeneMapper 3.5 software from Applied Biosystems. A total of 58 triplets were attempted at genotyping using the 249 Coriell DNAs before the DNA stocks were exhausted. The list of triplets, number of genotypes obtained and genome location are listed in table 5. After discarding 16 triplets for absence of polymorphism in the cohort, weak signal or irresolvable peak pattern using the reduced size PCR protocol a total of 42 triplets were available for haplotype analysis. The number of triplet genotypes available ranges from 241 to 146 with a median of 219. In all cases it was enough alleles to accurately reconstruct the SNP haplotypes.

40

# Triplet Genotypes Notes hg18 position Name 1 ACVR2A 215 chr2:148318459 2 MN1 168 chr22:26524921 3 SP8 0 Weak signal chr7:20791481 4 AR 233 chrX:66681935 5 NUMBL 235 chr19:45865732 6 TBP 211 chr6:170712975 7 ASCL1 234 chr12:101876310 8 NRXN1 0 Could not resolve pattern chr2:50427528 9 THAP11 146 chr16:66434310 10 ATXN1 226 chr6:16435889 11 NCOA3 161 chr20:45713265 12 TNRC5 226 chr6:43005349 13 ARID3B 163 chr15:72623362 14 PAX7 219 chr1:18829900 15 TNRC6A 238 chr16:24695917 16 ATXN2 0 Weak signal chr12:110521171 17 RUNx2 0 Could not resolve pattern chr6:45498487 18 CACNA1A_2 240 chr19:13179692 19 BIK 236 chr22:41855201 20 RAI1 167 chr17:17637839 21 DLX6 0 Weak signal chr7:96473497 22 PTPRZ1 0 DNA ran out chr7:121440613 23 CDH4 220 chr20:59260886 24 PPGB 0 DNA ran out chr20:43953657 25 CACNA1A 235* Non polymorphic chr19:13180708 26 ZIC5 0 DNA ran out chr13:99420683 27 EMX2 216 chr10:119291891 28 SLC12A2 218 chr5:127447848 29 EMX2 229 chr10:119291132 30 SOS1 241 chr2:39201632 31 E2F4 240 chr16:65787323 32 TNRC4 200 chr1:149945352 33 DLX6 212 chr7:96473329 34 YY1 221 chr14:99775558 35 FOXP2 186 chr7:114058831 36 SOCS7 214* Non polymorphic chr17:33762201 37 FOXE1 227 chr9:99656535 38 ATN1 198 chr12:6916170 39 KCNMA1 220* Non polymorphic chr10:79067372 40 ATXN3 197 chr14:91607128 41 IRS2 237 chr13:109234310 42 BCL6B 219 chr17:6868758 43 HOXD13 219* Non polymorphic chr2:176666052 44 BTBD2 0 Could not resolve pattern chr19:1966560 45 MEF2C 189 chr5:88214904

41

46 CNKSR2 213 chrX:21302652 47 MAP3K4 213 chr6:161439360 48 DCHS1 241 chr11:6619334 49 LRP8 0 DNA ran out chr1:53566117 50 EOMES 240 chr3:27738426 51 NAP1L3 170 chrX:92814786 52 MEF2A 208 chr15:98070249 53 MYT1L 0 Could not resolve pattern chr2:1925840 54 NKD2 231 chr5:1091451 55 MYST4 172 chr10:76451872 56 POU4F2 229 chr4:147779925 57 MN1 234* Non polymorphic chr22:26525628 58 SNAPC4 238 chr9:138397847 Table 5: List of 58 triplet loci attempted for genotyping on the Coriell Institute set of neurologic normal individuals. (*) Genotypes not informative for haplotype analysis.

Data structure to efficiently store millions of bialleleic SNP genotypes

The Illumina HumanHapp550 array dataset from the Coriell Institute neurologic normal control cohort contains about 136,950,000 genotypes in 26 files totaling 574MB. A simple and often used data structure for representing diploid genotypes is the 4-tuple (individual, marker, allele 1, allele 2) corresponding to one row per genotype in terms of relational database storage. For large datasets the redundancy of this structure leads to reduced data retrieval performance or high memory usage due to the size of the individual and marker index tables. In the context of biallelic SNP genotyping produced by whole genome genotyping microarrays like the Illumina HumanHap550, it is possible to optimize the data structure for maximal retrieval performance and minimal memory usage. Biallelic SNP genotypes can only take 4 possible values (AA, AB, BB and missing) and therefore the smallest possible representation of such genotype would be a 4 bit encoding. By storing the complete series of 4 bits corresponding to all the genotypes produced by the genotyping arrays next to each other into a binary vector, one would achieve the most memory

42 efficient storage method. The order of the genotypes in the vector is done according to the corresponding markers order which is stored as an index in a different structure.

We have implemented this storage method in the Oracle 11g database using the BLOB data type for the genotype vector. In addition to the genotype order, the marker index table also contains the actual DNA base letter for each encoded allele (A or B) on the top strand of the reference genome, chromosome and map position of the SNP. Triplet genotypes were stored conventionally as 4- tuples since their range of values is quite variable.

A web based interface was developed to perform all the data management and data analysis tasks for the project in a GUI driven fashion.

In order to determine the possibility to detect triplet alleles using neighboring SNP haplotypes the following data analysis process was adopted:

1. Upload the triplet genotypes to the database. 2. Determine the haplotype block structure around the triplet position by doing the analysis of SNP genotypes with the Haploview software. 3. Reconstruct the SNP only haplotypes inside the block containing the triplet with the PHASE software. 4. Reconstruct the SNP and triplet allele haplotypes inside the block containing the triplet again with PHASE. 5. Compare PHASE haplotype reconstructions with and without triplet data in it and interpret the results using the Wn statistics.

43

Determination of haplotype block around the SARP location

We used HAPLOVIEW 4.2 (Barrett et al, 2005) to determine the haplotype block that encompasses the location of each SARP and the SNPs from the Illumina HumanHapp550 array that were necessary to identify the haplotypes that contain the different SARPS alleles.

HAPLOVIEW 4.2 is a software developed at the Broad Institute of MIT and Harvard. It is designed to facilitate the process of haplotype analysis by providing a common interface to several tasks relating to such analyses. In this study we use it to visualize the degree of linkage disequilibrium between SNPs in order to identify haplotype blocks in which recombination is minimal within our study population. Out of the 4 methods supported by Haploview to determine haplotype blocks we chose the “confidence intervals” methods (Gabriel et al, 2002). 95% confidence bounds on D‟ are generated and each comparison is called "strong LD", "inconclusive" or "strong recombination". A block is created if 95% of informative (i.e. non-inconclusive) comparisons are "strong LD". This definition allows for many overlapping blocks to be valid. The default behavior is to sort the list of all possible blocks and start with the largest and keep adding blocks as long as they don't overlap with an already declared block. Empirical testing determined that the confidence intervals methods tend to produce the shortest haplotype blocks and SNPs in the blocks all had a strong LD between all of them. We liked this behavior because our hypothesis was to determine if haplotypes could predict triplet alleles so the most accurate haplotype determination with SNP in high LD was the most relevant to our analysis. In cases when the confidence intervals method failed to produce haplotypes including the triplet we chose the “Solid Spine of LD” method (deBakker et al, 2005). This method searches for a "spine" of strong

44

LD running from one marker to another along the legs of the triangle in the LD chart. This would mean that the first and last markers in a block are in strong LD with all intermediate markers but that the intermediate markers are not necessarily in LD with each other. Empirical testing determined that the solid spine of LD method tend to produce long haplotype blocks in which not all the SNP are in strong LD. It had the advantage of giving a base for the calculation of haplotypes in cases when the triplet region was subject to high recombination rates in our study population.

In cases when no method would produce a haplotype block containing the triplet, because of its position near a recombination hot spot, we chose to determine the haplotype in a region containing the 2 haplotype blocks flanking the triplet.

The size, SNP composition, LD pattern and relative distances between SNPs in the haplotype blocks containing each studied triplet as determined by Haploview, is illustrated in appendix C.

Determination of haplotypes with specific SARP alleles

To determine the haplotypes in these previously determined haplotype blocks that contain the SARPs alleles, we used PHASE 2.1 (Stephens et al, 2003). PHASE 2.1 is a software developed at the University of Washington in Seattle. Its advantage over HAPLOVIEW is that it implements a Bayesian statistical method for reconstructing haplotypes from population genotype data. Experiments with the software on both real and simulated data indicate that it can provide an improvement over other existing methods that use expectation minimization (EM) algorithms like HAPLOVIEW for reconstructing 45 haplotypes - for example, in simulation experiments in Stephens et al. 2001 the mean error rate using PHASE was about half that obtained by the EM algorithm. However the highly iterative Bayesian algorithm makes using Phase much more time consuming than other software implementations based on expectation minimization.

The output of reconstructed haplotypes containing the triplet alleles and the derived statistics are shown in appendix C. In each case the frequency of haplotypes inferred with and without the triplet alleles were very similar indicating that the presence of a marker with a potentially different mechanism of polymorphism creation was not dramatically changing the behavior of the software.

Determination of the relationship between SARP alleles and haplotypes

For each SNP haplotype / triplet allele combination we calculated the correlation between the triplet allele and the haplotype. To make it comparable with commonly used methods in genetics we chose the r2 and the D‟ statistics which are used to determine the amount of linkage disequilibrium (LD) between markers. These normalized measures of linkage disequilibrium are derived from the absolute amount of linkage disequilibrium (D) which is simply the difference between the observed haplotype frequency and the expected haplotype frequency under assumption of absence of correlation. D‟ normalizes D by dividing it by its theoretical maximum, therefore D‟ varies between 0 and 1 for all allele pairs. r2 also normalizes D but is adjusted for the loci having different allele frequencies.

It is expected that some degree of LD will not guarantee predictability of a 46 triplet allele by a SNP haplotype but predictability will necessitates LD for each haplotype / allele combination.

After getting an idea of the amount LD between alleles we assessed the overall significance and importance of our individual correlations by using the Wn statistic (Cramer, 1946). Also known as Cramer's V Statistic, Wn is the most popular of the chi-square-based measures for multiallelic LD calculation because it gives good norming from 0 to 1 regardless of the number of alleles at both markers. When each marker has 2 alleles Wn is equal to the correlation coefficient r (figure 12).

Figure 12: Formula for Wn, where Dij is the value of the LD between alleles i and j of the first and second locus respectively; and I and J the number of alleles at the first and second locus respectively. The second term of the 2 definition of Wn is simplified X LD representing the chi- square of the LD measure and N the sum of the number of alleles at both loci.

The interpretation of Wn (Goodman and Kruskal, 1954-1972) can be viewed as the association between two variables as a percentage of their maximum possible variation. Wn=1.0 defines a perfect predictive relationship whereas Wn=0.0 defines statistical independence.

For each triplet we calculated Wn in order to get an idea of the correlation between triplet alleles and neighboring SNP haplotypes. We also listed the 47 number of common SNP haplotypes (>5%) and the number of common triplet alleles (>1%) and verified that the amount of data used after excluding rare cases was still relevant. We reduced the number of triplet alleles to more frequently occurring ones, because it is thought that they will be more relevant for disease under the „common disease-common variant‟ hypothesis. Out of the 42 triplets examined 7 had only a single allele with frequency >1%. Usually, in order for a marker to be predictive of other markers a correlation of at least 0.8 is required (International HapMap Consortium, 2003). Out of the 35 triplets remaining only 2 showed a significant correlation between their alleles and the neighboring SNP haplotypes (Wn>0.8), with the 3rd best case only having Wn=0.67 and average Wn for all 34 SARPs was 0.41. Summary results are listed in table 6.

triplet gene # of SNP # of Wn % of haplotypes triplet data alleles used ACVR2A 4*** 5 0.66 95.1% MN1 4 3 0.37 99.4% TBP 7 6 0.40 74.9% NUMBL 4 2 0.67 99.4% AR 4*** 13 0.41 89.3% THAP11 4 2 0.40 89.7% ASCL1 2*** 3 0.34 95.1% TNRC5 3 1** 0.00 98.9% NCOA3 6 4 0.46 96.0% ATXN1 3*** 7 0.64 96.9% TNRC6A 7 2 0.52 93.5% ARID3B 4 1** 0.02 89.9% PAX7 8 4 0.19 95.7% CACNA1A_2 7 6 0.52 82.5% RAI1 3*** 5 0.59 96.1% BIK 8 1** 0.02 72.0% CDH4 3 3 0.59 95.9% NAP1L3 2 2 0.03 95.3%

48

SLC12A2 6 3 0.44 91.5% EMX2 7 1** 0.01 85.4% SOS1 3 3 0.10 92.9% EMX2 6 2 0.04 72.5% TNRC4 3 2 0.05 99.8% E2F4 3 2 0.02 97.9% YY1 3 2 0.17 94.8% DLX6 5 4 0.11 71.5% FOXP2 3 3 0.05 93.5% ATN1 4*** 11 0.53 86.9% FOXE1 6 2 0.17 90.7% ATXN3 5*** 9 0.60 97.5% BCL6B 3 2 0.62 98.4% IRS2 5 1** 0.00 95.6% CNKSR2 3*** 4 0.61 92.0% MEF2C 4 1** 0.01 80.4% DCHS1 5 2 0.92* 80.9% MAP3K4 5 2 0.56 93.0% EOMES 3 2 0.89* 84.8% MEF2A 9 3 0.34 67.5% NKD2 3 1** 0.01 73.2% POU4F2 5 5 0.35 98.5% MYST4 3*** 4 0.46 98.5% SNAPC4 3 3 0.18 100.0% Table 6: Results of the analysis of 42 triplet repeat loci for which the alleles were correlated with the SNP haplotypes within a block of high linkage disequilibrium. The data used is the percentage of SNP genotypes used that produce haplotypes with a frequency higher than 5%. (*) Significant result. (**) Triplets with low level of polymorphism and a single common allele resulting in non significant correlations. (***) Number of SNP haplotypes is insufficient to capture the number of triplet alleles, further analysis extending the haplotype block was performed.

We carefully reviewed cases when the number of SNP haplotypes was lower than the number of triplet alleles. In these cases several triplet alleles would have to be predicted by a single SNP haplotype. In order to assess if we had

49 been too conservative in the local estimation of LD in these cases we extended the size of the haplotypes in increments of 20kb on each side of the triplet until the number of identified SNP haplotypes became at least equal to the number of triplet alleles. As expected, because blocks are separated by recombination hotspots (thus reducing correlations), the predictive value of the larger haplotype was lower than when estimated within the haplotype block containing the triplet repeat.

Simulation of data to understand the significance of the Wn statistic

In order to understand the variability of Wn within the context of our experiments we simulated some data that is easy to understand. A scenario with a 3 SNP haplotype and a 8 allele repeat was generated in which each of the 8 repeat alleles could go with one of the 8 haplotype alleles. The haplotype frequencies and corresponding SNP allele frequencies are listed in table 7.

Repeat Haplotype Frequency Allele Frequency

1 A1 A2 A3 .01 A1 .50 2 A1 A2 B3 .09 B1 .50 3 A1 B2 A3 .18 4 A1 B2 B3 .22 A2 .50 5 B1 A2 A3 .27 B2 .50 6 B1 A2 B3 .13 7 B1 B2 A3 .07 A3 .53 8 B1 B2 B3 .03 B3 .47 Table 7: Unimodal distribution of haplotype/repeat allele frequencies and near equal SNP allele frequencies were used in the simulation.

The genotypes for 240 individuals (480 chromosomes) were simulated 11

50 times, each time varying the degree with which the triplet allele in LD with its corresponding haplotype from table 7. We tested correlations varying from 100% to 1% in 10% decrement. The PHASE software was used to estimate haplotypes and Wn statistics were calculated for each simulated correlation (figure 13). On average the software predicted the simulated haplotype frequencies within a 2% margin of what was simulated.

Relationship between Wn and the correlation of repeat alleles with SNP haplotypes

100 1.00

80 0.80

60 0.60

40 0.40

Value of of statistic Value Percent correlation Percent 20 0.20

0 0.00 Simulation Wn

Figure 13: The percent correlation between SNP haplotypes and triplet alleles (left axis, blue line) is following the value of the Wn statistic obtained (right axis, blue line)

The linearity observed between the simulated percent correlation between SNP haplotypes and repeat alleles illustrates well in this example that the Wn statistic can be viewed as a measure of how often the triplet allele can be predicted by its neighboring SNP haplotype.

51

While there is some amount of correlation between a triplet allele and the surrounding haplotype, we conclude that as a general method to capture untyped SARP alleles, the current SNP arrays are not well suited and if a disease is caused by SARPs, typing only SNPs will often miss disease related associations or require unrealistically large sample sizes to restore power.

52

E. DISCUSSION

Simple variable sequence repeats, once labeled just genetic „junk‟, are an important source of genetic variation. Among the sequence repeats, triplet repeats deserve special attention because they lead to the repetition of whole codons, translating into repetitive stretches of amino acids at the protein level. Furthermore, extended triplet repeat length has been shown to be the disease mechanisms of sixteen human diseases so far, including Huntington‟s Disease or Fragile X for example (Pearson et al, 2005). Findings from animal or plant genetic studies showed that variations in triplet repeats can lead to a complex phenotype and that variations within a normal (i.e. non-pathological) range of repeat number commonly yields small, quantitative functional effects (Kashi and King, 2006). They often serve to modify genes with which they are associated and their influence can be varied either on gene regulation, transcription or protein function. Mutations that alter repeat number typically occur at rates orders of magnitude greater than single-nucleotide point mutations and in addition they can be reversible.

Triplet repeats therefore offer an intriguing mechanism for more genetically complex diseases like schizophrenia, autism and others. Those diseases are suspected to have a multigenic etiology with several genes contributing to disease, each with a small and subtle effect. Genetic mechanisms that do not lead to catastrophic change of gene function but to slight alterations and modifications are therefore more consistent with current hypotheses. The high mutation rate of SARPs would also be able to explain the high „sporadic rate‟

53 seen in those diseases, i.e. the observation that many patients with those diseases do not necessarily have a positive family history for disease. It is also interesting, that the length of repeat polymorphisms have been shown to be correlated with exogenous factors. SARPs could therefore reconcile the assumed genetic basis together with the suspected environmental exposures in many of those diseases.

However, methods to systematically investigate for triplet repeats never have been developed and the potential contribution of SARPs to these diseases remains an understudied topic. In part, this is due to the fact that current methods for the detection of triplet repeat variations are expensive and/or time consuming and are not suitable for a genome wide search. We have developed a high throughput method that allows rapid detection of SARPs at a fraction of the current costs. By using a generic labeling primer (a key cost point) that can be reused in all SARP assays we have optimized our method and shown that it is reliable, accurate and precise. The limitations of our method are only when two triplet repeats are too close together thus preventing the development of two different amplification products. In this case, only the combined product of both repeats can be amplified. It is important to note that this limitation applies to all PCR based method for triplet genotyping including the current gold standard by Sanger sequencing.

SNP arrays with over 1 million SNPs dispersed over the genome have become the method of choice to investigate the genetic contributions to common complex diseases with to date limited success. These SNP arrays have been

54 increasingly used to also investigate other genetic mechanisms in those diseases i.e. that of copy number variants (CNVs). CNVs are caused by chromosomal rearrangements that result in the loss (deletions) or gain (duplications) of stretches of DNA sequences that range from 1kb to several million bases. Analogously, we were interested to see if those SNP genotypes, which are generated in increasing quantity for a variety of diseases, can be used as a resource to capture SARP alleles and for the detection of a potential association with certain SARPs and disease. It has been shown (Gabriel et al, 2002) that the human genome can be divided up in haplotype blocks and that only a few „tagSNPs‟ per block are sufficient to identify the block that contains all the alleles that get inherited together in this block. Our goal was therefore to identify the blocks that contain the triplet repeat and investigate if there is strong linkage disequilibrium between specific repeat alleles and haplotypes from the block. If there would be a strong correlation, then already available SNP array genotypes could be used to infer the frequency of SARP alleles in a sample. Our results show that only a few triplet repeats have a strong enough correlation with the surrounding SNP haplotypes for SNP data to be useful for prediction of SARP alleles and genetic association studies with those SARPs. For those repeats that have only a medium correlation it might be still possible to find associations if the association is strong or if the sample size is probably prohibitively large. But we also found SARPs for which it was impossible to infer the alleles from the haplotype block. We attribute this to the high mutation rate of SARPs but also to mutational events that can lead to a reversal of repeat lengths, a variation creation model far different from recombination of ancestral alleles which is the principle behind genetic association. SNPs in contrast, have a much lower mutation rate and any new SNP mutation that arises on a specific chromosomal haplotype is preserved throughout the

55 evolutionary history of a population (i.e. founder effect) on this specific ancestral haplotype unless it is disrupted only by mutation and recombination in subsequent generations. When the mutation rate is high though, this increases the chances that the same variation arises or changes on different ancestral haplotypes thus showing correlations not with only one haplotype but with several different ones. In those cases, a possible disease association with a triplet repeat will be missed if one only relies on SNP genotyping.

.

Interestingly, triplet repeat variations also appear to be missed by the 1000 genome sequencing project. The goal of the 1000 genomes project is to provide a comprehensive resource on human genetic variation that will support future medical research studies. Through the use of the inexpensive next generation sequencing technology the full genome of a thousand individuals are planned to be completed and analyzed for single point and copy number variants. The full data publically available, as of this writing, consists of only 8 ethnically varied individuals. But looking at the 10 most polymorphic triplets in our study for which the 1000 genome project has coverage, none of the loci showed a single polymorphism in the repeat region. While this is theoretically possible it is highly unlikely that 16 alleles be identical to the reference sequence for each of 10 confirmed polymorphic loci. Confirmation could be obtained by genotyping the DNA of the 8 individuals using the method we have developed. However, it is much more likely that this reflects the limitations of the currently used technology in next generation sequencing. There are technical reasons in the way next generation sequencing works that might hinder the detection of triplet repeat polymorphisms. Next generation sequencing of the human genome is actually a re-sequencing method. Genomes are not assembled de-

56 novo but instead, reads of 36 to 150 bases in length are aligned to their best matching position on the reference genome which acts as a scaffold. At the time the data was generated for the 8 individuals in the 1000 genome project release version 1 most likely the read length used were in the 36-50 bases range and not in the 100-150 bases that can be generated today. The aligner cannot position a read with a repeat accurately unless the repeat sequence is anchored by 25 to 32 non repetitive bases (Anthony J. Cox, Solexa, unpublished) and several non repetitive bases on the other side indicating the end of the repeat sequences. To accurately identify the number of repeats the length of each read has to have at least the number of repeat bases plus sufficient flanking non repetitive bases to align the read accurately. In fact, even with 50 base reads and 25 bases aligning to the non repetitive edge of a triplet repeat only polymorphisms involving less than 5 repeats can be detected. With short read length the variant calling software is more likely to attribute the couple of repeat bases that are anchored only on one side to a sequencing error rather than to the presence of a size variation from the reference. This is accentuated by the fact that next generation sequencing errors compound at the 3‟ ends of reads and the variant calling software usually does not make a variant call when the variation is only observed at the end of a read, even though this will be the only possible location of triplet repeat polymorphisms given the current size limitations of the reads. Therefore it appears to be a limitation of short read technology to be unable to identify triplet repeat polymorphisms. While new data from the 1000 genome project is expected by the end of 2010 and will be gathered from longer reads than the first release it remains to be seen if such variants will be detected. We note that all the polymorphic triplets we looked at had 5 repeats or more, and in our entire sample we found SARPs with alleles up to 28 repeat sequences. Special software could be written to specifically

57 consider the possibility of a variation within triplet repeats in next generation sequencing but it is not part of the current standard variant determination algorithms.

The triplet repeats we investigated showed a wide variation in their number of alleles, with up 14 additional motifs from the smallest to the largest allele. However, the majority of the allelic variation we observed was incremental differences in repeat length, within two or three repeat units from the most common allele (appendix C). It is thought that it is mutations disturbing the purity of the repeat pattern that prevent further elongation of the repeat sequence and thus limit the repeat size to a certain maximum (Subramanian et al, 2003). The lack of linkage disequilibrium between triplet repeat alleles and SNP haplotypes within a tight LD block underscores a relatively younger age for the triplet alleles since they haven't reached equilibrium. We also observed that some of the triplet repeats that were identified in the small Coriell human diversity panel as showing some degree of size variations did not exhibit any polymorphism in the larger follow up sample comprised of Caucasians only (triplets CACNA1A_1, SOCS7, KCNMA1, HOXD13 and MN1). This might point to some population specific selection of triplet alleles among different ethnic groups.

Because of the instability of repeat elements, these variations in coding sequence repeats have been increasingly viewed as playing a major role in generating the genetic variation underlying adaptive evolution (Fondon & Garner, 2004; Kashi et al, 1997). There are several examples providing evidence that common repeat alleles can offer potential selective advantages, while variations in repetition purity and motif length enable site-specific adjustment

58 of both mutation rate and mutation effect. With this, coding sequence repeats provide a rapid mechanism for regulating, modifying and restructuring genetic information with minimal risk to ongoing adaptation. They have been characterized as „evolutionary tuning knobs‟ that express the potential to facilitate the efficient adaptive adjustment of a quantitative trait (King et al, 2006; Trifonov EN, 2004). The quantitative adjustment and on–off switching provided by site-specific mutation of repeats might be one of the simplest but also one of the most widespread and powerful means of providing genetic variation for evolution.

It is still an ongoing debate how these evolutionary changes are determined – through genetic drift or natural selection. Natural selection, a term introduced by Darwin in his book „On the Origin of Species‟, is the process by which genetic variations become more or less common in a population (are selected for or against) due to consistent effects upon the survival or reproduction of their bearers in their current environment.

Genetic drift, in contrast, changes the frequencies of genetic variants in a population due to random sampling, i.e. it affects the genetic makeup of the population but, unlike natural selection, through an entirely random process. There are several arguments in favor of genetic drift being the dominant force of evolution: It is very difficult to find clear evidence of selection in humans (the sickle cell allele is a notable exception) but also in other organisms. Most importantly, there is molecular and biochemical evidence that most observed mutations at the level of a gene are mostly neutral and not subject to positive selection. This lead to the establishment of the „Neutral Theory of Molecular Evolution‟ (initiated by Wright in the 1930s and extended by Kimura in the 1960s and 1970s) that postulates that variations between individuals and species

59 are selectively "neutral." That is, the molecular changes represented by these differences do not influence the fitness of the individual organism. As a result, the theory regards these genomic features as neither subject to, nor explicable by, natural selection. Through drift, these new alleles may become more common within the population. They may subsequently be lost, or in rare cases they may become fixed, meaning that the new allele becomes standard in the population. The effect of genetic drift follows random statistical sampling rules and is therefore larger in small populations and smaller in large populations where the number of matings is higher. However, these „neutral‟ variations can accumulate in populations and genetic diversity increases. This can set the stage for „rapid‟ evolution in the future, if some previously neutral traits suddenly become advantageous (because the environment changes, or because they combine with some other "neutral" traits in some advantageous way). Since a large percentage of the population already have the newly advantageous allele, natural selection can favor those individuals leading to a much faster evolution than when selective mutations would have to arise at a general mutation rate.

These hypotheses for evolution are not mutually exclusive and the debate about the effect of each still continues. Some research describes models taking both into account. A study of within-group versus between-group variation in skull form of hominin fossils reveals that non random differential selection is most likely the cause of the split between the narrow faced Homo genus and the wide faced Australopithecines 2.5-3 million years ago. Yet the differences between H. erectus and H. habilis can be explained simply by random drift (Rogers Ackermann et al, 2004). More recently a similar study looked at within- population versus between-population variation in skull form also taking into account planetary distances between the populations and the climate in which each population lives (Betti et al, 2010). They saw no correlation between

60 climate and skull form, corrected for the fact that nearby populations tend to be phenotypically close as well as living in a similar climate, indicating a random origin of skull form diversity (genetic drift). One exception however was in the Inuit and Eskimo who live in very cold climate and for which measurements of cranial breadth, nasal apertures and orbital apertures, all major contributors to thermoregulation are more narrowly distributed, indicating a non random association (positive selection).

Another recent report analyzing genetic diversity of chromosome Y worldwide strongly suggest drift as the major source of genetic variation in humans since the expansion of modern humans (Chiaroni et al, 2009). One very interesting point is that culturally and technologically evolved populations can negate the role of natural selection in the creation of allelic diversity since they create their environment rather than being influenced by it as an external pressure.

The ordered drift then selection hypothesis sits well with regard to triplet repeats:

1. Because the effect of increased or decreased repeat length on the phenotype is supposed to be subtle, there should be little pressure initially for natural selection, thus favoring genetic drift as the source of allelic diversity.

2. The high mutation rate of triplet repeats, compared to single nucleotide variations, will ensure a critical mass of potentially favorable variations on which natural selection could act on, thereby narrowing the allelic diversity in a population under environmental pressure.

It is interesting to note that in all the SARPs for which the triplet repeat regions where fully sequenced we found only one case when the repeat region overlapped with a known functional domain of the protein (Table 8). Yet in all 61

the cases when a genetic defect in the SARP was known to be the cause of a human disease, the expansion of the repeat domain was mentioned as the cause. This implies that the triplet repeat domains do not need to be directly altering the functional domains of the SARPs to cause disease.

Triplet Repeat Repeat Protein functional domains Repeat in Role of repeat domain Gene type region functional domain AR Gln 58-78 Composed of three domains: a modulating N- no Expansion causes Kennedy disease terminal domain, a DNA-binding domain and a C- (SMAX1), an X-linked recessive terminal ligand-binding domain. In the presence form of spinal muscular atrophy. of bound steroid the ligand-binding domain Shows genetic anticipation interacts with the N-terminal modulating domain, and thereby activates AR transcription factor activity. Agonist binding is required for dimerization and binding to target DNA. The transcription factor activity of the complex formed by ligand-activated AR and DNA is modulated by interactions with coactivator and corepressor proteins. Interaction with RANBP9 is mediated by both the N-terminal domain and the DNA-binding domain. Interaction with EFCAB6/DJBP is mediated by the DNA-binding domain.

ARID3B Gln 5-15 Interaction domain with RB1. Interaction domain no not known with ARID3A ASCL1 Gln 51-62 DNA binding domain no not known ATXN2 Gln 166-187 Interacts with TARDBP; the interaction is RNA- no Expansion causes spinocerebellar dependent. Interacts with RBFOX1 ataxia type 2 (SCA2). Shows genetic anticipation CACNA1A His 2211 - 2220 Each of the four internal repeats contains five no Expansion of Gln repeat causes hydrophobic transmembrane segments (S1, S2, spinocerebellar ataxia type 6 Gln 2314-2324 S3, S5, S6) and one positively charged (SCA6). Shows genetic anticipation transmembrane segment (S4). S4 segments probably represent the voltage-sensor and are characterized by a series of positively charged amino acids at every third position.

DLX6 Gln 26-31 Homeobox no not known E2F4 Ser 307-327 Leucine-zipper, Dimerization, Transactivation, no poly Serine length might be Interaction with RBL1, interaction with RBL2, DEF associated with tumorigenesis box, HCFC1-binding-motif FOXE1 Ala 164-177 Fork-head DNA binding domain no poly Alanine might be associated with familial thyroid dysgenesis HOXD13 Ala 57-71 Homeobox no Expansion causes synpolydactyly type 1 (SPD1) and syndactyly type 5

IRS2 Ala 371-380 7 YXXM motifs, PH domain, IRS-type PTB domain no not known MAP3K4 Ala 1190-1202 Protein kinase domain, ATP binding domain no not known MN1 Gln 523-550 no information no not known

62

MYT1L Glu 152-168 6 C2HC-type zinc fingers no not known NCOA3 Gln 1248-1278 Contains three Leu-Xaa-Xaa-Leu-Leu (LXXLL) yes Function of repeat in motifs. Motifs 1 and 2 are essential for the acetyltransferase domain is not association with nuclear receptors, and constitute known the RID domain (Receptor-interacting domain). PAS domain, DNA binding domain, Acetyltransferase domain, CREBBP interaction domain. NRXN1 n/a 5'UTR EGF-like domain, transmembrane domain. no not known NUMBL Gln 424-446 PTB domain necessary for the inhibition of no not known MAP3K7IP2-mediated activation of NF-kappa-B Table 8: Functional domains found in the Single Amino acid Repeat Proteins studied and their overlap with the triplet repeat domain.

If the direct sequence change of a functional domain is not the way triplet repeat polymorphisms modulate the function and/or activity of the genes they are contained in we should strive to understand how they provide this modulation. This could be done at least at 3 possible levels:

1. When the biological activity of the SARP is known it is possible to design experiments at the molecular level that directly measure the protein activity as a function of a triplet length that can be set as needed. Or, molecular level experiments can be designed to measure the effect of triplet length on protein conformation. For example using synthetic forms of the N terminal domain of the androgen receptor (AR-NTD) it was established that a longer poly glutamine domain provided a shift in the folding of the peptide from a loose structure to a helix structure (Davies et al, 2008). Such findings provide molecular evidence that the length of the triplet repeat domain can modulate the structure and folding of the SARP. 2. When a specific phenotype is known or suspected to be caused by a SARP it is possible to design experiments that establish the degree of correlation between the triplet repeat length and the variation in the

63

phenotype. This kind of experiment depends on the range of naturally occurring triplet alleles and naturally occurring phenotypes and would be considered to determine the function of the repeat at the association level. For example, expansion in the triplet repeat of the androgen receptor gene (AR) causes a rare neuromuscular disease that is associated with reduced sperm production and infertility. It was established that within the normal range of glutamine repeat (9-36 residues) individuals with 28 or more repeats were at more than 4 fold risk of imperfect spermatogenesis (Tut et al, 1997). Such findings provide evidence that a continuous and diseaseogenic phenotype, like imperfect sperm count leading to infertility, can be modulated simply by the number of residues in the polyglutamine domain. It was later established (Casella et al, 2003) that the biological effect is due to a subtle modulatory effect of the polyglutamine length on the activity of the androgen receptor. 3. Finally, the function of the repeat domains could be studied at the bioinformatic level. First by modeling the folding structure of all 20 amino acids repeat tracts at various length (by ab-initio secondary structure prediction for example) then by using a structural homology search engine like PDBeFold (Krissinel et al, 2004) to derive all known protein domains structurally similar to the amino acid repeat domain. Leads on the possible biological function of such domains could be obtained by varying the degree of structural homology (most likely lowering the threshold for reporting) until a reasonable number of functional domains hits are obtained.

64

We have presented a high throughput method that allows for the rapid detection of SARP alleles and will enable researchers to perform genome wide searches of the genetic basis of human diseases.

The lack of success of numerous SNP based GWASs can be attributed to the complex model of genetic disease inheritance (multilocus, low penetrance). The results of this study offer a new perspective on these weak findings by showing their lack of power to detect disease associations caused by a simple model of triplet repeat inheritance.

65

References

Barrett JC, Fry B, Maller J, Daly MJ.(2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics.

Bates GP, Gonitel R. (2006) Mouse models of triplet repeat diseases. Mol Biotechnol.

Betti L, Balloux F, Hanihara T, Manica A. (2010) The relative role of drift and selection in shaping the human skull. Am J Phys Anthropol.

Caburet S, Cocquet J, Vaiman D, Veitia RA.(2005) Coding repeats and evolutionary "agility". Bioessays.

Casella R, Maduro MR, Misfud A, Lipshultz LI, Yong EL, Lamb DJ. (2003) Androgen receptor gene polyglutamine length is associated with testicular histology in infertile patients. J Urol.

Chiaroni J, Underhill PA, Cavalli-Sforza LL. (2009) Y chromosome diversity, human expansion, drift, and cultural evolution. Proc Natl Acad Sci U S A.

Cramer H. (1946) Mathematical Models of Statistics. Princeton University Press.

Davies P, Watt K, Kelly SM, Clark C, Price NC, McEwan IJ. (2008) Consequences of poly- glutamine repeat length for the conformation and folding of the androgen receptor amino- terminal domain. J Mol Endocrinol. de Bakker PIW, Yelensky R, Pe'er I, Gabriel SB, Daly MJ, Altshuler D(2005) Efficiency and power in genetic association studies. Nature Genetics.

Fondon JW 3rd, Garner HR.(2004) Molecular origins of rapid and continuous morphological evolution. PNAS

66

Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D (2002) The structure of haplotype blocks in the human genome. Science.

Gatchel JR, Zoghbi HY. (2005) Diseases of unstable repeat expansion: Mechanisms and common principles. Nature Reviews Genetics.

Goodman Leo A. and W. H. Kruskal (1954, 1959, 1963, 1972). Measures for association for cross-classification, I, II, III and IV. Journal of the American Statistical Association.

Holmes SE, Wentzell JS, Seixas AI, Callahan C, Silveira I, Ross CA, Margolis RL. (2006) A novel trinucleotide repeat expansion at chromosome 3q26.2 identified by a CAG/CTG repeat expansion detection array. Hum Genet.

International HapMap Consortium.(2003) The International HapMap Project. Nature.

Kashi Y, King DG. (2006) Simple sequence repeats as advantageous mutators in evolution. Trends in Genetics.

King, DG, Trifonow EN., Kashi Y (2006) Tuning knobs in the genome: evolution of simple sequence repeats by indirect selection. In The Implicit Genome (Caporale, L.H., ed.), pp. 77–90, Oxford University Press

Krissinel E, Henrick K. (2004) Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr.

Missirlis PI, Mead CL, Butland SL, Ouellette BF, Devon RS, Leavitt BR, Holt RA. (2005) Satellog: a database for the identification and prioritization of satellite repeats in disease association studies. BMC Bioinformatics.

67

Pearson CE, Nichol Edamura K, Cleary JD.(2005) Repeat instability: mechanisms of dynamic mutations. Nat Rev Genet.

Ranum LP, Day JW. (2004) Pathogenic RNA repeats: an expanding role in genetic disease. Trends Genet.

Rozanska M, Sobczak K, Jasinska A, Napierala M, Kaczynska D, Czerny A, Koziel M, Kozlowski P, Olejniczak M, Krzyzosiak WJ. (2007).CAG and CTG repeat polymorphism in exons of human genes shows distinct features at the expandable loci. Hum Mutat.

Ruden DM, Jamison DC, Zeeberg BR, Garfinkel MD, Weinstein JN, Rasouli P, Lu X. (2008) The EDGE hypothesis: Epigenetically Directed Genetic Errors in repeat-containing proteins (RCPs) involved in evolution, neuroendocrine signaling, and cancer. Front Neuroendocrinol.

Siwach P, Pophaly SD, Ganesh S.(2006) Genomic and evolutionary insights into genes encoding proteins with single amino acid repeats. Mol Biol Evol.

Stephens, M. and P. Donnelly (2003). A comparison of bayesian methods for haplotype reconstruction. American Journal of Human Genetics.

Stephens, M., N. J. Smith, and P. Donnelly (2001). A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics.

Subramanian S, Madgula VM, George R, Mishra RK, Pandit MW, Kumar CS, Singh L.(2003) Triplet repeats in human genome: distribution and their association with genes and other genomic regions. Bioinformatics.

Trifonov EN (2004) The tuning function of tandemly repeating sequences: a molecular device for fast adaptation. In Evolutionary Theory and Processes: Modern Horizons (Wasser, S.P., ed.), pp. 115–138, Kluwer Academic Publishers

68

Tut TG, Ghadessy FJ, Trifiro MA, Pinsky L, Yong EL. (1997) Long polyglutamine tracts in the androgen receptor are associated with reduced trans-activation, impaired sperm production, and male infertility. J Clin Endocrinol Metab.

69

Appendix A List of 207 triplets tested and primer sequences used for amplification Triplet Forward primer sequence Reverse primer sequence 1 ACVR2A_GCC_p0 CTTCCCCGACAATCTCCTC GTCTGGGAACTGACTGGGTC 2 ADAMTS2_GCA_0 GGGGTCCCGGGGAGTAG GCTCGCTCCTTCCCTCTC 3 ADAMTS8_CAG_0 GACAGGTGGAGCGCGAG GGGCTTCTTCCCTCTCCTG 4 AFF3_GCT_0 CTGCCCTCACTCTCGCTG GTTTGCTCACTTGGCTTTCC 5 ALMS1_GGA_0 AGTCAGGGCTCTCCCCTTC GCCCGTAGTGAGAGTCGGAG 6 AMOT_GCA_0 GCAGCAGCTGGAATCTGAC GTTGCTCCCATCTCTGTTCC 7 ANK3_TGG_0 TTTCTTCTTCAGAGAGGCGG ATGTGCCACACAAAAGCAAG 8 AR_GCA_0 GACCTACCGAGGAGCTTTCC CTCATCCAGGACCAGGTAGC 9 AR_GGC_1 CTCTTCACAGCCGAAGAAGG GGATAGGGCACTCTGCTCAC 10 ARID3A_GGA_0 TCTCCCCTGACTCCTGCC ACCCTGGGAACGCCATC 11 ARID3B_CAG_0 ACCTTTTCCTACCCACCCAC ACTGGGCTTCTCTCATCTGC 12 ARX_GCC_0 GCTGATCTTGAGCGTGTCC TGGCTGATAGCTCTCCCTTG 13 ASCL1_GCA_0 GTTCCTGCACCCAAGTTCTC GTTTGCAGCGCATCAGTTC 14 ATN1_CAG_0 ACAGCAATGCCCATCCAG GCGTAAGGGTGTGCGTG 15 ATRN_GCT_0 ACTGGGACTGGGACGTGAC CAGCATCTCCCTCTGACCTG 16 ATXN1_TGC_0 GTGTGTGGGATCATCGTCTG ACAGTGGAACCTATGCCAGC 17 ATXN2_GCT_0 ACGCTAGAAGGCCGCTG CTCCGCCTCAGACTGTTTTG 18 ATXN3_CTG_0 GAATGGTGAGCAGGCCTTAC CCAGTGACTACTTTGATTCGTG 19 AUTS2_CCA_0 CGAGAACTTGACATTCACCG ATAGTGCATGCTGGGGAGTC 20 BCL11A_CTC_0 CGAGCTGTTCTCGTGGTG ACCAGCGACTTGGTGGG 21 BCL6B_CAG_0 GCTAAACTCTCAGGCCTCCC AGGCCAGCTCTGACAAATTC 22 BHLHB3_GCC_1 ACAGATACTTCTCCAGGCCG CGTCACCATCAAGCAGGAG 23 BHLHB3_GCG_0 ACTCTGCTTGAACCTCCGTC AGAAGTATCTGTACCCGGCG 24 BIK_TGC_0 CCCTCTCCTGAACTCCTTCC TGTTCCAGCACTATCTCGGG 25 BRD3_CTT_0 AGGAGCCTTCTTCTGCTGAG GCCCGCTCTTTTCCAGC 26 BRD4_GCT_0 ACCAGGCCTCCAGACTCAC AGCACGCTGAGAAGGAGAAG 27 BTBD2_GGC_0 CTGCCAGTTGTACGCGG GCCGCAGCCACAAAATG 28 BZRAP1_CTC_0 CTGACCACTGTCACAGCCC CGATGAGGAGATCTTGGAGC 29 BZRAP1_CTC_1 ATCCCTACCCCGTTACCTTG TGCAGAAGGAGGACACAGC 30 C19orf2_TGA_0 TGCAAGTTCAGAACCATTCAG TGGTATAGAATTATCTCCAACCCC 31 CACNA1A_CTG_1 CCCGGTAGTAGCCATGGTG ACACGTGTCCTATTCCCCTG 32 CACNA1A_TGG_0 ACCTTTGCCTAGAGTGGCTG CGACCACCTGTGTGTCTGTC 33 CACNA1G_CAC_0 GAGCTGCTCAAGTACCTGGTG CGGCATGGTAGAAGCTGTG 34 CACNA1H_ACC_0 CCCAGTGCTGTGCAAGG CTGGGGAAGGTGGCGAG 35 CACNG8_CGG_0 GGACGACTTTCTTGGGCAG TTCAGCGACTCCAGTTTGAC 36 CAPNS1_GGC_0 TGGTTAACTCGTTCTTGAAGGG CCCGCCCCTCTGATAGTC 37 CASQ2_TCA_0 CAGAATTGCTTGCTGCCAC GATGACGATCTTCCAACTGC 38 CBX4_GTG_0 AGTCAGCCGCTTCTTGCAG CTGCAGCTCACCACCAAG 39 CDH4_GCG_p0 CTTCACCTGCAGCTCCG GAGGCGGCAACTTACCC 40 CENPB_CCT_0 ACCGGCCTCCACTACTCC GACATAGCCGCCTGCTTTC 41 CHD3_CCG_0 TTGTTCTCTTGGGAGGAAAGAG AGCCATCCTGTGGCTCTG 42 CHD8_TGG_0 ATCCTCCTCAGGTTGCAGTG TGCATCTTTGCCTTTTATGC 43 CHERP_GCT_0 GTACCAGGCTGGGTGGTG TACTCCTCAGTGGTCCAGCC 44 CHGA_GAG_0 CTGGTGGACAGAGAGAAGGG ATTTGTCCACCTGGGAAATG 45 CNGB1_CCT_0 AAAGAGTGCCCAGACCCC AGGAGGTAGACTCAGCGGG

71

46 CNKSR2_GAG_0 GCAGAGGCTGCTTCCCTC ACAGGGAGGTAAACGTGCAG 47 CREBBP_TGC_0 CTTGAGGCTGCTGGAACTG TTGAACATCATGAACCCAGG 48 CSPG3_CAC_0 TCACCACCTTTTGTCCCTTC CAGCAGAAATTCCCTTCGTC 49 CYP26B1_GGC_p0 CGGAGCCTAAAAGTCCCTG CAACTCCAGAGCCGAGAGAC 50 DACH1_CCG_1 GGGGTGCTGGAAGCGAC CCTCTGGGCCAACTCTGTTC 51 DACH1_GCT_0 CCCTCAGATCCACCATTTTG AACTGCAACCCCAACCTG 52 DCHS1_CAG_0 CCAATCAGTGTACCCGCTG TTGTCATGCAGAAGGAGCTG 53 DEK_TCC_0 CCGACTTCACGCCTCTAAAC ATGCAGGTTCACAGCATGTC 54 DIAPH1_AGG_0 GGTAGCATCCCCAGACAAAG CTAGTCGTGCTCCTGTTCCC 55 DLX6_CAC_0 CATGACTACGATGGCTGACG GACTGGAGGTAAGGGCTGTG 56 DLX6_GCA_1 CCAAAGAGCTAAGGTGGCTG AGGCAGTGCAGAGGGTAGTG 57 DNER_GCT_0 TCTCACCTCCACTTGATCCC CTCGTTTCAGCTTGTTGCAG 58 DYRK1A_CAC_0 GCAATTTCCTGCTCCTCTTG ACTGTGGCCAACCTCCATAG 59 E2F4_CAG_0 TCAACGACCTCTTCCTGACC TGACAGGAGATCTGGGTTCC 60 ECM2_TCC_0 ACAGAGAGCACCCGCTTG GGGGACAGCAGAGGAGG 61 EMX2_ATA_p1 TAGTTGCACTGGGTTTGCAC AACTCTCTCCCTCTCTCGCC 62 EMX2_GAG_p0 TTCTGTCCTACCAGCTTGGG TAAAGTGCGTCAGAGGGAGG 63 EN1_GCC_0 TGCGAGTCAGTTTTGACCAC AGACTGCCGCAGGTAGAGAC 64 EOMES_GCG_0 GGTACGGGAAGAGTGAGCAG CATGCTTAGTGACACCGACG 65 EP400_GCA_0 TGAGCAAACCTCTTCATAGCC CAGCTTTGATCTGTGCTGGG 66 EPHB6_CCT_0 CCTTTGAGGCATGTCATGTG TTGACGTTCAGTTGCAGTCC 67 FOXA2_GGA_p0 GGCTCCACTCCCTCTCTCTC TATATCACCAGCCTCCCACG 68 FOXE1_GCC_0 CGACTGCTTCCTCAAGATCC GGGTAGTAGACTGGAGGCGG 69 FOXF2_CCG_0 AGAGGAGCTGAGGGAGGC AATTGGAGGAGGACGAGGAC

72

70 FOXG1B_CCA_0 TGAAAATGATCCCCAAGTCC CAGCCCGTCCGCTTTAG 71 FOXG1B_CCG_1 ACCACAACAGCCACCACC ACTTGCCGTTCTTCTTCTCG 72 FOXP2_CAG_1 ACCCACTCAGGCAATGAAAG ATGGAGATGAGTCCCTGACG 73 FOXP2_GCA_0 TGGAAAATTTCAGAGCTGCC CTATTCTTGCCGCTCAAAGC 74 GATA6_ACC_0 GCGCTTCCCCTACTCTCC ACCCTTACCTGCACTGGGAC 75 GDF11_GCG_0 CAGTCCTCCCTCCCCTCC GTAGGTCCAGGATCTGCTGC 76 GTF3C3_TCC_0 TCATCATTTTCTTGGTTTCACG CCCCATCAATGTCTCTTTTCTC 77 GTF3C5_GAG_0 ACACCCATGGCCTTGTCTC CTTGGGCCCTGTCACAC 78 HAND2_GCG_0 CGCTGTTGATGCTCTGAGTC GGCGAAATGAGTCTGGTAGG 79 HCN1_CGC_0 CGGCTGTTAGACGAAGAGTTG AGCCTCAGCTTCAGCACC 80 HCN2_GCC_0 CTCAGTTTCCCCTCCTGGG CACGAGAACGACACCTTGG 81 HD_CAG_0 AGGTTCTGCTTTTACCTGCG GGCCCAAACTCACGGTC 82 HD_GCC_1 TCGAGTCCCTCAAGTCCTTC GGTTGCTGGGTCACTCTGTC 83 HEY1_GGC_p0 CACTCAGGGGAAGGAGGAG CAGGCTCCGATTACAGGTTC 84 HLXB9_GCG_0 AGTAGCCGTAGACCGGGTG CTTGGTCACGTCGCTCG 85 HOXB3_CCG_0 ACTCCTTCTCCAGCTCCACC CAGCGAAAGGAGAACAGGTC 86 HOXD11_GCG_0 CCCCTACTCTTCCAACCTGG AAGCTCGCGTTGCATGG 87 HOXD11_GCG_1 CGACTACGGCCTGGAGC CGCGCTGTAGAAGTTGGAG 88 HOXD13_GGC_0 TCTTCCTCCTCCTCATCGG GATGACTTGAGCGCATTCTG 89 IGSF4_TGG_0 AAAGGGAACTTTTCGGCTTC GTTGTGTTCAGCATTGTGGG 90 IRS2_CGG_0 CTCCATGGACAGCTTGGAAC CCCTACCCAGAGGACTACGG 91 IRS2_GTT_1 GAGAGCGATCACCCGTTTC CAAGGGTGGGAGGGAGC 92 KCNA4_CCT_0 GGACTGAACTGTAGCCGCC CCACTACCGGCAGAGCAG 93 KCNMA1_CCG_1 TCTAGGCTGAGATGGTTCGC GCTGTTGATGGGTGTTTGG

73

94 KCNMA1_GAG_0 TCCACAGGTACTTGAGCGTC TAGCTATGGCAAATGGTGGC 95 KCNQ3_GCC_0 CAAGGTGACTTGCTCCACG CCCCTTCTTCCCAGAGCAG 96 LMX1B_CCG_p0 AGGAAGTTTCGGGAGGTGAG GTCACCTGCTTGTCAGTCCC 97 LRP5_GCT_0 AGGCACCTCCAGGGCAG CCAAGTCGCTTCCGAGAC 98 LRP8_GCA_0 AGCAGAGCCGAGTCAGAGAC CACCGAACCTGCTTGAAATG 99 LTBP3_CAG_0 CAAAGACCACCTTGAAGCG GCTGTCCAGTCTGCATCTCC 100 MAF_CCG_0 AGGGTGGCTAGCTGGAATC GCGCACCTGGAAGACTACTAC 101 MAFA_TGG_0 TACAGGTCCCGCTCTTTGG AGCCTACGAGGCTTTCCG 102 MAP3K4_CTG_0 CAAGTCCGTTCCCTCTTCTC TAGAGCATCACTCACACGCC 103 MAP3K9_CCT_0 CAGGGTCAGCTCGTCCTC AGAGCGCTTCTCGGCTG 104 MAPK8IP2_GAG_0 CCGATTTCCCTTCTCCTAGC TATGAACCCAGGCTCAGCTC 105 MBD2_CGC_0 CTTCCTCCTTCTTCCATCCG GACTCCGCCATAGAGCAGG 106 MEF2A_CAG_0 TCCATTAATACCAACCAAAACATC GAGCTGCTCAGACTGTCCAC 107 MEF2C_GAG_p0 TCCACCTGATTCAAACATGC TCAAACGCTGTGAGCTGTTC 108 MEOX2_TGG_0 GCAGAGGCTGTGCCGAG CCCGAGCTCTCTACTTCTTCC 109 MIB1_CGG_p0 CCGTGAGTTATTCTCACGTCC ACACTACCACCACCTCCTCG 110 MICALCL_CTC_0 AGGAAGGTGCTGCCTGAAG TGCTTGTCGATGCAATGAAG 111 MLR1_GCG_0 CCCGCGTCTCTCTTACCTAC CACGTGTCAATCAACCCC 112 MN1_GCT_0 CTAGCTGAGCCAGGTTGGG TCTGGATAATCACCTCTCCCC 113 MN1_TGC_1 CAGACCCACAGGCATCTTTC ATTTTGACATGTTTTCGCCC 114 MUC2_CCA_0 TCTGTTCCATTACGACACGC AGAGGAATGCCACTGTCACC 115 MYST4_GAG_0 TTTACCCTCCCCACTTAGGC TTTGATTGAACGCGACTCAC 116 MYT1L_TCC_0 AACAGTGGACCCAAATCACG CAGGAACCTGCTCCTAAACG 117 NAP1L2_TCC_0 TCGCCAGGATCTGAAAGC TATGTGGATGAGGACGATGG

74

118 NAP1L3_GCT_0 CCACGAAATTTGTTCCCAAC AAATGGTCTCGGAACCTGTC 119 NAP1L5_CCT_0 CCCCTTACTTCTTGGCGTC GATAAGGAATTTCAGGCTCTGG 120 NCOA3_GCA_0 ATGCAACTGGCAGGTTCTTG TGTTAACTTGGGGAAGGCAC 121 NCOA6_TGC_0 TACCGTGGCTAAGGATGGTC TACAGCAGCAAAGTCATCCC 122 NF2_AAT_p0 TGACGCATACCTGTAATCCC CGGCGAATATTCTTCTTTGG 123 NFAT5_CAG_0 TAATCCAATGCCTCAAAGCC ATCATGTTTGGTGGTGGTTG 124 NISCH_GGA_0 GAACAGGGCGAGGAGGAG AGCACCACAGCACCTTGAG 125 NKD2_CAC_0 CACAGTGGAGCACGAGGTG CTGCTCCCTGCCCTGAG 126 NOG_CGC_p0 TTTTGCAGGAGGAGGAGAAG CATCACCATGGAAACCACTG 127 NPAS3_GCC_p0 GGGAGGAGAAGGCGCAG TGGAAGGATCCTGCTGAAAG 128 NR4A3_ACC_0 GGGCTACTCGAGCAACTACG AAGTACATGGAGGTGCTGGG 129 NRG2_TGC_0 AGCATGGAGAAGCCGGG CTGTTTCCCCTCCCAAGG 130 NRXN1_CGC_0 TTTTGTCGAGCTCCCATTTC CGCCATGTACCAGAGGATG 131 NUMBL_GCT_0 CCATCTGTGAGGGTGTGATG CCTCTCCTCTCTCCAGGGAC 132 OLFML2A_CCA_0 ACTCCGCAGAGCCCAAC GTGTGCTGACTCCTGTGAGG 133 ONECUT1_TGG_0 CCACGTCCTTGTGGTAGGG ATGCCCACCACCTACACC 134 OTX1_CCA_0 CAACACCTCGTGTATGCAGC GTAATCCAAGCAGTCGGCAG 135 PAK3_GAA_0 TCAGGTGAGCAAGTCCTGG GGCATCTTTTGAATCATTTGG 136 PAX7_CCT_p0 TTTTCGTTATTGGTCCTCCG CTTCTTTCTCCGGACCACAC 137 PAXIP1_CTG_0 TGAATCCTCACCCCACACTC AACAGTTGCATCCTCCACAG 138 PAXIP1_TGC_1 TGCTGTTGCTGTGAAAATGG TGCAGTGCTGTTTAGCCAAG 139 PBX1_CCT_p0 CCTTTATCTGGTGGCCAATC GTTGCAAAAGCCAGATCTCC 140 PCDH15_AGG_0 CAGACGTTGAAACGGAAAGTG TTTGTATGTTACCTATTGAAACCG 141 PHC1_GCA_0 TCAGCAACAGCAACAGATCC AGGTGCTACAGGTGGTTTGG

75

142 PHLDA1_TGC_0 TGCAGTTCCTTGAGCTTGAC GACGGGTTGTTGCAGCTC 143 PHOX2B_GCC_0 TTTGGAGCGAAGATAGGACG CAGGTCCCAATCCCAACC 144 PIK3R5_TCC_0 ACAAGGTGGAGTCATGGGAC CTTGCTTCCCTGCAGACATC 145 POU3F1_CCG_0 GTGGGTAGCCACTGGGG CACCACCGCGCAGTACC 146 POU3F2_CGG_0 CTGGTGCAGGGCGACTAC CCCGTGCAGCTCGTCTC 147 POU3F2_GCA_1 CACCAGTGGATCACCGC GCCGTGGTGGTGCAGAC 148 POU4F1_TGG_0 TGTGCATATGCGGGTGAG AAGAGCCATCCTTTCAAGCC 149 POU4F2_CAC_1 CTATGAATACCATCCCGTGC GCATGGGGTTCATGGTG 150 POU4F2_GGC_0 CCAAGTACTCGGCACTGCAC TGGAAGACAGGCTCTCCG 151 POU6F2_CAG_0 GCTGTCAAGACCCAGCAAG CTGAGAGGCGGGTGGTG 152 PPGB_CTG_0 GCTGAGGAGCGAGTCAACAG GTAGTGGAGGTGCTTGGAGC 153 PRG2_CCT_0 GTGGGAGGATGACCCGAG GCCAAGGAGAACATGGTGAC 154 PRKCSH_GAG_0 CGACCTTCCAGCACCTTC CAGTGAGTAGCCCTTCCCC 155 PTPRZ1_TGA_0 TCTCCCCACAGAGATGGTTC TTTTCGTGGGTGTCTGAATC 156 PURA_CGG_0 ATGGCAGACTCCTTCTCACC CCCTCACCCTACAACCCTG 157 PVRL1_CCT_0 TCAGGGTCGTACTGGTAGCC CATCCTGCTGGTGTTGATTG 158 RAB21_GCG_0 TCTCGGACTTTCCCTCAGC ACCTGCAGAGTGGTGATGTG 159 RAI1_CAG_0 TTTCCTCAGCATTCCCAGTC TAGTACTGCTCTGGGGTCCG 160 RANBP9_GGC_0 ACCAGAGCTGGGGTCGG CTCCGGAGTCGTCCTGC 161 RBM10_GGA_0 GCTCCGAGACTCAGCGTAG AGGGCCCTGGAGGAGAC 162 REXO1_GGA_0 GAGCCAACTGGAAACCACTC CGCTCACTAGACGAGGGC 163 RIS1_CGG_0 ACACCGGCCGAGTCTCC CAACGGCATGGAACAGC 164 RPS6KA3_CGG_p0 CACAGCCATCTTCTGCCAC AGGAAGAAAGGGGCGAGAC 165 RUNX2_GGC_1 ACTTGTGGCTGTTGTGATGC CTCTTACCTTGAAGGCCACG

76

166 RYR1_GAG_0 GTAGGTGATGGGCATCTTTG ACTTCATCTGGAGCAGCCC 167 SHOX2_CCT_0 GTCGTTACCCTCCGTCAGTC TGGAAGAACTTACGGCGTTC 168 SLC12A2_GCG_0 CAGAGCCGTTTCCAGGTG GTTCTGGAAGCTCACGTTGG 169 SMARCA2_GCA_0 TTATAAAATGCTGGCCCGAG TGGTTGCGTATTAACCTACCAG 170 SNAPC4_GCT_0 GTCCGGGACCATGTACTGTG GCCTCTAAGTAGCTCTTTCCTCC 171 SOCS7_AGC_0 TAACAGCTGCTCGGAAGAGG GACTGAGGCGGATTTTGAAG 172 SOS1_GCC_p0 GTAGTTGGGACTCCGAAACG GCTCAGTAGCGAGCAGGTTC 173 SOX1_GGC_0 CAAGTGGTTTGTGCATCAGG CTCCTGCATCATGGCCG 174 SOX11_GAC_0 GACGACTACGTGCTGGGC AGCTGTAGTAGAGGCGGCTG 175 SOX21_GCG_0 GGTAGCCCAGCGACGAC GGCCTGGTGCCTGAGTC 176 SP8_GCC_0 ATCCGAGTTGTAGCCTCCG TGGGTTCCAGTCTCTCAAGC 177 SP8_GGC_1 TGCACCTTGGAGATGAACAC CAGTCTCTCAAGCTTCGGC 178 SPRED3_CCT_0 AGGTGATGTGCACTGTGGG ACAAAACCCCTACCCTCCC 179 SPRY2_GGA_p0 AAGAAAACGGCCTTACGGAG CACATGACGCCATTTACTGC 180 SREBF2_AGC_0 CACAGAGATGCTGCAATTTGTC GAAGGAAGGTAATGTGACCTGG 181 SRPX_GCA_0 GGGTCTTTGGTGCCCAG CTCTTCGGTCTCCTGCCG 182 SRRM2_CTC_0 GAGGCTGTTCGAGAGGGAC AAGACAGCACTCACCTCCG 183 SSBP3_CCG_p0 CCGCCACTTGCAAAATAGG CGACAGAAAGAGAGGCTCG 184 ST6GALNAC5_GCA_0 TCTGGCAGTGTGTTTAGCG GGTATCCGTCCAGTGGC 185 TAF7L_TCA_0 CCAGATTCAATAAACTCGGC TTATATCTGGAGTATGAGTCATTTGC 186 TAOK2_GAG_0 CAGTGCCCAGCATGTCC CCAAGGGTGAGCTGTGTACC 187 TBP_GCA_0 GATGCCTTATGGCACTGGAC TGAGTGGAAGAGCTGTGGTG 188 TCF7L1_CGG_0 GCCCTGTCAAACTTTGTTGC GCTTCCTTACCTCCGAGTCC 189 TERF1_GAG_0 GAAACGACGAGGAGCAGTTC AGAGGAAATCGAGCATCCAG

77

190 TFAP2A_AGG_p0 CAAAGCATTTTCATGGATCG GGCTGTTGGTAAAGAGCTGG 191 TFEB_GCT_0 TACCTTCAACACCTCCCCAG CGTCACGCATAGGGTTGC 192 TGFBR1_GGC_0 CTCCGAGCAGTTACAAAGGG GCGCCATGTTTGAGAAAGAG 193 THAP11_GCA_0 CGTCCCCACCATCTTCC CTACAGTGGCCTGAAGGGTG 194 TNRC15_CAG_0 CCCTTATCTTCTCCCCTCCC GGGAGATACACAAAGGTATCTGG 195 TNRC4_TGC_0 ACACCCCAGAAGCAAATGTC CAAGAAGTCAGGGAAGCCC 196 TNRC5_TGC_0 GGTCCTTTAGGGTCCGGG CTCACCTTCGCATTTGCTG 197 TNRC6A_CAG_0 CCAGTGTAAGCCAGCCTCAG CCCCTCTTTAGAAGCTGTTTG 198 TNRC6A_CAG_1 GTCCAGTTGCTAGGGCCTC CGAGCCTCAAGCCCTCC 199 TNS1_GCT_0 CAATGGCTGAGGGCTGG CTTTTCGGAAGCTGAACCC 200 TRIO_GCG_0 GTCTCTCCCGCTGTCTTGTC GGGTGGACGTACCTGACATC 201 TWIST1_CCG_0 AATCTTGCTCAGCTTGTCCG AACAGCGAGGAAGAGCCAG 202 USP34_GTG_0 AAATGTGCCCAATAGAAAACTTG TTGTATCACACGAACTGGGG 203 VGLL3_CTC_0 CTGACCCAATGTCTCCCTG TGTTTTGTTTATTCCATCCCAG 204 YY1_CCA_0 GGAGACCATCGAGACCACAG TTCAATGTAGTCGTCGTCGC 205 ZIC3_CGC_0 CTTTCCGACTACGGCACTTC CTCTGGCCTGAAGATAGATCG 206 ZIC5_GGC_0 GGACAGACACCCACCTGTATG TTAAACCTGAACCTGGCGG 207 ZNF312_CCG_0 CAGAAAGAGCTTGGGGTGAG GGTGCCGTCAAAGACACTG

78

Appendix B Labeling PCR products with the fluorescent adaptamer method 3 primers are used in a single PCR, a conventional reverse primer (yellow), a forward primer (red) tailed with the M13 sequence (orange) and a fluorescently labeled M13 primers (orange star)

During the first PCR cycle only the oligos with sequence specificity to the template will anneal.

First cycle extension will produce a conventional product from the reverse primer and an M13 tailed product from the forward primer.

Second cycle annealing has the same characteristics as the first cycle annealing with only the template specific primers being able to hybridize.

Second cycle extension produces the first products with ends complementary to the primers (A and B)

While template A proceeds conventionally in cycle 3, product B can be annealed with both the tailed specific primer or the labeled M13 primer.

The significant difference in concentration between - the primers drives the reaction towards the labeled + annealing.

The product of the reaction from the forward direction is a majority of labeled strands. Unlabeled forward and reverse (not shown here) strands are present but undetected during fragment sizing. Appendix C Output of reconstructed haplotypes by PHASE, with and without the triplet alleles, Wn calculations and the derived analysis statistics. SNP composition, LD pattern and relative distances between SNPs in the haplotype blocks containing each studied triplet are also shown in a figure. A stronger shade of red indicates a higher level of linkage disequilibrium which is expected in haplotype blocks.

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122