Relationship Between Triplet Repeat Polymorphisms and Hapmap Tagsnps
Total Page:16
File Type:pdf, Size:1020Kb
University of Cincinnati Date: 2/17/2011 I, Mehdi Keddache , hereby submit this original work as part of the requirements for the degree of Doctor of Philosophy in Biomedical Engineering. It is entitled: Relationship between triplet repeat polymorphisms and HapMap tagSNPs Student's name: Mehdi Keddache This work and its defense approved by: Committee chair: Bruce Aronow, PhD Committee member: William S. Ball, MD , MD Committee member: Jeffrey Johnson, PhD 1333 Last Printed:2/24/2011 Document Of Defense Form Relationship between triplet repeat polymorphisms and HapMap tagSNPs A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY In the Department of Biomedical Engineering of the College of Engineering By Mehdi Keddache M.A., Genetics & Development Columbia University, May 1998 Committee Chair: Bruce Aronow PhD Abstract Single Amino Acid Repeat Proteins (SARPs) are a class of peptides that contain extended stretches of the same amino acid. At the DNA level they are represented by a repetition of the same triplet of bases. Their influence can be on gene regulation, transcription and protein function depending on the number of repeats. Mutations that add or subtract repeat units are both frequent and reversible. Several human diseases have been identified that are caused by variations in the size of repeated DNA sequences. Animal and plant genetic studies showed that variations in repeat sequences can lead to complex phenotypes and that variations within a normal (i.e. non-pathological) range of repeat number commonly yields small, quantitative functional effects. Even though triplet repeats present an intriguing disease mechanism for other complex human diseases, methods for a fast, inexpensive and systematic search have never been utilized. We have devised an assay that can be used for a high throughput, genome wide detection of SARPs in those diseases. We have optimized our method for low cost and validated our results through direct sequencing. We also showed that inferring SARP alleles from SNP data derived from genome wide association studies (GWAS) is often not possible and therefore a SNP based GWAS approach will be underpowered or may even fail if the disease studied is caused by a triplet repeat allele. In addition, we found that current methods of next generation sequencing may miss the detection of triplet repeat variations, probably due to technical limitations of the technology. 2 3 Acknowledgements I want to thank my advisor Dr. Bruce Aronow and my committee Dr William Ball and Dr. Jeffrey Johnson for their encouragements and support during the course of my graduate work. A very special thank you to Dr. Gregory Grabowski, Jerry Diegmueller and the entire division of Human Genetics at CCHMC for their support and for giving me the opportunity to pursue a doctoral degree. I am deeply indebted to Katie Lutz and Sarah Srodulski from the CCHMC Genotyping facility for their assistance in the generation of triplet genotypes. To my friends Dr. Martina Durner and Dr. Larry Altsteil, thank you so much for your support and encouragement during all the years I have been working on my doctorate. To my colleagues Dr. Auvo Reponen and Patrick Putnam, thank you your invaluable database management and programming advice. 4 Table of contents Abstract ..................................................................................................... 2 Acknowledgements .................................................................................. 4 Part A: Background and Significance ...................................................... 7 Part B: Specific Aims ............................................................................... 12 Part C: Specific Aim 1 .............................................................................. 14 Primer design and amplification strategy for triplet repeat analysis ..... 15 Incorporation of fluorophore by the fluorescent adaptamer PCR technique ...................................................................................................... 16 Polymorphic determination of triplet repeats using DNA of varied ethnic origin ................................................................................................. 18 Correlation of the experimentally determined polymorphism of the selected triplets with the repeat database Satellog. .................................. 19 Verification of the normal meiotic origin of the observed triplet variations ...................................................................................................... 20 Validation of assayed alleles through sequencing .................................... 21 Analysis of sequencing and genotyping results ....................................... 28 Using the technique beyond triplet repeats.............................................. 35 Cost considerations ..................................................................................... 36 5 Sensitivity and specificity considerations ................................................. 37 Part D: Specific Aim 2 ............................................................................ 38 High throughput triplet genotyping protocol .......................................... 40 Data structure to efficiently store millions of bialleleic SNP genotypes42 Determination of haplotype block around the SARP location ............. 44 Determination of haplotypes with specific SARP alleles ....................... 45 Determination of the relationship between SARP alleles and haplotypes ....................................................................................................................... 46 Simulation of data to understand the significance of the Wn statistic . 50 Part E: Discussion .................................................................................. 53 References .............................................................................................. 66 Appendix A: List of 207 triplets tested and primer sequences used for amplification ........................................................................................... 70 Appendix B: Labeling PCR products with the fluorescent adaptamer method .................................................................................................... 79 Appendix C: Output of reconstructed haplotypes by PHASE .............. 80 6 A. BACKGROUND AND SIGNIFICANCE Single Amino Acid Repeat Proteins (SARPs) are a class of peptides that contain extended stretches of the same amino acid. At the DNA level they are represented by a repetition within an exonic sequence of the same triplet of bases coding for one amino acid (codon). Because several codons can encode the same amino acid (GCC and GCA both code for Alanine) we make a distinction, at the DNA level, between SARPs with homogeneous triplet repeats (the same codon is repeated) and SARPs with imperfect triplet repeats (multiple codons are repeated). One of the interesting features of SARPs is their potential repeat number polymorphism. The number of repeats is thought to influence gene regulation, transcription and protein function. Mutations that add or subtract repeat units are both frequent and reversible, thus providing a prolific source of quantitative and qualitative genetic variation producing phenotypes upon which natural selection can occur. Several human diseases have been identified that are caused by variations in the size of repeated DNA sequences within a specific gene (Pearson et al, 2005). In addition, it has been shown that SARP polymorphisms are also involved in complex phenotypes in animal species. SARPs are therefore thought to be ideal candidate variations potentially involved in complex human genetic diseases that may be caused by several genes each one with only subtle effect. A mechanism that does not interrupt gene function but merely alters it is a very attractive potential function of SARPs. Approximately 2,100 SARPs have been identified with 4 or more repeats (Subramanian et al, 2003) and triplet repeats are also found in promoter regions of active genes. In particular 2/3 of the SARPs are located in genes with brain related function. 7 The preferred current method of screening for genetic association with disease is with SNPs densely spaced throughout the genome. A Single nucleotide polymorphism or SNP (pronounced snip) is a DNA sequence variation occurring when a single nucleotide - A, T, C, or G - in the genome differs between individuals. The first phase of the International HapMap project was completed in 4 major populations with the goal of identifying the key point variations (tagSNPs) in the human genome that allow the prediction of most other nearby point variations because they are on the same “haplotype block”. Commercial products have become available to systematically genotype all HapMap tagSNPs and perform genome wide association (GAW) studies. The question arises whether those tagSNPs would be able to also capture the SARP alleles within their haplotypes. Repeat polymorphisms are expected to be more plastic than SNPs. The main reason being that SNPs most likely occur as the result of uncorrected polymerization errors and chemical or radiation DNA damage whereas repeat polymorphisms can occur by more frequent events like polymerase slippage and unequal recombination that can either shrink or expand the repeat length. The formation of imperfect triplet repeats through point