Triplet Repeat Length Bias and Variation in the Human Transcriptome
Total Page:16
File Type:pdf, Size:1020Kb
Triplet repeat length bias and variation in the human transcriptome Michael Mollaa,1,2, Arthur Delcherb,1, Shamil Sunyaevc, Charles Cantora,d,2, and Simon Kasifa,e aDepartment of Biomedical Engineering and dCenter for Advanced Biotechnology, Boston University, Boston, MA 02215; bCenter for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742; cDepartment of Medicine, Division of Genetics, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115; and eCenter for Advanced Genomic Technology, Boston University, Boston, MA 02215 Contributed by Charles Cantor, July 6, 2009 (sent for review May 4, 2009) Length variation in short tandem repeats (STRs) is an important family including Huntington’s disease (10) and hereditary ataxias (11, 12). of DNA polymorphisms with numerous applications in genetics, All Huntington’s patients exhibit an expanded number of copies in medicine, forensics, and evolutionary analysis. Several major diseases the CAG tandem repeat subsequence in the N terminus of the have been associated with length variation of trinucleotide (triplet) huntingtin gene. Moreover, an increase in the repeat length is repeats including Huntington’s disease, hereditary ataxias and spi- anti-correlated to the onset age of the disease (13). Multiple other nobulbar muscular atrophy. Using the reference human genome, we diseases have also been associated with copy number variation of have catalogued all triplet repeats in genic regions. This data revealed tandem repeats (8, 14). Researchers have hypothesized that inap- a bias in noncoding DNA repeat lengths. It also enabled a survey of propriate repeat variation in coding regions could result in toxicity, repeat-length polymorphisms (RLPs) in human genomes and a com- incorrect folding, or aggregation of a protein. In noncoding regions, parison of the rate of polymorphism in humans versus divergence repeats are associated with increased instability or fragility of the from chimpanzee. For short repeats, this analysis of three human DNA or in altering the binding properties of DNA binding proteins genomes reveals a relatively low RLP rate in exons and, somewhat involved in DNA metabolism or biological regulation. surprisingly, in introns. All short RLPs observed in multiple genomes Tandem repeat variation in genomic DNA has already been shown are biallelic (at least in this small sample). In contrast, long repeats are to be important to morphological evolution (15). The work presented highly polymorphic and some long RLPs are multiallelic. For long here is motivated by the hypothesis that an individual’s repeatome— GENETICS repeats, the chimpanzee sequence frequently differs from all ob- both common and private variation—plays a large role in the genetic served human alleles. This suggests a high expansion/contraction rate basis of susceptibility to a wide range of diseases and other traits. The in all long repeats. Expansions and contractions are not, however, advent of whole-genome sequencing and analysis enables us to make affected by natural selection discernable from our comparison of progress toward supporting this hypothesis. human-chimpanzee divergence with human RLPs. Our catalog of To investigate personal variation in repeat length polymorphisms human triplet repeats and their surrounding flanking regions can be (RLPs) in human populations, we undertook the first whole-genome used to produce a cost-effective whole-genome assay to test indi- triplet-repeat variation analysis for two of the available human ge- viduals. This repeat assay could someday complement SNP arrays for nomes: those of J. Craig Venter and James Watson. Given that producing tests that assess the risk of an individual to develop a noncoding regions have extensive repeats for which there is less disease, or become part of personalized genomic strategy that pro- information about their level of impact on phenotype, we confined our vides therapeutic guidance with respect to drug response. analysis to triplet tandem repeats (3-STRs) in protein coding regions and surveyed the relative frequency of RLPs found in exons vs. introns. computational biology ͉ genomics ͉ genome ͉ polymorphisms ͉ Our analysis of this small sampling of personal genomes provides tandem repeats evidence for several thought-provoking phenomena. We observe a relatively low rate of short RLPs in both exons and introns, all of which enomic variability is key to the adaptation, survival, and are biallelic. We also see that, in the case of these short RLPs, one allele Grelative fitness of species. Nature uses mutations, duplications, almost always agrees with the chimpanzee genome. Conversely, we recombinations, inversions, deletions, and many other genomic observed a relatively high rate of long RLPs, some of which are multiallelic. Also unlike short RLPs, most long RLPs show deviation mechanisms to create diversity in populations. Evolution selects from the chimpanzee genome in all of the observed human alleles. against most deleterious genomic changes. The remainder result in Additionally, we observe a surprising periodic pattern in the the spectacular range of traits observed in living organisms. Current lengths of some repeat types in noncoding sequence. We also research in human genetics has focused on documenting the full propose a design for a cost-effective, genome-wide assay of RLPs vocabulary of genomic polymorphisms and relating them to specific in humans that can be used for more extensive future studies. traits. In particular, medical genetics concentrates on identifying variations that are associated with disease. Most of this work focuses Results on identifying SNPs or other small genetic changes and on docu- In our survey of human 3-STRs, we have: (i) compared the menting both the common variations and disease-associated mu- frequency of these polymorphisms in exons versus introns and, in tations (1, 2). This is due, in part, to the availability of relatively turn, compared this relationship to the analogous one for SNPs, to inexpensive technologies to probe the landscape of common mu- make informal estimates of the relative selective pressures on tations. Recently, copy number variation has become a prominent topic, in part due to the realization that a good fraction of observed genomic variation may result from copy number differences rather Author contributions: C.C. and S.K. designed research; A.D. performed research; M.M., A.D., than SNPs (3–5). S.S., C.C., and S.K. analyzed data; and M.M., A.D., S.S., C.C., and S.K. wrote the paper. Short tandem repeats (STRs) are an important family of genetic The authors declare no conflict of interest. polymorphisms that have either documented or suspected signifi- Freely available online through the PNAS open access option. cance in human genetics, personal medicine, and evolutionary 1M.M. and A.D. contributed equally to this work. analysis (6–8). Several diseases have been associated with tandem- 2To whom correspondence may be addressed. E-mail: [email protected] or ccantor@ repeat length variation (7–9). The most prominent among them is sequenom.com. the polyglutamine (CAG) trinucleotide repeat, which has been This article contains supporting information online at www.pnas.org/cgi/content/full/ implicated in a number of devastating neurodegenerative diseases 0907112106/DCSupplemental. www.pnas.org͞cgi͞doi͞10.1073͞pnas.0907112106 PNAS ͉ October 6, 2009 ͉ vol. 106 ͉ no. 40 ͉ 17095–17100 Downloaded by guest on October 1, 2021 Table 1. Basic counts of 3-STRs and their combined length and Repeats the number of polymorphisms in human introns and exons and exon repeats intron repeats the number of exons and introns, their total length and the exon polymorphisms intron Polymorphisms number of SNPs in each exon polyprophism/repeat ratio intron polymorphism/repeatratio Ratio Exon Intron (exon/intron) No. of Regions 85,741 167,485 Total Length 31,918,308 901,736,782 No. of 3-STRs 201,850 3,629,645 Total Length 1,576,309 27,681,474 of 3-STRs RLPs (Venter) No. 93 4,268 0.0218 Rate 1/3.43ϫ 105bp 1/2.11ϫ 105 bp 62% SNPs (Venter) No. 17,765 899,970 0.0197 Rate 1/1.80ϫ 103bp 1/1.00ϫ 103 bp 56% Nonsynonymous No. 8,452 0.0094 SNPs (Venter) Rate 1/3.78ϫ 103bp 26% Fig. 1. Histogram of the frequency distribution of lengths of 3-STRs in the human genome. Superimposed on this histogram are data points showing the 3-STR RLPs; (ii) compared the genomes of Venter and Watson to fraction of length-3 STRs that are length polymorphic for each length of STR. the reference human genome sequence to estimate the 3-STR RLP The absolute number of such polymorphisms for each repeat length is shown rate in individuals and carefully documented the relative rate of in the histogram as the lighter-colored bottom portion of each bar on the RLPs in short and long repeats and both shared and different alleles graph. For readability we have excluded from this figure the 17 intronic in the human genomes and the chimp genome; and (iii) analyzed the repeats that range in length from 101 bases to 362 bases. length distribution of 3-STRs in the human genome with respect to their composition and uncovered a surprising periodic pattern possibly related to observed unusual DNA structures or other genome, which can be explained by a random sequence model with secondary structural properties (6). We also discuss methods for independent occurrence of nucleotides at frequencies matching assaying the full set of such repeats in individuals to document their genome-wide prevalence. personal RLPs that will lead to cost-effective approaches for Unlike shorter repeats, long repeats are highly polymorphic. The genotyping using RLPs. majority of repeats with length exceeding 40 bp appear polymorphic even in our small sample of individual genomes. Fig. 2 shows the The 3-STR RLPs in Venter and Watson Genomes. We documented all fraction of 3-STRs that are length-polymorphic for each length 3-STRs in the gene regions (exons/introns) in the reference human 3-STR in the reference human genome.