Comparative Genomics of Amino Acid Tandem Repeats
Total Page:16
File Type:pdf, Size:1020Kb
0100010101101101001000000110001101101111011100110111010001100001001000 0001101001011011010110000101100111011010010110111001100001011100100010 1101011101000110010100100000011000010110001001110011011001010110111001 1101000010000001110000011001010111001000100000011100110110010101101101 0111000001110010011001010010111000001010010101000110000101101110011101 0001110011001000000110010001100101001000000111001001100101011000110110 1111011100100110010001110011001000000110010001100101001000000111010001 1101010010000001110011011001010010000001101101001001110110000101100011 0111010101101101011101010110110001100101011011100000101001110001011101 0101100101001000000110111001101001001000000110010001100101011010010111COMPARATIVE GENOMICS 1000011001010110111000100000011001010111001101110000011000010110100100 1000000110000100100000011011000110000100100000011101000111001001101001 0111001101110100011001010111001101100001000010100110100100100000011101OF 1001101001011100110110001100100000011010010110111001110100011001010110 1110011100110110000101101101011001010110111001110100001000000111001101AMINO ACID TANDEM REPEATS 1001010110111001110011011001010010000001110100011001010110111001101001 0111001000101101011101000110010100101110000010100100111001101111001000 0001110110011101010110110001101100001000000111000001100001011100100110 1100011000010111001000101101011101000110010100100000011000010110110101 1000100010000001110110011001010111010100100000011011010110010101101100 0110000101101110011001110110100101101111011100110110000100101100000010 1001101100011000010010000001110100011001010111011001100001001000000110 1101011011110111001001110100001000000110111001101111001000000110010101 1011010010000001100011011100100110010101101101011000010010000001101100 0110010101110011001000000110010101101110011101000111001001100001011011 1001111001011001010111001100101100000010100110111001101001001000000110 1101001001110110000101101110011001110110111101101001011110000110000100 1011000010000001101110011010010010000001100101011011010010000001101100 0110110001100101011101100110000100100000011001010110110000100000011001 1101101111011010010110011100100000011001000110010100100000011101100110 1001011101010111001001100101001110110000101001100101011011010010000001 1001000110111101101100001000000111001101100001011000100110010101110010 0010000001110001011101010110010100100000011011100110111100100000011100 0001101111011001000111001001100101011011010010000001110000011000010111 0010011101000110100101110010001011010110111001101111011100110000101001 1011010110000101101001001000000110110111101001011100110010000001100101 0110110000100000011100000110000100101100001000000110111001101001001000 0001100110011001010111001000101101011011100110111101110011001000000110 0011011011110110110101110000011000010110111001111001011010010110000100 1110110000101001110000011001010111001011110010001000000110010000100111 0110000101110001011101010110010101110011011101000010000001100100011011 1101101100011011110111001000100000011001010110111000100000011101000111PhD Thesis 0010011001010110001100100000011011000110000100100000011001100110111101Loris Mularoni 1100101110011101100001000010100111000001100101011100100010000001100101 0111001101100011011100100110100101110101011100100110010100100000011000Barcelona, June 2008 0101110001011101010110010101110011011101000111001100100000011011010110 1111011101000111001100100000011010010010000001110010011001010110001101 1011110111001001100100011000010111001000101101011101000110010100101110 0000101001001101111010010111001100100000011101000110010101101110011000 0111100111011011010110010101101110011101000010000001110001011101010110 TO MERITXELL, MY WIFE, WHOSE PATIENT LOVE ENABLED ME TO COMPLETE THIS WORK. For the most curious, the background of the cover represents the binary codification of one of my wife’s favorite poem. The original version of this poem by Miquel Martí i Pol is included at the end of this thesis. COMPARATIVE GENOMICS OF AMINO ACID TANDEM REPEATS Memòria presentada per Loris Mularoni per optar al grau de Doctor en Biologia per la Universitat Pompeu Fabra Aquesta Tesi Doctoral ha estat realitzada sota la direcció de la Dra. M.Mar Albà al Departament de Ciències Experimentals i de la Salut de la Universitat Pompeu Fabra M.Mar Albà Loris Mularoni La Directora de Tesi L’Autor Barcelona, Juny de 2008 Tu ne quaesieris, scire nefas, quem mihi, quem tibi finem di dederint, Leuconoe, nec Babylonios temptaris numeros. ut melius, quidquid erit, pati. seu pluris hiemes seu tribuit Iuppiter ultimam, quae nunc oppositis debilitat pumicibus mare Tyrrhenum: sapias, vina liques et spatio brevi spem longam reseces. dum loquimur, fugerit invida aetas: carpe diem quam minimum credula postero. Ask not - we cannot know - what end the gods have set for you, for me; nor attempt the Babylonian reckonings Leuconoe. How much better to endure whatever comes, whether Jupiter grants us additional winters or whether this is our last, which now wears out the Tuscan sea upon the barrier of the cliffs. Be wise, strain the wine; and since life is brief, prune back far-reaching hopes! Even while we speak, envious time has passed: seize the day, putting as little trust as possible in tomorrow! ODE I-XI, QUINTUS HORATIUS FLACCUS (Venosa 65 BC - Roma 8 BC) Contents Contents vii List of Figures ix List of Tables ix List of Abbreviations xi Abstract xiii I INTRODUCTION 1 1. Genome repetitive elements 5 1.1. The repetitive DNA content of genomes 5 1.2. Interspersed repeats 5 1.3. Tandem repeats 7 1.3.1. Satellites 9 1.3.2. Minisatellites 9 1.3.3. Microsatellites 11 2. Tandem amino acid repeats 19 2.1. Distribution of amino acid tandem repeats across genomes 20 2.2. Relationship between GC content and repeat content 22 2.3. Coding repeat and functionality 23 2.4. Diseases associated repeat 26 2.5. Mutational mechanisms of tandem amino acid repeats 29 2.5.1. Replication slippage 29 2.5.2. Unequal crossing-over 30 3. Bioinformatics approaches for the study of repeats 30 3.1. Expressed Sequence Tags (ESTs) 30 3.1. Orthologous genes 31 II OBJECTIVES 33 vii CONTENTS III RESULTS 37 Chapter 1 39 1.1. Mutational patterns of amino acid tandem repeats in the human proteome 41 Chapter 2 51 2.1. Highly constrained proteins contain an unexpectedly large number of amino acid tandem repeats 53 2.2. Comparative Genetic of Trinucleotide Repeats in the Human and Ape Genomes 63 Chapter 3 73 3.1. Patterns of emergence of amino acid tandem repeats in the vertebrate phylogeny 75 IV DISCUSSION 103 1. Thesis overview 105 2. Abundance of repeats 106 3. Amino acid repeats features 107 4. Balance between slippage replication and point mutation 109 5. Evolutionary dynamics of tandem amino acid repeats 110 6. Future directions of research 112 V CONCLUSIONS 115 VI REFERENCES 119 VII APPENDICES 137 Appendix 1 139 1.1. Housekeeping genes tend to show reduced upstream sequence conservation 141 Appendix 2 151 2.1. List of Pubblications 151 Appendix 3 153 3.1. List of Conferences Contributions 153 viii List of Figures 1. Genome sizes and amount of coding DNA 6 2. Components of the human genome 6 3. Interspersed repeats 8 4. Unequal crossing-over 15 5. Replication slippage 16 6. Hypothesised biology of a microsatellite locus 18 7. Tandem amino acid repeats 20 8. The genetic code 21 9. Repeats and functions 23 10. Runx-2 and morphological evolution 25 11. Hox proteins and development 25 12. An overview on how ESTs are generated 31 13. Paralogous genes and orthologous genes 32 List of Tables 1. Frequency of microsatellites in non-translated regions of genes in eukaryotas 13 2. Distribution of amino acid tandem repeats of size >= 5 in different species 22 3. Features of glutamine and alanine unstable repeat expansion disorders 28 ix x List of Abbreviations bp ………………………….. Base pairs cDNA ……………………...Complementary DNA CH ………………………….Codon homogeneity DNA ……………………….. Deoxyribonucleic acid EST ………………………...Expressed sequence tag Ka …………………………..Rate of non-synonymous substitutions Ks …………………………..Rate of synonymous substitutions RNA ……………………….. Ribonucleic acid xi xii Abstract ANDEM amino acid repeats, also known as homopolimeric tract or homopeptides, are very common features of eukaryotic genomes and are present in nearly one-fifth of human encoded proteins. These structures have attracted much interest in the early 1990s when a number of neurological diseases associated with repeat expansion mutations were discovered in humans. Despite their abundance in coding proteins, little is known about their functional consequences. Two scenarios have been proposed. In one, tandem amino acid repeat is considered a neutral structure generated by slippage event and eventually tolerated in protein as long as it does not disrupt the protein function. However, an increasing number of studies proposed that tandem amino acid repeats may be involved in important functional or structural roles. For instance, tandem amino acid repeats had been found to be especially abundant in transcription factors and developmental proteins, where they can potentially modulate protein-protein interaction, exert an effect on gene transcriptional activity, or act as spacer between different protein domains. In addition, several studies have linked changes in repeat size to modification in developmental processes. Despite the advancement made in the last decade, little is known about the selective forces that shape their evolution. The aim of this thesis has been to gain further insight onto the evolutionary dynamics of tandem amino acid repeats by studying the different types of