<<

          

           

   !

        "## $%$&%'%    "#  ()*&($&+&))'&   ,,,, - &$$*(.                  !   "  #$%&%' ''()''*  $ + *   *,$  $-. *#  /01$  2    +$0

 " 30%' '04  1   5   2   5$  603   0           %%0 0789(!(  &%'%0

1$$  6$  :  + $2   + ;< + 05    *$$  $  6+      *+    $ $  *   $    *   *0 1$ *$$ $6 ** 2 $2  2$+  $+ =  $ +    *01$*2   *    +      +    -,15/ !> *$ $  6+ 01$   *,15+   2$ $$   $   $  + 0# ,15+ $        ** $,152  $,15**$   *   * 07 $$   2 +$    +  $  + $+$   01$$ 2$+    ** 2  $ 2  03   > *$+  **   2   $* *$+ $ 2 **   +  0"  *%!  2$$       *$  *   * 07 $* $  2 =  +  $$  6    *    0"+    $*    + -1/0 1$?  *1    +    $  03 *1 *   $$ + 01$     ** *  $ *$ 2  * *+  $ 2$1 *  0 7    $ $       *  +        2 $  $  601$*  +   * *$  * $  + ** ** + $   + 2 $  $  60

 $ $  6+         +  =  +       +   

  ! "# "$!%!  "  "&'()*+) " 

@3 " %' '

79 < <%'< 789(!(  &%'%  )  )))  %!(A-$ )BB 00B C D )  )))  %!(A/

Problems worthy of attack prove their worth by hitting back.

Piet Hein, Danish scientist and philosopher (1905-1996)

Till min familj

List of Papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Comparative genomic analysis of human and indicates a key role for indels in primate evolution. Wetterbom, A., Sevov, M., Cavelier, L., Bergström, TF. J Mol Evol (2006) Nov;63(5):682-90

II Genome-wide analysis of chimpanzee genes with premature termi- nation codons. Wetterbom, A., Gyllensten, U., Cavelier, L., Bergström, TF. BMC Genomics (2009) Jan 29;10:56

III Global comparison of the human and chimpanzee transcriptomes using Affymetrix exon arrays. Wetterbom, A., Ameur, A., Feuk, L., Gyllensten, U., Bergström, TF. Cavelier, L. Manuscript.

IV Deep sequencing of the chimpanzee transcriptome identifies numer- ous transcribed regions in frontal cortex and liver. Wetterbom, A.*, Ameur, A.*, Feuk, L., Gyllensten, U., Cavelier, L. Submitted.

* Both authors contributed equally.

Reprints were made with permission from the respective publishers.

Related papers by the author

Indels in the evolution of the human and primate genomes. Wetterbom, A., Cavelier, L., Bergström, TF. In: ENCYCLOPEDIA OF LIFE SCIENCES. John Wiley & Sons, Ltd: Chichester. (July 2008). http://www.els.net/ [doi:10.1002/9780470015902.a0020851].

Global and unbiased detection of splice junctions from RNA-seq data. Ameur, A., Wetterbom, A., Feuk, L., Gyllensten, U. Submitted.

Contents

Introduction...... 11 The genome...... 14 Genomic variation and nucleotide divergence between human and chimpanzee...... 14 The transcriptome: from DNA to RNA ...... 17 Protein coding genes ...... 17 Other transcription products...... 19 Implications for phenotypic divergence...... 21 Enabling technologies...... 25 Publicly available data: genome assemblies and annotations ...... 25 Gene expression arrays...... 26 Next-generation sequencing...... 27 Analysis of data from high-throughput methods...... 30 Present research ...... 32 Aims ...... 32 Paper I: Comparative genomic analysis of human and chimpanzee indicates a key role for indels in primate evolution ...... 33 Paper II: Genome-wide analysis of chimpanzee genes with premature termination codons ...... 35 Paper III: Global comparison of the human and chimpanzee transcriptomes using Affymetrix exon arrays ...... 36 Paper IV: Deep sequencing of the chimpanzee transcriptome identifies numerous transcribed regions in frontal cortex and liver...... 39 General discussion of the present research...... 40 A note on the ethical implications of working with primate samples ...... 44 Concluding remarks and future perspectives...... 46 Why do we bother? ...... 46 What have we learnt? ...... 46 Where do we go next?...... 47 Sammanfattning på svenska...... 49 Acknowledgements...... 51 References...... 53

Abbreviations

BAC Bacterial artificial chromosome Bp Base pair(s) cDNA Complementary DNA CNS Conserved noncoding sequence CNV Copy number variant DNA Deoxyribonucleic acid ENCODE The encyclopedia of DNA elements Gb Giga base pairs GO Gene Ontology EJC Exon junction complex EPHX1 Epoxide hydrolase 1 EST Expressed sequence tag FDR False discovery rate FOXP2 Forkhead box P2 GSTO2 Glutathione S-transferase omega 2 HPR Human proteome resource Indel Insertion or deletion Kb Kilo base pairs Mb Mega base pairs MYA Million years ago mRNA Messenger RNA miRNA Micro RNA ncRNA Noncoding RNA NGS Next-generation sequencing NMD Nonsense mediated decay RNA Ribonucleic acid PCR Polymerase chain reaction PTC Premature termination codon rRNA Ribosomal RNA SD Segmental duplication siRNA Short interferring RNA snRNA Small nuclear RNA snoRNA Small nucleolar RNA SOLiD Sequencing by oligonucleotide ligation and detection tRNA Transfer RNA TSS Transcription start site UTR Untranslated region

Introduction

One of the most famous scholars of Uppsala University was Carl von Linné (1707-1778), a Swedish botanist, zoologist and physician who is recognised as the founder of modern taxonomy. Linné noticed the morphological simi- larities between humans and the great and classified the species to- gether in his great work, Systema Naturae (Linné 1735). Although the deci- sion to group humans and great apes together was biologically correct, it was controversial since it challenged the prevailing religious view of humans as the crown of creation. It would take more than a century before Linné’s work was supported by Thomas Huxley (1864) and Charles Darwin (1871) who also argued that humans were evolutionarily related to (Figure 1) and other great apes. Today it is widely accepted that chimpan- zees represent humankind’s closest living relative and molecular evidence places the last common ancestor of human and chimpanzee 5-7 million years ago (Figure 2).

Figure 1. Chimpanzee at the Kolmården Zoo. Photograph by Anna Wetterbom.

Since the time of separation both lineages have acquired a number of changes in their genomes and the initial genome comparison (Mikkelsen et al. 2005) revealed a nucleotide divergence of ~ 1.2 % and an additional indel 11 divergence of ~ 3 %. Other types of highly variable regions have also been described, such as segmental duplications (Cheng et al. 2005b) and trans- posable Alu elements (Mikkelsen et al. 2005), further adding to the variation. This implies that the popularly cited ‘estimate’ that humans and chimpan- zees are > 98 % alike is not entirely true. It all depends on which type of variation we consider.

Figure 2. Species tree indicating the relationship between human, chimpanzee (Pan troglodytes), (Pan paniscus) and other closely related primates. Divergence times, in million years ago (MYA), are indicated to the left (Chen and Li 2001; Glazko and Nei 2003; Yu et al. 2003).

Changes in the nucleotide sequence provide the raw material for evolu- tion. If a genetic variant results in a phenotype that confers an advantage to the carrier, the frequency of this genetic variant is likely to increase in the population. This is referred to as positive selection. The opposite is known as purifying selection, which acts to remove genotypes that cause a deleterious phenotype. However, most changes in the nucleotide sequence are selec- tively neutral, or nearly neutral. Meaning that the molecular changes either (i) do not change the phenotype or (ii) the resulting phenotype is neither advantageous nor deleterious. The neutral theory of molecular evolution was proposed by Kimura (1968) and it states that the majority of genetic changes occuring during evolution have little or no functional impact. The theory was initially developed to account for genetic alterations and it has recently been extended by Khaitovich et al (2004b) who suggested that changes in gene expression also follow a model for neutral evolution. Their results imply that similar to changes at the DNA level, many genetically determined expres- sion differences within and between species have little or no biological ef- fect. Phenotypic divergence between humans and chimpanzees includes changes in morphology, behaviour, cognition and diet as well as different susceptibilities to diseases (Olson and Varki 2003). Several hypotheses have been proposed to explain the observed phenotypic variation. Perhaps the 12 most naïve explanation would be that the two species have different genes as a result of gain and loss of genes (Ohno 1970; Olson 1999; Fortna et al. 2004) or sequence divergence in protein coding genes (Bustamante et al. 2005). Another possibility is that the two species share their genetic makeup to a large extent and it is rather a matter of location, timing and the level of gene expression that make a difference. The importance of divergence in regulatory sequences was proposed by King and Wilson (1975) and the re- sulting divergence in gene expression (Preuss et al. 2004; Khaitovich et al. 2005) and alternative splicing (Calarco et al. 2007; Blekhman et al. 2009a) have been thoroughly studied. These hypotheses are not mutually exclusive and phenotypic divergence could be the result of a combination of variation at the coding, expression and splicing levels. It is commonly believed that by comparing the genomes and transcrip- tomes of human and chimpanzee we can increase our understanding of the differences that sets us apart as species. This is the main theme of this thesis. In the following two chapters, I will outline some types of genomic variation that may lead to transcriptional changes and thus eventually result in pheno- typic divergence between the species. Another prevailing theme of this thesis is the rapid development of methods enabling comparisons of genomes and transcriptomes and this will be discussed in a separate chapter. The rapid increase in the amount of available data has to some extent led to a change of paradigm where we have moved from the classical hypothesis-driven re- search presented by Karl Popper (Popper 1934) to data driven-research where we seek to explore and describe the data at hand and then formulate the hypothesis.

13 The genome

The genome1 consists of all DNA present in the nucleus and the mitochon- dria2of a somatic cell. It contains the information needed to assemble the individual cells into tissues and further into organs and finally a complete organism. Thus, genome comparisons provide a powerful method to extract information on the genetic differences between species and the changes that are unique to human. However, for a long time genome comparisons at nu- cleotide resolution were a mere dream and early comparative studies of hu- mans and chimpanzees made use of methods such as DNA-DNA hybridiza- tion techniques (King and Wilson 1975; Sibley and Ahlquist 1984) and chromosome banding (Yunis and Prakash 1982; Nickerson and Nelson 1998).

Genomic variation and nucleotide divergence between human and chimpanzee The chimpanzee karyotype differs from its human counterpart in several ways. The most obvious is the presence of and additional chromosome, i.e. the chromosomes 2 a and 2 b which are fused into the single chromosome 2 in human (Yunis and Prakash 1982). There are also a number of pericentric inversions (Yunis and Prakash 1982; Nickerson and Nelson 1998) that differ between the species. Although gross morphological changes are rare, there is a wealth of other types of events that differs between the genomes and here I will briefly outline them from the largest to the most fine-grained nucleotide substitutions. While karyotypic differences between human and chimpanzee have been known for a long time, the importance of smaller structural variation (re- viewed by Feuk et al (2006) and Sharp et al (2006)) has been uncovered only in recent years. Structural variation between the human and chimpanzee genomes includes both copy number variation (CNVs) and other types of rearrangements, such as inversions (Feuk et al. 2005). In addition, approxi- mately 5 % of the human genome consists of genomic regions, referred to as segmental duplications (Bailey et al. 2002a), which represent sequences that

1 Here I refer to the human, or chimpanzee genome, and not to genomes in general. 2 In the present investigations the mitochondrial genome has not been studied. 14 have been duplicated within the genome. Segmental duplications have also been studied in humans and chimpanzees (Cheng et al. 2005b). A copy number variation is a segment of DNA present in variable num- bers when two genomes are compared, the size may range from a few kb up to several Mb. CNVs provide a mechanism for gene duplication and expan- sion of gene families and are thus interesting from an evolutionary perspec- tive. Perry et al (2006; 2008) have compared the patterns of CNVs between human and chimpanzee and also studied the variation within chimpanzees. The authors find that CNVs often occur in orthologous regions in the two species and that CNVs are strongly associated with segmental duplications (SD). They observe a group of CNVs that differ in number between the spe- cies and furthermore a group of CNVs that vary in number within the chim- panzee population. Segmental duplications (SD) are large blocks of genomic sequence that occur in two or more copies within a genome (low-copy repeats, > 1 kb in length and 90 % sequence similarity (Bailey et al. 2002a)). A survey of segmental duplications in chimpanzee and human (Cheng et al. 2005b) shows that SDs are an important source of variation between the two ge- nomes. Although the species share the majority of base pairs involved in duplications (i.e. they occurred before the divergence of humans and chim- panzees), there is also a large lineage specific proportion including 31.9 Mb specifically duplicated in human and 36.2 Mb duplicated in chimpanzee (Cheng et al. 2005b). By comparing the copy numbers of shared SDs the authors found that due to a rapid expansion of chimpanzee duplications, the chimpanzee genome has increased in size by as much as 1 %. Many anno- tated SDs in both the human and chimpanzee genomes are variable in copy number within the respective species, whereas others have been driven to fixation. As a consequence, SDs (and CNVs) may harbour duplicated genes (reviewed by Bailey and Eichler (2006)). Insertions and deletions (indels) are genomic regions that are present, or absent, in one genome relative to another. They will show up as gaps in an alignment and to determine if a region is inserted in one species, or deleted in the other, a third species (an outgroup) is required for the comparison. Although CNVs and SDs may be viewed as indels, the term is more com- monly used for smaller events. The initial analysis of the chimpanzee ge- nome (Mikkelsen et al. 2005) revealed that the human and chimpanzee ge- nomes contain 40-45 Mb of lineage-specific euchromatic sequence, amount- ing to an estimated indel divergence of 3 %. One important group of indels are transposable Alu elements that are present in higher numbers in the hu- man than in the chimpanzee genome (Mikkelsen et al. 2005; Mills et al. 2006). The presence of species-specific Alu elements will be manifested as indels of ~ 300 bp in an alignment. Although large indels contribute more to the overall indel divergence, the most commonly occurring indels are single nucleotides (Watanabe et al. 2004; Wetterbom et al. 2006; Chen et al. 2007). 15 Shorter indels may result from replication slippage in tandemly repeated regions (e.g. micro satellites) (Levinson and Gutman 1987) or recombination and unequal crossing over (Kvikstad et al. 2007). Substitutions are base pairs changes. This type of divergence has been thoroughly studied (Chen and Li 2001; Ebersberger et al. 2002; Watanabe et al. 2004; Mikkelsen et al. 2005) and with the completion of the chimpanzee genome (Mikkelsen et al. 2005), the nucleotide divergence was confirmed to be ~ 1.2 %. In parallel to indels, substitutions occur less often in coding re- gions than in intronic and intergenic sequences. A wide range of analysis tools has been developed to estimate substitutions rates from DNA samples and to use this information to detect selection of various kinds. The different types of structural variation and nucleotide substitutions are likely to affect transcription and I will discuss this in more detail in follow- ing chapters.

16 The transcriptome: from DNA to RNA

Whereas the genome remains more or less the same in all cells in an organ- ism, the transcriptome is a dynamic entity changing between both different cell types and developmental stages and as a response to external stimuli. The transcriptome is a representation of the transcribed parts of the genome at a given time point in a given cell or tissue type. This raises the question of which time point and which cell or tissue to study. Although single cells may be isolated from a tissue section using laser capture micro dissection (Emmert-Buck et al. 1996), this is tedious work and most often a tissue sam- ple consisting of a mixed cell population is studied.

Protein coding genes According to the classical central dogma of molecular biology (Crick 1970), coding regions of the genome (i.e. genes) are transcribed into mRNAs which are in turn translated to proteins by the ribosomes. The primary transcript contains both exons and introns. Further processing removes the introns leaving the mature mRNA to only include exonic regions. During this proc- ess a 5’ cap (an altered guanine) and a 3’ poly(A)-tail are added to ensure stability of the mRNA and protect it from enzymatic degradation. Subse- quent translation of the mRNA into a protein is regulated by sophisticated interplay between the DNA sequence and regulating proteins (e.g. transcrip- tion factors).

Generation of mRNA diversity When the primary transcript is processed into a mature mRNA the exons may be joined together in different ways, a process know as alternative splicing, which greatly increases the diversity of proteins generated from a single gene. The number of identified alternatively spliced genes is steadily increasing and a recent study (Wang et al. 2008) suggests that almost all multi-exon genes have multiple isoforms. Several types of alternative splic- ing exist including differential use of 5’ and 3’ splice sites (resulting in dif- ferent exon-intron boundaries), alternate 5’ and 3’ exons, exon skipping, mutually exclusive exons, intron retention and alternative polyadenylation (Black 2003) (see Figure 3 for examples). The splicing machinery responsi- 17 ble for excising introns and joining exons is the spliceosome, a complex consisting of five small nuclear ribonucleoproteins (designated U1, U2, U4, U5 and U6) and several accessory proteins which assemble onto the primary transcript (Black 2003). Numerous signals in the sequence guide the spli- ceosome and the intron-exon boundaries are recognised by splice sites. The 5’ end of the intron has a consensus GU splice site and at the 3’ end of the intron there is a branch site (defined as yUnAy3) followed by a polypyrymidine tract and finally a consensus AG splice site (Gao et al. 2008).

Figure 3. Different types of alternative splicing. Figure from Ward and Cooper (2009), reprinted with permission from Wiley & sons Ltd.

After splicing has occurred some proteins remain as a mark of the exon- exon junction (i.e. the exon junction complex (EJC)) and this has important implications if the mRNA contains a premature termination codon (PTC). In this sense, a PTC is defined as a termination codon occurring upstream of the termination codon at the 3’ end of the mRNA. (Maquat 2004; Chang et al. 2007). During the first round of translation the ribosome displaces the EJCs but this does not happen if a PTC occurs > 50 bp upstream of the last exon-exon junction. In this case, the mRNA is instead degraded by non- sense-mediated decay (NMD), a surveillance mechanism ensuring that aber- rant mRNAs are not translated into truncated proteins but rather degraded in the cell. However, if the PTC occurs in the last exon of an mRNA, there is no downstream EJC and thus the transcript will not be degraded by NMD (Maquat 2004; Chang et al. 2007). Paper I and II of this thesis deals with detection of PTC in chimpanzee genes and the possibility that PTCs in combination with NMD alters the

3 ‘y’ denotes a pyrimidine, i.e. the nucleotides C, T (in DNA) and U (in RNA) 18 repertoire of available transcripts in a species. PTCs may result from non- sense mutations or from frame shift indels and several studies have sug- gested that PTCs coupled to NMD may provide a post-translational mecha- nism for regulating gene expression (Lewis et al. 2003; Hillman et al. 2004; Mendell et al. 2004; Saltzman et al. 2008). To descibe the proposed action let us consider a transcript with two mutually exclusive exons, A and B. In normal splicing, either of the exons is included and the resulting transcript may be transported to the ribosome and translated into a protein. On the other hand, if neither or both of the exons are included, it introduces a frame shift causing a PTC and thereby elicit NMD. Altering the splice conditions to produce unproductive transcripts in this way allows the cell to respond rapidly to changing needs and modify the mRNA level after transcription. This is achieved by directing some proportion of already transcribed pre- mRNAs to the nonsense mediated decay pathway (Lewis et al. 2003). At first it may appear a waste of energy and resources for the cell to transcribe a gene and then directly degrade the transcript but the added layer of gene regulation may compensate for this extra cost. The effect of the coupled action of PTCs and NMD as a modulator of transcript levels has been veri- fied experimentally (Mendell et al. 2004; Saltzman et al. 2008).

Other transcription products Not all types of transcription conform to the classical central dogma of mo- lecular biology and in addition to mRNAs that encode proteins, several other types of noncoding RNAs (ncRNAs) are known (of which some are summa- rised in Table 1).

Table 1. Overview of common types of ncRNAs (Mattick and Makunin 2006; Gin- geras 2007). Type of RNA Function Micro (mi)RNAs Regulate gene expression through RNA interference. Small interferring (si)RNAs Regulate gene expression through RNA interference. Small nuclear (sn)RNAs Involved in splicing. Small nucleolar (sno)RNAs Modifies rRNA. Transfer (t)RNAs Involved in mRNA translation. Ribosomal (r)RNAs Involved in mRNA translation.

Noncoding RNAs function as structural, catalytic or regulatory RNAs and affect several processes in the cell including gene expression, epigenetic modifications, transcription, alternative splicing and translation (reviewed by Eddy (2001) and Mattick (2006)). The development of high-throughput methods, such as tiling arrays and next-generation sequencing (NGS), has 19 rapidly expanded the repertoire of ncRNAs even further (reviewed by Jac- quier (2009)), with the discovery of an increasing number of antisense tran- scripts of different lengths.

Figure 4. Example of the transcriptional divergence of a genomic locus. The upper panel displays four hypothetical protein coding genes (with exons as red boxes) on both strands and the bottom panel is a blow up of one of the loci. The protein coding gene, located on the 5’ strand, has two alternative transcription start sites (TSS) (green arrows) and in addition, several ncRNAs are transcribed from intronic re- gions. The 3’ strand harbours two antisense ncRNAs with different TSS (green ar- rows) and some other ncRNAs. Figure from Gingeras (2007) and reprinted with permission from Cold Spring Harbor Laboratory Press.

Less than 2 % of the bases of the human genome are covered by protein coding genes and yet the prevailing view is that the vast majority of bases in the mammalian genome are transcribed (Carninci et al. 2005; Cheng et al. 20 2005a; Kapranov et al. 2007). In the ENCODE pilot study, covering 1 % of the human genome, it was estimated that > 90 % of the bases were tran- scribed at some point, although only ~ 15 % were transcribed in a single tissue. The results suggest that transcription is to some extent tissue-specific. The importance of ncRNAs is emphasized by a study from Cheng et al (2005a) where the amount of non-polyadenylated RNA is more than twice as high as the amount of poly(A)+ RNA. Transcription often occurs in an over- lapping fashion (Birney et al. 2007) and a specific genomic locus may pro- duce several types of transcripts, both coding and noncoding, and from both the sense and the antisense strand (Figure 4). Whereas some regions of the genome are heavily transcribed other regions appear to be ‘transcriptional deserts’ (Carninci et al. 2005; Gerstein et al. 2007). The biological function and importance of the observed pervasive transcription is only beginning to be elucidated and further understanding will benefit from the recent ad- vances in sequencing technology. With the advent of next-generation sequencing techniques it has become possible to survey the transcriptome at unprecedented depth and recent stud- ies in human (Cloonan et al. 2008; Sultan et al. 2008) and mouse (Mortazavi et al. 2008) have confirmed the notion that transcription is a very complex process. As displayed in Figure 4 there is much evidence that the traditional concept of a ‘gene’ may need to be revised and Gerstein et al (2007) have proposed to redefine the term to focus on the final functional product. The classical central dogma of molecular biology with one gene - one transcript - one protein needs to be rephrased to encapsulate the complex transcriptional landscape that has emerged in the last few years.

Implications for phenotypic divergence I have so far described several types of genomic differences between human and chimpanzee and the basics of transcription, but how does this relate to phenotypic divergence and speciation? The vast majority of mutations occur- ring at the DNA level are neutral4 (Kimura 1968) and whether or not they become fixed in a population is a process that depends on the mutation rate and on random drift. Nevertheless, if a DNA mutation leads to an enhanced function of the resulting transcript (for RNA genes) or protein, the carrier of that mutation may have a reproductive advantage. The mutations will then be selected for and possibly increase in frequency in the population, a proc- ess known as positive selection. Although positive selection is relatively rare, it is interesting from an evolutionary point of view since positively selected genes are likely to be candidates for understanding species-specific traits. On the other hand, if the mutation is deleterious enough it will instead

4 Commonly referred to as ‘the neutral theory of molecular evolution’ . 21 be selected against and eventually cleared from the population. In recent years, the neutral theory of molecular evolution has been extended from nucleotide mutations to gene expression (Khaitovich et al. 2004b) and it is possible that the theory may apply to all types of transcriptional divergence. The increasingly comprehensive catalogues of differences between human and chimpanzee largely reflect random changes and the challenge is to find the functionally important differences. In this chapter I will discuss some published examples from the literature of how others have searched for genes under positive selection. Moreover, I will exemplify how structural and nucleotide variation may have impacted the phenotypic divergence between human and chimpanzee.

Protein coding sequences Substitutions represent the most subtle changes in the genome and several attempts have been made to detect positively selected genes by using differ- ent measures of substitution rates (Clark et al. 2003; Bustamante et al. 2005; Nielsen et al. 2005; Bakewell et al. 2007). These studies propose that a number of genes, such as those involved in olfaction, nuclear transport, im- mune defence and some transcription factors, have been targeted by positive selection in humans. An intriguing example of a gene where substitutions have been suggestively associated with a phenotype is the FOXP2 gene, which is involved in the human ability to develop language (Lai et al. 2001). This gene is highly conserved between human and mouse but has acquired two amino acid substitutions on the human lineage during the last ~ 6 mil- lion years. The changes have become fixed and Enard et al (2002b) suggest that FOXP2 has been positively selected for in our species. In addition to altering the amino acid composition, substitutions may also affect regulatory sequences and thereby modify the repertoire of splice variants (reviewed by Ward and Cooper (2009)). Such alterations have been implicated in a num- ber of human diseases (Ward and Cooper 2009) and it is reasonable to as- sume that they also contribute to the divergence between species. Structural variation in the genome may lead to gain or loss of lineage- specific genes (or parts of genes, such as exons) as has been shown by Fortna and colleagues (2004). Using an array-based approach they identified a group of genes with different copy numbers between species. Furthermore, they noted that an increase of the copy number of genes was more common than a decrease, both in humans and in other great apes. The group of genes with a copy number expansion in human included a number of genes impli- cated in the structure and function of the brain (Fortna et al. 2004). Changes in copy number of genes may lead to differences in the amount of resulting gene product. Furthermore, one of the gene copies may also evolve into a pseudogene or evolve new functions when it is no longer under functional constraint. 22 Differential gene expression between human and chimpanzee was first studied with microarrays by Enard et al (2002a), who noted that changes in gene expression in the brain have been accelerated on the human linage. The results were supported by two subsequent array studies (Caceres et al. 2003; Uddin et al. 2004), both of which found that genes that were upregulated in the human brain were often involved in neuronal function. Khaitovich et al have later suggested that differences in gene expression and protein diver- gence have evolved in parallel (Khaitovich et al. 2005) and under neutral expectations. Moreover, they found that both amino acid sequences and gene expression have diverged less for genes expressed in several tissues, as com- pared to genes with tissue-specific expression. This indicates that genes with a broad expression profile are more conserved. A weak association has also been observed between gene expression differences between species and large-scale genomic variation, including both chromosomal rearrangements (Khaitovich et al. 2004a; Blekhman et al. 2008) and segmental duplications (Blekhman et al. 2009b). If structural variation causes a gene to be present in different copy numbers in the human and chimpanzee genomes, we would expect to observe differential gene expression between the species. Although the results may be biologically plausible, they could also reflect a technical bias due to cross-hybridization on the arrays. Blekhman et al (2009b) try to control for this bias and suggest an alternative explanation. When a gene is copied to a different location, this could change the expression in response to the new genomic context with different regulatory regions. Finally, differen- tial gene expression between species could be caused by changes in the use of promoters (Haygood et al. 2007), splice sites or different use of transcrip- tion factors (Nowick et al. 2009). The latter theme has been studied by Gilad et al (Gilad et al. 2006) who noted that transcription factors were over repre- sented among genes with an increased expression in humans, as compared to other primates. Since transcription factors are important regulators of gene expression, the results support the proposal by King and Wilson (1975) that divergence between human and chimpanzee can be explained by alterations of gene regulation rather than the structure of proteins. While differences in steady state levels of gene expression have been studied by several groups, less is known about differences in alternative splicing between human and chimpanzee. In a pioneering study Calarco et al (2007) used a custom made array to profile ~ 5000 (1700 after filtering) exons. The authors found that 6-8 % of the orthologous exons have different splice patterns in human and chimpanzee. Genes with different splicing pat- terns typically do not overlap with genes with differential expression, sug- gesting that the two layers of regulation have evolved to affect different sub- sets genes. Genes with different splice levels were involved in gene expres- sion, signal transduction, cell death and immune defence (Calarco et al. 2007). As an example the gene GSTO2, which has been associated with hu- man age-related diseases, was suggested to be less active in human than in 23 chimpanzees. The general results were confirmed in a subsequent study by Blekham and colleagues (2009a) where they used RNA-seq5 to survey splic- ing patterns in liver samples from human, chimpanzee and macaque. Ap- proximately 7 % of the analysed transcripts had different splicing levels between human and chimpanzee and the number of human-specific exons was lower than the corresponding figure for chimpanzee. Moreover, the study identified a group of genes with conserved sexually dimorphic gene expression. Both studies (Calarco et al. 2007; Blekhman et al. 2009a) sug- gest that alternative splicing may be as important as differential gene expres- sion for the evolution of species-specific traits.

Noncoding sequences Considering that a large proportion of transcripts originate from outside pro- tein coding genes it is reasonable to assume that ncRNAs may also be impor- tant for the evolution of human-specific traits and although not very well studied, there are some examples of this. Pollard et al (2006) scanned the genome for evolutionary conserved regions with an increased mutation rate on the human lineage, since the divergence from the most recent common ancestor with the chimpanzee. The top candidate region (HAR1) was found to be part of two noncoding RNA genes, HAR1F and HAR1R. The HAR1F gene is predicted to form a stable secondary structure, which is slightly dif- ferent between the species due to nucleotide substitutions. Expression profil- ing revealed that HAR1F is expressed in the developing brain while in later developmental stages both HAR1F and HAR1R are found in brain samples leading Pollard et al to suggest that HAR1R regulates HAR1F. Expression of both RNA genes is conserved in primates and it was speculated that the con- formational difference between human and chimpanzee could affect the re- spective expression patterns. In another study, Prabhakar et al (2006) sought to identify conserved noncoding sequences (CNS) with accelerated evolution on the human line- age. They find that CNS located close to genes with neuronal function dis- play the strongest signal of human-specific sequence evolution. Although CNS with accelerated evolution in chimpanzee are also associated with neu- ronal function, this group of CNS is different from the group identified in human. Prabkhar et al to conclude that different noncoding regulatory ele- ments may have contributed to modifications in brain functions and in shap- ing the different neuronal phenotype in the two species.

5 RNA-seq is described in detail in the chapter Enabling technologies. 24 Enabling technologies

Large-scale studies such as comparisons of genomes and transcripts rely heavily on technological advancess, especially in the field of sequencing. In this chapter I will introduce the techniques we have used and discuss some of their advantages and pitfalls.

Publicly available data: genome assemblies and annotations A wealth of biological data is now available at our fingertips and only a mouse-click or two away. Publicly available resources comprise everything from comprehensive genome browsers (e.g. Ensembl (Hubbard et al. 2009), NCBI (Sayers et al. 2010) and UCSC (Rhead et al. 2010)), with large collec- tions of genomes and associated sequences, to more specialized databases. The first two papers in this thesis deal with in silico work based on alignments of human and chimpanzee genomic sequences. In Paper I, we used chimpanzee sequences generated from BAC clones (Watanabe et al. 2004) and aligned to the human genome and in the subsequent study we instead used the genome sequence from both human (Lander et al. 2001; Venter et al. 2001) and chimpanzee (Mikkelsen et al. 2005) as our starting material. A major difference is that the BAC clone sequences of chromo- some 216 have higher quality than the chimpanzee genome in general and thus the genome comparison required more thorough filtering to ensure that the results were not biased by low sequence quality. Genome and transcriptome studies rely on the annotation of known genes and transcripts. Several databases provide annotations at different levels of confidence and we have used Ensembl annotations of human genes for Pa- per I and II. For the exon arrays in Paper III we used the core set of probes (annotations provided by Affymetrix), which target human RefSeq genes (Pruitt et al. 2007). In the final Paper IV, we relied both on RefSeq annota-

6 Note that chromosome 21 was previoulsy refered to as chromosome 22 in chimpanzee. Throughout this thesis I will use the numbering scheme suggested by McConkey (McConkey EH. 2004. Orthologous numbering of great and human chromosomes is essential for comparative genomics. Cytogenetic and genome research 105(1): 157-158.), except in the specific section about Paper I. 25 tions of chimpanzee genes and on additional annotations of human mRNAs and ESTs from the UCSC Genome Browser. A general concern in all our studies is that the gene annotations are human-based (even the chimpanzee RefSeq genes are inferred from human sequences) and this causes at least two types of problems in the analyses. First, we may incorrectly infer se- quence to be coding in the chimpanzee and secondly we may miss coding regions that are chimpanzee-specific. Large-scale sequencing efforts of the chimpanzee transcriptome and de novo gene predictions may remedy this anthropocentric bias. Gene Ontology (GO) (Ashburner et al. 2000) is commonly used to assign a biological function to genes identified using different bioinformatic or experimental methods. The GO project uses a set of defined terms to de- scribe genes and their products and thereby provides standardised annota- tions of biological process, cellular component and molecular function. High-throughput methods often result in lists of potentially interesting genes and a common approach is to assign GO annotations and test if, for example, certain molecular functions are enriched for in a set of genes as compared to a genomic background. A persistent problem with GO analyses is the in- completeness of the gene annotations and a comparison rendering a negative result may be an artefact of insufficient annotations. This problem will de- crease as the onotologies become even more comprehensive.

Gene expression arrays Gene expression arrays were developed in response to an increasing need to simultaneously monitor the expression of large sets of genes (more precisely the mRNA levels). The first expression array targeted 45 genes (Schena et al. 1995) and in the following years two complementary array platforms for gene expression have been developed: (i) spotted arrays with probes consist- ing of cDNA or PCR products and (ii) oligonucleotide arrays with synthe- sized (generally shorter) nucleotide probes. In Paper III in this thesis we used the second type of arrays, namely the Affymetrix exon array (Human Exon 1.0 ST Array) to survey transcription at high resolution. The exon array interrogates over a million exon clusters with different levels of supporting evidence and we used the most reliable core set of probes (i.e. supported by RefSeq mRNAs) which included 17,880 genes. Each exon is generally targeted by one probe set consisting of four probes and core genes have an average of 14 probe sets per gene. The dense probe coverage allows for both gene expression estimates and for a more refined analysis of expression at the exon level. However, the first generation of exon arrays from Affymetrix did not include any junction probes (i.e. probes spanning the junction between two exons) and thus it is

26 not possible to use the expression data to infer which exons are joined to- gether in transcripts. Using arrays designed for one species to measure gene expression in an- other species presents some problems since sequence divergence is expected to introduce mismatches between a number of probes and their target se- quence. Gilad et al (2005b) have addressed this problem by manufacturing multi-species arrays with species-specific probes for several primates. Al- though useful for some applications this approach requires custom made arrays. An alternative approach, sometimes referred to as ’cross-species hy- bridization’, is to hybridize primate samples to commercially available hu- man-centred arrays (Enard et al. 2002a; Caceres et al. 2003). Hsieh et al (2003) have showed that a sequence divergence as low as 1 % has a signifi- cant influence on the measured expression levels and thus some method for masking probes has to be employed. In Paper III we used a simple strategy and only included probes with a single (100 % identity) match in both ge- nomes, thereby circumventing both the mismatch problem and the issue with cross-hybridizing probes (i.e. probes targeting several locations in the ge- nome). Unfortunately, this led to a reduced number of transcripts available for analysis and it is possible that we introduced a new bias by removing diverged genes. Alternatively, Dannemann et al (2009) have proposed to mask probes based on the expression data itself and remove probes that are uncorrelated with other probes within a probeset or transcript. The authors suggest that their approach removes fewer probes and thus increases the power of the study.

Next-generation sequencing Sequencing projects used to rely on cloning the DNA fragment of interest into a bacterial vector, followed by amplification (either by PCR or in bacte- ria), purification of the templates and finally Sanger sequencing (Sanger et al. 1977). Although a tedious and expensive process, it was the basis for sequencing of e.g. the human (Lander et al. 2001; Venter et al. 2001) and many other genomes. In the last five, years a sequencing revolution has oc- curred and with the advent of ‘next-generation sequencing’ (NGS) methods the throughput has increased immensely and the cost per sequenced base has dropped. Presently there are four technologies available for high-throughput sequencing (summarized in Table 2) and several others platforms are lining up (e.g. Polonator (Shendure et al. 2005) and Pacific Biosciences (Eid et al. 2009)). Genome sequencing is also being offered as a service by the com- pany Complete Genomics (Drmanac et al. 2010). Although the NGS plat- forms are based on different technologies, some general features are shared: all platforms use a combination of chemical and enzymatic reactions to se- quence fragments in a highly parallel fashion, using image analysis for the 27 read out. The obtained sequence reads are either mapped to a reference ge- nome or used for de novo assembly.

Table 2. Overview of next-generation sequencing technologies and the amount of data typically obtained7. Number of reads from a Technology Read length single run 454 (Roche) 400 bp 1.2 million (Margulies et al. 2005) SOLiD (Applied Biosystems) 50 bp 800 million (Valouev et al. 2008) Illumina Genome Analyser 200 bp 300 million (Bentley et al. 2008) Helicos 30-35 bp 400 million (Pushkarev et al. 2009),

Since we have used the AB-SOLiD platform in Paper IV I will describe the technology in more detail. In our transcriptome analysis, the RNA popu- lation was enriched for mRNAs by using a poly(T) primer, the RNA was then reverse transcribed into cDNA, amplified by PCR and finally used as the basis for the fragment library. More recently, protocols have been devel- oped that can start from total RNA, followed by rRNA depletion, fractioning of the RNA, ligation of sequencing adapters and finally generation of cDNA in a strand-specific manner. The resulting cDNA is then used for the library preparation. In either case, adaptors are ligated to the fragmented DNA and the tem- plate is clonally amplified on 1-μm beads in an emulsion PCR. After the enrichment of beads with bound DNA the beads are deposited on a glass slide and covalently attached to the surface. The first sequencing primer is hybridized to the adaptor (which is attached to the template) and the subse- quent sequencing relies on two-base encoded probes, outlined in Figure 5. Degenerate probes, where the first two bases are known, interrogate the tem- plate sequence and ligate onto the growing strand (panel A). A fluorophore is attached to the probe and after ligation the fluorophore is detected by im- age analysis and then cleaved. After several cycles of ligation, cleavage and detection the extension product is removed. Panel (B) shows the results from the sequencing round with the first primer, where base pairs 1-2, 6-7, 11-12, 16-17 and 21-22 are read. In the following step a new primer (starting at n-1) is hybridized and the ligation process is repeated. Five sequencing primers with different start sites are used. This way every base in the template is read twice with two different primers, resulting in two colour calls.

7 The numbers were obtained from the websites of the respective companies, January 2010. 28 A

B

Figure 5. Illustration of the SOLiD sequencing chemistry. ‘XX’ denote the known base pairs in the degenerate probes. The letters B, G, R and Y stands for blue, green, red and yellow and the numbers (in panel B) represent the base pairs read by a cer- tain probe. Figures provided by Applied Biosystems, Life Technologies copyright 2010.

Two-base encoding is illustrated in Figure 6. In the upper panel (A) of the example, the target sequence starts with ATT and it is first interrogated by a red probe and subsequently by a blue probe. Colour calls from all sequenc- ing cycles are lined up and aligned to a reference genome (panel B). This step translates the ‘colour space’ into the well known letters of a DNA se- quence. With two-base encoding, a single nucleotide polymorphism will result in two colour changes, in contrast to sequencing errors, and thus in- crease the accuracy. The different NGS platforms have different features and may be used for a great variety of applications: de novo sequencing, genomic re-sequencing, targeted re-sequencing of specific regions or complete genomes of higher organisms, studies of chromatin structure with ChIP-seq (Wold and Myers 2008) or transcriptome profiling by RNA-seq (Nagalakshmi et al. 2008). The latter is a method for transcriptome profiling where RNA molecules are re- verse transcribed into cDNA, ligated to adaptors, amplified and sequenced with some NGS technology (reviewed by Wang et al (2009)). RNA-seq has

29 been successfully applied to poly(A)-selected populations of RNA to esti- mate gene expression in tissues (Mortazavi et al. 2008; Pan et al. 2008; Wang et al. 2008), cell lines (Cloonan et al. 2008; Sultan et al. 2008) and even in a single cell (Tang et al. 2009). Compared to expression arrays, RNA-seq has several advantages: higher throughput, simultaneous detection of gene expression and alternative splicing without prior knowledge of gene boundaries or exon-exon junctions, and the ability to find novel transcribed regions.

A

B

Figure 6. Illustration of the SOLiD two-base encoding (A) and sequence alignment in colour space (B). Figures provided by Applied Biosystems, Life Technologies copyright 2010.

Analysis of data from high-throughput methods High-throughput methods result in very large datasets. For example, in Pa- per IV we acquired over 500 million sequence reads and the most recent version of the SOLiD platform can produce even more sequences. This

30 places new demands on virtually all steps in the data handling, storage and bioinformatic analyses. For interpretation of the results from both expression arrays and NGS, we commonly use several different types of annotations obtained from data- bases. Gene annotations are used to define coding exons and UTRs. To fur- ther understand the genomic neighbourhood, other types of annotations (e.g. repeats, GC-content, nucleotide diversity and structural variation) are help- ful. Biological function of genes is commonly assigned with the help of Gene Ontology and other tools (e.g. Ingenuity Pathway Analysis) are used to identify biological pathways and visualize how genes are regulated by other genes in a network. To test hypotheses on the data it is necessary to use a statistical frame- work for testing. For example, one may be interested in genes that are differ- entially expressed between samples or genes with a certain biological func- tion. This type of analysis usually addresses many thousands of genes at the same time and by doing so we increase the risk of falsely predicting a signal to be significant simply because of the large number of statistical tests. This is known as the multiple testing problem and one way of avoiding this type of error is to control the false discovery rate (FDR) (Benjamini 1995), which is the proportion of tests where we falsely predict the result to be true. As an example, if we test 10,000 genes for differential expression and the FDR is 0.01 we would expect 100 genes to be incorrectly inferred as differentially expressed. A more conservative approach to control for multiple testing is the Bonferroni correction, where the significance threshold is adjusted ac- cording to the number of tests performed.

31 Present research

In this chapter I will cover the four Papers on which the thesis is based but before moving on to the individual Papers I will present the overall aims for the thesis work. The chapter ends with a summarizing discussion of the pre- sented results and some common themes relating to primate evolution and comparative studies of genomes and transcriptomes.

Aims The general aim of all manuscripts included in this thesis was to characterize genomic differences between human and chimpanzee, both in terms of the genome sequences and the resulting transcripts. More specifically we aimed to:

• investigate indel divergence between human and chimpanzee in general and specifically indels causing PTCs in chimpanzee genes (Paper I and Paper II) • investigate a possible correlation between PTCs and other genomic properties in the chimpanzee (Paper II) • investigate the transcriptional divergence between human and chimpan- zee, using high-density exon arrays, with respect to differential gene ex- pression and splicing (Paper III) • perform a thorough profiling, using next-generation sequencing, of the chimpanzee transcriptome in order to identify novel transcribed regions (Paper IV).

32 Paper I: Comparative genomic analysis of human and chimpanzee indicates a key role for indels in primate evolution In the first paper we aligned BAC sequences from chimpanzee chromosome 228 (Watanabe et al. 2004) to the human orthologous chromosome 21 and investigated the extent of sequence divergence. We found that the genomic divergence attributed to indels (~ 5 %) was higher that that caused by nu- cleotide substitutions (~ 1.5 %). Both indels and substitutions occurred fre- quently in repeated regions and when repeats and low-complexity DNA were excluded the total divergence decreased to 2.4 %. Indels ranged from one base to 55.6 kb in length and although single bp indels were by far the most common, longer indels contributed more heavily to the overall indel divergence. We noted that indels caused a net loss of 0.5 Mb in chimpanzee (or a net gain in human). In contrast to a previous study (Watanabe et al. 2004), we observed a cor- relation between the occurrence of indels and substitutions across the length of the chromosome. The discrepancy between the two studies is likely caused by different alignment algorithms since the choice of method and parameters influences the occurrence or substitutions in relation to indels. Our results indicated that some underlying genomic properties affected both types of events (explored in detail in Paper II). This was further emphasized by the fact that some chromosomal regions were more prone to divergence caused by both indels and substitutions. Even though indels were found to occur more frequently in noncoding re- gions than in coding regions we found that 13 % of genes harboured indels within the coding sequence. We focused on indels in protein coding genes (n = 224 from Ensembl (Birney et al. 2004)) since we reasoned that such events were more likely to be functionally important, as compared to indels in noncoding regions. The majority were 3n-indels that altered the amino acid composition but kept the reading frame intact. However, we also ob- served a small number of genes (n = 9) with frame shifting indels that re- sulted in PTCs, and two additional genes where substitutions caused PTCs. Transcripts affected by PTCs may result in truncated proteins but more likely they will be degraded by NMD and be cleared from the cell (Hentze and Kulozik 1999; Maquat 2004). We noted that many of the PTC genes had multiple annotated transcripts in Ensembl and that the PTC did not affect all of them. PTCs may thus have the potential to alter the transcriptional output

8 The naming convention of chimpanzee chromosomes has changed after the publication of Paper I and the chromosome previously known as 22 is now referred to as 21 (McConkey EH. 2004. Orthologous numbering of great ape and human chromosomes is essential for compara- tive genomics. Cytogenetic and genome research 105(1): 157-158. However, when discussing Paper I the old numbering scheme will be used to comply with the published paper. 33 of a gene (see Figure 7 for an example) and this was explored in more detail in Paper II. To ensure that the PTCs were not an artefact of low sequence quality we compared the region to the chimpanzee genome sequence and re-sequenced a region surrounding the event rendering the PTC. In order to address the issue of the ancestral state we sequenced the same region in and also downloaded sequence for macaque and dog from Ensembl. This way we determined that the indel event appeared on the human lineage in three cases and in the remaining eight genes the indel (or substitutions) appeared on the chimpanzee lineage.

Figure 7. The FAM3B gene is an example of a gene where we detected an indel causing a PTC in chimpanzee. Panel A is a schematic overview of the four annotated transcripts with coding exons in light grey and UTR exons in white. We predicted a PTC in the second exon of the bottom transcript and the alignment revealed a ten bp indel. Panel B displays a species tree with human (H), chimpanzee (C) and several outgroups gorilla (G), macaque (M) and dog (D). Our analysis indicated that since the indel was found in comparisons between human and all other related species, the most parsimonious explanation was that an insertion occurred on the human lineage after the human chimapnzee split (indicated by a light grey box in the species tree). The indel was predicted to shift the human reading frame and thereby removed the PTC found in all the other species. This suggested that the transcript has been gained in the human lineage. Figure from Wetterbom et al (2008. ), reprinted with permis- sion from The Encyclopedia of Life Sciences, Wiley & sons Ltd.

34 At the time when Paper I was initiated most work on sequence divergence had focused on nucleotide substitutions (Chen and Li 2001; Ebersberger et al. 2002) and few papers had studied indels as a source of genomic variation in primates (Bailey et al. 2002b; Britten 2002). Since then, other groups have also suggested that indels of different sizes may play an important role in primate evolution (Frazer et al. 2003; Watanabe et al. 2004; Mikkelsen et al. 2005) and our results further support this view.

Paper II: Genome-wide analysis of chimpanzee genes with premature termination codons Having identified a handful of PTC genes on a single chimpanzee chromo- some, we wanted to extend the comparison to the complete genome. Fur- thermore, we wanted to study the genomic location and biological function of PTC genes, as well as the position of PTCs within the affected genes. We started with Ensembl annotations of human genes (and correspond- ing transcripts) and inferred the chimpanzee orthologue based on genomic alignments, employing several filtering steps to ensure that the detected PTCs were not the result of poor sequence quality or erroneous gene predic- tions. The final data set comprised 13,487 genes of which ~ 8 % (n = 1,109) had at least one associated transcript with a PTC in chimpanzee. Seventy percent of the PTCs were caused by frame shift indels and the remainder were derived from nucleotide substitutions that rendered novel termination codons. The proportion of PTC genes varied both between and within chromo- somes, suggesting that the density of PTC genes reflects some underlying genomic properties. To pursue this further, we investigated a possible corre- lation between the density of PTC genes and the following genomic features: GC-content, presence of repeats, CpG-islands and segmental duplications as well as indel divergence, substitutions and the distance to the telomere and centromere. We found the density of PTC genes to be correlated with both indel di- vergence and the distance to the telomere. For the remaining genomic fea- tures studied (i.e. GC-content, repeats, segmental duplications, substitutions and distance to the centromere), there was no correlation with the density of PTC genes, or the correlation was similar to the density of non-PTC genes. Since the observed correlations with indel divergence and telomeric distance were rather weak we speculate that there may be undetected genomic fea- tures influencing the density of PTC genes. We used Gene Ontology (GO) to assess the biological function of genes with PTCs in chimpanzee and the GO classification revealed an enrichment of olfactory receptor genes. This group of genes has been the subject of sev-

35 eral studies and it has been proposed (Gilad et al. 2005a) that humans and chimpanzees have acquired slightly different sets of olfactory receptor genes. Looking more closely at the position of PTCs, we observed that they were often located towards the ends of the transcripts and thereby avoided the functional domains (as detected with Pfam (Finn et al. 2009)) that were pre- dominantly found in central parts of the translated transcripts. Considering the mechanism of NMD, transcripts with PTCs in the last exon will result in truncated proteins whereas other PTC transcripts will more likely be targeted by NMD. Several authors (Lewis et al. 2003; Hillman et al. 2004; Mendell et al. 2004; Saltzman et al. 2008) have proposed that the coupled action of PTCs and NMD provides a mechanism to regulate gene expression at a post- transcriptional level. By directing some proportion of the already transcribed pre-mRNAs to NMD, the cell can rapidly lower the level of mRNA in re- sponse to changing needs. Approximately half of the genes in our dataset have a single annotated transcript (according to Ensembl) and thus they would become pseudo genes if affected by a PTC. However, recent studies using NGS (Wang et al. 2008) have indicated that virtually all genes have > 1 transcript and present data- base annotations represent an underestimate. It is therefore possible that rather than silencing the affected genes, PTCs have the potential of altering the transcriptional output of a gene.

Paper III: Global comparison of the human and chimpanzee transcriptomes using Affymetrix exon arrays In paper III we focused on the transcriptome rather than the genome. Using comprehensive exon arrays from Affymetrix (Human Exon 1.0 ST Array), we compared the human and chimpanzee transcriptomes in three tissues (cerebellum, heart and liver) with respect to differential gene expression and splicing between the two species. To account for the 1.2 % estimated se- quence divergence between human and chimpanzee (Mikkelsen et al. 2005), we removed all probes that did not have a unique and identical (100 %) match in both genomes. This way we ensured that the observed expression differences between the species were biologically relevant and not a result of sequence divergence between the samples and the interrogating probes. We identified expression from a total of 6,281 RefSeq genes9 in our samples and

9 The annotations provided by Affymetrix refer to RefSeq mRNAs and throughout the manu- script we have referred to them as transcript. However, to enhance the readability of the thesis I have chosen the word ‘gene’ here. The RefSeq annotations are very conservative and have low coverage of different splice variants, manifested by the fact that > 99 % of the RefSeq 36 as a general trend, the expression profiles differed more between tissues than between species. The vast majority of RefSeq genes were expressed in both species and only a small subset of genes (1.5 %) were exclusively detected in a single species. GO classification did not reveal any significant differ- ences between this group of genes and the genes expressed in both species. In interpreting this observation, it is important to remember that we have only studied three types of tissues and a gene lacking expression in one of the species may still be expressed in other tissues. Thus, it is perhaps more correct to say that the tissue-specificity of gene expression is different be- tween the two species. Cerebellum had the highest number of expressed genes, closely followed by heart and liver and the vast majority of genes were expressed simultane- ously in several tissues. Approximately 15 % of the genes were differentially expressed between species and in contrast to gene expression in general, differential expression was almost entirely tissue-specific. The results im- plied that genes with different expression levels between species had more specialized tissue-specific functions, in line with previous reports (Khaitovich et al. 2005). Alternative splicing allows a gene to code for several different isoforms. This is a commonly occurring phenomenon and the number of affected hu- man genes is steadily rising, now approaching all multi-exon genes (Wang et al. 2008). In Paper III we used an ANOVA model to detect genes with pre- dicted alternative splicing between species and found that almost half of the genes expressed in liver, and slightly less genes in cerebellum and heart, showed evidence of different splicing patterns between human and chimpan- zee. Calarco et al (2007) have suggested that alternative splicing and differen- tial expression affect different groups of genes. We wanted to investigate this in our data and found that the results depended on how we did the com- parison. Our data suggested that 66-78 % of the differentially expressed genes were also included in the set of genes with predicted alternative splic- ing. On the other hand, if we did the opposite comparison and estimated the number of spliced genes that were also differentially expressed, the propor- tion was small. The reason for this was that the number of spliced genes was much higher than the number of differentially expressed genes. Differences in splicing between human and chimpanzee has also been studied by Blekham et al (2009a) and they suggest that expression of differ- ent isoforms varies not only between lineages but also between sexes. This is an intriguing finding that we have not yet addressed in our data. Alternative splicing, differential expression and absence of expression of a complete gene represent different ways to alter the transcriptional output of

IDs mapped to unique genes. Thus the elective use of ‘transcript’ or ‘gene’ in the text does not affect the described work or the conclusions. 37 a gene. To test if genes with expression or splicing differences were func- tionally different from genes with similar expression levels and splicing pat- terns, we utilized GO annotations. However, we observed no significant enrichment or depletion of GO categories in these groups of genes, as com- pared to a background of all expressed genes. We finally focused on the subtler, but clearly identifiable, events where a single exon was skipped in one of the species (named ‘cassette exon’ in the paper). We identified 23 cassette exons with inclusion only in human and an additional five exons only included in chimpanzee. The use of cassette exons appeared to be tissue specific and 79 % of the identified cassette exons were expressed in both species in another of the surveyed tissues. The data presented in Paper III will be used for at least two further types of analyses. We assume that the tissue-specific inclusion of exons may result from underlying differences in the genome, a theme previously explored by Castle and colleagues (2008). In a study of 48 human tissues and cell lines, they have found that exons with tissue-specific inclusion are often associated with specific sequence motifs. As many as 143 distinct sequences motifs were found to be enriched near the regulated cassette exon and the motifs varied between different tissues. Many motifs mapped to binding sites for regulatory proteins, suggesting a path for regulation of alternative splicing. Our expression data revealed a set of genes with cassette exons between the species and the inclusion, or exclusion, appeared to be largely tissue- specific. Many of the exons that were identified as cassette exons between human and chimpanzee in one tissue, were expressed in both species in an- other of the tissues. We hypothesize that this tissue-specific inclu- sion/exclusion of exons may be partially explained by substitutions or indels altering the regulatory motifs, as suggetsed by Castle et al (2008), and we will pursue this further. Another aspect of our study is the age discrepancy between the human and chimpanzee individuals and to ensure that the results were not biased by age we did a thorough comparison between spliced genes in our study and the previously published work by Calarco et al (2007). The comparison re- vealed that 77 % of the genes found in their study were also predicted to be spliced with our approach. Since the two studies have employed different array platforms and used slightly different tissues we do not expect a total overlap. We conclude that our splicing predictions are robust and not biased by the age difference between subjects. Furthermore, we intend to verify the putative cassette exons in a different expression dataset with chimpanzee liver samples (Blekhman et al. 2009a).

38 Paper IV: Deep sequencing of the chimpanzee transcriptome identifies numerous transcribed regions in frontal cortex and liver In an attempt to undertake an unbiased survey of the chimpanzee transcrip- tome, we have used the SOLiD NGS technology to sequence cDNA libraries from frontal cortex and liver samples from two individuals. We generated over 500 million reads and the data was used for expression analysis of Ref- Seq genes and subsequently for detection of transcribed regions (TRs) with no previous annotations (i.e. novel TRs). When mapping the reads to RefSeq transcripts we observed an uneven coverage and to compensate for the 3’ bias of reads we focused on a 500 bp window at the end of the transcript to determine expression. Furthermore, transcripts were required to be present above a defined statistical threshold and in both biological replicates of a tissue. Using these criteria we identified 7,761 expressed RefSeq transcripts in frontal cortex and 5,639 in liver, a large proportion of which were found in both tissues. Having established firm criteria for expression we next performed a de novo search for TRs in the chimpanzee genome and found more than 60,000 TRs in frontal cortex and 35,000 in liver. These figures may seem high at first, but it is reasonable to assume that with increasing sequencing depth many TRs will collapse into longer consecutive transcribed regions. Ap- proximately 40 % of the TRs overlapped with RefSeq exons and these were more highly expressed than other TRs. This was expected since we have selected for poly(A)+ mRNA, which represent protein coding genes. The remaining TRs that did not map to known exons, were matched to mRNAs and ESTs from databases and if no overlap was detected the TRs were termed ‘novel’ (n = 7,430 in frontal cortex and n = 2,919 in liver). We observed extensive transcription surrounding RefSeq transcripts (i.e. in introns and within 10 kb upstream and downstream) implying a multitude of coding regions and UTRs not previously identified. In addition, novel TRs may represent expressed repeats, different types of noncoding RNAs or unspliced mRNAs. In order to resolve this complex pattern it will be helpful to use improved sequencing protocols with strand-specificity of the reads. For a small number of novel TRs there was no support in the human ge- nome and these suggestively represented chimpanzee-specific transcripts. Finally, we noted that the brain and liver transcriptomes differed in two major ways. The proportion of transcripts expressed solely in brain was higher than the corresponding proportion expressed only in liver. Addition- ally, we observed a group of transcripts with a strong enrichment of intronic novel TRs in brain whereas no parallel pattern was observed in liver. This suggests a higher transcriptional complexity in brain than in liver. Pathway analysis, using the Ingenuity tool, of transcripts enriched for novel TRs in

39 brain revealed that the genes belonged to networks with neuronal function. This was confirmed with Gene Ontology classification. The importance of our findings is twofold. First, our results provide a comprehensive description of the chimpanzee transcriptome. The informa- tion on novel transcribed regions is important for understanding the biology of the chimpanzee itself and for future evolutionary studies of the divergence between primates. Secondly, our sequencing data is valuable for de novo annotations of genes in the chimpanzee genome. Previous annotations have been extrapolated from human gene models and our data can be used to con- firm and extend existing annotations.

General discussion of the present research Having described the research field in general and the research conducted in this thesis I will provide a perspective on my work in relation to the field as a whole.

Genome and transcriptome divergence between human and chimpanzee The first two papers in this thesis consider genomic differences and how they will be reflected in the transcriptome. We speculate that PTCs in chim- panzee genes may alter the repertoire of expressed transcripts and find sup- port for this proposal in the literature (Lewis et al. 2003; Hillman et al. 2004; Mendell et al. 2004; Saltzman et al. 2008). In the third paper we aimed di- rectly at transcriptome divergence and specifically searched for cassette ex- ons (i.e. exons expressed only in one of the species). To produce variable expression and splice patterns there must be some underlying differences in the genome sequence, such as in splice sites or other regulatory sites. An important challenge for the future is to combine large datasets in order to link sequence divergence to transcriptional, and perhaps also proteomic, variation and try to infer the relationships between them. The genome and transcriptome divergence described in Papers I-III is compatible with several of the proposed hypotheses for phenotypic diver- gence (outlined in the Introduction). PTCs may result in divergence of either the protein sequence or the expression levels, including expression of differ- ent isoforms. Cassette exons on the other hand will be manifested as differ- ent splice variants and to produce this pattern it is likely that some regulatory regions have diverged at the nucleotide level. A PubMed search for citations indicates that our work has mainly been important for other researchers with an interest in indels in primate evolution.

40 The final paper in this thesis differs from the others in that it only in- cludes chimpanzee samples and thus it is not strictly a comparative study. In this study we aimed at producing a comprehensive catalogue of transcribed regions in chimpanzee and the use of this was dual. As discussed in the chapter Enabling technologies, gene annotations in chimpanzee have relied on human annotations since there has been no comprehensive collection of chimpanzee cDNAs. Sakate and colleagues (2003; 2007) have previously sequenced the 5’ ends of 226 cDNAs and 87 full-length cDNAs and in a pioneering study, Blekhman et al (2009a) used RNA-seq to survey the liver transcriptome in chimpanzees. Our sequencing efforts further extended on their work by adding deeper coverage of the liver transcriptome and provid- ing new data of the brain transcriptome. We anticipate that the results will be of use for future comparative studies with both human and other mammals. Paper IV is also a complement to previous RNA-seq work on human (Pan et al. 2008; Sultan et al. 2008) and mouse (Mortazavi et al. 2008). Although our results confirmed the view of dense transcription around known protein coding genes, the results also indicated a multitude of novel transcribed re- gions. Technical advances such as methods for sequencing specific strands and single molecule sequencing10 will help to further untangle the complex and still largely unknown patterns of transcription.

Perspectives on the work on PTCs in chimpanzee genes The work in Paper I and II relied on gene annotations and some words of caution need to be said about this. The most accurate and reliable types of gene annotations come from alignment of cDNAs to the genome sequence but unfortunately there are few cDNAs available for chimpanzee and the genes are instead predicted with computational methods or inferred from the human gene catalogue. Since the expected sequence similarity between hu- man and chimpanzee is high, this approach will usually work well for many genes. However, different use of splice sites and 5’ or 3’ exons may be prob- lematic. Sequencing errors are an additional source of error and may intro- duce incorrect PTCs in the chimpanzee. In our work we have started with pairwise alignments of the human and chimpanzee sequence and projected human gene annotations onto that. This approach assumes similar gene structure between the species and has some limitations: we will not detect the use of alternative exons or splice sites. With that in mind, we still have the ability to detect indels and substitutions located within exons. We have used stringent filtering criteria to ensure that the predicted chimpanzee genes have valid start codons and that predicted

10 In single molecule sequencing the RNA molecules are sequenced directly, in contrast to present methods where the RNA is first reverse transcribed into cDNA and amplified before sequencing. 41 PTCs are not caused by low sequence quality. To validate the predicted PTCs, and the indel or substitution causing it, we re-sequenced the regions (Paper I) and found that the events were not artefacts from low sequence quality. The analyses in Papers I and II end with prediction of PTCs, but to evaluate the biological effect it would be useful to proceed with experimen- tal studies of the mRNA levels. The level of produced mRNA is generally used as a proxy for gene expression and real time PCR or northern blot may be used to measure mRNA levels in a subset of genes. To measure mRNA levels in a large dataset other techniques such as expression arrays or NGS are more suitable. The dataset from the exon arrays, presented in Paper III, would be one possible way of validating the predicted PTCs. This has not yet been done but I will show a pilot example of how the array data can be used to support the predictions in Papers I and II. We reasoned that transcripts affected by PTCs could be degraded by NMD and consequently we would detect no expression from the probesets related to that transcript. On the other hand, if the transcript was instead translated into a truncated protein, we would ex- pect to detect expression from probesets upstream of the PTC but not for downstream probesets. Simple in theory, the analysis is complicated by the fact that the majority of genes have several transcripts and thus the expres- sion of a probeset is often a composite signal measuring expression of sev- eral different transcripts. Even so, some general expression signatures may be hypothesized for genes with a PTC in chimpanzee. For such genes we expected a pattern like Figure 8 with both species displaying similar expres- sion levels, except for the final exon where we predicted a PTC. In Figure 8 this is represented by the second last probe set which has a significantly lower expression in chimpanzee. Since PTCs in the last exon do not usually elicit NMD we expected the transcipt to be expressed but not the final probe- set. The observed expression profile was consistent with our predicitions. Although there are no similar studies of PTCs in chimpanzee there is some work on human indels inferred from human-chimpanzee alignments. Using the chimpanzee draft genome as a comparison, Hahn et al (2005; 2006) have identified nine genes each with nonsense mutations and frame shift mutations in human. The low number of identified genes reflects the problems with using the draft version of the chimpanzee genome and Hahn et al have used a third outgroup to confirm the detected PTCs. In another study, Chen et al (2007) have inferred human-specific indels (also using the draft chimpanzee genome) and found that 6.4 % of UCSC genes have indels within the CDS. Approximately two thirds of them are 3n-indels and the remainder are likely to disrupt the reading frame. Furthermore, it was found that indels occurred more often in spliced genes than in unspliced genes. Taken together, these studies and our results imply that indels occur both in human and chimpanzee genes and that PTCs occur in both lineages, al- though the affected genes differ. 42

Figure 8. Screen shot from the Partek Genomics Suite showing expression of the Epoxide hydrolase 1 gene (EPHX1) (combining predictions of PTCs from Paper II and expression data from Paper III). The expression level in human cerebellum is represented by a blue line with boxes at the position of probe sets and chimpanzee expression levels are represented with by red line and triangles. The first (blurred to represent that it was masked due to sequence divergence) and last probe sets were below the detection treshold and were not considered to be expressed in our sam- ples.

Platform comparison We have used two experimental platforms to survey transcriptomes, namely the exon arrays from Affymetrix and NGS using the SOLiD system. Both platforms aim at detecting gene expression and Marioni et al (2008) have showed that the results from arrays (Affymetrix U133 plus 2 array) and NGS (Illumina) are highly correlated. However, there are some important differ- ences. For the Affymetrix exon array, several types of gene annotations have been used to design probes towards known and predicted exons. The dense probe coverage enables robust estimates of gene expression and offers the 43 ability to study differences in exon usage. Despite the high density of probes an inherent problem with arrays is that they can only target predefined re- gions in the genome. Moreover, the array we have used does not include probes for exon-exon junctions and thus there is no way to reliably infer how the expressed exons are joined into transcripts. However, even with junction probes the problem with having to predefine the interrogated region persists and arrays are unable to detect novel splice variants. With the SOLiD platform on the other hand, in theory we should be able to sequence essentially all transcripts present in the sample and map them back to the genome. Since the mapping does not rely on any previous anno- tations, we should be able to detect transcribed regions in an unbiased way. Unfortunately, the dataset presented in Paper IV suffers from an uneven representation of different parts of the transcripts and the gene expression levels were therefore estimated only from the last 500 bp. With an even read coverage we would have been able to investigate not only gene expression but also splicing patterns. Previous studies have shown that reads covering exon-exon junctions may be used to predict novel splice variants (Pan et al. 2008; Sultan et al. 2008; Wang et al. 2008) and we initially intended to do this. As with all emerging technologies NGS suffers from the lack of standard protocols for applications such as RNA preparation and analysis of the re- sults. An important part of our work with Paper IV has thus been to develop strategies for the analyses and define appropriate thresholds for expression. The choice of cut-offs is always somewhat arbitrary and a trade off between sensitivity11 and specificity12. In our analysis, we have taken a very conserva- tive approach, meaning that we have underestimated the number of tran- scripts used for the comparison between tissues and species. Thus we have prioritized reliability of the results rather than detection of novel low-level transcripts.

A note on the ethical implications of working with primate samples Working with primate samples presents some delicate ethical problems. Not only is the chimpanzee regarded as the species most closely related to hu- mans, but it is also en endangered species covered by the CITES act (www.cites.org), an international agreement on protection of endangered species. All primate tissue samples used in this work come from chimpan- zees from the Kolmården zoo (Sweden) where the animals live in a modern facility. If we accept that chimpanzee may be kept in captivity for the pur-

11 Measures the proportion of true positives that are correctly identified. 12 Measures the proportion of true negatives that are correctly identified. 44 pose of education and breeding, the animals at Kolmården seem to lead a good life. In our work we have used tissue samples from two animals, one of the chimpanzees died of natural causes and the other one was euthanized due to injuries from a fight.

45 Concluding remarks and future perspectives

Why do we bother? The question of “What makes us human?” is certainly intriguing and has engaged humans for a long time. It is interesting from a pure philosophical point of view and comparison to our closest relative, the chimpanzee, may be one way to gain deeper understanding of who we are, at least at the mo- lecular level. More pragmatic motivations for comparative studies would be to understand human (i) disease aethiology and (ii) evolution. Chimpanzees have different susceptibility to a number of human diseases, such as malaria, Alzheimer’s disease and the progression of HIV infection to AIDS (Varki 2000). Understanding the molecular mechanisms for these diseases would have great medical value and could in the long term benefit patients. To pin- point genes that have been under positive selection on the lineage leading to humans, the chimpanzee genome represents a valuable comparison. If we widen the homo centric perspective, comparative studies are also useful in gaining a deeper understanding of the chimpanzee and how it has evolved since our last common ancestor. Intersting in itself, a greater under- standing of the common chimpanzee and the bonobo may also help to save the species from extinction. These two species are highly endangered today and an increased public awareness is probably necessary in order to be able to save them in the wild.

What have we learnt? A general lesson from comparative studies of species is that we can rela- tively easy find and catalogue a large number of differences, including ge- nome variation and transcriptional differences. This has been done by many research groups and also by us, as presented in this thesis. However, most changes (both at the nucleotide and transcriptome level) are likely to be se- lectively neutral and to sift through the large catalogues of irrelevant differ- ences to find the functionally important ones is a tedious work. Given that the human (and chimpanzee) genome is ~ 3 Gb and the sequence divergence is ~ 1.2 % this results in 36 million base pairs that differ between the species. Add indels, CNVs, segmental duplications and so on and the sequence di-

46 vergence will increase even further. Thus, the challenge is to find the ge- nomic changes that result in species-specific phenotypes. Finding differences that underlie the divergent phenotypes has not been easy and so far only a few studies have been successful. One such example is the FOXP2 gene which is involved in the human ability of speech. Svante Pääbo’s group has identified some human specific nucleotide changes and suggested that the gene has been positively selected for during human evolu- tion (Enard et al. 2002b). In this case, FOXP2 was chosen as a candidate gene based on its function and association with speech impairment in hu- mans. The success story of FOXP2 has been hard to repeat eventhough we now have a close to complete list of protein coding genes with amino acid changes between human and chimpanzee. Perhaps, this is not as surprising since variation in protein coding genes is only one of the proposed explana- tions of the species divergence. With the added complexity of differences in regulatory regions and other types of genomic variation that have been shown to affect gene expression, it is increasingly hard to provide good can- didate genes for further study.

Where do we go next? The release of the chimpanzee genome in 2005 was a great landmark in pri- mate and evolutionary genomics and an important quest is to find strategies to decipher the genome sequence. A combined analysis of several compre- hensive datasets is a promising path forward. Analysing differences in geno- types and gene expression in parallel has proved to be a powerful approach for identification of genes and gene networks associated with human disease (Chen et al. 2008; Emilsson et al. 2008) and may also prove useful for evo- lutionary studies. A variety of arrays are available to measure gene expres- sion and study genomic variation in CNVs and SNPs. Although intended for human samples this type of arrays can also be used for chimpanzees to detect differential gene expression as well as different frequencies of known ge- nomic variation. As the capacity of NGS increases and the costs decreases is is anticipated that arrays will be replaced by high-throughput sequencing for many applications. In the next few years, re-sequencing of genomes will provide unbiased data on all types of genomic variation, ranging from substi- tutions to CNVs and large segmental duplications. Together with RNA-seq, which provides information of the transcriptional activity of the genome, this data can be mined for genomic alterations that seem to affect the transcrip- tional output. The sequencing revolution holds great promise and an addi- tional advantage with re-sequencing of genomes is that we will soon have polymorphism data on chimpanzee. It is known that the chimpanzee has greater within-species divergence than humans (Mikkelsen et al. 2005) and so far this has not been addressed at the genomic level. Polymorphism data 47 can be used to compare the intra- and interspecies divergence for humans and chimpanzees and to further study the association between genetic changes and alterations in gene expression. In a broad sense, gene expression may be viewed as a phenotype but to move beyond the world of DNA and RNA we may shift our focus to the resulting gene products, i.e. proteins. The multitude of proteins greatly ex- ceed the number of genes, as a result of alternative splicing, RNA-editing and post-translational modifications. High-throughput studies of the pro- teome use mass spectrometry to detect the peptides present in a sample. The technique separates peptide fragments based on their mass-to-charge ration (m/z) and the fragments are matched to a database. Another approach for studying proteome variation has been taken by the Human Proteome Re- source (HPR) Program (Uhlen et al. 2005) which aim at producing antibod- ies towads the majority of human proteins. The antibodies are used together with tissue microarrays to generate an expression atlas that can be used to explore and contrast protein expression across tissues and cell lines. The HPR program has a focus on humans but the antibodies could also be used in combination with chimpanzee specific tissue arrays, bearing in mind then that there is a bias since the antibodies are designed towards human targets. An even better approach would be to design chimpanzee specific antibodies. Although hard to use for quantitative comparisons beteween species the method may provide important clues to where and when gene products are expressed. Moving one step further to functional studies we are limited since research on chimpanzees is not morally acceptable. Cell lines provide a model system suitable to test e.g. the effect of regulatory changes on gene expression. For some research, animal models such as mouse, may also be useful. This approach has been pursued by Pääbo’s group who have devel- oped a transgenic FOXP2 mouse (Enard et al. 2009). One concern with experimental studies involving chimpanzee material is that tissues are not abundant and most studies have to rely on small sample sizes. This is problematic since the within species divergence is predicted to be higher in the chimpanzee population than in the human ditto. In order to really drive the field forward and profit from the emerging opportunities with NGS it would be beneficial with collaborative efforts to gain access to larger collections of primate samples.

48 Sammanfattning på svenska

Människans och schimpansens gemensamma urmoder levde för 5–7 miljo- ner år sedan. Efter att evolutionen skiljde våra vägar åt så har bägge arterna ansamlat mutationer i sina respektive genom. Mutationerna består både av enstaka basparsvariationer och större skillnader, såsom insertioner och dele- tioner (indels). Om både variationen i enkla baspar och indels tas med är sekvensdivergensen mellan arterna minst 5 %. Många forskargrupper har studerat skillnader mellan männsikans och schimpansens genom och de tran- skript som produceras, i syfte att identifiera den genetiska basen för skillna- der i utseende och andra karaktärer mellan arterna. Min avhandling handlar om den typen av molekylära jämförande artstudier. De första två delstudierna handlar om mängden indelvariation mellan människa och schimpans, och specifikt om indels som orsakar prematura stoppkodon i schimpansgener. Resultaten pekar på att prematura stoppko- don är relativt vanligt förekommande (finns i ca 8 % av schimpansgenerna) och vi föreslår att detta kan orsaka skillnader i genuttryck mellan arterna. I den tredje delstudien har vi använt expressionsarrayer för att studera tran- skriptomet i tre vävnader (cerebellum, hjärta och lever). En slutsats är att genuttrycket varierar mer mellan vävnaderna än mellan arterna. Ungefär 15 % av generna visar en uttrycksskillnad mellan arterna i någon vävnad, och omkring hälften av generna visar tecken på alternativ splitsning (d.v.s produktion av alternativa transkript från en och samma gen). Vi har speci- fikt studerat kassett-exon mellan arterna, d.v.s. interna (ej första eller sista) exoner som uttrycks i endast en av de två arterna. I den fjärde delstudien har vi använt storskalig sekvensering för att analysera transkriptomet i frontal cortex och lever hos schimpans. För att validera metoden studerade vi först expression av RefSeq-gener och letade därefter efter nya uttryckta regioner (TR:er) i hela genomet. Vi fann ett stort antal TR:er och en majo- ritet av dessa fanns i närheten av kända gener, vilket tyder på att de kan vara kodande regioner eller UTR:er som inte identifierats tidigare. Andra tänkbara förklaringar är icke-kodande RNA, uttryckta repeterade regioner eller introniska sekvenser från osplicat mRNA. Vi noterade också att tran- skriptomet i hjärna skiljde sig från det i lever: fler gener uttrycktes speci- fikt i frontal cortex än i lever och fler gener hade en stor ansamling av nya uttryckta regioner. Sammanfattningsvis bidrar avhandlingen till en ökad förståelse av skill- naderna mellan människa och schimpans, både på genom- och transkriptom- 49 nivå. Resultaten utgör en grund för fortsatta studier av vilka specifika skill- nader som påverkar den fenotypiska variationen mellan människa och schimpans.

50 Acknowledgements

The work presented in this thesis was performed at the Department of Genet- ics and Pathology at Uppsala University. During this time I have been part of the National Research School in Functional Genomics and Bioinformatics (FGB) which have provided financial support. The following funding agen- cies have also provided financial support: the Swedish Research Councils in Natural Science and Medicine (VR), the Knut and Alice Wallenberg Foun- dation, Center for Metagenomic Sequence Analysis, Uppsala Genome Cen- ter, the Erik Philip-Sörensen Foundation, the Beijer Foundation, the Marcus Borgström Foundation, and the Magnus Bergvalls Foundation.

As we all know, research is not a one woman show and many people have been important to me along the way and I want to take the opportunity to thank them here. My most sincere thanks to:

My supervisor Ulf Gyllensten for taking me under your wings and being an excellent scientific guide. Thanks for letting me work with some very excit- ing projects and giving me a lot of freedom to ask the questions and find the answers in my way. My co-supervisor Lucia Cavelier for teaching me to work in the lab and to keep asking the important biological questions. And even more important, thanks for being a loyal supporter during all these years. My co-supervisor Tomas Bergström for introducing me to the fasci- nating field of primate genomics and evolution. My collaborators on the last two papers. Adam who continued with the SOLiD-project when I went on maternity leave and has remained committed ever since. It has been fun working with you and I have learned a lot. Lars for generously sharing your knowledge on genomic variation. Your ques- tions have made me think hard and I have enjoyed our discussions. I also want to thank Marie for the collaboration on Paper I. Bengt Röken at the Kolmården Zoo for kindly providing the chimpanzee samples. The staff at SVA and pathologists in our department for help with tissue handling. The Uppsala Genome Center and the Uppsala Array Plat- form for generating the data used in Papers III and IV. Everyone in Ulf’s group: Adam, Emma, Ghazal, Inger G, Ivana, Rita, Wilmar, and at the genome centre: Inger J, Jessica, Joanna, Katarina, Otto, Christian, Emelie, our extra room mate Monica and finally all absent friends who have left for new adventures. Thanks for coffee breaks, lunches 51 (thanks to you I know everything about buying houses/apartments, raising children, soap operas and iPhones) and parties during the years. Some of you have also become good friends outside work. Emma who has offered sup- port and advice on many aspects of life. I know I can count on you, even if the rain is pouring down (read both literally and as a metaphore). Åsa, my first friend at Rudbeck and my role model when I started as a PhD student. Malin for being just as talkative as me and for bringing more fashion aware- ness to our room. I took your advice seriously and bought the dress for to- night’s party already at the sale in January. Thanks to all three of you for lenghty discussions on research, wine and music (pre-kids), sleepless nights (post-kids), and other important aspects of life. I also want to thank for proof reading and keeping my Australian accent up to date. And Anna J, my next-door neighbour at Rudbeck (and ten years ago at Ryds allé 21), for be- ing so easily convinced to drink coffe. This was especcially appreciated dur- ing the writing of my thesis. Friends from the PhD student council at IGP for many memorable mo- ments and keeping me up to date with the gossip. I’m proud of our work and especially the PhD student symposia. All other friends at Rudbeck, past and present. You have made it a great place to work and I have many fond memories. My non-Rudbeck friends who remind me that there is a world outside. None mentioned and none forgotten but a special thanks to Åsa for sharing my interest in bioinformatics all those years ago in Linköping. My parents Cecilia and Ingemar for endless support and faith in me. I would say the debate on nature versus nurture is irrelevant because you pro- vided me with the best of both. Thanks also for taking care of my family while I have been working on this thesis. My sister and brother for giving me a lot of practise at arguing my case ;) Thanks also Elin for helping me to try all the fancy cakes when we were on maternity leave and Erik for revealing the secret of GoreTex gloves (biking to work is much more enjoyable now). And finally my own little family. Björn, my love and companion in life. Thanks for keeping everything together during the last hectic months. Tea and Noel, the sunshines of my life. Being with you makes me remember what’s really important (surprisingly enough that’s not research but rather holding the warm hand of a three-year-old or the smell of a sleeping baby). I love the three of you

52 References

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics 25(1): 25-29. Bailey JA, Eichler EE. 2006. Primate segmental duplications: crucibles of evolution, diversity and disease. Nature reviews 7(7): 552-564. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, Adams MD, Myers EW, Li PW, Eichler EE. 2002a. Recent segmental du- plications in the human genome. Science (New York, NY 297(5583): 1003-1007. Bailey JA, Yavor AM, Viggiano L, Misceo D, Horvath JE, Archidiacono N, Schwartz S, Rocchi M, Eichler EE. 2002b. Human-specific duplica- tion and mosaic transcripts: the recent paralogous structure of chro- mosome 22. American journal of human genetics 70(1): 83-100. Bakewell MA, Shi P, Zhang J. 2007. More genes underwent positive selec- tion in chimpanzee evolution than in human evolution. Proceedings of the National Academy of Sciences of the United States of America 104(18): 7489-7494. Benjamini Y, Hochberg, Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B(57): 289-300. Bentley DR Balasubramanian S Swerdlow HP Smith GP Milton J Brown CG Hall KP Evers DJ Barnes CL Bignell HR et al. 2008. Accurate whole human genome sequencing using reversible terminator chem- istry. Nature 456(7218): 53-59. Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, Chen Y, Clarke L, Coates G, Cox T, Cuff J et al. 2004. Ensembl 2004. Nucleic acids research 32 Database issue: D468-470. Birney E Stamatoyannopoulos JA Dutta A Guigo R Gingeras TR Margulies EH Weng Z Snyder M Dermitzakis ET Thurman RE et al. 2007. Identification and analysis of functional elements in 1% of the hu- man genome by the ENCODE pilot project. Nature 447(7146): 799- 816. Black DL. 2003. Mechanisms of alternative pre-messenger RNA splicing. Annual review of biochemistry 72: 291-336. Blekhman R, Marioni JC, Zumbo P, Stephens M, Gilad Y. 2009a. Sex- specific and lineage-specific alternative splicing in primates. Ge- nome research. 53 Blekhman R, Oshlack A, Chabot AE, Smyth GK, Gilad Y. 2008. Gene regu- lation in primates evolves under tissue-specific selection pressures. PLoS genetics 4(11): e1000271. Blekhman R, Oshlack A, Gilad Y. 2009b. Segmental duplications contribute to gene expression differences between humans and chimpanzees. Genetics 182(2): 627-630. Britten RJ. 2002. Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. Proceedings of the National Academy of Sciences of the United States of America 99(21): 13633- 13635. Bustamante CD, Fledel-Alon A, Williamson S, Nielsen R, Hubisz MT, Glanowski S, Tanenbaum DM, White TJ, Sninsky JJ, Hernandez RD et al. 2005. Natural selection on protein-coding genes in the human genome. Nature 437(7062): 1153-1157. Caceres M, Lachuer J, Zapala MA, Redmond JC, Kudo L, Geschwind DH, Lockhart DJ, Preuss TM, Barlow C. 2003. Elevated gene expression levels distinguish human from non-human primate brains. Proceed- ings of the National Academy of Sciences of the United States of America 100(22): 13030-13035. Calarco JA, Xing Y, Caceres M, Calarco JP, Xiao X, Pan Q, Lee C, Preuss TM, Blencowe BJ. 2007. Global analysis of alternative splicing dif- ferences between humans and chimpanzees. Genes & development 21(22): 2963-2975. Carninci P Kasukawa T Katayama S Gough J Frith MC Maeda N Oyama R Ravasi T Lenhard B Wells C et al. 2005. The transcriptional land- scape of the mammalian genome. Science (New York, NY 309(5740): 1559-1563. Castle JC, Zhang C, Shah JK, Kulkarni AV, Kalsotra A, Cooper TA, John- son JM. 2008. Expression of 24,426 human alternative splicing events and predicted cis regulation in 48 tissues and cell lines. Na- ture genetics 40(12): 1416-1425. Chang YF, Imam JS, Wilkinson MF. 2007. The nonsense-mediated decay RNA surveillance pathway. Annual review of biochemistry 76: 51- 74. Chen FC, Chen CJ, Li WH, Chuang TJ. 2007. Human-specific insertions and deletions inferred from mammalian genome sequences. Genome re- search 17(1): 16-22. Chen FC, Li WH. 2001. Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. American journal of human genetics 68(2): 444-456. Chen Y, Zhu J, Lum PY, Yang X, Pinto S, MacNeil DJ, Zhang C, Lamb J, Edwards S, Sieberts SK et al. 2008. Variations in DNA elucidate molecular networks that cause disease. Nature 452(7186): 429-435. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G et al. 2005a. Transcriptional maps of 10 hu-

54 man chromosomes at 5-nucleotide resolution. Science (New York, NY 308(5725): 1149-1154. Cheng Z, Ventura M, She X, Khaitovich P, Graves T, Osoegawa K, Church D, DeJong P, Wilson RK, Paabo S et al. 2005b. A genome-wide comparison of recent chimpanzee and human segmental duplica- tions. Nature 437(7055): 88-93. Clark AG, Glanowski S, Nielsen R, Thomas PD, Kejariwal A, Todd MA, Tanenbaum DM, Civello D, Lu F, Murphy B et al. 2003. Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science (New York, NY 302(5652): 1960-1963. Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G et al. 2008. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Na- ture methods 5(7): 613-619. Crick F. 1970. Central dogma of molecular biology. Nature 227(5258): 561- 563. Dannemann M, Lorenc A, Hellmann I, Khaitovich P, Lachmann M. 2009. The effects of probe binding affinity differences on gene expression measurements and how to deal with them. Bioinformatics (Oxford, England) 25(21): 2772-2779. Darwin C. 1871. The descent of man and selection in relation to sex Charles Darwin. John Murray, London. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, Carnevali P, Nazarenko I, Nilsen GB, Yeung G et al. 2010. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science (New York, NY 327(5961): 78-81. Ebersberger I, Metzler D, Schwarz C, Paabo S. 2002. Genomewide compari- son of DNA sequences between humans and chimpanzees. American journal of human genetics 70(6): 1490-1497. Eddy SR. 2001. Non-coding RNA genes and the modern RNA world. Na- ture reviews 2(12): 919-929. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B et al. 2009. Real-time DNA sequencing from single polymerase molecules. Science (New York, NY 323(5910): 133-138. Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, Zhu J, Carl- son S, Helgason A, Walters GB, Gunnarsdottir S et al. 2008. Genet- ics of gene expression and its effect on disease. Nature 452(7186): 423-428. Emmert-Buck MR, Bonner RF, Smith PD, Chuaqui RF, Zhuang Z, Gold- stein SR, Weiss RA, Liotta LA. 1996. Laser capture microdissec- tion. Science (New York, NY 274(5289): 998-1001. Enard W, Gehre S, Hammerschmidt K, Holter SM, Blass T, Somel M, Bruckner MK, Schreiweis C, Winter C, Sohr R et al. 2009. A hu- manized version of Foxp2 affects cortico-basal ganglia circuits in mice. Cell 137(5): 961-971. Enard W, Khaitovich P, Klose J, Zollner S, Heissig F, Giavalisco P, Nieselt- Struwe K, Muchmore E, Varki A, Ravid R et al. 2002a. Intra- and

55 interspecific variation in primate gene expression patterns. Science (New York, NY 296(5566): 340-343. Enard W, Przeworski M, Fisher SE, Lai CS, Wiebe V, Kitano T, Monaco AP, Paabo S. 2002b. Molecular evolution of FOXP2, a gene in- volved in speech and language. Nature 418(6900): 869-872. Feuk L, Carson AR, Scherer SW. 2006. Structural variation in the human genome. Nature reviews 7(2): 85-97. Feuk L, MacDonald JR, Tang T, Carson AR, Li M, Rao G, Khaja R, Scherer SW. 2005. Discovery of human inversion polymorphisms by com- parative analysis of human and chimpanzee DNA sequence assem- blies. PLoS genetics 1(4): e56. Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K et al. 2009. The Pfam protein families database. Nucleic acids research 38(Database issue): D211- 222. Fortna A, Kim Y, MacLaren E, Marshall K, Hahn G, Meltesen L, Brenton M, Hink R, Burgers S, Hernandez-Boussard T et al. 2004. Lineage- specific gene duplication and loss in human and great ape evolution. PLoS biology 2(7): E207. Frazer KA, Chen X, Hinds DA, Pant PV, Patil N, Cox DR. 2003. Genomic DNA insertions and deletions occur frequently between humans and nonhuman primates. Genome research 13(3): 341-346. Gao K, Masuda A, Matsuura T, Ohno K. 2008. Human branch point consen- sus sequence is yUnAy. Nucleic acids research 36(7): 2257-2267. Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuels- son O, Zhang ZD, Weissman S, Snyder M. 2007. What is a gene, post-ENCODE? History and updated definition. Genome research 17(6): 669-681. Gilad Y, Man O, Glusman G. 2005a. A comparison of the human and chim- panzee olfactory receptor gene repertoires. Genome research 15(2): 224-230. Gilad Y, Oshlack A, Smyth GK, Speed TP, White KP. 2006. Expression profiling in primates reveals a rapid evolution of human transcrip- tion factors. Nature 440(7081): 242-245. Gilad Y, Rifkin SA, Bertone P, Gerstein M, White KP. 2005b. Multi-species microarrays reveal the effect of sequence divergence on gene ex- pression profiles. Genome research 15(5): 674-680. Gingeras TR. 2007. Origin of phenotypes: genes and transcripts. Genome research 17(6): 682-690. Glazko GV, Nei M. 2003. Estimation of divergence times for major lineages of primate species. Molecular biology and evolution 20(3): 424-434. Hahn Y, Lee B. 2005. Identification of nine human-specific frameshift muta- tions by comparative analysis of the human and the chimpanzee ge- nome sequences. Bioinformatics (Oxford, England) 21 Suppl 1: i186-194. Hahn Y, Lee B. 2006. Human-specific nonsense mutations identified by genome sequence comparisons. Human genetics 119(1-2): 169-178. 56 Haygood R, Fedrigo O, Hanson B, Yokoyama KD, Wray GA. 2007. Pro- moter regions of many neural- and nutrition-related genes have ex- perienced positive selection during human evolution. Nature genet- ics 39(9): 1140-1144. Hentze MW, Kulozik AE. 1999. A perfect message: RNA surveillance and nonsense-mediated decay. Cell 96(3): 307-310. Hillman RT, Green RE, Brenner SE. 2004. An unappreciated role for RNA surveillance. Genome biology 5(2): R8. Hsieh WP, Chu TM, Wolfinger RD, Gibson G. 2003. Mixed-model reanaly- sis of primate data suggests tissue and species biases in oligonucleo- tide-based gene expression profiles. Genetics 165(2): 747-757. Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L et al. 2009. Ensembl 2009. Nucleic acids research 37(Database issue): D690-697. Huxley TH. 1864. Evidence as to the man's place in nature, London,. Jacquier A. 2009. The complex eukaryotic transcriptome: unexpected perva- sive transcription and novel small RNAs. Nature reviews 10(12): 833-844. Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermuller J, Hofacker IL et al. 2007. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science (New York, NY 316(5830): 1484-1488. Khaitovich P, Hellmann I, Enard W, Nowick K, Leinweber M, Franz H, Weiss G, Lachmann M, Paabo S. 2005. Parallel patterns of evolution in the genomes and transcriptomes of humans and chimpanzees. Sci- ence (New York, NY 309(5742): 1850-1854. Khaitovich P, Muetzel B, She X, Lachmann M, Hellmann I, Dietzsch J, Steigele S, Do HH, Weiss G, Enard W et al. 2004a. Regional pat- terns of gene expression in human and chimpanzee brains. Genome research 14(8): 1462-1473. Khaitovich P, Weiss G, Lachmann M, Hellmann I, Enard W, Muetzel B, Wirkner U, Ansorge W, Paabo S. 2004b. A neutral model of tran- scriptome evolution. PLoS biology 2(5): E132. Kimura M. 1968. Evolutionary rate at the molecular level. Nature 217(5129): 624-626. King MC, Wilson AC. 1975. Evolution at two levels in humans and chim- panzees. Science (New York, NY 188(4184): 107-116. Kvikstad EM, Tyekucheva S, Chiaromonte F, Makova KD. 2007. A ma- caque's-eye view of human insertions and deletions: differences in mechanisms. PLoS computational biology 3(9): 1772-1782. Lai CS, Fisher SE, Hurst JA, Vargha-Khadem F, Monaco AP. 2001. A fork- head-domain gene is mutated in a severe speech and language disor- der. Nature 413(6855): 519-523. Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K Doyle M FitzHugh W et al. 2001. Initial sequencing and analysis of the human genome. Nature 409(6822): 860-921.

57 Levinson G, Gutman GA. 1987. Slipped-strand mispairing: a major mecha- nism for DNA sequence evolution. Molecular biology and evolution 4(3): 203-221. Lewis BP, Green RE, Brenner SE. 2003. Evidence for the widespread cou- pling of alternative splicing and nonsense-mediated mRNA decay in humans. Proceedings of the National Academy of Sciences of the United States of America 100(1): 189-192. Linné Cv. 1735. Systema naturae, sive regna tria naturae systematice pro- posita per classes, ordines, genera & species, Lugduni Batavorum,. Maquat LE. 2004. Nonsense-mediated mRNA decay: splicing, translation and mRNP dynamics. Nat Rev Mol Cell Biol 5(2): 89-99. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. 2008. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research 18(9): 1509-1517. Mattick JS, Makunin IV. 2006. Non-coding RNA. Human molecular genet- ics 15 Spec No 1: R17-29. McConkey EH. 2004. Orthologous numbering of great ape and human chromosomes is essential for comparative genomics. Cytogenetic and genome research 105(1): 157-158. Mendell JT, Sharifi NA, Meyers JL, Martinez-Murillo F, Dietz HC. 2004. Nonsense surveillance regulates expression of diverse classes of mammalian transcripts and mutes genomic noise. Nature genetics 36(10): 1073-1078. Mikkelsen TS, Hillier LW, Eichler EE, Zody MC, Jaffe DB, Yang S, Enard W, Hellmann I, Lindblad-Toh K, Altheide TK. 2005. Initial se- quence of the chimpanzee genome and comparison with the human genome. Nature 437(7055): 69-87. Mills RE, Bennett EA, Iskow RC, Luttig CT, Tsui C, Pittard WS, Devine SE. 2006. Recently mobilized transposons in the human and chim- panzee genomes. American journal of human genetics 78(4): 671- 679. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods 5(7): 621-628. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. 2008. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science (New York, NY 320(5881): 1344-1349. Nickerson E, Nelson DL. 1998. Molecular definition of pericentric inversion breakpoints occurring during the evolution of humans and chimpan- zees. Genomics 50(3): 368-372. Nielsen R, Bustamante C, Clark AG, Glanowski S, Sackton TB, Hubisz MJ, Fledel-Alon A, Tanenbaum DM, Civello D, White TJ et al. 2005. A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS biology 3(6): e170. Nowick K, Gernat T, Almaas E, Stubbs L. 2009. Differences in human and chimpanzee gene expression patterns define an evolving network of

58 transcription factors in brain. Proceedings of the National Academy of Sciences of the United States of America. Ohno S. 1970. Evolution by Gene Duplication. Springer-Verlag. Olson MV. 1999. When less is more: gene loss as an engine of evolutionary change. American journal of human genetics 64(1): 18-23. Olson MV, Varki A. 2003. Sequencing the chimpanzee genome: insights into human evolution and disease. Nature reviews 4(1): 20-28. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. 2008. Deep surveying of alternative splicing complexity in the human transcriptome by high- throughput sequencing. Nature genetics. Perry GH, Tchinda J, McGrath SD, Zhang J, Picker SR, Caceres AM, Iafrate AJ, Tyler-Smith C, Scherer SW, Eichler EE et al. 2006. Hotspots for copy number variation in chimpanzees and humans. Proceedings of the National Academy of Sciences of the United States of America 103(21): 8006-8011. Perry GH, Yang F, Marques-Bonet T, Murphy C, Fitzgerald T, Lee AS, Hyland C, Stone AC, Hurles ME, Tyler-Smith C et al. 2008. Copy number variation and evolution in humans and chimpanzees. Ge- nome research 18(11): 1698-1710. Pollard KS, Salama SR, Lambert N, Lambot MA, Coppens S, Pedersen JS, Katzman S, King B, Onodera C, Siepel A et al. 2006. An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443(7108): 167-172. Popper K. 1934. Logik der Forschung. Prabhakar S, Noonan JP, Paabo S, Rubin EM. 2006. Accelerated evolution of conserved noncoding sequences in humans. Science (New York, NY 314(5800): 786. Preuss TM, Caceres M, Oldham MC, Geschwind DH. 2004. Human brain evolution: insights from microarrays. Nature reviews 5(11): 850- 860. Pruitt KD, Tatusova T, Maglott DR. 2007. NCBI reference sequences (Ref- Seq): a curated non-redundant sequence database of genomes, tran- scripts and proteins. Nucleic acids research 35(Database issue): D61-65. Pushkarev D, Neff NF, Quake SR. 2009. Single-molecule sequencing of an individual human genome. Nature biotechnology 27(9): 847-852. Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diek- hans M, Smith KE, Rosenbloom KR, Raney BJ et al. 2010. The UCSC Genome Browser database: update 2010. Nucleic acids re- search 38(Database issue): D613-619. Sakate R, Osada N, Hida M, Sugano S, Hayasaka I, Shimohira N, Yanagi S, Suto Y, Hashimoto K, Hirai M. 2003. Analysis of 5'-end sequences of chimpanzee cDNAs. Genome research 13(5): 1022-1026. Sakate R, Suto Y, Imanishi T, Tanoue T, Hida M, Hayasaka I, Kusuda J, Gojobori T, Hashimoto K, Hirai M. 2007. Mapping of chimpanzee full-length cDNAs onto the human genome unveils large potential divergence of the transcriptome. Gene 399(1): 1-10.

59 Saltzman AL, Kim YK, Pan Q, Fagnani MM, Maquat LE, Blencowe BJ. 2008. Regulation of multiple core spliceosomal proteins by alterna- tive splicing-coupled nonsense-mediated mRNA decay. Molecular and cellular biology. Sanger F, Nicklen S, Coulson AR. 1977. DNA sequencing with chain- terminating inhibitors. Proceedings of the National Academy of Sci- ences of the United States of America 74(12): 5463-5467. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chet- vernin V, Church DM, Dicuccio M, Federhen S et al. 2010. Data- base resources of the National Center for Biotechnology Informa- tion. Nucleic acids research 38(Database issue): D5-16. Schena M, Shalon D, Davis RW, Brown PO. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science (New York, NY 270(5235): 467-470. Sharp AJ, Cheng Z, Eichler EE. 2006. Structural variation of the human genome. Annual review of genomics and human genetics 7: 407-442. Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, Rosenbaum AM, Wang MD, Zhang K, Mitra RD, Church GM. 2005. Accurate multiplex polony sequencing of an evolved bacterial genome. Sci- ence (New York, NY 309(5741): 1728-1732. Sibley CG, Ahlquist JE. 1984. The phylogeny of the hominoid primates, as indicated by DNA-DNA hybridization. Journal of molecular evolu- tion 20(1): 2-15. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D et al. 2008. A global view of gene activity and alternative splicing by deep se- quencing of the human transcriptome. Science (New York, NY 321(5891): 956-960. Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, Wang X, Bodeau J, Tuch BB, Siddiqui A et al. 2009. mRNA-Seq whole-transcriptome analysis of a single cell. Nature methods 6(5): 377-382. Uddin M, Wildman DE, Liu G, Xu W, Johnson RM, Hof PR, Kapatos G, Grossman LI, Goodman M. 2004. Sister grouping of chimpanzees and humans as revealed by genome-wide phylogenetic analysis of brain gene expression profiles. Proceedings of the National Acad- emy of Sciences of the United States of America 101(9): 2957-2962. Uhlen M, Bjorling E, Agaton C, Szigyarto CA, Amini B, Andersen E, Andersson AC, Angelidou P, Asplund A, Asplund C et al. 2005. A human protein atlas for normal and cancer tissues based on antibody proteomics. Mol Cell Proteomics 4(12): 1920-1932. Valouev A, Ichikawa J, Tonthat T, Stuart J, Ranade S, Peckham H, Zeng K, Malek JA, Costa G, McKernan K et al. 2008. A high-resolution, nu- cleosome position map of C. elegans reveals a lack of universal se- quence-dictated positioning. Genome research 18(7): 1051-1063. Varki A. 2000. A chimpanzee genome project is a biomedical imperative. Genome research 10(8): 1065-1070.

60 Venter JC Adams MD Myers EW Li PW Mural RJ Sutton GG Smith HO Yandell M Evans CA Holt RA et al. 2001. The sequence of the hu- man genome. Science (New York, NY 291(5507): 1304-1351. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. 2008. Alternative isoform regulation in human tissue transcriptomes. Nature. Wang Z, Gerstein M, Snyder M. 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews 10(1): 57-63. Ward AJ, Cooper TA. 2009. The pathobiology of splicing. The Journal of pathology 220(2): 152-163. Watanabe H, Fujiyama A, Hattori M, Taylor TD, Toyoda A, Kuroki Y, No- guchi H, BenKahla A, Lehrach H, Sudbrak R et al. 2004. DNA se- quence and comparative analysis of chimpanzee chromosome 22. Nature 429(6990): 382-388. Wetterbom A, Cavelier, L., Bergström, TF. (2008). Indels in the Evolution of the Human and Primate Genomes. ENCYCLOPEDIA OF LIFE SCIENCES. Wetterbom A, Sevov M, Cavelier L, Bergstrom TF. 2006. Comparative ge- nomic analysis of human and chimpanzee indicates a key role for indels in primate evolution. Journal of molecular evolution 63(5): 682-690. Wold B, Myers RM. 2008. Sequence census methods for functional genom- ics. Nature methods 5(1): 19-21. Yu N, Jensen-Seaman MI, Chemnick L, Kidd JR, Deinard AS, Ryder O, Kidd KK, Li WH. 2003. Low nucleotide diversity in chimpanzees and . Genetics 164(4): 1511-1518. Yunis JJ, Prakash O. 1982. The origin of man: a chromosomal pictorial leg- acy. Science (New York, NY 215(4539): 1525-1530.

61  /-   / 0              ,  1  2  3 04 2 ! 

  0    2   3 04 2 ! 5 / 0 /- 45   004  4 2  2   6  2  2  0     7   8 #    0 5 0   4 0       004  9    19 0  - #  2 / 0 1   2   3 04 2 ! 6 :;   <  45 ''5      0    0 = - #  2 / 0 1   2   3 04 2 ! >6?

        1 , 0 66     ,,,, - &$$*(.