Quick viewing(Text Mode)

Characterization and Evolution of Transmembrane Proteins With

Characterization and Evolution of Transmembrane Proteins With

          

                       ! 

"#$%&'(')*$+!$,-

        .!!)/0/ 0120    .!3)45 4/  5556   77 77  /1/040                   !" # $%&$  '   ( )$%

 *  +,-.%  %$/  (   #&  0   '$1  23         03 %4   %           5% % %6*!753!3 3777835%

2   3       920:;   # $ )   #  % 20:    # $        )  )        <  $  )  # $ % 0 6  $    # 20:      '$$  #$      %   )  $ )   #$ $ ) $ ' $   #==  20:   )$ # #20:>  !"   #$%!& '$! (()!   )!%0 66    #$&    20:#   ) %&$& 20:$    ) # )20:'$)) /    #  ) #  ##    %0$ )  $ ' & )  .9$   20:88 20:; $  $ # )$)   $ & # '$$ '               #% 0 666 $   #$      %&$# ?    )$@  '$$   )'$$  #$  '$ $<  # ) )     % A   #  =75 $       #  $<  # $    8 # # '$$    )   $ $< #  )  >:  9=8)  8 ;&  95!)   57; ( /97)   88;%6   7  )   '$ '    =!7%0 6.#$$$ #$ #    )  #<  #$/  20:#%B))    #!7C #$20:D #   $    )$ #%

*+ ,2   3     20:    :$  4 $   (    # 0$ ) 

* -./.0  1 2 3   2 4#5'2  267$# 89  2+

E,-%.%*  + 

6*= 3= = 6*!753!3 3777835  "  """ 3=!=9$ "FF %?%F G H "  """ 3=!=; In memory of Stig Pålsson

List of Papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Nordström, K. J. V., Fredriksson, R., Schiöth, H. B. (2008) The amphioxus (Branchiostoma floridae) genome contains a highly diversified set of G -coupled receptors. BMC Evolutionary Biology, 8:9 II Nordström, K. J. V., Lagerström, M. C., Wallér, L. M. J., Fredriksson, R., Schiöth, H. B. (2009) The Secretin GPCRs Descended from the Family of Adhesion GPCRs. Molecular Biology and Evolution, 26(1):71-84 III Sällman Almén M., Nordström, K. J. V., Fredriksson, R., Schiöth, H. B. (2009) Mapping the human membrane proteome: a majority of the human membrane can be classified according to function and evolutionary origin. BMC Biology, 7:50 IV Nordström, K. J. V., Sällman Almén, M., Edstam, M. M., Fredriksson, R., Schiöth, H. B. Independent HHsearch, Needleman-Wunsch-based and motif analyses reveals the overall hierarchy for most of the families of G protein-coupled receptors. Submitted

Reprints were made with permission from the respective publishers.

Additional publications

Articles • Hill, T., Nordström, K. J. V., Thollesson, M., Säfström, T. M., Vernersson, A. K. E., Fredriksson, R., Schiöth, H. B. (2010) SPRIT: Identifying horizontal gene transfer in rooted phylogenetic trees. BMC Evolutionary Biology, 10:42 • Nordström, K. J. V., Mirza, M. A., Sällman Almén, M., Gloriam, D. E., Fredriksson, R., Schiöth, H. B. (2009) Critical evaluation of the FANTOM3 non-coding RNA transcripts. Genomics, 94(3):169-76. • Fredriksson, R., Nordström, K. J. V., Stephensson, O., Hägglund, M. G. A., Schiöth, H. B. (2008) The solute carrier (SLC) complement of the human genome: Phylogenetic classification reveals four major families. FEBS letters, 582(27) 3811-6. • Schiöth, H. B., Nordström, K. J. V., Fredriksson, R. (2007) Mining the gene repertoire and ESTs for G protein-coupled receptors with evolutionary perspective. Acta Physiologica (Oxf), 190(1):21-31. • Nordström, K. J. V., Mirza, M. A., Larsson, T. P., Gloriam, D. E., Fredriksson, R., Schiöth, H. B. (2006) Comprehensive comparisons of the current human, mouse, and rat RefSeq, Ensembl, EST, and FANTOM3 datasets: identification of new human genes with specific tissue expression profile. Biochemical and Biophysical Research Communications, 348(3):1063-74

Book chapters • Schiöth, H. B., Nordström, K. J. V., Fredriksson, R. The Adhesion GPCRs; Gene repertoire, Phylogeny, and Evolution. With Editor

Contents

INTRODUCTION ...... 11 Membrane Proteins ...... 11 Transporters...... 11 Receptors...... 12 Enzymes ...... 12 Miscellaneous...... 12 Identification of membrane proteins ...... 12 G protein-coupled receptors ...... 14 Species...... 16 AIMS ...... 19 Paper I ...... 19 Paper II ...... 19 Paper III...... 19 Paper IV ...... 19 MATERIALS AND METHODS...... 20 Genomes...... 20 Sequences...... 20 Identification of candidate proteins...... 21 Domain search...... 23 Sequence identity ...... 23 Splice sites...... 24 Phylogenetic analysis ...... 24 Paper I, phylogeny...... 24 Paper II, phylogeny...... 25 RESULTS AND DISCUSSION...... 26 Paper I ...... 26 Paper II ...... 27 Paper III...... 29 Paper IV ...... 31 FUTURE PERSPECTIVES...... 33 ACKNOWLEDGEMENT ...... 34 REFERENCES ...... 35

Abbreviations

7TM Seven pass trans-membrane domain A. gambiae Anophelels gambiae B. floridae or amphioxus Branchiostoma floridae C. elegans Caenorhabditis elegans C. familiaris Canis familiaris C. intestinalis Ciona intestinalis cAMP cyclic adenosine monophosphate CDD Conserved domain database css Conserved Splice Site D. discoideum Dictyostelium discoideum D. melanogaster Drosophila melanogaster D. rerio Danio rerio G. gallus Gallus gallus GPCR G protein-coupled GPS GPCR proteolytic site H. sapiens Homo sapiens HMM Hidden Markow Model M. brevicollis Monosiga brevicollis M. musculus Mus musculus MCMC Markow chain Monte Carlo N. vectensis Nematostella vectensis nr non-redundant PK R. norvegicus Rattus norvegicus S. cerevisiae Saccharomyces cerevisiae S. pombe Schizosaccharomyces pombe S. purpuratus Strongylocentrotus purpuratus T. adhaerens Trichoplax adhaerens T. nigroviridis Tetraodon nigroviridis T. rubripes Takifugu rubripes TM Trans-membrane uPA urokinase-type plasminogen activator VLGR1 Very long G protein-coupled receptor X. tropicalis Xenopus tropicalis

INTRODUCTION

Membrane Proteins The cell is separated from the surrounding environment by a cell membrane, which is impassable for most biological substances. The same type of membrane also protects the innards of the different organelles within the cell. This protection is crucial for life as many of the chemical reactions that are a part of it, would be disrupted if the conditions were not highly regulated. Still, life is very dynamic and depends on controlled transfers of information or particles, between the compartments to interact and respond to the environment in a specific manner. The integral membrane proteins naturally play a critical role in this machinery. Transporters, receptors and enzymes can be identified as three larger functional groups of membrane proteins (see Figure 1).

Figure 1. A schematic figure depicting examples from the three largest functional groups of transmembrane proteins; transporters that transport substances over the membrane, receptors that reacts on a and signals on the other side of the membrane and enzymes that catalyzes biochemical reactions.

Transporters Transporters perform the movement of a substrate across membranes by utilizing electrochemical gradients or energy from chemical reactions. Transporter proteins can be grouped in various classes where the most important are ion channels (Yu and Catterall 2004), ABC transporters (Borst and Elferink 2002), water channels (Wang, Feng et al. 2006), pumps such as the sodium potassium pump (Dunbar and Caplan 2001) and the solute carriers (Fredriksson, Nordstrom et al. 2008). Here it is also possible to

11 include the auxiliary transport proteins that modulate the activity of other transporters rather than performing the transport themselves.

Receptors A receptor is a protein that mediates a cellular response upon binding of a ligand. Most of these families can be placed in one of four super families; G protein-coupled receptors, Receptor type tyrosine kinases, Receptors of the immunoglobulin superfamily and related and Scavenger receptors and related. The G protein-coupled receptors (GPCRs) are covered to more detail below.

Enzymes Enzymes are proteins with the ability to catalyze a chemical reaction. They can be classified based on the EC system, which classifies enzymes performing a similar type of reaction into six major classes; Oxidoreductases, Transferases, Hydrolases, Lyases, Isomerases and Ligases (Chang, Scheer et al. 2009). The majority of enzymes are not bound to the cell membrane.

Miscellaneous The 74 protein families that did not fit into any of the three major functional classes were gathered in a class called Miscellaneous. Many of the proteins in this class are Structural or Adhesion proteins. Examples of these are the cadherin and protocadherin families. Furthermore, there are membrane proteins that act as ligands to other proteins like the semaphorins that are involved in axon guidance and vascular growth among other things (Kruger, Aurandt et al. 2005). Finally, there are a number of proteins with unknown function.

Identification of membrane proteins Recent overview of membrane bound proteins discusses important membrane protein groups such as the G-protein coupled receptors (GPCR), Aquaporins, Ion channels, ATPases, their structure and topology (von Heijne 2006; von Heijne 2007). While several individual protein and gene families have been relatively well described, e.g. the GPCRs (Lagerstrom and Schioth 2008) and Voltage-gated ion channels (William A, Catterall KG et

12 al. 2003); there is a considerable number of genes that have remained unexplored. One of the most referenced paper regarding the percentage of membrane proteins in proteomes is from 2001 where the membrane topology prediction method TMHMM was applied on a number of proteomes from different species to estimate the membrane protein content, e.g. Caenorhabditis elegans (31%), Escherichia coli (21%) and Drosophila melanogaster (20%) (Krogh, Larsson et al. 2001). However, the human or any other vertebrate’s proteome was not included in this study. A crucial number, when calculating the fraction of membrane proteins, is the total number of genes in a species. Still, with more than a decade of qualified guesses (Antequera and Bird 1993; Fields, Adams et al. 1994; Ewing and Green 2000; Li, Cutler et al. 2003; Larsson, Murray et al. 2005; Clamp, Fry et al. 2007), including serious attempts to estimate the total number of protein-coding sequences in human and mouse (Larsson, Murray et al. 2005; Nordstrom, Mirza et al. 2006; Clamp, Fry et al. 2007), there are large uncertainties regarding the total number of transcribed elements in the mammalian genomes. Another complication is the exact proportions of coding vs. non-coding genes have been extremely hard to determine (Birney, Stamatoyannopoulos et al. 2007). In mouse, one of the most important efforts to identify all transcribed mRNAs in the genome is the RIKEN cDNA project (Carninci, Waki et al. 2003; Hayashizaki 2003). The FANTOM3 release, comprises 102,801 cDNAs of which 38,129 have been classified as potential non-coding RNA genes (Carninci, Kasukawa et al. 2005). Less than a third of the non-coding RNA genes are conserved in rat and human (Nordstrom, Mirza et al. 2009), adding another parameter of insecurity. Clamp et al. argues that open reading frames that are not conserved in other species only should be trusted together with other evidences. This have resulted in a dataset of 20,488 genes based on the union of the human gene catalogue of Ensembl 35 and 48 where all non-conserved entries have been carefully curated (Clamp, Fry et al. 2007; Hubbard, Aken et al. 2009). This is in contrast with the estimated gene count of 31,778 in the original human genome sequence project (Lander, Linton et al. 2001). Initially, Lander et al. suggested 20% of the total gene count to code for membrane proteins (Lander, Linton et al. 2001). This falls well within the range of more recent predictions of the size of the membrane proteotome in various species (Ahram, Litou et al. 2006). Based on the range of predictions by the different methods 15-39% of the human proteome was dedicated to be membrane proteins, clearly illustrating how difficult it is to estimate the number with automatic approaches.

13 G protein-coupled receptors G protein-coupled receptors (GPCRs) are the largest membrane-bound protein family in mammalian genomes with about 800 members in the human genome (Lagerstrom and Schioth 2008). The GPCRs have a trans- membrane domain, which passes seven times through the membrane (7TM). The 7TM domain is well-conserved and flanked by an extra-cellular N- terminal and an intra-cellular C-terminal. GPCRs are located in the membrane and transduce the presence of an extra-cellular signal to an intra- cellular response. The extra-cellular signal varies over a large range, spanning from photons over organic odorants, nucleotides, peptides and lipids to complete proteins (Bockaert and Pin 1999). The conserved motifs present in GPCRs are carefully mapped (Attwood and Findlay 1994; Attwood 2001). GPCRs can be grouped by the A-F (Kolakowski 1994), 1-5 (Bockaert and Pin 1999) or the GRAFS system (Fredriksson, Lagerstrom et al. 2003). These are overlapping and in this thesis, the GRAFS system is most frequently used. It is based on the human repertoire of GPCRs and a phylogenetic division of these, which results in the five families Glutamate (G), (R), Adhesion (A), Frizzled/Taste2 (F) and Secretin (S) (Fredriksson, Lagerstrom et al. 2003). The Rhodopsin family of GPCRs, which corresponds to class A or 1, is the largest family with about 672 members in the human genome including about 388 olfactory receptors. Most of these receptors have short N-termini and bind peptide, amine and lipid-like compounds in a ligand-binding pocket within the trans-membrane (TM) regions of the protein. The Glutamate GPCRs, which corresponds to class C or 3, are characterized by the so-called “Venus Flytrap” mechanism, which is found in the N-termini and is crucial for ligand binding. The Frizzled receptors, which correspond to class F or 5, have long cysteine-rich N-termini that interact with the curly twisted Wnt protein and have a role in cell polarity while the Taste 2 receptors lack long N- and C-termini and sense bitter-tasting substances. Both the Secretin and the Adhesion families correspond to class B or 2. The Secretin GPCRs all have a hormone binding domain in their N-termini that interacts with peptide hormones (Schioth, Nordstrom et al. 2007). The Adhesion GPCR family, with 33 members, is the second largest GPCR family in humans. They are characterized by very long serine and threonine rich N-termini that have multiple domains often found in other types of proteins such as tyrosin kinases. It has been speculated that these long N-termini have a role in cell to cell communication allowing them to participate in different types of cell guidance (Bjarnadottir, Fredriksson et al. 2007). The insect-specific Methuselah family is also included in class B. Some similarities have been identified between the Secretin, Adhesion and Methuselah GPCRs (Harmar 2001) and many domain databases (7tm_2 in Pfam (Finn, Mistry et al. 2006), GPCR_secretin in Interpro (Zdobnov and Apweiler 2001) and 7tm_2 in the National Center for Biotechnology

14 Information (NCBI) Conserved domain database (CDD) (Marchler-Bauer and Bryant 2004)) use sequences from all three groups to form common models for search tools, despite the fact that the functional characteristics of these families are highly divergent. Recently, Cardoso et al. studied evolutionary events that shaped the different branches of the Secretin GPCRs and clearly found the relationship between the Secretin receptors in C. elegans and the vertebrate Secretin family sub branches (Cardoso, Clark et al. 2005). The evolution of the Adhesion family has, however, not been studied in detail. The main reason for this is that the Adhesion family is, by far, the most complex group of GPCR sequences. These GPCRs are very large and have a large number of . Alternative splicing and complex processing steps, including the putative intracellular cleavage at the GPCR proteolytic site (GPS), are also contributing factors to their complexity (Bjarnadottir, Fredriksson et al. 2007). GPCRs similar to those in mammals are not found in bacteria but mammalian-like GPCRs can be found in almost any eukaryotic organism. This includes plants (Devoto, Piffanelli et al. 1999; Josefsson 1999), fungi (Versele, Lemaire et al. 2001) and the amoeba Dictyostelium discoideum (D. discoideum) (Prabhu and Eichinger 2006). What was thought to be a novel family of GPCRs in insects (Hill, Fox et al. 2002) involved in odorant functions, is ligand-gated ion-channels (Sato, Pellegrino et al. 2008; Wicher, Schafer et al. 2008). The light sensing 7TM protein found in bacteria, the bacterial rhodopsin, does not signal through G proteins and has very low sequence identity to GPCRs (Okada, Ernst et al. 2001). It is thus presently unclear whether this protein has a common origin with GPCRs in eukaryotic organisms. The five main families (see above) are present in considerable numbers in most metazoan species including the social amoeba D. discoideum, the yeasts Schizosaccharomyces pombe (S. pombe) and Saccharomyces cerevisiae (S. cerevisiae), the nematode Caenorhabditis elegans (C. elegans), the insects Anophelels gambiae (A. gambiae) and Drosophila melanogaster (D. melanogaster), the sea squirt Ciona intestinalis (C. intestinalis), the fishes Tetraodon nigroviridis (T. nigroviridis), Takifugu rubripes (T. rubripes) and Danio rerio (D. rerio), the frog Xenopus tropicalis (X. tropicalis), the chicken Gallus gallus (G. gallus), the rodents Mus musculus (M. musculus) and Rattus norvegicus (R. norvegicus), the dog Canis familiaris (C. familiaris) and the human Homo sapiens (H. sapiens) (Fredriksson, Lagerstrom et al. 2003; Bjarnadottir, Fredriksson et al. 2004; Eichinger, Pachebat et al. 2005; Fredriksson and Schioth 2005; Metpally and Sowdhamini 2005; Gloriam, Fredriksson et al. 2007; Schioth, Nordstrom et al. 2007; Kamesh, Aradhyam et al. 2008; Haitina, Fredriksson et al. 2009; Ji, Zhang et al. 2009). An excerpt of these species is presented in figure 2. The repertoire in D. discoideum contains distant orthologs to sequences from the Glutamate, Frizzled and the Adhesion family together with cyclic adenosine

15 monophosphate (cAMP) receptors belonging to class D. The Rhodopsin family can be divided in four main groups (-, - - and ) with 13 main branches and previous studies have shown that members within each of the four main groups seem to be found in most bilaterial species, while the representation of each of the main branches is highly variable (Fredriksson and Schioth 2005; Lagerstrom, Hellstrom et al. 2006; Gloriam, Fredriksson et al. 2007; Schioth, Nordstrom et al. 2007).

Figure 2. The tree in the lower part depicts the phylogenetic relationship of common species, with sequenced genomes, that are addressed in this work. The hierarchy is adopted from (Pennisi 2003). In the text, the Latin names are used; Human – Homo sapiens, Chicken – Gallus gallus, Clawed frog – Xenopus tropicalis, Pufferfish – Tetraodon nigroviridis, Vase tunicate – Ciona intestinalis, Florida lancelet or amphioxus – Branchiostoma floridae, Fruit fly – Drosophila melanogaster, Roundworm – Caenorhabditis elegans, Sea anemone – Nematostella vectensis, Trichoplax – Trichoplax adhaerens and Slime mold – Dictyostelium discoideum.

Species This thesis unravels many crucial events of the evolution of GPCRs among animals, i.e. the Metazoan kingdom. This has been made possible through sequencing of several key genomes (see figure 2). A good starting point is the Slime mold, D. discoideum, which is an amoeba and not considered an animal. It split from the human linage more than 1000 million years ago. Still it has a considerable amount of GPCRs, with sequences in three main families present in humans; Adhesion, Frizzled and Glutamate (Prabhu and Eichinger 2006). D. discoideum is a single-celled social amoeba that during starvation gathers into multicellular structures (Eichinger, Pachebat et al. 2005). To be a part of the Metazoan kingdom a species must be multicellular and have more than one cell-type, why D. discoideum is considered to have split from the human linage before the first Metazoan species occurred. One of the simplest known Metazoan species, which genome is sequenced, is Trichoplax adhaerens (T. adhaerens). It has the shape of a flat disc and is found in subtropical and tropical oceans and on the walls of saltwater aquariums (Srivastava, Begovic et al. 2008). An interesting sister-group to T. adhaerens, for which no genome is yet sequenced, is the sponges or

16 porifera. Instead, the next species, with a sequenced genome, to split from the human linage, is the cnidarian or sea anemone Nematostella vectensis (N. vectensis). The body of N. vectensis also has a radial symmetry, but compared to T. adhaerens, it has evolved a basic nerve system. The genome is sequenced and the information is available on the Joint Genome Institute’s homepage (Putnam, Srivastava et al. 2007). To my best knowledge, no focused inquiries, regarding the full repertoires of GPCRs in T. adhaerens and N. vectensis, were made prior to the works in this thesis. The next large phylum to branch from the human linage is the Ecdysozoa, which contains both the nematodes and the arthropods (Aguinaldo, Turbeville et al. 1997). The competing Articulata theory claims that the nematodes split before the arthropods, but the evidence weights towards the Ecdysozoa theory (Pennisi 2003; Podsiadlowski, Braband et al. 2008). Two model species in the Ecdysozoa phylum are the nematode and roundworm C. elegans and the arthropod and fruit fly D. melanogaster. Both these species have been mined for GPCRs and have in common that their repertoires are divergent from what is known In other species (Fredriksson and Schioth 2005). The development of the backbone that characterizes the vertebrates is interesting from an evolutionary point of view. Hence, several species that split from the human linage at the time when the backbone developed have had their genome sequenced. First the echinodermatan and sea urchin Strongylocentrotus purpuraturs (S. purpuratus), branched of, followed by the lancelets and tunicates, represented by Branchiostoma floridae (B. floridae or amphioxus) and C. Intestinalis, respectively. One of the articles in this thesis focuses on B. floridae, which is a small cephalochordate that spends much of its time buried in the sand. It is one of the closest now living relatives to vertebrates. Amphioxus shares several features with vertebrates like a dorsal, hollow nerve cord, notochord, segmental muscles and pharyngeal gill slits. On the other hand, they are missing the pronounced head region of vertebrates as well as not having neural crest cells functioning similar to those in vertebrates, paraxial skeletal tissue and some visceral organs (Holland, Laudet et al. 2004). It has been argued on the basis of molecular phylogenetic data that tunicates, such as C. intestinalis, could be more closely related to vertebrates than cephalochordates are (Delsuc, Brinkmann et al. 2006). GPCRs have been mined from the C. intestinalis genome (Kamesh, Aradhyam et al. 2008) while very little was known about GPCRs in the most basal Cephalochordates such as amphioxus. These tree species live in the ocean and the first vertebrates were fishes. Two of first phyla to split from the human linage were those of hagfishes and lampreys. From the latter, the sequencing of the sea lamprey Petromyzon marinus’ genome is under progress. Several fish genomes are sequenced and in this thesis, the Pufferfish T. nigroviridis genome that is mined for GPCRs (Metpally and Sowdhamini 2005), is used as a reference for the fishes. The frogs constitute

17 one of the earlier phylum to be present above water. The genome of the African clawed frog X. tropicalis is sequenced and it is mined for GPCRs (Ji, Zhang et al. 2009). Other common genomes in this thesis is those of G. gallus, M. musculus and H. sapiens, which all are mined for GPCRs (Attwood and Findlay 1994; Fredriksson, Lagerstrom et al. 2003; Vassilatis, Hohmann et al. 2003; Lagerstrom, Hellstrom et al. 2006)

18 AIMS

Paper I The aim of this study was to map the repertoire of GPCRs in B. floridae, a bottom-living, fish-like animal, which is one of the closest now-living, non- vertebrate relatives to the vertebrates.

Paper II The aim was to study the gene repertoire of class B families Adhesion, Secretin and Methuselah from an evolutionary perspective.

Paper III The aim of this study was to generate a curated classification of the human membrane proteins.

Paper IV The aim of this study was to delineate a tree of the evolution of the GPCR families present in metazoan species.

19 MATERIALS AND METHODS

Genomes The most common technique to sequence a genome today is by shotgun sequencing. In short that means breaking up the genome in smaller pieces, sequence these and use computers to puzzle them together again. To compensate for errors, this is done several times and many genomes have coverage of around eight times. From the genome sequence, it is possible to predict protein-coding genes by, for instance, searching for open reading frames. This is usually done for newly sequenced genomes. The quality of the predictions varies and it has been shown that it can be fruitful to manually curate them (Lagerstrom, Hellstrom et al. 2006). Generally, for new genomes, what are available for download are the genomic scaffolds, a protein dataset and the corresponding nucleotide sequences.

Sequences In this thesis newly sequenced genomes for B. floridae, Monosiga brevicollis (M. brevicollis), T. adhaerens and N. vectensis were downloaded from the Joint Genome Institutes homepage. Full-length Adhesion sequences for H. sapiens, M. musculus, G. gallus, were downloaded according to information from previously published articles (Bjarnadottir, Fredriksson et al. 2004; Lagerstrom, Hellstrom et al. 2006). The genome and the proteome for T. nigroviridis (version 7.46), D. melanogaster (version BDGP4.3.46) and C. elegans (version WB170.46) were downloaded from Ensembl (http://www.ensembl.org). The genome and the proteome for D. discoideum were downloaded from Dictybase (http://www.dictybase.org). For S. purpuratus, we used version 3 of the genome and proteome. It was downloaded from SpBase (Sodergren, Weinstock et al. 2006). In paper III, we used IPI Human version 3.39, which was downloaded from EBI containing 69,731 protein sequences (Kersey, Duarte et al. 2004).

20

Figure 3. A schematic flow chart of GPCR mining, processing a protein source, identifying, and returning the classified GPCRs, if any. The gray boxes hold different methods and the white fields directly beneath points out which part of the result to consider. The procedure is closer described in the text

Identification of candidate proteins The first generation of sequence similarity searches compares sequences to sequences and one of the earlier, still applied, methods is the global alignments algorithm by Needleman and Wunsch (Needleman and Wunsch 1970), which given a set of parameters returns the optimal alignment. This

21 algorithm is time consuming and was not a viable option when computational capacity first became widely available. Instead, well recognized, approximate methods such as BLAST (Altschul, Gish et al. 1990) and BLAT (Kent 2002) became standard tools. The second generation of sequence similarity searches compares sequences to statistical models of several sequences (e.g. HMMER (Eddy 1998)). Further development of sequence similarity search tools has resulted in the third generation tools, such as HHSearch (Soding 2005) and the similar PRC (Madera 2008), which compares two statistical models of two groups of sequences. The method applied in paper I, II and IV to search for GPCRs is in general depicted in figure 3 and can be separated in three main sections; the search for putative sequences, filtration of the found sequences and lastly, classification of the sequences. The initial search was conducted by combining searches with BLAST (Altschul, Gish et al. 1990) and Hidden Markow Model (HMM) searches with HMMER (Eddy 1998). BLAST was used to align the new sequences against the human RefSeq data set (Pruitt, Tatusova et al. 2007) extended with GPCRs from other known families (Kolakowski 1994) (nematode chemosensory receptors, the gustatory receptors from insects, the odorant receptors from D. melanogaster, MLO receptors from plants, fungal receptors (STE2 and STE3) from yeast and cAMP receptors from D. discoideum). Considering only hits with an e-value below 0.01, a transcript was selected as a putative GPCR if the number of GPCR hits among the top five hits were above three.. The sequences were also searched with HMMER against GPCR-HMMs adopted from (Fredriksson and Schioth 2005). Sequences for which the best hit model had an e-value below 0.01 or disagreed with the BLAST result were discarded. Several steps of filtration were applied to the sequences in order to remove artifacts. A removal of splice-variants by aligning all putative GPCRs to the genome with BLAT (Kent 2002) and then only keep one sequence from each -overlapping cluster, was a step that was used in all the publications. Other possible filtration steps included the removal of incomplete sequences. One method was based on discarding sequences with a number TM passages deviating too much from the seven TM passages of a GPCR. One example of a TM-prediction program is Phobius (Kall, Krogh et al. 2004) which uses HMM-models to predict the TM passages. The advantage of this program is that it also predicts whether the protein has a signal peptide. The signal peptide is otherwise easily interpreted as an extra TM passage. In paper III, to improve accuracy (Ahram, Litou et al. 2006), Phobius was complemented with TMHMM (Krogh, Larsson et al. 2001) and SOSUI (Hirokawa, Boon-Chieng et al. 1998). As Phobius, TMHMM use Hidden Markov Models (HMM) to predict membrane topology, but with a different training set. SOSUI evaluates hydrophobicity and amphiphilicity

22 for its predictions and complements the HMM methods as it is not dependent of training sets. These three programs predict the topology of TM proteins spanning the membrane with -helices. Phobius was used to predict TM helices and signal peptides for the IPI Human dataset as a first step in paper III. The predicted signal peptides were cut out of the sequences before prediction with SOSUI and TMHMM to avoid false-positive prediction as suggested by Ahram and colleagues (Ahram, Litou et al. 2006). Candidate membrane proteins were initially selected as those predicted to have TM helices by at least two applications. Another method is to discard sequences with 7TM domains that are considerably shorter than other sequences from the same family. The 7TM domain of a putative GPCR can be identified either by aligning it to the corresponding HMM or by aligning it to other closely related GPCRs for which the 7TM domain is known. As the quality of protein predictions can be bad, a second step of the latter method is to edit the sequences with regard to the genome and correct the splice sites. In this case, also sequences with stop codons within the 7TM domain can be found and then discarded as pseudogenes. Finally the putative sequences are classified to a group or family based on the top five hits in BLAST searches in a similar manner to the initial search described above. This search can be done either towards a dataset consisting of only GPCRs or towards a larger dataset or as a combination of both. The GPCR dataset should be sufficient in most cases, but there could be problems with sequences belonging to novel expansions. The wider scope of, for instance, NCBIs non-redundant (nr) dataset is more forgiving for these sequences, but comes with the problem of vague annotation.

Domain search The 7TM domain of GPCRs is present in several domain databases like for instance NCBIs CDD (Marchler-Bauer, Anderson et al. 2007) and the Pfam data set (Finn, Mistry et al. 2006). These can be aligned to protein sequences with rps-BLAST and HMMER, respectively.

Sequence identity Global alignments between all truncated sequences were made using the needle program from the EMBOSS package (Rice, Longden et al. 2000). The result was compiled using an in-house parser written in Python and viewed and manipulated using the spreadsheet program in the Open.Office.org suite.

23 Splice sites The positions of splice sites in the 7TM region were extracted by aligning the edited and truncated sequences to their corresponding genome using BLAT. All sequences were aligned with Kalign (Lassmann and Sonnhammer 2005). Conserved splice sites was identified by viewing the positions of the splice sites and the alignment together in Jalview (Clamp, Cuff et al. 2004). Data was manipulated with OpenOffice.

Phylogenetic analysis The identified putative GPCRs were analyzed phylogenetically with either the Neighbor-joining method or the Maximum likelihood method. The former is a distance matrix based heuristic method, which aims to minimize the total branch length of the tree. This renders trees to a low computational cost. The latter method is a statistical method based on likelihood statistics. Still, the search for the tree structure, which maximizes the likelihood function, for the given dataset is heuristic. The likelihood function for a given tree takes the probability of all possible mutations into account. The Maximum likelihood method is held to give more reliable trees, but the computational cost is considerably higher compared to the Neighbor-joining method. A third phylogenetic method is the Maximum parsimony method, which tries to find the tree that requires the minimum number of evolutionary changes (Li 1997). An important component in a phylogenetic analysis is the underlying multiple sequence alignment. If the alignment is incorrect, this is perpetuated through the whole analysis. This can be compensated for by sampling the alignment space with bootstrapping, calculating multiple trees and then merging them into a consensus tree. In the second article, the Markow chain Monte Carlo (MCMC) based tool MrBayes (Huelsenbeck and Ronquist 2001) is used in the search of the correct tree structure. The MCMC method samples a Monte Carlo-biased random walk after a burn-in period and returns a sample that mirrors the true distribution of the tree space.

Paper I, phylogeny We calculated a neighbor-joining trees, bootstrapped five hundred times, using MEGA (Kumar, Tamura et al. 2004) with default parameters based on an alignment made in MEGA using the ClustalW (Thompson, Higgins et al. 1994) algorithm and batch ClustalW default parameters for the sequences with identities above twenty percent to at least one of the reference sequences. Sequences disrupting the tree were moved to the basal group on

24 the basis of low bootstrap values and poor alignments through manual inspection.

Paper II, phylogeny The protein sequences of GPCRs chosen for phylogenetic analysis were aligned using Kalign (Lassmann and Sonnhammer 2005) with default parameter settings. The alignments were re-translated to nucleotides using the genomic sequence procured when assembling the GPCR and correcting its splice sites. We used MCMC analysis with MrBayes, but in order to get at the liberal posterior probabilities of Bayesian analysis (Suzuki, Glazko et al. 2002; Alfaro, Zoller et al. 2003; Douady, Delsuc et al. 2003) we chose to implement the bootstrapped MCMC analysis suggested by Douady et al. Hence, the aligned nucleotide file was bootstrapped 200 times using SEQBOOT from the PHYLIP 3.67 package (Felsenstein 2004) and each alignment was analyzed with the General Time Reversible model with a proportion of invariable sites and a gamma-shaped distribution of rates across sites in MrBayes (nst=6 and rates=invgamma). Each analysis ran 500 000 generations. Every hundredth tree from the last 100 000 generations was sampled and a consensus tree was constructed using CONSENSE from the PHYLIP 3.67 package with the majority rule. Maximum likelihood branch lengths were then calculated with DNAML, also from the PHYLIP 3.67 package, and the tree was plotted in the Win32 version of TreeView 1.6.6 (Page 1996). The resulting tree was edited in InkScape.

25 RESULTS AND DISCUSSION

Paper I We identified at least 664 unique GPCRs using our highly diversified seeding datasets. Most of these are found in the main families (Fredriksson, Lagerstrom et al. 2003); Glutamate (18), Rhodopsin (570), Adhesion (37), Frizzled (6) and Secretin (16). We did, however, not find any bitter taste (Taste 2) or vomeronasal (VR1) receptor. These types of receptors are abundant in rodents and most likely all other non-primate mammals but are rare in fish and chicken (Lagerstrom, Hellstrom et al. 2006). Neither did we find evidence for any members of the many pre-vertebrate lineage-specific expansions such as the nematode chemosensory receptors, the gustatory receptors from insects, the odorant receptors from Drosophila melanogaster, MLO receptors in plants or fungal pheromone (STE2 and STE3) from yeast (Devoto, Hartmann et al. 2003; Fredriksson and Schioth 2005). We found, however, two sequences that show some similarity with the cAMP-binding receptors from slime moulds but the identity is fairly low (25.6%) and further work needs to determine if they are true cAMP receptors. We found a very rich repertoire of Adhesion GPCRs, in total 37, in B. floridae. Especially interesting is the Kringle and Somatomedin B domains which are found in sequences from a previously uncharacterized Adhesion expansion of ten genes. The Kringle domain is a protein-binding domain (Patthy, Trexler et al. 1984) present in urokinase-type plasminogen activator (uPA) while the Somatomedin B domain can be found in vitronectin (Salasznyk, Zappala et al. 2007). These two proteins interact and the Somatomedin B domain and helps in the localization of uPA to focal adhesions in microvessel endothelial cells. Interestingly, all of these domains not previously identified in Adhesion GPCRs, have a large number of conserved cysteines which is a feature consistent with many other common domains found in this family. Like most vertebrates, the Rhodopsin family in amphioxus contains the largest number of GPCRs. These GPCRs are found in all four main groups of Rhodopsin GPCRs. However, only eight of the thirteen subgroups are present in amphioxus; missing are the mammalian type of olfactory receptors, the chemokine, the melanin concentrating hormone (MCH), the MAS-related and the purin receptor subgroup. Recently, some of the transcripts we identified as Rhodopsin GPCRs were further classified as

26 olfactory GPCRs with distant similarities to vertebrate olfactory GPCRs in B. floridae (Churcher and Taylor 2009; Niimura 2009). The Rhodopsin family in B. floridae contains several larger expansions of GPCRs. The largest of these includes the Neuropeptide FF receptor in the - group (39 sequences), somatostatin, opioid, galanin cluster (29 sequences), two expansions in the melanocortin, endothelin cannabinoid clusters with 21 and 24 sequences each and one in the cluster (19). The - group, which has several peptide receptors, also has a large expansion of twenty predicted genes that are similar to the prokineticin receptor (PK) 1 and 2 which are important for contraction of gastrointestinal smooth muscles (Lin, Bullock et al. 2002). These are, on average, about 42 percent identical in the TM regions with PK1 and PK2 receptors. This expansion is surprising, however as prokineticin is a very potent muscle contractor; it is possible that these receptors are involved in these early specific pre-vertebrate muscles that create effective undulations of the amphioxus body.

Paper II We collected 250 sequences from the extended class B GPCRs in H. sapiens, M. musculus, G. gallus, T. nigroviridis, D. melanogaster, C. elegans, N. vectensis, M. brevicollis and D. discoideum. The N. vectensis genome had not previously been mined for class B GPCRs and we conclude here that it contains 38 class B transcripts of which 37 were classified as Adhesion and one, Nv112360, as a Methuselah GPCR. Four of the N. vectensis Adhesion GPCRs, Nv187681, Nv211490, Nv212781 and Nv215376, showed some similarity to Methuselah GPCRs. A phylogenetic tree was calculated of the Adhesion GPCRs with a bootstrapped nucleotide method with MrBayes and estimated the branch lengths using DNAML from the PHYLIP 3.67 package. According to Cardoso et al. (Cardoso, Clark et al. 2005), the two most ancient groups of Secretin family GPCRs are the CRHRs (group A) and the CALCRs (group E). We constructed a preliminary tree of the Secretin GPCRs with an out- group of Adhesion GPCRs and identified the Secretin group A as the group most closely related to the Adhesion GPCRs. Based on this; we included all the pre-vertebrate and group A Secretin GPCRs in the tree of the Adhesion GPCRs. The Adhesion tree has two major nodes for which bifurcal topology could not be reached. The node closest to the out-group consists of group VI, the Very long G protein-coupled receptors (VLGR1s), a branch with group III, VIII and GPR128, and a branch with 12 N. vectensis GPCRs. The latter branch, denoted NvX, contains both the gene classified as Methuselah, and the three genes with similarities to the Methuselah from N. vectensis. The other major node holds one branch with GPR144, part of group V, and the Secretin GPCRs, one branch with group I and II, and two branches with

27 groups IV and VII. The GPR133 orthologs from group V are also placed in this node. A tree containing the Methuselah GPCRs, the new expansion of N. vectensis GPCRs and the Adhesion GPCRs in groups III, VI and VIII was also calculated. Because the domain structure of the N. vectensis expansion is similar to an expansion in B. floridae (Nordstrom, Fredriksson et al. 2008), these sequences were also included. We excluded GPCRs from G. gallus in groups III and VI. The resulting tree inserts the Methuselah family closest to the out-group. The expansions in N. vectensis and B. floridae are placed in the same node as groups III, VI and VIII. Bf133112 from B. floridae is placed basal to the N. vectensis expansion with a bootstrap value below 50%. We analyzed the number and position of splice sites in the 7TM region of the class B GPCRs. We can confirm the results of Cardoso et al. (Cardoso, Clark et al. 2005) considering the positions of the splice sites in the Secretin family GPCRs. There are seven well-conserved splice sites in the vertebrate group A-D Secretin GPCRs and six in group E Secretin GPCRs, here denominated Conserved Splice Site (css) 1-7. The conserved splice sites are also present in the Adhesion family, although there are distinct differences between the groups. Group V is split between two neighboring nodes in the phylogenetic tree but both have similar patterns of splice sites. All three vertebrate orthologs to GPR144 have six of the conserved Secretin splice sites, with only css4 is missing. The N. vectensis GPCR Nv201898 and Nv204814 both have the complete setup of conserved Secretin splice sites. All the vertebrate orthologs to GPR133 have all the seven conserved splice sites found in the Secretin family. Nv20971 has splice sites at css4 to css7. The three N. vectensis GPCRs, Nv78835, Nv125574 and Nv217885, placed in the node holding groups I, II, IV, V and VII, also have all seven splice sites css1-css7. Our overall phylogenetic tree suggests that group V (that contains GPR133 and GPR144 in humans) is the closest relative to the Secretin family in the Adhesion family. Interestingly, group V sequences in N. vectensis (Nv_201898 and Nv_204814), both share the same splice site setup as the Secretin GPCRs and this splice site setup is not shared by any of the other ancient groups. One of the most conserved motifs in the whole Secretin family is PL(L/F)G found in TM6. This motif is highly conserved in Secretin groups A-C and E and also present in group D, but with lower degree of conservation. If we consider all other Class B groups present in N. vectensis, this specific Secretin family motif is only found in group V of the Adhesion family. Taken together, this provides strong evidence that the Secretin family of GPCRs could have originated from group V Adhesion GPCRs. Interestingly, we found a set of unique Adhesion-like GPCRs in N. vectensis. These 13 genes do not have any GPS domain but have a TM domain that can readily be aligned with Adhesion GPCRs (amino acid

28 identities range from 21 to 30%). These sequences have long N-termini containing one Somatomedin_B domain each. This Somatomedin_B domain is not found in any mammalian Adhesion GPCR but is, interestingly, found in a set of Adhesion-like GPCRs found in B. floridae (Nordstrom, Fredriksson et al. 2008). The TM regions of these two groups do not group with each other, or any of the other branches of class B sequences. However, this unique N-terminal composition with no GPS domains and their relatively high amino acid identities (21-38%) suggests that these two groups could be related.

Paper III We provide a non-redundant dataset for the human membrane proteome and a qualitative functional classification for all major groups and families containing 6,718 proteins. Comparison with the most recent and reliable set of the genes in the human genome (Clamp, Fry et al. 2007) suggests that the 5,359 validated protein coding -helical transmembrane proteins comprise 27% of the entire human proteome. Our number of 27% is within the two previously suggested spans of 15-39% (Ahram, Litou et al. 2006) membrane proteins in the human proteome and 20-30% (Krogh, Larsson et al. 2001) in any proteome, regardless of species. It is notable that our clustering and classification resulted in that 3,145 (59%) valid proteins of the total dataset were identified to belong to 234 families or groups with at least two members, while 41% of the data set are single genes with no clear identity (significantly lower than 13% (P<0.001) ) to any other human membrane protein coding gene. The largest functional group of membrane proteins is the Receptor class constituting 1,352 proteins. This is 23% of our membrane proteome dataset and 40% of the classified proteins. This is in large contrast with the membrane proteome of Escherichia coli where receptors only count for 5% of the membrane proteome, leaving transporters as the most prominent group with 40%, compared to 15% (817) in humans and 32% in Saccharomyces cerevisiae’s membrane proteomes (Daley, Rapp et al. 2005; Kim, Melen et al. 2006). The estimated number of membrane receptor proteins has increased from about 35 in E. coli to over 1000 in humans. The GPCRs (7TM) count for the largest expansion with 67% of the human receptors compared to zero in bacteria and three proteins in Saccharomyces cerevisiae (Fredriksson and Schioth 2005). Furthermore, we classified 533 proteins as enzymes and 697 in miscellaneous groups with other functions. Many proteins have only one single TM helix (47%) and while some of these TM regions seem to have a primary role to simply anchor the protein to the membrane, i.e. no signal or substrate are relying on the TM helices to cross the membrane, several form oligomers that can participate in the signal

29 process. Non-GPCR receptors are common 1TM proteins (Phobius predicts 397 receptors to have this topology) and 60% of the proteins in the enzyme group are also found here, but only 57 transporters. On the other hand, multi- TM proteins are often highly dependent of the arrangement of their TM helices that form complex structures. Transporters are the most obvious group which in general has high TM numbers, 77% of the transporters have at least a 6TM topology and 76% of the classified proteins with at least 8TM helices are transporters. This is also true for a majority of the families classified as transporters. In the TCDB database, that holds transporter families from all organisms, 70% of 2,847 -helical channels, secondary transporters and active transporters, have at least a 6TM topology according to predictions by Phobius. Further, 12TM was the most common topology in TCDB (17%) which is in analogy with 16% in human. Thus, a high number of TM helices are a good predictor for transporter function and the topologies of the human transporter is representative for transporters in general, considering a number of distant species. There are also receptors with a high number of TM helices. One of these are the two Hedgehog receptors, Patched, with a 12TM topology which otherwise is almost exclusively found among transporters. Our clustering resulted in the identification of new protein groups and novel members of existing families. Discussions about individual members in existing families are found in additional data file 4. Here we want to highlight five clusters with a total of 41 sequences found in the Miscellaneous class where no previous relationship within the families have been reported. These families are simply termed New TM Group (NTMG); NTM1G1, NTM1G2, NTMG1, NTMG2 and NTM5G1 and are more or less uncharacterized with little annotation and no similarity to any Pfam domain. The NTM5G1 family is the only family with any similarity to known proteins. It contains three proteins; two are predicted to be 5TM proteins and one 3TM. They show high identity to the C-terminal end of the 11TM protein Unc93B1. Recently Unc93B1 was reported to be involved in trafficking of toll receptors to endolysosomes and is proposed to be involved in immunodeficiency, but when its homolog was initially characterized in Caenorhabditis elegans it was found to be involved in muscle contraction, which suggests multiple functions for the putative family (Levin and Horvitz 1992; Kim, Brinkmann et al. 2008). The three novel proteins have orthologs in several species, which supports them as valid proteins and they might represent a novel subfamily of truncated Unc93B1 homologues. Such truncated genes were discussed by Kashuba and colleagues as they found clones with high similarity to the 3’ part of the Unc93B1 gene (Kashuba, Protopopov et al. 2002). Considering the extent of our clustering methods we find it unlikely that larger groups of closely related proteins are left to discover within the human TM proteome, although it cannot be entirely excluded. However, there are probably still several distant members of

30 existing families and diverse novel families that could be identified in the future using more sensitive techniques than sequence comparisons.

Paper IV Here, we collected almost 7000 unique GPCR sequences, which we characterized into families and subgroups. This rich dataset allowed us to delineate a high-resolution picture of the GPCR families’ evolutionary history based on HMM-HMM similarities, sequence-sequence similarities and motifs based on alignments.

Figure 4. The table holds an estimated number of genes in different species for the GPCR families present in the GRAFS system. The data for slime mold (D. discoideum) is taken from (Eichinger, Pachebat et al. 2005), Florida lancelet(B. floridae) from paper I (Nordstrom, Fredriksson et al. 2008), vase tunicate (C. intestinalis) from (Kamesh, Aradhyam et al. 2008), pufferfish (T. nigroviridis) from (Metpally and Sowdhamini 2005), clawed frog (X. tropicalis) from (Ji, Zhang et al. 2009) and chicken from (Lagerstrom, Hellstrom et al. 2006). Data for Trichoplax (T. adhaerens) and sea anemone (N. vectensis) are from Paper IV in this thesis. The remaining data for roundworm (C. elegans), fruit fly (D. melanogaster) and human (H. sapiens) are from (Fredriksson and Schioth 2005).

This analysis expanded previous groupings of the GPCR super family (Josefsson 1999; Graul and Sadee 2001) by adding a higher spatial resolution due to the recently sequenced genomes of T. pseudonana, T. adhaerens, N. vectensis and S. purpuratus (see figure 4). By identifying Rhodopsin sequence preceding the split of C. elegans, the Rhodopsin family could be identified as a child to the cAMP family. Comparing the relationship between

31 the Adhesion or cAMP families and the Rhodopsin family, there are more conserved motifs between the cAMP and Rhodopsin families and among them, the prominent Rhodopsin motif, NPXXY, was present in the cAMP consensus sequence as NSXXY. Furthermore, the HHsearch probability was 99.4% between the Rhodopsin and cAMP families. Therefore, we suggest that the Adhesion and Frizzled families originated from the cAMP in an event close in time to that which gave rise to the Rhodopsin family (see figure 4). Later, between the split of N. vectensis and the split of C. elegans from the vertebrate-linage, the Secretin family evolved from the Adhesion family (Nordstrom, Lagerstrom et al. 2009). Furthermore, the Nematode chemoreceptor families were children of Rhodopsin. Nine families had a strong relationship to Rhodopsin, seven had a distant relationship and the remaining five families could be linked to Rhodopsin through one of the other families. Rhodopsin was also found to be the parent of the Vomeronasal type 1 with a strong relationship, a probability of 98.1% and eight motifs. Interestingly, we also linked the Taste2 family to Rhodopsin. Our relationship is strong with a probability of 98.7% and supported by conserved motifs 1, 2, 4 and 12. This is in contrast with previous reports that has associated it with Frizzled (Fredriksson, Lagerstrom et al. 2003). We also identified a strong relationship between the ITR-like and GPR108-like families. Including, the week, but still considerable, links between the GPR108-like and Frizzled families and between the Glutamate and Adhesion families, all Metazoan GPCR families, except some the ecdysozoan- and fungi-specific expansions, can be derived down to a common origin with the GPR108-like family as putative ancestors. The ITR-like and GPR108-like sequences are two small families here shown to be closely related, although they are represented by two separate domains in the Pfam HMM dataset. The GPR108-like family occurs in plants, fungi and animals (Edgar 2007) and we also found it in all the new genomes we analyzed. No evolutionary study has been conducted, to our knowledge, on the ITR-like family. Here we showed that it is present in all twelve species we have analyzed. Both families are small compared to many other GPCR families as the numbers varied between one and four per family and species. The average identity between all identified GPR108-like sequences was 29.1±13.5% and the corresponding number for the ITR-like sequences was 26.1±10.2%. These numbers are high for families present in so many species and well comparable with the well-conserved Frizzled family (30.1±15.4%). The receptors in the Frizzled family are heavily associated with important development functions, such as cell polarity and proliferation. The ITR-like sequences are suggested to be essential in the early stage of intima-media thickening (Tsukada, Iwai et al. 2003). No function has been suggested for the GPR108-like sequence, but considering the high degree of conservation regarding both amino acid identity and family size in different species; it is very possible that it has a fundamental and important function.

32 FUTURE PERSPECTIVES

Although, this thesis unravels several key issues regarding the evolution of G protein-coupled receptors, the evolutionary history of the large Rhodopsin family is still unknown. There are large numbers of uncharacterized GPCRs in pre-vertebrate species such as T. adhaerens, N. vectensis and S. purpuratus, especially in the Rhodopsin family. These sequences hit members from several Rhodopsin subgroups, which raises the question whether they are local expansions with novel functions or pre-vertebrate versions of vertebrate GPCRs. This could be approached with an unbiased clustering followed by manual inspection similar to what was done with all human membrane proteins in paper III. The clusters would then be analyzed in a manner similar to that of paper IV to render an overall hierarchy. Each node could then be the subject of a throughout phylogenetic analysis similar to that in paper II.

By combining the classification of membrane proteins in paper III, with the numerous interesting genomes available today and the mining methodology from paper I-II and IV, it would be possible to map the evolutionary history of the whole or subsets of the membrane proteotome. This data could then be analyzed with the methodology of integrating several measures of group distance utilized in paper IV. The large super families, like for instance the kinases, channels and solute carriers would be interesting as they already are suggested to be related, although no hierarchy is available. Moreover, in the miscellaneous group, there are several small groups with four transmembrane passages. Given the results of a common origin of the majority of the vertebrate GPCR families in paper IV, it could be speculated that also the four transmembrane proteins share common origin.

With the sequencing of novel genomes, more early Metazoan and pre- Metazoan species will be available for analysis. The slime mold, D. discoideum, split from the human lineage before Fungi and choanoflagellates, such as Monosiga brevicollis, while the GPCR repertoire of the slime mold is considerably richer than those from M. brevicollis (unpublished material) and Fungi (Fredriksson and Schioth 2005). Considering this, the large losses of GPCRs in the Nematode and Insect lineages and the suggestion that many of the GPCR families evolved in this time window, the dynamics of the GPCR evolution in the early Metazoan species is highly interesting.

33 ACKNOWLEDGEMENT

I would like to thank some people who have been with me along the way:

Helgi Schiöth, my supervisor, for his patience and encouragement.

Robert Fredriksson, my co-supervisor, for clever suggestions.

Leah, for all the support and for always being there.

Majd, Josefin and Johan, for the company through the BMC years.

Markus, for continiuing were I stopped.

My family, for the effort put into understanding what I have been doing.

Tobias and Pär for being a part of my relocations and excursions.

Linn for introducing the “Måndags-fika” and many entertaining moments.

Maria, Tatjana, Pavel and all the other people at the Neuroscience department.

34 REFERENCES

. "US Department of Energy Joint Genome Institute." from http://www.jgi.doe.gov. Aguinaldo, A. M., J. M. Turbeville, et al. (1997). "Evidence for a clade of nematodes, arthropods and other moulting animals." Nature 387(6632): 489-93. Ahram, M., Z. I. Litou, et al. (2006). "Estimation of membrane proteins in the human proteome." In Silico Biol 6(5): 379-86. Alfaro, M. E., S. Zoller, et al. (2003). "Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence." Mol Biol Evol 20(2): 255- 66. Altschul, S. F., W. Gish, et al. (1990). "Basic local alignment search tool." J Mol Biol 215(3): 403-10. Antequera, F. and A. Bird (1993). "Number of CpG islands and genes in human and mouse." Proc Natl Acad Sci U S A 90(24): 11995-9. Attwood, T. K. (2001). "A compendium of specific motifs for diagnosing GPCR subtypes." Trends Pharmacol Sci 22(4): 162-5. Attwood, T. K. and J. B. Findlay (1994). "Fingerprinting G-protein- coupled receptors." Protein Eng 7(2): 195-203. Birney, E., J. A. Stamatoyannopoulos, et al. (2007). "Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project." Nature 447(7146): 799-816. Bjarnadottir, T. K., R. Fredriksson, et al. (2004). "The human and mouse repertoire of the adhesion family of G-protein-coupled receptors." Genomics 84(1): 23-33. Bjarnadottir, T. K., R. Fredriksson, et al. (2007). "The adhesion GPCRs: a unique family of G protein-coupled receptors with important roles in both central and peripheral tissues." Cell Mol Life Sci 64(16): 2104-19.

35 Bockaert, J. and J. P. Pin (1999). "Molecular tinkering of G protein- coupled receptors: an evolutionary success." Embo J 18(7): 1723-9. Borst, P. and R. O. Elferink (2002). "Mammalian ABC transporters in health and disease." Annu Rev Biochem 71: 537-92. Cardoso, J. C., M. S. Clark, et al. (2005). "The secretin G-protein- coupled receptor family: teleost receptors." J Mol Endocrinol 34(3): 753-65. Carninci, P., T. Kasukawa, et al. (2005). "The transcriptional landscape of the mammalian genome." Science 309(5740): 1559-63. Carninci, P., K. Waki, et al. (2003). "Targeting a complex transcriptome: the construction of the mouse full-length cDNA encyclopedia." Genome Res 13(6B): 1273-89. Chang, A., M. Scheer, et al. (2009). "BRENDA, AMENDA and FRENDA the enzyme information system: new content and tools in 2009." Nucleic Acids Res 37(Database issue): D588- 92. Churcher, A. M. and J. S. Taylor (2009). "Amphioxus (Branchiostoma floridae) has orthologs of vertebrate odorant receptors." BMC Evol Biol 9: 242. Clamp, M., J. Cuff, et al. (2004). "The Jalview Java alignment editor." Bioinformatics 20(3): 426-7. Clamp, M., B. Fry, et al. (2007). "Distinguishing protein-coding and noncoding genes in the human genome." Proc Natl Acad Sci U S A 104(49): 19428-33. Daley, D. O., M. Rapp, et al. (2005). "Global topology analysis of the Escherichia coli inner membrane proteome." Science 308(5726): 1321-3. Delsuc, F., H. Brinkmann, et al. (2006). "Tunicates and not cephalochordates are the closest living relatives of vertebrates." Nature 439(7079): 965-8. Devoto, A., H. A. Hartmann, et al. (2003). "Molecular phylogeny and evolution of the plant-specific seven-transmembrane MLO family." J Mol Evol 56(1): 77-88. Devoto, A., P. Piffanelli, et al. (1999). "Topology, subcellular localization, and sequence diversity of the Mlo family in plants." J Biol Chem 274(49): 34993-5004. Douady, C. J., F. Delsuc, et al. (2003). "Comparison of Bayesian and maximum likelihood bootstrap measures of phylogenetic reliability." Mol Biol Evol 20(2): 248-54.

36 Dunbar, L. A. and M. J. Caplan (2001). "Ion pumps in polarized cells: sorting and regulation of the Na+, K+- and H+, K+-ATPases." J Biol Chem 276(32): 29617-20. Eddy, S. R. (1998). "Profile hidden Markov models." Bioinformatics 14(9): 755-63. Edgar, A. J. (2007). "Human GPR107 and murine Gpr108 are members of the LUSTR family of proteins found in both plants and animals, having similar topology to G-protein coupled receptors." DNA Seq 18(3): 235-41. Eichinger, L., J. A. Pachebat, et al. (2005). "The genome of the social amoeba Dictyostelium discoideum." Nature 435(7038): 43-57. Ewing, B. and P. Green (2000). "Analysis of expressed sequence tags indicates 35,000 human genes." Nat Genet 25(2): 232-4. Felsenstein, J. (2004). PHYLIP (Phylogeny Inference Package). Seattle, Distributed by the author. Department of Genome Sciences, University of Washington. Fields, C., M. D. Adams, et al. (1994). "How many genes in the human genome?" Nat Genet 7(3): 345-6. Finn, R. D., J. Mistry, et al. (2006). "Pfam: clans, web tools and services." Nucleic Acids Res 34(Database issue): D247-51. Fredriksson, R., M. C. Lagerstrom, et al. (2003). "The G-protein- coupled receptors in the human genome form five main families. Phylogenetic analysis, paralogon groups, and fingerprints." Mol Pharmacol 63(6): 1256-72. Fredriksson, R., K. J. Nordstrom, et al. (2008). "The solute carrier (SLC) complement of the human genome: phylogenetic classification reveals four major families." FEBS Lett 582(27): 3811-6. Fredriksson, R. and H. B. Schioth (2005). "The repertoire of G- protein-coupled receptors in fully sequenced genomes." Mol Pharmacol 67(5): 1414-25. Gloriam, D. E., R. Fredriksson, et al. (2007). "The G protein-coupled receptor subset of the rat genome." BMC Genomics 8: 338. Gloriam, D. E., R. Fredriksson, et al. (2007). "The G protein-coupled receptor subset of the rat genome." BMC Genomics 8(1): 338. Graul, R. C. and W. Sadee (2001). "Evolutionary relationships among G protein-coupled receptors using a clustered database approach." AAPS PharmSci 3(2): E12. Haitina, T., R. Fredriksson, et al. (2009). "The G protein-coupled receptor subset of the dog genome is more similar to that in humans than rodents." BMC Genomics 10: 24.

37 Harmar, A. J. (2001). "Family-B G-protein-coupled receptors." Genome Biol 2(12): rewiews3013.1-3013.10. Hayashizaki, Y. (2003). "RIKEN mouse genome encyclopedia." Mech Ageing Dev 124(1): 93-102. Hill, C. A., A. N. Fox, et al. (2002). "G protein-coupled receptors in Anopheles gambiae." Science 298(5591): 176-8. Hirokawa, T., S. Boon-Chieng, et al. (1998). "SOSUI: classification and secondary structure prediction system for membrane proteins." Bioinformatics 14(4): 378-9. Holland, L. Z., V. Laudet, et al. (2004). "The chordate amphioxus: an emerging model organism for developmental biology." Cell Mol Life Sci 61(18): 2290-308. Hubbard, T. J., B. L. Aken, et al. (2009). "Ensembl 2009." Nucleic Acids Res 37(Database issue): D690-7. Huelsenbeck, J. P. and F. Ronquist (2001). "MRBAYES: Bayesian inference of phylogenetic trees." Bioinformatics 17(8): 754-5. Ji, Y., Z. Zhang, et al. (2009). "The repertoire of G-protein-coupled receptors in Xenopus tropicalis." BMC Genomics 10: 263. Josefsson, L. G. (1999). "Evidence for kinship between diverse G- protein coupled receptors." Gene 239(2): 333-40. Kall, L., A. Krogh, et al. (2004). "A combined transmembrane topology and signal peptide prediction method." J Mol Biol 338(5): 1027-36. Kamesh, N., G. K. Aradhyam, et al. (2008). "The repertoire of G protein-coupled receptors in the sea squirt Ciona intestinalis." BMC Evol Biol 8: 129. Kashuba, V. I., A. I. Protopopov, et al. (2002). "hUNC93B1: a novel human gene representing a new gene family and encoding an unc-93-like protein." Gene 283(1-2): 209-17. Kent, W. J. (2002). "BLAT--the BLAST-like alignment tool." Genome Res 12(4): 656-64. Kersey, P. J., J. Duarte, et al. (2004). "The International Protein Index: an integrated database for proteomics experiments." Proteomics 4(7): 1985-8. Kim, H., K. Melen, et al. (2006). "A global topology map of the Saccharomyces cerevisiae membrane proteome." Proc Natl Acad Sci U S A 103(30): 11142-7. Kim, Y. M., M. M. Brinkmann, et al. (2008). "UNC93B1 delivers nucleotide-sensing toll-like receptors to endolysosomes." Nature 452(7184): 234-8.

38 Kolakowski, L. F., Jr. (1994). "GCRDb: a G-protein-coupled receptor database." Receptors Channels 2(1): 1-7. Krogh, A., B. Larsson, et al. (2001). "Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes." J Mol Biol 305(3): 567-80. Kruger, R. P., J. Aurandt, et al. (2005). "Semaphorins command cells to move." Nat Rev Mol Cell Biol 6(10): 789-800. Kumar, S., K. Tamura, et al. (2004). "MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment." Brief Bioinform 5(2): 150-63. Lagerstrom, M. C., A. R. Hellstrom, et al. (2006). "The G protein- coupled receptor subset of the chicken genome." PLoS Comput Biol 2(6): e54. Lagerstrom, M. C. and H. B. Schioth (2008). "Structural diversity of G protein-coupled receptors and significance for drug discovery." Nat Rev Drug Discov 7(4): 339-57. Lander, E. S., L. M. Linton, et al. (2001). "Initial sequencing and analysis of the human genome." Nature 409(6822): 860-921. Larsson, T. P., C. G. Murray, et al. (2005). "Comparison of the current RefSeq, Ensembl and EST databases for counting genes and gene discovery." FEBS Lett 579(3): 690-8. Lassmann, T. and E. L. Sonnhammer (2005). "Kalign--an accurate and fast multiple sequence alignment algorithm." BMC Bioinformatics 6: 298. Levin, J. Z. and H. R. Horvitz (1992). "The Caenorhabditis elegans unc-93 gene encodes a putative transmembrane protein that regulates muscle contraction." J Cell Biol 117(1): 143-55. Li, S., G. Cutler, et al. (2003). "A comparative analysis of HGSC and Celera human genome assemblies and gene sets." Bioinformatics 19(13): 1597-605. Li, W.-H. (1997). Molecular evolution. Sunderland, Mass., Sinauer Associates. Lin, D. C., C. M. Bullock, et al. (2002). "Identification and molecular characterization of two closely related G protein-coupled receptors activated by prokineticins/endocrine gland vascular endothelial growth factor." J Biol Chem 277(22): 19276-80. Madera, M. (2008). "Profile Comparer: a program for scoring and aligning profile hidden Markov models." Bioinformatics 24(22): 2630-1.

39 Marchler-Bauer, A., J. B. Anderson, et al. (2007). "CDD: a conserved domain database for interactive domain family analysis." Nucleic Acids Res 35(Database issue): D237-40. Marchler-Bauer, A. and S. H. Bryant (2004). "CD-Search: protein domain annotations on the fly." Nucleic Acids Res 32(Web Server issue): W327-31. Metpally, R. P. and R. Sowdhamini (2005). "Genome wide survey of G protein-coupled receptors in Tetraodon nigroviridis." BMC Evol Biol 5: 41. Needleman, S. B. and C. D. Wunsch (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins." J Mol Biol 48(3): 443-453. Niimura, Y. (2009). On the Origin and Evolution of Vertebrate Genes: Comparative Genome Analysis Among 23 Chordate Species. 2009: 34. Nordstrom, K. J., R. Fredriksson, et al. (2008). "The amphioxus (Branchiostoma floridae) genome contains a highly diversified set of G protein-coupled receptors." BMC Evol Biol 8: 9. Nordstrom, K. J., M. C. Lagerstrom, et al. (2009). "The Secretin GPCRs descended from the family of Adhesion GPCRs." Mol Biol Evol 26(1): 71-84. Nordstrom, K. J., M. A. Mirza, et al. (2009). "Critical evaluation of the FANTOM3 non-coding RNA transcripts." Genomics 94(3): 169-76. Nordstrom, K. J., M. A. Mirza, et al. (2006). "Comprehensive comparisons of the current human, mouse, and rat RefSeq, Ensembl, EST, and FANTOM3 datasets: identification of new human genes with specific tissue expression profile." Biochem Biophys Res Commun 348(3): 1063-74. Okada, T., O. P. Ernst, et al. (2001). "Activation of rhodopsin: new insights from structural and biochemical studies." Trends Biochem Sci 26(5): 318-24. Page, R. D. (1996). "TreeView: an application to display phylogenetic trees on personal computers." Comput Appl Biosci 12(4): 357- 8. Patthy, L., M. Trexler, et al. (1984). "Kringles: modules specialized for protein binding. Homology of the gelatin-binding region of fibronectin with the kringle structures of proteases." FEBS Lett 171(1): 131-6. Pennisi, E. (2003). "Drafting a tree." Science 300(5626): 1694.

40 Podsiadlowski, L., A. Braband, et al. (2008). "The complete mitochondrial genome of the onychophoran Epiperipatus biolleyi reveals a unique transfer RNA set and provides further support for the ecdysozoa hypothesis." Mol Biol Evol 25(1): 42-51. Prabhu, Y. and L. Eichinger (2006). "The Dictyostelium repertoire of seven transmembrane domain receptors." Eur J Cell Biol 85(9- 10): 937-46. Pruitt, K. D., T. Tatusova, et al. (2007). "NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins." Nucleic Acids Res 35(Database issue): D61-5. Putnam, N. H., M. Srivastava, et al. (2007). "Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization." Science 317(5834): 86-94. Rice, P., I. Longden, et al. (2000). "EMBOSS: the European Molecular Biology Open Software Suite." Trends Genet 16(6): 276-7. Salasznyk, R. M., M. Zappala, et al. (2007). "The uPA receptor and the somatomedin B region of vitronectin direct the localization of uPA to focal adhesions in microvessel endothelial cells." Matrix Biol. Sato, K., M. Pellegrino, et al. (2008). "Insect olfactory receptors are heteromeric ligand-gated ion channels." Nature 452(7190): 1002-6. Schioth, H. B., K. J. Nordstrom, et al. (2007). "Mining the gene repertoire and ESTs for G protein-coupled receptors with evolutionary perspective." Acta Physiol (Oxf) 190(1): 21-31. Sodergren, E., G. M. Weinstock, et al. (2006). "The genome of the sea urchin Strongylocentrotus purpuratus." Science 314(5801): 941-52. Soding, J. (2005). "Protein homology detection by HMM-HMM comparison." Bioinformatics 21(7): 951-60. Srivastava, M., E. Begovic, et al. (2008). "The Trichoplax genome and the nature of placozoans." Nature 454(7207): 955-60. Suzuki, Y., G. V. Glazko, et al. (2002). "Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics." Proc Natl Acad Sci U S A 99(25): 16138-43. Thompson, J. D., D. G. Higgins, et al. (1994). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap

41 penalties and weight matrix choice." Nucleic Acids Res 22(22): 4673-80. Tsukada, S., M. Iwai, et al. (2003). "Inhibition of experimental intimal thickening in mice lacking a novel G-protein-coupled receptor." Circulation 107(2): 313-9. Wang, F., X. C. Feng, et al. (2006). "Aquaporins as potential drug targets." Acta Pharmacol Sin 27(4): 395-401. Vassilatis, D. K., J. G. Hohmann, et al. (2003). "The G protein- coupled receptor repertoires of human and mouse." Proc Natl Acad Sci U S A 100(8): 4903-8. Versele, M., K. Lemaire, et al. (2001). "Sex and sugar in yeast: two distinct GPCR systems." EMBO Rep 2(7): 574-9. Wicher, D., R. Schafer, et al. (2008). "Drosophila odorant receptors are both ligand-gated and cyclic-nucleotide-activated cation channels." Nature 452(7190): 1007-11. William A, Catterall KG, et al. (2003). "International Union of Pharmacology: Approaches to the Nomenclature of Voltage- Gated Ion Channels." Pharmacol Rev 55(4): 573-574. von Heijne, G. (2006). "Membrane-protein topology." Nat Rev Mol Cell Biol 7(12): 909-18. von Heijne, G. (2007). "The membrane protein universe: what's out there and why bother?" J Intern Med 261(6): 543-57. Yu, F. H. and W. A. Catterall (2004). "The VGL-chanome: a protein superfamily specialized for electrical signaling and ionic homeostasis." Sci STKE 2004(253): re15. Zdobnov, E. M. and R. Apweiler (2001). "InterProScan--an integration platform for the signature-recognition methods in InterPro." Bioinformatics 17(9): 847-8.

42

6W!&&W3&"&&            V%D$1#"X$F

6"&&$'1#"X$FYW33&" W!&XY&#&#""X&#''X$#'($33&@6$0 3&$'3"&&`3'a40& &"(&Y0"&#''X"&&(# ""X#b&&Db"('3&! 4#''&$W33&"D&&&$'1#"X$ F@c)9#XYPQQY&&0&3#("& #"d('3&!4#''&$W33&" D&&&$'1#"X$Fe@f

        D&(#V3#("&@##@&    #V(V&V##V!1HPHISI