Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine 388

Identification, Characterization and Evolution of Membrane-bound

PÄR J. HÖGLUND

ACTA UNIVERSITATIS UPSALIENSIS ISSN 1651-6206 UPPSALA ISBN 978-91-554-7312-9 2008 urn:nbn:se:uu:diva-9329                      !  " ##$  #%&## '  (  '    ' )(  ( *+ '   ,- .(   /     0 (-

 1  ) 2- ##$- 3  '  (4  0  '  5  )  - 6   -                     $$- 7 - - 3 ! %8$5%5""958 5%-

          #: '     ( (    /  ' '       ;   5      *;),    * <, /(   $##  $# (    - 3 )  3 33  3= /   7  (    ;) '   (  !3   - 3 )  3 /   (  ( ;)    )  33- >   ' / / (    ;)  8   (    )  3=- )(       ( ( 7  (        ' (    ;) '        ( (     - 0 .   ( '  (      '     (    / (  ( /  /      (     ( - ''    / '   ( !5    (     ;)   ( ( ?       - 3 )  333 /    / (   ;)- 3 )  = /   (   ' ' / / (    <768  <76$ '  (   ' 7 * <7,- >    ' (     (         '  (      - >       9 #  @ <7    '  #      / '   %7   - 3 )  =3 /    ( '    ' (    (  ' ( ''  < '  0? - 3  /   9# @   (    /   (    (  ' ( ' ( 97 < '-

   ;   5      ;)   <   '  0   (   6 (  )(  

! "# $%&      '    & ( ) *+,&    & -./*012  &   

A )B 2- 1 ##$

3 ! 7"57#7 3 ! %8$5%5""958 5%  &  &&& 5% % *( &CC -?-C D E &  &&& 5% %,

Scientific discovery and scientific knowledge have been achieved only by those who have gone in pursuit of it without any practical purpose whatsoever in view.

Max Planck

To my family

List of publications

I. Fredriksson R, Lagerström MC, Höglund PJ, Schiöth HB. Novel human G -coupled receptors with long N-terminals containing GPS domains and Ser/Thr-rich regions. FEBS Lett. 2002 Nov 20;531:407- 14.

II. Fredriksson R, Gloriam DE, Höglund PJ, Lagerström MC, Schiöth HB. There exist at least 30 human G-protein-coupled receptors with long Ser/Thr-rich N-termini. Biochem Biophys Res Commun. 2003 Feb 14;301:725-34.

III. Fredriksson R, Höglund PJ, Gloriam DE, Lagerström MC, Schiöth HB. Seven evolutionarily conserved human G protein-coupled recep- tors lacking close relatives. FEBS Lett. 2003 Nov 20;554:381-8.

IV. Bjarnadóttir TK, Fredriksson R, Höglund PJ, Gloriam DE, Lagerström MC, Schiöth HB. The human and mouse repertoire of the adhesion family of G-protein- coupled receptors. Genomics. 2004 Jul;84:23-33.

V. Höglund PJ, Adzic D, Scicluna SJ, Lindblom J, Fredriksson R. The repertoire of solute carriers of family 6: identification of new human and rodent . Biochem Biophys Res Commun. 2005 Oct 14;336:175-89.

VI. Höglund PJ, Nordström KJ, Schiöth HB, Fredriksson R. The families of Solute Carriers (SLC) have a remarkable long evolutionary history with most families present before a early divergence of eukaryotes. Manuscript.

Articles printed with permission.

Contents

Introduction...... 11 From genetics to postgenomics ...... 11 Genetics ...... 11 Genomics...... 11 Postgenomics ...... 13 Sequenced genomes ...... 14 Membrane proteins...... 16 G protein-coupled receptors ...... 16 The ...... 19 Research aim...... 22 Methods ...... 23 Phylogenetic analysis ...... 23 Sequence similarity based search strategies...... 27 Expressed Sequence Tags ...... 27 Results...... 29 Discussion...... 34 The Adhesion Family ...... 34 The Solute Carrier Family...... 40 The Solute Carrier Family 6...... 42 Diseases...... 45 Future perspectives ...... 49 Acknowledgements...... 51 References...... 53

Abbreviations

7TM Seven transmembrane Adhesion(s) GPCR(s) belonging to the Adhesion family BAC(s) Bacterial artificial (s) BAI Brain Specific Angiogenesis Inhibitor BLAST Basic Local Alignment Search Tool BLAT BLAST-Like Alignment Tool CA Cadherin Repeats cDNA Complementary DNA CDS Coding Domain Sequence CELSR Cadherin EGF LAG seven-pass G-type receptor CNS Central Nervous System EGF Epidermal Growth Factor Domain EGF-LAM Laminin Type Epidermal Growth Factor Domain EMR EGF-module containing mucin-like hormone receptors EST Expressed Sequence Tag E-value Expectation value EST(s) Expressed Sequence Tags Gbp Gigabase pairs GPCR G Protein-Coupled Receptor GPS GPCR-Proteolytic Site The human GPCR families; Glutamate, Rhodopsin, Adhesion, GRAFS /Taste2 and Secretin HBD Hormone-Binding Domain HE6 Human Epidydimal Product 6 HGNC HUGO Committee HMM Hidden Markov Model HUGO Human [Human] Genome Organisation LamG Laminin domain LEC Lectomedin receptor Mbp Megabase Pairs ML Maximum Likelihood MP Maximum Parsimony mRNA Messenger Ribonucleic Acid NCBI (US) National Center for Biotechnology Information NJ Neighbor Joining OLF Olfactomedin domain PTX Pentraxin domain Rhodopsin(s) GPCR(s) belonging to the Rhodopsin family RPS-BLAST Reverse Position specific BLAST SEA Sperm Protein, Enterokinase, and Agrin SLCs Genes belonging to the Solute Carrier Family WGS Whole Genome Shotgun YAC(s) Yeast artificial chromosome(s)

Introduction

From genetics to postgenomics

Genetics The modern era of genetics is generally considered to have started with Gregor Mendel and Charles Darwin. In 1858, Alfred Russel Wallace pub- lished his essay “On the Tendency of Varieties to Depart Indefinitely from the Original Type” in a joint presentation together with previously unpub- lished writings from Charles Darwin. Darwin published his landmark work “On the Origin of Species by Means of Natural Selection” in 1859. In 1866, the Augustinian monk, Gregor Mendel, published his findings on the laws of inheritance based on experiments. However, his findings were not given any attention in the scientific community and it was not until 1900 with the "re- discovery of Mendel" that the theories were widely accepted by the scientific community. In 1915, Thomas Hunt Morgan, published “The Mechanisms of Mendelian Heredity”, in which he presents results that proved that genes are lined up along . In 1953, Francis Crick and James Watson de- termined the structure of the DNA molecule; a double helix composed of strings of nucleotides (Watson et al. 1953). Crick and Watson shared the Nobel Prize in Physiology or Medicine in 1962 for their discovery. In 1977, Fred Sanger developed the chain termination method for sequencing DNA (Sanger et al. 1977) and Allan Maxam and Walter Gilbert developed new sequencing techniques (Maxam et al. 1977). These sequencing techniques were very powerful and substantially raised the sequencing capacity. In 1983, Kary Mullis and others at Cetus Corporation invented a technique for making many copies of a specific DNA sequence - the polymerase chain reaction (PCR).

Genomics The Project, an international effort to sequence all of the human DNA and map all of the genes in humans, was launched in 1990 (Watson 1990). In 1995 the first full genome sequence of a living organism was completed for the bacteria Hemophilus influenzae (Fleischmann et al. 1995). In 1996, Saccharomyces cerevisiae (baker's yeast) was the first eu-

11 karyotic genome to be completely sequenced and annotated (Goffeau et al. 1996). The nematode, Caenorhabditis elegans, was the first multicellular organism to have its genome completely sequenced (CESC 1998). The DNA sequencing has developed rapidly, and sequences of much larger genomes were produced: Drosophila melanogaster (Adams et al. 2000), Arabidopsis thaliana (AGI 2000) , Anopheles gambiae (Holt et al. 2002) and Mus muscu- lus (Waterston et al. 2002). In 2001 the draft completion of the human ge- nome was announced by Celera Genomics (Venter et al. 2001) and the Inter- national Human Genome Sequencing Consortium (IHGSC) (Lander et al. 2001). The so-called final version of the human genome was published in 2004 (IHGSC 2004). There has been a quick expansion in the number of Sequenced Genomes. As of 2008-09-15, there are 854 Published Complete Genomes in the GOLD, Genomes OnLine Database (http://www.genomesonline.org): 53 Archaea, 706 Bacterial and 95 Eu- karyotic (Liolios et al. 2008). There are also a vast amount of ongoing Ge- nomic Sequencing Projects: 97 Archaea, 1934 Bacterial and a total of 990 Eukaryotic genomes.

According to GOLD, the genome of Homo Sapiens is about 3.42 Gigabase Pairs (Gbp) long, the smallest known vertebrate genome is Tetraodon fluvi- atilis (Green Puffer fish) with about 0.385 Gbp, and the largest known verte- brate genome Marbled lungfish, Protopterus aethiopicus, is about 130 Gbp (Liolios et al. 2008).

Sequencing costs have dropped tenthousandfold in 19 years: from about $3 per finished in 1990, to about $0.0003 today. Further dramatic drops in costs are projected (See Table 1) (Watson 1990; Williams 2002; Service 2006; Wolinsky 2007).

Cost (USD) per base Year pair Cost (USD) Human Genome 1990 3 10.3 billions 1993 0.5 1.7 billions 2005 0.01 34 millions 2008 0.0003 1 million 2009 0.00003 100 000 2013 0.0000003 1000

Table 1. Data table listing the Sequencing costs (Watson 1990; Williams 2002; Ser- vice 2006; Wolinsky 2007). The second column denotes the cost per base pair in US Dollars (USD). The third Column lists the costs in USD for human full genome sequencing. The sums are calculated using the estimated size of the human genome of 3.42 Gigabase pair.

12 Postgenomics Now that the human genome project is finished, the characterization of the proteins encoded by the sequence remains a challenge. The study of the complete protein expression of the genome, proteomics, is one of the impor- tant tasks ahead for researchers. The rapidly growing amount of data made it nearly impossible to analyze sequences manually. The huge amount of in- formation has to be systematically handled. The bioinformatics discipline is the science of using computers, statistics and databases to manage and ana- lyze genomic and molecular biological data. Some fields of bioinformatics are sequence alignment, gene finding, protein structures, gene assembly and modeling of evolution. The bioinformatic tools and methodologies are nec- essary to collect, store and interpret the biological sequence information. The postgenomic era promises great possibilities for gaining further insight into the organism development, metabolic processes, and diseases. Bioinformat- ics research will have a significant impact on improving our understanding of such diverse areas as the regulation of drug development, gene expres- sion, protein structures and comparative evolution.

Figure 1. Phylogenetic figure of the species in Paper I-VI. The figure is constructed using the NCBI taxonomy browser (http://www.ncbi.nlm.nih.gov/Taxonomy). The rec- tangular shaped boxes show the evolutionary lineages.

13 Sequenced genomes The International Human Genome Sequence Consortium (IHGSC) published the NCBI build 35 (hg 17), the so-called final version, in 2004. It covered approximately 99% of the euchromatic genome with only 341 gaps (IHGSC 2004). The latest version of the human genome is the NCBI Build 36.1 (hg18) assembly. It has 234 euchromatin gaps. Chromose 14 (106 Mbp) has no euchromatin gaps, whereas the largest chromosome, Chromosome 1 (247 Mbp) has the highest amount with 13 gaps (http://genome.ucsc.edu/goldenPath/stats.html). Recently, an article on the mapping and sequencing of structural variation from eight human genomes was published and more genetic variations were found than previously ex- pected (Kidd et al. 2008).

The mouse, Mouse musculus, genome was published in 2002. The assembly has approximately 7.7-fold sequence coverage of the euchromatic mouse genome. The latest assembly is the Build 37.1 from NCBI and the Mouse Genome Sequencing Consortium. The Build 37.1 assembly includes ap- proximately 2.6 Gb of sequence on chromosomes 1-19, X, Y, M (mitochon- drial DNA) and unmapped clone contigs. On chromosome Y in this assem- bly, only the short arm has reliable mapping data: therefore, most of the con- tigs on the Y chromosome are unmapped (http://genome.ucsc.edu).

The rat, Rattus norvegicus, genome was published in 2004 by the Rat Ge- nome Sequencing Project Consortium. The sequence represents a high- quality 'draft' covering over 90% of the genome. The genome was assembled using a hybrid approach using both Whole Genome Shotgun (WGS) and Bacterial Artificial Chromosomes (BAC) sequencing. This yielded sevenfold sequence coverage with 60% provided by WGS and 40% from BACs (Gibbs et al. 2004).

The fruitfly, Drosophila melanogaster, genome was sequenced and assem- bled by a combination of BAC and WGS data, and has been finished to high quality. The WGS data has a Sequence coverage of 12.8X (Adams et al. 2000). The latest release (BDGP Release 5) was provided by the Berkeley Drosophila Genome Project (BDGP) and has a total of eight euchromatic gaps (http://www.fruitfly.org).

The pufferfish Fugu, Takifugu rubripes, has over 95% of the genome se- quenced and has a 5.7X coverage (Aparicio et al. 2002). The latest release, Takifugu rubripes v4.0 (Oct. 2004) WGS assembly from the US DOE Joint Genome Institute (JGI), has approximately a 8.5X coverage (http://genome.ucsc.edu).

14 The mosquito, Anopheles gambiae, was sequenced using a WGS approach. A coverage of 10X WGS coverage was obtained. A total of 91% of the ge- nome was organized in 303 scaffolds (Holt et al. 2002).

The chicken, Gallus gallus, genome was generated from approximately 6.6X coverage in whole-genome shotgun reads, a combination of plasmid, fosmid and bacterial artificial chromosome (BAC)-end read pairs (ICGSC 2004). The latest version is the v2.1 draft assembly. In this assembly, 198,000 addi- tional reads covering all contig ends and regions of low quality have been added to the original assembly's 6.6X coverage (http://genome.wustl.edu).

The nematode, Caenorhabditis elegans, sequencing project yielded 97 Mbp of sequences, consisting of 2527 cosmids, 257 YACs, 113 fosmids, and 44 PCR products (CESC 1998). The latest version is the WS197 from Worm- Base (http://www.wormbase.org).

The sea squirt, Ciona intestinalis, assembly had a total coverage of 8.2X, 8X WGS and 0.2X BACs and cosmic libraries, when it was first published (Dehal et al. 2002). The latest draft assembly is Version 2.0. Starting with a coverage of 11X, additional data, including BAC and FISH markers, were used to map scaffolds to chromosome arms. The size of this assembly, in- cluding unmapped scaffolds, is 173 Mb, with 94 Mb of the sequence mapped to chromosome arms (http://genome.jgi-psf.org).

The fission yeast, Schizosaccharomyces pombe, genome was published in 2002, and the current assembly, from Sanger (ftp://ftp.sanger.ac.uk/), con- sists of 13.6 Mbp, 12.5 Mbp excluding repeats and centromeres. The S. Pombe has three chromosomes and is important genetic evolutionary model organism (Wood et al. 2002). The name Pombe is derived from the Swahili word for beer.

The budding yeast, Saccharomyces cerevisiae, is one of the most studied eukaryotic model organisms (Goffeau et al. 1996). The current assembly is Version 1.01 from the Saccharomyces Genome Database (SGD).

The zebrafish, Danio rerio, assembly is still preliminary. The latest release is July 2007 Zv7 assembly with a total sequence length of 1.44 Gbp in 5036 fragments. Highly variable regions within the genome posed assembly diffi- culties, most likely because the sequences originated from different haplo- types. This also results in assembly dropouts and false duplications (http://www.sanger.ac.uk).

Arabidopsis thaliana or thale cress is a small flowering . Arabidopsis is popular as a model organism in plant biology and genetics. The genome is

15 one of the smallest plant genomes and was the first sequenced plant genome. The sequenced regions cover 115.4 megabases of the 125-megabase ge- nome. The traditional approach with BAC, TAC, cosmid, P1 clones and YAC were used for sequencing (AGI 2000). The genome annotations for this specie are currently almost exclusively automated or semiautomated.

Membrane proteins Membrane proteins constitute approximately 30% of all genes in the human (Lander et al. 2001), and the largest family of phylogenetically related mem- brane proteins are the G protein-coupled receptors (Fredriksson et al. 2003). The Solute Carriers, SLCs, is one of the largest families of membrane pro- teins with 360 proteins currently listed in the SLC database (http://www.bioparadigms.org/slc/). Other important subgroups of mem- brane proteins are single pass receptors, channels and ABC transporters. Membrane proteins are difficult to crystallize because of their amphiphilicity, with a hydrophilic headgroup and hydrophobic tail. This obstacle has meant that they account for less than 1% of known struc- tures, with about 170 unique structures known (http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html). Some of the few crystallized GPCRs include Human 2 (Cherezov et al. 2007) and in its G-protein-interacting conformation (Scheerer et al. 2008).

G protein-coupled receptors The GPCR is a large family of integral membrane proteins. There are over 800 GPCRs in the human genome, corresponding to approximately 3% of all human genes. Often they consist of a single polypeptide chain of 300-500 residues, but in some cases extend up to several thousand amino acids. The receptors possess a common domain with seven transmembrane (TMI- TMVII) regions. GPCRs recognize a tremendous variety of extracellular messenger molecules (such as hormones and neurotransmitters, as well as growth and developmental factors) and several sensory messages (such as light, odors, and pain). The GPCRs are one of the most studied protein fami- lies. The GPCRs in the human genome form five major phylogenetic sub- families. These families vary greatly in their extracellular N-termini. The five families, the so-called GRAFS, are the Glutamate, Rhodopsin, Adhe- sion, Frizzled/Taste2 and the Secretin family (See Figure 2). The Rhodopsin family (clan A) is the largest of the five subfamilies of GPCRs (Fredriksson et al. 2003). Approximately 50% of all newly introduced drugs are targeted against GPCRs and around 25% of the 100 top-selling drugs are targeted at members of this protein family (Flower 1999). There are at least 46 GPCRs

16 that have been successfully targeted by drugs. These are found in three of the main families: Rhodopsin (>39), Secretin (4) and Glutamate (3) (Lagerstrom et al. 2008). The GPCRDB (http://www.gpcr.org/7tm/) is a database with GPCR sequences and related information (Horn et al. 2003). All families, even the ones not found in humans, such as Class D fungal (STE2 and STE3) and Class Z Archaeal/bacterial/fungal , are listed there.

Figure 2. Schematic illustration of the phylogenetic relationship of the GPCRs (TM1-TM7). Adapted from Molecular Pharmacology (Fredriksson et al. 2003).

The Adhesion family The Adhesion family is the second largest GPCR family with 33 members in human and 30 in M. musculus and R. norvegicus. This GPCR family is called Adhesion according to a study on phylogenetic GPCR classification (Fredriksson et al. 2003). Through the years this family has been assigned various names, including family B (Kolakowski 1994), B2 (Harmar 2001), or the long N-terminal seven transmembrane receptors related to family B (LNB-TM7) (Stacey et al. 2000). It is not entirely clear if the Adhesion GPCRs are functionally coupled to G proteins in the manner that is known for the Rhodopsin GPCR (Bjarnadottir et al. 2007).

Adhesion is a complex GPCR gene family with large genomic size, high number of exons and a multitude of functional domains. The Adhesions have long N-termini with multiple functional domains often found in other pro- teins such as tyrosine kinases, integrins and cadherins. Alternative splicing and complex processing steps, including the putative intracellular cleavage at the GPCR proteolytic site (GPS), are also contributing factors to their complexity (Bjarnadottir et al. 2007). The first identified member of this family, the epidermal growth factor (EGF) module containing receptor 1 (EMR1), was described in 1995 (Baud

17 et al. 1995). The sequencing of the human genome and development of the systematic analysis with bioinformatic tools led to rapid discovery of Adhe- sion genes (Fredriksson et al. 2002) (Fredriksson et al. 2003) and now the repertoire consists of 33 human (Bjarnadottir et al. 2004) and 30 mouse ge- nes (Haitina et al. 2008).

The family can be divided into eight groups according to phylogenetic ana- lysis (See Figure 5). Interestingly, even though the phylogenetic analysis is based on 7TM regions alone, the clans show good agreement with the func- tional domains contained in the N-termini. For example, receptors containing EGF domains group together in group II, which contain EMR1–EMR4 and the differentiating antigen receptor (CD97). Group I contains the lec- tomedin receptors (LEC1–3) and the EGF-TM7-latrophilinrelated protein (ETL). The CELSR group (IV) contains the cadherin EGF LAG seven-pass G-type receptors (CELSR1–3) and the BAI group (VII) contains the brain- specific angiogenesis inhibitor receptors (BAI1–3). The four remaining groups contain: GPR123–GPR125 (III), GPR133 and GPR144 (V), GPR110, GPR111, GPR113, GPR115, and GPR116 (VI), with group VIII containing GPR56, GPR97, GPR112, GPR114, GPR126, GPR128 and the human epidi- dymal gene product 6 (HE6). The very long G protein-coupled receptor 1 (VLGR1) is definitely an Adhesion, however it does not have a firm place- ment in the phylogeny. It is thus not possible to assign it to a subgroup.

Most of the Adhesions contain a GPCR proteolytic domain (GPS). The ma- jority of Adhesion GPCRs are orphans, i.e. their endogenous ligand is still unknown. CD97 is one of the most studied receptors in this family and is found in several blood cell types. It contains EGF-like domains and is in- volved in leukocyte activation, similar to other EGF domain-containing re- ceptors such as the EMRs. Receptors of the Adhesion family are expressed in various parts of the human body and many of them have prominent expres- sion in the immune system, central nervous system, and reproductive organs, suggesting that they might take part in a large variety of physiological func- tions. Apart from the Rhodopsin, the Adhesion GPCR family has the most members of all the GPCR families in mammals. The complex genomic structure and vast number of exons, and their long introns, make the Adhe- sion family difficult to study in terms of alternatively splicing. One such study for the whole Adhesion family was conducted (Bjarnadottir et al. 2007). Here, the splice variants were categorized either as “functional” and “non-functional”. A variant was only considered functional if it contained the whole 7TM region. 52 splice variants are found among the 32 Adhesion genes. Many of these variants appear to be coding for “functional” proteins (29), while the others are seemingly “non-functional” (23). Several of the functional splice variants lack one or more of the functional domains that are

18 found in the N-termini of these receptors. These functional domains are likely to affect ligand binding or interaction with other proteins.

The Rhodopsin family The Rhodopsin family (clan A) is the largest of the five subfamilies of GPCRs (Fredriksson et al. 2003). Their natural ligands are highly diverse, comprising biogenic amines (such as adrenaline, dopamine, histamine, and serotonin), peptides (such as angiotensins, bradykinins, somatostatins, and melanocortins), large proteins (such as luteinizing hormone, follicle- stimulating hormone, and thyroid-stimulating hormone), nucleosides and nucleotides (adenosine, ATP, UTP, and ADP), lipids and eicosanoids (such as leukotrienes, prostaglandins, and cannabinoids) (Lagerstrom et al. 2008). Most of the have a short N-terminus, except for the few recep- tors that bind complex glycoproteins such as the luteinizing hormone. There are also several hundreds of olfactory receptors in the genome, although a large number of these are pseudogenes. The 2004 Nobel Prize in Physiology or Medicine was awarded to Richard Axel and Linda B. Buck for their dis- coveries of Odorant receptors and the organization of the olfactory system. The Rhodopsin family of GPCRs has been extesively studied because of the intense pharmaceutical interest in amine binding receptors. The numbers of drugs for other GPCRs are increasing, in particular for those receptors that bind peptides. The therapeutic potential of most Rhodopsin GPCRs has, however, not yet been exploited. Many of these receptors are still orphans, without any known ligand. The diversity of the known genes that encode GPCRs is so large that it cannot be excluded that more GPCR genes can be found in the human genome.

The Solute Carrier Family The SLCs are membrane proteins that mediate the flow of various substrates (solutes) such as sugar, amino acids, nucleotides, inorganic and drugs over the . The SLC gene series includes genes encoding pas- sive transporters, ion transporters and exchangers. The different SLC fami- lies are functionally related, as most rely on an ion gradient over the cell membrane as the driving force for the transport. There are currently 47 dis- tinct families of SLCs in mammals and some of these are known to be phy- logenetically linked into larger clusters such as SLC32, SLC36 and SLC38 (Sundberg et al. 2008). The SLC25 family is located in the mitochondria; however the overwhelming majority of the SLC families are located in the outer cell membrane. The HUGO Gene Nomenclature Committee (http://www.genenames.org) provides a list of transporter families of the SLC gene series. Here, transporters have been assigned to a given SLC fam- ily if it has at least 20-25% amino acid sequence identity to any other mem-

19 ber of the family ((Hediger et al. 2004) and personal communication, Dr Elspeth Bruford, HGNC) and thus the criteria is not strictly phylogenetic. Numerous genetic defects of SLC genes have been linked to human disease. One example is the SLC12A1, a Na-K-Cl- that mediates active reabsorption of in the thick ascending limb of the loop of Henle. It has been shown to be the cause of the antenatal Bartter syndrome type 1. Bartter syndrome leads to fetal tubular hypokalemic alkalosis with marked fetal polyuria that leads to polyhydramnios between 24 and 30 weeks of gestation and increases the risk of premature delivery. One demon- strated mutation is a G-to-A transition in the last base of exon 14, represent- ing the first base of codon 648, changing aspartic acid to asparagine (Simon et al. 1996).

Family Family name N The high affinity glutamate and neutral amino acid trans- SLC1 porter family 7 SLC2 The facilitative GLUT transporter family 14 The heavy subunits of the heteromeric amino acid trans- SLC3 porters 2 SLC4 The bicarbonate transporter family 10 SLC5 The sodium glucose cotransporter family 12 The sodium- and chloride- dependent neurotransmitter SLC6 transporter family 19 The cationic amino acid transporter/glycoprotein-associated SLC7 family 13 SLC8 The Na+/Ca2+exchanger family 3 SLC9 The Na+/H+ exchanger family 11 SLC10 The sodium bile salt cotransport family 7 SLC11 The proton coupled metal family 2 SLC12 The electroneutral cation-Cl cotransporter family 9 SLC13 The human Na+-sulfate/carboxylate cotransporter family 5 SLC14 The family 2 SLC15 The proton oligopeptide cotransporter family 4 SLC16 The monocarboxylate transporter family 14 SLC17 The vesicular family 8 SLC18 The vesicular amine transporter family 3 SLC19 The / transporter family 3 SLC20 The type III Na+-phosphate cotransporter family 2 SLC21 The organic anion transporting family 10 SLC22 The organic cation/anion/zwitterion transporter family 22 SLC23 The Na+-dependent ascorbic acid transporter family 4 SLC24 The Na+/(Ca2+-K+) exchanger family 6 SLC25 The family 46 SLC26 The multifunctional 11 SLC27 The fatty acid transport protein family 6 SLC28 The Na+-coupled nucleoside transport family 3

20 SLC29 The facilitative family 4 SLC30 The zinc efflux family 10 SLC31 The copper transporter family 2 SLC32 The vesicular inhibitory amino acid transporter family 1 SLC33 The Acety-CoA transporter family 1 SLC34 The type II Na+-phosphate cotransporter family 3 SLC35 The nucleoside-sugar transporter family 23 SLC36 The proton-coupled amino acid transporter family 4 SLC37 The sugar-phosphate/phosphate exchanger family 4 The System A & N, sodium-coupled neutral amino acid SLC38 transporter family 11 SLC39 The metal ion transporter family 14 SLC40 The basolateral iron transporter family 1 SLC41 The MgtE-like magnesium transporter family 3 SLC42 The Rh ammonium transporter family 3 Na+-independent, system-L like amino acid transporter SLC43 family 3 SLC44 Choline-like transporter family 5 SLC45 Putative sugar transporter family 4 SLC46 Heme transporter family 3 SLC47 Multidrug and toxin extrusion 2 Total 359

Table 2. The SLC families, the number of genes in each family and the names of the families (Supplementary table 1, Paper VI; Fredriksson et al., In Press).

The Solute Carrier Family 6 The Solute Carrier Family 6 is ancient and exists in nematodes and insects as well as in Archaea and Bacteria. Synaptic transmissions in neurons involve the packaging of neurotransmitters into vesicles. Thereafter, the transmitter is released into the synaptic cleft, interacts with neurotransmitter receptors and is subsequently removed from the cleft. The SLC6 proteins are impor- tant as they act as specific transporters for neurotransmitters, amino acids and osmolytes: betaine, taurine and creatine. Important neurotransmitters whose levels are controlled by SLC6 proteins are small molecules: acetyl- choline, dopamine, serotonin and norepinephrine; and amino acids like Gamma-aminobutyric acid (GABA), glycine and aspartate. The SLC6 pro- teins are Na+ , with the energy for transport of the solute against its concentration gradient provided by the electrochemical gradient for sodium ions.

21 Research aim

The overall aim was to investigate the genetic repertoire and expression of G protein-coupled receptors (GPCRs) and Solute Carriers (SLC). We have identified and characterized new genes and analyzed them from a compara- tive evolutionary perspective.

22 Methods

This section describes some theoretical aspects of the methodology used in this thesis. More detailed method descriptions can be found in the material and method sections of Paper I-VI.

Phylogenetic analysis The three methods commonly used for phylogenetic analysis are the Neighbor-joining (NJ), Maximum Parsimony (MP) and Maximum Likeli- hood (ML) methods.

The neighbor-joining starts with an initial star tree. The tree is constructed by linking the pair of nodes with the least distance. When the two nodes are linked, their common ancestral node is added to the tree. The least different pairs are connected by an ancestral node. Then the distance matrix is modi- fied and the calculations repeated regarding the two sequences, previously joined by a node, as one. Another most related pair is then found in the re- vised matrix and the calculations continue in this way until all pairs have been given a score (See Figure 3a) (Bjarnadóttir 2006). Neighbor-joining is based on the minimum-evolution criterion for phylogenetic trees, i.e. the topology that gives the least total branch length is preferred at each step of the algorithm. Neighbor-joining may be stuck in a local optimum and may not find the true tree topology with least total branch length. However the algorithm is extensively tested and usually finds a tree that is close to the optimal tree. Another benefit of the algorithm is the fast speed and thus its suitability for large datasets (Pachter 2006).

Occam’s razor, the law of parsimony, adopts the hypotheses that simpler hypothesis are preferred in front of more complicated ones. Maximum par- simony is a method that infers a phylogenetic tree by minimizing the total number of evolutionary steps (i.e. mutations) required to explain a given set of data, or, in other words, by minimizing the total tree length (See Figure 3b) (Bjarnadóttir 2006). The MP algorithm evaluates possible trees for each informative site and the tree with fewest evolutionary changes is chosen. Maximum parsimony is a very simple approach; however it is not guaran-

23 teed to produce the true tree with highest probability or the most parsimoni- ous tree.

Maximum Likelihood is a method that evaluates a hypothesis about evolu- tionary history. It calculates the probability that a proposed evolutionary model and a hypothesized history would give rise to the observed data set. A history with higher probability is preferred before a history with lower prob- ability. The tree most likely to explain the given dataset is eventually chosen (See Figure 3c)(Bjarnadóttir 2006). The Maximum likelihood method has advantages over the traditional parsimony algorithms, which can give mis- leading results if rates of evolution differ in different lineages (Felsenstein 1981). The disadvantage of the method is the large CPU power needed and thus the long calculation time required.

24

Figure 3. The Figure is used by permission from (Bjarnadóttir 2006). The Figure is based on (http://www.icp.ucl.ac.be/~opperd/private/parsimony.html) and (Hill 2007): a) NJ: In this case B and C are the least different sequences according to a pairwise distance matrix (not shown). They are paired and joined by an ancestral node whereafter they are treated as one and the distance matrix modified. Again the least different sequences are paired, in this case A and B/C are joined by an ancestral node etc. b) MP: In theory, all tree topologies are tested. The tree with the fewest evolutionary substitutions (ES) is then chosen, in this case tree one with ES = 10. c) ML: Likewise, all tree topologies are in theory considered, looking at each amino acid position and calculating the probability of the expected amino acid in an ances- tral node. In practice a heuristic approach is used for larger datasets. In this example there are four possible nucleotides for node x and node y (left tree). Thus there are 16 different possible trees: one of these trees, and how the probability is calculated, is shown (right tree). After calculating P1-P16, the probabilities are added to obtain the probability of the tree to the left and then a tree with another topology is evalu- ated. The tree with the highest probability is considered the most likely tree (Bjarnadóttir 2006).

25 The protein sequences can be used to construct alignments and phylogenetic trees. To avoid input order bias, the dataset in Paper I-IV was randomized 20 times with regard to sequence input order using a program called Randfasta (http://www.medfarm.neuro.uu.se/schioth.html). These 20 datasets, contain- ing the full set of sequences but in different orders, were all aligned using ClustalW (Thompson et al. 1994). The default alignment parameters were applied. The 20 alignments were all bootstrapped using SEQBOOT from the Phylip package. Protein distances were calculated using PROTDIST (NJ), PROTPARS (MP) or PROML (ML) from the Phylip package (Felsenstein 1989; Vandamme 2003; Felsenstein 2005). The result file for each method was analyzed using CONSENSE from the Phylip package to get a boot- strapped consensus tree. The trees were plotted using TREEVIEW (http://taxonomy.zoology.gla.ac.uk/rod/treeview.html)

Paper Method Replicas Randfasta usage I NJ, MP 1000 Yes II NJ, MP 1000 Yes III NJ, MP 1000 Yes IV NJ, MP, ML 1000 Yes V NJ, MP, ML 100 No VI A NJ, MP, ML 100 No VI B Fitch 1 No

Table 3. Table listing; the Paper; the phylogenetics methods used in the papers (Neighbor-Joining (NJ), Maximum Parsimony (MP), Maximum Likelihood (ML) and Fitch); the number of Replicas and the use of Randfasta.

In Paper VI we analyzed the SLC family homology by comparing all 46 SLC HMMs against each other with HHsearch using default options and with calibration. HHsearch is a method of protein homology detection by HMM-HMM comparison (Soding 2005). The probability values were used for scoring the significance of a match and scores equal to or higher than 50 percent were considered significant and subsequently used in our phyloge- netic analysis.

The scores were compiled into a matrix which was used to calculate a phy- logenetic tree using FITCH of the PHYLIP 3.67 package with default set- tings except for the global option which was omitted. The Fitch program uses the Fitch-Margoliash criterion and some related least squares criteria, or the Minimum Evolution distance matrix method, and it does not assume an evolutionary clock (Felsenstein J 2007). The tree was plotted using Tree- View (http://taxonomy.zoology.gla.ac.uk/rod/treeview).

26 Sequence similarity based search strategies BLAST of the BLAST package (Altschul et al. 1997) rapidly identifies se- quences of high similarity. Hidden Markov Models (Eddy 1998) searches and obtain results with a higher sensitivity and is thus better when searching for distantly related homologues. These two methods can thus give comple- mentary results. The BLAST programs identify the local alignments sharing the highest number of identical or similar residues, whereas the HMM algo- rithm makes a statistical model from a multiple alignment and uses it to scan a database for sequences. It has been shown that BLAST detects almost all relationships between proteins whose sequence identity are >30%. For more distantly related proteins, they do much less well; only one-half of the rela- tionships between proteins with 20-30% identity are found (Brenner et al. 1998). Profile-sequence methods such as HMMer, SAM and PSI-BLAST find three times as many homologues as sequence-sequence methods such as BLAST with sequence identities less than 30% (Park et al. 1998). Profile- sequencing methods are based on multiple alignments of sequences to model evolutionary changes. This improves the homology detection between pro- tein sequences (Lindahl et al. 2000). New methods are the profile-profile methods that use a profile to search a database of profiles. Evolutionary in- formation in both the query and the target can be utilized and thus more re- mote relationships can be detected (Reid et al. 2007). Such methods include COMPASS, LAMA, PRC and HHSearch. PRC and HHSearch use HMM models in their searches. HHSearch showed superiority in one study (Soding 2005), but PRC showed superiority in a more recent one (Reid et al. 2007). However, all profile-profile methods are better than profile-sequence meth- ods on remote homologues (Reid et al. 2007). Sequences can also be searched in a genome using BLAT (Kent 2002). BLAT is a quick but has a relatively low sensitivity. It is therefore useful when the evolutionary dis- tance is relatively short.

Expressed Sequence Tags The dbEST, database for Expressed Sequence Tags (ESTs), was established in 1992 (Boguski et al. 1993). Thereafter, there has been an exponential growth in the generation and accumulation of EST data in public databases for numerous species (Nagaraj et al. 2007). ESTs and mRNA sequences can be used to support predicted intron-exon boundaries, thus giving support for the existence of a predicted nucleotide sequence. ESTs are short (200-800 nucleotide bases in length), unedited, randomly selected single-pass se- quence reads derived from cDNA libraries. The idea is to use the ESTs ex- pressed in certain cells, tissues, or organs from different organisms to correct the predicted gene, for example to correct wrongly predicted exon-intron

27 boundaries. Genes from the raw genomic DNA are predicted by computer programs. There are many different programs such as: GENSCAN, FGENEH, GeneID and GRAIL2. GENSCAN has been used as the principal tool for gene prediction in the International Human Genome Project. It iden- tifies exon/intron structures of genes as well as whole genes. GENSCAN identified 75-80% of the exons correctly when tested on standardized sets of human and vertebrate genes (Burge et al. 1997). This requires that every predicted coding region must be verified. EST and mRNA sequences are used to correct predicted proteins. Incorrectly included exons can often be removed directly after defining their borders using BLAT, whilst missing exons have to be identified from full-length cDNAs or Expressed Sequence Tags (ESTs). These physically expressed sequences are assembled and the missing exon(s) can be found. Splice variants can also be identified using this method. There are several limitations associated with the EST approach. One is that it is very difficult to isolate mRNA from some tissues and cell types, resulting in an EST tissue expression bias. ESTs are also subject to sampling bias resulting in under-representation of rare transcripts (Bonaldo et al. 1996). As ESTs are sequenced only once, they are susceptible to errors. Generally, the quality of base reads in individual EST sequences is initially poor for the first 20% or the first 50-100 base pair. It gradually improves and then diminishes again towards the end. The overall sequence quality is usu- ally significantly better in the middle (Nagaraj et al. 2007). Despite these limitations, ESTs continue to be invaluable in characterizing the human ge- nome, as well as the genomes of other organisms. They have enabled the mapping of many genes to chromosomal sites and have also assisted in the discovery of many new genes. (http://www.ncbi.nlm.nih.gov/About/primer/est.html).

28 Results

Paper I, II and IV - Adhesion We report 16 novel human Adhesion GPCRs found by searches in the NCBI and Celera databases. In Paper I, we report eight novel human GPCRs found by searches in the human genome databases, termed GPR97, GPR110, GPR111, GPR112, GPR113, GPR114, GPR115 and GPR116. In Paper II, we report six novel members of the superfamily of human GPCRs found by searches in the human genome databases, termed GPR123, GPR124, GPR125, GPR126, GPR127, and GPR128. In Paper IV, we identified two new human Adhesions, termed GPR133 and GPR144. Phylogenetic analysis demonstrates that these are all additional members of the Adhesion GPCR family. EST expression charts for the entire repertoire of Adhesion GPCRs in human and mouse were established, showing widespread distribution in both central and peripheral tissues.

All the receptors in Paper I have a GPS domain in their N-terminus and long Ser/Thr-rich regions forming mucin-like stalks. GPR113 has a hormone binding domain and one EGF domain. GPR112 has over 20 Ser/Thr repeats and a pentraxin domain. GPR116 has two immunoglobulin-like repeats and a SEA box.

All 16 novel Adhesions, with the exception of GPR123, have a GPS domain in their N-terminus. They all have long Ser/Thr rich regions forming mucin- like stalks. GPR124 and GPR125 have a leucine rich repeat (LRR), an im- munoglobulin (Ig) domain, and a hormone-binding domain (HBD). GPR127 has one EGF domain while GPR126 and GPR128 do not contain other do- mains readily recognized beyond the GPS domain.

In Paper III, we identified 17 mouse Adhesion orthologs (GPR110, GPR111, GPR112, GPR113, GPR114, GPR115, GPR116, GPR123, GPR124, GPR125, GPR126, GPR128, GPR133, GPR144, LEC1, LEC2, and LEC3). The mouse and human sequences show a clear one-to-one relationship, with the exception of EMR2 and EMR3, which do not seem to exist in mouse.

29 Name Accession Length Chromosomal Exons numbers (aa) positioning BAI1 NP_001693 1584 8q24 28 BAI2 NP_001694 1585 1p35 29 BAI3 NP_001695 1522 6q12 30 CD97 NP_510966 835 19p13 18 CELSR1 NP_055061 3014 22q13.3 33 CELSR2 NP_001399 2923 1p21 32 CELSR3 NP_001398 3312 3p24.1-p21.2 34 EMR1 NP_001965 886 19p13.3 19 EMR2 NP_038475 823 19p13.1 18 EMR3 NP_115960 652 19p13.1 13 ETL NP_071442 690 1p33-p32 14 GPR56 NP_005673 693 16q12.2-q21 13 GPR97 AAN46673 549 16q13 11 GPR110 NP_722582 910 6p12.3 14 GPR111 NP_722581 708 6p12.3 6 GPR112 NP_722576 3080 Xq26.3 21 GPR113 NP_722577 1079 2p23.3 11 GPR114 AAN46670 569 16q13 12 GPR115 AAN46671 752 6p12.3 8 GPR116 NP_056049 1346 6p12.3 20 GPR123 NP_115798 1280 10q26 16 GPR124 NP_116166 1338 8p12 19 GPR125 NP_660333 1321 4p15.31 19 GPR126 NP_940971 1250 6q24.1 24 GPR127p Q86SQ3 - 19p13.3 - (EMR4p) GPR128 NP_116176 797 3q12.2 16 GPR133 NP_942122 874 12q24.33 25 GPR144 AAP35064 963 9q33.3 20 HE6 Q8IZP9 1017 Xp22.13 22 LEC1 O95490 1459 1p31.1 20 LEC2 NP_055736 1474 19p13.2 20 LEC3 NP_056051 1469 4q13.1 21 VLGR1 NP_115495 6306 5q13 89

Table IV. Adapted from Paper IV. The list was updated 2008-08-21 against the NCBI Entrez database (http://www.ncbi.nlm.nih.gov/sites/entrez). The table lists 33 human Adhesion GPCRs, accession ID, length in amino acids, chromosomal posi- tion and number of exons. The chromosomal location and the number of exons are obtained from the NCBI Entrez database and the UCSC Genome Browser website (http://genome.ucsc.edu/). The 16 novel genes found in Paper I, II and IV are in bold text. Listed here are the longest isoforms. GPR127 is renamed to EMR4p and con- sidered a pseudogene.

It has been five years since our first research and publication of the Adhesion family, and therefore it was prudent to update the dataset with the longest

30 Adhesion isoforms available. We found 15 longer isoforms compared to Table 1, Paper IV; BAI2, EMR2, ETL, GPR110, GPR112, GPR114, GPR116, GPR124, GPR125, GPR126, GPR133, HE6, LEC1, LEC3 and VLGR1. New RPS searches (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) were performed (See Figure 6). Some discrepancies were found compared to Figure 2 (Bjarnadottir et al. 2007). A Herpes_Ul32, Herpesvirus large structural phos- phoprotein UL32, domain was found in GRP112. A CUB domain, Comple- ment C1r/C1s, Uegf, Bmp1 was found in GPR126. In GPR110, a SEA, sperm protein, enterokinase, and agrin, domain was found. Additional Cad- herin (CA) domains were found in CELSR1 (2 CA), CELSR2 (3 CA) and CELSR3 (2 CA).

The CUB domain (for complement C1r/C1s, Uegf, Bmp1) is a structural motif of approximately 110 residues found almost exclusively in extracellu- lar and plasma membrane-associated proteins, many of which are develop- mentally regulated. These proteins are involved in a diverse range of func- tions, including complement activation, developmental patterning, tissue repair, axon guidance and angiogenesis, cell signalling, fertilisation, hemo- stasis, inflammation, neurotransmission, receptor-mediated endocytosis, and tumour suppression. The Mammalian complement subcomponents C1s/C1r is the first component of the classical pathway of the complement system (Perry et al. 2007). The CUB domain was found in GPR126, which also has a PTX (pentraxin) domain. Many of proteins with PTX are cytokine- inducible acute phase proteins. C reactive protein and serum amyloid P component are short pentraxins, whose concentrations in the blood increase dramatically upon infection or trauma (Goodman et al. 1996). GPR126 has been shown to be induced by LPS and thrombin, suggesting an important function in cell-adhesion and potentially links inflammation and coagulation (Stehlik et al. 2004).

In Paper III, we report seven new human GPCRs found by searches in the human genome databases, termed GPR100, GPR119, GPR120, GPR135, GPR136, GPR141, and GPR142. We also report 16 orthologs of these recep- tors in mouse, rat, fugu and zebrafish. Phylogenetic analysis shows that these are additional members of the Rhodopsin GPCR family.

31 Name Accession Length Chromo- Exons Rhod. numbers (aa) somal group position- ing GPR100 NP_871001 374 1q22 1 (Chem) GPR119 NP_848566 335 Xq25 1 (Amine) GPR120 NP_859529 377 10q23.33 4 (SOG) GPR135 NP_072093 494 14q23.1 1 (SOG) GPR136 NP_859528 354 6p12.3 6 (Opsin) GPR141 NP_861456 305 7p14.1 1 (Purin) GPR142 NP_861455 462 17q25.1 4 (Chem)

Table 5. Listing updated version (Accessed 2008-08-08) of human Rhodopsin GPCRs found in Paper III: accession ID, length in amino acids, chromosomal posi- tion, number of exons and their Rhodopsin group (, , or ) and subgroup. The subgroups are the cluster (Chemo), the Amine Receptor cluster (Amine), the Somatostatin/opioid/ Cluster (SOG), the Opsin recep- tors cluster (Opsin) and the Purin receptor cluster (Purin) The chromosomal location and the number of exons are obtained from the NCBI Entrez database and the UCSC Genome Browser website (http://genome.ucsc.edu/) respectively. Listed here are the longest isoforms.

In Paper V, we present the identification of two new human genes, termed SLC6A17 and SLC6A18 from the Solute Carriers Family 6. We also identi- fied the corresponding orthologs and additional genes from the mouse and rat genomes. Detailed phylogenetic analysis of the entire family of SLC6 proteins in mammals shows that this family can be divided into four sub- groups. We used Sequence Hidden Markov Models for these subgroups and identified in total 430 unique SLC6 proteins from 10 , one plant, two fungal, and 196 bacterial genomes. It is evident that SLC6 proteins are pre- sent in both and bacteria, and that three of the four subfamilies of mammalian SLC6 proteins are present in Caenorhabditis elegans, showing that these subfamilies are very ancient in an evolutionary sense. Moreover, we performed tissue localization studies on the entire family of SLC6 pro- teins on a panel of 15 rat tissues. The expression of three of the genes was studied using quantitative real-time PCR which showed expression in multi- ple central and peripheral tissues. This paper presents an overall overview of the gene repertoire of the SLC6 gene family and its expression profile in rats.

In Paper VI, we systematically mined the genomes of eight species aiming to identify all genes coding for SLCs in these genomes. We performed the first evolutionary analysis of the entire SLC family based on whole genome se-

32 quences. We systematically mined and analyzed the genomes of eight spe- cies aiming to identify SLC genes using Hidden Markow Models (HMMs) and phylogeny. In all we identified 2403 SLC sequences in these genomes and we delineate the evolutionary history of each of the families. Moreover we also identified 10 new human sequences not previously classified as SLCs, that most likely belong to the SLC family. We found that 43 of the 46 of the SLC families found in H. sapiens are also found in C. elegans while 42 of them are also found in insects. A large number of the families are also found in Eubacteria and Archaea. Several families of SLC have increased in number, perhaps reflecting their important role in the development of CNS functions. This study shows that SLC family is very ancient with multiple branches that were present before early divergence of eukaryotes. This is the first report that provides systematic analysis of the evolutionary history of the SLC families in Eukaryotes.

33 Discussion

The Adhesion Family New research shows that the GPCR families Glutamate and Frizzled exist as early back as the social amoeba Dictyostelium discoideum (Unpublished data, Nordstrom et al.)(See Figure 4). However, the Rhodopsin family is not found until the Placozoa Trichoplax adhaerens (Unpublished data, Nord- strom et al), an organism lacking organs and a nervous system (Srivastava et al. 2008). The GPCR class B (7tm2) consists of the Adhesion, Secretin and Methuselah families. The Adhesion family is the oldest of the class B groups (Nordstrom et al., In Press). The earliest evidence of adhesion-like genes are in D. discoideum and A. thaliana. Adhesion-like genes are also found in the Choanoflagellate Monosiga brevicollis, however it is not until T. adhaerens that a clear development of several Adhesion families are seen (Nordstrom et al., In Press). 42 Adhesion or adhesion-like genes belong to group IV, V, and VIII in T. adhaerens. All eight groups of Adhesion are found in the pufferfish Tetraodon nigroviridis. Phylogenetic analysis reveals that the Adhesion group V (GPR133 and GPR144) is the closest relative to the Se- cretin family in the Adhesion family. Analysis of conserved splice sites con- firms this relationship. This interestingly suggests that the Secretin Family have descended from group V Adhesion (Nordstrom et al., In Press).

34

Figure 4. Phylogenetic figure of the evolution of the Secretin, Methuselah and Adhesion families (7tm_2).The figure is constructed using the NCBI taxonomy browser (http://www.ncbi.nlm.nih.gov/Taxonomy). The rectangular shaped boxes show the evolution- ary lineages. We analyzed 16 Eukaryotic species; Dictyostelium discoideum, Naegleria gru- beri, Arabidopsis thaliana, Monosiga brevicollis, Phycomyces blakesleeanus, Trichoplax adhaerens, Nematostella vectensis, Caenorhabditis elegans, Drosophila melanogaster, Strongylocentrotus purpuratus, Branchiostoma floridae, Ciona intestinalis, Tetraodon nigro- viridis, Gallus gallus, Mus musculus and Homo sapiens. The data is compiled from several sources; (Kamesh et al. 2008), (Nordstrom et al., In Press) and UniProt.

35

The Adhesion family is the second largest GPCR subfamily. If the pseu- dogenes are extracted, there are 32 Adhesion members in human and 30 in mouse and rat. GPR144 is an expressed gene in human, but a pseudogene in mouse and rat (Haitina et al. 2008). The evolutionary history of Adhesion group II is very intriguing. The repertoire of the Adhesion group II differs between man and murine. Humans has the members CD97, EMR1, EMR2 and EMR3 whereas mouse and rat genomes contain CD97, EMR1 and EMR4 (Hamann 2004). EMR4 (GPR127) is almost exclusively expressed in peripheral tissues in mouse and rat (Haitina et al. 2008). In humans, a dele- tion in exon 8 of EMR4 causes a frame shift. This results in the creation of a truncated 232 amino acid protein lacking the entire 7-TM region. The dele- tion is not present in non-human primates, including the chimpanzee (Kwakkenbos et al. 2004). Two alternatively spliced human EMR transcripts were found: neither are predicted to express the 7-TM region. Thus, if hu- man EMR exists, it is likely to be expressed as a soluble secreted molecule (Caminschi et al. 2006). The dog, Canis familiaris, genome has recently been sequenced (Lindblad-Toh et al. 2005). For the GPCR family there is a high number of orthologous gene pairs between dog and human. However interestingly, there are discrepancies in the Adhesion family. The 33 human GPCRs, including EMR4, are present in the dog genome. However the dog also contains 5 additional full length genes; EMR2b, EMR2c, EMR2d, EMR4b and EMR4c; and 1 pseudogene GPR133b. GPR144, EMR2 and EMR3, which are full length in rodents appear to be functional (full-length) in dog (Haitina et al.; Submitted). This suggests that there have been species- specific gene duplication events in dogs. The functions of the numerous dog Adhesion genes are not clearly known and further investigation is needed. Additional evolutionary analysis reveals that the Adhesion group II, which in humans contains CD97, EMR1, EMR2, EMR3 and EMR4p is first clearly seen in T. nigroviridis. The Adhesion group II genes are however not found in Gallus gallus (chicken) (Lagerstrom et al. 2006). One suggested hypothe- sis is that the two duplication events resulted in EMR2 creation by concerted evolution (Kwakkenbos et al. 2006). Concerted evolution is the greater-than- expected similarity seen in members of gene families within a species rela- tive to that seen between species.

The expression of Adhesion GPCRs were studied in rat and mouse tissues by our group (Haitina et al. 2008) (See Figure 5). The members of group I; LEC1, LEC2, LEC3 and ETL were expressed ubiquitously in both rat and mouse, whereas LEC2 and LEC3 had higher levels in CNS compared to levels in the periphery. The members of group II CD97, EMR1 and EMR4 were detected mostly in peripheral tissues and at very low levels in the brain. The member of group III GPR123 displayed central expression, while the

36 other members GPR124 and GPR125 showed ubiquitous expression. The CELSR2 and CELSR3 members of group IV were detected mainly in CNS, while CELSR1 was found both in central and peripheral tissues. Group V member GPR133 was detected in both CNS and periphery. GPR144, which is a pseudogene in rat, was not detected in any of the examined mouse tis- sues. Group VI members GPR110, GPR111, GPR113 and GPR115 were detected in only few peripheral tissues. Expression of BAI1 and BAI2 from group VII was detected only in brain regions of rat and mouse. BAI3 tran- scripts were detected mostly in the brain, with very low levels in some pe- ripheral tissues. GPR56, GPR64 and GPR126 from group VIII were detected in central and peripheral tissues, whereas GPR97, GPR112, GPR114 and GPR128 were detected in peripheral tissues. The overall conclusion is that the phylogenetic clusters have an either predominant central or peripheral expression in rat and mouse tissues (Haitina et al. 2008).

Receptor Tissue expression Adrenal gland, bladder, brain, dendritic cells, heart, intestine, kid- CD97 ney, leukocytes (granulocytes, monocytes, T-cells and B cells), liver, lung, lymph node, pancreas, placenta, prostate, skeletal muscle, skin, spleen, stomach, testis, thymus, tonsils, uterus EMR1 Hemopoietic cells (monocytes, macrophages, myeloid cells) EMR2 Bone marrow, leukocytes (monocytes, granulocytes), liver, lung, lymph nodes, placenta, skin, spleen, thymus,tonsils EMR3 Bone marrow, heart, , leukocytes (neutrophils, monocytes), liver, lung, lymph node, macrophages, pancreas, placenta skeletal muscle, spleen, thymus, tonsils EMR4 Kidney, liver, lung, macrophages, spleen, thymus

Table 6. Table adapted from (Bjarnadottir et al. 2007). Columns showing EGF-clan receptor and their tissue expression. Cells of the Immune system are marked in italics.

The Adhesion group II is mainly expressed by the immune system or im- mune system-related tissues; spleen, bone marrow, gastrointestinal and he- mopoietic stem cells for EMR3 and spleen and stem cells for EMR4 (See Table 6). The Adhesion group I; LEC1-3 and ETL are sometimes grouped together with group II (Bjarnadottir et al. 2007). However phylogenetic analysis and the discrepancy in expression, clearly indicates that they are two separate Adhesion groups (Bjarnadottir et al. 2007; Haitina et al. 2008). The relative high evolutionary differences in mammals are somewhat sur- prising. Their immune-system related expression pattern, suggests that the Adhesion group II is important for survival of the organism. We speculate

37 that the rapid evolution has been necessary in the mammal evolution to meet the different life situation and immunological demands of each species.

Figure 5. Phylogenetic tree adapted from Fig 1, Article IV and expression data (Haitina et al. 2008). A phylogenetic analysis of the Adhesion GCPRs in human and mouse. The seven transmembrane regions from TMI to the end of TM7 were used to obtain phylogenetic trees using 100 replicas and maximum likelihood. Each group, I-VIII, has been put in a separate box. The predominant expression for the mouse Adhesions is either in the Central Nervous System (CNS) and/or the Peripheral tissues. The displayed expression is based on obtained mouse and rat real-time PCR data.

38

Figure 6. Adapted from Article IV. A schematic presentation of the domains found in the N-termini of the Adhesions. The Figure is based on data from Table IV. We performed new searches on all Adhesion proteins using RPS-BLAST (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) with default E-value (0.01) and the CDD (Conserved domain database) version 2.14. CA, cadherin domains; Calx-beta, Calx-beta-domain; CUB, Complement C1r/C1s, Uegf, Bmp1; EGF, epidermal growth factor; EFG_LAM, Laminin-type epidermal growth factor-like domain; GBL, galactose- binding lectin domain; GPS, GPCR proteolytic site; HBD, hormone-binding domain; Her- pes_Ul32, Herpesvirus large structural phosphoprotein UL32; Ig, immunoglobulin; LamG, laminin; LRR, leucine rich repeats; OLF, olfactomedin; PTX, pentraxin domain; SEA, sperm protein, enterokinase, and agrin; TSP, thrombospondin.

39 The Solute Carrier Family

Despite the high diversity of SLC genes in distant species such as and bacteria, it is clear that there has been a large increase in the number of SLCs within the animal lineage. Moreover, there has also been an increase in the number of SLC genes in the vertebrate animals as compared with the inver- tebrate animals. The mammals, represented by H. sapiens and M. musculus, have the highest total amount of SLC genes (400 and 392, respectively). The Fungi lineage has the fewest members; S. pombe (94 genes) and S. cere- visiae (151). The Bacteria and Archaea SLC genes however need further investigation to show if they are fully-functional transporters. There is an increase in the number of SLC genes in human and mouse compared insects and nematodes. This may suggests that SLCs are of great importance in or- ganisms with more complex nervous systems.

26 of the human SLC families belong to four major groups (See Paper VI, Figure 1) (Fredriksson et al.; In Press). The largest of these is the Major Facilitator superfamily, containing 18 Pfam models (See Paper VI, Figure 1, light green boxes). The MFS clan contains 13 of the human SLC families and 9 of those were also shown to group together in the phylogenetic analy- sis based on Hmmpfam searches against the Pfam database (See Paper VI, Figure 2). In general, some of the individual families of the MFS transporter group have a high degree of similarity, yet as a group, the superfamily has a high degree of sequence diversity.

SLC46 is a relatively newly discovered family with three human members (http://www.genenames.org). The family is related to the SLC2 and SLC22 families with a HHsearch homology probability in our cluster analysis of 94.9% and 91.4%, respectively. SLC46 has a weak homology to bacterial and mammalian cation co-transporters with 12 transmembrane domains (Kim et al. 2000) but the origin of the evolution of SLC46 has not been resolved. In Paper VI, we show that SLC46 proteins first appear in C. elegans. This sequence has a 94.7% and 93.2% percent probability of ho- mology to SLC22 and SLC2, respectively.

The Amino acid Polyamine Organocation (APC) transporter clan functions as solute/cation and solute/solute antiporters. They vary in length from 350 to 850 residues. Fungal and bacterial APC amino acid transporters show significant sequence similarities which may reflect a common topology and mechanism of function (Kafasla et al. 2007). The APC clan consists of 14 families, of which four are found in Eukaryota. The human genes in the APC clan are mainly known SLCs, with three exceptions: 104 (Accession number Q8NE00) with Amino acid transporter Pfam

40 domain, Anchor protein(Q5Y190) with amino acid permease Pfam domain and P41 (Q9BXG1). None of these are defined as SLC proteins. The sub- strates of several APC superfamily permeases have been carefully studied and it has been revealed that some members have exceptionally broad speci- ficity, whereas others are restricted to just one or a few amino acids or re- lated compounds (Jack et al. 2000). We confirm that the SLC32, SLC36 and SLC38 groups all belong to the AA_permease family and also group phy- logenetically (See Paper VI, Figure). Interestingly, the substrates differ; SLC36 and SLC38 are amino acid transporters, whereas SLC32 is a GABA/.

The CPA/AT clan contains transporter proteins that belong to the monova- lent cation:proton (CPA) superfamily and the anion transporter (AT) superfamily. This clan contains 10 families and two of the families, Na_H_Exchanger (SLC9) and SBF (SLC10), have human members. Five human proteins are members of the CPA/AT clan, but it is yet to be deter- mined if they are SLCs and may create new SLC families.

There are a total of 364 genes for all species we studied which are most simi- lar to SLCs, but do not fulfil our criteria for subclassification into SLC fami- lies. We term these Unclassified SLCs, to clearly indicate that they most likely represents SLCs. 43 of these are found in H. sapiens, 39 are found in D. melanogaster and 67 are found in A. thaliana. Of the 43 Unclassified human sequences there are 24 that could be classified to specific SLC fami- lies based on pairwise sequence alignments. Of the remaining 19, 5 belongs to SLC related groups (SV2 related protein homolog (rat)-like, Synaptic vesicle glycoprotein 2a, Synaptic vesicle glycoprotein 2b, Synaptic vesicle glycoprotein 2c, SV2 related protein (Jacobsson et al. 2007) and four are not previously classified as SLCs (spinster homolog 1, spinster homolog 2 and P protein (Melanocyte-specific transporter protein) and Transmembrane pro- tein 104). The remaining 10 are novel genes and not yet annotated. We also found four novel mouse genes.

We analyzed all the Unclassified proteins with blastclust of the HMMER package, to see if they would be similar to any previously existing SLC or other human RefSeq sequence. The search yielded two clusters with three and 18 clusters with two proteins, all other proteins did not cluster with any Ref- Seq sequence. 13 of these clusters consist of only mouse and human se- quences, six consist solely of A. thaliana sequences and one only of S. cere- visiae sequences. Remarkably there are 290 proteins that do not cluster with any known human RefSeq proteins, including the SLCs in the present study, in the eight species analyzed. 30 genes are from the A. gambiae, 70 in A. thaliana, 34 in C. elegans, 41 in D. melanogaster, 13 in H. sapiens, 15 in M. musculus, 47 in S. cerevisiae and 40 in S. pombe. This suggests that there are

41 many proteins similar to SLC that are species-specific or have evolved rela- tively rapidly, as there are neither paralogs nor orthologs of these found in our dataset. We suggest that there is a high rate of divergent evolution among the SLC families. Further analysis is needed to classify and establish relationships among these proteins. There are most likely proteins that will be classified as new SLC families. One such recently established family is the multidrug and toxin extrusion family, now termed SLC47 (Terada et al. 2008).

In conclusion we have performed the first evolutionary analysis of the entire SLC family based on whole genome sequences. Using sequence Hidden Markov Models we found that 26 of the 46 SLC families belong to four evo- lutionary clusters while the others have no or low sequence similarity. We also show that 14 of the SLC families are present in all eukaryotes tested and are therefore present in all major eukaryotic lineages. These are most likely old in terms of evolution and their presence in some families most likely predates the appearance of eukaryotes. In addition we find evidence for the presence of 59 % of the families in prokaryotes and 51 % in Archaea making the SLC family one of the oldest groups of membrane proteins. We also identify 14 new human sequences not previously classified as SLCs, which most likely belong to the SLC family.

The Solute Carrier Family 6 Our analysis shows that the SLC6 family in mouse and rat are identical. Human and rodent repertoires are very similar, although there are differences in the Orphan group between human and rodents. SLC6A16 exists in hu- mans, but not in rodents. The X transporter protein 3 similar 1 (Xtrp3s1) is a duplication of SLC6A20 that only exists in mouse and rat. We present two new genes in Paper V; SLC6A17 and SLC6A18 that appear to belong to the Orphan subgroup. Since the publication the Orphan subgroup has been shown to be a family of amino acid transporters and has been renamed to Amino Acid Transporters II (Broer 2006). The Amino Acid Transporters II subgroup has six human members; SLC6A15-SLC6A20.

The phylogenetic analysis (See Paper V, Fig. I) divides the SLC6 family into four subgroups which fits remarkably well with the known substrate prefer- ences for the SLC6 proteins, and we have hence opted for naming the phy- logenetic branches accordingly. Our phylogenetic analysis reveals that the human SLC6 repertoire consists of 6 GABA, 3 Monoamine, 4 Amino Acids and 6 Orphan/Amino Acid Transporters II proteins, i.e. 19 SLC6 proteins in total. The bootstrap values that designate the four groups are high, regardless of phylogenetic method, which shows that the phylogenetic groupings are very stable. Interestingly, the Amino Acid substrate subgroup is phylogeneti-

42 cally closer to the Orphan subgroup and recently the SLC6A19 gene from the Orphan subgroup has been found to transport neutral amino acids (Kleta et al. 2004; Seow et al. 2004). Hence, considering that the phylogenetic clus- tering for all other groups is according to substrate preference, we speculate in article V on the possibility that other Orphan SLC6 transporters could have amino acids as substrates. New research corroborates our speculation. SLC6A15, the Neutral Amino Acid Transporter B0AT2, has been shown to transport leucine, isoleucine, valine, methionine and proline (O'Mara et al. 2006). SLC6A17 has recently been shown to be a vesicular transporter for proline, glycine, leucine, and alanine (Parra et al. 2008). SLC6A19, the Neu- tral Amino Acid Transporter B0AT1 has been shown to transport all neutral amino acid with different affinities (O'Mara et al. 2006). SLC6A20 seems to transport proline as well as hydroxy-proline, N-methylaminoisobutyric acid, betaine and pipecolic acid. (Broer 2006). The substrates for SLC6A16 and SLC6A18 have not yet been discovered. SLC6A18 and SLC6A19 form a phylogenetic pair with SLC6A20 placing basal on the same branch (Paper V, Figure I). SLC6A16 forms a branch together with SLC6A15 and SLC6A17. The SLC6A19 in rat shows expression in several peripheral tissues including intestine, and in mouse this gene seems to have high expression in intestine and kidney. On the other hand, our human EST panels indicate that this gene has a very low expression level as not a single EST is found in any tissue. Also, we do not find any evidence of expression of the SLC6A16 protein in the colon or the kidney. The fact that the gene for SLC6A16 is absent in both mouse and rat complicates further the use of rodents as a model for this dis- ease. SLC6A18 is highly expressed in brain and we can also detect SLC6A20 in certain sub dissected brain regions of rat. Additonally SLC6A15 and SLC6A17, as well as the rodent specific Xtrp3s1, are ex- pressed in the brain of all species investigated. It is highly likely that SLC6A16 and SLC6A18 also transports amino acids. This is intriguing as they could have important functions in the brain such as being transporters for amino acids used in the synthesis of neurotransmitters or being specific reuptake proteins for neurotransmitters. In rat, nine of the 19 SLC6 proteins have high expression in the brain, while an additional three have low expres- sion in the brain. Of the known monoamine transporters for dopamine (SLC6A3), serotonin (SLC6A4) and noradrenalin (SLC6A2), only the sero- tonin transporter can not be detected in these large brain regions while it is likely be detected in more sub dissected regions. The three phylogenetically related GABA transporters SLC6A1, SLC6A8 and SLC6A6 were found in the CNS, while the other three proteins from the GABA group, SLC6A11, SLC6A12 and SLC6A13, seem to be less highly expressed and have more expression in the periphery. Expression in the periphery for rat GABA trans- porters has previously been shown for SLC6A12 (Burnham et al. 1996) and SLC6A13 (Borden et al. 1992). In a similar manner we do see high expres- sion of all three monoamine transporters in peripheral tissues in rat and this

43 is also supported from the EST data from human and mouse. The Amino Acid transporters SLC6A5 and SLC6A9 are expressed in the brain and pe- ripheral tissue, while SLC6A7 and SLC6A14 are expressed only in the pe- riphery.

Our analysis shows that SLC6 genes are available in all Bilateria organisms, but neither in plants nor fungi. The repertoire of SLC6A proteins, as esti- mated from GENSCAN proteins, for mouse, human, rat and chicken is very similar. However the T. rubripes, D. rerio and C. intestinalis all had a two- fold to threefold number of transporters in the GABA group compared to H. sapiens. The A. gambiae and D. melanogaster both had most transporters in the Amino Acid group while C. elegans had the highest number in the un- classifiable group. The plant (A. thaliana) and yeast (S. cerevisiae and S. pombe) predicted protein dataset did not contain any SLC6-proteins while bacteria had a substantial number. A total of 16 GABA, 4 Monoamine, 3 Amino Acid, 3 Orphans and 81 Unclassifiable bacterial proteins were found in the different genomes. The Orphan subgroup seems to have appeared before the divergence of the arthropoda line from the lineage leading to ver- tebrates. Although these lines do not have an Orphan subgroup, the other three groups, GABA, Monoamine and Amino Acids, are clearly present. No- tably there exist a few Unclassifiable genes in the genome of most species. For example there are 6 in Anopheles gambiae, 13 in C. elegans and 81 in bacteria. This is very intriguing and it is likely that one or more phylogenetic subgroups either appeared specifically in these lineages or existed in parallel with the three subgroups common to vertebrates and invertebrates, but were lost before the appearance of vertebrates.

The following paragraph is based on very crude database analysis of frag- menst available in the Uniprot web page (http://www.uniprot.org/). Our analysis suggests that there exist SLC6 transporters as early Trichoplax ad- haerens (See Table 7). The T. adhaerens is the earliest branched off living eumetazoan (a clade comprising all major animal groups except sponge). T. adhaerens is an organism lacking organs and a nervous system (Srivastava et al. 2008). Compared to the phylogenetic grouping in Paper V, both a trans- porter from GABA group (SLC6A7) and one from the Amino Acid group (SLC6A11) seem to be present. The Monoamine group is first present in N. vectensis (SLC6A4). The Orphan/Amino Acid group II does not seem to be present in the species investigated, which corroborates our conclusion that this subgroup has appeared after the divergence of the arthropoda line from the lineage leading to vertebrates. Further research is needed to investigate the origination of SLC6 and the different phylogenetic groups.

44

Specie N Most similar to Dictyostelium discoideum - Naegleria gruberi - Phycomyces blakesleeanus - Monosiga brevicollis - Trichoplax adhaerens 2 SLC6A7 (Proline) SLC6A11 (GABA) Nematostella vectensis 2 SLC6A4 (Serotonin) SLC6A11 (GABA) Strongylocentrotus purpuratus 3 SLC6A3 (Dopamine) SLC6A12 (GABA) SLC6A13 (GABA) Table 7. Table listing the Species; the number of transporters in each specie (N), best Human blast hits (http://blast.ncbi.nlm.nih.gov/Blast.cgi). The information is gathered through searches of the The UniProt Knowledgebase (UniProtKB) and UniRef50 (http://www.uniprot.org). The UniRef databases provide clustered sets of sequences from UniProt Knowledgebase. UniRef50 is based on sequences with at least 50% sequence iden- tity.

Diseases The Online Mendelian Inheritance in Man (OMIMTM), is a catalog of human genes and genetic disorders. OMIM focuses primarily on inherited, or heri- table, genetic diseases (McKusick 1998). The OMIM database is a continu- ously updated and searchable database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim).

The Solute Carrier Families are important families for transporting different substrates. Therefore, genetic defects should have an effect on the pheno- type. One example is the SLC2 (Facilitate ) family, where SLC2A2 is linked to Diabetes Mellitus Type II (Kusari et al. 1991), Another example is SLC10A2, where a primary bile acid malabsorption is caused by mutations in the ileal sodium-dependent bile acid transporter. The phenotype is idiopathic intestinal disorder associated with congenital diar- rhea, steatorrhea, interruption of the enterohepatic circulation of bile acids, and reduced plasma cholesterol levels (Oelkers et al. 1997). There are also numerous other diseases associated with the SLC families.

45

Gene Family Disease SLC1 ATAXIA, ALS DIABETES MELLITUS TYPE 2, SLC2 GLUCOSE TRANSPORT DEFECT SLC3 CYSTINURIA RENAL TUBULAR ACIDOSIS, HEMOLYTIC ANEMIA, SLC4 CORNEAL DYSTROPHY RENAL GLUCOSURIA, GLUCOSE/GALACTOSE MALAB- SLC5 SORPTION SLC6 SEE TABLE 9 SLC7 LYSINURIC PROTEIN INTOLERANCE, CYSTINURA, SLC9 AUTISM AND EPILEPSY, MENTAL RETARDATION SLC10 BILE ACID MALABSORPTION SLC11 ANEMIA, HYPOCHROMIC MICROCYTIC PERIPHERAL NEUROPATHY, GITELMAN SYNDROME, SLC12 BARTTER SYNDROME, ERYTHROCYTE LACTATE TRANSPORTER DEFECT, SLC16 JUVENILE CATARACT INFANTILE SIALIC ACID STORAGE DISORDER, SLC17 SIALURIA FINNISH TYPE THIAMINE-RESPONSIVE MEGALOBLASTIC ANEMIA SLC19 SYNDROME BREAST CANCER, LUNG CANCER, SLC22 INFLAMMATORY BOWEL DISEASE AMISH TYPE MICROCEPHALY SLC25 MITOCHONDRIAL PHOSPHATE CARRIER DEFICIENCY DEAFNESS, PENDRED SYNDROME, ACHONDROGENE- SLC26 SIS SLC30 DIABETES MELLITUS TYPE 1&2 HYPOPHOSPHATEMIC OSTEOPOROSIS, SLC34 PULMONARY ALVEOLAR MICROLITHIASIS SLC35 CONGENITAL DISORDER OF GLYCOSYLATION SLC39 ACRODERMATITIS ENTEROPATHIC SLC40 HEMOCHROMATOSIS SLC45 OCULOCUTANEOUS ALBINISM Table 8. Table listing the Solute Carrier families associated with diseases according to OMIM (Online Mendelian Inheritance in Man http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim). The diseases listed are examples of diseases in each SLC family. It is however, not a complete list.

SLC6 family members are also known to be important for a number of pathological processes and disorders. Mutations in the SLC6A2 (norepineph- rine transporter) causes orthostatic intolerance. The SLC6A3 (DAT1) that acts to take released dopamine back up into presynap-

46 tic terminals has been implicated in human disorders such as Parkinsonism, Tourette syndrome, and substance abuse. The encoded by SLC6A4 is the target of important classes of antidepressant drugs: the serotonin selective reuptake inhibitors and the serotonin noradrenaline reup- take inhibitors. The SLC6A14 gene has recently been shown to encode an amino acid -alanine carrier (Anderson et al. 2008). Two studies have shown that polymorphism in the SLC6A14 gene is associated with obesity (Suviolahti et al. 2003; Durand et al. 2004). The expression of SLC6A14 has also been shown to be upregulated in carcinoma of the cervix (Gupta et al. 2006), colorectal cancer (Gupta et al. 2005) and Intestinal Bowel Disease (Eriksson et al. 2008; Eriksson et al. 2008). In Paper V, we showed that SLC6A14 is expressed in the gastrointestinal region in human and mouse with EST analysis and in fat tissue in rat.

Gene Disease SLC6A1 ESSENTIAL HYPERTENSION SLC6A2 ORTHOSTATIC INTOLERANCE CIGARETTE SMOKING, ADHD, TOURETTE SYNDROME, SLC6A3 BIPOLAR DISORDER ANXIETY-RELATED PERSONALITY TRAITS, BIPOLAR AFFECTIVE DISORDER, MAJOR DEPRESSIVE DISOR- DER, SEASONAL AFFECTIVE DISORDER, SSRI ANTI- DEPRESSANT RESPONSE, ALCOHOLISM, MIGRAINE WITH AURA, SUDDEN INFANT DEATH PULMONARY HYPERTENSION, OBSESSIVE-COMPULSIVE DISOR- SLC6A4 DER, SLC6A5 HYPEREKPLEXIA SLC6A8 CREATINE DEFICIENCY SYNDROME SLC6A14 OBESITY, SUSCEPTIBILITY TO, X-LINKED SLC6A18 HARTNUP DISORDER SLC6A19 HARTNUP DISORDER Table 9. Table listing the members of Soluter Carrier Family 6 and all their associated dis- eases according to OMIM (Online Mendelian Inheritance in Man http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim).

The Adhesion GPCRs have been shown to be linked to some diseases. For instance, bilateral frontoparietal polymicrogyria (BFPP) can be caused by mutations in the GPR56 gene. The total of 11 GPR56 mutations found repre- sented a variety of distinct founder mutations in various populations throughout the world. The clinical symptoms are mental retardation of mod- erate to severe degree; motor developmental delay; seizures, most commonly symptomatic generalized epilepsy; cerebellar signs, consisting of ataxia; dysconjugate gaze, presenting variably as esotropia, nystagmus, exotropia, or strabismus (Piao et al. 2005).

47

Gene Disease BILATERIAL FRONTOPARIETAL POLY- GPR56 MICROGYRIA GPR126 STATURE QUANTITATIVE TRAIT VLGR1 FEBRILE CONVULSIONS, USHER SYNDROME Table 10. Table listing the members of the Adhesion family and all their associated diseases according to OMIM (Online Mendelian Inheritance in Man http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim).

At last we are able to draw some general conclusion of genetic diseases of the GPCR and SLC families. With the recent emphasis on single nucleotide polymorphism (SNP) analysis, certainly many new mutation and diseases will be discovered. This will create tremendous new work for medical doc- tors with a special interest in clinical genetics.

48 Future perspectives

This work has provided a number of human sequences as well as or- thologues sequences in other species. The rapidly growing genomic informa- tion and the subsequent completion of genome sequencing project for an increasing number of species has increased our possibilities to study mem- brane genes from a comparative genomic perspective. Several new genomes are likely to become available soon which will enable further studies of the evolution of membrane bound proteins, enabling us to pinpoint the origin of both the families and their subsequent expansions.

Considering the current status of the particular groups that have been inves- tigated here, there are still several key questions to be answered, for example to determine the origin of each of the GRAFS families. Recent findings sug- gest that the Glutamate and Frizzled exists as early back as the social amoeba Dictyostelium discoideum while the Rhodopsins are not found until Trichoplax adhaerens (Nordstrom et al.; unpublished results). Further inves- tigations in additional species are warranted to gain insight into the origin of different GRAFS families. There are also several lineage-specific GPCR families such as the chemosensory in nematode (C. elegans) and Metusaleh in insects that need further investigations to understand their origin. The long evolution of the Adhesion GPCR family is very interesting and there are also many question left within the specific subgroups. For example the Adhesion group II seems to have an unusual evolution pattern that differs between the mammalian species.

The evolution of the SLC families is much less studied compared to the GPCRs. One important family is the SLC6 transporters, which seems to exist as early as Trichoplax adhaerens while for example the Orphan/Amino Acid group II seems to first occur in Artropoda. The availability of additional genomes will provide further insights in the origin of SLC6 and the rela- tively late development of the Orphan/Amino Acid group II. Most of the families are not studied in detail and it is possible that the number of families of Solute Carriers may continue to expand

A large number of the genes studied here have no clear functional role. This applies for example to the SLC44, Choline-like transporter family; SLC45, Putative sugar transporter family; SLC46, Heme transporter family; SLC47,

49 Multidrug and toxin extrusion and these SLC families in particular, may be interesting to study, both from an evolutionary and functional perspective. The function of most Adhesion GPCRs is yet to be discovered and studies of knockout models in mouse are likely to provide more function-related in- formation for many of these genes. Human genetics is also one way to asso- ciate functions to these genes. Understanding of the variations of DNA se- quence in humans may provide vital information about how human diseases develop, how they respond to pathogens and how they can be diagnosed. It may also help to develop personalized medicine. Personalized medicine could aid development of tailored treatment for each patient with individual doses based on their genotype and gene expression. More research on co- horts of patients and comparisons of their geno- and phenotypes will help this course. The geno- and phenotype correlation is likely to be improved with rapid cost reduction for whole genome sequencing on individual bases. When the expression, the function and the geno- and phenotype correlations of genes can be studied simultaneously, it will promote identification of fu- ture drug targets. There is no doubt that several of the new genes presented here may become very interesting as drug targets.

50 Acknowledgements

I would like to send my appreciation to all the employees and students at the department of neuroscience.

I would especially like to express my sincere gratitude to my supervi- sor Professor Helgi B. Schiöth for good support, creative solutions and thrilling beachvolley games (I think I won most of them), and Dr Robert Fredriksson, my co-supervisor, for always having time for dis- cussion and guidance.

Thanks to David Gloriam, for good co-operation and good friendship. Þóra Bjarnadottir, for inspiration, discussions and a very nice confer- ence in Japan. Kalle Nordström, for scientific (and mundane discussions), help with computer-related matters as well as being my office roommate. Malin Lagerström, for valuable help in general as well as advice on graphic presentation. Kia Leao, for discussions about Brazil and advice on illustration Marcus Sällman Almén, for intersting interoffice conferences of ge- nomics as well as geopolitics. Christian Murray, for computer-related matters, including EST search programming in Paper V. Johan Alsiö, for nice postprandial table tennis games. Maria Hägglund, Josefin Jacobsson, Tatjana Haitina, Chris Pickering, Smitha Sreedharan and Olga Stephansson for nice coffee breaks and discussions on science-related matters The administrative personnel at the Pharmacology department (espe- cially Ulla) for all help I received during my PhD years. Anna-Maria Lundins stipendiefond, for very generous travelling grants. Smålands nation, where I have forged a lot of friendships and have had a enormous amount of fun. My close friends: Karl, Magnus and Iman, for just hanging out and great times travelling. My extended family Mormor, Farmor, Farfar, Cousins, Uncles, Aunts, in-laws et alii.

51 Jon, Erik, Olov, Johanna, Ulrika and Mattias, for constant “cousin peer review” and not letting all my success get to my head. Dr Rainer Hillermann, for proofreading my dissertation thesis and nice scientific excursions in Germany, Italy and Australia. ODG566, for faithful service Arthur and Trillian, for valuable input on the mouse phenotype, espe- cially in Paper V-VI. Kate, for proofreading Paper VI and the thesis. Nils, for helping me with computer programming and being my best, although only, brother. Mom and Dad for continuous love and support.

52 References

Adams, M. D., S. E. Celniker, R. A. Holt, C. A. Evans, J. D. Gocayne, P. G. Amanatides, et al. (2000). "The genome sequence of Drosophila melanogaster." Science 287(5461): 2185-95. AGI (2000). "Analysis of the genome sequence of the flowering plant Arabidopsis thaliana." Nature 408(6814): 796-815. Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, et al. (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Ac- ids Res 25(17): 3389-402. Anderson, C. M., V. Ganapathy and D. T. Thwaites (2008). "Human solute carrier SLC6A14 is the {beta}-alanine carrier." J Physiol 586(Pt 17): 4061-7. Aparicio, S., J. Chapman, E. Stupka, N. Putnam, J. M. Chia, P. Dehal, et al. (2002). "Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes." Science 297(5585): 1301-10. Baud, V., S. L. Chissoe, E. Viegas-Pequignot, S. Diriong, V. C. N'Guyen, B. A. Roe, et al. (1995). "EMR1, an unusual member in the family of hormone receptors with seven transmembrane segments." Genomics 26(2): 334-44. Bjarnadóttir, ó. K. (2006). The Gene Repertoire of G protein-coupled Receptors: New Genes, Phylogeny, and Evolution. Bjarnadottir, T. K., R. Fredriksson, P. J. Hoglund, D. E. Gloriam, M. C. Lagerstrom and H. B. Schioth (2004). "The human and mouse repertoire of the adhesion family of G-protein-coupled receptors." Genomics 84(1): 23-33. Bjarnadottir, T. K., R. Fredriksson and H. B. Schioth (2007). "The adhesion GPCRs: a unique family of G protein-coupled recep- tors with important roles in both central and peripheral tis- sues." Cell Mol Life Sci 64(16): 2104-19. Bjarnadottir, T. K., K. Geirardsdottir, M. Ingemansson, M. A. Mirza, R. Fredriksson and H. B. Schioth (2007). "Identification of novel splice variants of Adhesion G protein-coupled recep- tors." Gene 387(1-2): 38-48.

53 Boguski, M. S., T. M. Lowe and C. M. Tolstoshev (1993). "dbEST-- database for "expressed sequence tags"." Nat Genet 4(4): 332- 3. Bonaldo, M. F., G. Lennon and M. B. Soares (1996). "Normalization and subtraction: two approaches to facilitate gene discovery." Genome Res 6(9): 791-806. Borden, L. A., K. E. Smith, P. R. Hartig, T. A. Branchek and R. L. Weinshank (1992). "Molecular heterogeneity of the gamma- aminobutyric acid (GABA) transport system. Cloning of two novel high affinity GABA transporters from rat brain." J Biol Chem 267(29): 21098-104. Brenner, S. E., C. Chothia and T. J. Hubbard (1998). "Assessing se- quence comparison methods with reliable structurally identi- fied distant evolutionary relationships." Proc Natl Acad Sci U S A 95(11): 6073-8. Broer, S. (2006). "The SLC6 orphans are forming a family of amino acid transporters." Neurochem Int 48(6-7): 559-67. Burge, C. and S. Karlin (1997). "Prediction of complete gene struc- tures in human genomic DNA." J Mol Biol 268(1): 78-94. Burnham, C. E., B. Buerk, C. Schmidt and J. C. Bucuvalas (1996). "A liver-specific isoform of the betaine/GABA transporter in the rat: cDNA sequence and organ distribution." Biochim Biophys Acta 1284(1): 4-8. Caminschi, I., S. Vandenabeele, M. Sofi, A. J. McKnight, N. Ward, T. C. Brodnicki, et al. (2006). "Gene structure and transcript analysis of the human and mouse EGF-TM7 molecule, FIRE." DNA Seq 17(1): 8-14. CESC (1998). "Genome sequence of the nematode C. elegans: a plat- form for investigating biology." Science 282(5396): 2012-8. Cherezov, V., D. M. Rosenbaum, M. A. Hanson, S. G. Rasmussen, F. S. Thian, T. S. Kobilka, et al. (2007). "High-resolution crystal structure of an engineered human beta2-adrenergic G protein- coupled receptor." Science 318(5854): 1258-65. Dehal, P., Y. Satou, R. K. Campbell, J. Chapman, B. Degnan, A. De Tomaso, et al. (2002). "The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins." Science 298(5601): 2157-67. Durand, E., P. Boutin, D. Meyre, M. A. Charles, K. Clement, C. Dina, et al. (2004). "Polymorphisms in the amino acid transporter solute carrier family 6 (neurotransmitter transporter) member

54 14 gene contribute to polygenic obesity in French Caucasians." Diabetes 53(9): 2483-6. Eddy, S. R. (1998). "Profile hidden Markov models." Bioinformatics 14(9): 755-63. Eriksson, A., C. F. Flach, A. Lindgren, E. Kvifors and S. Lange (2008). "Five mucosal transcripts of interest in ulcerative coli- tis identified by quantitative real-time PCR: a prospective study." BMC Gastroenterol 8: 34. Eriksson, A., E. Jennische, C. F. Flach, A. Jorge and S. Lange (2008). "Real-time PCR quantification analysis of five mucosal tran- scripts in patients with Crohn's disease." Eur J Gastroenterol Hepatol 20(4): 290-6. Felsenstein, J. (1981). "Evolutionary trees from DNA sequences: a maximum likelihood approach." J Mol Evol 17(6): 368-76. Felsenstein, J. (1989). "PHYLIP - Phylogeny Inference Package (Ver- sion 3.2)." Cladistics 5: 164-166. Felsenstein, J. (2005). PHYLIP (Phylogeny Inference Package). De- partment of Genome Sciences, University of Washington, Se- attle, Distributed by the author. Felsenstein J (2007). PHYLIP (Phylogeny Inference Package) version 3.67. Distributed by the author. Department of Genome Sci- ences, University of Washington, Seattle. Fleischmann, R. D., M. D. Adams, O. White, R. A. Clayton, E. F. Kirkness, A. R. Kerlavage, et al. (1995). "Whole-genome ran- dom sequencing and assembly of Haemophilus influenzae Rd." Science 269(5223): 496-512. Flower, D. R. (1999). "Modelling G-protein-coupled receptors for drug design." Biochim Biophys Acta 1422(3): 207-34. Fredriksson, R., D. E. Gloriam, P. J. Hoglund, M. C. Lagerstrom and H. B. Schioth (2003). "There exist at least 30 human G- protein-coupled receptors with long Ser/Thr-rich N-termini." Biochem Biophys Res Commun 301(3): 725-34. Fredriksson, R., M. C. Lagerstrom, P. J. Hoglund and H. B. Schioth (2002). "Novel human G protein-coupled receptors with long N-terminals containing GPS domains and Ser/Thr-rich re- gions." FEBS Lett 531(3): 407-14. Fredriksson, R., M. C. Lagerstrom, L. G. Lundin and H. B. Schioth (2003). "The G-protein-coupled receptors in the human ge- nome form five main families. Phylogenetic analysis, paralogon groups, and fingerprints." Mol Pharmacol 63(6): 1256-72.

55 Gibbs, R. A., G. M. Weinstock, M. L. Metzker, D. M. Muzny, E. J. Sodergren, S. Scherer, et al. (2004). "Genome sequence of the Brown Norway rat yields insights into mammalian evolution." Nature 428(6982): 493-521. Goffeau, A., B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann, et al. (1996). "Life with 6000 genes." Science 274(5287): 546, 563-7. Goodman, A. R., T. Cardozo, R. Abagyan, A. Altmeyer, H. G. Wisniewski and J. Vilcek (1996). "Long pentraxins: an emerg- ing group of proteins with diverse functions." Cytokine Growth Factor Rev 7(2): 191-202. Gupta, N., S. Miyauchi, R. G. Martindale, A. V. Herdman, R. Podol- sky, K. Miyake, et al. (2005). "Upregulation of the amino acid transporter ATB0,+ (SLC6A14) in colorectal cancer and me- tastasis in humans." Biochim Biophys Acta 1741(1-2): 215-23. Gupta, N., P. D. Prasad, S. Ghamande, P. Moore-Martin, A. V. Herd- man, R. G. Martindale, et al. (2006). "Up-regulation of the amino acid transporter ATB(0,+) (SLC6A14) in carcinoma of the cervix." Gynecol Oncol 100(1): 8-13. Haitina, T., F. Olsson, O. Stephansson, J. Alsio, E. Roman, T. Eben- dal, et al. (2008). "Expression profile of the entire family of Adhesion G protein-coupled receptors in mouse and rat." BMC Neurosci 9: 43. Hamann, J. (2004). "The EGF-TM7 family of the rat." Immunogenet- ics 56(9): 679-81. Harmar, A. J. (2001). "Family-B G-protein-coupled receptors." Ge- nome Biol 2(12): REVIEWS3013. Hediger, M. A., M. F. Romero, J. B. Peng, A. Rolfs, H. Takanaga and E. A. Bruford (2004). "The ABCs of solute carriers: physio- logical, pathological and therapeutic implications of human membrane transport proteinsIntroduction." Pflugers Arch 447(5): 465-8. Hill, T. (2007). Development of New Methods for Inferring and Evaluating Phylogenetic Trees. Uppsala, Uppsala University. PhD. Holt, R. A., G. M. Subramanian, A. Halpern, G. G. Sutton, R. Char- lab, D. R. Nusskern, et al. (2002). "The genome sequence of the malaria mosquito Anopheles gambiae." Science 298(5591): 129-49.

56 Horn, F., E. Bettler, L. Oliveira, F. Campagne, F. E. Cohen and G. Vriend (2003). "GPCRDB information system for G protein- coupled receptors." Nucleic Acids Res 31(1): 294-7. http://www.icp.ucl.ac.be/~opperd/private/parsimony.html. ICGSC (2004). "Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution." Nature 432(7018): 695-716. IHGSC (2004). "Finishing the euchromatic sequence of the human genome." Nature 431(7011): 931-45. Jack, D. L., I. T. Paulsen and M. H. Saier (2000). "The amino acid/polyamine/organocation (APC) superfamily of transport- ers specific for amino acids, polyamines and organocations." Microbiology 146 ( Pt 8): 1797-814. Jacobsson, J. A., T. Haitina, J. Lindblom and R. Fredriksson (2007). "Identification of six putative human transporters with struc- tural similarity to the drug transporter SLC22 family." Genom- ics 90(5): 595-609. Kafasla, P., D. Bouzarelou, S. Frillingos and V. Sophianopoulou (2007). "The proline permease of Aspergillus nidulans: func- tional replacement of the native cysteine residues and proper- ties of a cysteine-less transporter." Fungal Genet Biol 44(7): 615-26. Kamesh, N., G. K. Aradhyam and N. Manoj (2008). "The repertoire of G protein-coupled receptors in the sea squirt Ciona intesti- nalis." BMC Evol Biol 8: 129. Kent, W. J. (2002). "BLAT--the BLAST-like alignment tool." Ge- nome Res 12(4): 656-64. Kidd, J. M., G. M. Cooper, W. F. Donahue, H. S. Hayden, N. Sampas, T. Graves, et al. (2008). "Mapping and sequencing of struc- tural variation from eight human genomes." Nature 453(7191): 56-64. Kim, M. G., F. A. Flomerfelt, K. N. Lee, C. Chen and R. H. Schwartz (2000). "A putative 12 transmembrane domain cotransporter expressed in thymic cortical epithelial cells." J Immunol 164(6): 3185-92. Kleta, R., E. Romeo, Z. Ristic, T. Ohura, C. Stuart, M. Arcos-Burgos, et al. (2004). "Mutations in SLC6A19, encoding B0AT1, cause Hartnup disorder." Nat Genet 36(9): 999-1002. Kolakowski, L. F., Jr. (1994). "GCRDb: a G-protein-coupled receptor database." Receptors Channels 2(1): 1-7.

57 Kusari, J., U. S. Verma, J. B. Buse, R. R. Henry and J. M. Olefsky (1991). "Analysis of the gene sequences of the receptor and the insulin-sensitive glucose transporter (GLUT-4) in pa- tients with common-type non-insulin-dependent diabetes mel- litus." J Clin Invest 88(4): 1323-30. Kwakkenbos, M. J., E. N. Kop, M. Stacey, M. Matmati, S. Gordon, H. H. Lin, et al. (2004). "The EGF-TM7 family: a postgenomic view." Immunogenetics 55(10): 655-66. Kwakkenbos, M. J., M. Matmati, O. Madsen, W. Pouwels, Y. Wang, R. E. Bontrop, et al. (2006). "An unusual mode of concerted evolution of the EGF-TM7 receptor chimera EMR2." Faseb J 20(14): 2582-4. Lagerstrom, M. C., A. R. Hellstrom, D. E. Gloriam, T. P. Larsson, H. B. Schioth and R. Fredriksson (2006). "The G protein-coupled receptor subset of the chicken genome." PLoS Comput Biol 2(6): e54. Lagerstrom, M. C. and H. B. Schioth (2008). "Structural diversity of G protein-coupled receptors and significance for drug discov- ery." Nat Rev Drug Discov 7(4): 339-57. Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, et al. (2001). "Initial sequencing and analysis of the human genome." Nature 409(6822): 860-921. Lindahl, E. and A. Elofsson (2000). "Identification of related proteins on family, superfamily and fold level." J Mol Biol 295(3): 613- 25. Lindblad-Toh, K., C. M. Wade, T. S. Mikkelsen, E. K. Karlsson, D. B. Jaffe, M. Kamal, et al. (2005). "Genome sequence, compara- tive analysis and haplotype structure of the domestic dog." Na- ture 438(7069): 803-19. Liolios, K., K. Mavromatis, N. Tavernarakis and N. C. Kyrpides (2008). "The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associ- ated metadata." Nucleic Acids Res 36(Database issue): D475- 9. Maxam, A. M. and W. Gilbert (1977). "A new method for sequencing DNA." Proc Natl Acad Sci U S A 74(2): 560-4. McKusick, V. A. (1998). Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders. Baltimore, Johns Hop- kins University Press.

58 Nagaraj, S. H., R. B. Gasser and S. Ranganathan (2007). "A hitch- hiker's guide to expressed sequence tag (EST) analysis." Brief Bioinform 8(1): 6-21. O'Mara, M., A. Oakley and S. Broer (2006). "Mechanism and putative structure of B(0)-like neutral amino acid transporters." J Membr Biol 213(2): 111-8. Oelkers, P., L. C. Kirby, J. E. Heubi and P. A. Dawson (1997). "Pri- mary bile acid malabsorption caused by mutations in the ileal sodium-dependent bile acid transporter gene (SLC10A2)." J Clin Invest 99(8): 1880-7. Pachter, R. M. a. D. L. a. L. (2006). Why neighbor-joining works. Park, J., K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, et al. (1998). "Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods." J Mol Biol 284(4): 1201-10. Parra, L. A., T. B. Baust, S. El Mestikawy, M. Quiroz, B. Hoffman, J. M. Haflett, et al. (2008). "The Orphan Transporter Rxt1/NTT4 (SLC6A17) Functions as a Synaptic Vesicle Amino Acid Ve- sicular Transporter Selective for Proline, Glycine, Leucine, and Alanine." Mol Pharmacol. Perry, S. E., P. Robinson, A. Melcher, P. Quirke, H. J. Buhring, G. P. Cook, et al. (2007). "Expression of the CUB domain contain- ing protein 1 (CDCP1) gene in colorectal tumour cells." FEBS Lett 581(6): 1137-42. Piao, X., B. S. Chang, A. Bodell, K. Woods, B. Benzeev, M. Topcu, et al. (2005). "Genotype-phenotype analysis of human frontopa- rietal polymicrogyria syndromes." Ann Neurol 58(5): 680-7. Reid, A. J., C. Yeats and C. A. Orengo (2007). "Methods of remote homology detection can be combined to increase coverage by 10% in the midnight zone." Bioinformatics 23(18): 2353-60. Sanger, F., S. Nicklen and A. R. Coulson (1977). "DNA sequencing with chain-terminating inhibitors." Proc Natl Acad Sci U S A 74(12): 5463-7. Scheerer, P., J. H. Park, P. W. Hildebrand, Y. J. Kim, N. Krauss, H. W. Choe, et al. (2008). "Crystal structure of opsin in its G- protein-interacting conformation." Nature 455(7212): 497-502. Seow, H. F., S. Broer, A. Broer, C. G. Bailey, S. J. Potter, J. A. Cava- naugh, et al. (2004). "Hartnup disorder is caused by mutations in the gene encoding the neutral amino acid transporter SLC6A19." Nat Genet 36(9): 1003-7.

59 Service, R. F. (2006). "Gene sequencing. The race for the $1000 ge- nome." Science 311(5767): 1544-6. Simon, D. B., F. E. Karet, J. M. Hamdan, A. DiPietro, S. A. Sanjad and R. P. Lifton (1996). "Bartter's syndrome, hypokalaemic alkalosis with hypercalciuria, is caused by mutations in the Na- K-2Cl cotransporter NKCC2." Nat Genet 13(2): 183-8. Soding, J. (2005). "Protein homology detection by HMM-HMM com- parison." Bioinformatics 21(7): 951-60. Srivastava, M., E. Begovic, J. Chapman, N. H. Putnam, U. Hellsten, T. Kawashima, et al. (2008). "The Trichoplax genome and the nature of placozoans." Nature 454(7207): 955-60. Stacey, M., H. H. Lin, S. Gordon and A. J. McKnight (2000). "LNB- TM7, a group of seven-transmembrane proteins related to fam- ily-B G-protein-coupled receptors." Trends Biochem Sci 25(6): 284-9. Stehlik, C., R. Kroismayr, A. Dorfleutner, B. R. Binder and J. Lipp (2004). "VIGR--a novel inducible adhesion family G-protein coupled receptor in endothelial cells." FEBS Lett 569(1-3): 149-55. Sundberg, B. E., E. Waag, J. A. Jacobsson, O. Stephansson, J. Ru- maks, S. Svirskis, et al. (2008). "The Evolutionary History and Tissue Mapping of Amino Acid Transporters Belonging to Solute Carrier Families SLC32, SLC36, and SLC38." J Mol Neurosci 35(2): 179-93. Suviolahti, E., L. J. Oksanen, M. Ohman, R. M. Cantor, M. Ridder- strale, T. Tuomi, et al. (2003). "The SLC6A14 gene shows evidence of association with obesity." J Clin Invest 112(11): 1762-72. Terada, T. and K. Inui (2008). "Physiological and pharmacokinetic roles of H+/organic cation antiporters (MATE/SLC47A)." Biochem Pharmacol 75(9): 1689-96. Thompson, J. D., D. G. Higgins and T. J. Gibson (1994). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice." Nucleic Acids Res 22(22): 4673-80. Vandamme, A. (2003). Basic concepts of molecular evolution. The phylogenetic handbook. A practical aproach to DNA and pro- tein phylogeny. M. Salemi and A. Vandamme. Cambridge, Cambridge University Press.

60 Waterston, R. H., K. Lindblad-Toh, E. Birney, J. Rogers, J. F. Abril, P. Agarwal, et al. (2002). "Initial sequencing and comparative analysis of the mouse genome." Nature 420(6915): 520-62. Watson, J. D. (1990). "The human genome project: past, present, and future." Science 248(4951): 44-9. Watson, J. D. and F. H. Crick (1953). "Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid." Nature 171(4356): 737-8. Venter, J. C., M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, et al. (2001). "The sequence of the human genome." Science 291(5507): 1304-51. Williams, E. D. (2002). "Genetics and bioethics: the current state of affairs." Synth Philos 17(1): 121-33. Wolinsky, H. (2007). "The thousand-dollar genome. Genetic brink- manship or personalized medicine?" EMBO Rep 8(10): 900-3. Wood, V., R. Gwilliam, M. A. Rajandream, M. Lyne, R. Lyne, A. Stewart, et al. (2002). "The genome sequence of Schizosac- charomyces pombe." Nature 415(6874): 871-80.

61 Acta Universitatis Upsaliensis Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine 388

Editor: The Dean of the Faculty of Medicine

A doctoral dissertation from the Faculty of Medicine, Uppsala University, is usually a summary of a number of papers. A few copies of the complete dissertation are kept at major Swedish research libraries, while the summary alone is distributed internationally through the series Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine. (Prior to January, 2005, the series was published under the title “Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine”.)

ACTA UNIVERSITATIS UPSALIENSIS Distribution: publications.uu.se UPPSALA urn:nbn:se:uu:diva-9329 2008