Topology Prediction of Membrane : Why, How and When?

Karin Melén

Stockholm University © Karin Melén, Stockholm 2007

ISBN 91-7155-397-5

Printed in Sweden by Universitetsservice AB, Stockholm 2007 Distributor: Stockholm University Library

To Ture, Otto and Henrik

List of publications

Publications included in this thesis

Paper I Melén K, Krogh A, von Heijne G. (2003) Reliability measures for membrane topology prediction algorithms. J Mol Biol. 327(3):735-44.

Paper II Kim H, Melén K, von Heijne G. (2003) Topology models for 37 Saccharomyces cerevisiae membrane proteins based on C-terminal reporter fusions and predictions. J Biol Chem. 278(12):10208-13.

Paper III Rapp M, Drew D, Daley DO, Nilsson J, Carvalho T, Melén K, de Gier JW, von Heijne G (2004) Experimentally based topology models for E. coli inner membrane proteins. Protein Sci. 13(4):937-45.

Paper IV Daley DO, Rapp M, Granseth E, Melén K, Drew D, von Heijne G. (2005) Global topology analysis of the Escherichia coli inner membrane proteome. Science. 308(5726):1321-3.

Paper V Kim H*, Melén K*, Österberg M*, von Heijne G. (2006) A global topology map of the Saccharomyces cerevisiae membrane proteome. Proc Natl Acad Sci. 103(30):11142-7.

* These authors contributed equally

Reprints were made with permission from the publishers

Other publications

Österberg M, Kim H, Warringer J, Melén K, Blomberg A, von Heijne G. (2006) Phenotypic effects of overexpression in Saccharomyces cerevisiae. Proc Natl Acad Sci. 103(30):11148-53.

Granseth E, Daley DO, Rapp M, Melén K, von Heijne G. (2005) Experimentally constrained topology models for 51,208 bacterial inner membrane proteins. J Mol Biol. 352(3):489-94.

Laudon H, Hansson EM, Melén K, Bergman A, Farmery MR, Winblad B, Lendahl U, von Heijne G, Näslund J. (2005) A nine-transmembrane domain topology for presenilin 1. J Biol Chem. 280(42):35352-60.

Contents

1 Introduction ...... 11 1.1 Biological membranes ...... 11 1.2 Membrane proteins ...... 13 1.2.1 Peripheral membrane proteins ...... 13 1.2.2 Integral membrane proteins...... 13

2 Membrane ...... 17 2.1 Overexpression ...... 17 2.2 Techniques for structure determination...... 18 2.2.1 Electron crystallography...... 19 2.2.2 X-ray crystallography ...... 19 2.2.3 NMR spectroscopy...... 19 2.3 Structural models by homology...... 20 2.4 Structure prediction ...... 21

3 Membrane protein topology ...... 23 3.1 Experimental determination of topology...... 24 3.1.1 Reporter genes ...... 24 3.2 Prediction of topology...... 26 3.2.1 Topological determinants...... 27 3.2.2 Topology prediction algorithms...... 31

4 Summary of papers...... 39 4.1 Reliability measures for topology predictions and the use of experimental knowledge (Paper I) ...... 39 4.2 Topology models for a small number of S. cerevisiae membrane proteins based on C-terminal reporter fusions and predictions (Paper II) ...... 41 4.3 Topology models for a small number of E. coli membrane proteins and optimization analysis of fusion points (Paper III) ...... 43 4.4 Large-scale topology analysis of the E. coli and S. cerevisiae membrane proteomes (Papers IV and V)...... 45

5 Discussion and future perspectives ...... 50

Acknowledgements ...... 52

References...... 54

Abbreviations

2D Two-dimensional 3D Three-dimensional Endo H Endoglycosidase H ER Endoplasmic reticulum GFP Green fluorescent protein GO Gene ontology GPCR G protein-coupled receptor HA Hemagglutinin His4 Histidinol dehydrogenase HMM MP Membrane protein NMR Nuclear magnetic resonance NN Neural network ORF Open reading frame PDB Protein Data Bank PhoA Alkaline phosphatase SG Structural genomics SP Signal peptide SRP Signal recognition particle SUC2 Invertase (carrying N-glycosylation sites) TM Transmembrane

Amino acids Ala Alanine Leu Leucine Arg Arginine Lys Lysine Asn Asparagine Met Methionine Asp Aspartic acid Phe Phenylalanine Cys Cysteine Pro Proline Gln Glutamine Ser Serine Glu Glutamic acid Thr Threonine Gly Glycine Trp Tryptophan His Histidine Tyr Tyrosine Ile Isoleucine Val Valine

1 Introduction

The cell is the essential unit of life and is the fundamental building block in all organisms. Cells are surrounded by membranes, usually a double layer of lipids, which separates them from the outside world. The membrane is a physical barrier that protects the cell from foreign molecules at the same time as it prevents leakage of internal components and substances. However, a cell must be able to communicate with its surroundings, exchange mole- cules and adapt to sudden environmental changes. Membrane proteins (MPs) are the key players in these communication processes and are responsible for regulating the permeability of the membranes. They are involved in a wide range of functions, such as transport of ions and water, receptors for hor- mones or other signaling molecules, recognition of “self” versus “non-self”, transducing energy and cell-cell interactions, just to mention some. The diversity of functions is also mirrored in the great variability in the three-dimensional (3D) structure of membrane proteins. Some are integral with certain parts embedded in the membrane whereas others are peripheral and bound only to the membrane surface. Determination of the structures would facilitate the assignment of the functions, but unfortunately it is very difficult to solve the three-dimensional structure of membrane proteins ex- perimentally. Therefore, alternative approaches to obtain structural informa- tion must be taken in parallel with traditional structure determination efforts. One way is to use methods to predict which parts of the pro- tein interact with the lipid bilayer and which parts are residing outside the membrane using the amino acid sequence only. Other means can be to ex- perimentally locate selected parts of the proteins to the cell interior, cell ex- terior or in the membrane. In the work presented in this thesis, a strategy of combining theoretical predictions and experimental analyses has been car- ried out in order to increase our knowledge of the membrane protein uni- verse.

1.1 Biological membranes The membranes surrounding the cells in all three domains of life (bacteria, archea and eukaryotes) are called plasma membranes. In eukaryotes there are

11 additional internal membranes that define specific compartments, so-called organelles. Some examples are the endoplasmic reticulum (ER), Golgi net- work, mitochondria, chloroplasts, nucleus, lysosomes and peroxisomes. Common to all kinds of membrane is that they consist of a mixture of lipids that are assembled into a bilayer structure. There are different classes of lipids that can be neutral, zwitterionic or negatively charged, where glyco- and phospholipids are the most common ones. The main characteristics for all lipids are similar in that they are amphipathic molecules with hydrophilic head groups and hydrophobic hydrocarbon tails. In an aqueous milieu they spontaneously arrange themselves into a double layer with the polar head groups facing the aqueous environment and the hydrophobic tails in each layer pointing inward, thereby avoiding contact with water. The driving force for this formation is the inability of the fatty acid hydrocarbon chains to hydrogen bond to water. The shape of the tails allows van der Waals in- teractions between neighboring tails. The lengths of the fatty acid chains determine the thickness of the hydrophobic core that typically is about 30 Å. The size of the polar head groups on each side of the core is around 15 Å which in total gives a bilayer thickness of roughly 60 Å (Fig. 1). The result is a hydrophobic barrier that is impermeable to most molecules (except small uncharged solutes). Therefore the membrane proteins associated with the lipid bilayer are necessary for the transmission of matter or information across the various membranes. The proportion of protein depends on the membrane type but in general makes up about 50% of the mass. The mem- brane is a dynamic structure with a fairly equal mixture of lipids and proteins which all can move laterally as depicted by the fluid mosaic model (Singer and Nicolson, 1972).

Polar head groups ~15 Å Interface region

Hydrophobic tails

~30 Å Membrane core

~15 Å Interface region

Figure 1. Schematic picture of a lipid bilayer with different kinds of lipids and asso- ciated membrane proteins; grey spheres represent the lipid head groups, black sticks represent lipid tails, the light grey cylinder symbolizes an integral membrane protein and the dark grey cylinder symbolizes a peripheral membrane protein.

12 1.2 Membrane proteins The biological significance of membrane proteins is reflected in their abun- dance in a cell. It is estimated that 20-30% of all genes in most organisms code for membrane proteins (Krogh et al., 2001; Wallin and von Heijne, 1998) based on prediction of the main category of membrane proteins (the α-helical class, see below). Taking all different classes into account, the figure is even higher. From a pharmaceutical point of view, the membrane proteins are also of great importance since about half of all drug targets are membrane proteins (Hopkins and Groom, 2002; Russell and Eggleston, 2000). The classification of membrane proteins depends on the relationship to the membrane. They can be divided into two broad categories, the periph- eral and the integral membrane proteins.

1.2.1 Peripheral membrane proteins Peripheral membrane proteins are only loosely associated to one side of the membrane. They either interact directly with the polar head groups of the lipids or attach to integral membrane proteins but they rarely penetrate deeply into the hydrophobic core of the membrane. The peripheral mem- brane proteins dissociate from the membrane upon treatment with solutions of high ionic strength or elevated pH.

1.2.2 Integral membrane proteins Integral membrane proteins are more tightly attached to the membranes and require a detergent or an organic solvent to be solubilized. They have one or more polypeptide segments buried in the lipid bilayer. The segments traverse the entire membrane and the protein must thus both have regions that can exist in a lipid environment and regions that are happy in a polar milieu. The solution has been to have a combination of transmembrane (TM) segments containing residues with hydrophobic side chains that can interact with the hydrophobic membrane core and loops with a more hydrophilic character that are in contact with the polar lipid head groups and the surrounding aqueous medium. Integral membrane proteins can further be separated into two distinct classes based on how the transmembrane segments are folded, namely the α- helical bundle class and the β-barrel class. The two architectures have in different ways solved the problem of having energetically unfavorable amide and carbonyl groups of the peptide bonds in a hydrophobic environment.

1.2.2.1 The α-helical bundle class The α-helical membrane proteins are the most frequent type of membrane proteins and are found in nearly all cellular membranes. The α-helices trav-

13 erse the membrane and are tightly packed into bundles (Fig. 2a). They are composed of mainly hydrophobic residues where the side chains can form van der Waals interactions with the fatty acid chains in the membrane core. All polar amide and carbonyl groups in the backbone are hydrogen bonded internally within the helix. This lowers the cost of transferring polar entities into the hydrocarbon interior and makes the conformation energetically sta- ble. Another stabilizing factor is the enrichment of aromatic residues (Tyr and Trp) near the ends of the transmembrane segments (Killian and von Hei- jne, 2000), something also recognized in the β-barrel membrane proteins (see below). It is thought that their interaction with the membrane-water interface regions helps the positioning of the helices relative to the mem- brane (de Planque et al., 1999; Yau et al., 1998). The helices are oriented more or less perpendicularly to the membrane plane but are usually slightly tilted. In general, the lengths of the helices will approximately match the thickness of the membrane. Among the known three-dimensional structures the average number of residues in an α-helix is commonly estimated to be between 23 and 26 (Bowie, 1997; Cuthbertson et al., 2005; Eyre et al., 2004; Ulmschneider et al., 2005). The deviation in the estimated numbers reflects both the difficulty of precisely identifying the membrane core and also the differences in defining where the TM helices actual start and end. There is however a large variation of the helix lengths and observations from 15 up to 43 residues have been made (Granseth et al., 2005a). Factors that affect the lengths are the tilting angle of a helix and the distortion of the lipid bilayer (Engelman et al., 1986). The exact positioning of the helices and the direction of the side chains are also influenced by the snorkeling effect (hydrophobic side chains tend to point towards the hydro- phobic membrane center while polar and charged side chains are oriented towards the polar head groups and the membrane-water interface region) (Granseth et al., 2005a; Monné et al., 1998). There are both single-spanning proteins where the membrane domain acts as an anchor for a water-soluble domain, and multi-spanning (polytopic) proteins where two or more α- helices usually are more directly involved with the function of the protein.

14 (a) (b)

Figure 2. Ribbon diagrams showing two examples of integral membrane proteins. The red lines denote the approximate membrane borders. (a) Bacteriorhodopsin (PDB code 2BRD) with 7 transmembrane α-helices, (b) A porin (PDB code 1PRN) with 12 transmembrane β-strands.

1.2.2.2 The β-barrel class The β-barrel proteins are present in the outer membrane of gram negative bacteria, mitochondria and chloroplasts. The membrane-spanning parts are composed of an even number of antiparallel β-strands (between 8 and 22 strands in proteins of known three-dimensional structures) that are tilted ~45° relative to the membrane normal (Lomize et al., 2006; Schulz, 2000). The strands form a cylindrical pore where the first and last strands meet to close the barrel (Fig. 2b). This arrangement enables the backbone amide and carbonyl groups to hydrogen bond laterally with the neighboring strands. The residues in the β-strands alternately face the lipid bilayer and the inside of the barrel. Side chains oriented towards the fatty acid chains are typically hydrophobic whereas side chains oriented towards the barrel interior are more polar on average. The outcome is a polar channel through which water- soluble molecules can cross. The residues flanking the β-strands are often aromatic (Tyr and Trp) and are in contact with the lipid head groups which is believed to stabilize the structure (Yau et al., 1998). The loops joining the strands are predominantly composed of polar residues and usually form short turns on the periplasmic side and longer loops at the extracellular side (Schulz, 2000). It is more difficult to predict the fraction of proteins in a genome belong- ing to the β-barrel class than to the α-helical bundle class, due to less hydro- phobic motifs and larger length variations in the former group. A few at- tempts have been made however and the proportion of β-barrel encoding genes in the gram negative bacterium Escherichia coli genome is estimated to be 2-3% (Garrow et al., 2005; Wimley, 2003) .

15

In this thesis I have only focused on the α-helical class of integral membrane proteins and will from now on refer to them as simply “membrane proteins”.

16 2 Membrane protein structure

The 3D structure of a protein is implicitly related to the function of the pro- tein (Laskowski et al., 2003), but it is not always straight forward to infer function from structure. There are cases where proteins with similar struc- tures have different functions (Bartlett et al., 2003) and if a protein repre- sents a new fold (i.e. resembles no previously solved structure) it might be hard to assign the function (Skolnick et al., 2000). Nevertheless, a good way to start studying the function for a protein is to determine its 3D structure. This is a demanding procedure for any protein but has turned out to be considerably more difficult for membrane proteins than for globular proteins. The issue is illustrated by the fact that only just above 100 structures of membrane proteins have been solved (http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html), which is in contrast to the number of solved structures for globular proteins deposited in the Protein Data Bank (PDB) (Berman et al., 2000) which is nearly 40 000 (http://www.rcsb.org/pdb). The fraction is hence less than 1% which should be compared to the observation that between 20 and 30% of all proteins are membrane proteins (Krogh et al., 2001; Wallin and von Heijne, 1998). The huge gap is not expected to be filled in the foreseeable future even though there has been an exponential growth in the number of membrane protein structures solved during the last years (White, 2004).

Why are membrane proteins so challenging? There are several explanations, but the major reason is the interaction with the membrane lipids that is nec- essary for correct folding. Without the amphipathic lipid molecules a mem- brane protein does not fold into its native structure. This has many implica- tions, some of which I will go through briefly in the following sections.

2.1 Overexpression A prerequisite for structural determination is a reasonable amount of purified protein, typically several milligrams. Most membrane proteins do not reach this level. Either the proteins are degraded by proteases (Wagner et al., 2006) or the overexpression can lead to aggregation of proteins and formation of inclusion bodies from which it is difficult to isolate the protein (Rogl et al., 1998). It seems as if the membrane assembly machinery (including the SRP,

17 translocon and chaperones) does not manage to handle too large quantities and therefore the proteins are not inserted correctly or get misfolded (Drew et al., 2003). Moreover, the balance between lipid and protein composition in the membrane gets disturbed which also might be a limiting factor for the protein insertion capacity. As an illustration, it has been shown in E. coli that proliferation of extra intracellular membranes increased the yield of overex- pressed membrane proteins (Arechaga et al., 2000). Finally, overexpression would ideally be performed in endogenous hosts, but this is often not possi- ble, especially not for mammalian proteins. A common host for both pro- karyotic and eukaryotic membrane proteins is E. coli but since the compo- nents in the assembly machinery vary between the species it is likely that the pathways are far from optimal. Another disadvantage of using bacterial ex- pression systems for eukaryotic membrane proteins is the inability of bacte- ria to perform post-translational modifications that may be crucial for the function. The further the evolutionary distance between host and protein origin is, the higher the risk of expression problems. Most mammalian mem- brane proteins require a eukaryotic host cell for overexpressing (Tate, 2001) and in many cases even insect or mammalian host cells are needed (Massotte, 2003).

2.2 Techniques for structure determination Even if the overexpression of membrane proteins is successful, solubiliza- tion and purification of the proteins have to be carried out before further analyses. Since membrane proteins contain both hydrophilic and hydropho- bic parts they are not soluble in water and can therefore not easily be re- leased from the lipids in the membrane. Detergents, i.e. amphipathic com- pounds possessing both polar and nonpolar regions, are necessary for the solubilization. It is however not trivial to assess the amount of detergent needed or which detergent to use or if a combination of different detergents and lipids is required. The conditions are individual for each membrane pro- tein and have to be optimized by trial and error. A general strategy for purifying overexpressed proteins is to utilize con- structs where the target genes are fused to an affinity tag, either at the amino- or the carboxy-terminus of the protein. The tag enables the proteins to be recovered efficiently (Walian et al., 2004). There are a number of experimental techniques for three-dimensional structure determination. The classical methods commonly used for globular proteins are X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy. The difficulties in working with membrane proteins have lead to a development of alternative methods for obtaining structural data, such as electron crystallography, single-particle cryo-electron microscopy (cryo- EM) and atomic force microscopy (AMF). In some cases the resulting struc-

18 tures are of lower resolution than achieved by the standard techniques but they still yield valuable and complementary information (Torres et al., 2003). Below follows a description of the three main techniques used for structure determination of membrane proteins.

2.2.1 Electron crystallography The pioneering work of solving the very first membrane protein structure was done using electron crystallography. It was performed in 1975 by Hen- derson and Unwin. They studied bacteriorhodopsin from Halobacterium halobium and managed to get a low-resolution structure (Henderson and Unwin, 1975), from which it was possible to produce a model consisting of 7 transmembrane segments arranged almost perpendicularly to the membrane. The methodology has been continually improved, and there are now exam- ples of structures solved at near-atomic resolution (Henderson et al., 1990; Torres et al., 2003). The main benefits of the method are that membrane proteins can be reconstituted in lipid bilayers resembling their natural envi- ronment and that the proteins often organize themselves into well-ordered two-dimensional (2D) crystals.

2.2.2 X-ray crystallography The rate-limiting obstacle in X-ray crystallography is to obtain diffracting 3D crystals of high quality. Well-ordered 3D crystals are more difficult to grow than 2D crystals, and again, it is the amphipathic nature of the protein that causes the trouble. The protein can get stuck in aggregates on the way from soluble protein-detergent complex to crystal (Caffrey, 2003). But once a high quality crystal is achieved, standard X-ray diffraction analyses can be applied. To date, the majority of the known 3D structures for both globular and membrane proteins have been solved by X-ray crystallography (> 80%) (http://www.rcsb.org/pdb). During recent years, new methods have been developed for improving 3D crystal production and it seems as if the X-ray technique will continue to be the most used method for structural determina- tion, at least in the near future (Caffrey, 2003) .

2.2.3 NMR spectroscopy The main advantage of NMR spectroscopy is that there is no need for crys- tals. On the other hand, the technique is limited to analyses of proteins of small sizes, regularly < 35 kDa, although advances for pushing this size limit have been reported (Kainosho et al., 2006; Yu, 1999). There are different types of NMR where solution NMR has been widely used for globular pro- teins. The technique requires that the molecule studied must tumble quickly on the NMR time scale in order to produce sharp resonance bands. The lar-

19 ger the molecule, the more slowly it will tumble and the harder it will be to obtain sharp NMR resonances (Torres et al., 2003). Membrane proteins that need to be surrounded by an environment that mimics the membrane, such as a detergent micelle, often become too large and tumble too slowly to be studied successfully and thus are even more size-restricted than globular proteins. The problem of slow tumbling can however be bypassed by solid- state NMR where larger molecules can be analyzed. The membrane protein is reconstituted in detergent bilayers instead of a micelle and the protein is immobilized by the environment (Opella et al., 2002). Although the NMR technique contributes fewer structures than X-ray crystallography it has been shown that for small proteins the methods are complementary with low re- dundancy (Yee et al., 2005). Therefore, it is suggested that the most effective way to obtain new structures is to continue to use both methods in parallel.

2.3 Structural models by homology Although technical progress is made continuously, it is not feasible to ex- perimentally determine structures for every known protein. It would take too much effort both in terms of cost and labor. Fortunately one can take advan- tage of the evolutionary relationship between sequence and structure. It has been concluded by several studies that protein sequences sharing at least 30% sequence identity are likely to have similar structures (Bradley et al., 2005; Rost and Sander, 1993; Vitkup et al., 2001). Therefore, a way to ob- tain a structural model for a protein without solving the structure explicitly is to find a homolog for which the structure is known. The homolog can be used as a template in homology modeling methods and accordingly one can attain a reasonably correct structure for the target protein. The evolutionary relationship between the modeled protein and the template influences the accuracy. The higher the sequence identity, the better the alignment will be and hence the likelihood of a correct structural model increases. Facing the reality and the aforementioned facts, a number of so-called structural genomics (SG) initiatives were started in 2000 (Thornton, 2001). The goal of SG is to provide a structural representative for each protein fam- ily and to use computational homology modeling and fold recognition to obtain structural models for all related sequences in the protein families (Brenner, 2001; Todd et al., 2005). In order to cover the whole protein fam- ily space, the target selection for experimental structure determination has to be clever. It should in general focus on targets from protein families without members of known structure and targets from large protein families where it might not be enough with one structure to capture the diversity of functions (Marsden et al., 2006). Following this strategy it has been predicted that 25,000 new structures are needed to achieve 80% structural coverage of all

20 proteins in the first 1,000 sequenced genomes (Yan and Moult, 2005), some- thing that is expected to be achievable within the next decade. When it comes to membrane proteins the figures have to be recalculated due to a number of reasons; i) the fraction of solved MP structures is much lower than the fraction of MPs present in the proteomes (see chapter 2), ii) the structure determination process is slower for membrane proteins than for globular proteins (Torres et al., 2003) and iii) the lipid bilayer imposes physical and chemical constraints that restricts the structural diversity of the transmembrane domains in the membrane proteins (Bowie, 1997; Ubarretx- ena-Belandia and Engelman, 2001). Yet it seems as if modeled structures of membrane proteins can reach the same level of accuracy as globular proteins if the sequence identity of the target protein and the template is 30% or higher (Forrest et al., 2006). The difference in success rate might instead be due to the scarcity of membrane protein structures which reduces the prob- ability of finding a homolog of known structure and thereby obtaining a cor- rect alignment between the target and template. According to a recent study (Oberai et al., 2006) the most dominant membrane protein families are al- ready present in sequence databases today, something which seems not to be the case for globular proteins. This concludes that the universe of membrane protein families appears to be much more structurally limited than found for families. Oberai et al. estimated that if 80% of the MP se- quence space should be covered by structural representatives, about 700 families are required, which relates to about 300 folds since different protein families can have the same fold (Grant et al., 2004). They further estimated that this could be achieved by year 2020 if the same pace of structure deter- mination is continued as modeled in (White, 2004).

2.4 Structure prediction Considering the moderate speed of structural determination of proteins it is tempting to try to predict the 3D structure ab initio. Anfinsen stated in 1973 that it is the chemical properties of the amino acid sequence that determines the tertiary structure of a protein (Anfinsen, 1973). This idea is the central postulate on which protein structure prediction is based. Another fundamen- tal principle is that the native state of a protein is assumed to be at the mini- mum of the global free energy. Ab initio structure prediction has turned out that be a very complex task mainly due to the search for low-energy states in the conformational space. There is however progress in the field and suc- cessful structure prediction for small globular proteins has been achieved (Bradley et al., 2005). Membrane protein structure prediction could benefit from the structural constraints imposed by the lipid bilayer. The membrane environment re- duces the degrees of freedom for an embedded protein and hence its struc-

21 ture should be easier to predict than the structure for a globular protein (Simon et al., 2001). In a first approximation, the prediction boils down to the identification of the transmembrane helices and the optimization of their packing. And indeed, there are examples where structures have been suc- cessfully predicted for parts of membrane proteins (Yarov-Yarovoy et al., 2006). On the other hand, membrane proteins are often larger than the small globular proteins for which ab initio prediction has done well, which make the computational calculations for membrane proteins more complicated “in the end” (Fleishman and Ben-Tal, 2006). Alternative routes for gaining structural insights have to be taken until 3D prediction is more satisfying. One way is to reduce the dimensionality and focus on the topology as de- scribed in next chapter.

22 3 Membrane protein topology

If the three-dimensional structure of a membrane protein is not obtainable, neither by experimental techniques nor by theoretical predictions, what would be the second-best approach to gain structural insight and functional information? There are many answers to this question but one commonly agreed on is to find out the topology for the membrane protein (Jones, 2007; von Heijne, 2006). The topology can be seen as a 2D representation of the protein but it has also been referred to as “low-resolution structure” (Kernytsky and Rost, 2003). A general definition of the topology is the specification of the number of transmembrane helices and the orientation of the protein relative to the lipid bilayer (Fig. 3). Although the spatial ar- rangement and interactions of the helices are not are taken into account, it provides information on which side of the membrane the protein starts and ends as wells as where the loops and helices are located, all of which can be used for both functional and structural classification of the protein (von Hei- jne, 2006).

N non-cytoplasm

membrane

cytoplasm C

Figure 3. A topology map of an integral membrane protein with 7 transmembrane helices. The N-terminus is located in the extracytoplasmic space (commonly referred to as ‘outside’), the C-terminus is located in the cytoplasm (commonly referred to as ‘inside’) and there are short loops connecting two succeeding helices on the in- and outside respectively.

23 3.1 Experimental determination of topology There are a number of different techniques available for topology determina- tion (van Geest and Lolkema, 2000). Since the membrane-spanning parts can be identified with relative ease due to their hydrophobic character, the topol- ogy determining techniques have in general focused on correctly localizing the loops to the cytoplasmic or the extracytoplasmic side of the membrane (Traxler et al., 1993). Some methods use biochemical agents or antibodies that have access only to one side of the membrane e.g. in cysteine labeling (Kimura et al., 1997), glycosylation mapping (Chang et al., 1994), epitope mapping or antibody binding (Canfield and Levenson, 1993). Other methods use a reporter gene fused to a hydrophilic domain of the membrane protein of interest and whose product or enzymatic activity can disclose on which side of the membrane it resides (Manoil, 1991). In this thesis, different gene fusion techniques have been employed for topology mapping. They will be described more thoroughly below.

3.1.1 Reporter genes Reporter genes usually code for proteins with enzymatic activity that be- comes manifest only on one but not the other side of a membrane. Activity assays on reporter gene fusions can therefore be used to determine the in- side/outside location of different parts of a membrane protein. A reporter gene can be fused to various sites and in order to do a full topology mapping, fusion constructs for each extramembraneous region, including the N and C termini, should be made. The constructs can either be designed to contain truncated versions of the membrane protein fused to the reporter gene in the C-terminal end or designed in a “sandwich” manner where the reporter gene is inserted into a loop region and accordingly the membrane protein is al- ways full-length (Ehrmann et al., 1990; van Geest and Lolkema, 2000). It has been discussed to what extent a large reporter moiety affects the folding and insertion into the membrane, in particular if it is supposed to be translo- cated to the extracytoplasmic side. It could also be suspected to affect the long-range interactions of different parts of the proteins. But so far, several studies have shown that the influence of the reporter gene is marginal and that membrane proteins mostly retain their native topology (Kim et al., 2005; Traxler et al., 1993; van Geest and Lolkema, 2000). The choice of reporter gene depends on which organism the proteins are to be expressed in. In E. coli (and bacteria in general) the membrane proteins are inserted into the plasma membrane and in Saccharomyces cerevisiae (and eukaryotes in general) most membrane proteins are inserted into the ER membrane (at least initially). Protein parts facing the “inside” are in both systems located in the cytoplasm while parts facing the “outside” are located in the periplasm and ER lumen, respectively. Thus the reporter proteins are

24 placed in different environments. For a in-depth review of different reporter genes, see (van Geest and Lolkema, 2000). Observing enzyme activity is usually correctly interpreted as having the reporter protein at the specific side of the membrane where the reaction can take place. But if there is no measurable activity, is that an indication of op- posite extramembraneous location or is it due to misfolding of the protein or failure of correct insertion into the membrane? To be able to unambiguously assign a part of the protein to be located in the cytoplasm or non-cytoplasm, the experimental analysis benefits from including two complementary re- porter proteins showing activities at opposite sides of the membrane. Below follows a description of reporters used in this thesis, two expressed in E. coli, alkaline phosphatase (PhoA) and green fluorescent protein (GFP), and two expressed in S. cerevisiae, histidinol dehydrogenase (His4) and an inver- tase carrying a number of glycosylation sites (SUC2).

3.1.1.1 PhoA One of the first reporter genes used for topology mapping was alkaline phos- phatase (PhoA) (Manoil and Beckwith, 1986). It is normally expressed in bacteria and transported to the periplasmic space where it can catalyse the hydrolysis of phosphate groups from different molecules. The enzymatic activity depends on the formation of an essential cysteine disulfide bridge within the protein, which only can take place in the periplasm. If PhoA stays in the cytoplasm it can not fold properly and is therefore not active. To de- tect the location of PhoA, a substrate that changes color upon hydrolysis is added to the bacterial culture medium and if a color change is observed PhoA must be located in the periplasm.

3.1.1.2 GFP Green fluorescent protein (GFP) from the jellyfish Aequrea victoria can be expressed in bacteria and used as a topology reporter as it is only active in the cytoplasm but incorrectly folded and thus inactive in the periplasm (Feilmeier et al., 2000). The active form is fluorescent after UV illumination, so detection of fluorescence indicates a cytoplasmic location. GFP is a suit- able complement to PhoA and the combination of the two reporters has been used in several studies (Drew et al., 2002, paper III and IV).

3.1.1.3 His4C His4C is a truncated version of the yeast enzyme histidinol dehydrogenase (His4). His4C retains the enzymatic activity of the full-length protein and can convert histidinol to the essential amino acid histidine (Deak and Wolf, 2001). Cells containing a mutated nonfunctional his4 gene cannot grow on media lacking histidine but containing histidinol. However, if the reporter His4C is fused to a membrane protein and stays in the cytoplasm, it metabo-

25 lizes histidinol to histidine and the his4 cells survive. Growth on the selec- tion media is therefore an indicator of cytoplasmic localization. When the His4C part is translocated to the ER lumen, it cannot act on its substrate and the his4 cells cannot grow.

3.1.1.4 SUC2 SUC2 codes for invertase, an enzyme that catalyses the hydrolysis of differ- ent sugar molecules. It is however not its enzymatic capacity that carries the reporter potential. Instead it is the presence of several N-linked glycosylation acceptor sites (Taussig and Carlson, 1983) that is utilized (Deak and Wolf, 2001). N-linked glycosylation of proteins in eukaryotes takes place in the ER lumen. Glycan molecules are attached to the asparagine (N) residue in the tripeptide Asn-X-Ser/Thr, where X can be any amino acid except proline. A fusion construct with the Suc2 part facing the ER lumen will be heavily gly- cosylated. This causes an increase in molecular weight (2-3 kDa/glycan), which can be detected by a shift in size of the protein upon endoglycosidase H (Endo H) treatment. In cases where there is no change in molecular weight of the fusion protein with and without Endo H treatment, the Suc2 part has remained unglycosylated and likely resides in the cytoplasm. Since His4C and Suc2 have opposite reporter profiles they constitute a suitable reporter pair for topology mapping (Deak and Wolf, 2001, paper II and V).

3.2 Prediction of topology Despite the wide variety and high confidence of experimental techniques for topology determination, it is practically demanding to accomplish full topol- ogy mapping of membrane proteins. Most proteins are polytopic and contain several transmembrane helices, meaning that a large number of constructs are required to map a detailed topology. It is therefore not practical to per- form if proteome-wide analyses in high-throughput mode are intended. Con- sequently the contributions to topology information from theoretical and computational calculations are of great importance. Membrane is thought to follow a two-stage process (Popot and Engelman, 1990). This model proposes that the individual helices ini- tially are formed and inserted independently as stable domains (the first stage), followed by the interaction and assembly of the helices to produce the final structure (the second stage). The model of folding has later been ex- panded to a four-stage model (White and Wimley, 1999) to also account for helix partitioning and folding in the water phase and the membrane interface region, but the main stages (insertion and association) are the same as in the two-stage model. Although it is an obvious simplification of the folding process, it is a widely accepted model and seems to hold for most cases (Chamberlain et al., 2003). Since the helix formation and insertion process is

26 assumed to be separate from the helix packing step, topology prediction is an important first step in any full structure prediction method. Membrane proteins possess certain attributes that make them additionally well-suited for sequenced-based topology prediction. As mentioned in chap- ter 2.4, the lipid bilayer imposes constraints on the protein architecture. The protein zigzags back and forth across the membrane and the amino acid dis- tribution in different regions reflects the very heterogeneous environments that the membrane and extramembraneous compartments present. Topology prediction algorithms try to capture those differences. There are two pre- dominating properties that are fundamental and on which the prediction al- gorithms rely: the long stretches of hydrophobic amino acids forming the α- helices and the uneven distribution of positively charged amino acids in loops flanking the α-helices. The following sections will describe the major factors contributing to the topology in more detail.

3.2.1 Topological determinants 3.2.1.1 Hydrophobicity The presence of mainly hydrophobic amino acids in the α-helices is the most characteristic feature of the transmembrane domains. The dominating residues are the hydrophobic amino acids Ala, Ile, Leu and Val that together account for 45% of the TM residues (Ulmschneider et al., 2005). The hydro- phobic effect and intrahelical hydrogen bonding make the α-helical confor- mation stable and compensate for the cost of transferring polar peptide bonds into the membrane. Intuitively, polar and charged amino acids would be expected to be absent from the hydrocarbon core due to energetic concerns. Nevertheless, those residues are found occasionally and it has been con- cluded that they are required for the structure or function (Ubarretxena- Belandia and Engelman, 2001), for example by binding of prostethic groups or to mediate proton or electron transport. It has furthermore been shown that polar residues are less frequently mutated in TM domains as compared to in extramembraneous regions or globular proteins (Jones et al., 1994a; Ubarretxena-Belandia and Engelman, 2001) which also indicates structural and functional importance. In multi-spanning MPs not all residues face the lipids. Helix-helix association creates environments where the side chains are buried within the protein interior and can interact without lipid contact. Buried residues are moreover found to be more conserved than residues fac- ing the lipids (Eyre et al., 2004), which again suggests that they are impor- tant for the preservation of structure and function. It is even believed that the interhelical interactions between polar amino acids are one of the driving forces for helix packing and protein folding (Popot and Engelman, 2000). Moreover, amino acids with small side chains like Ala and Gly have a pref- erence for helix-helix interfaces since they allow the helices to pack tightly.

27 Glycine is also the core of the GxxxG motif known to stabilize helix-helix interaction and oligomerization (Senes et al., 2000). In contrast, hydrophobic amino acids with large bulky side chains like Phe, Leu, Ile and Val are more exposed on the helix surfaces (Eilers et al., 2000; Ulmschneider et al., 2005). As a consequence of the relaxed hydrophobicity requirement for residues involved in helix packing, helices in multi-spanning MPs are on average less hydrophobic than helices in single-spanning MPs (Jones et al., 1994a). Moreover, charged residues in transmembrane segments do not necessarily need to be completely buried; they can still be neutralized by forming intra- helical hydrogen bonds, i.e. interacting with residues on the same helix, and thus be exposed to lipid tails without being energetically unfavorable (Eyre et al., 2004), or the charged side chains can snorkel out to the interface re- gion and thereby escape the lipid environment, see below.

3.2.1.2 Helix length A transmembrane helix has to be sufficiently long to traverse the lipid bi- layer. It should at least span through the 30 Å thick hydrophobic core, some- thing that is attained if it is ~20 residues long since each residue adds ~1.5 Å to the length of the helix. The average helix length is estimated to be around 26 residues (Bowie, 1997; Granseth et al., 2005a; Ulmschneider et al., 2005), meaning that the helices often protrude into the membrane interface region. The helices are usually tilted around 21-24 degrees relative to the membrane normal (Bowie, 1997; Ulmschneider et al., 2005) so that longer helices nicely can fit into the membrane core. There is also flexibility among the lipids which allows a certain degree of bilayer distortion (Engelman et al., 1986) or accumulation of lipids with suitable lengths of the hydrocarbon tails around the TM helices to avoid “hydrophobic mismatch” (Chamberlain et al., 2003). An additional factor that does not affect the helix length, but rather has influence on the precise helix positioning and directions of the side chains is the snorkeling effect. Residues flanking the helices tend to point toward the membrane core if they are hydrophobic whereas polar and charged residues instead are oriented towards the interface region (Granseth et al., 2005a; Monné et al., 1998).

3.2.1.3 Membrane-water interface region There is no exact definition of the membrane-water interface region but in general it is located ±15-25 Å away from the membrane center (see Fig. 1). It is here that most TM helices end and the polypeptide chains continue as loops of either irregular secondary structure (70% of the residues) or interfa- cial α-helices parallel to the membrane (30% of the residues) (Granseth et al., 2005a). The chemically complex interface environment influences the amino acid distribution which differs significantly from both the membrane core and the surrounding aqueous phase. A striking observation is that polar aromatic residues (Tyr and Trp) are enriched in the membrane interface on

28 both sides (Granseth et al., 2005a; Killian and von Heijne, 2000). It is be- lieved that aromatic residues, with their flat rigid ring structures, are ex- cluded from the membrane core (except Phe) because they would perturb the ordering of the hydrocarbon tails too much. At the same time they are well suited to the interface region, probably due to electrostatic interactions and thereby help to anchor the transmembrane segments in the membrane (Killian and von Heijne, 2000; Yau et al., 1998). Glycine is the most frequent residue in this region and this is probably due to its small size that enables the formation of short loops of the polypep- tide facilitating close packing of the TM helices (Ulmschneider et al., 2005). Proline is also preferred in the interface regions for the same reason, as it allows the backbone to take specific conformations which can be advanta- geous as the loops connecting two TM helices on average are short, only 9 residues (Granseth et al., 2005a). As mentioned earlier, charged residues are not common in the hydrocar- bon core, but are more frequent in the interface region, that is discussed in the next section.

3.2.1.4 Positive-inside rule The exact positioning of the transmembrane segments is determined by the long stretches of hydrophobic residues together with the anchoring residues in the interface region as discussed above. But what guarantees that the heli- ces are inserted in the correct orientation? The helices themselves contain no directional information. Statistical studies of bacterial inner membrane pro- teins have shown that positively charged residues (Arg and Lys) are more prevalent in cytoplasmic loops than in non-cytoplasmic loops (von Heijne, 1986). The biased distribution, the so-called positive-inside rule, has later been confirmed to be universal and holds for virtually all organisms, al- though it is somewhat less pronounced in eukaryotes (Nilsson et al., 2005). No comparable enrichment of negatively charged residues (Asp or Glu) is detected on any side of the membrane (Granseth et al., 2005a; Nilsson et al., 2005). It is the increased presence of Arg and Lys in the vicinity of the helix ends in the cytoplasmic loops that directs the helix orientation. How this guidance occurs in detail is not fully understood, but there are presumably a number of factors that contribute to establishing the final to- pology. Interaction with negatively charged phospholipid head groups can prevent translocation of a polypeptide chain with positive charge (van Klom- penburg et al., 1997), which thus becomes retained in the cytoplasm. Another aspect that affects which parts that are translocated is the direct contact with the Sec-translocon. Charged residues in the translocon complex are believed to attract or repel the positive residues in the MP that is to be inserted. It has been verified that mutations of residues to opposite charges in the yeast Sec61p (the largest subunit in the Sec-translocation complex) affect the MP orientation (Goder et al., 2004). The influence of the positive-inside

29 rule was reduced for a set of test membrane proteins as compared to when wildtype Sec61p was present, with inverted orientations as a result. Finally, the electrochemical potential across the bacterial inner mem- brane is anticipated to play a role for the orientation. The basic cytoplasm and the acidic periplasm seem to prevent translocation of positively charged polypeptide segments containing Arg and Lys residues and facilitate translo- cation of less positively charged segments (Andersson and von Heijne, 1994). This phenomenon contributing to the positive-inside rule can not be applied to eukaryotic MPs because there is no general potential across the ER membrane. This might be one reason why the distribution of Arg and Lys in cytoplasmic loops compared to non-cytoplasmic loops is less biased in eukaryotes than in prokaryotes.

3.2.1.5 Signal peptides In order to be inserted into the membrane, TM proteins have to be targeted to the translocon present in the ER in eukaryotes (Sec61) and in the plasma membrane in prokaryotes (SecY), reviewed in (Luirink et al., 2005). Target- ing is mainly mediated by the signal recognition particle (SRP) that binds to the first hydrophobic segment as the nascent polypeptide chain is translated on the . The SRP directs the ribosome to the translocon and the N- terminal part of the peptide enters the protein-conducting channel. The hy- drophobic segment acts as a signal anchor sequence that most often is trans- ferred latterly into the membrane and stays there as the first TM helix. In some cases it can be cleaved off by the enzyme signal peptidase yielding a TM protein with non-cytoplasmic N-terminal topology. The presence of a signal peptide (SP) in membrane proteins is thus an important topological determinant. Secretory proteins follow the same pathway through the translocon (al- though not always co-translationally), but since they are not intended to be kept in the membrane they all have a cleavable signal peptide. The SP is hydrophobic in the middle of the sequence and forms an α-helix, but this is in general shorter than a TM helix and somewhat less hydrophobic. Still, an SP is similar enough to a TM helix to cause confusion among prediction programs trying to discriminate between secreted and membrane proteins (Käll et al., 2004).

3.2.1.6 Re-entrant regions During the last years it has been become clear that it is not only transmem- brane helices that enter the membrane in helical MPs. So-called re-entrant regions also penetrate into the membrane, but instead of traversing the mem- brane, the peptide chain enters and leaves at the same side of the membrane. Usually re-entrant regions are short, on average ~13 residues. Some contain an α-helix that digs into the membrane, makes a turn and goes back, either as a second α-helix or as coil, while others contain only coil structures

30 (Viklund et al., 2006). Aquaporin-1 is a nice example where two re-entrant regions, one from each side meet in the center of the membrane and thereby can form a stable structure (Murata et al., 2000) (Fig. 4). Viklund et al. esti- mated that at least 10% of all TM proteins have re-entrant regions and that it is most common in channels and transporters. There is an abundance of resi- dues with small side chains (Ala and Gly) and the overall amino acid distri- bution is of intermediate hydrophobicity.

non-cytoplasm

5 2 E

4 3

6 1 B cytoplasm

Figure 4. Ribbon diagram of aquaporin-1 (PDB code 2D57) showing six transmem- brane helices (number 1-6) and two re-entrant regions (B and E) each folded as a half-helix connected to a coil structure. The N-terminal ends of the half-helices face the hydrophilic water-conducting pore and contain the conserved motif Asn-Pro-Ala which holds together the half-helices by hydrogen bonding and van der Waals inter- actions.

3.2.2 Topology prediction algorithms Taking all the abovementioned topological elements into account, one would imagine that we have all the information needed for correct topology predic- tion. But still it has turned out to be challenging to incorporate all the various TM properties into one general model. Three basic criteria have to be ful- filled for a topology prediction algorithm, namely ability to i) discriminate between globular and membrane proteins, ii) determine the number and posi- tions of the transmembrane segments and iii) determine the inside/outside orientation of the protein. Additional criteria for more sophisticated algo- rithms are ability to iv) discriminate between a signal peptide and a trans- membrane helix and v) predict re-entrant regions and interfacial helices. It is always a bit tricky to compare the performance of different topology prediction methods since many are machine learning algorithms developed using different training sets of proteins with known topology. The lack of structural data restricts the size of training and test sets for membrane pro-

31 teins. Ideally the sets should be completely disparate but this has not always been the case. Some algorithms might have been trained on sequences pre- sent in the test set which bias the result. Overtraining is an obvious problem as test set proteins have turned out to be easier to predict than whole TM proteomes (Käll and Sonnhammer, 2002, paper I). Several benchmarking studies have been performed (Chen et al., 2002; Ikeda et al., 2002; Möller et al., 2001) and the conclusion is that no method is always performing the best and that the evaluated performance accuracy usually lies in the range 50- 70%. Incorporation of evolutionary information (Jones, 2007; Viklund and Elofsson, 2004), experimental knowledge (paper I) or using consensus pre- dictions (Arai et al., 2004; Nilsson et al., 2000) increase the prediction accu- racy. Topology prediction algorithms can be divided into two broad categories, methods based on hydrophobicity scales and methods based on machine learning approaches. Below I will explain the differences between the cate- gories, describe the particular methods used in my studies, and discuss some of the most recent methods that have had impact on the prediction perform- ance.

3.2.2.1 Hydrophobicity scales The earliest methods for predicting transmembrane segment locations were based on hydrophobicity indices reflecting each amino acid’s propensity for being embedded in the membrane. The indices were derived from experi- mental or theoretical studies estimating the free energy of transferring an amino acid from aqueous solution to nonpolar environment resembling the membrane interior and summarized in different hydrophobicity scales (Engelman et al., 1986; Kyte and Doolittle, 1982; White and Wimley, 1999; Wimley and White, 1996). In a prediction, it is only information from amino acids anticipated to be embedded in the membrane that are considered. A sliding window approach is applied where a window of fixed length is scanned along the sequence and the hydrophobicity indices for each residue within the window are summed. The average hydrophobicity for the center position in the window is plotted to generate a hydrophobicity profile of the whole protein. Segments sufficiently long and above a heuristically deter- mined cut-off are predicted as transmembrane. One drawback with the early hydrophobicity scales is that they are poor in discriminating between globular and membrane proteins and usually over- predict the number of TM helices (Chen et al., 2002; Möller et al., 2001). Another disadvantage is that they do not predict the sidedness of the protein meaning that the inside/outside orientation remains unknown.

3.2.2.1.1 TopPred The prediction progress was taken one step further by the development of TopPred (von Heijne, 1992). The method integrated the GES hydrophobicity

32 scale (Engelman et al., 1986) with the positive-inside rule and was actually the first method to predict the complete topology and not only the number of TM segments and their positions. The sliding-window analysis was also somewhat refined by using a trapezoid window to reflect the environmental transitions between membrane interface (triangular shape) and membrane core (rectangular shape). A hydrophobicity plot is constructed and every peak above an upper cut-off is considered as a certain TM helix whereas a peak between a lower and the upper cut-off is considered as putative. All possible topologies are created, always including the certain helices but al- ternatively including or excluding the putative ones. By applying the posi- tive-inside rule (maximizing the difference in the number of Arg and Lys in potentially inside and outside loops) the most probable topology can be pre- dicted.

3.2.2.2 Machine learning approaches More advanced prediction methods are usually based on some machine learning technique such as neural networks (NNs) or hidden Markov models (HMMs). These methods use collections of known MP structures or experi- mentally confirmed topologies as training data to statistically estimate the amino acid distributions in topologically distinct regions of a model mem- brane protein (e.g. TM helices, interfaces, inside and outside loops). A pre- diction is the outcome of optimizing the matching of the residue distribution in an examined protein with the pre-calculated distributions in the different regions of the model. The whole protein is analyzed at a time where all topo- logical signals are taken into account concurrently. Thus, the amino acid compositions in other regions than the membrane-spanning parts have im- pact on the predicted topology, which not is the case in the more simple hy- drophobicity scale-based methods. This has turned out to be a successful strategy and advanced methods are almost always performing better than more simple methods in benchmarking studies (Chen et al., 2002; Möller et al., 2001).

3.2.2.2.1 MEMSAT One of the first machine learning methods to take advantage of a global view of all topological signals was MEMSAT (Jones et al., 1994b). It defines five structural states (inside loop, inside helix end, helix middle, outside helix end, outside loop) and each state is associated with a statistical table of the frequency of the 20 amino acids, represented as log likelihood ratios. The tables were compiled from a set of proteins of known topologies where sin- gle-spanning and multi-spanning proteins are treated separately since the TM helices tend to have somewhat different properties (Jones et al., 1994a). All possible topologies are explored, starting from one helix (in both orienta- tions) and successively increasing the number of helices by one up to an upper limit depending on the protein length. Each topology is then scored

33 according to a statistical method (expectation maximization). For a given number of TM helices a dynamic programming algorithm is applied to search for their most optimal locations and lengths. A list of the optimized topologies together with their scores is produced. The topology with the highest score is the final prediction. The approach used in MEMSAT can be seen as a forerunner of prediction methods based on hidden Markov models (HMMs), see TMHMM and HMMTOP below.

3.2.2.2.2 PHDhtm PHDhtm (Rost et al., 1996) is a program belonging to a general tool, PHD (Rost, 1996), for predicting secondary structures of proteins. It is designed to predict the topology of membrane proteins by using evolutionary informa- tion in a stepwise procedure. The first step is a BLAST search (Altschul et al., 1990) of the query sequence against the SWISSPROT database (Boeckmann et al., 2003) to identify relevant homologs. The hits are aligned in a multiple sequence alignment which is fed into a neural network. The network estimates the preference for each residue to be or not to be in a transmembrane helix state. In the second step, the region with highest trans- membrane preference is compared to a threshold value to decide whether the protein is a membrane or globular protein. If it is classified as MP, the net- work preferences are used again in the third step as input to a dynamic pro- gramming algorithm that finds the optimal number and locations of the TM helices (the highest-scoring model). Lastly, the positive-inside rule is applied to determine the orientation and to generate the final topology of the protein. Together with the predicted topology, PHDhtm provides two indices, ranging from 0 to 9, for estimating the prediction reliability. One reliability index is defined for the model (i.e. the number and locations of the TM heli- ces), based on the difference in score for the best and second best model. The other reliability index is related to the predicted orientation and is pro- portional to the positive charge difference between the cytoplasmic and non- cytoplasmic parts of the protein.

3.2.2.2.3 TMHMM TMHMM (Krogh et al., 2001; Sonnhammer et al., 1998) is based on a hid- den Markov model with seven distinct states (helix core, helix caps on either side of the membrane, short loop on cytoplasmic side, short and long loops on non-cytoplasmic side and globular domains in the middle of each loop) corresponding to the well-defined regions in membrane proteins. Each type of state has a probability distribution of the 20 amino acids (emission prob- abilities) that is estimated from a set of proteins with experimentally known topologies. Between the states there are transition probabilities that reflect the likelihood for either staying in the same state or move to the next state, also estimated from training data. The architecture of TMHMM is cyclic and

34 biologically relevant as the transitions between the states force the succes- sion of predicted protein regions to be in the correct order, i.e. an inside loops is always followed by a helix, followed by an outside loop and so on (Fig. 5a). Given the defined model, the algorithm finds the most probable path through the states for a query protein, i.e. it maximizes the correlation be- tween the sequence of state emission probabilities and the observed amino acids. The predicted topology is represented as a labeled sequence of the three classes i (inside or cytoplasmic), h (helix) and o (outside or extra- cytoplasmic). This is calculated by the N-best algorithm (Krogh, 1997) which maximizes the likelihood that the query sequence is generated by a set of state paths (that all generate the same topology), here denoted as p(best topology), where p stands for probability. There are many possible paths that a protein sequence can take through the model and their summed probabili- ties can be calculated with a procedure called the forward algorithm, here denoted as p(all possible topologies). Each residue is labeled i, h or o as mentioned above, but a posterior probability is given for all three classes, p(i), p(h) and p(o) to each residue, which is interpreted as the probability to be in each class given the residue (Fig 5b). Note that there is not necessarily a one-to-one correlation between the most probable labeling according to the posterior probability and the final predicted topology. The reason for this is that posterior probabilities are “local” and are not limited to obey the restric- tions of allowed state paths. The strength of HMMs is that the optimal path, i.e. the prediction, is found in one step and thus all protein parts are modeled simultaneously. The influence of all topological signals is therefore also dependent on their rela- tive clearness rather than only the actual matching to each state. This is beneficial when signals are weaker than usual, as for example in multi- spanning helices where some helices are less hydrophobic than the average helix.

35 (a)

Sequence: M S W N L L F V p(i): 0.76 0.71 0.68 0.68 0.14 0.14 0.00 0.00 (b) p(h): 0.00 0.05 0.08 0.11 0.83 0.85 0.99 0.99 p(o): 0.24 0.24 0.24 0.21 0.03 0.01 0.01 0.01 Label: i i i i h h h h

Figure 5. (a) The architecture of TMHMM showing the different states (as boxes) and the permitted transitions connecting the states (as arrows). States with the same names have the same amino acid distributions. Figure from (Krogh et al., 2001), reprinted with permission from Elsevier. (b) An example of a TMHMM output for the first eight residues in the membrane protein SWF1 from S. cerevisiae. The poste- rior probabilities for the three classes i, h and o are listed for each residue and the underlined numbers correspond to the labeling, i.e. the prediction, which yields the highest probability for the whole protein, p(best topology).

3.2.2.2.4 HMMTOP HMMTOP (Tusnady and Simon, 1998; Tusnady and Simon, 2001) is an- other HMM-based method with an architecture similar to that of TMHMM. Here however, there are only five structural states (inside loop, inside helix tail, helix, outside helix tail and outside loop) where the differences to TMHMM are that there is no globular state and that the modeling of short loops is done by omitting the loop state and instead connect the helix tail state to another helix tail state on the same side. The helix tail states thus model segments located in the loops (but close to the membrane) and are therefore not equivalent to the helix cap states of TMHMM which model segments that are parts of the TM helices. The major difference between the two methods lies in the description of the driving forces for membrane protein folding. Whereas TMHMM is based on the assumption that the different structural parts are composed of more or less predetermined amino acid distributions that should hold for all mem- brane proteins, the hypothesis of HMMTOP is that the topology is deter- mined by the difference in amino acid distributions between the various structural parts and thus not solely on the absolute amino acid compositions in the separate parts. Therefore, the method first uses the query protein se- quence to optimize its state parameters and then searches for the combina- tion of states that gives the maximum divergence in the amino acid distribu-

36 tions among the predicted segments. The idea is that the large differences in physicochemical properties that different parts of the protein encounter should be reflected in large changes in the amino acid distributions. There is furthermore an option to include homologous sequences to im- prove the prediction performance. Those sequences are then used one by one in the state parameter optimization process.

3.2.2.3 Recently developed methods During the last years new methods have been developed that are even more sophisticated than the previously described ones. Some contain additional structural elements, some also benefit from being trained on a larger data set than earlier methods and others apply new approaches. It has been known for a long time is that the use of evolutionary informa- tion, i.e. homologous sequences, increases the prediction accuracy for globu- lar proteins (Rost and Sander, 1993) as well as for membrane proteins (Persson and Argos, 1994; Rost et al., 1996; Tusnady and Simon, 1998). As mentioned above, a multiple sequence alignment is used in PHDhtm whereas in HMMTOP the homologs are used as single sequences to estimate new model parameters. In the recent methods prodiv-TMHMM (Viklund and Elofsson, 2004) and MEMSAT3 (Jones, 2007) information from multiple sequences are used in sequence profiles which significantly improves the prediction performance. Phobius (Käll et al., 2004) and TOP-MOD (Viklund et al., 2006) are two other novel methods, both of which are based on HMMs, that incorporate prediction of additional substructures beside the ordinary transmembrane helices and loops. Phobius is able to model signal peptides and TM helices simultaneously and thus reduces the risk of mixing-up the first TM helix and a SP. TOP-MOD identifies re-entrant regions and also attempts to predict interfacial helices but that has turned out to be more challenging, possibly due to weaker sequence characteristics. A new approach has been taken in (Hessa et al., 2005) where the contri- bution from individual residues to the membrane insertion efficiency of a TM helix has been analyzed by designing polypeptide segments and quanti- fying the degree of insertion. A ‘biological’ hydrophobicity scale has been developed as well as a position-dependent free energy matrix. A novelty in this approach compared to the derivation of traditional hydrophobicity scales is the experimental design where the measurements are made on ER mem- branes in vitro (dog pancreas microsomes) and not on residue or peptide partitioning into aqueous and non-polar solvent respectively. Moreover, the positional dependence of the residues is accounted for and, in agreement with statistical studies, it was shown that charged and polar residues are un- favorable in the middle of the helices, polar aromatic residues Trp and Tyr are preferred towards the helix ends, and the contribution from hydrophobic residues do not vary much with the position within the helices.

37 A way to bridge the topology knowledge between experimental studies and bioinformatics is to use experimental information as constraints in theo- retical predictions. If one or more residues in a membrane protein are con- strained to lie on one or the other side of the membrane, the number of pos- sible topologies is reduced and the likelihood of predicting the correct topol- ogy increases, as described more thoroughly in chapter 4.1 (paper I). Instead of experimental information, domain assignments can also be used as a pri- ori topological data fed into prediction algorithms (Bernsel and von Heijne, 2005). Extramembraneous soluble domains that are compartment-specific, i.e. always localized in the cytoplasm or the extracytoplasmic space but never found on both sides of the membrane, are estimated to be present in at least 11% of eukaryotic membrane proteomes. These findings were shown to increase the prediction accuracy in general when used as constraints, particu- larly for single-spanning MPs. Finally, reliability scores can be a guide to dismiss or approve predictions and a help to identify the most dubious topologies that are worth confirming experimentally, see chapter 4.1 (paper I).

38 4 Summary of papers

4.1 Reliability measures for topology predictions and the use of experimental knowledge (Paper I) The objective behind this study was to find different strategies to overcome the relatively moderate performance for the most widely used topology pre- diction methods by that time. We managed to do this in some aspects by deriving reliability scores that make it possible to estimate the trustworthi- ness of a prediction, and by showing that limited experimental information given a priori to a prediction algorithm considerably increases the accuracy.

We examined the five topology prediction methods TopPred, PHDhtm, MEMSAT, TMHMM and HMMTOP and defined for each a reliability score based on their respective raw output. A test set of 92 prokaryotic proteins with experimentally determined topologies was used to assess prediction accuracy and its correlation to the constructed reliability scores. As seen in Figure 6, the best correlations were obtained for TMHMM and MEMSAT, whereas the scores did not seem very useful for the other methods. For both TMHMM and MEMSAT ~50% of the predictions have reliability scores corresponding to a prediction accuracy of ~90%, and ~70% of the proteins have scores corresponding to a prediction accuracy of ~80%. This should be compared to accuracies of 66 and 70% respectively if the whole test set is benchmarked. Thus, by considering the score it is possible to estimate the likelihood that a given prediction is correct. The TMHMM reliability score was defined as p(best topology)/p(all pos- sible topologies) and takes values between 0 and 1. It gets close to 1 if the suggested topology has high probability at the same time as there are few other topologies that can compete with it. The opposite applies to scores close to 0, i.e. then several other topologies might be as likely as the sug- gested one and such predictions should therefore be considered with caution. Furthermore we used the TMHMM score to assess the degree of bias in the test set compared to the predicted membrane proteomes of E. coli, S. cerevisiae and Caenorhabditis elegans. We found a much larger fraction of high-scoring proteins in the test set compared to the whole proteomes and consequently estimated the prediction accuracy to be far lower for the full proteomes. This bias is a result of limited experimental data and overtraining which also was confirmed in another study (Käll and Sonnhammer, 2002).

39

Figure 6. A plot showing the relation between the reliability scores and test set cu- mulative coverage. The overall accuracies measured on the entire test set (found at 100% coverage) lie between 50 and 70%. See paper I for details.

The other way to address the low expected prediction accuracies was to in- vestigate the effect of including experimental information in the predictions. The TMHMM algorithm allows the class assignment for a residue (or re- gion) to be set a priori. In other words, the probability for a certain residue to be located in, for example, an inside loop can be set to 1.0 (p(i)=1.0 and p(o)=p(h)=0.0) which means that the prediction is fixed in that location for that residue and only topologies that are compatible with this constraint will be considered valid. When the number of possible solutions decreases (all topologies opposing the fixation are ignored), the likelihood of predicting the correct topology is expected to increase. We assigned the C-terminal residue for all test set proteins to its experi- mentally known class and registered the prediction performance. There was an increase in accuracy from 66% (unfixed predictions) to 77% (fixed pre- dictions). Thus, with very limited topological pre-knowledge it was possible to get much improved topologies. We further estimated that for the three membrane proteomes studied, the prediction accuracy will increase by at least the same amount, given that the C-terminal location is known.

The advances of TMHMM, i.e. the derivation of a reliability score and the enabling of experimental information usage were implemented in a refined version of TMHMM2.0, namely TMHMMfix that is publicly available, (http://www.sbc.su.se/~melen/TMHMMfix/).

40 4.2 Topology models for a small number of S. cerevisiae membrane proteins based on C-terminal reporter fusions and predictions (Paper II) This is a pilot study where we for the first time combined experimental re- sults with TMHMMfix. The motives for working with S. cerevisiae were that it is a common model organism for other eukaryotes, at the same time as prediction methods in general perform less well on yeast membrane proteins as compared to both mammalian and prokaryotic membrane proteins (Nilsson et al., 2002, paper I).

Encouraged by the performance improvement estimated in paper I, the idea was to determine the C-terminal location for a set of yeast membrane pro- teins and to use that information as constraints to predict reliable topology models, Target proteins were selected by scanning the yeast genome and choosing the predicted open reading frames (ORFs) for which the five prediction methods (TopPred, MEMSAT, PHDhtm, TMHMM and HMMTOP) all pro- duced the same topology. It had been shown earlier (Nilsson et al., 2000) that consensus predictions have high proportions of correctly predicted to- pologies and thus gave us a good basis for verifying the orientation of the proteins experimentally and also correcting possibly incorrectly predicted topologies. Only proteins predicted to have at least two transmembrane segments were included in order to avoid confusion between cleavable signal peptides of secretory proteins and single-spanning membrane proteins. Furthermore, genes containing introns or genes with dubious ORFs were removed as well as proteins not expected to be targeted to the secretory pathway, as the ex- perimental design requires insertion into the ER membrane. For each protein a construct was made in which the full-length gene (ex- cept the stop codon) was fused to the dual reporter Suc2/His4C (Fig. 7a). The constructs were expressed from plasmids transformed into a yeast strain with a mutated nonfunctional his4 gene. Successful fusions that expressed well were finally made for 39 proteins.

The experimental determination of the C terminal location was carried out in two parallel ways. Cells were streaked on plates depleted in histidine but supplemented with histidinol and checked for growth ability. Cells were also subjected to lysis from where the membrane-protein-fusions were isolated and treated with Endo H to assess the glycosylation status. If the C terminus is located in the cytoplasm the His4C can convert histidinol to histidine and the cells grow (Fig. 7b). At the same time there will be no change in molecu- lar weight between Endo H-treated and untreated proteins since glycosyla-

41 tion of the SUC2 moiety only can take place in the ER. The contrary applies to cases where the C terminus resides in the ER lumen. The cells cannot grow due to lack of histidine but there will be a molecular weight difference as a consequence of glycosylation (Fig. 7c). Since the reporter genes obvi- ously are complementary to each other, C-terminal locations can be deter- mined unequivocally for most proteins.

(a) MP HA SUC2 His4C

C

His (b) Histidinol (c) C ER ER

Growth on -His/+Histidinol No growth on -His/+Histidinol No glycosylation Glycosylation Figure 7. C-terminal topology mapping in S. cerevisiae. (a) Schematic picture of a construct where the membrane protein (MP) is fused to the reporters SUC2 (contain- ing glycosylation acceptor sites, indicated by ‘V’) and His4C (with enzymatic ca- pacity). The hemagglutinin (HA) tag allows the fused protein to be identified by Western blotting. (b) The C terminus is located in the cytoplasm which is detected by growth and absence of glycosylation. (c) The C terminus is located in the ER lumen which is detected by glycosylation and no growth.

For 37 out of the 39 proteins we could assign the C termini to either the in- side (cytoplasm) or the outside (ER lumen). Two proteins were neither gly- cosylated nor did the cells grow on histidinol and could thus not be assigned. We speculated that they are inserted into the mitochondrial inner membrane with their C termini located in the matrix, and for one of the proteins this was later confirmed.

The inside/outside assignments for the C-terminal ends of the proteins were further used as constraints in TMHMMfix to produce reliable topology mod- els. For 31 of the 37 proteins the fixed predicted topology was the same as the initial consensus prediction. The other six shifted orientation or changed the number of helices by 1. The reliability scores were relatively high compared to the score distribu- tion for yeast calculated in paper I. Notably, proteins with a large number of TM helices had in general lower scores than those with few TM helices, likely due to the various possible topologies that TMHMM can produce.

Our conclusion was that our strategy of combining experimental methods and bioinformatics predictions worked out well. With a relatively limited

42 amount of experimental effort (compared to a full topology mapping) we could obtain reliable topology models for 37 membrane proteins in an effi- cient way. Accordingly, this work shows potential to be expanded to a pro- teome-wide scale in yeast.

4.3 Topology models for a small number of E. coli membrane proteins and optimization analysis of fusion points (Paper III) Here we applied the same strategy as in paper II but for E. coli, i.e. we de- termined the C-terminal location for a selection of integral membrane pro- teins and used the information as constraints in TMHMMfix predictions to generate reliable topology models. Additionally, we investigated which part of a protein is the optimal part to fix, and raised the question where a re- porter protein should be fused in order to capture the most informative topo- logical data.

Initially we examined how different placements of a topology reporter would influence the topology prediction. Is it correct to assume that fixation of the C terminus increases the TMHMM accuracy the most? Or are there other regions in a protein that provide better topological information, for example the N terminus or a loop region with low posterior probability? We used a test set comprising of 233 membrane proteins with experimen- tally determined topologies. First we ran unconstrained predictions for all proteins and noted that 69% were correctly predicted. Then each protein was scanned along the sequence and one residue at a time was fixed according to its annotated location prior to prediction. Residues within transmembrane segments were excluded since common topology reporters can only be used for detection of extramembraneous locations. We focused on the initially incorrectly predicted topologies and analyzed which fixed residues in those proteins could convert the topologies to the correct ones. We concluded that the C terminus is an optimal placement for the reporter protein if only one region is to be experimentally mapped. 81% of all proteins were correctly predicted when their respective C termini were fixed. The corresponding number for always fixing the N terminus was 79%. If a combination of the two was used, only a slight increase in prediction accuracy was obtained (82%), but to a very high experimental cost. It did not turn out to be a good idea to fix residues in loops of low poste- rior probabilities. In fact, many such regions were transmembrane segments, missed by TMHMM. This situation reflects the uncertainty of a prediction. By fixing a residue there to inside or outside would inevitably produce a wrong topology.

43

Based on the positive results for fixing the C terminus, and the fact the C- terminal fusions do not affect the insertion into the membrane significantly and are minimally disruptive of the native topologies of membrane proteins (no truncation is needed nor any fusions into internal loops), we decided to use the same experimental approach as in a pioneering analysis performed in our lab (Drew et al., 2002). They successfully used a setup with the two to- pology reporters GFP and PhoA for localization studies in E. coli.

34 new target proteins were selected by applying the five prediction methods (TopPred, MEMSAT, PHDhtm, TMHMM and HMMTOP) to all ORFs in the E.coli genome and choosing the ones that had a consensus predicted N- terminal location, and that were predicted to have the same topology by the five methods. Deviation of one predicted TM segment was however allowed. These selection criteria provided a data set where the C-terminal determina- tion will be conclusive for producing reliable topologies. For each protein, two constructs on separate plasmids were made, one where the corresponding gene was fused to GFP and another where it was fused to PhoA. The vectors were transformed into E. coli and the expressed fusions were analyzed. The GFP assay was carried out by illuminating the cells with UV light and analyzing the GFP fluorescence emission. GFP folds properly and fluo- resces only in the cytoplasm, so detection of fluorescence cells suggests a cytoplasmic location of the protein’s C terminus. In contrast, PhoA requires a periplasmic location in order to be enzymatically active, which can be de- tected by adding an appropriate substrate.

For 31 of the 34 proteins, the GFP/PhoA results were completely consistent and for those we could assign the C termini to the inside or outside. This information was used as constraints in TMHMMfix, and experimentally based topology models were produced. The reliability scores supported in- creased prediction quality as compared to unconstrained predictions. 3D structures for two of the examined proteins were known and in one case the predicted topology agreed on the structural data but for the other protein the prediction missed two helices, yet the orientation was correct. However, the reliability score for the failed one was very low.

In conclusion, we found that the strategy of expressing full-length proteins fused to the C-terminal reporters GFP and PhoA had a high success rate and combining the results with constrained predictions yielded trustworthy to- pologies. The approach thus has potential to be used on a proteome-wide scale in E. coli.

44 4.4 Large-scale topology analysis of the E. coli and S. cerevisiae membrane proteomes (Papers IV and V) Having validated the experimental and bioinformatics approaches in E. coli (paper III) and in S. cerevisiae (paper II), the next step was to extend the analyses to global topology mappings of all multi-spanning membrane pro- teins in the two organisms.

We started with all predicted ORFs in E. coli and S. cerevisiae respectively and defined the whole membrane proteomes by applying TMHMM. Only proteins with 2 or more predicted transmembrane segments were selected (for the same reason as in paper II and III). The data sets were reduced by eliminating ORFs that were too short, previously analyzed, had dubious gene sequences or were not targeted to the secretory pathway (in yeast). This re- sulted in 714 putative membrane proteins in E. coli and 629 in S. cerevisiae. Out of those, 665 and 617 were cloned and expressed successfully. The C- terminal assignments could finally be made to 502 and 468 proteins respec- tively.

Bioinformatics was heavily used in the selection process, the design of opti- mal restriction enzyme combinations for the E. coli genes and the control of the expressed gene sequences (comparison of expected nucleotide sequence with output from sequencing analyses). The last of these steps was necessary to discover mispredicted ORFs or cloning failures in order to not draw incor- rect conclusions from the C-terminal assignments. The experimental setup was somewhat simpler in the yeast study in that cloning was accomplished by homologous recombination and inserted genes were verified by PCR analysis instead of by sequencing. I will mainly describe the results in paper V in the following sections, but comparisons to the results obtained in paper IV will be made.

To be confident that our C-terminal assignment procedure also holds on a large scale and to possibly rule out that the C-terminal reporter proteins af- fect the membrane insertion (or at least conclude that this is not likely the case here), we did an internal validation for all the yeast proteins with as- signed C-terminal locations. We performed an all-against-all BLAST search (Altschul et al., 1990) among the proteins and retained all pairwise hits with an E-value < 10-5 and for which the BLAST alignment should reach within 15 residues of the C termini (illustrated in Fig. 8). These restrictions prevent the appearance of an additional transmembrane segment between the end of the alignment and the C terminus of either sequence, and homologs found in this way can be assumed to have the same C-terminal orientation.

45 9 unaligned residues

...iiihhhhhhhhhhhh---hhhhhhhhhooooooooooooooooooo ooooooooo ...FVLYAGFALVIGCFW---YFSPISFGMEGPSSNFRYLNWFSTWDIA ...... query ... V Y ++L GC + F+PI GM G + + L W STWDIA BLAST alignment ...MVKYPIYSLFGGCIYIYNLFAPICQGMHGDKAEYLPLQWLSTWDIA .... hit ...iiiiihhhhhhhhhhhhhhhhhhhhhhooooooooooooooooooo oooo

4 unaligned residues Figure 8. An example of the “BLAST approach”. The grey area marks the aligned region between the query and hit protein sequences. The upper topology prediction belongs to the query protein, the lower to the hit protein and both are constrained by their corresponding experimentally determined C-terminal location (i; inside, h; helix, o; outside). Here the unaligned residues in the C termini are 9 and 4 respec- tively.

All proteins in our data set that fulfilled our search criteria matched ho- mologs with the same C-terminal assignment except two, Ygl263wp and Ynr002cp. Ygl263wp is a member of the large COS (conserved sequence) family and was assigned a lumenal C terminus (Cout) while the other eight COS family members in our data set were assigned to have a cytoplasmic C terminus (Cin). The Cout orientation of Ygl263wp was further confirmed by three internal N-linked glycosylation sites in loops that all face the ER lumen (and are thus glycosylated) when the C terminus is also located there. The opposite orientation was also supported by the positive-inside rule, as accu- mulation of positively charged residues was found in different loops for Ygl263wp compared to the other family members. The second protein of contradicting orientation, Ynr002cp, belongs to a family of ATO (ammo- nia/ammonium transport outward) proteins. It also has a Cout assignment whereas two homologous proteins in our data set have Cin assignments. This family was however not studied further. The presence of families with oppo- site orientations opens up interesting interpretation possibilities of the evolu- tion of membrane proteins and their topologies. Gene duplication followed by divergent topology evolution or adoption of a single protein to dual to- pologies can explain the experimental results. These phenomena have earlier been observed in E. coli (Sääf et al., 1999), a few were also identified in paper IV and further analyzed in (Rapp et al., 2006). Strengthened by the validation tests (only two proteins showed deviating orientations between homologs whereas the rest was completely consistent) we extended our initial data set by repeating the “BLAST-approach”, but this time applied to the unassigned yeast proteins, and inferred the query assignment to the hit if the search criteria were matched. We could thereby assign a C-terminal location to another 41 proteins, increasing the total num- ber of assigned proteins to 546 (also including the 37 from paper II).

By applying TMHMM and using the C-terminal locations as constraints we produced topology models for the 546 proteins (Fig. 9). Perhaps the most

46 striking observation is that proteins with a Cin orientation are far more fre- quent than those with a Cout orientation (82% vs. 18% and similar propor- tions for the E. coli membrane proteome). An even number of transmem- brane regions dominate and thus topologies with both N and C termini lo- cated in the cytoplasm are the most common. This might suggest that so- called “helical hairpins” (two closely spaced TM helices) are a basic block in co-translational insertion of MPs into the membrane. The functional catego- ries seen in Figure 9 are taken from the Gene Ontology (GO) terms (Ashburner et al., 2000) and correlate well with the number of transmem- brane helices. For example proteins of 10 or more TM helices are mainly involved in solute transport and proteins with few predicted TM helices are usually of unknown function. One category that deviates from the general Cin trend is the protein modification class (yellow in figure) that has about 50% as Cout, which makes sense since protein modification largely takes place in the ER lumen.

100

Vesicle- mediated transport 80 4% Unk now n Transport 36% 32%

Protein 60 Lipid modification metabolism Other function Organelle 7% 13% 5% organisation 3% 40

20 Cin Number ofproteins 0

2020 Cout

4040 01234567891011121314151617181920 Number of transmembrane helices

Figure 9. A histogram over the topology distribution for the 546 membrane proteins with an assigned C terminus. Bars upward correspond to Cin topologies and bars downward correspond to Cout topologies. The pie chart shows the GO annotations for the yeast membrane proteins. The white bars inside the colored bars represent the corresponding distribution for the E. coli proteome.

Comparison to the topology distribution in E. coli shows many similarities but also a few obvious differences, for instance the higher fraction of trans- port proteins in E. coli with Nin-6TM-Cin topology and the higher fraction of possible G protein-coupled receptors (GPCRs) with a Nout-7TM-Cin topology in S. cerevisiae.

47

Since it seems rare that homologous proteins have opposite orientations we believe that homology based C-terminal mapping is reliable with a low error rate and therefore we wanted to expand our analysis to other eukaryotic membrane proteomes. First we used the C-terminally assigned yeast proteins (excluding the COS and ATO families) as queries in BLAST searches against a database of predicted membrane proteins from 38 different fully sequenced eukaryotic genomes. We applied the same criteria as in the “BLAST approach” described above, except that we used a stricter E-value cutoff of 10-6 here. All homologs for which we could infer a C-terminal as- signment were used in a second BLAST run against the same database and all together 13,281 eukaryotic proteins homologous to S. cerevisiae were generated to which a C-terminal orientation could be assigned (Fig.10). Subsequently we used 612 E. coli proteins from paper IV that also were C-terminal assigned in an additional two-step BLAST search against the eukaryotic database. This generated 4,051 further homologs for which the C- terminal locations could be assigned (Fig. 10). Out of these, 2,522 over- lapped with the S. cerevisiae homologs and in all cases the C-terminal as- signments agreed, supporting our assumption that homology-based C- terminal mapping is appropriate. Interestingly, eukaryotic membrane pro- teins only homologous to E. coli proteins often turned out to be located in the mitochondria or the chloroplasts, which naturally can be related to the prokaryotic origin of these organelles. Combining the results for S. cerevisiae and E. coli in total 14,810 eu- karyotic membrane proteins were C-terminally assigned. For these proteins we also run constrained TMHMM predictions to produce topology models. A similar study for bacterial membrane proteins has also recently been performed (Granseth et al., 2005b).

48 900

S. cerevisiae homologs 800 overlapping homologs E. coli homologs unique S. cerevisiae homologs 700

600

500

400

Number of homologs of Number 300

200

100

0

o t t j s l y g l l l c hs xp rn gg da t m ag xe ci ce d a o o p p eu tb po ns gr go an f al y dh g k y1 y6 y8 s mm d cw um kw Organism animals plants parasites fungi Figure 10. Homologs in 38 eukaryotic genomes with assigned C termini orientations assigned from either the 534 S. cerevisiae proteins or the 612 previously analyzed E. coli proteins. Dark red bars represent homologs assigned only by S. cerevisiae pro- teins in each organism; yellow bars represent homologs assigned only by E. coli proteins; orange bars represent homologs assigned by both S. cerevisiae and E. coli proteins. Dark blue bars show the number of unique S. cerevisiae proteins that have at least one homolog in the other genome. Organism abbreviations as in paper V.

In conclusion, we have been able to generate global topology maps for the two important model organisms E. coli and S. cerevisiae. By homology we have further transferred the C-terminal assignments to a large number of eukaryotic membrane proteins and from our observations so far it appears to be reliable. The extensive amount of C-terminal location data can be used for benchmarking topology prediction methods which hopefully will prove valu- able in the light of the limited structural and other experimental data of membrane proteins.

49 5 Discussion and future perspectives

What have we learnt from our studies and what future directions should be taken to advance the exploration of the membrane proteome? First, the results summarized in paper I-V show that topology prediction indeed benefits from including experimental information and that determina- tion of the location of the C termini of the majority of polytopic membrane proteins in both E. coli and S. cerevisiae could be achieved in an efficient way. By looking at the distribution of membrane protein topologies in the two organisms we can get some clues about the function of yet unannotated proteins. We have also shown that it is possible to transfer C-terminal assignments to homologs of a large number of eukaryotic membrane proteins, for which it would be difficult to perform the same sort of experiments. How can the results be used in the future? One obvious way is to apply more recently developed prediction methods with higher accuracy than TMHMM, for example prodiv-TMHMM (Viklund and Elofsson, 2004) and constrain the predictions with our experimental data to produce even better topologies. The ability to constrain predictions is not only limited to the C-terminal region of a membrane protein. Any part for which the location is known (inside, outside or membrane-spanning) can be constrained. The TMHMMfix website (http://www.sbc.su.se/~melen/TMHMMfix/) allows the user to include all known topological information for a membrane pro- tein and thereby increase the chance of retrieving the correct prediction. The reliability score is useful for estimating the likelihood that a given prediction is correct. High scores correlate to more accurate predictions whereas low scores identify proteins where the predicted topologies are more uncertain and for which a detailed experimental topology mapping therefore probably is worth the effort. Topology prediction can furthermore be valuable for target selection in structural genomics as the goal is to determine a structural representative for all protein families, making it important to choose proteins of potentially new folds. The selection can further take advantage of information of which proteins that are well-expressing (since large amounts of proteins is a pre- requisite for structural determination), which is estimated for E. coli in paper IV and S. cerevisiae in an accompanying study to paper V (Österberg et al., 2006).

50 Another way to improve topology prediction is to make the methods spe- cific for different organisms, or at least tune them for prokaryotic and eu- karyotic proteins separately. There are obvious differences between various organisms, for example the positive-inside rule is more pronounced in pro- karyotes than in eukaryotes, the lipid composition varies among species and extramembraneous environmental differences might affect the amino acid distributions. However, topology prediction methods will most likely have an upper accuracy limit considerably less than 100%. Polar and charged residues within transmembrane helices will continue to confuse prediction algorithms unless helix packing interactions are explicitly taken into ac- count. The next generation of prediction algorithms will therefore need to model the contact between the helices. New propensity scales for predicting buried and exposed residues in transmembrane helices have already been used to predict the spatial relationships between the helices (Adamian and Liang, 2006; Park and Helms, 2007). Although full three-dimensional pre- diction trials also have been made (Yarov-Yarovoy et al., 2006) (notably using topology prediction methods initially to define the positions of the ends of each helix and then successively adding helices one by one and model the interactions and orientations), there is still a long way to go before reliable 3D prediction of membrane protein structure is achieved. Moreover, a complete 3D structure does not only contain the membrane-spanning heli- ces, but also includes the structures of the extramembraneous loop regions with substructures like β-beta strands, interface helices, and re-entrant heli- ces, as well as larger globular protein domains. Finally, until it becomes possible to either quickly determine the 3D structure of a membrane protein experimentally or to predict it accurately there is motivation for doing topology predictions. Why? Because the struc- tural data is limited and a predicted topology can be used for classification, suggest possible functions and be a guide for designing experiments. How? By applying one or several prediction methods (preferably using the con- straining ability if any topological information is available and preferably also including methods that can discover signal peptides and/or other sub- structures) and to carefully analyze and compare the outputs. When? When the aim is to achieve quick and cheap information about membrane proteins, in particular in the context of large-scale studies.

51 Acknowledgements

I would like to express my sincere gratitude to all people who have sup- ported and encouraged me and made this work possible. I am especially grateful to the following persons:

Gunnar von Heijne, my supervisor, for sharing your immense scientific knowledge and stimulating research. Thanks for all great assistance over the years and for always having time when it is needed. It has been a pleasure to be a student in your lab.

Hugh Salter, my co-supervisor at AstraZeneca, for giving me the opportu- nity to be part of your group. Thanks for both scientific inspiration and all social events and your hospitality. Thanks also to Henrik, Ingela and Kerstin for always making me feel so welcome at AstraZeneca.

Bengt Persson, Lena Lewin and Per-Erik Jansson at The PhD Programme in Medical Bioinformatics at the Karolinska Institute for educational guid- ance and financial support through the Swedish Knowledge Foundation and AstraZeneca.

Stefan Nordlund, for being a cool and relaxed senior researcher at DBB with a contagious interest for science. Thanks also to all past and present people at the secretariat for always being so kind and helpful.

Anders Krogh for our fruitful collaboration and for kindly letting me visit your lab in .

Arne and Erik L (“Piff & Puff”) for creating an inspiring research environ- ment and open atmosphere at CBR.

Joy, Marie Ö, Mikaela, and Dan, my co-authors and gurus at the lab, for good collaboration and for teaching me all about your favorite organisms. Thanks also for sustaining my recurrent questions on experimental details.

Erik G, Håkan, Andreas, Johan N and Lukas for interesting and valuable discussions on topology prediction of membrane proteins and for being en- thusiastic colleagues.

Sara L, my former roommate, for discussing all that is relevant in life and for being such a good listener. Thanks for all help with linguistic issues. I do miss you a lot!

52 Anna J, Per L and Diana, my present roommates, for lightening up our office and for taking care of my flowers. Thanks also to Åsa B and Johan- nes for being nice friends with super positive attitudes, and to all other peo- ple at CBR for being such a pleasant company.

Erik Sj, my computer-hero, for your patience and willing to rescue me when I am distressed.

Olivia, Olof, Bob and all other previous colleagues at SBC for making every-day life at work so fun and for our delicious cake club. I have always enjoyed going to work (especially on Fridays…).

Marika, my favorite lunch mate, for all your warm support and your kind- hearted personality. I am also happy that we did a good job on timing our pregnancies perfectly!

Marie, my best friend, for your endless encouragement and inspiration and for always being so caring, and also for being so incredible fun! Your charming humor is catching!

Helena, Eva, Sara N and Åsa G for the unforgettable years in Uppsala. Thanks for all great fun while studying and living together, and for our an- nual Christmas baking day. I already long for December!

Anna N, Tove, Anja and Christin in “bokcirkeln” for the best evening per month. Every session is an energy booster and a laugh releaser which I can live on for a long time!

Linda, my twin spirit, for always being close in mind despite the distance to Paris. Without you as a dear friend I would never have been the one I am today and I can’t in words tell how much I appreciate you.

Birgitta and Ingemar, my mother and father, for your never ending love and believe in me, your stability and unconditional support. You’re my best role models in life and a perfect mix of vitality and contemplation.

Jonas and Erik, my brothers with families, for your generosity and for mak- ing life rich in many ways. I love you!

Ture and Otto, my sons and sweethearts, for making the future so exciting and unpredictable. Thanks for bringing so much joy into my life and for giving every day meaning.

Finally I would like to thank the most wonderful person I know, my husband Henrik, for always standing by my side and for always making me happy. You are my everything. I love you with all my heart.

53 References

Adamian, L. and Liang, J. (2006) Prediction of transmembrane helix orienta- tion in polytopic membrane proteins. BMC Struct Biol, 6, 13. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. J Mol Biol, 215, 403-410. Andersson, H. and von Heijne, G. (1994) Membrane protein topology: ef- fects of delta mu H+ on the translocation of charged residues explain the 'positive inside' rule. EMBO J, 13, 2267-2272. Anfinsen, C.B. (1973) Principles that govern the folding of protein chains. Science, 181, 223-230. Arai, M., Mitsuke, H., Ikeda, M., Xia, J.X., Kikuchi, T., Satake, M. and Shimizu, T. (2004) ConPred II: a consensus prediction method for obtaining transmembrane topology models with high reliability. Nu- cleic Acids Res, 32, W390-393. Arechaga, I., Miroux, B., Karrasch, S., Huijbregts, R., de Kruijff, B., Runs- wick, M.J. and Walker, J.E. (2000) Characterisation of new intracel- lular membranes in Escherichia coli accompanying large scale over- production of the b subunit of F(1)F(o) ATP synthase. FEBS Lett, 482, 215-219. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M. and Sherlock, G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25, 25-29. Bartlett, G.J., Todd, A.E. and Thornton, J.M. (2003) Inferring protein func- tion from structure. Methods Biochem Anal, 44, 387-407. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) The Protein Data Bank. Nucleic Acids Res, 28, 235-242. Bernsel, A. and von Heijne, G. (2005) Improved membrane protein topology prediction by domain assignments. Protein Sci, 14, 1723-1728. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O'Donovan, C., Phan, I., Pilbout, S. and Schneider, M. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, 31, 365-370. Bowie, J.U. (1997) Helix packing in membrane proteins. J Mol Biol, 272, 780-789.

54 Bradley, P., Misura, K.M. and Baker, D. (2005) Toward high-resolution de novo structure prediction for small proteins. Science, 309, 1868- 1871. Brenner, S.E. (2001) A tour of structural genomics. Nat Rev Genet, 2, 801- 809. Caffrey, M. (2003) Membrane protein crystallization. J Struct Biol, 142, 108-132. Canfield, V.A. and Levenson, R. (1993) Transmembrane organization of the Na,K-ATPase determined by epitope addition. Biochemistry, 32, 13782-13786. Chamberlain, A.K., Faham, S., Yohannan, S. and Bowie, J.U. (2003) Con- struction of helix-bundle membrane proteins. Adv Protein Chem, 63, 19-46. Chang, X.B., Hou, Y.X., Jensen, T.J. and Riordan, J.R. (1994) Mapping of cystic fibrosis transmembrane conductance regulator membrane to- pology by glycosylation site insertion. J Biol Chem, 269, 18572- 18575. Chen, C.P., Kernytsky, A. and Rost, B. (2002) Transmembrane helix predic- tions revisited. Protein Sci, 11, 2774-2791. Cuthbertson, J.M., Doyle, D.A. and Sansom, M.S. (2005) Transmembrane helix prediction: a comparative evaluation and analysis. Protein Eng Des Sel, 18, 295-308. de Planque, M.R., Kruijtzer, J.A., Liskamp, R.M., Marsh, D., Greathouse, D.V., Koeppe, R.E., 2nd, de Kruijff, B. and Killian, J.A. (1999) Dif- ferent membrane anchoring positions of tryptophan and lysine in synthetic transmembrane alpha-helical peptides. J Biol Chem, 274, 20839-20846. Deak, P.M. and Wolf, D.H. (2001) Membrane topology and function of Der3/Hrd1p as a ubiquitin-protein ligase (E3) involved in endoplas- mic reticulum degradation. J Biol Chem, 276, 10663-10669. Drew, D., Froderberg, L., Baars, L. and de Gier, J.W. (2003) Assembly and overexpression of membrane proteins in Escherichia coli. Biochim Biophys Acta, 1610, 3-10. Drew, D., Sjostrand, D., Nilsson, J., Urbig, T., Chin, C.N., de Gier, J.W. and von Heijne, G. (2002) Rapid topology mapping of Escherichia coli inner-membrane proteins by prediction and PhoA/GFP fusion analy- sis. Proc Natl Acad Sci U S A, 99, 2690-2695. Ehrmann, M., Boyd, D. and Beckwith, J. (1990) Genetic analysis of mem- brane protein topology by a sandwich gene fusion approach. Proc Natl Acad Sci U S A, 87, 7574-7578. Eilers, M., Shekar, S.C., Shieh, T., Smith, S.O. and Fleming, P.J. (2000) Internal packing of helical membrane proteins. Proc Natl Acad Sci U S A, 97, 5796-5801. Engelman, D.M., Steitz, T.A. and Goldman, A. (1986) Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Biophys Chem, 15, 321-353.

55 Eyre, T.A., Partridge, L. and Thornton, J.M. (2004) Computational analysis of alpha-helical membrane protein structure: implications for the prediction of 3D structural models. Protein Eng Des Sel, 17, 613- 624. Feilmeier, B.J., Iseminger, G., Schroeder, D., Webber, H. and Phillips, G.J. (2000) Green fluorescent protein functions as a reporter for protein localization in Escherichia coli. J Bacteriol, 182, 4068-4076. Fleishman, S.J. and Ben-Tal, N. (2006) Progress in structure prediction of alpha-helical membrane proteins. Curr Opin Struct Biol, 16, 496- 504. Forrest, L.R., Tang, C.L. and Honig, B. (2006) On the accuracy of homology modeling and sequence alignment methods applied to membrane proteins. Biophys J, 91, 508-517. Garrow, A.G., Agnew, A. and Westhead, D.R. (2005) TMB-Hunt: an amino acid composition based method to screen proteomes for beta-barrel transmembrane proteins. BMC Bioinformatics, 6, 56. Goder, V., Junne, T. and Spiess, M. (2004) Sec61p contributes to signal sequence orientation according to the positive-inside rule. Mol Biol Cell, 15, 1470-1478. Granseth, E., Daley, D.O., Rapp, M., Melén, K. and von Heijne, G. (2005b) Experimentally constrained topology models for 51,208 bacterial in- ner membrane proteins. J Mol Biol, 352, 489-494. Granseth, E., von Heijne, G. and Elofsson, A. (2005a) A study of the mem- brane-water interface region of membrane proteins. J Mol Biol, 346, 377-385. Grant, A., Lee, D. and Orengo, C. (2004) Progress towards mapping the universe of protein folds. Genome Biol, 5, 107. Henderson, R., Baldwin, J.M., Ceska, T.A., Zemlin, F., Beckmann, E. and Downing, K.H. (1990) Model for the structure of bacteriorhodopsin based on high-resolution electron cryo-microscopy. J Mol Biol, 213, 899-929. Henderson, R. and Unwin, P.N. (1975) Three-dimensional model of purple membrane obtained by electron microscopy. Nature, 257, 28-32. Hessa, T., Kim, H., Bihlmaier, K., Lundin, C., Boekel, J., Andersson, H., Nilsson, I., White, S.H. and von Heijne, G. (2005) Recognition of transmembrane helices by the endoplasmic reticulum translocon. Nature, 433, 377-381. Hopkins, A.L. and Groom, C.R. (2002) The druggable genome. Nat Rev Drug Discov, 1, 727-730. Ikeda, M., Arai, M., Lao, D.M. and Shimizu, T. (2002) Transmembrane to- pology prediction methods: a re-assessment and improvement by a consensus method using a dataset of experimentally-characterized transmembrane topologies. In Silico Biol, 2, 19-33. Jones, D.T. (2007) Improving the accuracy of transmembrane protein topol- ogy prediction using evolutionary information. Bioinformatics, 23, 538-544.

56 Jones, D.T., Taylor, W.R. and Thornton, J.M. (1994a) A mutation data ma- trix for transmembrane proteins. FEBS Lett, 339, 269-275. Jones, D.T., Taylor, W.R. and Thornton, J.M. (1994b) A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry, 33, 3038-3049. Kainosho, M., Torizawa, T., Iwashita, Y., Terauchi, T., Mei Ono, A. and Guntert, P. (2006) Optimal isotope labelling for NMR protein struc- ture determinations. Nature, 440, 52-57. Kernytsky, A. and Rost, B. (2003) Static benchmarking of membrane helix predictions. Nucleic Acids Res, 31, 3642-3644. Killian, J.A. and von Heijne, G. (2000) How proteins adapt to a membrane- water interface. Trends Biochem Sci, 25, 429-434. Kim, H., von Heijne, G. and Nilsson, I. (2005) Membrane topology of the STT3 subunit of the oligosaccharyl transferase complex. J Biol Chem, 280, 20261-20267. Kimura, T., Ohnuma, M., Sawai, T. and Yamaguchi, A. (1997) Membrane topology of the transposon 10-encoded metal-tetracycline/H+ anti- porter as studied by site-directed chemical labeling. J Biol Chem, 272, 580-585. Krogh, A. (1997) Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol, 5, 179-186. Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E.L. (2001) Pre- dicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol, 305, 567-580. Kyte, J. and Doolittle, R.F. (1982) A simple method for displaying the hy- dropathic character of a protein. J Mol Biol, 157, 105-132. Käll, L., Krogh, A. and Sonnhammer, E.L. (2004) A combined transmem- brane topology and signal peptide prediction method. J Mol Biol, 338, 1027-1036. Käll, L. and Sonnhammer, E.L. (2002) Reliability of transmembrane predic- tions in whole-genome data. FEBS Lett, 532, 415-418. Laskowski, R.A., Watson, J.D. and Thornton, J.M. (2003) From protein structure to biochemical function? J Struct Funct Genomics, 4, 167- 177. Lomize, M.A., Lomize, A.L., Pogozheva, I.D. and Mosberg, H.I. (2006) OPM: orientations of proteins in membranes database. Bioinformat- ics, 22, 623-625. Luirink, J., von Heijne, G., Houben, E. and de Gier, J.W. (2005) Biogenesis of inner membrane proteins in Escherichia coli. Annu Rev Micro- biol, 59, 329-355. Manoil, C. (1991) Analysis of membrane protein topology using alkaline phosphatase and beta-galactosidase gene fusions. Methods Cell Biol, 34, 61-75. Manoil, C. and Beckwith, J. (1986) A genetic approach to analyzing mem- brane protein topology. Science, 233, 1403-1408.

57 Marsden, R.L., Lee, D., Maibaum, M., Yeats, C. and Orengo, C.A. (2006) Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res, 34, 1066-1080. Massotte, D. (2003) G protein-coupled receptor overexpression with the baculovirus-insect cell system: a tool for structural and functional studies. Biochim Biophys Acta, 1610, 77-89. Monné, M., Nilsson, I., Johansson, M., Elmhed, N. and von Heijne, G. (1998) Positively and negatively charged residues have different ef- fects on the position in the membrane of a model transmembrane he- lix. J Mol Biol, 284, 1177-1183. Murata, K., Mitsuoka, K., Hirai, T., Walz, T., Agre, P., Heymann, J.B., Engel, A. and Fujiyoshi, Y. (2000) Structural determinants of water permeation through aquaporin-1. Nature, 407, 599-605. Möller, S., Croning, M.D. and Apweiler, R. (2001) Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics, 17, 646-653. Nilsson, J., Persson, B. and von Heijne, G. (2000) Consensus predictions of membrane protein topology. FEBS Lett, 486, 267-269. Nilsson, J., Persson, B. and Von Heijne, G. (2002) Prediction of partial membrane protein topologies using a consensus approach. Protein Sci, 11, 2974-2980. Nilsson, J., Persson, B. and von Heijne, G. (2005) Comparative analysis of amino acid distributions in integral membrane proteins from 107 ge- nomes. Proteins, 60, 606-616. Oberai, A., Ihm, Y., Kim, S. and Bowie, J.U. (2006) A limited universe of membrane protein families and folds. Protein Sci, 15, 1723-1734. Opella, S.J., Nevzorov, A., Mesleb, M.F. and Marassi, F.M. (2002) Structure determination of membrane proteins by NMR spectroscopy. Bio- chem Cell Biol, 80, 597-604. Park, Y. and Helms, V. (2007) On the derivation of propensity scales for predicting exposed transmembrane residues of helical membrane proteins. Bioinformatics, 23, 701-708. Persson, B. and Argos, P. (1994) Prediction of transmembrane segments in proteins utilising multiple sequence alignments. J Mol Biol, 237, 182-192. Popot, J.L. and Engelman, D.M. (1990) Membrane protein folding and oli- gomerization: the two-stage model. Biochemistry, 29, 4031-4037. Popot, J.L. and Engelman, D.M. (2000) Helical membrane protein folding, stability, and evolution. Annu Rev Biochem, 69, 881-922. Rapp, M., Granseth, E., Seppala, S. and von Heijne, G. (2006) Identification and evolution of dual-topology membrane proteins. Nat Struct Mol Biol, 13, 112-116. Rogl, H., Kosemund, K., Kuhlbrandt, W. and Collinson, I. (1998) Refolding of Escherichia coli produced membrane protein inclusion bodies immobilised by nickel chelating chromatography. FEBS Lett, 432, 21-26.

58 Rost, B. (1996) PHD: predicting one-dimensional protein structure by pro- file-based neural networks. Methods Enzymol, 266, 525-539. Rost, B., Fariselli, P. and Casadio, R. (1996) Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci, 5, 1704-1718. Rost, B. and Sander, C. (1993) Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proc Natl Acad Sci U S A, 90, 7558-7562. Russell, R.B. and Eggleston, D.S. (2000) New roles for structure in biology and drug discovery. Nat Struct Biol, 7 Suppl, 928-930. Schulz, G.E. (2000) beta-Barrel membrane proteins. Curr Opin Struct Biol, 10, 443-447. Senes, A., Gerstein, M. and Engelman, D.M. (2000) Statistical analysis of amino acid patterns in transmembrane helices: the GxxxG motif oc- curs frequently and in association with beta-branched residues at neighboring positions. J Mol Biol, 296, 921-936. Simon, I., Fiser, A. and Tusnady, G.E. (2001) Predicting protein conforma- tion by statistical methods. Biochim Biophys Acta, 1549, 123-136. Singer, S.J. and Nicolson, G.L. (1972) The fluid mosaic model of the struc- ture of cell membranes. Science, 175, 720-731. Skolnick, J., Fetrow, J.S. and Kolinski, A. (2000) Structural genomics and its importance for gene function analysis. Nat Biotechnol, 18, 283-287. Sonnhammer, E.L., von Heijne, G. and Krogh, A. (1998) A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol, 6, 175-182. Sääf, A., Johansson, M., Wallin, E. and von Heijne, G. (1999) Divergent evolution of membrane protein topology: the Escherichia coli RnfA and RnfE homologues. Proc Natl Acad Sci U S A, 96, 8540-8544. Tate, C.G. (2001) Overexpression of mammalian integral membrane proteins for structural studies. FEBS Lett, 504, 94-98. Taussig, R. and Carlson, M. (1983) Nucleotide sequence of the yeast SUC2 gene for invertase. Nucleic Acids Res, 11, 1943-1954. Thornton, J. (2001) Structural genomics takes off. Trends Biochem Sci, 26, 88-89. Todd, A.E., Marsden, R.L., Thornton, J.M. and Orengo, C.A. (2005) Pro- gress of structural genomics initiatives: an analysis of solved target structures. J Mol Biol, 348, 1235-1260. Torres, J., Stevens, T.J. and Samso, M. (2003) Membrane proteins: the 'Wild West' of structural biology. Trends Biochem Sci, 28, 137-144. Traxler, B., Boyd, D. and Beckwith, J. (1993) The topological analysis of integral cytoplasmic membrane proteins. J Membr Biol, 132, 1-11. Tusnady, G.E. and Simon, I. (1998) Principles governing amino acid compo- sition of integral membrane proteins: application to topology predic- tion. J Mol Biol, 283, 489-506. Tusnady, G.E. and Simon, I. (2001) The HMMTOP transmembrane topol- ogy prediction server. Bioinformatics, 17, 849-850.

59 Ubarretxena-Belandia, I. and Engelman, D.M. (2001) Helical membrane proteins: diversity of functions in the context of simple architecture. Curr Opin Struct Biol, 11, 370-376. Ulmschneider, M.B., Sansom, M.S. and Di Nola, A. (2005) Properties of integral membrane protein structures: derivation of an implicit membrane potential. Proteins, 59, 252-265. Wagner, S., Bader, M.L., Drew, D. and de Gier, J.W. (2006) Rationalizing membrane protein overexpression. Trends Biotechnol, 24, 364-371. Walian, P., Cross, T.A. and Jap, B.K. (2004) Structural genomics of mem- brane proteins. Genome Biol, 5, 215. Wallin, E. and von Heijne, G. (1998) Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic or- ganisms. Protein Sci, 7, 1029-1038. van Geest, M. and Lolkema, J.S. (2000) Membrane topology and insertion of membrane proteins: search for topogenic signals. Microbiol Mol Biol Rev, 64, 13-33. van Klompenburg, W., Nilsson, I., von Heijne, G. and de Kruijff, B. (1997) Anionic phospholipids are determinants of membrane protein topol- ogy. EMBO J, 16, 4261-4266. White, S.H. (2004) The progress of membrane protein structure determina- tion. Protein Sci, 13, 1948-1949. White, S.H. and Wimley, W.C. (1999) Membrane protein folding and stabil- ity: physical principles. Annu Rev Biophys Biomol Struct, 28, 319- 365. Viklund, H. and Elofsson, A. (2004) Best alpha-helical transmembrane pro- tein topology predictions are achieved using hidden Markov models and evolutionary information. Protein Sci, 13, 1908-1917. Viklund, H., Granseth, E. and Elofsson, A. (2006) Structural classification and prediction of reentrant regions in alpha-helical transmembrane proteins: application to complete genomes. J Mol Biol, 361, 591- 603. Wimley, W.C. (2003) The versatile beta-barrel membrane protein. Curr Opin Struct Biol, 13, 404-411. Wimley, W.C. and White, S.H. (1996) Experimentally determined hydro- phobicity scale for proteins at membrane interfaces. Nat Struct Biol, 3, 842-848. Vitkup, D., Melamud, E., Moult, J. and Sander, C. (2001) Completeness in structural genomics. Nat Struct Biol, 8, 559-566. von Heijne, G. (1986) The distribution of positively charged residues in bac- terial inner membrane proteins correlates with the trans-membrane topology. EMBO J, 5, 3021-3027. von Heijne, G. (1992) Membrane protein structure prediction. Hydrophobic- ity analysis and the positive-inside rule. J Mol Biol, 225, 487-494. von Heijne, G. (2006) Membrane-protein topology. Nat Rev Mol Cell Biol, 7, 909-918. Yan, Y. and Moult, J. (2005) Protein family clustering for structural genom- ics. J Mol Biol, 353, 744-759.

60 Yarov-Yarovoy, V., Schonbrun, J. and Baker, D. (2006) Multipass mem- brane protein structure prediction using Rosetta. Proteins, 62, 1010- 1025. Yau, W.M., Wimley, W.C., Gawrisch, K. and White, S.H. (1998) The pref- erence of tryptophan for membrane interfaces. Biochemistry, 37, 14713-14718. Yee, A.A., Savchenko, A., Ignachenko, A., Lukin, J., Xu, X., Skarina, T., Evdokimova, E., Liu, C.S., Semesi, A., Guido, V., Edwards, A.M. and Arrowsmith, C.H. (2005) NMR and X-ray crystallography, complementary tools in structural proteomics of small proteins. J Am Chem Soc, 127, 16512-16517. Yu, H. (1999) Extending the size limit of protein nuclear magnetic reso- nance. Proc Natl Acad Sci U S A, 96, 332-334. Österberg, M., Kim, H., Warringer, J., Melén, K., Blomberg, A. and von Heijne, G. (2006) Phenotypic effects of membrane protein overex- pression in Saccharomyces cerevisiae. Proc Natl Acad Sci U S A, 103, 11148-11153.

61