Mapping the Human Proteome Using Bioinformatic Methods

LINN FAGERBERG

Royal Institute of Technology School of Biotechnology Stockholm 2011

© Linn Fagerberg Stockholm 2011 Royal Institute of Technology School of Biotechnology AlbaNova University Center SE-106 91 Stockholm Sweden

Printed by Universitetsservice US-AB Drottning Kristinas väg 53B SE-100 44 Stockholm Sweden ISBN 978-91-7415-886-1 TRITA-BIO Report 2011:4 ISSN 1654-2312

Cover illustration: Network view of the distribution of subcellular locations in U-2 OS. Each protein is represented by a circle colored by the number of locations it is detected in out of a total of 16 subcellular compartments.

Linn Fagerberg (2011): Mapping the Human Proteome Using Bioinformatic Methods. School of Biotechnology, Royal Institute of Technology (KTH), Sweden

ABSTRACT

The fundamental goal of proteomics is to gain an understanding of the expression and function of the proteome on the level of individual , on the level of defined cell types and on the level of the entire organism. In this thesis, the human proteome is explored using topology prediction methods to define the human membrane proteome and by global protein expression profiling, which relies on a complex study of the location and expression levels of proteins in tissues and cells.

A whole-proteome analysis was performed based on the predicted protein-coding genes of humans using a selection of membrane protein topology prediction methods. The study used a majority decision-based method, which estimated that approximately 26% of the human genes encode for a membrane protein. The prediction results are displayed in a visualization tool to facilitate the selection of antigens to be used for antibody generation.

Global protein expression profiles in a large number of cells and tissues in the human body were analyzed for more than 4000 protein targets, based on data from the antibody-based immunohistochemistry and immunofluorescence methods within the framework of the Human Protein Atlas project. The results revealed few cell-type specific proteins and a high fraction of human proteins expressed in most cells, suggesting that cell and tissue specificity is attained by a fine-tuned regulation of protein levels. The expression profiles were also used to analyze the relationship between 45 cell lines by hierarchical clustering and principal component analysis. The global protein expression patterns overall reflected the tumor origin of the cells, and also allowed for identification of proteins of importance for distinguishing different categories of cell lines, as defined by phenotype of progenitor cell. In addition, the protein distribution in 16 subcellular compartments in three of the human cell lines was mapped. A large fraction of proteins were localized in two or more compartments and, in line with previous results, a majority of proteins were detected in all three cell lines.

Finally, mass spectrometry-based protein expression levels were compared to RNA-seq-based transcript expression levels in three cell lines. Highly ubiquitous mRNA expression was found and the changes of expression levels between the cell lines showed high correlations between proteins and transcripts. Large general differences in abundance of proteins from various functional classes were observed. A comparison between categories based on expression levels revealed that, in general, genes with varying expression levels between the cell lines or only expressed in one cell line were highly enriched for cell-surface proteins.

These studies show a path for a systematic analysis to characterize the proteome in human cells, tissues and organs.

Keywords: proteome, transcriptome, bioinformatics, membrane protein prediction, subcellular localization, protein expression level, cell line, immunohistochemistry, immunofluorescence.

© Linn Fagerberg

LIST OF PUBLICATIONS

This thesis is based upon the following five publications, which are referred to in the text by the corresponding roman numerical (I-V). The papers are included in the Appendix.

I. Fagerberg L, Jonasson K, von Heijne G, Uhlén M, Berglund L. Prediction of the human membrane proteome. Proteomics. 2010 Mar; 10(6):1141-9.

II. Pontén F, Gry M, Fagerberg L, Lundberg E, Asplund A, Berglund L, Oksvold P, Björling E, Hober S, Kampf C, Navani S, Nilsson P, Ottosson J, Persson A, Wernérus H, Wester K, Uhlén M. A global view of protein expression in human cells, tissues, and organs. Mol Syst Biol. 2009 Dec;5:337.

III. Fagerberg L, Strömberg, S, Gry M, El-Obeid A, Nilsson K, Uhlen M, Ponten F, Asplund A. The Global Protein Expression Pattern in Human Cell Lines. Manuscript.

IV. Fagerberg L*, Stadler C*, Skogs M, Hjelmare M, Jonasson K, Wiking M, Åbergh A, Uhlén M, Lundberg E. Mapping the subcellular protein distribution in three human cell lines. Submitted.

V. Lundberg E*, Fagerberg L*, Klevebring D, Matic I, Geiger T, Cox J, Älgenäs C, Lundeberg J, Mann M, Uhlen M. Defining the transcriptome and proteome in three functionally different cell lines. Mol Syst Biol. 2010 Dec 21;6:450.

* Authors contributed equally to this work.

All publications are reproduced with permission of the respective copyright holders.

RELATED PUBLICATIONS

Uhlen M, Oksvold P, Fagerberg L, Lundberg E, Jonasson K, Forsberg M, Zwahlen M, Kampf C, Wester K, Hober S, Wernerus H, Björling L, Ponten F. Towards a knowledge-based Human Protein Atlas. Nat Biotechnol. 2010 Dec;28(12):1248-50.

Klevebring D, Fagerberg L, Lundberg E, Emanuelsson O, Uhlén M, Lundeberg J. Analysis of transcript and protein overlap in a human osteosarcoma cell line. BMC Genomics. 2010 Dec 2;11:684.

Berglund L, Björling E, Oksvold P, Fagerberg L, Asplund A, Szigyarto CA, Persson A, Ottosson J, Wernérus H, Nilsson P, Lundberg E, Sivertsson A, Navani S, Wester K, Kampf C, Hober S, Pontén F, Uhlén M. A genecentric Human Protein Atlas for expression profiles based on antibodies. Mol Cell Proteomics. 2008 Oct;7(10):2019-27.

Berglund L, Björling E, Jonasson K, Rockberg J, Fagerberg L, Al-Khalili Szigyarto C, Sivertsson A, Uhlén M. A whole-genome bioinformatics approach to selection of antigens for systematic antibody generation. Proteomics. 2008 Jul;8(14):2832-9.

Uhlén M, Björling E, Agaton C, Szigyarto CA, Amini B, Andersen E, Andersson AC, Angelidou P, Asplund A, Asplund C, Berglund L, Bergström K, Brumer H, Cerjan D, Ekström M, Elobeid A, Eriksson C, Fagerberg L, Falk R, Fall J, Forsberg M, Björklund MG, Gumbel K, Halimi A, Hallin I, Hamsten C, Hansson M, Hedhammar M, Hercules G, Kampf C, Larsson K, Lindskog M, Lodewyckx W, Lund J, Lundeberg J, Magnusson K, Malm E, Nilsson P, Odling J, Oksvold P, Olsson I, Oster E, Ottosson J, Paavilainen L, Persson A, Rimini R, Rockberg J, Runeson M, Sivertsson A, Sköllermo A, Steen J, Stenvall M, Sterky F, Strömberg S, Sundberg M, Tegel H, Tourle S, Wahlund E, Waldén A, Wan J, Wernérus H, Westberg J, Wester K, Wrethagen U, Xu LL, Hober S, Pontén F. A human protein atlas for normal and cancer tissues based on antibody proteomics. Mol Cell Proteomics. 2005 Dec;4(12):1920-32.

CONTENTS

INTRODUCTION ...... 1 1. THE DISCOVERY OF THE BUILDING BLOCKS OF LIFE ...... 2 2. PROTEINS ...... 5 2.1 CLASSIFICATION OF PROTEINS ...... 5 2.2 MEMBRANE PROTEINS ...... 8 2.3 PROTEIN ABUNDANCE ...... 8 2.4 PROTEOMIC TOOLS FOR ANALYZING PROTEIN EXPRESSION ...... 9 2.4.1 ANTIBODY-BASED PROTEIN PROFILING ...... 10 2.4.2 MASS SPECTROMETRY-BASED METHODS ...... 12 3. THE HUMAN PROTEIN ATLAS PROJECT ...... 13 3.1 PROTEIN PROFILING USING IMMUNOHISTOCHEMISTRY ...... 14 3.2 SUBCELLULAR PROFILING USING IMMUNOFLUORESCENCE-BASED CONFOCAL MICROSCOPY ...... 16 4. BIOINFORMATICS AND DATA ANALYSIS ...... 17 4.1 BIOLOGICAL DATABASES ...... 17 4.1.1 GENE AND GENE EXPRESSION RELATED DATABASES ...... 17 4.1.2 PROTEIN RELATED DATABASES ...... 19 4.1.3 ONTOLOGIES ...... 21 4.2 TOOLS FOR THE ANALYSIS AND VISUALIZATION OF LARGE-SCALE BIOLOGICAL DATA ...... 23 4.2.1 HIERARCHICAL CLUSTERING ...... 23 4.2.2 PRINCIPAL COMPONENT ANALYSIS ...... 25 4.2.3 ENRICHMENT ANALYSIS OF GENE ANNOTATIONS ...... 25 4.2.4 NETWORK ANALYSIS ...... 27 4.3 TRANSMEMBRANE PROTEIN TOPOLOGY PREDICTION ...... 29 4.3.1 TRANSMEMBRANE PROTEIN FEATURES ...... 29 4.3.2 MEMBRANE PROTEIN TOPOLOGY PREDICTION BASED ON AMINO ACID PROPERTIES ...... 31 4.3.3 MACHINE LEARNING-BASED APPROACHES FOR MEMBRANE PROTEIN TOPOLOGY PREDICTION ..... 31 4.3.4 SIGNAL PEPTIDE PREDICTION ...... 32 4.3.5 TOPOLOGY PREDICTION PERFORMANCE ...... 33

PRESENT INVESTIGATION ...... 35 5. OBJECTIVE ...... 36 6. MAPPING THE HUMAN MEMBRANE PROTEOME (I) ...... 37 7. GLOBAL ANALYSIS OF PROTEIN EXPRESSION IN HUMAN CELLS AND TISSUES (II & III) ...... 42 8. MAPPING THE SUBCELLULAR LOCATION OF PROTEINS (IV) ...... 47 9. COMPARISON OF THE TRANSCRIPTOME AND PROTEOME IN THREE HUMAN CELL LINES (V) ..... 50

CONCLUDING REMARKS AND FUTURE PERSPECTIVES ...... 53

ABBREVIATIONS ...... 56

ACKNOWLEDGEMENTS ...... 57

REFERENCES ...... 60

INTRODUCTION 1

INTRODUCTION 2 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

1. THE DISCOVERY OF THE BUILDING BLOCKS OF LIFE

For centuries, scientists have strived to understand the basic principles of life. We now know that the key building blocks consist of cells containing proteins and nucleic acids in the form of deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), as well as lipids, metabolites and other compounds. The story of DNA began in the 19th century when Johann Miescher was able to isolate a substance he called “nuclein” from white blood cells (Dahm, 2008) and Gregor Mendel discovered the laws of inheritance by studying the breeding patterns of peas. Subsequent research lead to the understanding that DNA was composed of nucleotides containing phosphate, sugar and the four different nitrogen bases adenine, cytosine, guanine and thymine and that were was another nucleic acid, RNA, with its unique nitrogen base uracil. In 1953, Francis Crick, James Watson and Maurice Wilkins were able to solve the structure of DNA and they also suggested a mechanism for how genetic material could be transferred in their double-helix base-pairing model (Watson and Crick, 1953). Francis Crick was the first to postulate the central dogma of molecular biology (Figure 1) (Crick, 1958, 1970), which describes the relationship between DNA, RNA and proteins, as well as the sequence hypothesis, which states that the sequence in genetic material is a code for the amino acid sequence of a corresponding protein (Crick, 1958) so that three bases of DNA code for one amino acid (Crick et al., 1961). The “missing link” between DNA and proteins was found when Elliot Volkin discovered a new type of “DNA-like-RNA” in 1956 (Volkin and Astrachan, 1956). This enabled other researchers to figure out that the new molecule encoded genetic information from DNA and was transported from the nucleus to the cytoplasm to generate proteins; hence it was named messenger RNA (mRNA) (Jacob and Monod, 1961).

Proteins had already been discovered in the 18th century in the form of albumin from egg whites and wheat gluten. It was known that this type of molecule could change under treatments with acid or heat. The name protein, which in Greek means “of primary importance”, was introduced 1838 by Jöns Jakob Berzelius and described large organic compounds with closely related empirical formulas (Perrett, 2007). During the same period, Gerardus Johannes Mulder came to the conclusion that the proteins he studied all had very similar chemical compositions and were likely to be composed of 1. THE DISCOVERY OF THE BUILDING BLOCKS OF LIFE 3

one fundamental substance (Mulder, 1838). He also discovered what would later be called amino acids, the small molecules resulting from acid hydrolysis of proteins and contained both amino and carboxylic properties (Perrett, 2007). was one of the most studied protein in the 19th century and had been crystallized in 1840, but it was not until 1935 that all 20 amino acids had been identified (Perrett, 2007). In 1951, the first complete amino acid sequence was determined from the protein insulin (Sanger and Tuppy, 1951a, b) and the first 3D structures, myoglobin and hemoglobin, were solved using X-ray crystallography in 1958 (Kendrew et al., 1960; Perutz et al., 1960).

REPLICATION

TRANSCRIPTION TRANSLATION

DNA mRNA PROTEIN

Figure 1. The central dogma of molecular biology involves the processes of replication, transcription and translation. Images from the RCSB PDB (http://www.rcsb.org/pdb) of PDB ID: 2BNA (Drew et al., 1982), PDB ID: 3MEI (Dibrov et al., 2011) single strand and PDB ID: 2V4M (Moche et al., unpublished). Structures modified by Timothy Nugent.

Today, we know that proteins are the essential building blocks of the cell, accounting for approximately 20% of its weight (Lodish et al., 2000), and that only 20 amino acids are necessary to generate the large diversity of protein functions that exists. A gene, which is a short stretch of DNA stored in chromosomes in the cell nucleus, is the blueprint for how the protein will be constructed in terms of the order of amino acids. Information flows via transcription of DNA into messenger RNA and subsequent translation to a polymer of amino acids, which is folded into a specific protein

4 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

structure. In a human cell, the total number of protein-coding genes is approximately 20,000 (Clamp et al., 2007; Uhlen et al., 2010) but the number of functionally different proteins is substantially higher than the number of genes, partly due to the many alternative variants that splicing events can give rise to and changes due to post-translational modifications (PTMs) (Uhlen, 2005a). The important protein molecule is the main focus of this thesis and will be discussed in more detail in the following sections. 2. PROTEINS 5

2. PROTEINS

For a protein to be able to carry out its function, the ability to fold into a complex three-dimensional (3D) structure is essential. Protein structures are commonly described using four hierarchical levels: (i) the primary structure is the linear sequence of amino acids; (ii) the secondary structure consists of local substructures with the two main types alpha-helices and beta-strands joined by loops or coils; (iii) the tertiary structures refers to the complete 3D structure of a folded single protein molecule and the (iv) quaternary structure applies to proteins which are larger assemblies of several polypeptide chains and is the arrangement of multiple protein subunits in a complex (Lodish et al., 2000). After the translation of a protein, the folding and its properties can be modified by post-translational modifications (PTMs), which often involve addition of a chemical group to one or more amino acids or proteolytic cleavage. Among the most important PTMs are glycosylation, acylation, phosphorylation, methylation and disulfide bond formation (Mann and Jensen, 2003). The function of a protein is highly dependent on its structure, which in turn relates to the properties of its amino acids. Some proteins undergo structural rearrangements to regulate their function or the activity of other proteins. It is also common for two or more proteins to be joined in functional complexes.

Proteins build up cellular structures and are involved in all sorts of biological processes in the cell and they also allow cells to communicate. Differences in the protein set make every human being unique, and although the sequence of the human genome has been known for almost a decade, a large portion of proteins still remains uncharacterized. The total protein complement of a genome of an organism has been coined the “proteome” (Wilkins et al., 1996) and the next sections will discuss different classes of proteins, with special focus on the large group of membrane proteins, as well as methods used to analyze proteins.

2.1 CLASSIFICATION OF PROTEINS

The proteins that have been successfully characterized in one way or another can be classified according to many different properties (Table 1). Three general classes based on structural properties are fibrous, globular and membrane proteins (Figure 2). Fibrous proteins, also called , are usually long, rod-shaped protein filaments and are used to construct macroscopic structures involved in support and protection, such as bone matrix and muscle fibers. Most scleroproteins are

6 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

practically insoluble and examples include collagen (Figure 2A) and elastin found in connective tissue, and cytoskeletal proteins such as keratin and (Lodish et al., 2000). In contrast, globular proteins are “globe”-like and mostly soluble in aqueous solution, thereby containing polar residues on their surface regions and a hydrophobic inside. This group is the largest and is very diverse, containing proteins with varying structures. Some of the most well studied proteins, including hemoglobin (Figure 2B), immunoglobulin and most enzymes, are globular proteins. The third group contains proteins attached to or spanning a membrane consisting of a lipid bilayer and will be discussed in more detail in section 2.2.

A B

C D

Figure 2. Examples of protein structures: (A) the fibrous protein collagen, PDB ID: 1CAG (Bella et al., 1994), (B) the hemoglobin, PDB ID: 1GZX (Paoli et al., 1996), (C) the alpha-helical membrane protein potassium channel Kirbac1, PDB ID: 1P7B (Kuo et al., 2003) and (D) the beta barrel protein porin OmpG, PDB ID: 2F1C (Subbarao and van den Berg, 2006). Structures modified by Timothy Nugent. 2. PROTEINS 7

Proteins can also be classified based on their subcellular location, i.e. depending on which organelle or subcellular structure they are found in. Examples of this kind of classification are nuclear, mitochondrial, cytoplasmic or cytoskeletal proteins and there are also proteins that shuttle between several locations. For a membrane protein, the subcellular location could be any of the membrane- bounded organelles, such as the endoplasmic reticulum, Golgi apparatus and vesicles, or the plasma membrane surrounding the cell. Another important group consists of secreted proteins, which are transported from the cell to the extracellular space. A third type of classification of proteins is based on function, for example enzymes such as kinases, peptidases and ligases catalyzing reactions, receptors receiving and responding to a signal, transcription factors binding to DNA and thereby regulating transcription of DNA or transporter proteins moving small molecules across the membrane.

Table 1. Examples of categories that can be used for classification of proteins.

Category Examples Fibrous Structural Globular Membrane Nucleus Plasma membrane Subcellular Mitochondria location Golgi apparatus Lysosome Enzyme Ion channel Function Transcription factor Receptor MAPK/ERK signaling Biological Glycolysis pathway Citric acid cycle Photosynthesis

Another way of categorizing proteins is based on a certain biological pathway, which is a series of actions involving several molecules in the cell. There are many types of pathways and some examples include signal transduction pathways where signals are retrieved from outside of the cell, metabolic pathways which enable chemical reactions such as converting food to energy, and gene regulation pathways which can turn the transcription of genes on and off. Another type of pathway is the

8 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

secretory pathway in which secreted and membrane proteins are transported from the endoplasmic reticulum via vesicles to the Golgi apparatus and further on to their final stations, which for example can be in the endoplasmic reticulum, Golgi apparatus, lysosome, endosome, plasma membrane or the extracellular space (Lodish et al., 2000).

2.2 MEMBRANE PROTEINS

Membrane protein can be classified as either peripheral or integral. Peripheral membrane proteins are associated with the membrane by being bound to either peripheral regions of the membrane or to integral membrane proteins, but they do not fully span the membrane. Integral membrane proteins contain alpha-helical (Figure 2C) or beta-barrel structures (Figure 2D), which are hydrophobic and therefore can span the entire lipid bilayer and are linked by extra-membranous loop regions. Beta- barrel proteins have been found in the outer membranes of Gram-negative bacteria, cell walls of Gram-positive bacteria, and the outer membranes of mitochondria and chloroplasts (Wimley, 2003). The alpha-helical integral membrane proteins form the major category of membrane proteins and are estimated to be encoded by 20-30% of all protein-coding genes in most genomes (Krogh, 2001). They are found in all types of biological membranes and contain one or more alpha-helical regions mostly consisting of hydrophobic amino acids. The functions of membrane proteins are diverse and include ion channel activity or transport of other molecules across the membrane, enzymatic processes, anchoring of other proteins and receptor signaling. Their key roles as transporters and receptors explain why they represent approximately 60% of all drug targets and hence their immense importance for the pharmacological industry (Bakheet and Doig, 2009; Yildirim et al., 2007). G- protein coupled receptors (GPCRs), which contain seven transmembrane (TM) segments and include approximately 800 of the human protein-coding genes (Fredriksson and Schioth, 2005), comprise the largest group of membrane protein drug targets (Wise et al., 2002). Surprisingly, although membrane proteins are so numerous, they only cover about 1% of all known protein structures (White, 2004) due to the difficulties in isolating and crystallizing them.

2.3 PROTEIN ABUNDANCE

As proteins carry out most of the important processes of the cell, they are its central components and in large define the cellular function and phenotype. To gain a better understanding of the biology 2. PROTEINS 9

of the cell, it is therefore important to quantify protein abundance. The expression levels of proteins span a very wide range with protein abundances estimated to anywhere between 50 and 106 molecules per cell in yeast (Ghaemmaghami et al., 2003) and similar results observed in Escherichia coli (Ishihama et al., 2008). Analysis of proteins in human plasma revealed abundances spanning more than ten orders of magnitude (Anderson and Anderson, 2002). Protein expression levels have also been linked to protein function and properties in the cell. Highly abundant proteins have been associated with processes involved in protein synthesis, energy and binding functions, whereas proteins related to transcription, transport and cellular organization have been found in small copy numbers (Ishihama et al., 2008). Also, the levels of proteins expressed in the same biological pathway have been found to be more correlated compared to those in different pathways (Sigal et al., 2006). For a specific protein, the abundance can vary between different cell types and tissues, and also within the same cell dependent on various conditions such as cell cycle. Besides being related to corresponding mRNA levels, protein levels are affected by degradation and translational controls (Gygi et al., 1999; Lu et al., 2007), and it was recently shown that this contribution is at least as important as mRNA transcription and stability (Vogel et al., 2010). The natural variation in protein expression between individual cells has been related to stochastic processes where genes switch between “on” or “off” states of transcription (Cohen et al., 2009). The expression variability has also been found to be higher in disease-associated genes compared to a random set of genes (Mayburd, 2009).

2.4 PROTEOMIC TOOLS FOR ANALYZING PROTEIN EXPRESSION

The word “proteomics” refers to the large-scale study of gene expression at the protein level (Naaby- Hansen et al., 2001) and the ultimate goal is to understand the individual biological functions of all proteins and how they interact. Although more than ten years have passed since the completion of the human genome sequence, a large portion of our proteins still remains uncharacterized. One of the main challenges is the size and complexity of the human proteome involving a large number of different proteins resulting from alternatively spliced transcripts and PTMs as well as polymorphisms and somatic rearrangements. Estimations of the number of proteins in the human proteome largely depend on how it is defined, but if all combinatorial variants such as the immunoglobulins and T-cell receptors as well as PTMs are included, the numbers add up to several million protein variants

10 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

(Uhlen, 2005a). Another challenge lies in the wide range of protein abundances and the consequential difficulties in concurrently detecting proteins present at very low and very high levels. Since proteins cannot be amplified like nucleotide sequences, the sensitivity of the detection method is essential (Cox and Mann, 2007). A selection of experimental methods that allow for large-scale analysis of proteins in terms of location and expression levels will be briefly described in the following sections.

2.4.1 ANTIBODY-BASED PROTEIN PROFILING

Antibodies, which are also known as immunoglobulins, are gamma globulin proteins naturally occurring in the adaptive immune system with a primary purpose to recognize and bind to foreign particles invading the body (Abbas and Lichtman, 2005). They are produced by B-cells, or lymphocytes, and the molecule that has triggered the production of the antibody in the immune system is referred to as the antigen. The binding site at which the antibody binds to the antigen is referred to as an epitope. The two main categories of antibodies are polyclonal and monoclonal, both of which are generated from immunization of an animal. Polyclonal antibodies stem from a pool of B-cells resulting in a large variety of antibodies all recognizing the same immunized antigen, but can bind to different epitopes. Monoclonal antibodies are generated by the fusion of a specific B-cell with a myeloma cancer cell resulting in what is known as a hybridoma, which reproducibly produces antibodies recognizing the same epitope (Kohler and Milstein, 1975). The generation of monoclonal antibodies is much more time consuming and expensive, but the single epitope as well as the renewable source is often an advantage for pharmaceutical applications. A disadvantage of polyclonal antibodies is that they cannot be reproduced, however, they can be produced in large quantities and there are applications where the ability to recognize more than one part of the target is valuable. Antibodies are widely used in clinical settings for diagnostic, prognostic and therapeutic purposes (Beck et al., 2010). They are also among the most popular affinity reagents used in proteomics and have a wide range of use in methods such as flow sorting used to study cells (Hulett et al., 1969), protein arrays for screening of binding events (Haab, 2001) and Western blot to analyze protein size (Renart et al., 1979). Although antibodies can be applied as specific protein probes in various platforms, the focus in the following sections will be on protein profiling using immunohistochemistry and immunofluorescence-based confocal microscopy. 2. PROTEINS 11

For a molecular characterization of proteins at a cellular level, a popular method is immunohistochemistry (IHC), which joins the three fields of immunology, histology and chemistry. The basic concept is to use antibodies to provide information about the expression level and spatial localization of proteins within cells or tissues (Ramos-Vara, 2005). IHC allows for preservation of the tissue morphology. An “antigen retrieval” process enables the epitopes of the protein to be exposed to the primary antibody, which is detected by a secondary antibody conjugated with an enzyme. The enzyme, often peroxidase, is capable of catalyzing a color-producing reaction after addition of a substrate. With the development of tissue microarrays (TMAs), it has become possible to use IHC as a high-throughput method for simultaneous analysis of protein expression in multiple tissues (Warford et al., 2004). Tissue samples are routinely stored in paraffin blocks and a TMA is constructed by punching and transferring representative cylindrical cores of tissue from multiple paraffin blocks and assembling them in a recipient block. This recipient paraffin block allows for hundreds of consecutive sections to be cut and subsequently used for IHC (Kononen et al., 1998). The method enables a large set of tissues to be stained and analyzed during the same time and under the same conditions.

An alternative to IHC is to use approaches based on immunofluorescence (IF). The key component behind fluorescence is a fluorophore; a small molecule able to absorb and re-emit energy at two different wavelengths. Fluorophores can easily be attached to other molecules, such as antibodies, and detection of the emitted light, or the fluorescence, allows for the analysis of protein localization and cellular structures. Different fluorophores have different spectral characteristics and this enables multiple targets to be detected simultaneously. IF, which utilizes fluorescently labeled antibodies, can be used in combination with a confocal microscope to analyze protein expression on a subcellular level. Confocal microscopy allows for high-resolution imaging with better contrast and less background haze than conventional wide-field microscopy (Semwogerere and Weeks, 2005). The result includes sharp optical sections that can be assembled into a series of cross-section images of different depths, which enables 3D reconstructions of the object. Therefore, it is possible to study the characteristics of proteins in their natural environment with a high resolution, and the method can also be applied to large-scale high-throughput experiments (Pepperkok and Ellenberg, 2006). IF- based confocal microscopy is much more sensitive than IHC and at the same also a more quantitative method due to the broader dynamic range of fluorescence probes (Rimm, 2006). With both methods, the sample preparation, including fixation and permeabilization of cells or antigen

12 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

retrieval in tissues as well as the antibody staining and labeling, greatly impact the outcome. Also, the availability of validated antibodies with high specificity to their protein targets is important to gain successful and accurate results.

2.4.2 MASS SPECTROMETRY-BASED METHODS

Mass spectrometry (MS) characterizes proteins using mass analysis and has been described as the most comprehensive and versatile tool in large-scale proteomics (Yates et al., 2009). MS methods have essentially replaced two-dimensional gel electrophoresis and other previous tools for global analysis of proteins and can be used to analyze the protein composition and dynamics of cells, organelles or protein complexes. The technique allows proteins to be identified based on their amino acid sequence and can also detect protein interactions and PTMs that change the mass of a protein, e.g. phosphorylation (Mann et al., 2001). MS measures the mass-to-charge ratio (m/z) of a molecule and involves electrically charging and transferring of molecules into gas phase (Walther and Mann, 2010). Two commonly used methods that allow macromolecules to be studied are matrix assisted laser adsorption/ionization (MALDI) (Karas and Hillenkamp, 1988) and electrospray ionization (Fenn et al., 1989). Since full-length proteins are difficult to study, most often peptides derived by enzymatic cleavage of a protein are used. A tandem MS (MS/MS) spectrum consisting of a list of m/z ratios for the different fragments can be used to determine the amino acid sequence and hence reveal the identity of a protein (Walther and Mann, 2010). For quantification purposes, where it is necessary to track changes in protein levels between different conditions rather than simply detecting a protein, several methods have been developed over the past years. A popular method to quantify the relative abundance of peptides in different samples is metabolic labeling, where “heavy” amino acids containing non-radioactive isotopes, e.g. arginine and lysine modified with 13C or 15N atoms, are incorporated in proteins. This approach is called stable-isotope labeling by amino acids in cell culture (SILAC) (Ong et al., 2002) and it allows the heavy labeled proteins to be distinguishable from normal control proteins. Consequently, the exact mass differences between two samples enables the relative intensities to be calculated, reflecting the relative abundance of the proteins (Walther and Mann, 2010).

3. THE HUMAN PROTEIN ATLAS PROJECT 13

3. THE HUMAN PROTEIN ATLAS PROJECT

In the Human Protein Atlas project (Uhlen et al., 2005), antibody-based proteomics is used to systematically explore the human proteome. The project started in 2003 and involves high- throughput generation of protein-specific polyclonal antibodies. The aim is to obtain protein expression profiles across cells, tissues and organs and to map the subcellular location of one representative protein from every human gene as defined by Ensembl (Berglund et al., 2008b; Flicek et al., 2010). Another important objective is to discover potential protein biomarkers, which can be used for diagnosis and prognosis of diseases such as cancer (Bjorling et al., 2008). Information generated from the project is available via a public website (www.proteinatlas.org) and the current version 7.0 (Uhlen et al., 2010), released in November 2010, contains 10,118 genes with protein expression profiles based on 13,154 antibodies, most of which are generated within the project, but also a smaller portion obtained from commercial providers.

The strategy of the project is based on the generation of Protein Epitope Signature Tags (PrESTs) (Agaton et al., 2003), which are fragments of a protein suitable for protein expression and with low sequence identity to other human proteins. The PrESTs are used both as antigens for generation of polyclonal antibodies and as affinity ligands in the purification of antibodies to obtain specificity. In the Human Protein Atlas project pipeline, a suitable PrEST fragment for the protein target is first selected in silico using an interactive visualization tool (Berglund et al., 2008a) displaying various protein features such as sequence similarities to other proteins, signal peptides and transmembrane regions. Oligonucleotide primers are used for amplification of the fragment from human tissue RNA pools and it is subsequently cloned into expression vectors (Agaton, Galli et al. 2003). After expression of the protein in Escherichia coli (Tegel et al., 2009), the PrEST is purified and used for both antigen preparation and for production of the affinity columns used for antibody purification. Next, rabbits are immunized with the antigens to generate polyclonal sera, which are purified using the affinity columns (Nilsson et al., 2005). This allows for a high-throughput generation of antigen- purified polyclonal antibodies.

The binding specificity of each antibody is evaluated in two assays (Nilsson et al., 2005). First, in a protein microarray spotted with 384 PrESTs, including its own antigen, to test that it binds specifically to the correct target protein with a high enough signal. Second, all generated antibodies as

14 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

well as those obtained from commercial providers are tested in a Western blot setup with human protein extracts from the two cell lines U-251 MG and RT-4, plasma as well as tonsil and liver tissue. This assay provides information about the molecular weight of the protein(s) that the antibody has bound to and can therefore test its ability to bind to the full-length protein target by comparing the size estimations of bound proteins to the predicted molecular weight of the target protein. However, absence of the protein target in the analyzed cells as well as PTMs, proteolysis or alternative splice variants can result in bands of the wrong size or no bands in the resulting gel (Nilsson et al., 2005).

3.1 PROTEIN PROFILING USING IMMUNOHISTOCHEMISTRY

The antibodies are used for protein expression profiling using IHC-based TMAs. The TMAs include biobank samples from 46 normal tissue types as well as 20 different cancer types (Uhlen et al., 2005), and each normal tissue is represented by triplicate samples. Several cell types are analyzed for each tissue rendering a total number of 66 different cell types being annotated. For cancer tissues, tumor cells are annotated for, in most cases, samples from 12 patients in duplicates. An immunostaining protocol is established for each antibody by considering results from the previous steps of quality assurance as well as information from publicly available gene and protein information and literature. After staining the TMA sections, bound antibodies are visualized by a brown-colored DAB, or 3,3'- diaminobenzidine, staining whereas the cells and extracellular material are stained blue using hematoxylin (Figure 3A). An automated slide-scanning system is used to generate digital images, which are processed and stored in a database in order to be manually annotated by a pathologist who describes the protein expression patterns. A total number of 570 images are manually annotated for each antibody, and the results include an evaluation of the staining intensity in a four-grade scale (negative, weak, moderate or strong) as well as the fraction of stained cells in each image (Uhlen, 2005b). The final output is a protein profile summarizing the expression in all analyzed cells and tissues.

In addition to normal and cancer tissues, a large set of cell lines and primary cell samples are also analyzed. Cell lines are cultured cells which have undergone a change so that they are immortal (Schaeffer, 1990), or in other words capable of an infinite numbers of cell divisions as opposed to primary cells, which can only be grown for a limited amount of time before going into senescence and losing the ability to divide. Since cell lines are most often derived from transformed tumor cells, 3. THE HUMAN PROTEIN ATLAS PROJECT 15

they have been extensively used as in vitro model systems in cancer research and can be used for long- term continuous studies due to the practically limitless amount of cells. A cell microarray (CMA) produced in the Human Protein Atlas project includes a selection of 47 well-characterized human cell lines and twelve primary cell samples, mostly from leukemia/lymphoma patients (Andersson et al., 2006). The CMAs provide a complement to the cancer tissues included in the TMAs, and as TMAs and CMAs are stained simultaneously using the same protocol, IHC results can be compared between cells and tissues. Contrary to the resulting normal and cancer tissue images, all images of the stained cells of the CMA (Figure 3B) are annotated using the automated image analysis software TMAx (Beecher Instruments, Sun Prairie, WI, USA) (Strömberg et al., 2007). This software automatically identifies cells in an image and its output parameters include the number of cells, the fraction of immunostained cells and the areas of weak, moderate and strong staining. With the assumption that staining intensity reflects the amount of protein present, this allows for an estimation of the relative protein levels in the analyzed cells. All images and annotation results from both TMAs and CMAs are displayed on the public Human Protein Atlas website.

A B C

Figure 3. Examples of images resulting from immunohistochemical and immunofluorescent stainings. Tissues and cells were stained using the antibody HPA001523 targeting the mitochondrial heat shock 60kDa protein 1 (HSPD1). Immunohistochemical images of (A) placenta tissue and (B) RT-4 cell line, where the brown color indicates antibody bound to the antigen. (C) Immunofluorescently stained U-2 OS cells, where the nucleus marker is shown in blue, the microtubule marker in red and the antibody staining in green.

16 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

3.2 SUBCELLULAR PROFILING USING IMMUNOFLUORESCENCE- BASED CONFOCAL MICROSCOPY

Three of the human cell lines are further analyzed by IF and confocal microscopy to determine the protein location on a subcellular level (Barbe et al., 2008). The cell lines, U-251 MG (glioblastoma), U-2 OS (osteosarcoma) and A-431 (epithelial carcinoma) are selected to represent different origins to increase the chance of locating each protein in at least one of the cell lines. For sample preparation, the cells are seeded, fixated and permeabilized before being stained with the internally produced antibodies. Also, three reference markers are used: the nuclear probe DAPI, or 4',6-diamidino-2- phenylindole, and the antibodies targeting calreticulin and alpha- for visualization of the endoplasmic reticulum and microtubules, respectively (Figure 3C). Fluorescently labeled secondary antibodies are added for detection of all antibodies and the results of the markers are used for quality control and guidance in the annotation procedure. A confocal laser-scanning microscope is used to manually acquire two representative four-channel high-resolution images for each analyzed antibody and cell type. All images are manually annotated and the results of each antibody include staining characteristics, the staining intensity in a four-grade scale (negative, weak, moderate or strong) based on microscope settings, and the subcellular location in one or more of 16 different compartments. All images and annotations are available as a part of the Human Protein Atlas website (Uhlen et al., 2010). 4. BIOINFORMATICS AND DATA ANALYSIS 17

4. BIOINFORMATICS AND DATA ANALYSIS

The field of bioinformatics has made significant progress over the past two decades, mostly thanks to advances in large-scale biological projects, which have generated massive amounts of data, and the success of the Internet which allows easy access to data and exchange of information (Kanehisa and Bork, 2003). Bioinformatics combines the areas of biology, information technology and computer science and involves the development of new algorithms, statistics and tools to analyze, organize, evaluate and distribute the exponentially growing amounts of biological data from technologies such as sequencing, gene expression, systems biology and proteomics. In the following sections, a number of valuable biological databases, a selection of bioinformatic and statistical methods used for the analysis of extensive genomic and proteomic data as well as methods for predicting transmembrane protein topology will be discussed.

4.1 BIOLOGICAL DATABASES

The availability of biological data is of the utmost importance for bioinformatics applications. Luckily, there are countless biological databases that collect data and organize it in such a way that their content is easily accessible by users. Biological databases can be grouped into three categories depending on the type of stored data: (i) primary databases, which contain for example DNA and protein sequences; (ii) secondary databases which derive their information from a primary database, and (iii) composite databases which combine various sources from primary databases. Due to the high number of available biological databases it is unfeasible to cover them all, but the next sections will highlight some of the major databases and those of particular value to the topics covered in this thesis.

4.1.1 GENE AND GENE EXPRESSION RELATED DATABASES

Within the field of genomics, there are three main resources that collect, organize and distribute nucleotide sequences: the DNA Databank of Japan (Kaminuma et al., 2010), GenBank (Benson et al., 2010) and the European Nucleotide Archive (Leinonen et al., 2010a). Together they participate in the International Nucleotide Sequence Databases Collaboration (INSDC) (Cochrane et al., 2010), an effort to synchronize the collection of nucleotide sequences by different international groups, and

18 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

exchanging data between them. INSDC established and operates the primary next-generation sequence data archive, the Sequence Read Archive (SRA) (Leinonen et al., 2010b), which in September 2010 contained more than 500 billion reads and with the most sequenced organism, Homo Sapiens, accounting for 65% of the share of bases.

The Ensembl project (Birney et al., 2004), based at the European Bioinformatics Institute (EBI) and the Wellcome Trust Sanger Institute, uses genomic assemblies as a starting point to automatically annotate genomes in their system. The pipeline of the annotation in Ensembl can be described as protein-coding centric and is focused on defining gene transcripts and the translated amino acid sequences, although a recent modification now includes short non-coding RNA genes (Flicek et al., 2010). To a large extent they also integrate annotations with other biological data such as evolutionary, regulatory, comparative and functional annotations. In version 59 released in August 2010, the collection of genomes supported 56 species, mainly from chordates but also a few other eukaryotes and a selection of model organisms. For the human genome, the current version 59.37 contains 20,734 protein-coding genes with 70,423 corresponding transcripts. Two other organizations that annotate and display genomic data are the National Center for Biotechnology Information (NCBI) and the UCSC Genome Browser (Fujita et al., 2010). EBI, the Wellcome Trust Sanger Institute, NCBI and UCSC are all members of the Collaborative Consensus Coding Sequence (CCDS) project (Pruitt et al., 2009) and coordinate their representations of protein annotations by using stable identifiers on reference human and mouse genomes. The project aims to recognize a common protein-coding gene set and had in the latest build 37.1 identified 23,739 human consensus coding regions and 18,175 Gene IDs.

Another important group of databases are those that function as repositories for gene expression data. The NCBI Gene Expression Omnibus (Barrett et al., 2009) is a repository for high-throughput data from sequencing, gene expression and microarray experiments. The ArrayExpress archive (Parkinson et al., 2010) is another major repository for high-throughput data within functional genomics and includes data from sequencing and microarray experiments. A selection of the data in ArrayExpress is used for the Gene Expression Atlas (Kapushesky et al., 2010), which performs a curation, re-annotation and statistical analysis to provide information about genes which are differentially expressed in particular diseases or developmental states and in certain tissues or cell types. 4. BIOINFORMATICS AND DATA ANALYSIS 19

4.1.2 PROTEIN RELATED DATABASES

Databases that manage protein data are most often focused on either sequence or structural data. The leading repository for protein sequences and functional information resource is UniProt, or the Universal Protein Knowledgebase Consortium (Uniprot Consortium, 2010), which is a collaboration between groups from the Swiss Institute of Bioinformatics, the EBI and the Protein Information Resource. One of the main components of UniProt is the knowledgebase UniProtKB, which is a database consisting of two sections: the manually curated UniProtKB/Swiss-Prot and UniProtKB/TrEMBL with computer-assisted annotations. The annotation data available in UniProtKB/Swiss-Prot includes protein function, structure, subcellular location, splice isoforms, biologically relevant domains, PTMs, disease-related information, tissue specificity and other expression related data. The current release 2011_02 of UniProtKB/Swiss-Prot contains 20,253 reviewed human entries out of which 13,382 (66%) have evidence at protein level, indicating that the protein has experimental evidence such as Edman sequencing, MS identification, structural or interaction data, or antibody detection (Uniprot Consortium, 2008).

Other databases deal with the organization of data from proteomics experiments. The Proteomics Identification database (PRIDE) (Vizcaino et al., 2009), is a public repository for protein and peptide identification, while the PeptideAtlas (Deutsch et al., 2008) stores information about peptides specifically identified by MS proteomics experiments. When it comes to the collection and organization of structural data, the RCSB Protein Data Bank (PDB) (Berman et al., 2000), is the major repository for experimentally determined structures, mainly from proteins but also other biological molecules such as nucleic acids and assemblies. In addition, PDB provides various tools for the visualization and analysis of structural data. The SCOP database (Murzin et al., 1995) describes the structural and evolutionary relationship between proteins with known structure. InterPro (Hunter et al., 2009), PFAM (Finn et al., 2010) and PROSITE (Sigrist et al., 2010) are examples of databases that manage information about protein domains, families and functional sites which are often defined by sequence motifs. IntAct (Aranda et al., 2010) and the Database of Interacting Proteins (Salwinski et al., 2004) are examples of databases that store protein interaction data. Visualizations of biological pathways can be found in the Reactome database (Joshi-Tope et al., 2005) and in the KEGG pathway database (Kanehisa et al., 2004). A summary with examples of protein-related databases is shown in Table 2.

20 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

Table 2. A selection of protein-related databases.

Database Content URL curated classification of protein domain CATH http://www.cathdb.info/ structures Database of experimentally determined protein Interacting http://dip.doe-mbi.ucla.edu/dip/ interactions Proteins http://www.ebi.ac.uk/intact/ IntAct protein interaction data and analysis tools main.xhtml protein signatures for classification of InterPro http://www.ebi.ac.uk/interpro families, domains and functional sites KEGG pathway maps of molecular interactions, http://www.genome.jp/kegg/ Pathway reactions and relations pathway.html peptides identified by tandem mass PeptideAtlas http://www.peptideatlas.org/ spectrometry protein families represented by multiple PFAM http://pfam.sanger.ac.uk/ sequence alignments and HMMs PHOSIDA post-translational modifications http://www.phosida.com

PRIDE protein and peptide identifications http://www.ebi.ac.uk/pride/ http://www.bioinf.manchester.ac.uk/ PRINTS protein fingerprints or conserved motifs dbbrowser/PRINTS/ protein families, domains and functional PROSITE http://expasy.org/prosite/ sites and associated patterns and profiles experimentally-determined structures of Protein Data proteins, nucleic acids, and complex http://www.rcsb.org/pdb/ Bank assemblies reactions, pathways and biological Reactome http://www.reactome.org/ processes SCOP structural classification of proteins http://scop.mrc-lmb.cam.ac.uk/scop protein sequences and functional UniProt http://www.uniprot.org information

4. BIOINFORMATICS AND DATA ANALYSIS 21

4.1.3 ONTOLOGIES

Constructing an ontology is a way of creating a shared understanding of a concept within a given field (Uschold and Gruninger, 1996). In biology, this usually involves a controlled vocabulary defining basic terms in different biological and medical domains, and is a step towards a more efficient method of retrieving and analyzing the vast amount of complex biological data stored in databases (Soldatova and King, 2005). One of the most widely used biological ontologies is the Gene Ontology (GO) (Ashburner et al., 2000). The Gene Ontology consortium is an effort to standardize the annotations of genes and gene products by using a controlled and structured vocabulary of terms. This consortium started in 1998 as a collaboration between three annotation projects for model organisms and today includes many of the leading genome projects. GO consists of three independent ontologies with terms that represent different properties of genes and gene products. Molecular Function (MF) describes activities at the molecular level for actions most often performed by a single gene product, such as a binding or catalytic function. Biological Process (BP) represents a series of events or molecular functions to which the gene or gene product contributes, for example signal transduction, or translation, which have a fixed beginning and end. Cellular component (CC) describes a part of a eukaryotic cell or the extracellular environment where a gene product is active, which for example can be a large structure, such as the nucleus, or a smaller part made up of several gene products, such as the .

In the annotation process, GO terms are assigned to gene products. Each annotated gene product has a one-to-many relationship to each of the ontologies, due to the fact that a protein can function in several parts of the cell and participate in many alternative processes. GO terms are structured into a network of nodes described as a directed acyclic graph (DAG), which is a hierarchical structure similar to a tree, although differing in that it allows each node to have multiple parents (Khatri and Drăghici, 2005). The DAG structure (Figure 4) allows for an annotation at different levels depending on the extent of available information for a gene product. Each GO annotation includes a gene name, a GO term, a reference containing supporting data and an evidence code stating the kind of data used to make the annotations, for example IDA: Inferred from Direct Assay or IEA: Inferred from Electronic Annotation. The GO website also provides a large set of tools to browse, analyze and search the ontologies provided by the consortium.

22 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

molecular function

transcription binding regulator activity

protein binding nucleic acid binding

transcription repressor activity

transcription factor DNA binding binding

transcription corepressor activity transcription DNA bending cofactor activity activity

Figure 4. An example section of the Molecular Function DAG structure in GO.

Examples of some of the many other biological ontologies include the Experimental Factor Ontology (Malone et al., 2010), which models the experimental factors in ArrayExpress; the Ontology Lookup Service (Cote et al., 2008), a database that integrate several publicly available biomedical ontologies in a query-friendly format; and the Systems Biology Ontologies project (Le Novere, 2006), which is specifically focused on problems within the systems biology field. 4. BIOINFORMATICS AND DATA ANALYSIS 23

4.2 TOOLS FOR THE ANALYSIS AND VISUALIZATION OF LARGE-SCALE BIOLOGICAL DATA

When dealing with large data sets such as gene expression data from thousands of genes, the human brain cannot process the data in a raw format. To be able to extract the most important features from the data, methods for visualization and analysis of large-scale data are of immense value. The widely used methods of hierarchical clustering and principal component analysis (PCA) aim to identify structure within a collection of data and to extract the most important information. These methods will be discussed in more detail in the next two sections. There are however many other methods dealing with similar problems, e.g. k-means clustering (Hastie et al., 2009), independent component analysis (Hyvarinen and Oja, 2000; Kangas et al., 1990), singular value decomposition (Jolliffe, 2002) and self-organizing maps (Kangas et al., 1990).

Most biological processes that take place in the cell involve signaling cascades and large protein complexes. When trying to build an in-depth map of the proteins within a cell, it is important to analyze both their interaction partners and functions. Enrichment analysis enables a biological interpretation of large gene and protein lists by exploration of the functional categories present in the data. Tools that enable construction of biological networks are used for visualizing relationships such as protein-protein interactions, and are valuable for extracting meaningful information from extensive data sets.

4.2.1 HIERARCHICAL CLUSTERING

Hierarchical clustering is an unsupervised method commonly used to analyze large datasets and has been successful in the genomics field (Eisen et al., 1998). The goal is to group together sets of items into a hierarchical representation of clusters forming a tree, with similar items close together in the same cluster. Since the method is based on the pairwise similarity between the analyzed data, it is necessary to define the term “similar” in a mathematical way (Hastie et al., 2009). Commonly used similarity measures are Euclidian distance, Manhattan distance or different correlation methods such as Pearson correlation (D'haeseleer, 2005). During the clustering process, the attributes of each established cluster are used to determine the successive clusters, and two basic strategies to group objects are often used: divisive (“top-down”) and agglomerative (“bottom-up”).

24 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

In divisive methods, the starting point is at the top and for each level one cluster is recursively split into two new clusters, as dissimilar as possible (Hastie et al., 2009). For agglomerative strategies, the bottom is the starting point, and at each level a pair of clusters is merged into a single cluster. The selected pair consists of the two most similar groups and it is therefore necessary to use a “linkage function” to measure the distances between all clusters (D'haeseleer, 2005). Examples of linkage functions are single linkage, complete linkage and average linkage, which use the minimum, maximum and mean distances of all pairwise distances respectively, as well as centroid linkage, in which the distance between the two “centroids”, or the mean values of each cluster, is used. The result of the hierarchical clustering can be displayed by a rooted, binary tree in which the root node represents the complete data set and the terminal nodes represent each individual item (Hastie et al., 2009). This graphical representation can be referred to as a dendrogram (Figure 5) if the height of each node is comparable to the dissimilarity between its two children nodes and the terminal nodes are plotted at height zero. 0.8 0.7 0.6 0.5 0.4 Height 0.3 0.2

A B C D E F

Figure 5. Example of a simple dendrogram with six nodes A-F. 4. BIOINFORMATICS AND DATA ANALYSIS 25

4.2.2 PRINCIPAL COMPONENT ANALYSIS

High-dimensional data sets are difficult to visualize since three dimensions is usually the limit for the human mind. The basic idea behind principal component analysis (PCA) is to reduce the dimensionality of a large data set while keeping most of the variation and information (Jolliffe, 2002). Dimensionality reduction is performed by transforming the variables to a new selection of variables called principal components (PCs), by identifying the directions with the most variable data in the data set (Ringner, 2008). The PCs are linear combinations of the true variables and are uncorrelated and ordered according to the amount of variation from the original data set each PC retains, with the constraint that it is orthogonal to the previously defined PCs. The first PC thus contains the largest variance whereas the last is supposed to capture only the residual “noise” (Yeung and Ruzzo, 2001). Mathematically speaking, the PC coefficients and variances are eigenvectors and eigenvalues of the data covariance matrix (Jolliffe, 2002). The first few PCs are supposed to summarize the properties of the original data set and therefore, PCA allows a sample to be represented by a few numbers instead of thousands of variables. A graphical plot of these reduced numbers enables a visualization of similarities and differences between groups of samples and is a good way of extracting information from the PCA. By plotting the values for each observation of the first two PCs, the best possible two-dimensional plot of the data is obtained.

4.2.3 ENRICHMENT ANALYSIS OF GENE ANNOTATIONS

High-throughput strategies in genomics and proteomics generate large amounts of data and often result in long lists of “interesting” genes, which are difficult to interpret (Huang da et al., 2009a). The biological knowledge stored in the vast number of databases described in the first section of this chapter can be exploited to allow for a systematic functional analysis of these lists to summarize the most relevant properties. Over the past decade, bioinformatic tools for enrichment analysis have been successful in adding valuable information to large-scale biological studies (Huang, 2009).

The principal basis behind enrichment analysis is that biological processes and functions within a cell are rarely dependent on a single gene, but most often made up of a group of genes. If a particular process or function is atypical in a biological study, co-functioning genes are likely to be selected together (Huang da et al., 2009a). Once a certain functional term has been associated with a set of genes, a test has to be performed to find out if the enrichment based on the proportion of associated

26 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

genes is significantly different than what would be expected by chance alone. Therefore, an enrichment analysis of a gene set uses a statistical model such as binomial, hyper-geometric, or a chi- square or Fisher’s exact test for equality of proportions to calculate overrepresented (or enriched) terms (Draghici et al., 2003), where a p-value examines the significance of the enrichment. To answer the question whether a functional term is overrepresented in a set of genes or not, a reference (or background) list is needed for comparison and to determine the degree of enrichment (Huang da et al., 2009b). A reference set can for example consist of the entire genome of the species being analyzed, or the complete set of genes with a potential of being in the annotation category in question. When many categories are considered and hence a large number of tests are performed, a multiple-testing correction such as False Discovery Rate, Bonferroni or Benjamini-Hochberg can be useful to correct for false rejections of the null hypothesis (Benjamini and Hochberg, 1995; Draghici et al., 2003).

The mission to functionally analyze long lists of genes is challenging, but the number of available tools is increasing every year. A study by Khatri and Draghici (Khatri and Drăghici, 2005) compared 14 of the tools available at that time and a later study in 2008 by Huang et al (Huang da et al., 2009a) was able to expand the selection of accessible bioinformatic enrichment tools to 68. Some examples include GoMiner (Zeeberg et al., 2003), Onto-Express (Khatri et al., 2002), GSEA (Subramanian et al., 2005) and DAVID (Dennis et al., 2003).

It is difficult to choose the most suitable statistic methods and provide relevant annotations covering a large set of genes since the functions of many genes still are unknown (Khatri and Drăghici, 2005). GO terms have been the most predominantly used annotation data so far, but lately many of the new or updated tools have started to include a larger assortment of underlying information, such as data on KEGG pathways, OMIM disease associations, protein domains and gene-expression results in their annotation databases (Huang da et al., 2009a). The differences between the many available methods lie mainly in their supported gene identifiers, choice of statistical model, reference data, annotation data, mapping between databases and many other aspects that can have a great impact on the results. Therefore it is highly important to be aware of the strengths and drawbacks of each method when deciding which one to use. 4. BIOINFORMATICS AND DATA ANALYSIS 27

4.2.4 NETWORK ANALYSIS

Most biological processes in the cell are dependent on numerous proteins functioning together in signaling cascades or larger complexes and co-operating inside the organelles. The interaction partners of a protein can contribute to defining its function and it is known that co-expressed genes are more likely to interact and be involved in the same biological pathway than genes that are not expressed at the same time (Bader et al., 2003). Therefore, an important area to study is protein interactions and other relationships between biological molecules. These types of relationships are often visualized by networks, or more formally, two-dimensional graphs consisting of vertices (nodes) connected pair wise by edges (Pavlopoulos et al., 2008b). The connections can express different types of relationships such as proteins known to interact or be co-expressed, sharing a domain, belonging to the same protein family or being evolutionary related.

One of the most widely used tools to construct networks in an automatic manner is Cytoscape (Shannon et al., 2003), an open source platform suitable for large-scale network analysis (Figure 6). It provides various types of automated network layout algorithms including hierarchical layout, circular layout and spring-embedded layout which arrange the nodes and edges in different positions. It also comes with many visual styles, allowing the user to change the node and edge attributes. One of the strengths of Cytoscape, besides its core functions, is that it can be extended with more than 100 plug-in modules, which allow for many specialized functions and integration with other databases.

The many available tools for constructing biological networks, e.g Medusa (Hooper and Bork, 2005), Osprey (Breitkreutz et al., 2003) and ProViz (Iragne et al., 2005), differ in their layout representation of the networks, compatibility with other tools and databases, input format and other functionalities. They also vary in terms of how optimized they are for large-scale analysis and many have limits to how many nodes and edges they allow. Therefore the major challenge in this area is the increasing amounts of data, with new large-scale experiments often generating more than hundred thousands of data points (Pavlopoulos et al., 2008b). A step towards a better visualization of large networks is to add an extra dimension to allow for a 3D representation. Examples of tools that have implemented this include Biolayoutexpress3D (Theocharidis et al., 2009), MetNetGE (Jia et al., 2010) and Arena3D (Pavlopoulos et al., 2008a).

28 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

Figure 6. Cytoscape interaction network for the CD40 ligand (CD40LG). The interaction data were obtained by using the Pathway commons web service client. 4. BIOINFORMATICS AND DATA ANALYSIS 29

4.3 TRANSMEMBRANE PROTEIN TOPOLOGY PREDICTION

The use of computational methods to predict the function of a protein from its sequence is invaluable for the effort to understand biological mechanisms using the ever increasing amount of sequence data. Examples of protein features that can be predicted include post-translational modifications such as glycosylation (Gupta and Brunak, 2002) and phosphorylation sites (Blom et al., 1999), protein domains and motifs (Finn et al., 2010; Hunter et al., 2009) as well as subcellular information such as mitochondrial (Claros and Vincens, 1996) and nuclear (Jans et al., 2000) targeting sequences. Other important features are the signal sequences that target membrane proteins and secreted proteins to the ER. The following sections will focus on sequence-based prediction methods for transmembrane protein topology and the basics behind the machine learning-based approaches in this field.

4.3.1 TRANSMEMBRANE PROTEIN FEATURES

Developing a better understanding of membrane and function is of immense importance for both biological and pharmacological purposes. Membrane proteins are difficult to crystallize and even though the number of known structures is rising, they are still severely underrepresented in structural databases (Chen and Rost, 2002). In 2009, approximately 180 unique membrane protein high-resolution structures had been obtained (White, 2009). Therefore, the use of computational methods is necessary to obtain more information about this important group of proteins. Although there also exist several methods able to predict beta-barrel structures (Jacoboni et al., 2001; Liu et al., 2003), due to their restricted location in the outer membranes of bacteria, chloroplasts and mitochondria and their limited abundance in human cells, the focus here will be on the prediction of the abundant alpha-helical membrane proteins that are found in all cellular membranes. The classical “topology” description of a membrane protein includes the number and locations of transmembrane (TM) segments as well as their position relative to the membrane (von Heijne, 2006) (Figure 7). Most of the existing methods use some basic concepts for their predictions: (i) TM segments have a limited length to span the lipid bilayer (on average ~25 residues) (Cuthbertson et al., 2005b); (ii) the positively charged amino acids Arg and Lys are not evenly distributed within the protein but more common on the cytoplasmic side (the positive-inside rule) (Heijne, 1986) and (iii) the amino acid distribution with respect to the positive-inside rule can vary between long and short globular loops (Chen and Rost, 2002). Identifying a well-characterized TM

30 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

segment consisting of a stretch of hydrophobic residues of a distinct length could seem like an easy task, but complications include other types of regions, such as globular proteins and signal peptides, which also contain long hydrophobic parts.

intracellular

plasma membrane

N-terminal C-terminal

extracellular

Figure 7. A schematic view of the topology of an alpha-helical membrane protein with four transmembrane segments and extracellular N-terminal.

During the past years, the increasing numbers of known membrane protein structures have led to a better understanding of their characteristics and have uncovered a higher complexity than was expected. Recent findings include TM segments that are tilted in the membrane and consequently are longer to fully span it (Park and Opella, 2005; Strandberg et al., 2004), TM helices containing disruptive kinks (Yohannan et al., 2004), as well as re-entrant regions which enter and exit the membrane on the same side (Viklund et al., 2006). However, most of the widely used prediction methods were developed before this additional complexity was discovered and therefore do not take this into consideration. Membrane protein topology prediction methods can generally be divided into two different classes of predictors: the first focusing primarily on an amino acid property-based evaluation of the protein sequence and the second using knowledge-based machine-learning methods. 4. BIOINFORMATICS AND DATA ANALYSIS 31

4.3.2 MEMBRANE PROTEIN TOPOLOGY PREDICTION BASED ON AMINO ACID PROPERTIES

The first generation of prediction methods focused mainly on the physiochemical properties of the protein and used hydrophobicity scales such as Kyte and Doolittle (Kyte and Doolittle, 1982) and Eisenberg (Eisenberg et al., 1984) to identify TM regions. TopPred (von Heijne, 1992) implemented a more complex processing of the hydrophobicity scales as well as the positive-inside rule to automatically generate protein topology. Examples of newer methods that evaluate the protein sequences using algorithms based on amino acid properties are Split4, THUMBUP and Scampi. Split4 (Juretic et al., 2002) uses basic charge clusters, which consist of motifs of the basic residues Arg and Lys, as well as 15 different scales for amino acid attributes for topology predictions. THUMBUP (Zhou and Zhou, 2003) is based on a simple scale of burial propensity, which is the tendency of an amino acid to be buried by other residues, and uses the fact that transmembrane helices are packed more tightly than non-membrane helices in combination with the positive-inside rule for identifying TM segments. SCAMPI (Bernsel et al., 2008), uses an experimental scale of position-specific amino acid contributions of the free energy of membrane insertion (Hessa et al., 2007) to predict TM positions in the protein sequence.

4.3.3 MACHINE LEARNING-BASED APPROACHES FOR MEMBRANE PROTEIN TOPOLOGY PREDICTION

Machine learning algorithms utilized by membrane protein topology prediction methods include hidden Markov models, neural networks and support vector machines. Hidden Markov models (HMMs) describe a series of observations by a stochastic process with a finite number of states where the probability at a certain state only depends on the value of the previous state (Krogh et al., 1994). The states in a HMM for TM protein topology prediction are connected to each other in a biologically reasonable way, for example there may be three states representing an inside loop, a TM region, and an outside loop. A HMM for TM protein topology prediction can include helix length, hydrophobicity, charge bias (positive-inside rule) and grammatical constraints in one single model (Krogh, 2001). The first methods to implement HMMs in their predictions were HMMTOP (Tusnady and Simon, 1998) and TMHMM (Sonnhammer et al., 1998) and since then many have followed. Other TM protein topology prediction methods have instead used artificial neural networks (NNs), which are inspired by the processing of information by the connected neurons in

32 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

the nervous system. A NN consists of nodes connected to form a network and is often used to find patterns in data and model complex relationships (Bishop, 1994). Examples of prediction methods where NNs have been trained on protein sequence profiles include PHDhtm (Rost, 1996) and MEMSAT3 (Jones, 2007). Support vector machines (SVMs) are a group of supervised-learning methods related to NNs that have recently been applied to TM protein topology prediction in for example SVMtop (Lo et al., 2008) and MEMSAT-SVM (Nugent, 2009). SVMs use a set of mathematical kernel functions to transform the original variables so that they are linearly separable and can be classified into two groups. Some methods, such as Octopus (Viklund and Elofsson, 2008) and ENSEMBLE (Martelli et al., 2003), are based on combinations of several machine learning algorithms. Other features sometimes incorporated in the prediction methods are multiple sequence alignments, which include evolutionary information (Jones, 2007), while some newer methods also predict re-entrant regions in the protein. Table 3 summarizes the features of a selection of machine learning-based methods for MP topology prediction.

4.3.4 SIGNAL PEPTIDE PREDICTION

N-terminal signal sequences that guide a translating protein to the endoplasmic reticulum for subsequent translation and transport via the secretory pathway are often called signal peptides (SPs) and are found in most secreted proteins and some types of membrane proteins. Due to the similarities between SPs and TM segments in their hydrophobic properties and lengths, it is difficult to discriminate between them and this is a major issue in the TM topology prediction field (Kall et al., 2004). There are several methods that are specialized in predicting SPs, such as SignalP (Bendtsen et al., 2004) and TargetP (Emanuelsson et al., 2000) and these can be used as a prefiltering step before performing the TM protein topology prediction. Some methods, for example Phobius (Kall et al., 2004), SPOCTOPUS (Viklund et al., 2008) and MEMSAT-SVM (Nugent, 2009) have instead incorporated a SP prediction model into their TM topology prediction algorithm. This allows for more reliable results when it comes to distinguishing between the two features and knowing whether a N-terminal hydrophobic stretch is a TM segment or an SP, which will be cleaved off of the protein. 4. BIOINFORMATICS AND DATA ANALYSIS 33

Table 3. A selection of machine learning-based methods for MP topology prediction. NN – neural network; HMM – hidden Markov model; SVM – support vector machine; TM – transmembrane region; SP – signal peptide; RE – re- entrant region.

Predicted Method Algorithm URL features

ENSEMBLE (Martelli et al., 2003) NN+HMM TM http://pongo.biocomp.unibo.it/

HMMTOP HMM TM http://www.enzim.hu/hmmtop/ (Tusnady and Simon, 1998)

MEMSAT3 (Jones, 2007) NN TM+SP http://bioinf.cs.ucl.ac.uk/psipred/

TM+SP MEMSAT-SVM (Nugent, 2009) SVM http://bioinf.cs.ucl.ac.uk/psipred/ +RE

PHDhtm (Rost, 1996) NN TM http://www.predictprotein.org/

PHOBIUS (Kall et al., 2004) HMM TM+SP http://phobius.sbc.su.se

PRODIV-TMHMM http://topcons.cbr.su.se/index.php? HMM TM (Viklund and Elofsson, 2004) about=proprodiv TM+SP SPOCTOPUS (Viklund et al., 2008) NN+HMM http://octopus.cbr.su.se/ +RE http://bio-cluster.iis.sinica.edu.tw/ SVMtop (Lo et al., 2008) SVM TM ~bioapp/SVMtop/ http://www.cbs.dtu.dk/services/ TMHMM (Sonnhammer et al., 1998) HMM TM TMHMM http://topcons.cbr.su.se/index.php? ZPRED (Granseth, 2006) NN+HMM TM about=zpred

4.3.5 TOPOLOGY PREDICTION PERFORMANCE

Several studies have attempted to compare the accuracy between different topology prediction methods (Chen, 2002; Chen and Rost, 2002; Cuthbertson et al., 2005a; Melen et al., 2003; Möller et al., 2001) with varying results showing a wide range of accuracies and with no single method performing consistently better than the others. The challenges in comparing the methods lie mainly in the limited amount of experimental data, with the consequence that all methods have been developed and trained using basically the same small set of known membrane proteins. As independent cross-validation using a common test set is unachievable, accuracy is hard to estimate.

34 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

Also, the evaluations in the comparative studies that have been performed are difficult to compare to one another (Chen and Rost, 2002) as they are based on different measures for prediction accuracy and different data sets. However, a lot of progress has been made in the field of membrane protein topology predictions over the recent years, and currently some of the newer methods claim to predict correct topologies for 80-93% of proteins (Bernsel et al., 2008; Nugent, 2009; Viklund and Elofsson, 2008). Another aspect to consider is that the set of well studied membrane protein is expected to be biased and not representative of the set of membrane proteins in a complete proteome (Kall and Sonnhammer, 2002). Therefore, the expectations on accuracy must be lowered when a whole-proteome analysis is performed. Also, membrane protein structures show a higher diversity in eukaryotes than bacteria, which pose a problem for eukaryotic studies since a large portion of available 3D-structures are of prokaryotic origin (Lomize et al., 2006). PRESENT INVESTIGATION 35

PRESENT INVESTIGATION

36 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

5. OBJECTIVE

The objective of this thesis is to gain an understanding of the global protein expression patterns in human cells and tissues as well as defining the human membrane proteome. A number of fundamental questions in human biology have been addressed:

• How many of the human genes encode membrane proteins and how do the available prediction methods for membrane protein topology compare? (Paper I)

• To what extent are proteins expressed in cells and tissues and how large fraction of our proteins are expressed in various cell types? (Paper II)

• Can protein expression profiles be used to categorize cell lines of different tumor origin and what is the difference between them in terms of protein expression levels? (Paper III)

• What is the protein distribution across the subcellular compartments in selected human cell lines and are there common organelle proteomes? (Paper IV)

• How do protein abundance and transcript abundance compare and do the expression levels vary between different cell lines or different functional classes of proteins? (Paper V)

6. MAPPING THE HUMAN MEMBRANE PROTEOME 37

6. MAPPING THE HUMAN MEMBRANE PROTEOME (I)

In this study, we focus on the important and large group of alpha-helical transmembrane proteins in the human proteome. Membrane proteins are crucial for many biological functions and are involved in processes such as signaling, energy-transfer, ion-transport, nerve impulses and cell-cell interactions. They are also very important in the pharmaceutical industry as they represent a majority of all drug targets (Bakheet and Doig, 2009; Yildirim et al., 2007). All of these characteristics make membrane proteins highly interesting proteins for functional studies and targets for antibody-based proteomic studies. However, generation of polyclonal antibodies against alpha-helical membrane proteins requires careful selection of antigen to avoid the inaccessible membrane spanning regions. In the Human Protein Atlas project, the ability to discriminate between soluble proteins and membrane proteins is vital in many stages of the project pipeline, particularly in the antigen selection procedure where information about the location of TM regions is needed. Since the structure of most membrane proteins remains unknown (Chen and Rost, 2002), the use of sequence-based prediction methods is necessary for obtaining more information about this group of proteins. Today, there exist numerous prediction methods for membrane protein topology, most of which are based on either a residue-based evaluation of the protein sequence or machine-learning algorithms that take the whole sequence into consideration. Therefore, the focus of Paper I has been prediction of membrane proteins using these types of methods.

After assessing the field of membrane protein topology predictors, the decision was made to use a combination of several methods to allow for a consensus prediction result based on more than one source. To select the most appropriate methods for this study, the main properties considered were the estimated accuracy, the ability to discriminate between globular and membrane proteins and the availability of a stand-alone version of the algorithm. Unfortunately, some methods are only available as web-based predictors which are not suitable for large-scale studies such as a whole-proteome analysis, involving several thousands of sequences. In addition, methods predicting more features than only TM segments were favored and particularly combinations with signal peptide predictions. The final selection included seven methods fulfilling the criteria and based on different underlying algorithms to be as complementary as possible: MEMSAT3 (Jones, 2007), MEMSAT-SVM (Nugent,

38 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

2009), Phobius (Kall et al., 2004), SCAMPI (Bernsel et al., 2008), SPOCTOPUS (Viklund et al., 2008), TMHMM (Sonnhammer et al., 1998) and THUMBUP (Zhou and Zhou, 2003).

The human protein sequences were obtained from Ensembl (Flicek et al., 2010) version 52.36, which included 21,416 protein-coding genes and 41,285 unique protein sequences. The whole-proteome analyses were run on a 980 node compute cluster and the results contained the positions of all predicted TM segments as well as the orientation of the loops relative to the membrane, and for some methods also included positions of predicted re-entrant regions and SPs. Previous studies have shown that prediction performance is related to the number of methods that agree on a prediction, and that a consensus approach can increase the reliability (Arai et al., 2004; Nilsson et al., 2000). Therefore, the combined prediction results were used to construct a majority-decision method (MDM) based on overlapping TM segments. Each amino acid position included in a predicted TM segment by at least four of the seven methods was considered a MDM position. The purposes of the MDM method were mainly to discriminate between globular and membrane proteins to allow for an estimation of the total number of human genes encoding membrane proteins, and to locate the regions of the protein sequence most likely to contain TM segments. It is not a membrane protein topology prediction method per se, as it does not consider the orientation of loop regions and the resulting MDM regions do not have a lower or upper length constraint and can therefore not be considered actual TM segments.

All prediction results, including the MDM, were displayed in a publicly available web-based visualization tool (Berglund et al., 2008a), which is used for the antigen selection in the Human Protein Atlas project. The numbers of genes predicted to encode a membrane protein by each method were reported (Figure 8) and ranged from 5508 to 7651. The MDM result was used as the estimate for the fraction of membrane protein coding genes (26%). An analysis of the distribution of the number of predicted TM segments resulted in single spanning membrane proteins composing the largest group and with gradually decreasing numbers, except for the larger group of 7 TM proteins which contains the G-protein coupled receptors (GPCRs). Next, the prediction results were used to compare the methods in various ways. Hierarchical clustering using a similarity measure showed that the most deviating method was SPOCTOPUS, which also was the method predicting the highest number of membrane protein-coding genes, and the methods with the closest relationship were Phobius and MDM. When analyzing the overlap between the predictions, all 6. MAPPING THE HUMAN MEMBRANE PROTEOME 39

methods agreed on classifying 21% of the genes as coding for a membrane protein, but there were large variations in the predicted numbers and positions of TM segments. Merely 12% of all proteins had the same number of predicted TM segments by all methods. Correspondingly, only 12% had the same predicted orientation of the first TM segment relative to the membrane.

9000

8000 7651 7300 7248 7000 6292 6190 5771 6000 5508 5539

5000

4000 Number of genes

3000

2000

1000

0 T3 A MDM -SVM OPUS T T A MEMS TMHMM2.0 SPOC THUMBUP1.0 PHOBIUS1.01 SCAMPI_multi MEMS

Figure 8. The number of genes predicted to encode a membrane protein by each of the seven selected methods, as well as the MDM.

The outcomes of the study, showing large deviations between methods and differing estimated numbers of predicted membrane proteins, demonstrate the value of considering results from several methods when analyzing membrane proteins. In the antigen selection procedure in the Human Protein Atlas project, each protein is studied in a graphical representation, which allows for a manual evaluation of the most probable TM segment positions. However, in large-scale studies, manual inspection is often not possible, and the MDM can be a useful resource for evaluating the results of the other methods. The results from this study are presented on the Human Protein Atlas website, providing the topology predictions from the seven selected methods as well as the MDM and an

40 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

additional method specializing in predicting GPCRs (Wistrand et al., 2006). The display also includes predicted signal peptide (SP) positions from Phobius, SPOCTOPUS and MEMSAT-SVM, as well as SignalP (Bendtsen et al., 2004), which is specialized solely on this feature. MEMSAT-SVM, however, only predicts signal peptides in proteins categorized as membrane proteins by its predictions. Lists of functionally classified membrane proteins are also provided, e.g. transporters and ion channels. All results can be combined with an advanced search function to explore the available protein expression data (Uhlen et al., 2010).

An additional study was performed to obtain a rough estimate of the size of the human secretome defined as proteins with a SP and without additional membrane spanning regions (Fagerberg and Uhlen, unpublished). Ensembl version 56.37, with a total number of 20,734 protein-coding genes, was used. Results from a whole-proteome scan using a selection of three methods for signal peptide prediction, SignalP, Phobius and SPOCTOPUS, were analyzed. Since SPs are found both in secreted proteins and certain types of membrane proteins, all proteins with a predicted SP in combination with a predicted MDM region were considered membrane proteins and not secreted proteins. The resulting numbers of genes encoding a predicted secreted protein ranged between 3074 and 3726 (Table 4). For the estimation of the size of the secretome, all proteins with a SP predicted by at least two out of three methods were considered a potential secreted protein. The resulting numbers were 3211 (15%) of all protein-coding genes and 7194 corresponding proteins.

Table 4. The total number of genes encoding proteins with predicted signal peptides and the corresponding unique proteins in parentheses, with and without removing the set of membrane proteins as determined by the MDM.

Method Number of genes Number of genes with SP with SP (proteins) and no MDM (proteins)

SignalP3.0 4868 (12,417) 3074 (6906)

SPOCTOPUS 4000 (12,175) 3726 (8030)

Phobius 4534 (11,215) 3356 (7457)

The combined results from the two studies were used to map the distribution of potential membrane proteins and secreted proteins in the human proteome. For each of the 20,734 genes, all corresponding proteins were annotated using three categories of (i) secreted, (ii) membrane and (iii) none (without SP/TM features), which resulted in seven potential combinations for genes encoding 6. MAPPING THE HUMAN MEMBRANE PROTEOME 41

multiple proteins (Figure 9). To exemplify this, the genes in the intersection between the three circles of the Venn diagram (Figure 9B) encode at least three different proteins where one does not have any SP/TM features, one is a membrane protein according to the MDM, and one is a potentially secreted protein with at least two overlapping SP predictions.

In summary, the results from these studies suggest that 15% of the human protein-coding genes encode at least one splice variant containing a signal peptide and no transmembrane region (“secretome”), 26% encode at least one splice variant containing a transmembrane region (“membrane proteome”) and 74% encode a splice variant without neither signal peptide nor transmembrane region (“intracellular”). The reason why the accumulated number is larger than 100% is due to the fact that 14% of the genes encode multiple splice variants with different localization signals.

A B NONE NONE 7,999 MEMBRANE 2,912 (39%) SECRETED (14%) MULTIPLE 1,533 (7%) 1,234 934 (6%) (5%) 306 (1%) 3,535 12,754 1,623 568 (17%) (62%) 438 (8%) (3%) (2%)

MEMBRANE SECRETED

ALL PROTEIN CODING GENES GENES ENCODING MULTIPLE PROTEINS

Figure 9. The distribution of genes encoding proteins without SP/TM features (NONE) as well as those encoding potential membrane proteins and potentially secreted proteins for genes encoding: (A) all human protein coding genes (n=20,734) where each gene belongs to exactly one of the four categories, and (B) multiple proteins with non-redundant sequences (n=13,102) visualized by a Venn diagram. The “multiple” section contains genes encoding proteins representing more than one of the different categories and is visualized by the intersections of the Venn diagram. The percentages in the Venn diagram are the fractions of the total number of genes.

42 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

7. GLOBAL ANALYSIS OF PROTEIN EXPRESSION IN HUMAN CELLS AND TISSUES (II & III)

Within the Human Protein Atlas project, affinity purified polyclonal antibodies are generated and used for protein expression profiling in a multitude of tissues as well as cell lines. High-resolution images are obtained from immunohistochemical stainings of 48 tissues and 45 cell lines. The different cell types in the tissues are manually annotated by pathologists, whereas the cell lines are automatically annotated by image analysis software. In addition, immunofluorescence (IF) -based confocal microscopy images allows for manual annotation of the subcellular location in three of the cell lines. The results do not provide absolute expression levels, but represent the relative expression level of a particular protein across a number of cell types. In the following two studies, systematic analyses of the resulting protein expression profiles have been performed. The antibodies used for the analyses were either generated within the project or obtained from commercial providers.

In Paper II, the relative abundance and spatial distribution of proteins in different cell types in major tissues and organs was analyzed. The study included 5934 antibodies with TMA data from tissues, 5349 antibodies with CMA data from cell lines, as well as 2249 antibodies with IF data. The relative protein intensity levels were annotated and visualized using a discrete four-grade color code: negative (white), weak (yellow), moderate (orange) and strong (red). The relative protein levels were used to perform an unsupervised cluster analysis (Eisen et al., 1998) of 65 normal cell types. The resulting dendrogram displayed relationships between cell types which harmonized well with current histological and embryological knowledge. In brief, most cells were grouped into six major clusters composed of (i) hematopoietic cells, (ii) mesenchymal cells, (iii) endocrine cells, cells from (iv) squamous epithelia, (v) glandular and transitional epithelia (vi) and the central nervous system. The results also suggested the developmental origin of cells to be an important factor for expression- based clustering, as the common denominator for a subset of clusters appeared to be the primary germ cell layer (ectoderm, endoderm and mesoderm).

An analysis of the annotated relative levels of protein expression in each cell type indicated that a large number of proteins are expressed across a majority of cell types. On the tissue level, 20% of all 7. GLOBAL ANALYSIS OF PROTEIN EXPRESSION IN HUMAN CELLS AND TISSUES 43

proteins were detected in 60 or more of the 65 analyzed cell types. The fraction of detected proteins out of all analyzed proteins in each of the cell types ranged between 40% and 84%, with an average of 68% (Figure 10A). The analysis of cell lines resulted in even higher numbers with a large fraction of proteins detected in all 45 cell lines, and with nearly 80% of all analyzed protein detected in each of the cell lines (Figure 10B), based on the IHC data. Accordingly, more than 70% of the proteins were detected in each of the three cell lines used for the IF studies. For the analyses based on IHC, the use of different subsets of antibodies, e.g. only those verified by supportive Western blots or paired antibodies targeting the same protein with a high level of validity, showed essentially the same results. An effort to identify tissue-specific proteins, i.e. proteins exclusively expressed in a single cell type, resulted in a short list consisting of 74 proteins including well known proteins like insulin and troponin, but also a group of proteins with little annotation and of potential interest. Overall, less than 2% of all proteins were detected in a single or only a few cell types.

A B

100 100

80 80

60 60

40 40

20 20

Fraction of expressed proteins (%) 0 Fraction of expressed proteins (%) 0 10 20 30 40 50 60 5 10 15 20 25 30 35 40 45 Normal tissue cell type Cell line

Figure 10. The fraction of studied proteins detected in a specific cell type in (A) 65 cell types from normal tissues and (B) 45 cell lines, ordered according to the fraction of detected proteins. For each cell type, the sub-fraction of proteins detected at weak, moderate or strong relative expression levels is shown in yellow, orange and red, respectively. Black represents missing data.

Three cell types (hepatocytes from liver, neurons from cerebral cortex and lymphoid cells from lymph nodes) were selected for a study to explore the difference in protein expression between cells with very distinct phenotypes. The results showed that few proteins are detected only in a single cell type and that 90% of all antibodies detected proteins in at least one of the three cell types. However, only 6% of all antibodies detected proteins at the same level in all three cell lines. A similar study of

44 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

three more closely related epithelial cell types resulted in fewer proteins detected in a single cell type and a larger fraction (17%) detected at the same level in all three cell types. A similar comparison of the cell lines used for IF analyses showed that 72% of proteins were detected in all three cell lines with the majority expressed at different levels.

In paper III, cell lines are in focus and the protein expression profiles are used to examine the characteristics and similarities between cell lines of various origins. As human cancer cell lines are widely used as model systems, it is important to understand how alterations caused by the in vitro microenvironment affect their properties. Here, automated annotation of IHC images resulted in protein profiles with relative expression levels across 45 cell lines from 4356 proteins, representing ~20% of all human protein-coding genes. Contrary to the previously used four-grade scale for relative expression levels, here, the automated image analysis results in continuous intensity values and allows for a more fine-tuned evaluation of the extent of protein expression across the cell lines.

The relative intensity levels were used to perform a hierarchical clustering analysis of all cell lines, which represent different types of leukemia and lymphoma as well as major types of solid cancer. The resulting dendrogram revealed clusters which in large reflected the categories of cell lines defined by their origin tumor or phenotype, as well as a group of four cell lines of various tissue origin but with a common feature of expression of neuroid markers. With the addition of this new group, the defined categories can be summarized as: (i) lymphoma, (ii) multiple myeloma, (iii) myeloid cells, (iv) epithelial carcinoma, (v) glioma/sarcoma/melanoma, (vi) TIME cells and (vii) four cells expressing neuroid markers. The fact that the solid cell lines and the hematological cell lines were found in two separate branches of the dendrogram could be a result of the different cultivating methods of growing on a plastic surface or in suspension, respectively. Moreover, the clustering of cell lines derived from solid tumors did not reflect their originating tumor tissues as well as the hematological cell lines. The hematological cell lines, comprising of lymphoma, multiple myeloma and myeloid cells, clustered separately and the sub-groups were comparable to the known hematological lineages of differentiation from which the cell lines are derived (Figure 11). These results were further supported by a dendrogram generated from a subset of antibodies targeting cell differentiation proteins. Additionally, PCA was used as an alternative method of analyzing the spatial distribution of the cell lines and the results at large reflected the hierarchical clustering with similar groupings of cell types. 7. GLOBAL ANALYSIS OF PROTEIN EXPRESSION IN HUMAN CELLS AND TISSUES 45

LP-1 Karpas-707 RPMI-8226 PC

U-698 U-266/70 Daudi B HDLM-2* U-266/84

KM3 HDLM-2 LyPC KM3 MOLT-4 T MOLT-4 Daudi U-698 MCPC MC HMC-1 leukemia / lymphoma HEL MSC HMC-1 K-562 K-562 HL-60 EPC E myeloid NB-4 THP-1 HEL U-937 NB-4 U-266/70 MPC Gr U-266/84

HL-60 Karpas-707 U-937 LP-1 THP-1 RPMI-8226 Mo multiple myeloma

Figure 11. Schematic figure showing a simplified diagram of the hematopoiesis, from the multipotent hematopoietic stem cell to increasingly differentiated phenotypes, as well as a dendrogram derived from hierarchical clustering of 4356 protein expression profiles. The 17 hematological in vitro cultured cell lines included in the analysis are depicted at positions in the diagram according to what is known from previous studies regarding their phenotype and classification based on CD markers. MSC-multipotent hematopoietic stem cell, LyPC-lymphoid progenitor cell, MCPC-mast cell progenitor cell, EPC-erythroid progenitor cell, MPC-myeloid progenitor cell, B-B lymphocyte, T-T lymphocyte, MC- mast cell, E-erythrocyte, Gr-granulocyte, Mo-monocyte, PC-plasma cell.

46 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

The differences in regard to expressed proteins between various combinations of pairs of cell line categories were analyzed by performing two-group comparisons. The identified candidate lists of differentially expressed proteins between the two compared groups were examined using GO-based enrichment analyses to find enriched functional terms and biological processes. This strategy allowed for the identification of proteins expressed for example solely in solid cell lines, as well as comparisons between the various sub-groups of the hematological cell lines. An up-regulation of proteins involved in immune response in the hematological cell lines and of cell adhesion proteins in cell lines established from solid tumors exemplifies some of the results. Furthermore, the unexpected positions of the myeloid cell lines and the multiple myeloma cell lines in the hematological dendrogram and the PCA plot called for a comparison between the protein profiles of these two groups, and the results indicated a shared dependency of a bone marrow microenvironment as a probable explanation. 8. MAPPING THE SUBCELLULAR LOCATION OF PROTEINS 47

8. MAPPING THE SUBCELLULAR LOCATION OF PROTEINS (IV)

In Paper IV, the antibodies generated within the Human Protein Atlas project are used for a systematic protein localization study at the subcellular level. Immunofluorescence and confocal microscopy were used to obtain manually annotated subcellular locations, staining intensities and characteristics for 4005 proteins in three human cell lines of different origins: U-251 malignant glioma, U-2 OS osteosarcoma and A-431 epidermoid carcinoma. The annotation describes the antibody staining in one or several out of 16 subcellular compartments or organelles: cytoplasm, nucleus, nuclear membrane, nucleoli, mitochondria, endoplasmic reticulum, Golgi apparatus, vesicles, plasma membrane, focal adhesions, cell junctions, centrosome, aggresome, microtubules, actin filaments and intermediate filaments. All high-resolution images and annotation results are published on the Human Protein Atlas website. In this report, almost 20% of the human protein- coding genes were mapped.

The approach used in this immunolocalization study resulted in detection of 89% of all protein targets in at least one cell line and the fraction of proteins detected in each cell line had a range of 80- 84%. This supports the ubiquitous expression patterns previously observed in Paper II, and correspondingly, a majority of proteins were found in all three cell lines and few cell-type specific proteins could be identified. Most of the detected proteins (55%) were found in a single location, but 45% were detected in between two to four locations. Proteins localized to several locations provide interesting candidates for further studies, especially regarding protein trafficking within the cell. The protein distribution across all annotated subcellular locations was also analyzed and the results for U-2 OS are shown in Figure 12. To evaluate the protein localization results, Gene Ontology-based enrichment analyses were performed using lists of proteins detected in selected organelles. The experimental approach used in this study was supported by the results, with top hits of GO terms directly related to the organelle in question.

48 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

1600

1500

1400

300

200

100

0 ia r ane ane ules r r b esicles Nucleoli Nucleus V Cytoplasm Aggresome Centrosome Microtu Mitochond Cell junctions Actin filaments ocal adhesions Golgi apparatus F mediate filaments r Plasma memb Nuclear memb Inte Endoplasmic reticulum

Figure 12. The number of proteins detected in each of the subcellular compartments in U-2 OS, with a break between 300 and 1400 due to the large number of proteins annotated in cytoplasm and nucleus.

Hierarchical clustering of subcellular compartments using combined results from all three cell lines revealed that there are common organelle proteomes. Moreover, the dendrogram showed that the biochemically similar organelles, for example the Golgi apparatus and vesicles, or endoplasmic reticulum and nuclear membrane, also were the most closely related with respect to expressed proteins. Protein-compartment networks were constructed to visualize the relationships between the different organelles in each of the three cell lines. An alternative version of the U-2 OS network is shown in Figure 13, where each node represents a particular combination of subcellular compartments that any of the proteins were detected in, instead of representing each protein. 8. MAPPING THE SUBCELLULAR LOCATION OF PROTEINS 49

Figure 13. Network of the distribution of protein subcellular locations in U-2 OS. Each circle represents a combination of subcellular compartments and the size is related to the number of proteins detected in a particular combination, indicated by the number in the middle. Circle and edge colors represent single (blue), double (green), triple (purple) and quadruple (red) locations. Subcellular compartments are represented by rounded rectangles in black and a two letter code: AC – actin filaments, AG – aggresome, CE – centrosome, CJ – cell junctions, CY – cytoplasm, ER – endoplasmic reticulum, FA – focal adhesions, IF – intermediate filaments, GO – Golgi apparatus, MI – mitochondria, MT – microtubules, NM – nuclear membrane, NO – nucleoli, NU – nucleus, PM – plasma membrane, VE – vesicles.

50 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

9. COMPARISON OF THE TRANSCRIPTOME AND PROTEOME IN THREE HUMAN CELL LINES (V)

One of the drawbacks with the antibody-based approach used in the Human Protein Atlas project is the lack of quantitative measures of expression levels in the studied cells and tissues. In Paper V, the proteomic data was therefore supplemented with quantitative mass spectrometry (MS) analyses and sequencing-based transcript profiling of the same three cell lines used in Paper IV: U-2 OS, U-251 MG and A-431.

The rapid technology development that has occurred in the sequencing field during the last decade has allowed for large-scale studies such as in-depth analysis of gene expression. The term “RNA-seq” for transcriptome profiling was coined in 2008 for an analysis performed using Illumina sequencing on the yeast transcriptome (Nagalakshmi et al., 2008). Shortly after, the unit of measure reads-per- kilobase-of-exon-model-per-million-mapped-reads (RPKM) was introduced to quantify transcript levels (Mortazavi et al., 2008). The RPKM measure allows for a comparison of transcript levels within and between samples and is based on the number of reads mapping uniquely to a gene, normalized for the total read number and gene length. RNA-seq has been performed on a number of cells and tissues, and a ubiquitous gene expression with a high number of detected genes has been reported (Ramsköld et al., 2009).

In this study, the SOLiDTM sequencing technology (Mardis, 2008) was used to perform a massive analysis of the transcriptome of the three selected cell lines and RPKM values were used as estimations of expression levels. Results from IF-based confocal microscopy on a subset of the proteome were used to report on proteins detected by antibodies and a triple-SILAC MS-based method (Ong et al., 2002) was used to perform a deep proteomic analysis of the cell lines. The RNA- seq analysis showed a highly ubiquitous expression with detection of 62-67% of all analyzed genes in the three cell lines. 74% of the detected transcripts were expressed in all cell lines and 13% were found in a single cell line. The MS analysis resulted in detection of 25% of all protein-coding genes and the main portion of detected proteins were found across all cell lines due to the setup of the triple-SILAC method. A comparison of the overlap between the three platforms, using only the 9. COMPARISON OF THE TRANCRIPTOME AND PROTEOME IN THREE CELL LINES 51

genes analyzed by IF, showed that more than 99% of the proteins detected by MS also were detected by either RNA-seq or IF. The genes detected by all three platforms were generally more abundant and also had more reliable results from Western blot than those only detected by IF, suggesting that MS and RNA-seq results can be useful for antibody validation.

Next, the quantitative RNA-seq results were compared to the MS intensity levels. The dynamic range of the protein expression levels as detected by MS (~106) was much higher than the dynamic range of transcript expression levels (~104). A comparison between the transcript and protein expression levels revealed that MS did not detect the majority of the low abundant transcripts whilst the highly abundant transcripts had a much larger fraction of corresponding proteins detected. This implies that RNA-seq is a more sensitive method and that low abundant proteins are difficult to detect using MS. A selection of various previously defined protein classes was used to compare the transcript abundance to the protein abundance and revealed very similar profiles (Figure 14), such as high abundant ribosomal proteins and low abundant transcription factors. The class of GPCRs showed the lowest abundance on the transcript level and MS only detected seven (1%) of these proteins in U-2 OS, as compared to the ribosomal proteins where 83% were detected.

A B

10 8 30 6 25 4 2 20 log2 RPKM

0 log2 MS intensity 15 -2 All All factor factor GPCR GPCR Kinase Kinase Enzyme Enzyme Ribosomal Membrane CD marker Ribosomal Membrane CD marker Transcription Transcription

Figure 14. The distribution of log2 expression levels in U-2 OS across selected protein classes for (A) transcripts detected by RNA-Seq and (B) proteins detected by MS.

52 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

The SILAC method provides a better quantitative measure of protein abundance than the MS intensity and generates a ratio of the protein expression level in one cell line compared to another. Therefore, the resulting SILAC ratios were compared to corresponding RPKM ratios of the transcript level changes between the three pairs of analyzed cell lines. The Spearman correlation coefficients ranged between 0.58 and 0.63 and the positive correlation indicate that a certain transcript change between two cell lines is associated with a similar change on the protein level. The ratios were also used to classify proteins and transcripts into three categories: “similar” (maximum two-fold change between all cell lines), “substanstially changed” (at least four-fold change between two pairs of cell lines) and “slightly changed” (at least two-fold change between a pair of cell lines). By the addition of two categories “not detected” and “cell-type specific” based on transcript expression levels, a pie chart comprising all protein-coding genes was constructed (Figure 15). The five categories and transcript expression data for all genes are displayed on the Human Protein Atlas website. The resulting gene lists from the different categories were also used for GO-based enrichment analyses. The results were congruent for protein and transcript levels, and showed that the genes only expressed in a single cell line or with changed expression levels between the cell lines were enriched for GO terms associated to the surface of the cell, whereas the genes with similar expression levels in all cell lines were enriched for intracellular functions.

Figure 15. The number of genes present in the five categories as defined by the transcript expression levels in the detected cell lines. CONCLUDING REMARKS AND FUTURE PERSPECTIVES 53

CONCLUDING REMARKS AND FUTURE PERSPECTIVES

This doctoral thesis describes an effort to map the human proteome using prediction methods for the large group of membrane proteins and by analyzing global protein expression profiles in a number of cells and tissues. The primary goal of the Human Protein Atlas project is to cover at least one protein per human gene. Therefore, a gene-centric view of the proteome has been used throughout all papers.

In Paper I, the human membrane proteome was explored using a selection of membrane protein topology prediction methods as well as new majority-decision method based on the results of the selected methods. A quarter of the human protein-coding genes were estimated to code for membrane proteins, and analyses of the difference between the selected methods showed a large variation between the predicted number of membrane proteins as well as the numbers and positions of TM segments. As the methods for experimentally determining the structures of membrane proteins are improved and the number of solved structures increase, the topology predictions can be made more accurate, thus leading to improved machine-learning algorithms. The set up of membrane protein topology prediction within the framework of Human Protein Atlas project allows for easy additions of newer methods to the visualization tool to keep up to date with future improvements.

More than 4000 global protein expression profiles, obtained from immunohistochemistry and immunofluorescence based methods, were used to analyze the unique characteristics of defined cell types in the human body in Paper II. The main conclusions from this global protein expression study were that a majority of our proteins are expressed in a large number of cell types, and that any given cell type expresses a high fraction of all proteins. This suggests that the characteristic of a cell is not defined by a binary “on/off” expression pattern of proteins, but that a fine-tuned regulation of the protein expression levels determines function and phenotype. Therefore, future studies using quantitative measurements of protein levels will be of great value and can be combined with protein interaction networks to explore the protein universe of the cell.

54 MAPPING THE HUMAN PROTEOME USING BIOINFORMATIC METHODS

The relationships between 45 cell lines were analyzed using PCA and hierarchical clustering in Paper III, and the global expression patterns in cell lines were found to, at large, reflect the tumor from which each cell line has been established. Also, the use of statistical methods allowed for an analysis of differentially expressed proteins between various groups of cell lines. The resulting lists of proteins and enriched GO terms provide tools for a deeper understanding of how protein expression patterns underlie phenotype and function of cell lines and also contribute to interesting starting points for further studies. For example, as cell lines are so extensively used in cancer research, a more in-depth analysis of how well different types of cell lines constitute model systems for their respective progenitor cell type would be of great value.

To complement the cell and tissue profiles obtained from IHC, the subcellular protein distribution was analyzed in Paper IV using annotations of images resulting from high-resolution confocal microscopy. The location of more than 4000 proteins was mapped in 16 subcellular compartments in three cell lines of different origins. Analysis of the subcellular distribution of proteins provides valuable information for functional studies, since the function and location of a protein is closely linked, as well as for pharmaceutical purposes and to gain a deeper understanding of the complex machinery of a cell. The ultimate goal of the project is to map the subcellular location of at least one representative protein from all human protein-encoding genes.

In the final study in Paper V, the global analysis of the proteome was complemented by transcriptomic data. A comparative study of protein and transcript expression levels was performed using RNA-seq and SILAC-based MS on three functionally different cell lines. A majority of all transcripts were detected, and the correlation of changes between the cell lines on protein and RNA level was high. The results include an estimation of the fraction of the protein-coding genes that are similarly and differentially expressed between the cell lines, only detected in one, or not detected at all. This study demonstrated a path to explore the relationship between the proteome and the transcriptome. By expanding the study to a larger number of cell lines, future studies can allow for a more comprehensive analysis of the differences between cell types and a deeper understanding of cell-specific functions.

CONCLUDING REMARKS AND FUTURE PERSPECTIVES 55

Global protein expression analyses have for a long time been hampered by a lack of specific probes. However, the pipeline of the Human Protein Atlas project allows for the generation of approximately 250 validated antibodies every month. There is now available protein expression profiles for more than 50% of all protein-coding genes based on IHC and IF on cell lines and an extensive set of human tissues. We are continuously creating a more comprehensive view of the proteome, which will add both value and possibilities for large-scale proteomic studies to follow. The underlying data for all studies performed in this thesis are publicly available and can hopefully contribute to research efforts in various areas and aid in the search for biomarkers.

56

ABBREVIATIONS

3D Three-Dimensional BP Biological Process CC Cellular Component CCDS Collaborative Consensus Coding Sequence CMA Cell Microarray DAB 3,3'-diaminobenzidine DAG Directed Acyclic Graph DAPI 4',6-diamidino-2-phenylindole DNA Deoxyribonucleic Acid GO Gene Ontology GPCR G-protein Coupled Receptor HMM Hidden Markov model IDA Inferred from Direct Assay IEA Inferred from Electronic Annotation IF Immunofluorescence IHC Immunohistochemistry INSDC International Nucleotide Sequence Databases Collaboration MALDI Matrix Assisted Laser Adsorption/Ionization MDM Majority-Decision Method MF Molecular Function mRNA messenger RNA MS Mass Spectrometry NCBI National Center for Biotechnology Information NN Neural Network PC Principal Component PCA Principal Component Analysis PDB Protein Data Bank PrEST Protein Epitope Signature Tag PRIDE Proteomics Identification database PTM Post-Translational Modification RNA Ribonucleic Acid RPKM Reads-per-kilobase-of-exon-model-per-million-mapped-reads SILAC Stable-Isotope Labeling by Amino acids in Cell culture SP Signal Peptide SRA Sequence Read Archive SVM Support Vector Machine TM Transmembrane TMA Tissue Microarray UniProt Universal Protein Knowledgebase Consortium

57

ACKNOWLEDGEMENTS

Jag har haft den stora äran att arbeta tillsammans med många olika människor under mina åtskilliga projekt här på KTH vilket har varit mycket värdefullt och lärorikt. Jag vill härmed tacka alla som på ett eller annat sätt bidragit till att jag kunde skriva denna avhandling och fått ha så roligt på vägen!

Min handledare Mathias – du är fantastiskt bra på att inspirera och att få fram det bästa av alla du jobbar med. Tack för att du alltid trott på mig och låtit mig vara med om så spännande forskning

Min bihandledare Gunnar von Heijne – för klok input i våra gemensamma projekt och i denna avhandling och för att jag får vara stolt över att Sverige har en så genomsympatisk och skicklig forskare.

Jacob – tack för värdefull granskning av denna avhandling.

Figge – som på många sätt varit min extrahandledare. Tack för att du välkomnade mig i Uppsala, för din mysiga personlighet, för din ständiga support och för alla roliga projekt vi har jobbat med.

Lisa – för att du hela tiden funnits där, låtit mig inspireras, kommit med visa ord, hjälpt till med allt du kunnat och för det roliga arbetet med vårt alldeles egna projekt.

Emma – din stora drivkraft leder till många intressanta projekt. Tack för att du låtit mig ta del av din grupps fina data och för alla spännande möten.

Anna – efter år av timslånga telefonsamtal och möten och ett väldigt kul projekt börjar allt falla på plats. Tack för att du har kämpat och varit vår fina spindel i nätet!

Åsa – min närmsta granne och främsta diskussionspartner om precis allt. Du och Lisa är tillsammans bäst, klokast och smartast och de finaste arbetskamrater jag kan tänka mig. Jag vill tacka er båda för alla luncher, resor, konferenser, middagar och fester - stora som små, och för att vi har fått dela den spännande resa de senaste 6 åren har inneburit i allas våra liv.

Mattias, Per, Kalle och Martin – för givande samtal om allt mellan himmel och jord, fikapauser, roliga gruppaktiviteter, all proffsig datahanteringshjälp, den fina limsen, smarta diskussioner och för mycket trevligt sällskap uppe på vinden.

Tim – for fun workshops and conferences, for making a great program, for helping me out so much and being such a great proofreader, for nice structures, and for being a good friend.

58

Marcus – för all R-kod, statistiksnack, kurser och trevligt resesällskap. Din hjälp har varit oerhört värdefull för mig och jag fick till och med lära mig ett nytt programmeringsspråk.

Daniel – för kurser, kodande, räknande, intressanta diskussioner, kul RNA-data, inspirerande möten och två mycket spännande projekt.

Joakim – för att du har låtit mig vara en del av din grupp på DNA-klubbar och mycket givande workshops och för våra roliga gemensamma projekt.

Erik – för att du visade hur roligt det kunde vara att jobba i en IT-grupp och lät mig vara med och programmera, för många roliga festminnen och härliga båtturer.

Cristina – utan dig hade jag aldrig hamnat på KTH. Jag är så tacksam för att du trodde på mig och för att du ända sen jag sommarjobbade har varit ett stort stöd och en fin vän.

Per-Åke – för alla givande diskussioner och smarta frågor och för att du och Amelie har låtit mig vara med i PRONOVA-gemenskapen med spännande projekt.

Övriga professorer och PIs på avdelningen, tack för att ni tillsammans bidrar med stor kunskap, mångsidig forskning och en härlig stämning.

Charlotte och övriga Cell profiling-gruppen för att ni släppte in mig i er konfokalvärld och för det roliga subcellprojektet. Sara Strömberg, våra oräkneliga timmar i telefon under arbetet med cellatlasprojektet var otroligt givande och roligt. Sara Lindström, som så tålmodigt tog hand om mig på labbet och lärde mig allt om cellodling. Johanna S, för alla tips och en snygg mall. Eric Björkvall, som varit en alldeles utomordentlig och viktig support i alla program- och datorfrågor. Basia, det har varit så skönt att vi varit flera i vår lilla grupp av doktorander, tack för att du varit min länk till plan 3 och för att du lyckas med konsten att vara unik på ett alldeles positivt sätt.

Hela Gene-tech för den mysiga och välkomnande stämningen på workshopar och DNA-klubbar. Jojje, Mattias, Ronald och Julia för att ni hade sånt tålamod med mig under vårt labassande. Johan R, för mycket trevligt rumssällskap. Johanna H, för massa skoj med badminton och assande. Johan N och Sebastian, för att ni är så genuint trevliga och mysiga. Patrik S, definitivt bästa bordsherren. Magnus, för alla bra diskussioner, frågor och kul konferens i Göteborg. Olof, för rolig kurs att assa på. Jesper, för alla ölinköpsrundor och intressanta samtal. Calle, som visade att det kunde vara kul att vara doktorand. Alla doktorandtejer för trevliga ost- och casinokvällar och roliga förfester och alla doktorander för en trevlig gemenskap. Luciagänget, för många roliga rep och lite väl tidiga mornar. 59

Immunotech och Atlas Antibodies för ert trevliga sällskap i Hus 10. Övriga gamla och nya medarbetare i Human Protein Atlas-projektet för allt hårt arbete, en alltid trevlig stämning och roliga annuals. Utan er hade jag inte haft något data att analysera. Alla på plan 3 för att ni bidrar till en så trevlig arbetsplats.

Jag vill också tacka alla vänner utanför jobbet som är en så väldigt viktig del av mitt liv. Mina fina och äldsta vänner Maria och Ullis, som alltid funnits där. Mina coola och härliga Bergtorpstjejer. Mitt knasiga, drivna och mycket inspirerande VRG-gäng, där Vivi tyvärr saknas alldeles för ofta. Sandra, för att det har varit fantastiskt skönt att ha en doktorandvän utanför KTH under hela den här tiden och för att du tillsammans med Cia utgör min lilla extrafamilj. Cia, min bästa f.d. sambo, jag saknar dig och vill ha dig här. Katie, my great sister on the opposite side of the Atlantic, I hope all our plans work out. Alla övriga fina nya och gamla bekantskaper.

Vännerna från VG-kören i Uppsala som för alltid har en stor plats i mitt hjärta. Mina körvänner i Cantare för att jag fått komma bort från avhandlingsstressen och tänka på något helt annat varje måndag och fått sjunga tillsammans med er.

Min fina mormor, mina morbröder, kusiner och alla andra i min släkt som jag tycker så mycket om att träffa.

Min kära familj, mamma, pappa och Staffan. Tack för att ni alltid finns där och alltid har varit min absolut största support.

David, tack för att du är du. Du har stöttat mig fantastiskt bra den här tiden. Du betyder mest och du är allra bäst!

Detta arbete har finansierats med anslag från Alice & Knut Wallenbergs stiftelse och ProNova VINN Excellence Centre for Protein Technology.

60

REFERENCES Abbas, K.A., and Lichtman, A.H. (2005). Cellular and Molecular Immunology, 5th edn (Philadelphia, Elsevier Saunders). Agaton, C., Galli, J., Hoiden Guthenberg, I., Janzon, L., Hansson, M., Asplund, A., Brundell, E., Lindberg, S., Ruthberg, I., Wester, K., et al. (2003). Affinity proteomics for systematic protein profiling of chromosome 21 gene products in human tissues. Mol Cell Proteomics 2, 405-414. Anderson, N.L., and Anderson, N.G. (2002). The human plasma proteome: history, character, and diagnostic prospects. Mol Cell Proteomics 1, 845-867. Andersson, A.-C., Stromberg, S., Backvall, H., Kampf, C., Uhlen, M., Wester, K., and Ponten, F. (2006). Analysis of Protein Expression in Cell Microarrays: A Tool for Antibody-based Proteomics. Journal of Histochemistry and Cytochemistry 54, 1413-1423. Arai, M., Mitsuke, H., Ikeda, M., Xia, J., Kikuchi, T., Satake, M., and Shimizu, T. (2004). ConPred II: a consensus prediction method for obtaining transmembrane topology models with high reliability. Nucleic Acids Res 32, W390-W393. Aranda, B., Achuthan, P., Alam-Faruque, Y., Armean, I., Bridge, A., Derow, C., Feuermann, M., Ghanbarian, A.T., Kerrien, S., Khadake, J., et al. (2010). The IntAct molecular interaction database in 2010. Nucleic Acids Res 38, D525-531. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-29. Bader, G.D., Heilbut, A., Andrews, B., Tyers, M., Hughes, T., and Boone, C. (2003). Functional genomics and proteomics: charting a multidimensional map of the yeast cell. Trends Cell Biol 13, 344-356. Bakheet, T.M., and Doig, A.J. (2009). Properties and identification of human protein drug targets. Bioinformatics 25, 451-457. Barbe, L., Lundberg, E., Oksvold, P., Stenius, A., Lewin, E., Bjorling, E., Asplund, A., Ponten, F., Brismar, H., Uhlen, M., et al. (2008). Toward a confocal subcellular atlas of the human proteome. Mol Cell Proteomics 7, 499-508. Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., Marshall, K.A., et al. (2009). NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res 37, D885-890. Beck, A., Wurch, T., Bailly, C., and Corvaia, N. (2010). Strategies and challenges for the next generation of therapeutic antibodies. Nat Rev Immunol 10, 345-352. Bella, J., Eaton, M., Brodsky, B., and Berman, H.M. (1994). Crystal and molecular structure of a collagen-like peptide at 1.9 A resolution. Science 266, 75-81. Bendtsen, J.D., Nielsen, H., von Heijne, G., and Brunak, S. (2004). Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340, 783-795. Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc 57, 289-300. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Sayers, E.W. (2010). GenBank. Nucleic Acids Res. Berglund, L., Björling, E., Jonasson, K., Rockberg, J., Fagerberg, L., Al-Khalili Szigyarto, C., Sivertsson, A., and Uhlén, M. (2008a). A whole-genome bioinformatics approach to selection of antigens for systematic antibody generation. Proteomics 8, 2832-2839. Berglund, L., Bjorling, E., Oksvold, P., Fagerberg, L., Asplund, A., Szigyarto, C.A., Persson, A., Ottosson, J., Wernerus, H., Nilsson, P., et al. (2008b). A genecentric Human Protein Atlas for expression profiles based on antibodies. Mol Cell Proteomics 7, 2019-2027. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. (2000). The Protein Data Bank. Nucleic Acids Res 28, 235-242. Bernsel, A., Viklund, H., Falk, J., Lindahl, E., von Heijne, G., and Elofsson, A. (2008). Prediction of membrane-protein topology from first principles. Proc Natl Acad Sci USA 105, 7177-7181. Birney, E., Andrews, T.D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cuff, J., Curwen, V., Cutts, T., et al. (2004). An overview of Ensembl. Genome Res 14, 925-928. Bishop, C.M. (1994). Neural networks and their applications. Review of Scientific Instruments 65, 1803-1831. Bjorling, E., Lindskog, C., Oksvold, P., Linne, J., Kampf, C., Hober, S., Uhlen, M., and Ponten, F. (2008). A web-based tool for in silico biomarker discovery based on tissue-specific protein profiles in normal and cancer tissues. Mol Cell Proteomics 7, 825-844. 61

Blom, N., Gammeltoft, S., and Brunak, S. (1999). Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 294, 1351-1362. Breitkreutz, B.J., Stark, C., and Tyers, M. (2003). Osprey: a network visualization system. Genome Biol 4, R22. Chen, C. (2002). Transmembrane helix predictions revisited. Protein Science 11, 2774-2791. Chen, C.P., and Rost, B. (2002). State-of-the-art in membrane protein prediction. Appl Bioinformatics 1, 21-35. Clamp, M., Fry, B., Kamal, M., Xie, X., Cuff, J., Lin, M.F., Kellis, M., Lindblad-Toh, K., and Lander, E.S. (2007). Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci USA 104, 19428- 19433. Claros, M.G., and Vincens, P. (1996). Computational method to predict mitochondrially imported proteins and their targeting sequences. Eur J Biochem 241, 779-786. Cochrane, G., Karsch-Mizrachi, I., and Nakamura, Y. (2010). The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. Cohen, A.A., Kalisky, T., Mayo, A., Geva-Zatorsky, N., Danon, T., Issaeva, I., Kopito, R.B., Perzov, N., Milo, R., Sigal, A., et al. (2009). Protein dynamics in individual human cells: experiment and theory. PLoS One 4, e4901. Cote, R.G., Jones, P., Martens, L., Apweiler, R., and Hermjakob, H. (2008). The Ontology Lookup Service: more data and better tools for controlled vocabulary queries. Nucleic Acids Res 36, W372-376. Cox, J., and Mann, M. (2007). Is proteomics the new genomics? Cell 130, 395-398. Crick, F.H., Barnett, L., Brenner, S., and Watts-Tobin, R.J. (1961). General nature of the genetic code for proteins. Nature 192. Crick, F.H.C. (1958). On protein synthesis. The Symposia of the Society for Experimental Biology 12, 138-163. Crick, F.H.C. (1970). Central dogma of molecular biology. Nature 227, 561-563. Cuthbertson, J., Doyle, D., and Sansom, M. (2005a). Transmembrane helix prediction: a comparative evaluation and analysis. Protein Engineering Design and Selection 18, 295-308. Cuthbertson, J.M., Doyle, D.A., and Sansom, M.S.P. (2005b). Transmembrane helix prediction: a comparative evaluation and analysis. Protein Engineering Design and Selection 18, 295-308. D'haeseleer, P. (2005). How does gene expression clustering work? Nat Biotechnol 23, 1499-1501. Dahm, R. (2008). Discovering DNA: Friedrich Miescher and the early years of nucleic acid research. Hum Genet 122, 565-581. Dennis, G., Sherman, B.T., Hosack, D.A., Yang, J., Gao, W., Lane, H.C., and Lempicki, R.A. (2003). DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 4, P3. Deutsch, E.W., Lam, H., and Aebersold, R. (2008). PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep 9, 429-434. Dibrov, S., McLean, J., and Hermann, T. (2011). Structure of an RNA dimer of a regulatory element from human thymidylate synthase mRNA. Acta Crystallogr Sect.S, 97-104. Draghici, S., Khatri, P., Martins, R.P., Ostermeier, G.C., and Krawetz, S.A. (2003). Global functional profiling of gene expression. Genomics 81, 98-104. Drew, H.R., Samson, S., Dickerson R.E. (1982). Structure of a B-DNA dodecamer at 16 K. Prot Natl Acad Sci USA 79, 4040-4. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863-14868. Eisenberg, D., Schwarz, E., Komaromy, M., and Wall, R. (1984). Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J Mol Biol 179, 125-142. Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. (2000). Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300, 1005-1016. Fenn, J.B., Mann, M., Meng, C.K., Wong, S.F., and Whitehouse, C.M. (1989). Electrospray ionization for mass spectrometry of large biomolecules. Science 246, 64-71. Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., et al. (2010). The Pfam protein families database. Nucleic Acids Res 38, D211-222. Flicek, P., Amode, M.R., Barrell, D., Beal, K., Brent, S., Chen, Y., Clapham, P., Coates, G., Fairley, S., Fitzgerald, S., et al. (2010). Ensembl 2011. Nucleic Acids Res. Fredriksson, R., and Schioth, H.B. (2005). The repertoire of G-protein-coupled receptors in fully sequenced genomes. Mol Pharmacol 67, 1414-1425. Fujita, P.A., Rhead, B., Zweig, A.S., Hinrichs, A.S., Karolchik, D., Cline, M.S., Goldman, M., Barber, G.P., Clawson, H., Coelho, A., et al. (2010). The UCSC Genome Browser database: update 2011. Nucleic Acids Res. Ghaemmaghami, S., Huh, W.-K., Bower, K., Howson, R.W., Belle, A., Dephoure, N., O'Shea, E.K., and Weissman, J.S. (2003). Global analysis of protein expression in yeast. Nature 425, 737-741.

62

Granseth, E. (2006). ZPRED: Predicting the distance to the membrane center for residues in -helical membrane proteins. Bioinformatics 22, e191-e196. Gupta, R., and Brunak, S. (2002). Prediction of glycosylation across the human proteome and the correlation to protein function. Pac Symp Biocomput, 310-322. Gygi, S.P., Rochon, Y., Franza, B.R., and Aebersold, R. (1999). Correlation between protein and mRNA abundance in Yeast. Mol Cell Biol 19, 1720-1730. Haab, B.B. (2001). Advances in protein microarray technology for protein expression and interaction profiling. Curr Opin Drug Discov Devel 4, 116-123. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second edn (Springer). Heijne, G. (1986). The distribution of positively charged residues in bacterial inner membrane proteins correlates with the trans-membrane topology. EMBO J 5, 3021-3027. Hessa, T., Meindl-Beinker, N.M., Bernsel, A., Kim, H., Sato, Y., Lerch-Bader, M., Nilsson, I., White, S.H., and von Heijne, G. (2007). Molecular code for transmembrane-helix recognition by the Sec61 translocon. Nature 450, 1026-1030. Hooper, S.D., and Bork, P. (2005). Medusa: a simple tool for interaction graph analysis. Bioinformatics 21, 4432-4433. Huang da, W., Sherman, B.T., and Lempicki, R.A. (2009a). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37, 1-13. Huang da, W., Sherman, B.T., and Lempicki, R.A. (2009b). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4, 44-57. Hulett, H.R., Bonner, W.A., Barrett, J., and Herzenberg, L.A. (1969). Cell sorting: automated separation of mammalian cells as a function of intracellular fluorescence. Science 166, 747-749. Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., et al. (2009). InterPro: the integrative protein signature database. Nucleic Acids Res 37, D211-215. Hyvarinen, A., and Oja, E. (2000). Independent component analysis: algorithms and applications. Neural Netw 13, 411- 430. Iragne, F., Nikolski, M., Mathieu, B., Auber, D., and Sherman, D. (2005). ProViz: protein interaction visualization and exploration. Bioinformatics 21, 272-274. Ishihama, Y., Schmidt, T., Rappsilber, J., Mann, M., Hartl, F.U., Kerner, M.J., and Frishman, D. (2008). Protein abundance profiling of the Escherichia coli cytosol. BMC Genomics 9, 102. Jacob, F., and Monod, J. (1961). Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol 3, 318-356. Jacoboni, I., Martelli, P.L., Fariselli, P., De Pinto, V., and Casadio, R. (2001). Prediction of the transmembrane regions of beta-barrel membrane proteins with a neural network-based predictor. Protein Sci 10, 779-787. Jans, D.A., Xiao, C.Y., and Lam, M.H. (2000). Nuclear targeting signal recognition: a key control point in nuclear transport? Bioessays 22, 532-544. Jia, M., Choi, S.Y., Reiners, D., Wurtele, E.S., and Dickerson, J.A. (2010). MetNetGE: interactive views of biological networks and ontologies. BMC Bioinformatics 11, 469. Jolliffe, I.T. (2002). Principal Component Analysis, Second edn (Springer). Jones, D.T. (2007). Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics 23, 538-544. Joshi-Tope, G., Gillespie, M., Vastrik, I., D'Eustachio, P., Schmidt, E., de Bono, B., Jassal, B., Gopinath, G.R., Wu, G.R., Matthews, L., et al. (2005). Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 33, D428-432. Juretic, D., Zoranic, L., and Zucic, D. (2002). Basic charge clusters and predictions of membrane protein topology. J Chem Inf Comput Sci 42, 620-632. Kall, L., Krogh, A., and Sonnhammer, E.L. (2004). A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338, 1027-1036. Kall, L., and Sonnhammer, E.L. (2002). Reliability of transmembrane predictions in whole-genome data. FEBS Lett 532, 415-418. Kaminuma, E., Kosuge, T., Kodama, Y., Aono, H., Mashima, J., Gojobori, T., Sugawara, H., Ogasawara, O., Takagi, T., Okubo, K., et al. (2010). DDBJ progress report. Nucleic Acids Res. Kanehisa, M., and Bork, P. (2003). Bioinformatics in the post-sequence era. Nat Genet 33 Suppl, 305-310. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., and Hattori, M. (2004). The KEGG resource for deciphering the genome. Nucleic Acids Res 32, D277-280. Kangas, J.A., Kohonen, T.K., and Laaksonen, J.T. (1990). Variants of self-organizing maps. IEEE Trans Neural Netw 1, 93-99. 63

Kapushesky, M., Emam, I., Holloway, E., Kurnosov, P., Zorin, A., Malone, J., Rustici, G., Williams, E., Parkinson, H., and Brazma, A. (2010). Gene expression atlas at the European bioinformatics institute. Nucleic Acids Res 38, D690-698. Karas, M., and Hillenkamp, F. (1988). Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons. Anal chem 60, 2299-2301. Kendrew, J.C., Dickerson, R.E., Strandberg, B.E., Hart, R.G., Davies, D.R., Phillips, D.C., and Shore, V.C. (1960). Structure of myoglobin: A three-dimensional Fourier synthesis at 2 A. resolution. Nature 185, 422-427. Khatri, P., and Drăghici, S. (2005). Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21, 3587-3595. Khatri, P., Draghici, S., Ostermeier, G.C., and Krawetz, S.A. (2002). Profiling gene expression using onto-express. Genomics 79, 266-270. Kohler, G., and Milstein, C. (1975). Continuous cultures of fused cells secreting antibody of predefined specificity. Nature 256, 495-497. Kononen, J., Bubendorf, L., Kallioniemi, A., Barlund, M., Schraml, P., Leighton, S., Torhorst, J., Mihatsch, M.J., Sauter, G., and Kallioniemi, O.P. (1998). Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nature Med 4, 844-847. Krogh, A. (2001). Predicting transmembrane protein topology with a hidden markov model: application to complete genomes. J Mol Biol 305, 567-580. Krogh, A., Brown, M., Mian, I.S., Sjolander, K., and Haussler, D. (1994). Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol 235, 1501-1531. Kuo, A., Gulbis, J.M., Antcliff, J.F., Rahman, T., Lowe, E.D., Zimmer, J., Cuthbertson, J., Ashcroft, F.M., Ezaki, T., and Doyle, D.A. (2003). Crystal structure of the potassium channel KirBac1.1 in the closed state. Science 300, 1922- 1926. Kyte, J., and Doolittle, R.F. (1982). A simple method for displaying the hydropathic character of a protein. J Mol Biol 157, 105-132. Le Novere, N. (2006). Model storage, exchange and integration. BMC Neurosci 7 Suppl 1, S11. Leinonen, R., Akhtar, R., Birney, E., Bower, L., Cerdeno-Tarraga, A., Cheng, Y., Cleland, I., Faruque, N., Goodgame, N., Gibson, R., et al. (2010a). The European Nucleotide Archive. Nucleic Acids Res. Leinonen, R., Sugawara, H., and Shumway, M. (2010b). The sequence read archive. Nucleic Acids Res. Liu, Q., Zhu, Y.S., Wang, B.H., and Li, Y.X. (2003). A HMM-based method to predict the transmembrane regions of beta-barrel membrane proteins. Comput Biol Chem 27, 69-76. Lo, A., Chiu, H.S., Sung, T.Y., Lyu, P.C., and Hsu, W.L. (2008). Enhanced membrane protein topology prediction using a hierarchical classification method and a new scoring function. J Proteome Res 7, 487-496. Lodish, H., Berk, A., and Zipursky, S.L. (2000). Molecular Cell Biology, 4th edn (New York, W. H. Freeman). Lomize, M.A., Lomize, A.L., Pogozheva, I.D., and Mosberg, H.I. (2006). OPM: orientations of proteins in membranes database. Bioinformatics 22, 623-625. Lu, P., Vogel, C., Wang, R., Yao, X., and Marcotte, E.M. (2007). Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat Biotechnol 25, 117-124. Malone, J., Holloway, E., Adamusiak, T., Kapushesky, M., Zheng, J., Kolesnikov, N., Zhukova, A., Brazma, A., and Parkinson, H. (2010). Modeling sample variables with an Experimental Factor Ontology. Bioinformatics 26, 1112- 1118. Mann, M., Hendrickson, R.C., and Pandey, A. (2001). Analysis of proteins and proteomes by mass spectrometry. Annu Rev Biochem 70, 437-473. Mann, M., and Jensen, O.N. (2003). Proteomic analysis of post-translational modifications. Nat Biotechnol 21, 255-261. Mardis, E.R. (2008). Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9, 387-402. Martelli, P.L., Fariselli, P., and Casadio, R. (2003). An ENSEMBLE machine learning approach for the prediction of all- alpha membrane proteins. Bioinformatics 19 Suppl 1, i205-211. Mayburd, A.L. (2009). Expression variation: its relevance to emergence of chronic disease and to therapy. PLoS One 4, e5921. Melen, K., Krogh, A., and Vonheijne, G. (2003). Reliability Measures for Membrane Protein Topology Prediction Algorithms. J Mol Biol 327, 735-744. Moche, M., Lehtio, L., Andersson, J., Arrowsmith, C.H., Berglund, H., Collins, R., Dahlgren, L.G., Edwards, A.M., Flodin, S., Flores, A., et al. The Isomerase Domain of Human Gfpt1 in Complex with Fructose 6-Phosphate. To be published. Möller, S., Croning, M.D., and Apweiler, R. (2001). Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 17, 646-653.

64

Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621-628. Mulder, G.J. (1838). On the composition of some animal substances. Bulletin des Sciences Physiques et Naturelles en Neerlande. Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247, 536-540. Naaby-Hansen, S., Waterfield, M.D., and Cramer, R. (2001). Proteomics--post-genomic cartography to understand gene function. Trends Pharmacol Sci 22, 376-384. Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., and Snyder, M. (2008). The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344-1349. Nilsson, J., Persson, B., and von Heijne, G. (2000). Consensus predictions of membrane protein topology. FEBS Lett 486, 267-269. Nilsson, P., Paavilainen, L., Larsson, K., Ödling, J., Sundberg, M., Andersson, A.-C., Kampf, C., Persson, A., Szigyarto, C.A.-K., Ottosson, J., et al. (2005). Towards a human proteome atlas: High-throughput generation of mono- specific antibodies for tissue profiling. Proteomics 5, 4327-4337. Nugent, T., Jones, D.T. (2009). Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics. Ong, S.E., Blagoev, B., Kratchmarova, I., Kristensen, D.B., Steen, H., Pandey, A., and Mann, M. (2002). Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics 1, 376-386. Paoli, M., Liddington, R., Tame, J., Wilkinson, A., and Dodson, G. (1996). Crystal structure of T state haemoglobin with oxygen bound at all four haems. J Mol Biol 256, 775-792. Park, S.H., and Opella, S.J. (2005). Tilt angle of a trans-membrane helix is determined by hydrophobic mismatch. J Mol Biol 350, 310-318. Parkinson, H., Sarkans, U., Kolesnikov, N., Abeygunawardena, N., Burdett, T., Dylag, M., Emam, I., Farne, A., Hastings, E., Holloway, E., et al. (2010). ArrayExpress update--an archive of microarray and high-throughput sequencing- based functional genomics experiments. Nucleic Acids Res. Pavlopoulos, G.A., O'Donoghue, S.I., Satagopam, V.P., Soldatos, T.G., Pafilis, E., and Schneider, R. (2008a). Arena3D: visualization of biological networks in 3D. BMC Syst Biol 2, 104. Pavlopoulos, G.A., Wegener, A.-L., and Schneider, R. (2008b). A survey of visualization tools for biological network analysis. BioData Min 1, 12. Pepperkok, R., and Ellenberg, J. (2006). High-throughput fluorescence microscopy for systems biology. Nat Rev Mol Cell Biol 7, 690-696. Perrett, D. (2007). From ‘protein’ to the beginnings of clinical proteomics. Proteomics Clin Appl 1, 720-738. Perutz, M.F., Rossmann, M.G., Cullis, A.F., Muirhead, H., Will, G., and North, A.C. (1960). Structure of haemoglobin: a three-dimensional Fourier synthesis at 5.5-A. resolution, obtained by X-ray analysis. Nature 185, 416-422. Pruitt, K.D., Harrow, J., Harte, R.A., Wallin, C., Diekhans, M., Maglott, D.R., Searle, S., Farrell, C.M., Loveland, J.E., Ruef, B.J., et al. (2009). The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res 19, 1316-1323. Ramos-Vara, J.A. (2005). Technical aspects of immunohistochemistry. Vet Pathol 42, 405-426. Ramsköld, D., Wang, E.T., Burge, C.B., and Sandberg, R. (2009). An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol 5, e1000598. Renart, J., Reiser, J., and Stark, G.R. (1979). Transfer of proteins from gels to diazobenzyloxymethyl-paper and detection with antisera: a method for studying antibody specificity and antigen structure. Proc Natl Acad Sci U S A 76, 3116-3120. Rimm, D.L. (2006). What brown cannot do for you. Nat Biotechnol 24, 914-916. Ringner, M. (2008). What is principal component analysis? Nat Biotechnol 26, 303-304. Rost, B. (1996). PHD: predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol 266, 525-539. Salwinski, L., Miller, C.S., Smith, A.J., Pettit, F.K., Bowie, J.U., and Eisenberg, D. (2004). The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 32, D449-451. Sanger, F., and Tuppy, H. (1951a). The amino-acid sequence in the phenylalanyl chain of insulin. 1. The identification of lower peptides from partial hydrolysates. Biochemical Journal 49, 463-481. Sanger, F., and Tuppy, H. (1951b). The amino-acid sequence in the phenylalanyl chain of insulin. 2. The investigation of peptides from enzymic hydrolysates. Biochemical Journal 49, 481–490. Schaeffer, W.I. (1990). Terminology associated with cell, tissue, and organ culture, molecular biology, and molecular genetics. Tissue Culture Association Terminology Committee. In Vitro Cell Dev Biol 26, 97-101. 65

Semwogerere, D., and Weeks, E. (2005). Confocal microscopy. In: Encyclopedia of biomaterials and biomedial engineering. (New York, Taylor and Francis). Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., and Ideker, T. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13, 2498-2504. Sigal, A., Milo, R., Cohen, A., Geva-Zatorsky, N., Klein, Y., Liron, Y., Rosenfeld, N., Danon, T., Perzov, N., and Alon, U. (2006). Variability and memory of protein levels in human cells. Nature 444, 643-646. Sigrist, C.J., Cerutti, L., de Castro, E., Langendijk-Genevaux, P.S., Bulliard, V., Bairoch, A., and Hulo, N. (2010). PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res 38, D161-166. Soldatova, L.N., and King, R.D. (2005). Are the current ontologies in biology good ontologies? Nat Biotechnol 23, 1095- 1098. Sonnhammer, E.L., von Heijne, G., and Krogh, A. (1998). A hidden Markov model for predicting transmembrane helices in protein sequences. Proceedings / International Conference on Intelligent Systems for Molecular Biology ; ISMB 6, 175-182. Strandberg, E., Ozdirekcan, S., Rijkers, D.T., van der Wel, P.C., Koeppe, R.E., 2nd, Liskamp, R.M., and Killian, J.A. (2004). Tilt angles of transmembrane model peptides in oriented and non-oriented lipid bilayers as determined by 2H solid-state NMR. Biophys J 86, 3709-3721. Strömberg, S., Björklund, M.G., Asplund, C., Sköllermo, A., Persson, A., Wester, K., Kampf, C., Nilsson, P., Andersson, A.-C., Uhlen, M., et al. (2007). A high-throughput strategy for protein profiling in cell microarrays using automated image analysis. Proteomics 7, 2142-2150. Subbarao, G.V., and van den Berg, B. (2006). Crystal structure of the monomeric porin OmpG. J Mol Biol 360, 750-759. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102, 15545-15550. Tegel, H., Steen, J., Konrad, A., Nikdin, H., Pettersson, K., Stenvall, M., Tourle, S., Wrethagen, U., Xu, L., Yderland, L., et al. (2009). High-throughput protein production--lessons from scaling up from 10 to 288 recombinant proteins per week. Biotechnol J 4, 51-57. Theocharidis, A., van Dongen, S., Enright, A.J., and Freeman, T.C. (2009). Network visualization and analysis of gene expression data using BioLayout Express(3D). Nat Protoc 4, 1535-1550. Tusnady, G.E., and Simon, I. (1998). Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol 283, 489-506. Uniprot Consortium. (2008). The universal protein resource (UniProt). Nucleic Acids Res 36, D190-195. Uniprot Consortium. (2010). The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38, D142-148. Uhlen, M. (2005a). Antibody-based Proteomics for Human Tissue Profiling. Mol Cell Proteomics 4, 384-393. Uhlen, M. (2005b). A Human Protein Atlas for Normal and Cancer Tissues Based on Antibody Proteomics. Mol Cell Proteomics 4, 1920-1932. Uhlen, M., Bjorling, E., Agaton, C., Szigyarto, C.A., Amini, B., Andersen, E., Andersson, A.C., Angelidou, P., Asplund, A., Asplund, C., et al. (2005). A human protein atlas for normal and cancer tissues based on antibody proteomics. Mol Cell Proteomics 4, 1920-1932. Uhlen, M., Oksvold, P., Fagerberg, L., Lundberg, E., Jonasson, K., Forsberg, M., Zwahlen, M., Kampf, C., Wester, K., Hober, S., et al. (2010). Towards a knowledge-based Human Protein Atlas. Nat Biotechnol 28, 1248-1250. Uschold, M., and Gruninger, M. (1996). Ontologies: principles, methods, and applications. Knowledge Engineering Review 11, 93–155. Viklund, H., Bernsel, A., Skwark, M., and Elofsson, A. (2008). SPOCTOPUS: a combined predictor of signal peptides and membrane protein topology. Bioinformatics 24, 2928-2929. Viklund, H., and Elofsson, A. (2004). Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information. Protein Sci 13, 1908-1917. Viklund, H., and Elofsson, A. (2008). OCTOPUS: improving topology prediction by two-track ANN-based preference scores and an extended topological grammar. Bioinformatics 24, 1662-1668. Viklund, H., Granseth, E., and Elofsson, A. (2006). Structural classification and prediction of reentrant regions in alpha- helical transmembrane proteins: application to complete genomes. J Mol Biol 361, 591-603. Vizcaino, J.A., Cote, R., Reisinger, F., Foster, J.M., Mueller, M., Rameseder, J., Hermjakob, H., and Martens, L. (2009). A guide to the Proteomics Identifications Database proteomics data repository. Proteomics 9, 4276-4283. Vogel, C., Abreu Rde, S., Ko, D., Le, S.Y., Shapiro, B.A., Burns, S.C., Sandhu, D., Boutz, D.R., Marcotte, E.M., and Penalva, L.O. (2010). Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line. Mol Syst Biol 6, 400.

66

Volkin, E., and Astrachan, L. (1956). Phosphorus incorporation in Escherichia coli ribo-nucleic acid after infection with bacteriophage T2. Virology 2, 149-161. von Heijne, G. (1992). Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J Mol Biol 225, 487-494. von Heijne, G. (2006). Membrane-protein topology. Nat Rev Mol Cell Biol 7, 909-918. Walther, T.C., and Mann, M. (2010). Mass spectrometry-based proteomics in cell biology. J Cell Biol 190, 491-500. Warford, A., Howat, W., and McCafferty, J. (2004). Expression profiling by high-throughput immunohistochemistry. J Immunol Methods 290, 81-92. Watson, J.D., and Crick, F.H.C. (1953). Molecular structures of nucleic acids; a structure for Deoxyribose Nucleic Acid. 171 4356. White, S.H. (2004). The progress of membrane protein structure determination. Protein Sci 13, 1948-1949. White, S.H. (2009). Biophysical dissection of membrane proteins. Nature 459, 344-346. Wilkins, M.R., Pasquali, C., Appel, R.D., Ou, K., Golaz, O., Sanchez, J.C., Yan, J.X., Gooley, A.A., Hughes, G., Humphery-Smith, I., et al. (1996). From proteins to proteomes: large scale protein identification by two- dimensional electrophoresis and amino acid analysis. Biotechnology (NY) 14, 61-65. Wimley, W.C. (2003). The versatile beta-barrel membrane protein. Curr Opin Struct Biol 13, 404-411. Wise, A., Gearing, K., and Rees, S. (2002). Target validation of G-protein coupled receptors. Drug Discov Today 7, 235- 246. Wistrand, M., Kall, L., and Sonnhammer, E.L. (2006). A general model of G protein-coupled receptor sequences and its application to detect remote homologs. Protein Sci 15, 509-521. Yates, J.R., Ruse, C.I., and Nakorchevsky, A. (2009). Proteomics by mass spectrometry: approaches, advances, and applications. Annu Rev Biomed Eng 11, 49-79. Yeung, K.Y., and Ruzzo, W.L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics 17, 763-774. Yildirim, M.A., Goh, K.I., Cusick, M.E., Barabasi, A.L., and Vidal, M. (2007). Drug-target network. Nat Biotechnol 25, 1119-1126. Yohannan, S., Faham, S., Yang, D., Whitelegge, J.P., and Bowie, J.U. (2004). The evolution of transmembrane helix kinks and the structural diversity of G protein-coupled receptors. Proc Natl Acad Sci U S A 101, 959-963. Zeeberg, B.R., Feng, W., Wang, G., Wang, M.D., Fojo, A.T., Sunshine, M., Narasimhan, S., Kane, D.W., Reinhold, W.C., Lababidi, S., et al. (2003). GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol 4, R28. Zhou, H., and Zhou, Y. (2003). Predicting the topology of transmembrane helical proteins using mean burial propensity and a hidden-Markov-model-based method. Protein Sci 12, 1547-1555.