<<

UNIVERSITY OF CINCINNATI

Date:______

I, ______, hereby submit this work as part of the requirements for the degree of: in:

It is entitled:

This work and its defense approved by:

Chair: ______

On Applications of Statistical Learning to Biophysics

A dissertation submitted to the Division of Research and Advanced Studies of the

University of Cincinnati in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY (Ph.D.) in the Department of Physics of the College of

Arts and Sciences

Baoqiang Cao

B.S., 1998 Northwest University of China

M.S., 2000 Nanjing University (Thesis) and Northwest University of China (Certificate)

Abstract

In this dissertation, we develop statistical and machine learning methods for problems in biological systems and processes. In particular, we are interested in two problems– predicting structural properties for membrane and clustering genes based on microarray experiments. In the membrane problem, we introduce a compact representation for amino acids, and build a neural network predictor based on it to identify transmembrane domains for membrane proteins. Membrane proteins are divided into two classes based on the secondary structure of the parts spanning the bilayer lipids: alpha-helical and beta-barrel membrane proteins. We further build a support regression model to predict the lipid exposed levels for the amino acids within the transmembrane domains in alpha-helical membrane proteins. We also develop methods to predict pore- forming residues for beta-barrel membrane proteins. In the other problem, we apply a context-specific Bayesian clustering model to cluster genes based on their expression levels and cDNA copy numbers.

This dissertation is organized as follows. Chapter 1 introduces the most relevant biology and statistical and machine learning methods. Chapters 2 and 3 focus on prediction of transmembrane domains for the alpha-helix and the beta-barrel, respectively. Chapter 4 discusses the prediction of relative lipid accessibility, a different structural property for membrane proteins. The final chapter addresses the gene clustering approach.

1 Contents Preface List of Figures List of Tables Chapter 1. Introduction 1.1 Membrane proteins: structure and function 1.2 Microarray experiments and gene profiling 1.3 Classification and regression in structural bioinformatics 1.4 Unsupervised learning and clustering in genomics 1.5 Summary Chapter 2. Transmembrane domains prediction in alpha-helical membrane proteins 2.1 Introduction 2.2 Methods and data 2.3 Results and discussion 2.4 Conclusion Chapter 3. Transmembrane domains prediction in beta-barrel membrane proteins 3.1 Introduction 3.2 Methods and data 3.3 Results and discussion 3.4 Conclusion Chapter 4. Relative lipid accessibility prediction in membrane proteins 4.1 Introduction 4.2 Materials and Methods 4.3 Results and discussion 4.4 Conclusion Chapter 5. Cluster analysis of array comparative genomic hybridization data 5.1 Introduction 5.2 Methods and data 5.3 Results and discussion 5.4 Future plans 5.5 Summary Bibliography

2 Preface

I would like to thank my advisors, Dr. Mark Jarrell and Dr. Jaroslaw Meller, for their outstanding mentoring, inspiration and financial support. Their enthusiasm for and impressive understanding of various branches of computational sciences proved to be truly irresistible and left a permanent mark on my personality. For that, their kindness and patience, I am deeply grateful.

I also had the good fortune to work with Drs. Mario Medvedovic and Michael Wagner during my Ph.D. years. I am particularly indebted to Mario who taught me to look at biological data in a critical and unbiased way. Michael helped me apply optimization- based techniques to problems in genomics, which proved very important for the success of our collaborative efforts and my interdisciplinary work stemming from these collaborations.

I would also like to thank the other members of my thesis committee, Dr. Rostislav

Serota and Dr. Thomas Beck, for their instructive comments and their overall support.

Furthermore, I would like to thank Dr. Aleksey Porollo for his much more appreciated technical help and Han Yong Ng and Eric Moss for their critical reading of the dissertation.

I also want to thank my teachers and mentors from the Departments of Physics,

Chemistry, and Biochemistry. I truly admire their knowledge and passion about teaching and research. There is an old saying in Chinese, which can be translated as “whoever is my teacher even for one day becomes my life long teacher”. I can not find any better words to express my appreciation and gratitude.

3 Last but not least, I would like to thank my wife, Ying Shen, and my parents, Haisheng

Cao and Guaxian Suo for their support and encouragements. Without them, I would be unable to complete this journey.

Three chapters of this dissertation have been published or submitted for publication. In particular, chapter 2 of this dissertation was published in Bioinformatics [11], chapter 3 will be published as a chapter in a volume of Advances in Computational and Systems

Biology series [82] and chapter 4 of this dissertation was submitted to Proteins. The remaining parts of this dissertation have not been published or submitted as yet. We are planning to submit one manuscript based on chapter 5 in near future.

4 List of Figures

1.1 Functions of membrane proteins.

1.2 Various types of associations of membrane proteins with the bilayer lipid membrane.

1.3 Classes of membrane proteins.

1.4 Schematic experimental procedure to measure gene expression using a microarray.

2.1 The distribution of lengths for TM segments included in the MPtopo.

2.2 Dependence of prediction accuracy on the size of the sliding window, using 10-fold

cross-validation on the non-redundant training set of 73 alpha-helical membrane

proteins.

2.3 Examples of MINNOU transmembrane helix predictions for an ion channel (PDB

code 1OTS) in panel A, and a glutamate transporter protein (PDB code 1XFH) in

panel B.

3.1 An example of the TM segments prediction for a β-barrel (PDB

code 1P4T).

4.1 The proposed position specific error (PSE) function for ε-insensitive SVR models for

the prediction of the relative lipid accessibility (RLA) in membrane proteins.

4.2 Average cross-validated RLA prediction accuracies (in terms of correlation

coefficients) for the training set of 72 non-redundant chains with and without

interface residues.

4.3 Example of RLA prediction for sodium/proton antiporter (PDB code 1ZCD, chain A).

4.4 TM domain and RLA prediction for sodium channel protein from human

(SCN1A_HUMAN).

5.1 Directed acyclic graph for the Bayesian hierarchical model.

5 5.2 The ROC curves obtained using independent clustering of mRNA expression and

cDNA copy number alteration patterns.

5.3 The ROC curves for the joint clustering of genes based on mRNA expression levels

and cDNA copy number alteration.

5.4 Classification of samples using averaging of cDNA copy number alteration (the right

panel) and the corresponding mRNA expression levels (left panel).

6 List of Tables

2.1 Per residue classification accuracy (Q3) of secondary structure predictions for TM

proteins obtained using the SABLE and PSIPRED servers.

2.2 Accuracy of transmembrane segment (domain) prediction using alternative

representations.

2.3 Accuracy of TM segment prediction using structural profiles including KD and WW

hydropathy profiles, predicted RSA and SS, estimated using cross-validation on non-

redundant sets of alpha-helical and beta-barrel proteins.

2.4 Improvements due to 1 st layer, 2 nd layer and filter (F) based predictors according to

TMH Benchmark evaluation.

2.5 Assessment of TM helix prediction methods using the TMH Benchmark server.

2.6 Accuracy of different methods as measured by per-residue accuracy and Matthews

correlation coefficients on a set of five helical membrane proteins not included in the

training set.

2.7 Performance of MINNOU on the four control sets.

3.1 The non-redundant dataset of 13 β-barrel proteins (S13 set).

3.2 Performance of alternative predictors estimated using 10-fold cross-validation on

different training sets.

3.3 Assessment of false positive rates for alternative TM β-strand segment predictors.

4.1 The effect of different representations and regression models on the RLA prediction

accuracies.

4.2 Performance of the final (consensus) SVR model on the control set of 16 non-

redundant membrane proteins.

7 Chapter 1 Introduction

Massive amounts of genomic and other data pertaining to biological systems and processes are being generated, as illustrated by the Human Genome Project and related efforts [1]. The availability of these data creates an opportunity and a challenge for computational scientists to develop algorithms and tools to improve our understanding of molecular processes, with the ultimate goal of facilitating progress in medicine [2]. One successful approach is to use statistical learning and data mining techniques to extract general rules and patterns from massive data sets [3]. In particular, the availability of efficient computational implementations of statistical and machine learning methods makes it suitable for high-throughput data mining [3].

The role of statistical learning methods in genomics is best highlighted by the growing number of studies that apply such methods to molecular systems [3]. In general, machine learning approaches can be divided into two classes, namely the so-called supervised and unsupervised learning, both of them being used in this dissertation. The difference between these two approaches stems from the availability (or lack of thereof) of labeled data for which we know the outcomes, e.g., structural attributes of a protein that we are trying to predict or a clinical phenotype related to a given gene expression profile [3]. Thus, in supervised learning the goal is to adjust model parameters so that the outputs correspond more closely to outcomes that are imposed in the training [3]. On the other hand, in unsupervised learning, the goal is to discover inherent structures and patterns in the data [3].

In this dissertation, our goals are to develop new methods to extract complex patterns from genomic data and subsequently used them for prediction. We capitalize on

8 machine and statistical learning approaches to study two classes of biological systems and problems inspired by them. The first problem deals with the prediction of a number of structural properties in membrane proteins, including the location of transmembrane

(TM) domains and the lipid exposed levels of residues inside the TM domains. The other problem involves the identification of genes with similar expression patterns in biological samples. These two problems have essentially common roots from a statistical learning perspective; in other words, similar statistical learning techniques can be applied to these problems. This will be demonstrated in the following chapters of this dissertation.

1.1 Membrane Proteins: Structure and Function

In order to provide the context for applications of statistical learning approaches to the problem of membrane protein prediction, we introduce the basic concepts pertaining to membrane proteins and the motivation for our approach. The impermeable and oily bilayer lipid encloses cells and maintains the essential difference between the cytosol and extracellular environment. Membrane proteins, which reside in the lipid membrane and control all chemical processes across the membrane, are critical to cell survival and function [4]. Membrane proteins are involved in many diverse cell events, such as cell-cell signaling, ion and small molecule transport, recognition and . [5].

Figure 1.1 illustrates some important functions of membrane proteins. For example, ion pumps are gated pores that enable transporting ions through the membranes against electrochemical gradients by using the energy of adenosine triphosphate (ATP) hydrolysis [6]. As opposed to ion pumps, ion channels transport ions down their electrochemical gradients [6]. Ion transporters are mixtures of pumps and channels, in

9 which one ion species moving along its electrochemical gradient powers the movement of another ion species against its electrochemical gradient [6]. Membrane protein receptors are involved in cellular communication and signal transduction. Through binding to a signaling molecule or a pair of them on each side of the membrane, they initiate the response on the other side [4].

Some are integral membrane proteins; for example, the famous ATPases are a class of enzymes catalyzing the decomposition of ATP into adenosine diphosphate and a phosphate ion [4]. This process is widely used as the fuel for all forms of life [4].

Understanding the functions of membrane proteins is of interest not only to those who are in academia but also to those in pharmacology [7]. For example, it was stated that more than 50% of drugs that are being developed target G protein-coupled receptors, a large super-family of membrane proteins [7].

Ion pumps Ion channels Ion transporters Receptors Enzymes

A B

Figure 1.1 Functions of membrane proteins.

10 Membrane proteins function in the context of bilayer lipid membranes. They are associated with the bilayer lipid membrane in various ways as shown in Figure 1.2 [4]. In cases 1 and 2, there are masses of proteins on both sides of a cell membrane. The only difference between the proteins in these two cases is that the protein in case 1 is a single

TM domain protein while a multiple TM domain protein in the other. The vast majority of membrane proteins whose structures have been experimentally resolved belong to these two cases; that is, they have at least 1 TM domain with masses on both sides [4,7].

In this dissertation, all membrane proteins are referred to proteins shown in case 1 or 2. In cases 3 and 4, membrane proteins extend either terminal into the membrane via a segment of fatty lipid. Therefore the protein is stuck to the membrane with all of its mass on one side of the membrane. They are then named anchor membrane proteins [4]. In the last two cases, the highlighted proteins associate with TM proteins via bonding to them either from the outside or the inside of the cell membrane and are named peripheral membrane proteins [4].

Figure 1.2 Various ways of associations of membrane proteins with the bilayer lipid membrane. This figure is taken from [4].

11 From a structural perspective, membrane proteins are divided into two classes: alpha-helical and beta-barrel membrane proteins. Their difference lies in the secondary structures of the segments spanning the bilayer membrane, which are exclusively alpha helices in the former and beta barrels in the latter [5, 8]. For example, in Figure 1.3, two membrane proteins with experimentally resolved 3-D structures are shown with the lipid membranes. In Figure 1.3, an alpha-helical membrane protein is shown in the left panel and a beta-barrel membrane protein in the right panel. Because of the nature of their difference in spanning segments, it is advantageous to separately develop methods to predict TM domains for the two types of membrane proteins. We will discuss this in detail in chapters 2 and 3.

Figure 1.3 Classes of membrane proteins. In the left panel, the structure of an alpha-helical membrane is shown, and a beta-barrel membrane protein is shown on the right. Gray planes are the position of cell membranes measured from the Protein Data Bank of Transmembrane Proteins (PDBTM) [9]. This figure is taken from the PDBTM website ( http://pdbtm.enzim.hu/ ).

The high resolution 3-D atomic structure is often the key to understanding membrane protein function. Solving 3-D structures of membrane proteins still remains a

12 challenge in experimental biology. Although the number of experimentally determined atomic structures for membrane proteins is continuing to increase, as of May, 2006, only

1.6% of the structures [9] in the Protein Data Bank (PDB) [10] were membrane protein structures. To be precise, there were only 88 unique membrane protein structures deposited in PDB [11]. Of them, 73 were alpha-helical and 15 were beta-barrel proteins

[http://blanco.biomol.uci.edu/mpex/ , 11]. On the other hand, membrane proteins are estimated to account for about 20% to 30% of all proteins in the sequenced genomes [11,

12]. Therefore, computational methods become significant tools for facilitating the recognition and prediction of membrane proteins and their properties [11].

Our first goal is to develop a method to predict TM domains given only the sequence information of membrane proteins. The details regarding the prediction of TM domains in alpha-helical and beta-barrel membrane proteins will be discussed in chapters

2 and 3, respectively. Our second goal is to develop a method to predict relative lipid accessibility based only on the sequence information in alpha-helical membrane proteins.

We will discuss the details of predicting relative lipid accessibility in chapter 4.

1.2 Microarray Experiments and Gene Expression Profiling

We introduce a simplified description of the most relevant aspects of genes, gene expression, and experiments to measure the expression. A gene is a fragment of DNA that encodes the information needed to synthesize one or more proteins [4]. This process starts with the transcription of DNA into messenger RNA (mRNA) [4]. When the double helix of DNA opens certain segments in response to a command to express a gene, a precise copy of the information is transcribed into mRNA [13]. The process of the production of mRNA is called transcription. The mRNA is then deciphered into amino

13 acids and the protein is synthesized using the sequence at the ribosome [4].

This process is called translation [4].

Almost every cell has the entire set of genes that the may ever need [4].

Nonetheless, cells are different because of their different functions and the variety of proteins used to perform those functions [13]. A cell may need a larger amount of some proteins and a smaller amount of others. For example, the concentration of protein in the red cells of the blood is much higher than in the cells of the hair [4].

Therefore, genes in every cell may be expressed at very different levels [13]. The cell expression profile consists of a set of numbers indicating the expression levels of all genes within the cell [13]. Genes with similar expression levels are generally assumed to be associated with common functions [1,4].

The microarray experiment is a high throughput measurement providing measures of the cellular concentration of mRNA in a cell. In this experiment, mRNA molecules are extracted from tissues of interest and then reverse-transcribed from mRNA to complementary DNA (cDNA) [1]. The resulting cDNA is labeled with various colors of fluorescent dyes [1]. In general, cDNA from a sample tissue of interest is fluorescently labeled with Cy5 (red) dye while cDNA from a reference sample is labeled with Cy3

(green) dye. Oligonucleotides, cDNA, or small fragments of the mRNA are spotted as probes on a glass slide [1]. A mixture of cDNAs with different fluorescent labels hybridizes with the probes in each spot. The glass slide or chip, which consists of several hundred thousand probes, is illuminated by a laser and the light intensity from each spot is measured. This intensity is proportional to the concentration of the cDNA (or precisely mRNA) fragments in the investigated samples [1]. The ratio of Cy5/Cy3 is measured and

14 then expressed in pseudocolor scale based on the log of the ratio in base 2. Then, a table with log2 ratio of concentration from samples and references is reported [1]. The following is a typical figure for illustrating microarray experiments.

Figure 1.4 Schematic experimental procedure to measure gene expression using a microarray. In the microarray experiment, DNA or messenger RNA is isolated from biological samples of interest. Then the cDNA is fluorescently dyed with different colors for the sample of interest and the reference sample. The different dyed DNA is then hybridized to the microarray and the microarray is scanned under laser light. cDNA: complementary DNA, mRNA: message RNA. This figure is taken from wikipedia

(http://en.wikipedia.org/ ).

The array comparative genomic hybridization (CGH) experiment is a high throughput measurement of DNA copy number alterations for thousands of genes simultaneously [14]. CGH can be performed on the same microarrays used for profiling

15 gene expression. When performed in parallel, gene expression profiling and microarray based CGH experiments allow for identification of gene expression patterns and the structural DNA changes that are causing them. This experimental technique has been applied to various cancer studies, such as, colon cancer and breast cancer [14, 15]. The additional information from DNA copy number alterations can be used to enhance the identification of gene expression patterns, which will be discussed in detail in chapter 5.

1.3 Classification and Regression in Structural Bioinformatics

In this section, we briefly introduce classification and regression in supervised r learning. Suppose a n dimensional input vector xi represents the i-th instance of the data, r r r r and X = (x1, x2 ,..., xT ) denotes all T data points. On the other hand, yi denotes the properties or output of interested target values for data point i , and the outputs for all r data points is denoted by Y = (y1, y2 ,..., yT ) . This learning process deals with finding a mapping function [3], f :Rn → R so that: r f (xi ) = yi , for ∀i ∈ ,1[ T]. (1)

The outputs vary in general among the examples [3]. They can be quantitative measurements, where the measurements can be compared in relative terms [3]. The outputs can also be categorical or discrete, where the measurements are labels rather than numbers used to denote the outputs [3]. The difference between classification and regression is that the outputs are discrete in classification but continuous in regression [3].

To some extent, the distinction is only a naming convention for the prediction task: regression for predicting quantitative outputs and classification for predicting discrete

16 outputs [3]. We will see that they have many commonalities, in particular in the implementation of the learning in chapters 2, 3 and 4.

Due to the complexity of the problem, it is in general difficult to find the exact mapping function by solving equation (1). A variety of approximations are being used to

ˆ r solve the equation by allowing for errors [3]. Let yˆi = f (xi ) denote the approximation

ˆ and the predicted outputs for all data points is denoted by Y = (yˆ1, yˆ2 ,..., yˆT ). Two typical errors are used as our measure of fit: mean absolute error (MAE) and mean squared error

(MSE). They are commonly defined as [3]:

ˆ 1 T MAE (f )= ∑ yˆ i − yi (2) T i=1

T ˆ 1 2 and MSE (f )= ∑()yˆ i − yi . (3) T i=1

Then, the approximate solution may be obtained by minimizing the errors as defined by equation (2) or (3). There are many methods that use different approximations for the function fˆ and utilize different cost functions and optimization approaches. Here, we used support vector machines [16], artificial neural networks [17], linear discriminant analysis [18], and support vector regression [19] as described in detail in the following chapters of this dissertation. Adjustable parameters in those methods are optimized by solving the cost function, which is either the above set of error functions or one which allows one to account for generalization [3]. The details about tested methods will be discussed in related chapters.

In chapters 2 and 3, we use classification to improve prediction of protein structural properties from the amino acid sequences. For example, in the prediction of

17 TM domains, the outputs have binary values (0 or 1) representing two classes of residues, which are either outside or inside of the cell membrane. Hence, the problem is cast as a classification problem. On the other hand, we used regression to predict relative lipid accessibility in chapter 4. Since accessibility is a real value, the problem can be cast in terms of regression. We will discuss the details of the strategy to implement the classification and regression approaches in subsequent chapters.

1.4 Unsupervised Learning and Clustering in Genomics

As opposed to supervised learning, class membership (or patterns) in unsupervised learning is not specified and even the number of different classes generally needs to be established. So from a statistics point of view, the learning is to estimate the r r probability distribution of patterns given the observed data. That is, we model P(θ | X ), r r the probability distribution of a pattern θ given X representing the observed data. From the genomic expression experiments, it is of interest for biologists to cluster genes with similar expressions [20]. A biological motivation for finding highly correlated genes is that they are good candidates for reflecting biological processes of interest (such as oncogenes). Therefore, an unsupervised learning approach, in particular clustering, is suitable for application to such a problem.

Clustering can be implemented with various approaches, such as, k-means, hierarchical clustering, or approaches that can be formulated using Bayesian inference framework [3, 20]. Here, we address the Bayesian inference approach because it is superior to others. In chapter 5, we will compare and discuss the clustering results from our approach and others and conclude by listing the advantages of our approach.

18 Here, we briefly formulate the Bayesian inference approach. Suppose the inputs r r r r r X = (x1, x2 ,..., xT ) represent all T data points and the model parameter θi represents a cluster or pattern in the data. Clustering can then be formulated to find the distribution of r v the conditional probability of θi given data X . From Bayes' theorem one has: r r r v r P X P ( |θir) (θi ) P()θi | X = (4) P()X r r r r r and P(X ) = ∫ P(X | θi )P(θi )dθi . (5) r r In the above equation (4), P(θi | X ) is the probability distribution of the model parameter r r r r θi given data X , P(θi ) is the prior probability distribution of the model parameter θi , and r r r r P(X |θi ) is the likelihood of X given the model parameter θi [3]. r

The model parameter θi can be a variety of random parameters, and in more complex cases, the random parameters can also be dependent on other random parameters, in which case the approach is then defined as a hierarchical Bayesian model r r

[21]. For example in gene clustering, θi may be equal to (C,α, β,... ) where r

C = (c1,c2 ,..., cT ) , and ci is the cluster in which gene i belongs and α, β,... are parameters yielded by possibly different priors. It should be stressed that ci can be any number (pattern) between two extreme cases (1 through T ). The minimum of the clusters is 1, which means all genes are within the same cluster. In the other extreme case, every gene is clustered into a cluster including only the gene itself; the maximum of the clusters is then T [22]. Therefore, the purpose of clustering is to find the conditional probability v r r distribution of C , P(C | X ,α, β ,...) .

19 Unlike other approaches, especially K-means, which is widely and successfully applied to many diverse problems, there is no predefined cluster in the data [2,3,22]. The number and inner structure of clusters are estimated by the Bayesian inference given the data. The flexibility of lacking predefined clusters makes this approach superior to others in which the total number of possible clusters is arbitrarily chosen in advance [22]. The details of building the model and applying this approach to clustering gene expressions will be discussed in chapter 5.

1.5 Summary

In the post Human Genome Project era, biologists are often confronted with complex data not only in terms of the number of entities but also the high dimensionality of each entity. Computational approaches, especially statistical and machine learning have already been used to solve a variety of interesting biology problems.

After decades of effort, experimentalists were able to resolve some structures of membrane proteins with high resolution. These structures are critical to understand the causes of diseases, if we are to cure the diseases caused by alteration of these proteins.

However, the structures of membrane proteins are difficult to resolve experimentally. So far fewer than 100 unique membrane protein structures have been deposited in PDB, compared to more than 30,000 experimentally resolved soluble protein structures. In this dissertation, we will focus on developing supervised learning methods to predict structural properties for membrane proteins with sequence only, including transmembrane domains and relative lipid accessibility.

As microarray technology has matured, its use and consequently the resulting data have grown dramatically. For example, the microarray has been used to find oncogenes

20 and tumor suppresser genes in cancers; hopefully, the identification of those genes will lead to prevention and eventually cure of cancers. We will develop unsupervised learning methods to cluster genes which respond to the experimental conditions.

The details will be discussed in the following four chapters. Chapters 2 and 3 will focus on prediction of transmembrane domains for the alpha-helix and the beta-barrel, respectively. Chapter 4 deals with the prediction of the relative lipid accessibility, a different structural property for membrane proteins. The last chapter will address the gene clustering approach.

21 Chapter 2 Transmembrane domains prediction in membrane proteins

The transmembrane domain prediction methods have been revisited by several groups.

Based on these studies, it appears that the accuracy of existing methods has been significantly overestimated. Here, we propose an alternative approach to predict TM domains for membrane proteins. In this approach, we introduced a compact representation to depict the amino acid and its local environment, which consists of predicted accessibility and secondary structures of the interested amino acid residue. Alternative representations were examined by using a variety of machine learning techniques, including both linear discriminant analysis and neural networks. In a

10-fold cross-validated training on a non-redundant experimental dataset, our approach, based on the compact representation, outperformed other published methods using multiple sequence alignment information. According to an external evaluation by the

Transmembrane Helix Benchmark server, our final prediction for the helical membrane proteins is competitive with state-of-the-art methods, achieving about 89% in terms of per-residue accuracy and 80% in terms of per-segment accuracy on a set of high resolution structures from the Benchmark server. More importantly, the observed confusion rates between membrane proteins, signal peptides, and globular proteins are the lowest among the tested methods. In contradiction, most of the publicly available methods have a well-known weakness, which is that they in high rates falsely predict signal peptides and globular proteins as membrane proteins. These low confusion rates make our method a preferred candidate for high-throughout searching for membrane proteins in whole genome data. We have developed a web server named MINNOU to

22 predict transmembrane domains in membrane proteins. It is available on-line at http://minnou.cchmc.org .

2.1 Introduction

While genes coding for membrane proteins (MPs) constitute up to an estimated

30% of sequenced genomes, experimentally solving MP structures has remained a challenge and less than 100 unique structures have been determined with high resolution.

Therefore, computational prediction of membrane proteins has become an important tool for the annotation of sequenced genomes and studies of membrane proteins.

Many methods have been developed for predicting the transmembrane domains

[7]. The majority of these methods rely primarily on hydropathy, observed preferences of amino acids for membrane proteins and other physico-chemical properties of amino acids in order to identify sufficiently long stretches of mostly hydrophobic residues that may coincide with TM helices. Successful examples of such methods include SOSUI [23],

TopPred [24], and TMpred [25]. Another very successful approach to membrane protein prediction is based on hidden Markov models (HMM) that allow one to model the global topology of membrane domains rather than simply identifying local membrane segments.

Highly successful examples of such methods include TMHMM [26], Phobius [27], and

HMMTOP [28, 29]. Either individual sequence or evolutionary profiles of protein families, as encoded by their multiple alignments (MAs), may be used in this approach

(see e.g., [30]). An alternative group of methods uses advanced machine learning techniques, such as neural networks (NNs), in order to map the single sequence-based or evolutionary information about an amino acid residue and its context into the classification (e.g. membrane vs. non-membrane). The NN-based PHDhtm method [31],

23 which incorporates evolutionary profiles, is a particularly successful example of such an approach.

An independent assessment of state-of-the-art TM helix prediction methods, referred to as the TMH Benchmark, was recently performed in the Rost Lab [32]. The top performing methods (both in terms of sensitivity and specificity), such as PHDhtm or

TMHMM, achieved the per-residue accuracy of up to 80% and per-segment accuracy of up to 84%, with the estimated 1% and 20-30% rate of false positive matches for globular proteins and signal peptides respectively. Other recent studies also pointed out the overall low accuracy of TM domain prediction (see e.g. [7,33,34]). At the same time, published estimates and benchmark results [7,30,32] suggest that methods that employ evolutionary information appear to be more accurate than methods based on information derived from a single sequence.

While similar concepts apply to predict beta-barrel membrane proteins as well, this problem appears to be more difficult due to a weak hydrophobic nature of membrane spanning beta-strands. Another fundamental limitation comes from the very limited number of known prototypes available for training. Nevertheless, a number of methods for prediction of beta-membrane proteins, both based on single sequences and evolutionary profiles, have been proposed, and are being used to annotate newly sequenced genomes (see e.g.,[35-37]).

The prediction methods considered here represent an amino acid residue and its environment using a sliding window of certain length. Typically, in MA-based prediction methods each residue in a sliding window is represented by a column of 20 substitution scores from a position-specific scoring matrix (PSSM) obtained using Psi-BLAST [38].

24 Such generated PSSMs effectively represent the frequencies of different amino acid substitutions at a given position in a protein family. Consequently, a MA-based representation may imply several hundred attributes to describe a residue (e.g. 420 numerical values when using a sliding window of length 21). In conjunction with certain machine learning techniques such as NNs, this may lead to thousands of free parameters to be optimized. The higher complexity of the model may, in turn, hinder our ability to train a successful prediction protocol with good generalization properties, especially if only a limited number of training examples are available. The latter is particularly relevant in case of membrane protein prediction.

We considered an alternative strategy to membrane protein prediction based on a compact representation of an amino acid and its environment. This may be applied to improve recognition of both helical and beta transmembrane domains. Instead of using directly evolutionary profiles, we propose to use prediction-based “structural profiles” consisting of predicted relative solvent accessibility (RSA) and secondary structures (SS) of each residue. RSA is a measurement of the exposed level of amino acid to its solvent environment. From a simple secondary structural point of view, each residue can be helix, beta-strand, or coil. This initial prediction step may be viewed as an effective projection of the information encoded by MA into a reduced representation defined by the predicted

RSA/SS profiles.

Here, we used our recently introduced, accurate real valued RSA prediction method, which is available through the SABLE ( Solvent Accessi BiLiti Es predictor) server [39]. The SABLE server, which was rigorously evaluated by the EVA meta-server for evaluation of SS prediction servers [40], is also used to generate state-of-the-art SS

25 predictions [41]. Each amino acid residue is thus represented by up to five numbers: the predicted real valued RSA, confidence of RSA prediction, and predicted probabilities for each of the three secondary structure classes (i.e. helix, beta-strand and coil or other).

This compact representation stemmed from the observation that a method trained on soluble proteins only [39] in order to effectively identify residues buried in the hydrophobic core of globular proteins, may, in fact, be also used to indicate residues that are “buried” in the membrane. In addition, secondary structure prediction methods trained on soluble proteins achieve, perhaps surprisingly, a relatively high (comparable to that for soluble proteins) accuracy on membrane proteins. The latter may indicate interesting hints as to how membrane proteins fold and the stability of their secondary structures also in an aqueous environment.

The accuracy of the new approach is estimated using both cross-validated training and the TMH Benchmark evaluation server [32]. Alternative representations and classification models are assessed using several different machine learning techniques including Linear Discriminant Analysis (LDA) and Neural Networks. In particular, the

MA-based representation is directly compared with the reduced representation proposed here.

2.2 Methods and Data

2.2.1 Data

In order to derive a non-redundant and representative set of membrane proteins with known experimentally resolved structures, we explored the MPtopo database [42], which is the excellent up-to-date collection of membrane proteins with experimentally verified

26 topologies. From the version of MPtopo in June 2004, we obtained 167 sequence chains consisting of 101 3D_helix chains, 38 1D_helix chains, and 28 3D_other chains. The TM domain length distribution among the three datasets is shown in the following figure.

70

3dhelix 60 1dhelix 3dother

50

40

30

20 Number of TM segments

10

0 0 5 10 15 20 25 30 35 40 45 50 TM length

Figure 2.1 The distribution of lengths for TM segments included in the MPtopo [42] database of membrane proteins. 3D_helix and 1D_helix refer to helical membrane proteins with or without resolved 3D structure, respectively.

In MPtopo terminology, 3D_helix and 1D_helix contain helix-bundle proteins segregated according to the existence or absence of experimentally resolved 3-D structures. 3D_other includes β –barrel and monotopic membrane proteins with experimentally resolved 3-D structures. Thus, in case of 1D_helix set only low resolution information derived from various experimental studies is available. Consequently, the delineation of TM segments is laden with additional uncertainty. It was also suggested

27 that prediction methods may have been used to derive the actual boundaries of TM segments for some of the 1D_helix proteins [7]. As can be seen from the figure, the distribution of lengths for low resolution structures is indeed qualitatively different compared to the high resolution (3D_helix) set. In particular, a sharp peak is observed around the length of an alpha-helical TM segment (20-22 residues) for low resolution structures. This is in stark contrast to a much wider distribution for high resolution structures. Because of these differences and the higher uncertainty in the annotation of low resolution structures, we used only the 3D_helix set in training MINNOU. The

3D_other set included beta–barrel membrane proteins with known 3D structures and was characterized by much shorter membrane spanning segments.

As discussed in a previous section, the nature of the difference between alpha- helical and beta-barrel membrane proteins made it appropriate to maintain two separate datasets to develop and evaluate models of TM domain prediction for alpha-helical and beta-barrel membrane proteins. The procedure of refining both datasets is as follows. We excluded the 1D_helix chains because of their lacking 3D structures from the alpha- helical set. Therefore, we have 101 sequence chains in the alpha-helical dataset. We removed monotopic membrane proteins from the beta-barrel set because they were not β –barrels. The beta-barrel dataset ended up with 28 sequence chains.

The non-redundant sequence dataset was obtained in the procedure discussed below. We used the pairwise sequence alignment to identify homologous sequence chains.

The BLAST program was used to remove sequences with E-value smaller than 10 −10 [38].

We used the default settings for BLAST, specifically, the substitution matrix is

BLOSUM62 [38].Consequently, there are 88 membrane protein sequence chains in our

28 non-redundant datasets. In particular, there are 73 sequence chains with 15,598 residues in the α -helical membrane protein set, and 15 sequence chains with 4,623 residues in the beta-barrel membrane protein set. Those two training sets were used to design and evaluate the features discussed in next section.

We also evaluated the model on 4 independent control sets which were created for evaluating SABLE predictor [41]. The 4 control sets were named S163, S135, S147, and

S156 indicating their numbers of sequence chains. For example, in S163 there are 163 sequence chains. The dominant majority of the 4 control sets are globular proteins, in particular, only 2 membrane protein sequences in S135 and S147 sets, and 1, 5 membrane proteins in S156 and S163. The advantage of the 4 control sets is that they were not used for training models in SABLE. This makes it a good candidate for evaluating our method regarding the performance on globular proteins. In fact, after applying those models initially trained only on the two training set of the 4 control sets, we chose falsely classified fragments and included them into an extended training set so as to reinforce our model reorganization of those patterns.

In addition, we also considered an extended dataset including falsely classified fragments for globular proteins and signal peptides. At first, we trained our models only on the training sets. Then when we applied the trained models to the 4 control sets, we chose the 10 most falsely predicted fragments with varied lengths and added them to the training set. These fragments yielded 1,539 residues. At last, we explored the database used for the development of PrediSi [43] and collected 15 signal peptides each 100 residues long. We randomly selected 5 signal peptides from three kingdoms: Eukaryote, gram positive, and gram negative. Therefore, the extended training set with 18,627

29 residues in total consisted of membrane protein chains, globular protein fragments, and signal peptides.

2.2.2 10-fold cross-validation

We used a 10-fold cross-validation procedure to train and evaluate our models. At first, the dataset is randomly split into 10 subsets. One subset is used as an evaluation set, and the others are used as the training set. The models were designed and trained on the training set and then evaluated on the evaluation set. Then we repeat the process by choosing a different subset as an evaluation set until all 10 subsets were used as evaluation sets once. In the end, the performance of trained models was reported.

In order to compare the results of linear and non-linear classifiers we used in this study (both LDA and NN approaches), the LDA-based classifiers were trained using the

Tooldiag package [18]. The results of LDA are briefly discussed in the next section as they are used primarily as a reference to assess the non-linear models. Here, we focus on the setup of NNs that were used to derive our final prediction system, for a cross- validation study, assessing relative contributions of the multiple alignment, hydropathy, and the novel prediction-based “structural profiles”.

The architecture of all NNs used here is similar. Namely, a feed forward topology with three layers: the input layer, one hidden layer, and the output layer, is used. The adjacent layers are fully connected, and the logistic activation function for the nodes in the hidden and output layers is used. The number of features used to represent each amino acid residue (and thus the size of the input layer) varies between one and six in our tests. For models that do not use evolutionary profiles and 20 for MA-based methods see also Tables 2.2 and 2.3. For example, when five features per amino acid residue are used,

30 the input consists of up to 155 numbers, which represent amino acid residues in a sliding window of length up to 31 (the longest sliding window tested here). The two output nodes represent the class of “membrane residues” and “non-membrane residues”, respectively. Each residue (input vector) is assigned to a class with a larger excitation of its output node.

All the networks were trained using the Rprop [44] algorithm, as implemented in the SNNS package [45]. The order of training examples was random, and the number of training iterations (epochs) was set to 500 since no significant improvement in terms of the sum of squares error function was observed in a longer training for the networks considered here. For each of the representations and sliding windows, we trained a number of networks with a different number of nodes in the hidden layer that was varied between 8 and 18 (with an increment of 2). In the cross-validation study, which aims at assessing relative merits of different representations discussed here, multiple networks with a different number of nodes in the hidden layer were trained and evaluated.

Alternative representations imply different sizes of the input vectors and may require a different number of nodes in the hidden layer in order to achieve the best performance.

Therefore, for a fair comparison between different representations, the results for networks with a number of nodes in the hidden layer that yielded the best generalization in each case are included in Table 2.1.

For helical membrane proteins, the cross-validation involves splitting the training set into ten subsets, each consisting of several protein chains with approximately equal number of residues in each subset. Similarly to standard 10-fold cross-validation, the training is then performed 10 times using different unions of nine subsets with the

31 remaining subset as a validation set. The results on controls sets are averaged and are referred to as 10-fold cross-validation results (even though this is not strictly correct).

A protein (rather than residue) based definition of training and control sets makes it more likely to observe specific patterns resulting from distinct membrane domains in a control set only. Thus, this approach is expected to yield a more realistic assessment of generalization for novel membrane proteins that have not yet been resolved structurally.

For beta-barrel membrane proteins, we used a leave-one-out (protein) procedure to estimate the accuracy of alternative representations.

In addition to a simple NN-based classifier developed and assessed in the cross- validation study, we also developed a multistage protocol for the enhanced prediction of transmembrane helices. (A similar system for beta-barrel membrane proteins is a subject of a future study.) For the final predictor, we do not consider the MA-based representation, which is shown using cross-validation, to yield a lower accuracy compared with our compact prediction-based “structural profiles.” Because the inclusion of hydropathy led in our tests to high levels of confusion with globular proteins and signal peptides, we also excluded hydropathy from the representation of an amino acid residue. Consequently, each residue is initially represented by five numbers: the predicted real valued RSA, confidence of RSA prediction, and probabilities of each of the three secondary structures (as predicted by SABLE, http://sable.cchmc.org ). SABLE predictions are derived from the multiple alignment, hydropathy, and other attributes that are commonly used by other state-of-the-art methods. Therefore, it is expected that other accurate methods for RSA and SS prediction will be useful in that regard as well. We considered Psipred [46], another structural profile predictor, to predict secondary

32 structure and applied it to predict TM domains. As shown in Table 2.1, SABLE prediction of the secondary structure in alpha-helical membrane proteins outperforms another secondary structure predictor, Psipred, but not in beta-barrel membrane proteins.

The prediction from Psipred was also applied to predict TM domains in alpha-helical membrane proteins, and SABLE outperformed it as a secondary structure prediction.

(The results are not shown.) However, Psipred will be a candidate for prediction of TM domains for beta-barrels though this investigation will be left for future research.

Test set SABLE PSIPRED Alpha-helical: all 78.2% 76.6% Alpha-helical: TM segments only 83.3% 80.6% Beta-barrel: all 68.7% 75.7% Beta-barrel: TM segments only 64.9% 77.6%

Table 2.1 Per residue classification accuracy (Q 3) of secondary structure predictions for TM proteins obtained using the SABLE [41] and PSIPRED servers [46]. The results on non-redundant sets of 73 alpha- helical and 15 beta-barrel membrane proteins, respectively, are shown.

Following in the footsteps of other studies [47], we use a two-stage prediction system, with the second layer (structure-to-structure) NNs allowing one to “average” and smooth over the initial classification obtained using the first (sequence-to-structure) layer predictor. The architecture of the first and second layer NNs is similar to that used for the cross-validation study. Namely, a simple feed-forward topology with one hidden layer is employed.

In order to reduce the danger of overfitting and achieve regularization as well as improvement in accuracy, a consensus of twenty different networks was used to generate predictions at each stage. These different networks were trained on different subsets of the training set that were also used for the cross-validation study. Multiple NNs were trained on each subset of the data, with a different number of nodes in the hidden layer

33 and different sizes of the sliding window. The number of nodes in the hidden layer was again varied between 8 and 18, and the size of the sliding window was varied between 11 and 31. The stopping criteria were defined in terms of improvement (or lack thereof) on the corresponding validation set. From multiple networks trained on each subset of the data, the one providing the highest per residue accuracy on the corresponding validation set was then selected for the final consensus predictor. In that sense, the whole training set of 73 protein chains is used here to optimize the parameters of each of the networks.

However, using different subsets of the data provides a better sampling of local minima.

The first ten networks were trained on different subsets of the data using the representation with five attributes per amino acid residue, as described above. The other ten networks included in the consensus were trained with only four attributes per residue.

Namely, the two nodes related to relative solvent accessibility prediction were replaced by one node, which represents a discretized RSA assignment. In other words, such a binary node indicates if a given residue is predicted to be buried or exposed (with a threshold of 25% RSA used to project the real valued SABLE prediction into two classes).

Adding these additional networks to the consensus was found to somewhat improve the per segment classification accuracy. The consensus predictor is based on a simple majority voting. The source code for the MINNOU package, with the definition of all the

NNs used for the consensus prediction (including a detailed description of their topology and parameters), can be downloaded from ftp://ftp.chmcc.org/pdi/jmeller/minnou/ .

To further improve the performance of the first layer classifier, we introduced a second layer prediction system. The output of the first layer is used at this stage as input.

In addition, the SABLE predictions used in 1 st layer are included in the input as well.

34 After some experimentation, we found that the best results were obtained when the output of the first layer system was combined in the input of the 2 nd layer. For each residue, two additional nodes are used to represent averaged (over the networks included in the consensus) excitation of the two output nodes corresponding to two different classes.

Another pair of nodes is used to represent the maximum (again over the different networks) excitation for each class. Thus, the number of nodes per residue in the 2 nd layer is either nine or eight, depending on the number of attributes used in the first layer. The choice of optimal topology, other settings for the neural networks, and the training procedures follows that for the 1 st layers (see also the MINNOU package for further details).

While the second layer NNs led to significant smoothing of the prediction and improves the overall accuracy in terms of both sensitivity and specificity, some long or short helices are still occasionally predicted. We estimated the probability density distribution for the length of TM helices and used it as a guideline in the design of a filter, applied to the second layer prediction in order to avoid such unphysical predictions.

Similar filters have been used before by other groups [47]. Basically, the final filter is applied to either split predicted long TM helices or delete ones are too short, which are observed with very low frequency in the known sample of helical TM domains. The presence of relatively short (e.g., horizontally oriented) membrane embedded helices as well as relatively long (“skewed”) helices that occur in some ion channels influences the choice of the length thresholds applied here.

Specifically, if only a single membrane segment is predicted in a protein, then it is deleted if its length is shorter than 14 residues; if more than one membrane segment is

35 predicted, then it is deleted if shorter than 8 residues. On the other hand, if there is a continuous membrane segment of length greater than 44, it is split into two segments in the middle by introducing an artificial loop of length one. If the predicted segment is longer than 66 residues, the latter happened seven times in the set of 2247 segments predicted for the TMH Benchmarks test, then it is split into three segments. Even though this final post-processing step does not affect significantly the per residue accuracy, it does help to “smooth” the predictions and to improve the per segment accuracy, QOK , as defined in [7] and [32]. It also helps reduce further the observed level of confusion with signal peptides and globular proteins by filtering out unphysical short TM helices (see

Table 2.4 and also next section for further discussion). However, this final prediction stage is likely to be improved further by considering explicitly the overall topology of membrane proteins and characteristics of loops connecting TM segments. This is a subject of a future work.

While the second layer NN leads to significant smoothing of the prediction and improves the overall accuracy in terms of both sensitivity and specificity, some long or short helices are still occasionally predicted. We estimated the probability density distribution for the length of TM helices, used it as a guideline in the design of a filter, and applied it to the second layer prediction in order to avoid such unphysical predictions.

Basically, the final filter is applied to either split predicted long TM helices or delete ones that are too short. Similar filters have been used before by other groups (see e.g.,[47]).

36 2.3 Results and Discussion

2.3.1 Cross-validation

In order to train and evaluate both LDA-based and NN-based classification systems, we used non-redundant training sets of alpha-helical and beta-barrel membrane proteins, as described in the previous section, and 10-fold (or leave-one-out) cross- validation. In Table 2.1 we compared the results obtained using NNs and different sliding windows for two alternative representations considered here. The novel compact representation is estimated to achieve the per-residue classification accuracy of 88% and

78% and correlation coefficients of 0.73 and 0.53 for prediction of transmembrane helices and beta membrane proteins. For comparison, the MSA-based prediction achieves in cross-validation per-residue classification accuracy of up to 87% and correlation coefficient of 0.7 for alpha-helical and 74% and 0.42 for beta proteins, respectively.

Alpha-helical Beta-barrel Features Q2 % MCC Q2 % MCC RSA+SS (11) 87.9±0.8 0.74±0.02 77.9±3.3 0.50±0.08 RSA+SS (21) 88.0±0.6 0.73±0.02 78.7±3.3 0.53±0.08 RSA+SS (31) 87.4±0.7 0.73±0.02 77.9±3.6 0.53±0.08 MA (11) 85.0±1.3 0.67±0.03 71.6±2.9 0.37±0.07 MA (21) 86.0±1.4 0.70±0.03 73.3±3.4 0.41±0.08 MA (31) 86.5±1.4 0.70±0.03 73.6±3.6 0.42±0.09 Table 2.2 Accuracy of transmembrane segment (domain) prediction using alternative representations, consisting of predicted RSA/SS profiles and MA-based evolutionary profiles, are compared using cross- validation on non-redundant sets of alpha-helical and beta-barrel proteins for three different sizes of the sliding window (11, 21 and 31 residues). Averaged two class per-residue classification accuracy (Q 2),

Matthews Correlations Coefficients (MCC) [48] and standard deviations are included.

It is interesting to note that the differences in accuracy between the two alternative representations, as well as error bars (as measured by standard deviations from

37 cross-validated training), are significantly higher for beta-barrel proteins, reflecting a very limited number of training examples in this case and highlighting the problems with reliable parameter estimation. While the difference in accuracy is not as large for alpha- helical proteins, for which the number of training examples is significantly higher, the prediction-based structural profiles still yield improved accuracies and lower error bars in cross-validation, despite (or perhaps thanks to) much simpler representation.

The error bars observed in cross-validation and the drop in accuracy between training and validation sets may also be used to assess the level of overfitting for both representations. In general, a higher variability on the control sets (e.g. 1.4% for MA- based vs. 0.7% for RSA-based representation in case of alpha-helical proteins) and higher accuracy in the training with respect to control sets indicate a higher level of overfitting.

In that regard, the classification accuracy in the training (as measured by the average accuracy on ten training sets used in cross-validation) is only about 3% higher than on control sets in case of the novel RSA-based representation, as opposed to about 5% difference in case of the more complex MA-based models. Given that some of the proteins included in our limited training and control sets are likely to share common characteristics (despite the lack of sequence homology), the above estimates of the level of overfitting are expected to be overly optimistic. Nevertheless, the relative differences between the two representations are clear and reinforce our proposition that the novel compact representation enables more reliable parameter estimation for prediction of transmembrane domains.

Note also, that the differences in accuracy between shorter and longer sliding windows considered in Table 2.2 are not statistically significant, even though the error

38 bars tend to be somewhat higher for longer sliding windows. In principle, longer sliding windows imply more parameters to be estimated from the limited data and are more prone to overfitting. However, for fair comparison between the two alternative representations, we report here the best results achieved for each size of the sliding window by selecting the optimal number of the nodes in the hidden layer (see Section

2.2). In contrast, if the topology of the network is fixed, implying a monotone increase of the number of parameters with the size of the sliding window, then a decrease in accuracy is observed in cross-validation for windows longer than 20 residues. A significant drop in accuracy is also observed for very short windows. In that context, it is also important to realize that the distribution of lengths of TM segments is quite wide, with many “non- canonical” TM helices providing difficult to classify prototypes. Therefore, we decided to incorporate into the final consensus-based predictor networks that use different sliding windows for further smoothing of the results.

Further results are summarized in Table 2.3. Several compact representations are compared, including the simple hydropathy scale based methods (with one attribute per amino acid in the sliding window), the RSA and SS predictions alone (with two and three attributes per residue, respectively), and the combined profiles. A sliding window of length 25 is used for this comparison since it was found to be optimal for the compact representations introduced here. For both alpha-helical and beta-barrel membrane proteins, the RSA and SS based structural profiles perform significantly better than a simple hydropathy-based method (with the KD scale [49] working somewhat better than the “biological” scale proposed recently [50]). Combining RSA and SS predictions with hydropathy profiles does not result in further increase in accuracy. On the other hand, the

39 RSA and SS predictions alone are sufficient to achieve performance close to that of the best combination of features.

Alpha-helical Beta-barrel Features Q2 % MCC Q2 % MCC HKD 84.2±1.5 0.67±0.02 67.8±2.0 0.32±0.04 HWW 82.6±1.5 0.63±0.02 67.3±2.6 0.31±0.04 SS 81.2±5.0 0.66±0.06 77.0±3.9 0.52±0.08 RSA 86.8±0.8 0.71±0.02 76.0±2.2 0.42±0.08 SS+RSA 88.2±0.6 0.74±0.02 77.4±3.7 0.53±0.08

SS+RSA+H KD 87.8±0.7 0.74±0.02 77.7±3.1 0.53±0.07 SS+RSA+H WW 88.0±0.8 0.74±0.01 77.4±3.3 0.53±0.07 Table 2.3 Accuracy of TM segment prediction using structural profiles including KD and WW hydropathy profiles [49,50], predicted RSA and SS, estimated using cross-validation on non-redundant sets of alpha- helical and beta-barrel proteins.

Even though we do not present a detailed analysis of the LDA results here, it is interesting to note that the same trends are observed in that case as well (with the overall level of classification accuracy lower by about 4%). The fact that cross-validated accuracy of the LDA based classification is significantly lower suggests that the more general, non-linear characteristics of NN-based classifiers play an important role in this case. On the other hand, however, the risk of overestimating the accuracy increases when using more complex models. From that point of view, the accuracy of the simple LDA model with its 125 free parameters (when using five attributes per residue and a sliding window of length 25) provides an additional support for the hypothesis that the compact representation proposed here, consisting of prediction-based structural profiles, is likely to contribute to a further increase in accuracy of membrane domain prediction methods.

We also hypothesize that the new representation may prove advantageous over explicit use of evolutionary profiles not only in the context of machine learning-based methods,

40 as directly tested here, but also in the context of grammar-based methods. This is the subject of a future work.

0.75

0.73

0.71

0.69

0.67

0.65 MCC

0.63

0.61

0.59

0.57

0.55 0 5 10 15 20 25 30 Sliding Window Size

Figure 2.2 Dependence of prediction accuracy on the size of the sliding window using 10-fold cross- validation on the non-redundant training set of 73 alpha-helical membrane proteins.

The results of a simple NN with one hidden layer consisting of 10 nodes, trained for 1000 cycles with the learning parameter set to 0.1, are compared in terms of

Matthews Correlation Coefficients (MCC). Fixed topology and meta-parameters of the networks allow one to compare the results directly and assess the extent of overfitting due to increased windows size. Each amino acid residue in the sliding window is represented here by five numbers using the prediction-based structural profiles described in the text.

Thus, increasing the size of the sliding window from 11 to 21 increases the number of weights to be optimized from 550 to 1050 (neglecting the weights for edges between the hidden layer and the two nodes in the output layer as well as the parameters

41 for the activation functions of the nodes in the hidden and output layers). As can be seen from the figure, the cross-validation estimates of the accuracy quickly increase initially in order to reach a plateau and then gradually decrease with the growing size of the sliding window, pointing out problems with overfitting for more complex models. It should be noted that by changing the topology of the network, we were able to obtain somewhat better results (at the level of MCC=0.74, as reported in Tables 2.1 and 2.2 for different sliding windows). We would also like to comment that while very short sliding windows seem to be capable of capturing relatively well the essential information about an amino acid residue and its environment (since not too many residues predicted to be fully buried and in alpha helices are found in the loop domains of TM proteins included in the training set) the overall quality of predictions based on such short windows is low due high level of confusion with globular proteins and low per segment accuracy.

The evaluations were based on 10-fold cross-validation. The training sets for alpha-helical and beta-barrel only consist of MPs. The whole training set for alpha-helical consists of 15598 residues in total; among them 6469 residues are in the 255 transmembrane segments. In the training set for beta-barrel, the total residues number is

4723, and there are 1922 residues in the 161 transmembrane segments.

High resolution Low resolution Confusion with Confusion with signal peptides globular proteins Qok % Q2 % Qok % Q2 % % % 1st layer(0) 66 88 32 83 78 27 1st layer(1) 61 88 40 85 54 12 1st layer(2) 66 89 24 82 64 23 2nd layer 66 88 39 84 27 8 2nd layer + filter 80 89 55 85 8 1 Table 2.4 Improvements due to 1 st layer, 2 nd layer, and filter (F) based predictors according to TMH

Benchmark evaluation. Per segment (QOK ) and per residue (Q2) classification accuracies and confusion levels with signal peptides (cSP) and globular proteins (cGP) are given in the units of per cent.

42 In Table 2.4, five different prediction systems are compared. The first two involve the first layer predictions: training with the standard training set containing membrane proteins only; and training with the augmented training set that includes signal peptides and some false positives from the first iteration. The next two include the corresponding results for the second layer predictors that use the results of the first layer classifiers as part of their input. The last row contains results of the second layer prediction with filtering of short, unphysical segments, which improves the prediction in some respects.

2.3.2 TMH Benchmark Assessment

In this section, we present the evaluation of our final two-stage NN-based prediction system for prediction of transmembrane helices. The new method will be referred to as “Membrane protein IdeNtificatioN withOUt explicit use of hydropathy profiles and alignments” (MINNOU). Using the TMH Benchmark server, we assess both sensitivity and specificity of the new method (especially in terms of confusion with globular proteins and signal peptides that may be incorrectly predicted as having TM segments). At the same time, the performance of MINNOU is compared with that of other state-of-the-art prediction methods. Table 2.4 summarizes the results on a set of high resolution structures with well defined boundaries of transmembrane helices. The levels of confusion with globular proteins and signal peptides are also shown in Table 2.4; whereas, Table 2.5, illustrates contributions due to different steps in our multistage protocol.

As can be seen from Table 2.5, MINNOU achieves the highest per residue accuracy (89%) among the methods included in the TMH Benchmark evaluation. At the same time per segment accuracy (80%) is worse than that of PHDhtm and HMMTOP2. It

43 should be noted, however, that per segment accuracy is very sensitive to falsely predicted short TM helices and short non-membrane segments in true TM helices. This can also be seen from the effects of including a filter that removes some of these incorrectly predicted short segments without affecting significantly per residue accuracy (see Section

2.4). Therefore, further improvement in that regard is likely to be achieved by optimizing this step.

We would also like to comment that several methods (including the two mentioned above) achieve a higher per residue accuracy (89-90%, as opposed to 85% for

MINNOU) on the set of low resolution structures included in the TMH Benchmark.

These structures were not included in our original training set because of the uncertain assignment of their membrane segments. However, for comparison we performed a cross-validated training using both high and low resolution structures and observed a decreased accuracy on such joint a set (by about 2% in terms of classification accuracy and 0.03 in terms of correlation coefficient). Thus, the two sets of TM segments appear to have distinct characteristics. This is further highlighted by much narrower (with respect to high resolution structures) distribution of lengths of the TM segments derived from the low resolution structures. The fact that some of the prediction methods actually achieve higher accuracy on low resolution than on high resolution structures could indicate that prediction methods may have played a role in delineating TM segments in these low resolution structures, as suggested before by [7].

It should be also noted that, due to the very small number of structurally resolved membrane proteins, any benchmark is likely to use evaluation proteins which are homologous to those used by most of the evaluated methods (including MINNOU in case

44 of high resolution structures) for training. Therefore, the observed levels of accuracy are, in fact, based on variations of the training set and are unlikely to hold in the future.

Nevertheless, TMH Benchmark is a very useful resource for independent (static) evaluation of the results and comparison between different methods. Moreover, the TMH

Benchmark evaluation revealed significant levels of confusion with globular proteins and even higher levels of confusion with signal peptides. It is encouraging that the new method appears to be significantly better in that regard than any other method evaluated in [32], which is partly to be attributed to the use of an augmented training set. In the latter aspect, MINNOU is similar to another recently published Phobius method [27].

Method Q2 QOK cGP cSP MINNOU 89 80 1 8 PHDhtm 80 84 2 23 HMMTOP2 80 83 6 48 TMHMM1 80 71 1 34 DAS 72 79 16 97 TopPred2 77 75 10 82 SOSUI 75 71 1 61 Table 2.5 Assessment of TM helix prediction methods using the TMH Benchmark server. Per segment

(QOK ) and per residue (Q2) classification accuracies [7] and confusion levels with signal peptides (cSP) and globular proteins (cGP) are given in the units of per cent.

Another interesting observation is a relatively higher accuracy of MINNOU predictions for ion channel proteins, which are characterized by occurance of very long and relatively short membrane helices. For seven ion channels included in the set of high resolution structures used for training, MINNOU achieved the average per residue accuracy of 92.0% and correlation coefficients of 0.81, as opposed to 86.4% and 0.68 for

HMMTOP2 or 85.4% and 0.68 for DAS. For a newly solved ion channel structure, 1ots, which is homologous to one of the proteins included in the training set, the accuracy of

MINNOU prediction is also significantly higher than any other method (correlation

45 coefficient of 0.67 as opposed to the second best of 0.44). While these differences are statistically not significant and are clouded by the lack of truly independent test sets, we find these trends and the ability of MINNOU to predict largely correctly membrane segments in ion channels, without losing the ability to make relatively accurate predictions for other types of helical TM proteins, rather encouraging.

Method 1tn0_A 1vfp_A 1umx_L 1u7c_A 1xfh_A 75.6 88.3 81.9 79.0 70.0 HMMTOP2 0.51 0.62 0.64 0.59 0.38 82.8 91.8 88.3 78.0 66.7 SOSUI 0.64 0.74 0.76 0.56 0.34 80.0 88.5 91.8 85.2 63.1 TopPred2 0.59 0.64 0.83 0.71 0.27 80.4 91.0 84.0 82.1 68.5 DAS 0.62 0.70 0.66 0.66 0.33 84.8 91.4 77.6 75.6 65.5 MINNOU 0.68 0.75 0.63 0.52 0.34 Table 2.6 Accuracy of different methods as measured by per-residue accuracy and Matthews Correlation

Coefficients (second line in each row) on a set of five helical membrane proteins not included in the training set (including two, 1u7c and 1xfh, that are not homologous to proteins included in the training).

Finally, in order to illustrate the performance of several top ranking methods on individual proteins we used a set of five recently solved membrane proteins. The results are shown in Table 2.6. Three out of these five proteins (including a bacteria rhodopsin structure, 1tn0, and a photosynthetic reaction center protein, 1umx) exhibit homology to those included in the training and are merely used to show the variation in accuracy observed for all the methods on different proteins. It is interesting to note that none of these methods appears to be clearly better, and they all have failed quite badly for one of the non-redundant new proteins, a glutamate transporter 1xfh, for which the highest correlation coefficient is only 0.38 (the MINNOU prediction for 1xfh, which is at level of other methods). One should note, however, that the assignment of TM segments for these newly solved proteins is based on a theoretical analysis of the structures using the method

46 by [9], and it is, thus, laden with additional uncertainty. Nevertheless, we believe that the limited accuracy of the top ranking methods included in Table 2.6 further underscores the need for continued development of improved methods for membrane domain prediction.

2.3.3 The Independent Control Sets

The four control sets were obtained from SABLE’s documents. Each control set consists of a helical MP and non helical MP that is either globular protein or beta-barrel

MP. All the helical membrane proteins are distinguished with little confusion with either globular proteins or beta-barrels. Helical MP predicted and Non helical MP predicted stand for number of predicted helical MPs and non helical MPs in the control set, respectively. Helical MP observed and non helical MP observed are numbers of actual helical MPs and non helical MPs, respectively.

We would like to point out that we also performed an independent analysis of the specificity of our predictions. Using non-redundant sets of signal peptides and globular proteins, we found good agreement with the low estimates of confusion observed in

Table 2.4. For example, on a set of 314 non-redundant soluble proteins that had no homology to proteins used in the training of our SABLE prediction method (a subset of proteins used before in the [41] in order to evaluate SABLE), only three proteins were falsely predicted to have membrane segments.

Helical MP Helical MP Non Helical MP Non Helical MP observed predicted observed predicted S156 1 1 155 155 S135 2 5 133 130 S163 4 6 159 157 S147 2 5 145 142 Table 2.7 Performance of MINNOU on the four control sets.

47 2.3.4 An example

In Figure 2.3, we show two examples of the prediction from MINNOU. The amino acid sequence is included in the first row with the actual and predicted membrane segments highlighted using bold and yellow boxes, respectively. The secondary structures and relative solvent accessibilities predicted using the SABLE server are shown in the second and third rows. The alpha-helices are represented by red ribbons, the beta-strands by green arrows, and coils by blue lines. The level of predicted aqueous solvent exposure is represented by shaded boxes with fully “buried” and fully exposed residues represented by black and white boxes, respectively. The secondary structures observed in the experimentally resolved structures are shown in the last row for comparison.

A.

48 B.

Figure 2.3 Examples of MINNOU transmembrane helix predictions for an ion channel (PDB code 1OTS) in panel A and a glutamate transporter protein (PDB code 1XFH) in panel B.

2.4 Conclusion

We proposed a novel representation of an amino acid residue and its environment for membrane protein prediction. The new approach does not use explicitly evolutionary profiles or hydropathy. Instead, the new method relies on prediction-based structural profiles, consisting of predicted relative solvent accessibility and secondary structures of amino acid residues. In particular, the predicted level of aqueous solvent exposure, which is indicated by an accurate RSA prediction method, trained on soluble proteins only [39].

It is used to identify segments of residues that are “buried” in the membrane in order to

“avoid” contact with water.

49 In cross-validation with a simple, one layer NN-based classifier, the new representation is estimated to yield an accuracy of 0.74 for TM helix and 0.53 for beta membrane prediction, as measured by the correlation coefficient between the predicted and observed classes. For comparison, the MA-based prediction is estimated to achieve a lower accuracy, with correlation coefficients of 0.70 for alpha-helical and 0.42 for beta proteins, respectively. The final prediction protocol for TM helix prediction, based on two-step NN-based classifier, is estimated by the TMH Benchmark server to achieve per- residue accuracy of 89% (significantly higher than any of the methods evaluated in [32]) and per-segment accuracy of 80%, with the lowest rates of confusion with globular proteins and signal peptides among the methods tested (and similar to the recently published Phobius method [27]).

Thus, using the new representation we were able to achieve accuracy competitive with other state-of-the-art methods for alpha helical TM domains, as assessed by the

TMH Benchmark server. High sensitivity of TM domain prediction is achieved with very low levels of confusion with globular proteins and signal peptides. Moreover, in our internal cross-validation tests, the new representation outperformed multiple alignment- based approaches for both alpha helical and beta-barrel membrane proteins. Therefore, we conclude that applying predicted RSA and SS is likely to further contribute to the development of accurate methods for prediction of protein membrane domains.

50 Chapter 3 Prediction of transmembrane domains in beta-barrel membrane proteins

β-barrel membrane proteins play an important physiological role by enabling, due to their specific topology, pore forming and transporting through cell membranes.

Identification and characterization of β-barrel proteins from sequence is challenging since only a limited number of structurally resolved membrane proteins of this type are available to train and validate prediction methods. We developed a novel method for the prediction of transmembrane domains and the pore-facing residues in β-barrel membrane proteins. Our approach is based on a compact representation of an amino acid and its environment, which consists of predicted solvent accessibility and secondary structure of each residue in a sliding window. Using cross-validated training on a set of 13 β-barrel membrane protein, augmented to include fragments of globular proteins as negative examples, we demonstrate that using this novel feature space enables achieving accuracy competitive with that of state-of-the-art methods. Our neural network-based transmembrane domain predictor is estimated to yield per residue classification accuracy of 84.1% and the Matthews Correlation Coefficient (MCC) 0.69. We also extended our effort to identify the pore-forming residues within the transmembrane segments. The new method is estimated to predict the pore-facing residues along the transmembrane strand with per-residue accuracy of 89.5% and the MCC of 0.75. The new method is available at http://minnou.cchmc.org .

51 3.1 Introduction

Many methods have been developed in the past for the prediction (from amino acid sequence) of TM segments and topology of helical membrane proteins. For a comprehensive overview of the state-of-the-art membrane protein prediction and concepts underlying different approaches, the reader is referred to [7]. In contrast, very few methods have been developed to predict transmembrane segments and the topology for β-barrel membrane proteins. Several alternative machine and statistical learning techniques were employed to develop such methods, including Neural Networks (NNs)

[51,52], and Hidden Markov Models (HMMs) [37,53]. While the reported accuracies and successful genome-wide applications [37] are encouraging, these methods are overall much less reliable than those for the prediction of alpha-helical transmembrane segments

[7,11].

In this chapter, we introduced a novel approach to predict transmembrane (TM) domains and the pore-facing residues in β-barrels. Our approach is based on a new compact representation of an amino acid and its environment, as defined here by a sliding window of certain length and centered at the position of interest. Instead of commonly used physico-chemical properties of amino acid residues, such as hydrophobicity, or multiple alignment (MA)-based representations [7], each residue in a sliding window is described in terms of its predicted (from sequence) relative solvent accessibility (RSA) and secondary structure (SS). These prediction-based “structural profiles” were introduced recently and shown to improve upon the state-of-the-art MA-based representations for helical TM proteins, also helping in reducing the number of false

52 positives, i.e., reducing the number of soluble proteins predicted as containing TM segments [11].

Here, we extend this strategy to β-barrel proteins for which the problem of limited data from which to learn and substantial levels of false positives are of even greater importance. We used several machine learning techniques, including Support Vector

Machines (SVM) and NN to carefully evaluate both linear and non-linear classifiers on the available data. We concluded that the new compact representation, in conjunction with NN-based classifiers and consensus predictors, enables reliable prediction of TM domains in β-barrel proteins that appear to be competitive with previously published and carefully evaluated methods, such as PROFtmb [37]. The new method also yielded very low confusion (false positive) rates between globular proteins and β-barrels membrane proteins. The manuscript is organized as follows. In the next section, we will discuss training sets, learning approaches, and the validation procedures used. In Section 3.3, we present the results of new methods and discussion, followed by conclusions.

3.2 Methods and Data

3.2.1 Data

In order to develop and test prediction methods considered here, we used several different datasets. The first dataset consists of eight non-redundant β-barrel proteins

(Protein Data Bank codes: 1fep, 1qjp, 1qj9, 1qd5, 1a0s, 1af6, and 1bt9), which were used before to develop the PROFtmb method [37]. This set was used to directly compare the new method with PROFtmb. In an attempt to further directly compare our method with those from the literature, we explored data sets used by [54] and [55] to develop their

53 respective TM β-barrel protein predictors. Unfortunately, we found some inconsistencies in the definition of these sets. In particular in the training set developed by [55], we found some protein chains that appear to be highly redundant (e.g., 1af6 and 2mpr); whereas, the set developed by [54] appears to contain non-membrane proteins (e.g., 1nqw). In light of the very limited number of resolved structures, these inconsistencies may result in overly optimistic estimates of accuracy and invalidate a direct comparison with published results.

Protein Number of β Number of residues PDB ID Name strands in TM seg./Total 1kmo FecA 8 247 / 741 1mal Maltoporin 18 222 / 421 1uun Porin 2 37 / 184 1p4t NspA 8 104 / 155 1qjp OmpA 8 89/ 325 1opf OmpF 16 191 / 340 1i78 OmpT 10 120 / 297 1qj9 OmpX 8 94 / 148 1k24 OpcA 10 114 / 253 1qd5 OmpLA 12 148 / 269 1prn Porin 16 176 / 289 1ek9 TolC 4 48 / 471 1mm4 CrcA 8 90 / 186 Table 3.1 The non-redundant dataset of 13 β-barrel proteins (S13 set). In addition to PDB codes and (shortened) protein names, the numbers of β-strands, number of residues in TM segments and the total number of residues in each protein, as given in MPtopo, is reported in the third and fourth columns, respectively.

In an effort to identify as many non-redundant TM β-barrel proteins as possible, we further explored the MPtopo database [42], which provides a comprehensive collection of membrane proteins with experimentally verified topologies. We used sequence alignment to exclude redundant protein chains. Specifically, we used the

BLASTP program with default settings [38] to detect homologous sequences among the

54 TM β-barrel proteins initially derived from MPtopo. In addition, several structures with non-typical topologies or uncertain annotations were also excluded. As a result we obtained a set of non-redundant 13 β-barrel proteins with the BLASTP E-value between any pair of sequences in this set being larger than 10-3. A detailed description of proteins in this set (which will be referred to as S13 throughout the text) is included in Table 3.1.

For further evaluation of the method in terms of the confusion with soluble

(globular) proteins (which may be falsely predicted as containing TM β strands and thus being identified incorrectly as membrane proteins), we used several non-redundant sets

(both internally and between them) of soluble proteins that were used to evaluate the

SABLE method for RSA and SS prediction [39,41,56]. Two of these sets, denoted as

S149 and S156 in [39], were used to identify a number of negative examples (false positives obtained after the initial training on TM β-barrels only) to augment the S13 training set and retrain the method. This augmented set (referred to as S13g) includes 18 fragments (of varied length ranging from 99 to 330 amino acid residues) of soluble proteins that were initially incorrectly predicted to have TM β strands. Finally, we used two other non-redundant sets of proteins used for the evaluation of SABLE (denoted as

S135 and S163), which together include 298 soluble proteins, for the assessment of the final predictor in terms of the false positive rates and confusion with globular proteins.

3.2.2 Amino Acid Representation

Prediction methods considered here use an amino acid sequence as the input and assign to each amino acid residue a class (binary output), i.e., either a membrane residue vs. non-membrane residue class in the case of TM segments prediction or pore-facing vs. lipid-facing residue in the case of pore prediction. Since machine learning approaches

55 typically require a vector representation of fixed dimensionality (defined by a number of features or attributes), a sliding window consisting of some arbitrary number of residues around the residue of interest is commonly used to represent an amino acid environment.

We used this approach, varying the number of residues in a sliding window from 5 to 31 and assessing the results on validation sets.

Each amino acid residue in a sliding window is encoded by five features representing predicted relative solvent accessibility, the confidence level for RSA prediction and probabilities of this residue being in three secondary structure classes

(helix, β-strand, and coil or other). These intermediate attributes are predicted from sequence using the SABLE method for RSA and SS prediction [39,41], which was shown to provide an accurate compact representation for the subsequent prediction of TM segments in helical membrane proteins [11]. In addition, for the prediction of pore-facing residues, we use another (sequence independent) attribute of an amino, namely its hydrophobicity as defined using the KD hydropathy scale [49]. Thus, the total number of attributes used is five (or six) times the size of the sliding window.

3.2.3 Multistage Protocol for the Prediction of TM Domains

In this section we describe learning algorithms, training protocols, and predictors that are evaluated in the Results section. Following in the footsteps of established prediction methods (e.g., [47]) and guided by assessment of intermediate predictors, we developed a two-level NN-based model to predict TM domains in β-barrel membrane proteins. The input for the first level neural network is based on the compact representation described in the previous section. The second level neural network uses the output from the first level networks as additional features in order to “smooth over”

56 the prediction of the first layer and improve the overall performance. Both first and second layer networks are trained and evaluated using cross-validation on both S13 dataset of TM β-barrels only and the extended S13g set which includes fragments from globular proteins as well. The networks trained on the latter (augmented) set proved to yield significantly lower rates of false positive and are, therefore, used for the final predictor.

The architecture of all neural networks used here is similar. A simple feed forward three layer network, which consists of an input layer, one hidden layer, and an output layer with two nodes is used. Since the size of the sliding window is varied from 5 to 31, the resulting number of features and thus the number of nodes in the input layer varies from 25 to 155 nodes for the compact representation with five features per residue used here. Several alternative networks, with the number of hidden nodes varying from 2 to 18, were trained and evaluated. The two nodes in the output layer correspond to the structural classes of the residue of interest, e.g., non-TM vs. TM domain. Binary encoding is imposed in the training for the output nodes, e.g., [1,0] represents a residue in a TM domain; whereas, [0,1] corresponds to non-membrane residues. After some experimentation we found that the threshold of 0.7 (as opposed to 0.5) for the normalized

TM class output to trigger the prediction of a membrane residue provides more balanced sensitivity and specificity of the predictions. All the networks were trained using the

Rprop algorithm [44] as implemented in SNNS package [45]. Unless specified otherwise, the trainings were stopped if there was no improvement on the corresponding validation set within 500 epochs. The learning parameter was fixed to 10-1 for all the networks.

57 The accuracy of alternative predictors is assessed using cross-validated training.

Due to limited number of data points from which to learn, the leave-one-out protein or an approximate ten-fold cross-validation is used. In the latter case, training and control subsets are still defined in terms of individual chains and contain approximately equal number of residues. In order to provide better generalization, the final consensus predictor consists of a simple majority voting of 10 second layer networks trained on different subsets of the data in 10-fold cross-validation. Furthermore, an additional post- processing step is applied to filter out predicted TM β-strands that are shorter than five residues (and thus would most likely be false positives as around ten residues are necessary for TM spanning strand). Also, since the smallest structural unit in β-barrel proteins is the β-hairpin [57,58], we ignore those TM domains that are predicted as the only TM strand in a given protein. Similar filters were used previously to improve the prediction of TM segments in alpha-helical membrane proteins [11,47]. The effect of the filter and other arbitrary choices are illustrated in details in the next section.

Since non-linear NN-based classifiers increase the risk of overfitting, even in the case of relatively compact representations considered here, contributions of different steps in the protocol were also assessed using simple linear Support Vector Machine

(SVM)-based classifiers that were trained using different training sets and representations described above. While an SVM approach, which we used before to evaluate the performance of linear predictors for RSA prediction [56], yielded accuracies somewhat lower than NN-based predictors, we were able to confirm the overall trends and additionally validate more involved non-linear classifiers. Three standard measures of accuracy are used to evaluate different methods: per residue classification accuracy

58 (defined as the per cent of correctly predicted residues), Matthews Correlation

Coefficient (MCC) between predicted and true class labels [48], and the false positive rate that reflects the level of confusion with soluble proteins being incorrectly predicted as having TM segments.

3.2.4 Prediction of Pore-facing Residues

As an extension of the TM segment prediction, we developed a new method to identify putative pore-facing residues in β-barrel membrane proteins. Following [59], we define the pore-facing resides by using geometric considerations. Specifically, pore residues are identified as those with their C α atom closer than the Cβ atom to the central axis of the pore. For glycine residues, the Hα2 atom was used to replace the Cβ atom. The following β-barrel structures were used to derive a set of TM residues for cross-validated training: 1mm4, 1kmo, 1prn, 1qd6, 1k24, 1qj9, 1i78, 1opf, 1qjp, 1p4t, 1mal, and 7ahl.

The center axis of each pore was found and visually inspected using the VMD program

[60]. The center of mass of each protein was identified, helping to define a line perpendicular to the membrane plane and located inside the pore. The distances between

Cα and C β and the center axis were also measured by using VMD [60]. As mentioned before, we used for the pore prediction the same compact representation extended by hydropathy profiles [49].

3.3. Results and Discussion

3.3.1. Comparison with PROFtmb

In order to compare our approach with that of Bigelow and colleagues [37], we used the set of 8 non-redundant β-barrel TM proteins used originally in [37] to develop

59 the PROFtmb method. A simple NN-based predictor (without the second stage predictor used below to further improve the performance) was trained on this set for direct comparison of the results. NNs with a fixed architecture, consisting of 55 input nodes corresponding to a sliding window of 11 residues, 3 nodes in a single hidden layer and binary output layer, were trained on eight different subsets of the data as described in the previous section (using however a constant number of 1000 training epochs). The novel compact representation yields in the leave-one-out validation average per-residue classification accuracy of 86.0±1% and the MCC of 0.73±0.02, as compared with the classification accuracy of 83% and the MCC of 0.70 reported by [37]. We would like to stress that the number of data points that can be sued for training and validation is very limited, suggesting that the current estimates of the accuracy may not hold in the future.

However, we find that in relative terms the new approach is competitive with state-of- the-art methods for TM segment prediction in β-barrel proteins as represented here by

PROFtmb.

3.3.2. Cross-validation for TM Domain Prediction

In this section, we discuss the results of a multistage NN-based predictor on our primary training set S13 and its augmented version S13g (see Section 3.4). The first stage, second stage, and second stage predictor with an additional filter are evaluated using

(approximate) 10-fold cross-validation on both sets. The results are summarized in Table

2. As can be seen from the table, the overall performance of NN-based predictors on the

S13 set is estimated to be similar to that observed on the set of eight proteins used in the previous section. At the same time, the performance of both first and second stage predictors is estimated to be significantly better on the S13 set of TM proteins only

60 (MCC of 0.73 for the first stage predictor) compared to the S13g set (MCC of 0.62 for the first stage predictor), which is augmented with soluble proteins that are likely to be incorrectly predicted as having TM β strands.

S13 S13g Q2 [%] MCC Q2 [%] MCC 1st 86.7±2.0 0.73±0.04 80.0±3 0.62±0.03 2nd 88.2±1.2 0.76±0.04 84.1±3 0.69±0.05 2nd +Filter 87.9±2.0 0.75±0.05 82.6±3 0.63±0.06 Table 3.2 Performance of alternative predictors estimated using 10-fold cross-validation on different training sets (S13 consists of only β-barrel membrane proteins, while S13g is an extended set augmented with fragments of globular proteins initially predicted as containing TM β strands). Average per-residue classification accuracy (Q2) and MCC as well as error bars (standard deviations) are given for the 1st level,

2nd level, and 2 nd level with a filter predictors.

While the second stage predictors improve results significantly due to smoothing of the initial prediction, many relatively short putative TM segments are still predicted in soluble proteins (including soluble domains of TM proteins included in S13). Therefore, as discussed in Section 3.4 an additional filter was applied to remove unphysical short

TM segments, resulting in significantly lower false positive rates. On the other hand, the use of a filter decreases the accuracy of the second stage predictor on TM proteins (MCC of 0.63) to the level of the first stage predictor without a filter. Since large-scale applications to annotate newly sequenced genomes require low rates of false positives

(i.e., high specificity of predictions), this loss in sensitivity is a necessary price to pay at present. It remains to be seen if a better trade-off between specificity and sensitivity can be achieved when more training data is available enabling a more reliable assessment of alternative models and arbitrary choices involved. Finally, we would like to comment that some of the problems discussed here (e.g., many falsely predicted short TM segments)

61 may be less pronounced when using an alternative to a sliding window-based approach that relies on a “grammatical” structure of TM protein sequences as utilized in HMM- based methods that model the entire sequence and impose explicitly expected length distributions for TM segments [37]. Investigating the advantages of the new compact representation considered here in the context of methods with grammar is the subject of a future work.

3.3.3. Pore Prediction

The pore-facing residue prediction was assessed using the leave-one-out cross- validated training on a set of TM segments derived from the set of 12 β-barrel membrane proteins described in the System and Methods section. A simple one stage NN-based predictor, similar to those used for TM segment prediction, in conjunction with the compact representation of an amino acid in terms of predicted RSA and SS, as well as average hydrophobicity, yields in the leave-one-out cross-validation average per-residue classification accuracy of 89.5±2% and the MCC 0.75±0.05.

Figure 3.1 An example of the TM segments prediction for a β-barrel membrane protein (PDB code

1P4T). Starting from the top row: the amino acid sequence, actual secondary structures (with β-strands

62 represented as green arrows) and relative solvent accessibility (with fully buried residues represented as black boxes), and SABLE predicted SS and RSA are shown in the subsequent rows, respectively. The actual TM domains are highlighted in yellow; whereas, the predicted TM residues are shown in bold.

We would like to stress again that only TM residues are classified here as either pore-facing or lipid-facing. Thus, an additional uncertainty due to errors in prediction of the boundaries of TM segments needs to be considered for sequence-based annotations.

Figure 3.1 provides an example of TM segment prediction and illustrates possible errors, such as two TM β-strands being predicted as one long strand. Other types of errors include TM strands not being predicted as such and falsely identified TM segments in soluble proteins; a problem which is discussed in the next section.

3.3.4. Confusion with Soluble Proteins

In order to test the confusion with globular proteins (i.e., the rate of false positive predictions that lead to misclassification of soluble proteins as TM β-barrels), we used two control sets that were used before to evaluate SABLE RSA and SS predictions. Thus, these representative sets (consisting of a total of 298 proteins without significant sequence homology [39]) are non-redundant with SABLE (used here to derive from sequence amino acid attributes for TM segment prediction) and S13 (as well as S13g) training sets. The results of two second stage consensus predictors (see System and

Methods section) developed using S13 and S13g set are shown in Table 3.3.

Method S135 S163 No filter 117 137 Trained on S13 Filter 85 89 No filter 46 54 Trained on S13g Filter 1 4

Table 3.3 Assessment of false positive rates for alternative TM β-strand segment predictors. Methods

63 trained on the S13 and S13g set (without and with a filter that removes spuriously short predicted TM segments) are compared on two sets of non-redundant soluble proteins denoted as S135 and S163 (see text for details). The number of false positive matches (soluble proteins with at least one falsely predicted TM segment) is given in each case.

As can be seen from Table 3.3, the first predictor that was trained on the original

S13 set predicts (without a filter) at least one TM segment in about 85% of soluble proteins. While the use of a filter that removes spuriously short TM β strands improves the results somewhat, false positive rates are still very high. Training an alternative predictor on the S13g set, which was augmented with examples of false positives observed (in other sets of soluble proteins without homology to sequences included in

S136 and S163) with an initially trained predictor, improves the results significantly.

However, only combining the improved predictor with an additional filter brings the false positive rate to a low level (five false positives in 298 proteins, i.e., the false positive rate of about 2%), which makes the new method applicable to genome-wide annotations.

3.4. Conclusion

We developed new methods for the prediction of TM domains and pore-facing residues in β-barrel membrane proteins. These methods are based on a compact representation of an amino acid and its environment, which involves prediction-based structural profiles consisting of predicted secondary structures and relative solvent accessibilities of amino acid residues. These intermediate attributes are predicted (from sequence) using the SABLE method, which was demonstrated to achieve state-of-the-art performance [39,41,56] and was shown to provide a basis for subsequent accurate prediction of TM segments in alpha-helical TM proteins [11]. Careful evaluation using

64 cross-validated training on currently available structurally resolved membrane β-barrels suggests that the new representation and predictors are competitive with other state-of- the-art methods.

Our final NN-based predictor is estimated to yield per residue accuracy of about

83% and the MCC of about 0.63. On the other hand, a relatively low false positive rate

(of about 2%), which we were able to achieve due to the inclusion of negative examples in the training and further post-processing of prediction to remove short putative TM segments, makes the new method a candidate for large-scale applications in genomics.

The prediction of pore-facing residues is estimated to achieve per-residue accuracy of about 90% and the MCC of about 0.75. While these estimates are likely to be revised as more data becomes available for training and validation, we conclude that the new method can already become a useful tool for structural and functional studies of membrane proteins. We also developed a website for predicting transmembrane domains and pore-facing residues in β-barrel membrane proteins which is available at http://minnou.cchmc.org .

65 Chapter 4 Relative lipid accessibility prediction in membrane proteins

The level of surface exposure of an amino acid residue in a soluble protein can be reliably predicted from sequence and subsequently applied to facilitate overall structure predictions and functional annotations. We propose to extend these efforts to membrane proteins. In particular, we developed novel methods to predict the relative lipid accessibility (RLA) of an amino acid residue in a membrane domain, which represents the lipid exposed surface area of that residue in relative terms. In analogy to relative solvent accessibility prediction in soluble proteins, the problem of predicting RLA from the amino acid sequence can be cast as a regression problem and solved using machine learning techniques. The critical difference between soluble and membrane proteins, which makes the latter significantly more challenging, is the relatively small number of high resolution structures from which to learn. It is thus essential to carefully design and evaluate compact representations and low-complexity models for RLA prediction. In this work, we developed linear Support Vector Regression (SVR) approaches that are suitable for training on the limited number of structurally resolved membrane proteins. Moreover, we evaluate flexible SVR-based models to represent the uncertainty of RLA assignments for residues at the membrane-water interfaces. Using cross-validation on a non-redundant set of alpha-helical membrane domains as well as an independent control set, we estimate that our methods yield correlation coefficients between the observed and predicted RLAs of about 0.5 and mean absolute errors of about 17% RLA. When the real valued predictions are discretized with a threshold of 15% RLA to define lipid exposed and buried residues, the two state classification accuracy is about 68% (with a baseline of

66 53%). We conclude that the proposed RLA prediction methods show promise towards further applications to structure prediction and identification of membrane domain interactions.

4.1 Introduction

One characteristic of membrane proteins that can be computationally predicted from sequence with high accuracy is the location of transmembrane (TM) segments, especially in alpha helical domains. While some early estimates of the accuracy of TM helix prediction methods have been revised recently, top performing methods achieve per residue classification accuracies of about 80%, as measured by the TMH Benchmark server [31]. Moreover, recent studies suggest that further progress can be achieved by better incorporating evolutionary information [7] or by carefully designing and validating low-complexity models [11]. Given that TM segments can be reliably predicted, the next challenge lies in providing more detailed characterizations of structural and functional attributes of individual residues in the membrane.

In the present study, we propose novel methods for the prediction of an important structural attribute of amino acid residues in membrane proteins: relative lipid accessibility (RLA). We define the RLA of amino acid residue i as the ratio of its lipid

exposed surface area observed in a given structure ( LA i ) and the maximum achievable

surface area for this amino acid ( MSA i ):

LA i RLA i = ×100 [%]. MSA i

67 RLA i can hence adopt values between 0% and 100%, with 0% corresponding to a fully buried and 100% to a fully lipid accessible residue.

Conceptually, RLA for membrane proteins is a natural extension of the commonly used notion of relative solvent accessibility (RSA) for soluble proteins [39,61]. However, the problem of RLA prediction is much more challenging compared to RSA. Most importantly, there is a limited number of high resolution membrane protein structures that provide detailed information about lipid exposed surface area for individual residues. As a result, careful strategies are required when applying statistical and machine learning methods to this problem. Secondly, given a set of 3D coordinates and protein surface parameterization, the exposed surface area of an amino acid residue is classified either as lipid exposed or water exposed depending on the location of that residue (in either membrane or soluble domains). This may lead to difficulties with RLA assignments for residues at the lipid-water interface that are transiently exposed to both environments.

Other confounding factors include the presence of pores and channels, as well as, the relative abundance of multimeric forms in membrane proteins [62-64].

On the other hand, there are many potential applications of RLA predictions. For example, in analogy to the RSA case, sufficiently accurate RLA estimates can facilitate folding simulations and overall structure prediction for membrane proteins. In particular, predicted RLAs can be used to bias the search in conformational space towards those models that are consistent with the predicted patterns of lipid accessibilities. RLA prediction can also be used in fold recognition and , enabling the identification of the most compatible structural template. Additionally, it is likely that

RLA prediction can be used to identify potential interactions between membrane

68 proteins. (RSA predictions were found to yield fingerprints of protein-protein interaction sites for soluble proteins [65].)

To the best of our knowledge, a direct RLA prediction for individual residues in membrane proteins has not been considered before in the literature. Recently, new lipophilicity scales, such as TMLIP [66], were derived using statistical analysis of lipid propensities of amino acid residues in TM domains. These average lipophilicities were then used to classify entire TM helices as either buried or lipid exposed [67]. We expect that detailed predictions of the level of lipid exposure for individual amino acid residues would be more informative than average properties and thus provide more accurate characterizations of multispan alpha helical membrane proteins.

Here, we use several machine learning-based approaches in order to develop and test alternative RLA predictors for alpha helical membrane proteins. In particular in our previous work on RSA prediction [39,56], we advocate the use of regression approaches

(as opposed to classification models) to obtain real-valued estimates of RLA. However, in contrast to RSA prediction, the comparative scarcity of data from high-resolution structures of membrane proteins imposes limits on the complexity of prediction models.

With this in mind, we specifically compare linear Support Vector Regression models with more involved non-linear predictors. The latter are indeed found to suffer from overfitting, as demonstrated in the Results Section.

69 4.2 Materials and Methods

4.2.1 Training and Control Sets

We explored the MPtopo [42] and PDB_TM [9] membrane protein databases in order to derive non-redundant and representative sets of membrane proteins for training and validation of the proposed RLA predictors. While MPtopo provides a set of carefully annotated membrane proteins with experimentally verified topologies, the annotation process in PDB_TM is fully automated with the goal of providing a comprehensive and continuously updated collection of membrane proteins. It should be noted, however, that even for structurally resolved membrane proteins the identification of the membrane segment boundaries is laden with some uncertainty, and, as a result, the two databases are not fully consistent with each other [9].

We initially obtained 101 high resolution structures of helical membrane proteins from MPtopo. We subsequently used pairwise sequence alignments to identify homologous chains and removed redundant entries from the training set. Specifically, the

BLAST program [38] was used to remove sequences resulting in matches with E-values smaller than 10 −10 . As a result, a set of 72 non-redundant alpha-helical protein chains was obtained with a total of 6,307 residues located in membrane segments. The same set of chains (minus one protein, which was excluded due to missing side chain information) had previously been used in order to develop MINNOU, an accurate TM domain prediction method that utilizes a compact RSA-based representation of amino acid residues in TM segments [11].

70 We next removed individual residues for which no structural information about side chains was available. Finally, channel proteins that may form pores or channels inside the proteins needed to be dealt with properly. The HOLE program [68], followed by manual inspection, was used to detect channels and to identify residues that form the pore or channel inside the protein. These residues were subsequently removed as they are considered water exposed and not lipid exposed. This is a deliberate choice to remove potentially confounding cases from the training set, resulting finally in a set of 6,156 residues for cross-validated training. In addition, we also considered a further reduced subset of 4,635 residues obtained by removing from training those residues that were located at interaction interfaces between membrane domains (see the next subsection for details).

Figure 4.1 The proposed position specific error (PSE) function for ε -insensitive SVR models

(represented in the figure by the red curve as a function of the position in the membrane), for the

prediction of the relative lipid exposure (RLA) in membrane proteins. By allowing larger errors for

residues at the lipid-water interface that are more diffic ult to predict, improved prediction accuracies can

be achieved for the core residues that occupy the center of the membrane (where the error function is

constant).

71 In addition, we created an independent control set using the PDB_TM database

[9]. As of May 2006, there were 1,762 sequence chains in PDB_TM that were derived from the 3D structures deposited in PDB using an automated algorithm to determine TM domains. This initial set was reduced to a non-redundant control set as follows. First, entries with incomplete sequence information were excluded. Next, sequences with homology to those in the training set (defined using a more rigorous E-value threshold of

10 −3 for BLAST matches in order to exclude even relatively distant homologs) were removed. Thirdly, this set was further reduced by checking internal redundancy (with the same threshold) and also removing modeled 3D structures. Finally, by removing structures either with missing side chains or consisting of only short fragments (e.g. individual TM helices), which would not be sufficient for obtaining reliable multiple alignments, we arrived at a non-redundant set of 16 chains with known 3D structures and with a total of 1482 residues in membrane segments (241 of them located at known interaction sites).

4.2.2 Feature Space and RLA Computation

Given the limited size of available training data, it is essential to carefully consider the complexity of the proposed prediction models for RLA. A direct consequence is that the representation that we choose to encode amino acid residues and their environment should involve a limited number of descriptors, especially if more complex prediction models such as those based on NNs are used. As part of this study, we specifically addressed the issue of the overall complexity of prediction models and the dimensionality of the feature space used to capture propensities to lipid exposure vs. burial inside the core of the protein. In particular, we compared multiple sequence

72 alignment (MSA)-based representations with compact representations derived from our

RSA predictions and hydropathy profiles in the context of the comparison between SVR and NN-based regression methods.

In the case of MSA-based representations, each amino acid residue is represented by a vector of 20 numbers (a column of the Position Specific Scoring Matrix (PSSM) at that position). In order to generate family profiles encoded in the form of PSSMs, we used the Psi-BLAST program [38] (version 2.6 with default options) and the NR database

[69] (version as of May 2006 with 2,869,298 sequences). Three iterations of Psi-BLAST were performed to generate family profiles; no masking of low complexity regions or membrane domains was used. The local structural environment and evolutionary context of each residue is now characterized by a sliding window of n amino acids, with the residue of interest at the central position, implying 20 × n features for this representation.

Different window sizes were tested. We noted that sliding windows for residues at the edges of the alignment are created using short artificial N- and C-terminal peptide extensions that were added to the original query sequence, following our strategy for

RSA prediction [39].

As an alternative, we also considered a more compact representation with each amino acid residue being represented by the predicted RSA and its associated confidence factor, as obtained using the SABLE server [39]. This representation was motivated by observed periodic patterns in the SABLE-predicted confidence factors that qualitatively seemed to follow the periodicity of TM helices. We put this hypothesis to the test by developing and assessing an RLA prediction method based on this simple representation.

73 In addition, we also considered adding hydropathy and lipophilicity profiles, as derived from the TMLIP2H [66] and WW [50] scales.

As mentioned in Section 4.1, the concept of lipid surface exposure in membrane proteins is essentially analogous to that of water solvent accessibility for soluble proteins.

For this study, and similarly to RSA computation, we computed actual RLAs using the parameterization of the protein surface implemented in the DSSP program [70]. Only residues from membrane domains were taken into account, as mentioned in section II.1.

The exposed surface area of the amino acid residue was then normalized by the maximum obtainable values of the surface exposed area (as defined in the commonly referenced study by Chothia [71]) in order to arrive at a relative measure of lipid accessibility (RLA).

Many of the structures included in the training set are, in fact, protein complexes with interacting membrane domains. This poses a dilemma as to how to assign RLA to interfacial residues. It is a priori unclear if these residues are lipid exposed without the presence of the interacting partners. Therefore, we consider two alternatives: we first compute their RLAs derived from complexes; and secondly, we derive RLAs from individual structures, disregarding interacting chains. As we will see, global accuracy estimates suggest that using data derived from complexes results in somewhat better performance. Moreover, removing interacting residues altogether from the training set results in further improvements, indicating that these are indeed difficult cases.

74 4.2.3 Support Vector Regression Models

For what follows, let each amino acid in the training set be encoded by a vector

ai as described in section 4.2.2, and let the corresponding true RLA value be denoted by

yi ∈[ ].1,0 Support Vector Regression models can be seen as generalizations of Least

Squares (LS) models where an ε -insensitive penalty function is minimized instead of the sum of squared errors. The interested reader is referred, e.g., to the excellent monograph

[3] for details. For the purposes of this chapter, we restrict ourselves to stating the overall optimization problem that is solved by SVRs:

min w + C ξ p 1

T ts .. ai w + β − yi −ξi ≤ ε for all i

Here C is an a-priori penalty parameter, which balances the regression error term

p /1 p ξ ≡ ξ and the normalization term w ≡ w (which corresponds to the 1 ∑i i p (∑i i ) margin maximization term for SVMs). Given a solution (wˆ, βˆ), the predicted RLA for a

T ˆ new amino acid characterized by ak is then given by yk = ak wˆ + β . To compare and

T contrast with LS, note that the term ai w + β − yi in the constraints corresponds to the

objective function for LS. SVRs will penalize this deviation via the slack variable ξi if it exceeds ε , which is an error insensitivity parameter that must be set by the user a-priori .

In addition, SVRs generally use the 1-norm to penalize the regression error ξ and are 1 thus less outlier-sensitive, which again is especially advantageous in our particular application context with noisy data [56]. p (the norm parameter in the objective function)

75 is typically chosen to be equal to 2, which implies that the SVR problem is a quadratic optimization problem. However, for the sake of computational efficiency, we opted for p = 1, since, in this case, the problem can be reformulated as a linear programming

(LP) problem and solved with efficient large-scale linear optimization codes. We use a tailored implementation of the interior-point linear programming solver PCx [72] to solve our training problems on a standard Linux server. The implementation is similar to that described for a different context [19].

While C and ε i are typically chosen as fixed parameters, there is no real reason not to allow them to be functions of i . In our previous work, we were able to improve the results of RSA prediction by choosing higher error insensitivities for more exposed residues to reflect their higher flexibility [56]. Here, we modeled the dual nature and higher uncertainty of lipid exposure for residues at the lipid-water interface by choosing a

larger ε i compared to residues at the center of the membrane (see Figure 4.1). Such models were also expected to be less sensitive to errors in the recognition of the exact boundaries of TM segments. However, these additional degrees of freedom in the model also presents the modeler with an additional problem of finding an “optimal” (in some

sense) choice of meta-parameters ε i and Ci , depending on the particular objective. We performed cross-validation experiments and used primarily several global accuracy measures, such as correlation coefficients between predicted and observed RLAs (see

Section 4.3), to find appropriate values for the parameters Ci and ε i (or ε ).

76 We evaluated two particular SVR implementations to illustrate how modeling errors with physical insights may be attractive for the RLA prediction problem. For the first

SVR predictor we chose constant error tolerancesε i = 1.0 , which will be referred to as

CE-SVR for constant error SVR. For the second run we chose to emphasize accuracy in

the central region of TM helices by increasing ε i for residues closer to the interface with water. We experimented (albeit not exhaustively) with different position specific error insensitivities (referred to as PSE-SVR). The best results in our tests were obtained for the following choice of error function, which is consistent with the pictorial

representation included in Figure 4.1: ε i = 4.0 − 05.0 × k where k ≤ 5 is the distance from the membrane boundary, and ε = 1.0 for all remaining (central) residues. The penalty parameter was set to C = 03.0 , which proved to be optimal in cross-validation. We noted that the results were relatively insensitive to the particular choice of C , as long as it was

Representation [Num. of features] NN SVR RSA [30] 0.34±0.02 0.36±0.04 MSA [300] 0.32±0.02 0.45±0.02 MSA+WW [315] 0.33±0.02 0.45±0.02 MSA+TMLIP2H [315] 0.32±0.02 0.46±0.02 MSA+RSA [330] 0.35±0.02 0.47±0.02 MSA+RSA+WW [345] 0.33±0.03 0.47±0.02 MSA+RSA+TMLIP2H [345] 0.36±0.02 0.47±0.02 Table 4.1 The effect of different representations and regression models on the RLA prediction accuracies

(the number of features in each model is given for a sliding window of length 15 used here). Cross-

validation accuracies are given in CCs. Note that NNs clearly suffer from the small training set size and

overfitting. above a certain threshold. We stress that these are somewhat arbitrary choices that follow physical intuition, and it remains to be seen if further improvement can be achieved once the parameters are fine-tuned for a particular optimization criterion. Finally, we would like to comment that for the final predictor (evaluated in Section 4.3.2) we used a simple

77 consensus model, consisting of an arithmetic average of ten individual SVRs trained in cross-validation on different subset of the data. This additional step provides some smoothing and a slight improvement in the accuracy on the independent control set.

4.2.4 Review of Neural Network Based Regression Models

In our previous work, we demonstrated that accurate RSA prediction models can be developed using nonlinear NN-based regression [39]. However, the number of available examples, from which a predictor may learn, is much larger in case of soluble proteins. Consequently, reliable parameter estimation is possible even in cases of more complex models, as implied by the use of NNs. On the other hand, it is not clear a priori if the same trends will hold for membrane proteins. In order to investigate the usefulness of NNs for RLA prediction, we specifically developed a simple NN-based predictor. We used a multilayer perception (MLP) architecture with a single hidden layer, consisting of either five or ten nodes with a logistic activation function and a single logistic output node that approximates real-valued RLAs. Hence, if the PSSM-derived representation with a sliding window of size 15 is used in conjunction with ten nodes in the hidden layer, then the overall number of edges and, consequently, weights to be optimized equals

3,000, which is of the same order of magnitude as the number of residues in our training set. This suggests the possibility of overfitting, as discussed below. All NNs considered here were trained using the SNNS package [45]. The Rprop learning algorithm was used with the default parameters, and the training was stopped after 500 epochs since no significant improvement was observed in terms of the standard sum of squares error function.

78 4.3 Results and Discussion

4.3.1 Cross-validation Study

In this section, we report the results of our cross-validation study, using the training sets defined in section II.1 and comparing both different representations and alternative machine learning approaches. We used standard error measures for regression-based methods, including the root mean squared error (RMSE), the mean absolute error (MAE), and the correlation coefficient (CC) between the predicted and observed values of RLA as defined before [56]. Average accuracies in 10 fold cross- validations (with different subset of the data defined in terms of protein chains and consisting of approximately equal number of individual residues) are reported below. We also measure errors on subsets of amino acids depending on their position in the membrane and their location at protein-protein interaction sites in order to assess the prediction quality for different environments in the membrane.

In Figure 4.2, two different representations are compared: one based on MSA alone and one extended by inclusion of the SABLE predicted RSAs and TMLIP2H lipophilicities for each amino acid in a sliding window. As can be seen from the figure, the combined MSA+RSA+TMLIP2H representation yields consistently higher accuracies compared to evolutionary profiles alone, even though the differences are small. In addition, a sliding window of length 15 yields best results in cross-validation for both representations. Therefore, this optimal window size is then used for further analysis and for the final predictor.

79 The two solid curves in the figure correspond to results obtained for the two different representations using the full training set of 6,307 residues with the actual RLAs derived from intact complexes, i.e., bound structures, whenever applicable. In order to assess the effect of residues located at interaction interfaces between membrane domains of multimeric structures, we also computed cross-validated accuracies using a subset of

4,635 residues obtained by removing known interaction sites (see also section 4.2.2).

Significantly higher accuracies are obtained in the latter case for both representations

(CCs of 0.53 vs. 0.47 for the extended MSA+RSA+TMLIP2H representation), indicating that residues located at interaction interfaces are indeed difficult to classify. We also note that improvements due to the extended representation become more pronounced in the case of the reduced training set, as illustrated by the distance between the pairs of solid and dashed curves, respectively.

0.53

0.51

0.49

0.47

0.45

0.43

Correlation Coefficient Correlation MSA with interaction sites

0.41 MSA without interaction sites

MSA+RSA+TMLIP2H with interaction sites 0.39 MSA+RSA+TMLIP2H without interaction sites

0.37 3 5 7 9 11 13 15 17 19 Sliding Window Size

Figure 4.2 Average cross-validated RLA prediction accuracies (in terms of correlation coefficients) for the training set of 72 non-redundant chains with (solid lines) and without (dotted lines) interface residues.

80 Significant differences between accuracies with and without interface residues indicate that the latter are more difficult to predict. Two different representations are also compared: one based on multiple sequence alignment (MSA) and an extended representation with MSA, predicted RSAs and TMLIP2H lipophilicities.

We further contrasted alternative representations and prediction models in Table

4.1. In particular, the results for SVR-based and NN-based regression models were compared for several representations. As can be seen from the Table, the MSA-based representation yields CCs of about 0.45 for SVR and 0.32 for NN. These accuracies were further improved when a combination of different features, including the SABLE predicted RSAs and hydrophobicity (WW) or lipophilicity (TMLIP2H) scales are included. Note, however, that these average amino acid properties result in rather small

(statistically insignificant) improvements, when combined with a MSA-based representation. In particular, we observed very small differences between the new

TMLIP2H lipophilicity scale (and its other variants [66] – data not shown) and older hydrophobicity scales, such as WW.

The final representation (MSA+RSA+TMLIP2H) yielded CCs of 0.47 for SVR and 0.36 for NN. Thus, NNs clearly suffer from overfitting and poor generalization, which is a likely consequence of the small training set. On the other hand, SVR-based models achieved consistently higher accuracy for all representations considered here. It was interesting to note, however, that both SVRs and NNs achieve accuracies that were quite comparable in case of the most compact representation, i.e., the one based on

SABLE-predicted RSAs and confidence factors. In this case, the implied number of parameters was the smallest, reducing the level of overfitting.

81 The SVR results discussed so far were obtained using the constant error (CE-

SVR) model. As indicated in 4.2.3, one advantage of SVRs is that they are very flexible in terms of how prediction errors are penalized in the model: as previously mentioned, both the insensitivity parameter ε and the penalization factor C can, in principle, be chosen as functions of the position in the membrane. We performed a limited search to find the best combination of the meta-parameters for the position specific error (PSE) model, as described in Section 4.3. While we observe a slight improvement for PSE-SVR

(by about 0.01-0.02 correlation coefficient) for the central residues where the allowed errors are constant ( ε = 1.0 , which corresponds to MAE of 10%), this gain was offset by somewhat lower accuracy for the residues at the lipid-water interface. However, as we will see in the next section, PSE-SVR models do perform better than constant error models on our independent control set, indicating that flexible SVRs might help achieve better generalization.

4.3.2 Independent Control Set

In order to further test the generalization of SVR regression models, we tested them on an independent control set, consisting of 74 TM segments in 16 non-redundant chains and 1,482 TM residues. Global accuracy measures obtained on the control set are consistent with the estimates obtained using cross-validated training. In particular, the combined MSA+RSA+TMLIP2H representation and the PSE-SVR model, which proved to perform best on the control set, yielded CC of 0.49. The slight increase in accuracy compared to cross-validation results (CC of 0.47 – see Table 4.1) can be attributed to the use of a simple consensus, as defined in Section 4.2.3, for the final predictor. The consensus predictor yields CCs 0.01-0.02 higher than individual SVR models. In

82 addition, the fraction of residues located at protein-protein interaction sites, which are difficult to predict (see Figure 4.2), were somewhat lower in the control set, compared with the training set (see section 4.2.2). Since the errors are similar to those observed in cross-validation, we conclude that the method is robust and avoids over-fitting.

PDB Chain ID CC RMSE [%] MAE [%] 1xfh_C 0.50 21.5 16.4 1vry_A 0.35 34.4 31.1 1xqf_A 0.62 15.6 12.7 1yq3_D 0.57 17.5 13.8 1s5l_z 0.53 18.5 14.7 1s5l_x 0.40 14.4 10.9 1w5c_F 0.56 23.7 20.6 2axt_h 0.56 19.7 17.6 1yew_K 0.59 16.7 14.4 1yew_J 0.33 25.6 22.1 1yew_A 0.80 19.2 17.1 1q90_M 0.60 20.8 16.3 2bbj_E 0.51 13.9 12.3 1zcd_A 0.64 16.3 13.6 1c17_M 0.40 24.2 18.8 2a65_A 0.51 18.9 16.0 Average 0.53±0.03 19.9±1.3 16.6±1.2 Table 4.2 Performance of the final (consensus) SVR model on the control set of 16 non-redundant membrane proteins.

In order to illustrate the performance of the new predictor on specific proteins, the results for all individual chains in the control set are listed in Table 4.2. The per protein average accuracies and error bars (standard deviations) for CC, RMSE, and MAE are included in the last row. As can be seen from the table, the per protein average CC of

0.53 was achieved, which was higher than per residue estimate of 0.49. At the same time, per protein average RMSE and MAE errors were somewhat higher than their per residue counterparts, which were estimated to be 19.4 and 15.7%, respectively. We would also like to comment that the constant error CE-SVR (consensus) model achieves average (per

83 protein) CC of 0.51 on the control set. Thus, although systematic, the differences between

PSE-SVR and CE-SVR were not statistically significant.

Our regression models provide real valued approximations of the actual RLA, without resorting to more crude classification approaches that use arbitrary thresholds to define lipid exposed and buried residues. However, real valued predictions, in analogy to the RSA case [39,56], can be clearly projected into discrete classes. When our RLA predictions were discretized with the threshold of 15% RLA, the two-state average (per protein) classification accuracy on the control set was about 68% (with a baseline of

53%). The classification accuracy for the threshold of 25% RLA was about 71% (with a baseline of 63%). The corresponding standard deviations were 3.5 and 3.3% RLA, respectively.

84 Figure 4.3 Example of RLA prediction for sodium/proton antiporter (PDB code 1zcd, chain A), which was included in our independent control set. The amino acid sequence is given in first row (labeled as A); the predicted RLA is shown in the second row (labeled as B); the SABLE-predicted secondary structures and

RSA (indicating clearly the location of TM segments that are not accessible to water) are shown in the third and fourth row (labeled as C and D, respectively); whereas, in the last two rows (labeled as E and F, respectively) secondary structures and RSA determined from the complex structure are shown for comparison. Solvent (lipid or water) inaccessible residues are indicated by black boxes; whereas, (partially) exposed residues are indicated by (shaded) white boxes. The segments highlighted in yellow correspond to

TM segments according to PDB_TM annotations.

The results on the control set discussed so far were obtained using RLA assignment derived from intact complexes. When the actual RLAs were derived from individual chains (thus, disregarding interacting chains in multimeric structures), then the accuracy of our RLA predictor that was trained on data derived from complexes decreases significantly, yielding the average (per protein) CC of 0.41, RMSE of 24.5%

85 and MAE of 21.0%. This significant drop in accuracy reflects the presence of about 16% of interacting residues, for which predicted RLAs are more consistent with bound

(complex) states (hence the lower accuracy when the data from unbound structures were used to define the actual RLAs). We suggest that such biases in RLA predictions can be used to identify potential protein-protein interaction sites in membrane proteins, in analogy with similar trends for soluble proteins [65].

4.3.3 Identification of Buried Helices

We would like to briefly comment on one possible application of RLA prediction for individual residues in membrane domains, namely, using predicted RLAs to estimate which of the TM segments is fully or largely buried (and thus not exposed to lipid) in multispan alpha-helical membrane proteins. This kind of prediction for whole TM alpha- helices has been considered recently, based on average amino acid properties

(lipophilicities) and sequence conservation patterns [67]. Here, we evaluate a simple projection of RLA predictions onto whole TM helices in order to obtain a predictive fingerprint of fully (or largely) buried TM segments. For assessment of the two-state prediction, we classified TM helices by averaging the actual RLA over all residues within a helix (using experimentally derived TM segments). Namely, a helix is classified as buried if the average RLA is less than 15% (in consistency with the most informative threshold for two-state classification of individual residues). For prediction, the average predicted RLA is computed for each TM helix and a threshold of 25% RLA is used to classify helices as buried or exposed (reflecting systematic biases in RLA predictions that are shifted with respect to observed values).

86 We performed a preliminary evaluation of this simple approach using three proteins that were not used for the training of the new RLA predictor (and that were also included in the evaluation of RANTS which was estimated to achieve the classification accuracy of 78% [67]), namely: 2a65, chain A; 1q90, chain B; and 1ocr, chain A, consisting of 28 TM helices (18 of them classified as buried). An overall classification accuracy of about 82% (with a baseline of 64%) is achieved on this small set, with 1 true positive (TP) and 2 false positives (FP) predicted for 1q90 (with just one helix out of four in this protein classified as buried), 6 TP and 1 FP predicted for 2a65 (with six helices out of twelve classified as buried), 10 TP, 1 FP and 1 false negative predicted for 1ocr (with eleven helices out of twelve classified as buried). Since RLA predictions for individual residues provide a more detailed representation of the state of exposure of entire helices, further improvements are expected when the number of training examples increases, enabling a more precise mapping of RLAs into classification of TM helices.

87 4.3.4 Prediction of Sodium Channel Protein

Figure 4.4 Predictions of TM domains and RLA for sodium channel protein from a human

(SCN1A_HUMAN). The sequence is given in first row, and the predicted transmembrane residues are

88 highlighted. The second row is predicted RLA. The RLA is indicated in boxes. The darker the RLA box of the residue, the less the lipid accessibility of the residue. The black box means the residue is not accessible to lipid at all.

Sodium channel proteins are also intensively studied in the literature [73].

However, solving the 3-D structures for sodium channels is still a challenge [73]. The structural properties, such as, secondary and partial tertiary structures can only be deduced from the primary sequence [73]. Here, we apply the MINNOU to predict the transmembrane domains and RLA for a sodium channel protein from a human. The results are shown in Figure 4.4. We predicted that there were 24 TM domains in the channel. In addition, we also predicted the RLA for all the 24 TM domains. The black boxes in the RLA line indicate that there are several residues closely interacting with other residues of the protein. The evaluation of our prediction is subject to the availability of further experiments.

4.4 Conclusion

The bottleneck in the field of computational structure prediction for membrane proteins is the relatively small number of high resolution structures, limiting one’s ability to develop reliable prediction methods that learn from known examples. In this paper, we proposed novel methods for the prediction of relative lipid accessibility in membrane proteins. We emphasized several aspects of the prediction problem, which are essential in our view for a realistic and careful machine learning study: a careful assembly of an appropriate representative training set, an appropriate and deliberate choice of representation for the amino acid residues, as well as careful selection and validation of models. After these considerations we concluded that RLA prediction is feasible, can be

89 achieved with good accuracy, and has great potential for applications in prediction protocols.

In particular, we developed several regression models using a linear Support

Vector Regression approach. One appealing feature of SVRs is that, in analogy to the

Support Vector Machine (SVM) approach for classification, the trade-off between the (in this case ε -insensitive) regression error and the generalization (“margin”) of the solution can be explicitly controlled [3]. Additionally, the SVR approach allows one to incorporate the expected (or permissible) error levels, e.g., for buried and exposed residues in case RSA prediction [56] or for residues at the lipid-water interface vs. those located in the center of the membrane.

Accurate predictions of intermediate attributes of amino acid residues in a protein, such as secondary structure or solvent accessibility, greatly facilitate the search for the correct structural template in fold recognition, as well as, the search for the native conformation in de novo simulations [45]. Our current results indicated that RLA can be predicted at the level of 0.5 correlation coefficient (which is still lower than the estimated

0.6-0.7 for state-of-the-art real-valued RSA prediction methods for soluble proteins [61]).

We also observed that the predicted RLAs are qualitatively consistent with the periodic pattern of surface exposure in TM helices (see Figure 4.3 for example).

We propose that RLA prediction methods developed here can already be used to improve structural studies on membrane proteins. Furthermore, we hypothesize that RLA prediction accuracy will further increase when more high definition structural data for membrane proteins becomes available. The novel predictor has been incorporated into the on-line MINNOU server and can be accessed via the URL http://minnou.cchmc.org .

90 Chapter 5 Cluster analysis of array comparative genomic hybridization data

In conventional use of microarray technology, only mRNA expression levels are measured. Array comparative genomic hybridization (CGH) can be used to investigate

DNA copy number alteration using the microarrays developed for assessing mRNA expression. DNA copy number is associated with the gene expressions. Hence, this additional DNA copy number alteration information creates an opportunity to cluster gene expressions with extra knowledge from copy number profiles. We proposed a novel approach to cluster genes by borrowing the strength across expression and copy number alteration. We applied the unsupervised machine learning, the context specific Bayesian infinite model, to implement the clustering. According to the independent evaluation from KEGG pathways, our approach outperforms other simple methods.

5.1 Introduction

In normal cells, every gene has two DNA copies from both of its parents. DNA copy number alteration, which is typically manifest as amplification or deletion in tumors, is strongly associated with cancer prognoses [74]. Specifically, DNA copy number amplification is associated with overexpression of genes promoting cell (oncogenes), and the deletion of DNA regions is associated with underexpression of genes suppressing cell growth (tumor suppression genes) [74]. The DNA copy number alterations detected by array comparative genomic hybridization (array CGH) can be directly related to expression of genes within these DNA regions and enhance identification of oncogenes that caused the disease phenotype and improve the therapeutic approaches [74].

91 Array comparative genomic hybridization has been recently developed to simultaneously investigate the DNA copy number alteration and mRNA expression levels in breast and colorectal cancers. For example, in the breast cancer study, Pollack et. al. measured the DNA copy number changes between samples from tumors and cell lines and a sample from a normal female leukocyte DNA from a single donor [14]. In parallel, mRNA expressions were measured for the samples using the same cDNA array. Their analysis of the simultaneous measurements of both DNA copy number alterations and mRNA expression levels leads to the identification of the significant impact of wide- spread DNA copy number alteration on the mRNA expression levels [14]. In fact, 62% of highly amplified genes either moderately or highly elevated expression [14]. It also concluded that elevated expression of an amplified gene cannot alone be simply considered strong independent evidence of a candidate oncogene. Therefore, sophisticated approaches are needed to seek candidate oncogenes.

The analysis methods in the literature focus on seeking the chromosomal amplification and deletion regions. Engler et. al. proposed a pseudolikelihood approach to detect genetic alterations [75]. Rouveirol et. al. developed two algorithms for computing minimal and minimal constrained regions of gain and loss from discretized CGH profiles

[76].They are all focusing on precisely predicting the chromosmal gains and loss or the boundaries of losses with the ignorance of expression levels.

In this chapter, we propose an alternative approach of seeking correlated genes by integrating information of both DNA copy number alterations and mRNA expression levels. A context specific Bayesian infinite mixture model was applied to cluster genes based on their expression levels within contexts defined by their DNA copy number

92 alterations. According to an independent test from KEGG pathways, our approach is able to detect the expression patterns and clusters, and outperforms other simple methods.

5.2 Methods and Data

5.2.1 Data

We exclusively used only the data set by Pollack et. al. [14] because it was the only available data containing both DNA copy number alterations and mRNA expression levels. They measured the DNA copy number alterations and mRNA expression levels in breast cancer for 6095 genes in 41 samples from 4 breast cancer tumor cell lines and 37 tumors [14]. In general, there are some missing values in the microarray data due to experimental limits or human errors. The missing values are imputed in this data set by using the “impute” package [77] in R with default settings [78]. In addition, they also observed the physiology statuses for those tumors, such as, tumor size, tumor grade, histology, and so on. These phenotypes will be used to investigate the phenotype- genotype association and will be one of the future plans as the follow-up study.

5.2.2 Hierarchical Bayesian Model

As we discussed in Chapter 1, the Bayesian inference approach to the unsupervised learning for pattern recognition is to calculate the posterior probability distribution of patterns given the data and other parameters. As introduced in [22] and

[79], the Directed Acyclic Network is used to define the dependences of all parameters in the Bayesian hierarchical model. In DAN, each probability distribution of any node is determined by its all parent nodes or priors. Thus, the joint probability of model

93 parameters and data is the product of all their probability distribution given their parents as shown in the following equation [79], r r r r p(X ,C, M , S,α,λ,τ , β,ϕ) = r r r r r r r (5.1) p(X | C, M , S) p(C | α) p(M | λ,τ ) p(S | β,ϕ) p(α) p(λ) p(τ ) p(β ) p(ϕ), r where M = (µ1,µ2 ,..., µQ ) is the set of all means associated with Q global patterns for r r r the data X , S = (∑1,∑2 ,..., ∑Q ) is the corresponding variances, C = (c1,c2 ,..., cT ) is the corresponding clusters. The global patterns number, Q , is not chosen in advance and will be determined through learning from the data. The prior parameters α,λ,τ , β and ϕ are r r r hyperparameters for C, M , and S , respectively.

Figure 5.1 is the simplified version of reprint of what is shown in [79] for the hierarchical Bayesian model. Priors α,λ,τ , β and ϕ determine the distribution of model parameters C,M , and S . In turn, the profile vectors X are dependent on its parents,

C,M , and S . The details of setting the distribution for α,λ,τ , β and ϕ were discussed in

[22, 79].

94 (λ,τ) α (β,φ)

C M S

X

Figure 5.1 Directed acyclic graph for the Bayesian hierarchical model.

The clustering is processed by approximating the joint posterior distribution of

classification given data p(c | x1, x2 ,..., xT ) , where xi is the profile for gene i among all

T genes and c = (c1,c2 ,... cT ). The classification variables given other model parameters and data are specified by the following equations [79]:

2 n−i, j 2 p(c = j | c , xi , µ ,σ ) ∝ f (xi | µ ,σ I) i −i j j T −1+ α N j j . 2 2 2 2 α −1 p(c ≠ c , j ≠ i | c , xi , µ ,σ x ) ∝ f (xi | µ ,σ j I) p(µ ,σ j | λ, r )d µ dσ j i j −i x T −1+ α ∫ N j j j

2 Here f N (xi | µ j ,σ j I) is the normal distribution xi of given mean µ j and covariance

2 σ j of gene j . n−i,c is the number of expression profiles classed in c without counting the

i-th profile, and c−i = (c1,c2 ,... ci−1,ci+1,... cT ).

95 Gibbs sampler is then used to sample all observations from a general multivariate distribution [22]. The procedure starts with an assumption that all profiles are clustered

together. Then after kth step, the k +1 set of profiles is updated by drawing ck+1 according to the posterior probability distributions described in the above equations.

When the process keeps updating, any cluster will be either created or removed. The

cluster will be created if a ci ≠ c j for all i ≠ j and be removed if no profile remains in it.

The whole procedure is implemented in the GIMM package developed by Medvedvic et. al. who made this package available for this study [22].

The resulted posterior probability distribution is then used to cluster genes. The probability of any pair of genes is used to define the distance between each pair of genes.

Suppose Pij is the probability of genes i and j being within a same cluster, the distance

between them is then defined as Dij = 1− Pij . For example, in an extreme case, if Pij = 1, which means both gene i and j are always within same cluster under any conditions, it is straight forward that the distance between them is 0. This distance measurement is then applied to cluster genes so that the distance between genes from same cluster is smaller than from a different cluster.

5.2.2 The Independent Test

We used information from the KEEG pathways database [80], which is independent of the gene expressions, to evaluate our approach. The KEGG pathway is a well-maintained database collecting massive information on biological pathways. With the assumption that, genes who share same pathways belong to same cluster, we then assessed the performance of our approach by constructing the Receiver Operating

96 Characteristic (ROC) curves [81]. The ROC curves graphically show the true positive rates v.s. the false positive rates for a binary classifier system as the discrimination threshold varies [81]. The true positive rates are the proportion of times that correct clusters are made while the false positive is the proportion of incorrect clusters are made.

In the extremely perfect case, if every cluster is made correctly (i.e., clustered genes and non-clustered genes are totally distinguished), then, the ROC curve should consist of only two straight lines. One is along the y-axis, and the false positive is always equal to 0; the other is parallel to x-axis, and the true positives rate is always equal to 1. In reality, the closer the ROC curve is, the better the performance of this classifier.

5.3 Results and Discussion

We applied a number of clustering algorithms to find those highly correlated genes based on the mRNA expression levels. As shown in Figure 5.2, in terms of the

ROCs, none of the tested methods were better than random clustering because of the noisiness of the data.

97

Figure 5.2 The ROC curves obtained using independent clustering of mRNA expression and cDNA copy

number alteration patterns. The dot line indicates random clustering.

In order to explore the DNA copy alteration information to assist the clustering on expression levels, we applied a newly developed context specific Bayesian infinite mixture model to cluster genes using both DNA copy number alteration and mRNA expression levels information. At first, genes were clustered based on expression levels only and without any context information. We then chose clusters of genes within them, the probability of 90% for any pair of genes being correlated. It then leads to 103 clusters of 694 genes for the 41 tissue samples. The structure of the clusters is then exactly duplicated to profiling average DNA copy numbers. For instance, in the new profile, each

98 sample is represented by a 103 dimensional vector in which the variable is the average of

DNA copy number alterations of all member genes of that particular cluster. We then cluster samples based on the new average DNA alteration profiles with both hierarchy and kmeans clustering methods. The number of clustered samples varies from 2 to 18.

The consequential clusters of samples were then used as defined contexts for gene expression levels. Therefore, we used GIMM, a context specific Bayesian infinite mixture model, to cluster genes with the defined contexts. We also iterated the whole procedure in order to optionally use all the information to do clustering. All the results will be discussed in next section.

As discussed in [22] and [79], context specific Bayesian mixture model improves gene clustering. In Figure 5.2, we compare in terms of ROCs the performances on contexts obtained by iteratively integrating DNA copy number alteration and mRNA expression levels. At first, we applied GIMM to only mRNA expression levels without any context. The result is close to a random clustering. The clustering was improved by integrating the DNA copy number alteration and defining context of samples. When iterating more, we actually found the drop of accuracy based on ROC as shown in Figure

5.2. ROCs from the second iteration are closer to the random line than from the first iteration. We concluded that the optimal iteration was achieved in the first iteration. The resulted optimal was then used for further analysis.

99

Figure 5.3 The ROC curves for the iterative clustering of genes based on mRNA expression levels and cDNA copy number alteration. Genes were clustered based on mRNA expression levels without any context and its ROC is labeled as No_context. The curve labeled as Loop1_hc is the clustering results from the first iteration, classifying samples into 5 classes by using average DNA copy number alteration profiles.

The curves labeled as Loop2_hc_2, Loop2_hc_5, Loop2_hc_9 are the results from the second iteration, in which with profiles of DNA copy number alterations samples were classified into 2, 5, and 9 classes, respectively. The straight line, Random Line indicates the random clustering.

As shown in Figure 5.3, this clustering is not optimal and a further consideration of contexts of all the samples improves the clustering. At first, we applied the Bayesian infinite mixture model to mRNA expression levels without any context information from the DNA copy number. The resulted posterior probability was the probability of every pair of genes being co-expressed given the measurements of all other genes across all

100 samples. We then chose clusters of genes which have in each cluster the probability of

90% for any pair of genes being co-expressed. This leads to 103 clusters of 694 genes for the 41 tissue samples. The structure of the clusters was then exactly duplicated to profile average DNA copy numbers. For instance, in the new profile, each sample was represented by a 103 dimensional vector in which the variable is the average of the DNA copy number alterations of all member genes of that particular cluster. We then clustered samples based on the new average DNA alteration profiles with hierarchy clustering. The contexts used in the first iteration are shown in Figure 5.4. The averaged profiles of mRNA are in the left panel and the averaged DNA in the right panel.

101

Figure 5.4 Classification of samples using averaging of cDNA copy number alteration (the right panel) and the corresponding mRNA expression levels (left panel).

5.4 Future Plans

In order to understand the underlying reasons of the improvements from separating samples into contexts based on CGH analysis, we would like to further

102 investigate the genes responsible for the creation of the contexts. With those further investigations, more practical suggestions for clinical therapy would be to expect. Further examination of results of our analysis is likely to yield additional biological insights into the mechanisms of carcinogenesis.

5.5 Summary

We propose an alternative approach of seeking correlated genes and their association with breast cancer tumor phenotypes by integrating information of both DNA copy number alterations and mRNA expression levels measured from array CGH. A context specific Bayesian infinite mixture model was applied to cluster genes based on their expression levels within contexts defined by their DNA copy number alterations.

Further investigations regarding the exact contributions of the context and the cluster of samples are needed.

103 Bibliography

[1]. Gibson, G. and Muse S., A primer of genome science 2 nd edition , Sinauer Associates,Inc., 2004.

[2]. Jiang, T., Xu, Y., and Zhang, M., Current topics in computational biology , MIT press 2002.

[3]. Hastie, T., Tibshirani, R., and Friedman, J., The elements of statistical learning: data mining,

inference , and prediction, Springer series in statistics, 2001.

[4]. Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., and Watson, J. M., Molecular biology of the

cell, 3 rd edition , Garland Publishing, Inc., 1994.

[5]. White, S. and von Heijne, G., The machinery of membrane protein assembly, Curr. Opin. Struct.

Bio., 2004; 14, 397-404.

[6]. Ashcroft, F., From molecule to malady, Nature , 2006; 440, 440-447.

[7]. Chen, C. and Rost, B., State-of-the-art in membrane protein prediction, Applied bioinformatics

2002, 1, 21-35.

[8]. Schulz, G., β-barrel membrane proteins, Currr. Opin. Struct. Biol. 10, 443-337.

[9]. Tusnady, G., Dostanyi, Z., and Simon, I., Transmembrane proteins in the Protein Data Bank:

identification and classification, Bioinformatics , 2004; 20, 2964-2972; Tusnady, G., Dostanyi, Z.

and Simon, I., PDB_TM: selection and membrane localization of transmembrane proteins in the

protein data bank, Nucleic Acids Res., 2005; 33, Database issue, D275-278.

[10]. Berman H., Westbrook J., Feng Z., Gilliland G., Bhat T., Weissig H., Shindyalov I., and Bourne

P. The Protein Data Bank. Nucleic Acids Res. , 2000; 28, 235-242.

[11]. Cao B., Porollo, A., Adamczak R., Jarrell, M., and Meller, J., Prediction-based Structural Profiles

Enhance Recognition of Transmembrane Proteins. Bioinformatics , 2006; 22, 303-309.

[12]. Wallin E. and von Heijne G., Genome-wide analysis of integral membrane proteins from

eubacterial, archaean, and eukaryotic . Protein Sci. , 1998; 7, 1029-38.

[13]. Domany, E., Cluster analysis of gene expression data, J. of Stat. Phys., 2003; 110, 1117-1139.

[14]. Pollack, J., Sorlie, T., Perou, C., Rees, C., Jeffery, S., Lonning, P. E., Tibshirani, R., Botstein, D.,

Borresen-Dale, A.-L., and Brown, P., Microarry analysis reveals a major direct role of DNA copy

104 number alteration in the transcriptional program of human breast tumors. Proc. Natl. Acad. Sci.

USA , 2002; 99, 12963-12968.

[15]. Douglas, E., Fiegler, H., Rowan, A., Halford, S., Bicknell, D., Bodmer, W., Tomlinson, I., and

Carter, N., Array comparative genomic hybridization analysis of colorectal cancer cell lines and

primary carcinomas, Cancer Research , 2004; 64, 4817-4825.

[16]. Schölkopf, B., Burges, C., and Smola, A. (eds.), Advances in Kernel Methods - Support Vector

Learning , MIT-Press, 1998.

[17]. Lingireddy, S. and Brion, G. M., Artificial neural networks in water supply engineering, The

American society of civil engineers , 2005.

[18]. Rauber, T., Barata, M., and Steiger-Garcao, A., A Toolbox for Analysis and Visualization of

Sensor Data in Supervision. Proceedings of the International Conference on Fault Diagnosis ,

Toulouse, France. 1993.

[19]. Wagner M., Meller J. and Elber R. Large-scale linear programming techniques for the design of

potentials. Math. Program. 2004; 101, 301-318.

[20]. Rowe, G., Theoretical models in biology: the origin of life, the immune system, and the brain ,

Clarendon Press, Oxford, 1994.

[21]. Gelman, A., Carlin, J., Stern, H., and Rubin, D., Bayesian data analysis 2nd , Chapman &

Hall/CRC, 2003.

[22]. Medvedovic, M. and Sivaganesan, S., Bayesian infinite mixture model based clustering of gene

expression profiles. Bioinformatics , 2002; 18, 1194-1206.

[23]. Hirokawa, T., Seah, B.C. and Mitaku. S., SOSUI: Classification and Secondary Structure

Prediction System for Membrane Proteins. Bioinformatics , 1998; 14, 378-379.

[24]. von Heijne, G., Membrane Protein Structure Prediction, Hydrophobicity Analysis and the

Positive-inside Rule, J. Mol. Biol. 1992; 225, 487-494.

[25]. Hofman, K. and Stoffel, W., TMbase--a database of membrane spanning protein segments, Bio.

Chem.., Hoppe-Seyler , 1993; 374, 166.

105 [26]. Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E., Predicting transmembrane protein

topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. , 2001;

305, 567-580.

[27]. Kall L., Krogh, A., and Sonnhammer, E., A combined transmembrane topology and signal peptide

prediction method, J. Mol. Biol. , 2004; 338, 1027-36.

[28]. Tusnády, G. and Simon I., Principles governing amino acid composition of integral membrane

ptoeins: application to topology prediction. J. Mol. Biol. , 1998; 283, 489-506.

[29]. Tusnády, G. and Simon, I., The HMMTOP transmembrane topology prediction server.

Bioinformatics , 2001; 17, 849-850.

[30]. Viklund H, and Elofsson A., Best alpha-helical transmembrane predictions are

achieved using HMMs and evolutionary information. Protein Sci. , 2004; 13, 1908-17.

[31]. Rost, B., PHD: predicting one dimensional protein structure by profile based neural networks.

Meth Enzymol , 1996; 266,525-539.

[32]. Kernytsky, A. and Rost, B., Static benchmarking of membrane helix predictions. Nucleic Acids

Res. , 2003; 31, 3642-3644.

[33]. Chen, C., Kernytsky, A., and Rost, B., Transmembrane helix predictions revisited, Protein Sci. ,

2002; 11, 2774-91.

[34]. Moller, S., Croning, M., and Apweiler, R., Evaluation of methods for the prediction of membrane

spanning regions. Bioinformatics , 2001; 17, 646–653.

[35]. Wimely, W., Toward genomic identification of ß-barrel membrane proteins: Composition and

architecture of known structures. Protein Sci. , 2002; 11, 301-312.

[36]. Casadio, R., Fariselli, P., Finocchiaro, G., and Martelli P., Fishing new proteins in the twilight

zone of genomes: The test case of outer membrane proteins in Escherichia coli K12, Escherichia

coli O157:H7, and other Gram-negative bacteria . Protein Sci. , 2003; 12, 1158 - 1168.

[37]. Bigelow H.R., Petrey, D.S., Liu, J., Przybylski, D., and Rost, B., Predicting transmembrane beta-

barrels in , Nucleic Acids Res. , 2004; 32, DOI: 10.1093/nar/gkh580

106 [38]. Altschul, S., Madden, T., Schäffer, A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D., Gapped

BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids

Res. , 1997; 25, 3389-3402.

[39]. Adamczak, R., Porollo, A., and Meller, J., Accurate prediction of solvent accessibility using

neural networks-based regression. Proteins , 2004; 56, 753-67.

[40]. Eyrich, V., Marti-Renom, M., Madhusudhan, M., Fiser, A., Pazos, F., Valencia, A., Sali, A., and

Rost, B., EVA: continuous automatic evaluation of protein structure prediction servers.

Bioinformatics , 2001; 17, 1242-1243.

[41]. Adamczak, R., Porollo, A., and Meller, J. Combining Prediction of Secondary Structures and

Solvent Accessibility in Proteins, Proteins , 2005; 59, 467-75.

[42]. Jayasinghe, S., Hristova, K., and White, S. H., MPtopo: A database of membrane protein

topology. Protein Sci. , 2001; 10, 455-8.

[43]. Hiller, K., Grote, A., Scheer, M., Münch, R., and Jahn, D., PrediSi: prediction of signal peptides

and their cleavage positions. Nucleic Acids Res. , 2004; 32, W375-W379.

[44]. Riedmiller M, Braun H., A direct adaptive method for faster backpropagation learning: the

RPROP algorithm. Proc IEEE Int Conf Neural Networks , 1993; pp. 123–134.

[45]. Zell A., Mamier G., Vogt, M., et al., The SNNS users manual version 4.1 Available online at

http://www-ra.informatik.uni-tuebingen.de/snns .

[46]. Jones, D., Protein secondary structure prediction based on position-specific scoring matrices. J.

Mol. Biol. , 1999; 292, 195-202.

[47]. Rost, B., Casadio, R., Fariselli, P., and Sander, C., Prediction of helical transmembrane segments

at 95% accuracy. Protein Sci. , 1995; 4, 521-533.

[48]. Matthews, B., Comparison of predicted and observed secondary structure of T4 ohage lysozyme.

Biochim Biophys Acta , 1975; 405, 442-451.

[49]. Kyte, J. and Doolittle, R., A simple method for displaying the hydropathic character of a protein.

J. Mol. Biol., 1982; 157, 105-132.

[50]. White S., and Wimley W., Membrane protein folding and stability: physical principles. Annu Rev

Biophys Biomol Struct. , 1998; 28, 319-65.

107 [51]. Diederichs, K., Freigang, J., Umhau, S., Zeth, K., and Breed, J., Prediction by a neural network of

outer membrane beta-strand protein topology, Protein Sci. , 1998; 7:2413-2420.

[52]. Jacoboni, I., Martelli, P., Fariselli, P., Pinto, V., and Casadio, R., Prediction of the transmembrane

regions of beta-barrel membrane proteins with a neural network-based predictor, Protein Sci. ,

2001; 10:779-787.

[53]. Martelli, P., Farselli, P., Krogh, A., and Casadio, R., A sequence-profile-based HMM for

predicting and discriminating beta barrel membrane proteins, Bioinformatics , 2002; 18:S46-S53

[54]. Gromiha, M. and Suwa, M., A simple statistical method for discriminating outer membrane

proteins with better accuracy, Bioinformatics , 2004; 21: 961-968.

[55]. Natt, N., Kaur, H. and Raghava, G., Prediction of Transmembrane regions of beta-barrel proteins

using ANN and SVM based method. Proteins: Structure, Function, and Bioinformatics , 2004;

56:11-8.

[56]. Wagner M., Adamczak R., Porollo A., and Meller J. Linear Regression Models for Solvent

Accessibility Prediction in Proteins. J. Comp. Bio. , 2005; 12, 355-369.

[57]. Wimley, W., The versatile beta-barrel membrane protein, Current Opinion in Structural Biology ,

2003; 13: 404-411.

[58]. Tamm, L., Arora, A. and Kleinschmidt, J., Structure and assembly of beta-barrel membrane

proteins, J. Bio. Chem. , 2001; 276:32399-32402.

[59]. Tamm, L., Arora, A., and Kleinschmidt, J., Structure and assembly of beta-barrel membrane

proteins, J. Bio. Chem. , 2001; 276:32399-32402.

[60]. Humphrey, W., Dalke, A., and Schulten, K., VMD - Visual Molecular Dynamics, J. Molec.

Graphics, 1996; 14, 33-38

[61]. Rost B., and Sander C., Conservation and prediction of solvent accessibility in protein families.

Proteins. 1994; 20, 216-226.

[62]. Popot J. and Engelman D., Helical membrane protein folding, stability, and evolution. Annu. Rev.

Biochem . 2000; 69, 881-922.

[63]. Bowie J., Solving the membrane protein folding problem. Nature 2005; 438, 581-589.

108 [64]. Eyre T., Partridge L., and Thornton J., Computational analysis of α-helical membrane protein

structure: implications for the prediction of 3D structural models. Protein Engineering, Design &

Selection 2004; 17, 613-624.

[65]. Porollo A., and Meller J. Prediction-based Fingerprints of Protein Interactions, Proteins , to appear

2006.

[66]. Adamian L., Nanda V., DeGrado W., and Liang J. Empirical lipid propensities of amino acid

residues in multispan alpha helical membrane proteins, Proteins , 2005; 59, 496-509.

[67]. Adamian L. and Liang J., Prediction of buried helices in multispan alpha helical membrane

proteins. Proteins: Structure, Function, and Bioinformatics , 2006; 63, 1-5.

[68]. Smart O., Goodfellow J., and Wallace B., The Pore Dimensions of Gramicidin A. Biophysical J. ,

1993; 65, 2455-2460.

[69]. Benson D., Karsch-Mizrachi I., Lipman D., Ostell J., and Wheeler D. GenBank. Nucleic Acids

Res., 2003; 31, 23-7.

[70]. Kabsch W., and Sander C. Dictionary of protein secondary structure: pattern recognition of

hydrogen-bonded and geometrical features. Biopolymers. 1983; 22, 2577-2637.

[71]. Chothia C. The nature of accessible and buried surfaces in proteins. J. Mol. Biol. , 1976; 105, 1-14.

[72]. Czyzyk J., Mehrotra S., Wagner M., and Wright S. PCx: An interior-point code for linear

programming. Optim. Method. Softw. , 1999; 12, 397-430.

[73]. Hille, B., Ion channels of excitable membranes 3 rd , Sinauer Associates, Inc., 2001.

[74]. Le Beau, M. Molecular biology of cancer: cytogenetics. In: V. DeVita, S. Hellman and S.

Rosenberg (eds.), Cancer principles and practice of oncology , pp. 103-119. Philadelphia:

Lippincott-Raven, 1997.

[75]. Engler, D. A., Mohapatra, G., Louis, D. N., and Betensky, R. A., A pseudolikelihood approach for

simultaneous analysis of array comparative genomic hybridizations. Biostatistics , 2006; 7, 3, 399-

421.

[76]. Rouveirol, C., Stransky, N., Hupe, Ph., La Rosa, Ph., Viara, E., Barillot, E., and Radvanyi, F.

Computation of recurrent minimal genomic alterations from array-CGH data. Bioinformatics,

2006; 22, 849-856.

109 [77]. Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani,

David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays.

Bioinformatics , 2001; 17, 520-525.

[78]. R Development Core Team. R: A language and environment for statistical computing. R

Foundation for Statistical Computing , 2005, Vienna, Austria. ISBN 3-900051-07-0, URL

http://www.R-project.org .

[79]. Liu, X., Sivaganesan, S., Yeung, K.Y., Guo, J., Bumgarner, R.E. and Medvedovic, M., Context-

specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset.

Bioinformatics , to appear 2006.

[80]. Kanehisa, M., et. al., The KEGG resource for deciphering the genome. Nucleic Acids Res. , 2004;

32 Database issue, D277-D280.

[81]. Fawcett, T., ROC convex hull, tutorial, program and paper,

http://home.comcast.net/~tom.fawcett/public_html/ROCCH/index.html

[82]. Cao, B., Medvedovic, M., and Meller, J., Prediction of Transmembrane Domains and Pore-facing

Residues in Beta-barrel Membrane Proteins, in Applications of Statistical and Machine Learning

in Bioinformatics: a volume in the series Advances in Computational and Systems Biology , eds.

Meller, J. and Nowak, W., Peter Lang Gmbh, to appear 2006.

110