Technische Universität München Lehrstuhl für Bioinformatik

Prediction of functional effects for sequence variants

Maximilian Natale Friedrich Hecht

Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation.

Vorsitzender: Univ.-Prof. Dr. Daniel Cremers Prüfer der Dissertation: 1. Univ.-Prof. Dr. Burkhard Rost 2. Univ.-Prof. Dr. Dimitri Frischmann

Die Dissertation wurde am 24.06.2015 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 13.10.2015 angenommen.

Contents

Abstract 5

1Introduction 6 1.1 Forms of genetic variation ...... 6 1.2 Investigating human genetic variation ...... 10 1.3 Resources for variants and their molecular effects ...... 13 1.4 Prediction of variant effects ...... 15 1.5 Thesismotivationandgoals ...... 19

2Methods 20 2.1 Data...... 20 2.2 Features for variant effect prediction ...... 22 2.3 Predictionmethod ...... 24 2.3.1 Clustering and cross-validation ...... 25 2.3.2 Feature selection and parameter optimization . . . . . 26 2.3.3 Alignment-free prediction ...... 27 2.4 Performance measures ...... 28 2.5 Result visualization ...... 30

3Resultsanddiscussion 31 3.1 Better prediction of functional effects ...... 31 3.2 Difficult cases and method combinations ...... 35 3.3 Neutral variant dilemma ...... 37 3.4 Interpretation and reliability of variant prediction ...... 39 3.5 Variant prediction in orphan proteins ...... 41 3.6 The protein mutability landscape ...... 44

4Conclusion 48

Acknowledgements 49

3 References 50

Appendix 64

4 Abstract

Technological advancements have enabled us to find and record genetic dif- ferences between humans. Yet, the vast amount of data gathered through next generation sequencing has long since overwhelmed our ability to study every discovered difference experimentally. Thus, computational approaches have been developed to cope with the deluge of data and guide experimen- tal efforts towards prioritizing the most promising candidates. This focuses on the computational advancements made towards predicting the effects of genetic variation. It discusses the current state of research and presents a newly developed method for the prediction of functional effects of sequence variants. Our new method SNAP2 predicts variants with over 83% accuracy and is also able to predict effects for variants of orphan proteins. Both constitute significant and important improvements over other methods. Furthermore, it not only predicts single variants but offers a comprehensive view of all possible substitution effects in a protein. This opens up a novel view on the landscape of protein mutability. Possible applications are pre- sented that show how this novel view on functional effect predictions can translate into novel hypotheses and thus aid the identification of targets for drug development.

5 1Introduction

Understanding human genetic variation is one of the major scientific chal- lenges of the 21st century. Nearly 15 years after we first sequenced the human genome, we still understand little of the mechanisms that link differences on the genetic level to the differences observed on the phenotypic level. Genetic variations influence phenotypes: from the most obvious, such as the visi- ble differences between individuals with different ethnic background, to the least obvious such as the differential response to drug treatment. Therefore, the genotype-phenotype link is crucial for understanding development and progression of diseases: it may be the key towards developing personalized treatment.

1.1 Forms of genetic variation

Differing inheritable traits can either favor or hinder survival and reproduc- tion. This process is called natural selection. The interplay between genetic variation and selection pressure is what constantly adapts organisms to best fit their specific niche in the environment. In genetic variation hap- pens randomly through changes on the molecular level which are caused by various factors, such as mutations and random mating between organisms (Krishnamurthy, 2003). There are different forms of genetic variation, which can be categorized according to their size and type: (i) numerical variation, (ii) large-scale structural variation and (iii) small-scale sequence variation.

Numerical Variation

Numerical variation can be defined as changes in the number of chromosomes, referred to as polyploidy (numerical change of the whole set of chromosomes) or aneuploidy (number of individual chromosomes is altered). A prominent example of polyploidy is wheat, which through years of hybridization and human modification has different species that range from diploid (two sets

6 of chromosomes) to hexaploid (six sets of chromosomes; today’s common bread wheat) (Martinez-Perez et al., 2003). In humans polyploidy plays not only a role in the normal development of certain cell types but also in the development of cancer (Davoli and de Lange, 2011). For humans, aneuploidy of most chromosomes results in miscarriage but there are exceptions that result in live births (Driscoll and Gross, 2009). These children often suffer from genetic disorders, the most common example being trisomy 21 (known as Down syndrome), where a third (possibly partial) copy of chromosome 21 is present (Patterson, 2009). Aneuploidy can also be observed in cancer cells (Sen, 2000).

Large-scale structural variation

The class of large-scale structural variation comprises variations of long stretches/regions of DNA, typically ranging in size from kilo- to megabases. These structural variations include rearrangements such as translocations (exchange of region between chromosomes) or inversions (region is reversed in the chromosome) but also so-called Copy-Number Variations (CNV; region is duplicated or deleted). While balanced rearrangements (reciprocal translo- cations or inversions) do not result in gain or loss of genetic material they may still affect gene expression through gene fusion or by dissociating genes from their long-range regulatory elements (Harewood et al., 2010). CNVs, on the other hand, represent a significant alteration of the genetic material re- sulting from duplication or deletion of long stretches of chromosomes. These large-scale variations were found to be widespread and common among hu- mans (Iafrate et al., 2004; Sebat et al., 2004) accounting for roughly 13% of the human genome (Stankiewicz and Lupski, 2010). Yet, with respect to CNVs only 0.4% of the genome significantly differed in a study comparing eight individual genomes to the reference assembly (Kidd et al., 2008). This suggests that most of the CNVs are common and shared by members of the same population. While CNVs have been associated with

7 diseases, for instance autism and schizophrenia (Cook and Scherer, 2008), it appears that gene gains are more common than gene losses and that gene amplification is favored through positive selection (Zhang et al., 2009). For example, the salivary amylase gene (AMY1) copy number was found to be varying significantly in different populations. This was considered an adap- tion of agricultural societies towards high-starch diet, as higher copy numbers correlated with higher salivary starch-digesting enzyme levels (Perry et al., 2007).

Small-scale sequence variation

The last class, the small-scale sequence variation, is the most prevalent form of genetic variation among humans (ENCODE Project Consortium, 2012). It comprises short or single nucleotide insertions/deletions (typically abbre- viated to ‘indels’) and single nucleotide substitutions (so-called point muta- tions). Indels (that are not a multiple of 3 bases) in genetic regions that encode proteins or functional RNA change the reading frame. This is prob- lematic because it affects how the codon triplets are read during transcription into mRNA (or other RNA). These so-called frameshift mutations have been shown to be involved in a number of diseases: For instance, the Crohn’s disease has been associated with an insertion in the NOD2 gene. The inser- tion of cytosine at position 3020 of the NOD2 gene was shown to produce a truncated protein. The resulting protein no longer responded to bacterial lipopolysaccharides, which might lead to increased susceptibility to Crohn’s disease (Ogura et al., 2001).

Single nucleotide variation

The most frequent human genetic variations are single nucleotide substitu- tions (called ‘single nucleotide polymorphisms’ – SNPs, or ‘single nucleotide variants’ – SNVs). The international SNP Map Working Group estimated that any two haploid genomes differed on average by around 1 nucleotide

8 every 1300 basepairs (Sachidanandam et al., 2001), a number which likely varies between ethnic groups. Estimates suggest that the human genome contains over 11 million SNPs of which roughly 7 million are common (oc- curring at a minor allele frequency [MAF] greater than 5%) in the human population while the remaining 4 million are uncommon (1% < MAF 5%)  (Kruglyak and Nickerson, 2001). These SNPs are estimated to constitute 90% of the genetic variation in humans while the remaining 10% consist of a vast array of variants that are each rare within the population (Interna- tional HapMap Consortium, 2003). Researchers suspect to find a tremendous amount of these rare variants. It is likely that almost every ‘life-compatible’ variant can be observed in at least one of the roughly 7 billion people on earth (Frazer et al., 2009). The vast majority of SNPs is located in non- protein-coding regions of the genome, since only around 1.5% of the genome encode proteins (Lander et al., 2001). Although these SNPs do not directly alter gene products they may for instance affect gene regulation by chang- ing transcription factor binding or gene splicing. A more direct phenotypic impact can be expected for coding SNPs (cSNPs). Due to the degeneracy of the genetic code, SNPs in coding regions can either be synonymous (sSNPs) or non-synonymous (nsSNPs). The former change the codon triplets in such awaythattheencodedaminoacidisthesameasbeforeandthusdonot alter the gene product. The latter, however, change the codon triplet in two ways. Either it becomes a stop codon (so-called nonsense mutations) or the codon encodes a different amino acid (so-called missense mutations). While nonsense mutations cause truncated and thus mostly nonfunctional proteins (depending on where the new stop codon is located), missense mutations can have a variety of effects on the resulting protein: They may affect folding and structure of the protein and thus affect its function. They may also change residues that are important for binding and thus affect interactions with substrates or other proteins (e.g. complex formation). These protein-altering point mutations are of particular importance for

9 medical research because they can directly affect phenotypes such as causing diseases, increasing disease susceptibility or altering drug response (Thus- berg and Vihinen, 2009). Although these appear to be a small fraction of variants, there is an estimated average of 6 coding SNPs per gene in the hu- man population (Collins et al., 1998). Recent studies suggest that any two unrelated individuals on average differ by one cSNP per gene, half of which are non-synonymous (1000 Genomes Project Consortium et al., 2010). Thus, approximately every other protein differs between any two individuals from the same population and many of these differences are likely instrumental in defining human diversity.

1.2 Investigating human genetic variation

While genetic variation works on genotypes, natural selection works on phe- notypes. Selection is likely to favor (or to not penalize) genetic changes that do not lead to disadvantageous phenotypes. However, not all genetic changes lead to phenotypic changes. In fact, the majority of genetic variation is hy- pothesized to be neutral (Kimura, 1968) with respect to any phenotypes, i.e. they are assumed not to contribute to any phenotype. Yet, the ex- act ratio of neutral, ‘near-neutral’ (Ohta, 2002) and non-neutral variation is unknown. Nevertheless, near-neutral variants may contribute (although with little impact each) to complex traits and even non-synonymous neutral variants might be important for human individuality (Bromberg et al., 2013). As mentioned before, the vast majority of human genetic variation is due to common variants. In other words, the majority of variants in any given individual are variants that are common within the whole population. Moreover, when two individual genomes are compared, the vast majority of differences can be found at positions that are commonly known to be variable within the population (Frazer et al., 2009). Therefore a great effort has been made towards cataloging and studying common variation. One observation towards this end has been particularly important: SNPs in proximity of

10 one another on the chromosome are likely to be co-inherited, thus leading to a strong correlation between SNPs in the same genomic interval. This complex correlation structure is called linkage disequilibrium (LD) and varies between populations (Slatkin, 2008). The International HapMap Project determined that 80% of common SNPs (MAF > 5%) could be categorized into roughly 550,000 LD groups for individuals of European or Asian ancestry and roughly 1,100,000 LD groups for African ancestry (International HapMap Consortium, 2003). This means that information for over 80% of common SNPs can be gained only by genotyping individual DNA with ’tag’ SNPs from each LD group (Barrett and Cardon, 2006; Eberle et al., 2007; Pe’er et al., 2006; Frazer et al., 2009). This remarkable finding significantly reduced the cost and time required to genotype thousands of individuals and thus enabled large-scale genotype-phenotype association studies. In genome-wide association studies (GWAS) large amounts of participants are genotyped and assigned to either a case or a control group depending on the phenotype under investigation. However, the power of GWAS is limited due to several reasons (Frazer et al., 2009): (i) Study cases need to be rep- resentative and sufficient in number, often requiring over 10,000 samples for detection (Kiezun et al., 2012). Moreover, participants are typically drawn from clinical sources, which often do not contain silent, mild or lethal cases because they do not come to clinical attention. (ii) GWAS are limited to common variants, as rare variants are generally not tagged. (iii) 20% of the common variants are not or only partially tagged. Nevertheless, these studies have significantly furthered our understanding of variants and diseases. Most common variants have been tested for associations with common traits and diseases thereby linking more than 1,100 loci to complex diseases (Lander, 2011). Yet, due to their limitations, GWAS cannot find significant asso- ciations for rare variants and moreover miss known associations for many common SNPs (Kiezun et al., 2012). Another approach for detection and investigation of variants and their ef-

11 fects are so-called Trio Studies. These typically focus on high-coverage whole genome sequencing of mother-father-child trios with the aim of discovering very rare de novo variants (i.e mutations that cannot be found in either parent). Using this study design, the 1000 Genomes Consortium estimated 8 the de novo germ-line mutation rate to be 10 substitutions per base per generation (1000 Genomes Project Consortium et al., 2010) - an approxi- mate average of 30 new mutations (and thus a theoretical average of roughly 30 0.015 = 0.45 protein coding variants) in each newborn that are not ob- ⇤ servable in either parent. A different study, however, found 1.2-1.7 coding variants to be novel with respect to both parents (Rauch et al., 2012). Acomplementaryapproachisprovidedthroughexomesequencingstud- ies. These studies are aimed at comprehensively assessing both common and rare coding variants by targeted sequencing of the protein coding regions of the genome rather than sequencing the whole genome or relying on LD pat- terns as in GWAS. Exome-sequencing profits from the accuracy and coverage of Next Generation Sequencing combined with efficient DNA capturing while focussing on a critical region of the genome (Teer and Mullikin, 2010; Kiezun et al., 2012). This allows for significantly larger sample sizes than currently feasible for whole genome sequencing but is limited to the detection of coding SNPs. This strategy has also been employed in the 1000 Genomes Project (1000 Genomes Project Consortium et al., 2010, 2012) in order to record detailed information on the variants present in 1,092 healthy individuals from 14 populations. Among their astonishing results were many important findings: Each individual carries 80-100 variants causing pre-mature stop codons, 40- 50 splice-site disrupting variants, and 220-250 frameshift mutations affecting roughly 250-300 genes. In other words, everyone carries up to 400 variants that are likely to completely disrupt protein function (i.e. putative loss-of- function variants). Moreover, the 1000 Genomes Project revealed that every individual also carries an average of 50-100 variants that had previously been

12 associated with inherited disorders. This clearly shows that the genotype- phenotype relationship is very complex in most cases and that we need to learn more about the underlying molecular mechanisms that link genetic vari- ation to complex traits and diseases. Large-scale exome sequencing studies give valuable insights on the statistical involvement of coding variants in a certain phenotype (i.e. typically a certain disease). However, they do not answer how these particular variants affect protein functions, their pathways or the interactions that finally lead to the observable phenotype.

1.3 Resources for variants and their molecular effects

Improved sequencing and variant calling methods have led to a flood of ge- netic variation being discovered over the last decade. The need of storing these data and making them publicly available gave rise to a variety of databases that collect, store, and present this variation depending on the database’s focus. The largest database is dbSNP (Sherry et al., 2001). It collects not only SNPs but also small-scale indels, retroposable element in- sertions and short tandem repeats along with the corresponding sequence context, its frequency, and additional submission data. It does, however, not provide information on the molecular effect of variants but links to additional sources if phenotypic information is available. The current build (142, Oct. 2014) comprises over 112 million human variants (so-called RefSNP Clusters) of which almost 54 million are located in coding regions. The Online Mendelian Inheritance in Men (OMIM) database focusses on diseases with a known genetic component (Hamosh et al., 2005). It provides literature-derived information on mendelian phenotypes and also lists associ- ated SNPs for over 3,400 entries (Feb. 2015). A similar approach is provided by the Human Gene Mutation Database (HGMD; Stenson et al., 2003). This database collects data on germ-line mutations in genes associated with hu- man diseases and currently covers over 64,000 publicly available missense and nonsense variants (Feb. 2015). Another disease-association based repos-

13 itory is provided through GWAS Central (Beck et al., 2014). This database contains summary level information from genome-wide association studies, providing almost 68 million association p-values for almost 3 million unique dbSNP markers (release 11, Sep. 2013). In addition to these, the Univer- sal Protein Resource (UniProt; UniProt Consortium, 2015) provides an index (called HUMSAVAR) for all variant entries with disease association, covering roughly 70,000 coding variants (release 2015_2 of Feb. 2015). All these re- sources (OMIM, HGMD, GWAS, HUMSAVAR) offer information on variants with probable disease association based on literature reports but generally do not provide information on molecular variant effects. It is however likely that many of the disease-associated coding variants have functional effects on the molecular level. This will be discussed later in section 2.1. Studying variant effects experimentally on the molecular level is costly and requires significant amounts of time. Effects are mostly investigated for non-synonymous coding variants because these variants change the protein product and are therefore most likely to have phenotypic effects. The most common way of testing molecular effects of coding variants is site-directed mutagenesis in combination with a certain assay: the effect of a certain substitution is measured with respect to the specified assay. In most cases only a few possible variants are tested experimentally. Exceptions are the comprehensive mutagenesis study of the E. coli LacI repressor (Markiewicz et al., 1994), in which over 4,000 variants of the lac repressor were tested (Fig. 1) or the complete mutagenesis of the HIV-I protease (Loeb et al., 1989). Studies of structural (impact on the native 3D protein structure) and functional effects were collected into a literature-derived database termed the Protein Mutant Database (PMD; Kawabata et al., 1999), whose latest build (March 2007) contains over 200,000 variant entries for roughly 45,000 proteins. Yet, mutagenesis studies are often not aimed at investigating the effect of mutations but rather aimed at studying the native protein function

14 Figure 1: Mutagenesis of the E. coli lacI repressor. At each position between residue 2 and 329, 12-13 amino acid substitutions are displayed as a bar. The height of a bar depicts the relative percentage of substitutions that alter the repressor function as determined (a) experimentally (Markiewicz et al., 1994) or (b) by computational prediction using SNAP2. This figure was adapted from Hecht et al. (2013). and identifying the residues that are involved. For instance, alanine-scans are often used to probe the protein for functional hot spots by mutating every na- tive amino acid into alanine and testing its effect on protein binding (Bogan and Thorn, 1998). Binding hot spots can be revealed by measuring the dif- ference in binding free energy upon mutation to alanine (Clackson and Wells, 1995). Thus, such experiments also generate information on the molecular effects of to-alanine substitutions in these proteins which is collected in the Alanine Scanning Energetics database (ASEdb; Thorn and Bogan, 2001). The presented databases constitute only a fraction of the available data repositories and have been presented because of their particular importance to this thesis. Yet, there are many other databases that provide invaluable sources of information for applications.

1.4 Prediction of variant effects

While next-generation sequencing technology becomes increasingly cheaper and faster, experimental verification of variants remains a bottleneck. Hence, fast and accurate computational prediction of variant effects becomes more

15 and more important in order to guide and prioritize experimental verification of the ever-increasing deluge of sequencing data. In the following, the term variant is used as a synonym for ’coding SNP causing a single amino acid substitution’, while non-coding SNPs will be mostly neglected for simplicity. Many methods for the prediction of variant effects have been developed in the past 15 years. Some methods (e.g. DISCERN: Sankararaman et al., 2010, INTERPID: Sankararaman and Sjölander, 2008) are aimed at finding active sites like the ones that significantly alter binding energy when mutated to alanine as annotated in ASEdb. By looking for evolutionary conserved residues these methods also capture a significant fraction of residues that, if mutated, cause disorders or other distinct phenotypes. While not being specifically optimized towards variant prediction, these methods can be used to predict that most variants of these (often highly conserved) active site residues will have a strong impact on the protein function. Methods specifically trained on variants cover the prediction of a vari- ety of effect aspects, including the explicit prediction of changes in protein- protein binding affinity upon mutation (e.g. BeAtMuSiC: Dehouck et al., 2013). Some predict the pathogenicity of coding SNPs (e.g. CADD: Kircher et al., 2014, SNPs&GO: Calabrese et al., 2009, Mutation Taster: Schwarz et al., 2010, Mutation Assessor: Reva et al., 2011, PolyPhen-2: Adzhubei et al., 2010) and non-coding SNPs (CADD, Mutation Taster), in the sense that they output a score reflecting the likelihood of a specific SNP being dele- terious. Others focus on predicting effects of variants on protein structure (Schaefer and Rost, 2012) and stability (e.g. i-Mutant-3: Capriotti et al., 2008, PoPMuSiC: Dehouck et al., 2009). Others yet put their focus on predicting whether or not a variant changes the native protein function (e.g. SNAP: Bromberg and Rost, 2007, SIFT: Kumar et al., 2009, PolyPhen-2: Adzhubei et al., 2010). Obviously, there can be significant overlap between these methods’ predictions since the effect

16 aspects are related. For instance, a variant that causes a structural change in the protein may reduce its function and thus cause a certain disease. However, a pairwise comparison of methods suggests that predictions vary considerably between methods, even within the same category, which will be discussed in section 3.

Prediction features

The focus of each method is essentially determined by the features used for prediction and the data used for training. The single most important feature for variant effect prediction is typically evolutionary conservation - the extent to which the native amino acid is conserved between homologous sequences. If the native amino acid is identical in most (especially also distantly) related species, it is likely that this is due to functional importance as the amino acid was retained through purifying selection (i.e. variants causing a disadvan- tageous phenotype are selected against). This can sometimes be misleading if the alignment does not contain enough distantly related sequences, as the apparent conservation may also result from a lack of time (i.e. closely related species did not have enough time to diverge). This feature therefore requires alargeandsufficiently diverse alignment of homologous sequences, which is sometimes not available because the required sequences are unknown. An- other common feature are biophysical properties of native and variant amino acid. Information such as size, hydrophobicity and charge of the variant is compared with the native amino acid to estimate substitution compatibility. For instance, substituting a small, hydrophobic amino acid in the protein core by a bulky, hydrophilic one is likely to cause a structural change of the protein. Thus, the extent to which biophysical properties differ can be used as an indicator of variant effect. Moreover, many other features of protein and amino-acid features, both experimental and predicted, have been used for variant effect predictions. These include structural information such as experimental structures or predicted secondary structures, residue annota-

17 tion such as known or predicted functional sites, patterns of protein domains and substitution probabilities estimated from known sequences. Features for variant effect prediction will be further discussed in section 2.2.

Classes of predictors

Predictors can usually be classified into two categories: constraint-based methods and trained classifiers. Constraint-based methods use a custom definition for their predicted effect aspect, which is based on one or more biological properties. For instance, a simple definition for “disease-variant” could be ’over n% residue conservation’ and ’m% different biophysical amino- acid properties’. The exact definition parameters (here, ’n’ and ’m’) are typ- ically optimized using available experimental data. The prediction is then performed based on how similar the specific case is to the selected effect definition. Trained classifiers also derive rules or patterns for their prediction task. However, this happens as part of the training of the corresponding classifier and is only influenced by the presented data samples. These devices use supervised learning (i.e. data samples are labeled; for instance as ’deleterious’ or ’neutral’) to generalize rules or patterns from the feature values present in the data. They require both positive and negative samples and operate under the assumption that the training data is representative of the prediction task. Popular examples of machine learning devices are support vector machines (SVM), multilayer perceptrons (MLP; also called artificial neural networks, ANN), decision trees (DT), random forests (RF), and rule-based learners. Major differences between these lie in the types of input features (e.g. numeric, ordinal, nominal) that can be handled and the way the final prediction is calculated. In terms of result interpretability we distinguish between black-box and white-box predictors. Black-box predic- tors (e.g. SVM, MLP, RF) offer no further information on how the result was generated, in the sense that the classification reasons are hidden in the

18 model and thus not interpretable by humans. White-Box predictors (DT, Rule-based learners) on the other hand can be interpreted by looking into the model and following the decisions or rules. The data for training can be retrieved from databases such as the ones mentioned in section 1.3. Pathogenicity predictors are typically trained or optimized on variants with probable disease association as their task is to distinguish between natural variation and disease-causing mutations. This implies (i) disregarding variants that have molecular effects as long as these are not associated with diseases and (ii) a focus on human variants, as these are the ones for which disease annotations are predominantly available and for which predictions are most wanted. A ’neutral’ prediction from a pathogenic- ity predictor does not indicate that the variant has no effect but rather that any possible effects are likely not involved in pathogenic phenotypes. Functional effect predictors are typically trained/optimized on variants with experimentally known molecular effects, which are predominantly obtained from non-human sequence experiments, as these are easiest to conduct. A positive prediction from a functional effect predictor does neither indicate nor exclude a possible human disease-association. It is thus difficult to compare methods with a different focus.

1.5 Thesis motivation and goals

Currently, there are significantly more disease variant predictors than effect variant predictors, which is attributable to the fact that medical research pru- dently focusses on revealing and targeting variants with disease association. However, there is also a need for accurate prediction of molecular variant effects. For instance also plant geneticists make use of computational meth- ods to aid their research. Another question that can hardly be answered by disease prediction is how the human genome evolved functionally. In other words: what are the functional genetic differences that distinguish humans and apes? Understanding how variation affects phenotypes like disease onset

19 and progression requires us to understand how variants affect the underlying molecular processes. In this work we focussed on improving the prediction of functional effects both in terms of accuracy and in terms of throughput over existing methods. We developed SNAP2, a classifier that outperformed current state-of-the-art methods and allows to predict every possible amino acid substitution in a protein in a time-efficient manner. Furthermore we visualized these predic- tions in heatmaps. This provides a simple but comprehensive representation that allows for intuitive interpretation of results and easy generation of hy- potheses.

2Methods

2.1 Data

Similar to an earlier publication (Bromberg and Rost, 2007), we used a mix- ture of experimentally determined neutral and effect variants from the Pro- tein Mutant Database (PMD) and a set of putative neutral variants derived using the Enzyme Commission number (EC; Webb, 1992). We also added several putative effect variants from the disease-association databases OMIM and HUMSAVAR. In the following, the individual data components and their extraction will be briefly described: The PMD provides functional variant annotations retrieved from litera- ture reports and categorizes effects in seven classes with respect to the native protein function. We collected all variants and assigned them to either the ’neutral’ or the ’effect’ class in our data set depending on the reported effect. If the variant protein function was reported to be slightly (’-’), moderately (’- -’), or substantially (’- - -’) decreased, or if it was reported to be slightly (’+’), moderately (’++’), or substantially (’+++’) increased, we labeled the variant as ’effect’ variant. Only if the variant was annotated to cause ’no

20 change’ (’=’) in protein function, we assigned it to the ’neutral’ class. In case of conflicting annotations (e.g. variant is reported as ’no change’ with respect to one assay but as ’slightly increased’ to another) the variant was assigned to the ’effect’ class. This procedure yielded 38,179 effect variants and 13,638 neutral variants (i.e. atotalof51,817variants)in4,061distinct proteins. From the Online Mendelian Inheritance In Men (OMIM) and the HUM- SAVAR we extracted variants associated with heritable diseases. Although, in many cases, there is no experimental evidence of their molecular functional effects, we assume these variants to have an effect on protein function. This is clearly an over-estimate, as these variants may just be in linkage disequilib- rium with the causative variant or simply affect regulatory elements or splice sites. Thus, some of these variants may not have any effects on protein func- tion. However, it is likely that this set is highly enriched in functional effect variants as they have been shown to exhibit a much stronger functional effect signal on average than the experimentally verified PMD variants (Schaefer et al., 2012). We thus collected 22,858 human variants in 3,537 proteins and assigned these to the ’effect’ class. From the above extraction steps we collected only 13,638 neutral variants as compared to 61,037 effect variants. This imbalance of available experimen- tal ’neutral’ and ’effect’ variants suggests there is a significant selection bias towards experimental verification of effect variants. Researchers prudently focus on variants for which the strongest effects are expected, often with the goal to investigate certain diseases. Moreover, a detected effect variant is much easier to publish than a variant for which no functional effects could be measured. However, machine learning typically performs best when trained on a data set that is representative of the prediction task. While the true ratio of neutral and effect mutations in nature remains unknown, we chose to increase the fraction of neutral samples to obtain a close to balanced ratio. We thus extracted putative neutral variants based on the following as-

21 sumption: If two independent experiments reveal that two similar, related proteins have the same enzymatic function we assume that most differences between these two are neutral with respect to this enzymatic function. While this approach explicitly neglects combinatorial effects such as compensatory mutations or effects on other possible functions, it is likely that this approach yields a set that is highly enriched in ’neutral’ variants. We thus extracted all enzymes with experimental EC numbers from Swiss-Prot (UniProt Con- sortium, 2015) and did a pairwise alignment (using PSI-BLAST; Altschul et al., 1997) of all enzymes with the same EC number. Two more restrictions were imposed before considering any differences as neutral substitutions: (i) The sequences had to be > 40% identical and (ii) have HSSP-values > 0 (Sander and Schneider, 1991; Rost, 1999). Through this approach we ex- tracted 26,840 variants in 2,146 proteins and assigned them to the ’neutral’ class. These three data sets were combined into our comprehensive training set. Thus the final data set consisted of 101,515 variants (40,478 neutral and 61,037 effect) in 9,744 distinct proteins (Hecht et al., 2015). Additionally, we used the 4,041 variants from the E. coli LacI repressor (Markiewicz et al., 1994) and the 336 variants from the HIV-1 protease (Loeb et al., 1989) as independent testing sets.

2.2 Features for variant effect prediction

The ability to predict variant effects through machine learning depends on the available information. For the task at hand, these features describe certain properties of proteins and amino acids. As mentioned before (Section 1.4), evolutionary information is a very important and thus commonly used fea- ture. It provides information on the extent to which certain amino acids are observed in other species and other related sequences and thus can be used to estimate how likely an amino acid substitution is tolerated. Also the afore- mentioned biophysical amino acid properties are a commonly used feature as

22 these may give insights into the structural impact of a substitution. The fea- ture calculation step involves extracting such information and transforming them into an appropriate machine-readable format. For instance, informa- tion on amino acid size may be given as a numeric value. If this is to be used, it has to be normalized in order to be used as an input value, because oth- erwise features with inherently large values would have proportionally larger impact on the prediction than features with inherently small values. Alterna- tively, that same size value can be transformed into binary or nominal inputs by defining thresholds (e.g. small < n, n 5 medium < m, large = m). The exact representation of a feature depends on the machine learning device and the available information. We extracted information from a variety of both experimental and predicted sources and calculated normalized numeric features where possible and multiple binary features otherwise. As the ex- act calculation is described in Hecht et al., 2015, the features used in the development of this method will only be briefly presented here. We extracted evolutionary information from the PSI-BLAST position spe- cific scoring matrix (PSSM: Altschul et al., 1997), from the position-specific independent counts profile (PSIC: Sunyaev et al., 1999) and by estimating co-evolving residues (Fodor and Aldrich, 2004; Kowarsch et al., 2010). Amino acid properties were retrieved from the AAindex database (Kawashima and Kanehisa, 2000). We considered biophysical properties (e.g. mass, volume, hydrophobicity, charge), structural propensities (e.g. c-beta branching, helix- breaker) and statistical properties (e.g. relative mutability, average flexibil- ity, distance-dependent contact potentials). Structural variant information was not included through the use of known 3D structures as these are not available for most proteins. Instead we used predicted structural informa- tion such as secondary structure and solvent accessibility (Rost and Sander, 1993, 1994), as well as predicted residue flexibility (Schlessinger et al., 2006). Information on functional importance of residues was included in multiple ways. We used predicted protein-protein and protein-DNA interaction sites

23 (Ofran and Rost, 2007; Ofran et al., 2007), experimental annotation from Swiss-Prot if available, and whether a residue and its surroundings match any Pfam (Punta et al., 2012) or PROSITE (Sigrist et al., 2010) pattern. Moreover, we included a set of global sequence properties such as the amino acid composition, the secondary structure and solvent accessibility composi- tion, the protein length, and low-complexity regions. Where applicable we also used deltas of these features. These delta- features were calculated by comparing feature values for wildtype and mu- tant amino acid and estimating the strength and direction of the change attributable to the substitution. We also considered the immediate sequence environment for each substitution by including not only the variant residue position but also surrounding positions. Thus, we extracted each feature in a window between 3 and 21 residues (i.e. the central variant position and each 1-10 residues up- and downstream).

2.3 Prediction method

Machine learning offers a variety of methods for learning patterns from la- beled data and using these to predict labels for novel samples. In order to identify the most suitable tool for our problem we initially trained and tested several tools from the WEKA suite (Frank et al., 2004) on our data. We used support vector machines, neural networks, decision trees, and random forests with default parameters and compared their performance. On our data neu- ral networks and SVMs performed slightly better than decision trees and random forests. The difference between SVMs and neural networks was not significant. We proceeded with standard feed-forward neural networks be- cause of better runtime and memory efficiency when applied through the Fast Artificial Neural Network (FANN: Nissen, 2003) library. The networks were designed to have two output nodes, one for ’neutral’ and one for ’effect’. The following sections describe the method development and previously es- tablished training procedures (Hecht, 2011).

24 2.3.1 Clustering and cross-validation

In order to avoid over-fitting of the model and over-optimistic performance estimates we created 10 sub-sets from our data in such a way that there was no significant sequence similarity between any two proteins in different sub-sets. We first used PSI-BLAST for each sequence against our entire set 3 and collected all significant hits (E value < 10 )foreachsequence.We then built an undirected graph where each protein was assigned to a vertex and connected through edges with all vertices for which we had recorded a significant hit. We then clustered our data through single-linkage clustering and thus collected all proteins into the same set that were reachable through a path in a graph. This approach yielded 1,241 clusters with the number of members ranging from 1 to 1,941. From this clustering we created ten sub- sets in such a way that we assigned all proteins and all their variants from the same cluster to one of ten sub-sets while balancing (as far as possible without separating cluster members) the overall number of variants per set as well as the ratio of ’neutral’ and ’effect’ variants within each set. These ten sub-sets were used in the cross-validation of our method by using eight sets for training, one set for optimization and one set for final testing. We cycled through these set combinations such that each set was used exactly once for optimizing and exactly once for testing in different com- binations. This ensured that no variant from the same or any similar protein was ever used simultaneously for training and testing. During development of the method, we always used eight sets for training and one optimization set for testing. The respective tenth final testing set was only used after development to reliably estimate the final network performance as reported in section 3. This approach effectively yielded ten different networks, each optimized and tested on a different sub-set of our data. In order to avoid the risk of over-fitting we did not select the best of the ten networks for the final method, but instead decided to use all ten networks. The final prediction

25 is calculated as a jury-decision by using the average ’neutral’ output and the average ’effect’ output over all ten networks. From these two scores we calculated a single output score as the difference between the average ’effect’ output and the average ’neutral’ output which resulted in a score ranging from -1 (all networks predict 100% neutral probability) to +1 (all networks predict 100% effect probability). For simplicity, we re-normalized this score to integers between -100 and +100 in the final method output.

2.3.2 Feature selection and parameter optimization

Determining the optimal feature combination for a prediction task is highly non-trivial for several reasons: First of all, we do not know for certain to which extent protein and variant properties are relevant for the task, or even if we know all the relevant factors. Moreover, the corresponding importance of each feature varies between different samples (e.g. variant prediction in membrane proteins has other priorities than in globular proteins). As the ’optimal’ combination depends on the prediction task, it is subject to how representative the training data is for this task. We developed a general functional effect predictor by generalizing patterns from different kinds of proteins, because we lack the data to develop a specialized method for every problem. This meant sifting through a large space of potentially relevant features in order to identify those that perform best on this general task. Moreover, finding the optimal combination is also difficult because it would involve testing every possible combination for all features, which is possible but extremely CPU time-consuming. Instead, we used heuristic methods that are likely to find at least a good combination, while cutting CPU requirements to a fraction of the exhaustive search. As mentioned above (Section 2.3.1) we separately trained and optimized ten networks. For each of these, we individually selected the best feature combination by standard greedy forward selection. The following steps were applied for all ten networks separately: First, we trained each network on

26 all features individually and estimated their individual performance on the cross-training set by calculating the area under the receiver operating char- acteristic curve (Section 2.4). We then selected the best-performing feature, paired it with all others and selected the best-performing duo. This procedure was repeated, always adding the best-performing feature of each round to the previous best-performing combination until no additional feature yielded an improvement. In the end we tested the best feature combination on the testing set (the tenth set that had not been used during feature selection) to assure that the network was not overfitted. A combined feature set was obtained by collecting all features that improved performance on any of the ten networks into one feature set. We trained all networks using this feature set and then performed a backward elimination selection by which we re- moved all features that could safely be removed without lowering the overall performance. The remaining features constituted our final feature set. The final feature set was used to find the optimal network architecture. For each of the ten networks we applied heuristically selected parameter combinations: learning rate between 0.005-0.1, learning momentum 0.01- 0.3, and hidden units 10-100. We again tested each parameter combination separately for each network using the cross-training set and selected the combination that performed best. In the end, the performance on the cross- training sets was compared to the performance of the corresponding testing sets to assure that there was no significant overfitting.

2.3.3 Alignment-free prediction

Of all features, alignments of related protein sequences carry the most infor- mation on whether or not a variable is acceptable. As many (approximately 10%-20%) sequences in today’s databases continue to not map to any known sequence, there is no evolutionary information available for these so-called orphan sequences. In fact, even for human there are currently (Feb 2015) over 600 sequences for which less than 5 homologs can be found. We thus

27 specifically trained a method to perform well when no evolutionary informa- tion is available. Towards this end, we created a substitution score matrix from the predictions made by our regular method and used this matrix as an independent feature along with all other features that did not require alignments. Networks were trained and optimized as above. The resulting prediction method constitutes a fall-back mode for the main method. Al- though by definition weaker than the main method, this enabled predictions for cases that would otherwise not be predictable.

2.4 Performance measures

We employed several different performance measures to evaluate different as- pects of method performance. The following standard definitions were used: True positives (TPs) were correctly predicted variants with experimentally annotated effect. True negatives (TNs) were correctly classified variants for which experiments confirmed no effect (i.e. neutrals). And correspondingly, false positives (FPs) and false negatives (TNs) were incorrectly classified samples with or without experimentally annotated effect, respectively. We considered three levels of performance: the inner-class performance, the com- bined class performance and the overall model performance. The inner-class performance was estimated using the standard formulas for precision and recall (i.e. accuracy and coverage, respectively) for both the ’effect’ and the ’neutral’ class separately:

TP P recisioneffect = Accuracyeffect = , (1) TP + FP

TP Recalleffect = Coverageeffect = TPR = , (2) TP + FN

TN P recisionneutral = Accuracyneutral = , (3) TN + FN

28 TN Recallneutral = Coverageneutral = . (4) TN + FP

The recall value of the effect class also represents the true positive rate (TPR; Eqn. 2). Combined class performances were measured through the F1 measure in order to comprehensively estimate the performance for neutral and effect variants individually:

P recisioneffect Recalleffect F 1effect = ⇤ , (5) P recisioneffect + Recalleffect

P recisionneutral Recallneutral F 1neutral = ⇤ . (6) P recisionneutral + Recallneutral

The overall performance was used to compare different models and meth- ods. We therefor calculated the Matthews correlation coefficient (MCC) and the overall two-state accuracy (Q2):

TP TN FP FN MCC = ⇤ ⇤ , (7) (TP + FP)(TP + FN)(TN + FP)(TN + FN) p TP + TN Q2= . (8) TP + FP + TN + FN

The false positive rate (FPR) was calculated as

FP FPR = . (9) FP + TN

Moreover the area under the curve (AUC) of the receiver operating char- acteristic (ROC) was used for model comparisons. This curve was obtained through plotting the true positive rate (Eqn. 2) against the false positive rate (Eqn. 9) at all possible decision thresholds (i.e. every score from -100 to +100).

29 We estimated standard error and standard deviation by bootstrapping. The 1000 bootstrap sets were created by randomly selecting 50% of all vari- ants without replacement. Although this is typically done with replacement, we experienced that bootstrapping without replacement yields more accu- rate estimates due to overrepresentation of certain protein families. With these n=1000 sets we calculated the standard deviation (StdDev; Eqn. 10) as the average performance difference of each set (xi)fromtheoverallper- formance average (). The standard error (StdErr; Eqn. 11) was calculated by dividing the standard deviation by the square root of sets:

2 StdDev = (xi ) , (10) qX StdDev StdErr = . (11) pn 1 2.5 Result visualization

One important achievement of this method is its ability to efficiently pre- dict large amounts of mutations in a protein. This allowed us to expand the standard resolution from predicting single mutations to predicting the entire mutability landscape of a protein (Hecht et al., 2013). For an average-sized protein (i.e. roughly 300 residues) we can predict all non-native substitu- tions at every position in less than one hour of runtime, yielding a total of roughly 5,700 predictions per protein. However, this much information cannot be reasonably represented in text form and thus requires appropriate visualization. We therefore implemented a JAVA script component with the ability to represent this data as a heat map (Yachdav et al., 2014). This allowed showing all 19 substitutions (y-axis) for every residue of the protein (x-axis) along with their predicted effect by color-coding the numerical pre- diction in a color cascade (Fig. 7b,c) ranging from green (Score -100, most reliable neutral prediction) over white (Score 0, inconclusive prediction) to

30 red (Score +100, most reliable effect prediction). This component was in- cluded in a web-server (available at http://rostlab.org/services/snap2web) in order to enable users to view and download both the numerical and the visual output of our method.

3Resultsanddiscussion

During the course of this thesis we developed SNAP2, a neural network based classifier for the prediction of functional effects of single amino acid substitutions in proteins. It is able to accurately predict variants even in orphan proteins and presents prediction results for all possible substitutions in comprehensive heat maps. The following sections present comparisons to other state-of-the-art methods and describe the improvements made in our novel method. We show the heat map representation along with a possible use-case and discuss difficulties that users may face when interpreting results from variant effect predictors.

3.1 Better prediction of functional effects

For the final method we estimated an overall performance of over 83% two- state accuracy. These estimates were obtained from our cross-validation test- ing sets, that had no part in method development. We first compared our new method SNAP2 against it’s predecessor SNAP, as both methods shared asignificantamountoftrainingdata.Towardsthisend,weassessedtherel- ative benefit of the newly implemented features by training our method on exactly the same data that SNAP was trained on. Through this approach we observed that the additional features accounted for approximately 1.2% per- formance increase over the original SNAP (79.8 0.4% up to 81 0.5% Q2; ± ± Eqn. 8). This performance increase was significant, although the original SNAP had an advantage in this comparison as it had been trained on the

31 same data, while the SNAP2 estimates were taken from the cross-validation (and had thus not been used in training). Next we assessed the benefit of adding disease-associated variant data to our training data. We trained our method on the extended training set but tested against the original SNAP data set as before. Here we observed an even higher performance increase: SNAP2 (82.4 0.4% Q2) outperformed its predecessor SNAP (79.8 0.4% Q2) ± ± even more, which suggested that the additional data had indeed improved performance. The improvement is also visible in figure 2. Next, we compared our methods against two other state-of-the-art functional effect predictors: SIFT (Kumar et al., 2009) and PolyPhen-2 (Adzhubei et al., 2010). While there are many methods for variant effect prediction (Section 1.4), it would have been inappropriate to test disease-association predictors on our data. We therefore explicitly only included these two methods because they had been optimized for functional effect prediction according to the authors. Fig- ure 2 shows that SNAP2 (again, predictions were taken from cross-validation sets that were not used for training) compares favorably to other methods. PolyPhen-2 has been explicitly optimized on human variants. Although the authors claimed that their method should work equally well on other eukaryotes, we felt that the above comparison put PolyPhen-2 at a disad- vantage as only 25% of our data consisted of human variants. We therefore re-calculated performance values separately for human and non-human PMD variants (Sections 1.3 and 2.1). As can be seen in Table 1, PolyPhen-2 does indeed perform significantly better on human variants on which it performs on par with SNAP2, although PolyPhen-2 values may be overestimated due to substantial overlap with its training data. Both methods exhibit better performance than SNAP and SIFT. On non-human variants however, SNAP2 significantly outperformed all other methods in terms of F 1effect(Eqn. 5), MCC (Eqn. 7) and Q2 (Eqn. 8).

32 Figure 2: Receiver Operating Characteristic (ROC) comparison. Shown are the receiver operating characteristics for our new method SNAP2 (dark blue, AUC=0.905), it’s predecessor SNAP (light blue, AUC=0.880) and the two widely used functional effect predictors SIFT (green, AUC=0.838) and PolyPhen-2 (orange, AUC=0.853). Curves are sig- 4 nificantly different (P<10 )accordingtothemethodbyDeLongetal., 1988. (Figure adapted from Hecht et al., 2015)

33 Method F 1effect F 1neutral MCC Q2 SNAP2 78.0% ± 0.6 46.3% ± 1.3 0.24 ± 0.01 68.8% ± 0.7 PolyPhen-2 78.4% ± 0.4* 45.1% ± 1.1* 0.23 ± 0.01* 68.9% ± 0.5* SNAP 74.9% ± 0.5* 46.7% ± 1.1* 0.22 ± 0.01* 65.8% ± 0.6* human SIFT 72.2% ± 0.6 49.0% ± 1.0 0.23 ± 0.01 63.6% ± 0.6

SNAP2 79.9% ± 0.3 45.8% ± 0.8 0.26 ± 0.01 70.7% ± 0.4 PolyPhen-2 77.1% ± 0.4 44.7% ± 0.8 0.22 ± 0.01 67.6% ± 0.5 SNAP 77.2% ± 0.3* 45.5% ± 0.9* 0.23 ± 0.01* 67.9% ± 0.5* SIFT 77.0% ± 0.3 45.8% ± 0.8 0.23 ± 0.01 67.7% ± 0.4 non-human Table 1: Method performances on PMD data. For each method, SNAP2, PolyPhen-2, SNAP and SIFT, the corresponding performance is shown separately for human and non-human proteins. Performance values were calculated for F 1effect (Eqn. 5), F 1neutral (Eqn.6), MCC (Eqn. 7) and Q2 (Eqn. 8) measures. The data consisted of 9,657 human variants in 678 proteins and 42,160 variants in 3,383 non-human proteins. Significantly best results for each measure are highlighted in bold. Marked values (*) indi- cate potentially over-estimated performance due to substantial overlap with training data. SNAP2 values were taken from cross-validation sets that had not been used for training.

34 3.2 Difficult cases and method combinations

When comparing different prediction methods, one will notice that overall method performance is quite similar. However, predictions can differ sub- stantially for individual cases. For instance, Liu et al., 2011 found that the pairwise agreement between any two methods lies between 61% and 77%. To turn this observation into a concept we grouped the variants that could be predicted by all four predictors (83,671 variants) into three classes. We found 53,976 ’easy’ cases (every method gets them right), 23,630 ’difficult’ cases (at least one but not all of the tested methods gave a correct prediction), and 6,066 ’unsolvable’ cases (no method gave a correct prediction). As can be seen from these numbers, the bulk of method performance is gained from easy cases. It appears that a well-trained method will achieve a minimum of 68% accuracy by simply using the most informative features, that almost every current implementation employs. It is likely that most of these variants can be predicted by simply looking at the alignment of related proteins. For unsolvable cases, on the other hand, it might be that alignment information is either not sufficiently available or misleading. We thus compared performances on those difficult cases (Figure 3) and found that our new method significantly outperformed the other methods with 67.2% accuracy. SNAP and PolyPhen-2 performed similarly with 55.2% and 57.8% accuracy respectively, although PolyPhen-2 was at a disadvantage (as it is specifically optimized for human variants). SIFT (45.5%) performed within the standard deviation of the random prediction model (44.5%), which assumed a background of 60% effect and 40% neutral like the overall data set (i.e. it randomly predicted effect variants slightly more often than neutral variants). We investigated the properties of these cases by looking into the human cases that SNAP2 could correctly predict while the others could not. We found that these were located at positions at which the variant amino acid was observed in the alignment (and in some cases even more frequently than

35 Figure 3: Comparison on difficult cases. The average two-state accuracy (Q2; Eqn. 8) on all difficult cases is shown for SNAP2 (dark blue), SNAP (light blue), SIFT (green) and PolyPhen-2 (orange). For reference the ran- dom prediction model with a 60:40 effect:neutral background is shown in pink.

36 the human native amino acid). As mentioned above, the presence of an amino acid in the alignment is a strong indicator that such a variant can be tolerated, thus possibly misleading other methods to predict these as neutral. This suspicion was further supported by the fact that our alignment- free version of SNAP2 predicted 75% of the effect variants with over 90% accuracy - a value that is significantly higher than the average performance of alignment-free SNAP2. This indicated that for these cases SNAP2 greatly profited from the variety of additional features rather than over-relying on alignment information. Can the reliability of prediction be increased by using multiple methods? We investigated the benefit of combining prediction methods for the human variants of our data by employing a combination of SIFT and PolyPhen-2 that only considered predictions if both methods agreed in their prediction. This naïve combination was compared to our new method SNAP2 and to each of the individual methods (SIFT and PolyPhen-2). Figure 4 shows that the method combination performed slightly better than SIFT on the neutral cases but did not improve over any of the individual methods on the effect variants. Moreover, the combination did not perform any better than SNAP2 throughout the curves. This suggests that users should not blindly combine methods and trust the results if methods agree but rather choose a method specifically for their problem or employ specifically optimized method combinations such as PredictSNP (Bendl et al., 2014) or Condel (González- Pérez and López-Bigas, 2011).

3.3 Neutral variant dilemma

As can be seen from Table 1 and Figure 4, neutral variants from our PMD set were predicted much worse than effect variants by all methods. In accordance with the findings reported by Bromberg et al., 2013, this can be attributed to a bias in both variant selection and experimental verification. The vari- ant selection bias results from the fact that variants are not investigated

37 Figure 4: Naïve method combination is not better than individual methods. This figure shows accuracy versus coverage separately for neutral (Panel a) and effect (Panel b) variants. The accuracy curve is the percentage of correctly predicted neutral (Eqn. 3, panel a) or effect (Eqn. 1, panel b) variants out of all predictions at the given threshold. Correspondingly the coverage is the percentage of correctly predicted neutral (Eqn. 4, panel a) or effect (Eqn. 2, panel b) variants out of all observations at the given threshold. The default thresholds of each method are marked by arrows for SNAP2 (dark blue), SIFT (green), PolyPhen-2 (orange) and the naïve combination of SIFT and PolyPhen-2 (brown). (Figure adapted from Hecht et al., 2015)

38 randomly. Researchers focus on variants for which the most extreme effects are expected or on variants that are hypothesized to be involved in diseases, which explains the imbalance in numbers between neutral and effect variants in today’s databases. This poses a significant obstacle for computational approaches, which require sufficient amounts of data in all classes in order to generalize. The lack of experimentally verified neutral variants hampers the ability to extract relevant patterns for neutral variants and lowers the predictive performance. The incomplete verification bias, on the other hand, results from the triviality that ’not observed’ does not necessarily imply ’not existing’. Negative experiments are much harder to carry out as they would require to assay every possible effect. Variants are thus typically evaluated on the basis of one or a few assays. If no effect is observed the variant is reported as neutral with respect to that assay but it may well have an effect on a different phenotype/assay. As today’s data is too scarce to develop specific methods for each phenotype, we have to accept that our general ap- proach is skewed with respect to neutral variants. This may have effects on the extraction of patterns for neutral variants as well as on the evaluation of performance.

3.4 Interpretation and reliability of variant prediction

Amajorpitfallintheapplicationofvarianteffect prediction for biologists and geneticists is the choice of method and the interpretation of results. There are significant differences between the seemingly similar approaches of functional effect and disease-association predictors. Disease association methods predict the likelihood of variants to be involved in diseases by dis- tinguishing disease variants from the background of natural genetic variation. For many users ’disease-associated’ implies that a variant must have strong functional effects. This is however only true for the simplest cases. For in- stance, in some monogenic or Mendelian diseases, the molecular effect can be quantified and directly linked to the phenotype (Hamosh et al., 2005).

39 Yet, most disorders appear to be complex in the sense that they involve multiple genes, variations and/or environmental conditions, as was shown by GWAS (Wellcome Trust Case Control Consortium, 2007; McCarthy et al., 2008). It has also been shown (1000 Genomes Project Consortium et al., 2010) that disease-associated variants can be found in healthy individuals. Moreover, even in seemingly clear cases the definition of a disease variant can be difficult. For example, certain variants of the hemoglobin B-chain cause a condition called sickle-cell anemia, which is associated with chronic health issues. On the other hand, that same condition grants immunity to some types of Malaria. In other words: the definition of a disease variant can depend on the environment and genotype of the individual, making the interpretation of ’disease’-predictions difficult. Adifferent set of problems applies for functional effect prediction. In con- trast to disease prediction, functional effect prediction focusses on the native molecular protein function. These predicted effects are independent of the individual and often also independent of the environment but offer no simple interpretation of their biological relevance. For instance, a weakly predicted effect in p53 may have a massive biological impact on the individual whereas astronglypredictedeffect in another protein may have little biological signifi- cance. Moreover, functional effect predictors cannot distinguish the direction of predicted effects. That is, they cannot discern between gain and loss of function, but simply predict that a variant has an effect on the native protein function. This means that predictions have to be interpreted with respect to the protein under investigation. Today’s methods can distinguish between sets of variants that are highly enriched in effects and sets that are not (Schaefer et al., 2012), but pinpoint- ing the one mutation that causes a certain phenotype within an individual genome is often beyond our reach. In order to aid the prioritization of exper- imental variant testing, users need to be able to focus on the most promising candidates. Towards this end, we provided a reliability index (RI) that allows

40 users to zoom in on the most reliable predictions. We first fixed the thresh- old for the binary categorization of our prediction by selecting a threshold on our balanced data set (Fig. 5a) that provided the highest overall perfor- mance for both classes. We then calculated reliability bins from the output score by projecting the score onto integers between 0 and 9. From these bins we estimated the performance of each reliability index for our training data, thus providing users with an estimate of accuracy for each index. Figure 5b depicts the cumulative accuracy and coverage that can be expected above each reliability threshold and to how many samples of our data this applied. For example, the grey arrows mark predictions made at reliability index of 7 or above, which applied to over 58% of our data. For this most strongly pre- dicted half of our data, we estimated over 90% accuracy for neutral samples and almost 95% for effect variants. Focussing on these variants with RI 7 or better can thus be considered a reasonable approach towards prioritizing promising candidates for experimental verification.

3.5 Variant prediction in orphan proteins

Orphan proteins are sequences that find no (or, by extension, only very few) related sequences in today’s databases. In other words, there is no meaningful alignment available for the prediction of variant effects. Most methods today rely heavily on the evolutionary information encoded in alignments and thus perform poorly at best for these cases. Ongoing sequencing efforts continue to bring in novel sequences and reveal novel protein families. In fact, the number of orphan families keeps increasing. In October 2012 the UniRef50 consisted of roughly 5.5 million sequence clusters of which over 3.5 million contained only one sequence, meaning that approximately 64% of all known protein clusters (clustered at 50% sequences identity) were represented only by a single orphan protein. Over the last 2 years, this number has more than doubled. The current release of UniRef50 (Feb 2015) lists over 13.2 million sequence clusters with roughly 8.2 million consisting of a single sequence. For

41 Figure 5: Threshold and reliability index. Panel (a) shows accuracy (solid lines, Eqn. 1 and 3) and coverage (dashed lines, Eqn. 2 and 4) for both effect (red) and neutral (green) variants over the entire spectrum of pos- sible thresholds. A black arrow marks the threshold selected for the binary prediction. Panel (b) depicts the same measures above certain reliability in- dices (RI). The leftmost point (RI=9) corresponds to predictions with the highest reliability, while the rightmost point (RI 0) includes all predictions. Shown are the cumulative accuracy (solid lines) and cumulative coverage (dashed lines) above the corresponding reliability value (ranging from 0, low- est reliability to 9, highest reliability) separately for effect (red) and neutral (green) variants. Grey arrows mark the reliability index that applies to the most strongly predicted half of the tested data. (Figure adapted from Hecht et al., 2015)

42 these 8.2 million proteins/families no other sequences can be found that are more than 50% identical. While sequences can be quite similar in structure and function with less than 50% sequence identity, this still suggests that many proteins in today’s databases are orphans or will have very sparse and thus possibly uninformative alignments. We specifically aimed at predicting variants for these proteins by training a method that did not require alignments (Section 2.3.3). SNAP2noali was first tested on our entire data set through cross-validation. With an overall accuracy of Q2=68% it performed significantly worse than other methods on our entire data set, which was to be expected and highlighted the importance of alignment information (Fig. 6).

Figure 6: Method performances on the entire training data. Shown are the accuracy versus coverage curves for SNAP2 (dark blue), SNAP (light blue), SIFT (green), SNAP2noali (black) and the random (pink) prediction model with 60:40 effect:neutral background for neutral (panel a) and effect (panel b) variants.

To test the performance of our method on orphan proteins we sorted our data by the amount of sequences in the corresponding alignments. We found no real orphan (i.e. no protein for which no significant hit was found) in

43 our data. However, by raising the threshold to less than five hits, we identi- fied a small number of 8 ’orphan’ proteins with a total of 248 variants. On these, our alignment-free method SNAP2noali achieved an accuracy of 62%, while our regular method SNAP2 performed at 61% accuracy. PolyPhen-2 predicted only three of the eight proteins (103 variants) at 60% and SIFT gave no predictions, which corresponds to random. Although this was a very small test set, the results indicate that the alignment-free method consti- tutes an important alternative for predicting variants in proteins with small alignments and is likely the best solution for orphan proteins.

3.6 The protein mutability landscape

The high computational power of today’s processors and the improved effi- ciency of our algorithm allowed us to shift the predictive scope of our method. Instead of only predicting one mutation at a time, we expanded the view by sketching the entire mutability landscape of a protein. This landscape can be defined as the predicted effects of each non-native substitution at each position in the protein (Fig. 7b). This comprehensive view of functional effects in proteins brings about opportunities but also a number of challenges. The human beta-2-adrenergic receptor (UniProt ID: ADRB2_HUMAN, UniProt AC: P07550) is an inte- gral membrane protein with seven transmembrane helices crossing the lipid bilayer. These seven transmembrane helices are clearly visible from the pre- dicted effects shown in Figure 7b and align with the DSSP-assigned (Kabsch and Sander, 1983) secondary structure elements (Fig. 7a) based on the high resolution structure 3PDS (Rosenbaum et al., 2011) in the Protein Data Bank (PDB; Bernstein et al., 1977). This could be expected, as transmem- brane regions are often well conserved due to the specific biophysical and structural restrictions imposed upon them. Still, there is a remarkably high number of effect predictions even for amino acids that fulfill the transmem- brane requirements (i.e. that are likely structurally acceptable/neutral) as

44 Figure 7: Mutability landscape of the human beta-2-adrenergic re- ceptor. Figure caption on the next page.

45 Figure 7: Mutability landscape of the human beta-2-adrenergic re- ceptor. Panel (a) shows secondary structure assignments from DSSP based on the protein structure PDB: 3PDS. The predicted mutability landscape is shown in panel (b) with the x-axis representing the entire sequence. Predicted effects (ranging from green/neutral over white/inconclusive to red/effect) are shown on the y-axis for all 19 non-native substitutions. Panel (c) zooms in on the region spanning from residue 91 to residue 210. It shows the two bind- ing sites D113 and T118 along with the two strongly predicted and highly correlated residues W99 and Y199. Panel (d) shows the 57 predicted high effect residues on the structure colored according to their predicted effects. Grey represents no or little effect (SNAP score 20), yellow depicts low  effects (20 < score 40) , medium effects are marked in orange (40 < score  60), and high effects are shown in red (score > 60). Panel (e) shows  the structure (PDB: 3PDS) of human beta-2-adrenergic receptor (UniProt ID: ADRB2_HUMAN; UniProt AC: P07550) with an irreversibly bound agonist (cyan sticks), the two known binding site residues D113 and T118 (blue spheres) and the two predicted high effect residues W99 and Y199 (red spheres) that exhibit strong residue couplings with each other and the binding site. (Figure adapted from Hecht et al., 2013)

46 well as for non-transmembrane variants. We found 57 residues that were predicted to cause strong effects upon mutation (average substitution effect score of over 60). To visualize these we used the available structure (PDB: 3PDS; Fig. 7d) and colored all residues according to their average predicted substitution effect. Figure 7d shows no or very small effects (SNAP score  20) in grey, strong effects (SNAP score > 60) in red, medium effects (40 < SNAP score 60) in orange, and weak effects (20 < SNAP score 40) in   yellow. Notably, the vast majority of effect predictions are facing the two known binding site residues (blue spheres) or are located in their proxim- ity. Only few of these have been experimentally tested for functional effects, which makes an assessment difficult: among these 57 positions, we could only find 11 experimental effect annotations. What about the remaining 46 residues without experimental evidence? Some of these may simply be false positives or predicted to have an effect because of structural importance but it appears likely that some of these may also be directly relevant for function and their involvement is yet unknown. A similar observation had previously been made by Bromberg et al. (2009) in an in-silico mutagenesis study of the human melanocortin-4 receptor. Following their example, we filtered our results by only considering variants that could be considered neutral given their biophysical properties. Towards this end we used the PHAT matrix (Ng et al., 2000) for transmembrane regions and the BLOSUM (Henikoff and Henikoff,1992)forallotherregionsandfilteredoutallpredictionsfor variants with a negative substitution score in these matrices. This filtering left us with twelve predicted high-effect positions that are likely to directly affect function upon mutation. These twelve can already be considered rea- sonable candidates for experimental testing in this protein but other proteins may exhibit significantly more high-impact sites. In order to further narrow down candidates, we studied these variants in the light of correlated muta- tions by applying EVfold (Marks et al., 2012; Hopf et al., 2012). This tool was designed to predict inter-residue contacts from correlated sequence variation

47 and thus possibly predict the 3D structure. We looked at the correlation pat- terns for our twelve candidates and found that only two of these had residue couplings among the top 5%. W99 and Y199 (Fig. 7e, red spheres) were strongly correlated with each other and the two known binding site residues (Fig. 7e, dark blue spheres). A visual inspection of the protein structure (PDB: 3PDS) with irreversibly bound agonist (Fig. 7e, cyan sticks) also suggested functional importance. This example shows how the protein mu- tability landscape of functional effects can be used to generate hypotheses for drug development and variant prioritization.

4Conclusion

This thesis presents a newly developed method for the prediction of func- tional effects of amino acid substitutions in proteins. The method is shown to perform better (on non-human variants) or on par (on human variants) with current state-of-the-art methods, while significantly outperforming all competitors on the difficult cases. The novel features, that no other method uses, allowed to make accurate predictions where other methods struggled and even allowed to make non-random predictions for orphan proteins. The method presented in this thesis is the most reasonable choice for all proteins with little or no evolutionary information and capable of making predictions where other methods cannot. Better CPUs and an improved algorithm allowed us to shift the predic- tive focus of our method from single variants to entire mutability landscapes. These landscapes have been shown to be promising and may enable new insights into protein engineering, drug development and variant prioritiza- tion. They may be the key for studying the genotype-phenotype link on amolecularlevelifwelearnhowtocorrectlyinterpretthem.Whilethe experimental counterparts remain constrained by the substantial amount of resources required for mutagenesis studies, computational prediction is cur-

48 rently constrained by the amount of available experimental data. Compre- hensive testing studies such as performed for the HIV protease or the LacI repressor are invaluable sources of information as they help overcome the neu- trality dilemma and broaden our understanding of functional variant effects. Still, computational prediction of variant effects is crucial for coping with the ever-increasing amount of sequencing data and will likely play a major role in understanding the molecular mechanisms that link genetic variation and diseases.

Acknowledgements

I would like to thank Prof. Dr. Burkhard Rost for supervising my thesis, for giving me advice and criticism, and providing the high-end working environ- ment to carry out my studies. Thanks to my colleagues at the Rostlab for interesting discussions and good times both in and outside of the lab. Spe- cial thanks to Timothy Karl and Laszlo Kajan for their patience in providing technical assistance and stable hard- and software. Thanks also to Marlena Drabik for all the little things that often (appear to) go unnoticed and her tireless efforts in dealing with administrative issues. Thanks also to Edda Kloppmann and Peter Hönigschmid for proof-reading the manuscript. Last, not least, I would like to thank my friends, my family and especially my girlfriend Sarah who continuously supported me in any imaginable way.

49 References

1000 Genomes Project Consortium, Abecasis, G. R., Altshuler, D., Auton, A., Brooks, L. D., Durbin, R. M., Gibbs, R. A., Hurles, M. E., and McVean, G. A. (2010). A map of human genome variation from population-scale sequencing. Nature,467(7319):1061–73.

1000 Genomes Project Consortium, Abecasis, G. R., Auton, A., Brooks, L. D., DePristo, M. A., Durbin, R. M., Handsaker, R. E., Kang, H. M., Marth, G. T., and McVean, G. A. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature,491(7422):56–65.

Adzhubei, I. A., Schmidt, S., Peshkin, L., Ramensky, V. E., Gerasimova, A., Bork, P., Kondrashov, A. S., and Sunyaev, S. R. (2010). A method and server for predicting damaging missense mutations. Nat Methods, 7(4):248–9.

Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997). Gapped blast and psi-blast: a new genera- tion of protein database search programs. Nucleic Acids Res,25(17):3389– 402.

Barrett, J. C. and Cardon, L. R. (2006). Evaluating coverage of genome-wide association studies. Nat Genet,38(6):659–62.

Beck, T., Hastings, R. K., Gollapudi, S., Free, R. C., and Brookes, A. J. (2014). Gwas central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. Eur J Hum Genet, 22(7):949–52.

Bendl, J., Stourac, J., Salanda, O., Pavelka, A., Wieben, E. D., Zendulka, J., Brezovsky, J., and Damborsky, J. (2014). Predictsnp: robust and accu- rate consensus classifier for prediction of disease-related mutations. PLoS Comput Biol,10(1):e1003440.

50 Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, Jr, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T., and Tasumi, M. (1977). The protein data bank: a computer-based archival file for macro- molecular structures. JMolBiol,112(3):535–42.

Bogan, A. A. and Thorn, K. S. (1998). Anatomy of hot spots in protein interfaces. JMolBiol,280(1):1–9.

Bromberg, Y., Kahn, P. C., and Rost, B. (2013). Neutral and weakly non- neutral sequence variants may define individuality. Proc Natl Acad Sci U SA,110(35):14255–60.

Bromberg, Y., Overton, J., Vaisse, C., Leibel, R. L., and Rost, B. (2009). In silico mutagenesis: a case study of the melanocortin 4 receptor. FASEB J, 23(9):3059–69.

Bromberg, Y. and Rost, B. (2007). Snap: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res,35(11):3823–35.

Calabrese, R., Capriotti, E., Fariselli, P., Martelli, P. L., and Casadio, R. (2009). Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat,30(8):1237–44.

Capriotti, E., Fariselli, P., Rossi, I., and Casadio, R. (2008). A three-state prediction of single point mutations on protein stability changes. BMC Bioinformatics,9Suppl2:S6.

Clackson, T. and Wells, J. A. (1995). A hot spot of binding energy in a hormone-receptor interface. ,267(5196):383–6.

Collins, F. S., Brooks, L. D., and Chakravarti, A. (1998). A dna polymor- phism discovery resource for research on human genetic variation. Genome Res,8(12):1229–31.

51 Cook, Jr, E. H. and Scherer, S. W. (2008). Copy-number variations associated with neuropsychiatric conditions. Nature,455(7215):919–23.

Davoli, T. and de Lange, T. (2011). The causes and consequences of poly- ploidy in normal development and cancer. Annu Rev Cell Dev Biol,27:585– 610.

Dehouck, Y., Grosfils, A., Folch, B., Gilis, D., Bogaerts, P., and Rooman, M. (2009). Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: Popmusic-2.0. Bioinformatics,25(19):2537–43.

Dehouck, Y., Kwasigroch, J. M., Rooman, M., and Gilis, D. (2013). Beatmu- sic: Prediction of changes in protein-protein binding affinity on mutations. Nucleic Acids Res,41(WebServerissue):W333–9.

DeLong, E. R., DeLong, D. M., and Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics,44(3):837–45.

Driscoll, D. A. and Gross, S. (2009). Clinical practice. prenatal screening for aneuploidy. NEnglJMed,360(24):2556–62.

Eberle, M. A., Ng, P. C., Kuhn, K., Zhou, L., Peiffer, D. A., Galver, L., Viaud-Martinez, K. A., Lawley, C. T., Gunderson, K. L., Shen, R., and Murray, S. S. (2007). Power to detect risk alleles using genome-wide tag snp panels. PLoS Genet,3(10):1827–37.

ENCODE Project Consortium (2012). An integrated encyclopedia of dna elements in the human genome. Nature,489(7414):57–74.

Fodor, A. A. and Aldrich, R. W. (2004). Influence of conservation on calcula- tions of amino acid covariance in multiple sequence alignments. Proteins, 56(2):211–21.

52 Frank, E., Hall, M., Trigg, L., Holmes, G., and Witten, I. H. (2004). Data mining in bioinformatics using weka. Bioinformatics,20(15):2479–81.

Frazer, K. A., Murray, S. S., Schork, N. J., and Topol, E. J. (2009). Human genetic variation and its contribution to complex traits. Nat Rev Genet, 10(4):241–51.

González-Pérez, A. and López-Bigas, N. (2011). Improving the assessment of the outcome of nonsynonymous snvs with a consensus deleteriousness score, condel. Am J Hum Genet,88(4):440–9.

Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., and McKusick, V. A. (2005). Online mendelian inheritance in man (omim), a knowledge- base of human genes and genetic disorders. Nucleic Acids Res,33(Database issue):D514–7.

Harewood, L., Schütz, F., Boyle, S., Perry, P., Delorenzi, M., Bickmore, W. A., and Reymond, A. (2010). The effect of translocation-induced nu- clear reorganization on gene expression. Genome Res,20(5):554–64.

Hecht, M. (2011). Improve predictions of functional effect of non-synonymous snps. Master’s thesis, Technische Universität München.

Hecht, M., Bromberg, Y., and Rost, B. (2013). News from the protein mu- tability landscape. JMolBiol,425(21):3937–48.

Hecht, M., Bromberg, Y., and Rost, B. (2015). Better prediction of functional effects for sequence variants. BMC Genomics,16(Suppl8):S1.

Henikoff,S.andHenikoff,J.G.(1992).Aminoacidsubstitutionmatrices from protein blocks. Proc Natl Acad Sci U S A,89(22):10915–9.

Hopf, T. A., Colwell, L. J., Sheridan, R., Rost, B., Sander, C., and Marks, D. S. (2012). Three-dimensional structures of membrane proteins from genomic sequencing. Cell,149(7):1607–21.

53 Iafrate, A. J., Feuk, L., Rivera, M. N., Listewnik, M. L., Donahoe, P. K., Qi, Y., Scherer, S. W., and Lee, C. (2004). Detection of large-scale variation in the human genome. Nat Genet,36(9):949–51.

International HapMap Consortium (2003). The international hapmap project. Nature,426(6968):789–96.

Kabsch, W. and Sander, C. (1983). Dictionary of protein secondary struc- ture: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers,22(12):2577–637.

Kawabata, T., Ota, M., and Nishikawa, K. (1999). The protein mutant database. Nucleic Acids Res,27(1):355–7.

Kawashima, S. and Kanehisa, M. (2000). Aaindex: amino acid index database. Nucleic Acids Res,28(1):374.

Kidd, J. M., Cooper, G. M., Donahue, W. F., Hayden, H. S., Sampas, N., Graves, T., Hansen, N., Teague, B., Alkan, C., Antonacci, F., Haugen, E., Zerr, T., Yamada, N. A., Tsang, P., Newman, T. L., Tüzün, E., Cheng, Z., Ebling, H. M., Tusneem, N., David, R., Gillett, W., Phelps, K. A., Weaver, M., Saranga, D., Brand, A., Tao, W., Gustafson, E., McKernan, K., Chen, L., Malig, M., Smith, J. D., Korn, J. M., McCarroll, S. A., Altshuler, D. A., Peiffer, D. A., Dorschner, M., Stamatoyannopoulos, J., Schwartz, D., Nickerson, D. A., Mullikin, J. C., Wilson, R. K., Bruhn, L., Olson, M. V., Kaul, R., Smith, D. R., and Eichler, E. E. (2008). Mapping and sequencing of structural variation from eight human genomes. Nature, 453(7191):56–64.

Kiezun, A., Garimella, K., Do, R., Stitziel, N. O., Neale, B. M., McLaren, P. J., Gupta, N., Sklar, P., Sullivan, P. F., Moran, J. L., Hultman, C. M., Lichtenstein, P., Magnusson, P., Lehner, T., Shugart, Y. Y., Price, A. L., de Bakker, P. I. W., Purcell, S. M., and Sunyaev, S. R. (2012). Exome

54 sequencing and the genetic basis of complex traits. Nat Genet,44(6):623– 30.

Kimura, M. (1968). Evolutionary rate at the molecular level. Nature, 217(5129):624–6.

Kircher, M., Witten, D. M., Jain, P., O’Roak, B. J., Cooper, G. M., and Shendure, J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet,46(3):310–5.

Kowarsch, A., Fuchs, A., Frishman, D., and Pagel, P. (2010). Correlated mu- tations: a hallmark of phenotypic amino acid substitutions. PLoS Comput Biol,6(9).

Krishnamurthy, K. (2003). Textbook of Biodiversity.Taylor&Francis.

Kruglyak, L. and Nickerson, D. A. (2001). Variation is the spice of life. Nat Genet,27(3):234–6.

Kumar, P., Henikoff, S., and Ng, P. C. (2009). Predicting the effects of coding non-synonymous variants on protein function using the sift algorithm. Nat Protoc,4(7):1073–81.

Lander, E. S. (2011). Initial impact of the sequencing of the human genome. Nature,470(7333):187–97.

Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Bald- win, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., Harris, K., Heaford, A., Howland, J., Kann, L., Lehoczky, J., LeVine, R., McEwan, P., McKernan, K., Meldrim, J., Mesirov, J. P., Miranda, C., Morris, W., Naylor, J., Raymond, C., Rosetti, M., Santos, R., Sheridan, A., Sougnez, C., Stange-Thomann, N., Stojanovic, N., Subramanian, A., Wyman, D., Rogers, J., Sulston, J., Ainscough, R., Beck, S., Bentley, D., Burton, J., Clee, C., Carter, N., Coulson, A., Deadman, R., Deloukas, P.,

55 Dunham, A., Dunham, I., Durbin, R., French, L., Grafham, D., Gregory, S., Hubbard, T., Humphray, S., Hunt, A., Jones, M., Lloyd, C., McMur- ray, A., Matthews, L., Mercer, S., Milne, S., Mullikin, J. C., Mungall, A., Plumb, R., Ross, M., Shownkeen, R., Sims, S., Waterston, R. H., Wilson, R. K., Hillier, L. W., McPherson, J. D., Marra, M. A., Mardis, E. R., Fulton, L. A., Chinwalla, A. T., Pepin, K. H., Gish, W. R., Chissoe, S. L., Wendl, M. C., Delehaunty, K. D., Miner, T. L., Delehaunty, A., Kramer, J. B., Cook, L. L., Fulton, R. S., Johnson, D. L., Minx, P. J., Clifton, S. W., Hawkins, T., Branscomb, E., Predki, P., Richardson, P., Wenning, S., Slezak, T., Doggett, N., Cheng, J. F., Olsen, A., Lucas, S., Elkin, C., Uberbacher, E., Frazier, M., Gibbs, R. A., Muzny, D. M., Scherer, S. E., Bouck, J. B., Sodergren, E. J., Worley, K. C., Rives, C. M., Gorrell, J. H., Metzker, M. L., Naylor, S. L., Kucherlapati, R. S., Nelson, D. L., Wein- stock, G. M., Sakaki, Y., Fujiyama, A., Hattori, M., Yada, T., Toyoda, A., Itoh, T., Kawagoe, C., Watanabe, H., Totoki, Y., Taylor, T., Weis- senbach, J., Heilig, R., Saurin, W., Artiguenave, F., Brottier, P., Bruls, T., Pelletier, E., Robert, C., Wincker, P., Smith, D. R., Doucette-Stamm, L., Rubenfield, M., Weinstock, K., Lee, H. M., Dubois, J., Rosenthal, A., Platzer, M., Nyakatura, G., Taudien, S., Rump, A., Yang, H., Yu, J., Wang, J., Huang, G., Gu, J., Hood, L., Rowen, L., Madan, A., Qin, S., Davis, R. W., Federspiel, N. A., Abola, A. P., Proctor, M. J., My- ers, R. M., Schmutz, J., Dickson, M., Grimwood, J., Cox, D. R., Olson, M. V., Kaul, R., Raymond, C., Shimizu, N., Kawasaki, K., Minoshima, S., Evans, G. A., Athanasiou, M., Schultz, R., Roe, B. A., Chen, F., Pan, H., Ramser, J., Lehrach, H., Reinhardt, R., McCombie, W. R., de la Bastide, M., Dedhia, N., Blöcker, H., Hornischer, K., Nordsiek, G., Agarwala, R., Aravind, L., Bailey, J. A., Bateman, A., Batzoglou, S., Birney, E., Bork, P., Brown, D. G., Burge, C. B., Cerutti, L., Chen, H. C., Church, D., Clamp, M., Copley, R. R., Doerks, T., Eddy, S. R., Eichler, E. E., Furey, T. S., Galagan, J., Gilbert, J. G., Harmon, C., Hayashizaki, Y., Haussler,

56 D., Hermjakob, H., Hokamp, K., Jang, W., Johnson, L. S., Jones, T. A., Kasif, S., Kaspryzk, A., Kennedy, S., Kent, W. J., Kitts, P., Koonin, E. V., Korf, I., Kulp, D., Lancet, D., Lowe, T. M., McLysaght, A., Mikkelsen, T., Moran, J. V., Mulder, N., Pollara, V. J., Ponting, C. P., Schuler, G., Schultz, J., Slater, G., Smit, A. F., Stupka, E., Szustakowski, J., Thierry- Mieg, D., Thierry-Mieg, J., Wagner, L., Wallis, J., Wheeler, R., Williams, A., Wolf, Y. I., Wolfe, K. H., Yang, S. P., Yeh, R. F., Collins, F., Guyer, M. S., Peterson, J., Felsenfeld, A., Wetterstrand, K. A., Patrinos, A., Mor- gan, M. J., de Jong, P., Catanese, J. J., Osoegawa, K., Shizuya, H., Choi, S., Chen, Y. J., Szustakowki, J., and International Human Genome Se- quencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature,409(6822):860–921.

Liu, X., Jian, X., and Boerwinkle, E. (2011). dbnsfp: a lightweight database of human nonsynonymous snps and their functional predictions. Hum Mutat,32(8):894–9.

Loeb, D. D., Swanstrom, R., Everitt, L., Manchester, M., Stamper, S. E., and Hutchison, 3rd, C. A. (1989). Complete mutagenesis of the hiv-1 protease. Nature,340(6232):397–400.

Markiewicz, P., Kleina, L. G., Cruz, C., Ehret, S., and Miller, J. H. (1994). Genetic studies of the lac repressor. xiv. analysis of 4000 altered escherichia coli lac repressors reveals essential and non-essential residues, as well as "spacers" which do not require a specific sequence. JMolBiol,240(5):421– 33.

Marks, D. S., Hopf, T. A., and Sander, C. (2012). Protein structure prediction from sequence variation. Nat Biotechnol,30(11):1072–80.

Martinez-Perez, E., Shaw, P., Aragon-Alcaide, L., and Moore, G. (2003). Chromosomes form into seven groups in hexaploid and tetraploid wheat as a prelude to meiosis. Plant J,36(1):21–9.

57 McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J. P. A., and Hirschhorn, J. N. (2008). Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet,9(5):356–69.

Ng, P. C., Henikoff,J.G.,andHenikoff, S. (2000). Phat: a transmembrane- specific substitution matrix. predicted hydrophobic and transmembrane. Bioinformatics,16(9):760–6.

Nissen, S. (2003). Implementation of a fast artificial neural network library (fann). Report, Department of Computer Science University of Copenhagen (DIKU),31.

Ofran, Y., Mysore, V., and Rost, B. (2007). Prediction of dna-binding residues from sequence. Bioinformatics,23(13):i347–53.

Ofran, Y. and Rost, B. (2007). Isis: interaction sites identified from sequence. Bioinformatics,23(2):e13–6.

Ogura, Y., Bonen, D. K., Inohara, N., Nicolae, D. L., Chen, F. F., Ramos, R., Britton, H., Moran, T., Karaliuskas, R., Duerr, R. H., Achkar, J. P., Brant, S. R., Bayless, T. M., Kirschner, B. S., Hanauer, S. B., Nuñez, G., and Cho, J. H. (2001). A frameshift mutation in nod2 associated with susceptibility to crohn’s disease. Nature,411(6837):603–6.

Ohta, T. (2002). Near-neutrality in of genes and gene regulation. Proc Natl Acad Sci U S A,99(25):16134–7.

Patterson, D. (2009). Molecular genetic analysis of down syndrome. Hum Genet,126(1):195–214.

Pe’er, I., de Bakker, P. I. W., Maller, J., Yelensky, R., Altshuler, D., and Daly, M. J. (2006). Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat Genet,38(6):663–7.

58 Perry, G. H., Dominy, N. J., Claw, K. G., Lee, A. S., Fiegler, H., Redon, R., Werner, J., Villanea, F. A., Mountain, J. L., Misra, R., Carter, N. P., Lee, C., and Stone, A. C. (2007). Diet and the evolution of human amylase gene copy number variation. Nat Genet,39(10):1256–60.

Punta, M., Coggill, P. C., Eberhardt, R. Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E. L. L., Eddy, S. R., Bateman, A., and Finn, R. D. (2012). The pfam protein families database. Nucleic Acids Res,40(Database issue):D290–301.

Rauch, A., Wieczorek, D., Graf, E., Wieland, T., Endele, S., Schwarzmayr, T., Albrecht, B., Bartholdi, D., Beygo, J., Di Donato, N., Dufke, A., Cre- mer, K., Hempel, M., Horn, D., Hoyer, J., Joset, P., Röpke, A., Moog, U., Riess, A., Thiel, C. T., Tzschach, A., Wiesener, A., Wohlleber, E., Zweier, C., Ekici, A. B., Zink, A. M., Rump, A., Meisinger, C., Grallert, H., Sticht, H., Schenck, A., Engels, H., Rappold, G., Schröck, E., Wieacker, P., Riess, O., Meitinger, T., Reis, A., and Strom, T. M. (2012). Range of genetic mutations associated with severe non-syndromic sporadic intellectual dis- ability: an exome sequencing study. Lancet,380(9854):1674–82.

Reva, B., Antipin, Y., and Sander, C. (2011). Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res,39(17):e118.

Rosenbaum, D. M., Zhang, C., Lyons, J. A., Holl, R., Aragao, D., Arlow, D. H., Rasmussen, S. G. F., Choi, H.-J., Devree, B. T., Sunahara, R. K., Chae, P. S., Gellman, S. H., Dror, R. O., Shaw, D. E., Weis, W. I., Caffrey, M., Gmeiner, P., and Kobilka, B. K. (2011). Structure and function of an irreversible agonist-beta(2) adrenoceptor complex. Nature,469(7329):236– 40.

59 Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Eng, 12(2):85–94.

Rost, B. and Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. JMolBiol,232(2):584–99.

Rost, B. and Sander, C. (1994). Conservation and prediction of solvent accessibility in protein families. Proteins,20(3):216–26.

Sachidanandam, R., Weissman, D., Schmidt, S. C., Kakol, J. M., Stein, L. D., Marth, G., Sherry, S., Mullikin, J. C., Mortimore, B. J., Willey, D. L., Hunt, S. E., Cole, C. G., Coggill, P. C., Rice, C. M., Ning, Z., Rogers, J., Bentley, D. R., Kwok, P. Y., Mardis, E. R., Yeh, R. T., Schultz, B., Cook, L., Davenport, R., Dante, M., Fulton, L., Hillier, L., Waterston, R. H., McPherson, J. D., Gilman, B., Schaffner, S., Van Etten, W. J., Reich, D., Higgins, J., Daly, M. J., Blumenstiel, B., Baldwin, J., Stange-Thomann, N., Zody, M. C., Linton, L., Lander, E. S., Altshuler, D., and International SNP Map Working Group (2001). A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409(6822):928–33.

Sander, C. and Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9(1):56–68.

Sankararaman, S., Sha, F., Kirsch, J. F., Jordan, M. I., and Sjölander, K. (2010). Active site prediction using evolutionary and structural informa- tion. Bioinformatics,26(5):617–24.

Sankararaman, S. and Sjölander, K. (2008). Intrepid–information-theoretic tree traversal for protein functional site identification. Bioinformatics, 24(21):2445–52.

60 Schaefer, C., Bromberg, Y., Achten, D., and Rost, B. (2012). Disease-related mutations predicted to impact protein function. BMC Genomics,13Suppl 4:S11.

Schaefer, C. and Rost, B. (2012). Predict impact of single amino acid change upon protein structure. BMC Genomics,13Suppl4:S4.

Schlessinger, A., Yachdav, G., and Rost, B. (2006). Profbval: predict flexible and rigid residues in proteins. Bioinformatics,22(7):891–3.

Schwarz, J. M., Rödelsperger, C., Schuelke, M., and Seelow, D. (2010). Muta- tiontaster evaluates disease-causing potential of sequence alterations. Nat Methods,7(8):575–6.

Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Månér, S., Massa, H., Walker, M., Chi, M., Navin, N., Lucito, R., Healy, J., Hicks, J., Ye, K., Reiner, A., Gilliam, T. C., Trask, B., Patterson, N., Zetterberg, A., and Wigler, M. (2004). Large-scale copy number polymor- phism in the human genome. Science,305(5683):525–8.

Sen, S. (2000). Aneuploidy and cancer. Curr Opin Oncol,12(1):82–8.

Sherry, S. T., Ward, M. H., Kholodov, M., Baker, J., Phan, L., Smigiel- ski, E. M., and Sirotkin, K. (2001). dbsnp: the ncbi database of genetic variation. Nucleic Acids Res,29(1):308–11.

Sigrist, C. J. A., Cerutti, L., de Castro, E., Langendijk-Genevaux, P. S., Bulliard, V., Bairoch, A., and Hulo, N. (2010). Prosite, a protein domain database for functional characterization and annotation. Nucleic Acids Res,38(Databaseissue):D161–6.

Slatkin, M. (2008). Linkage disequilibrium–understanding the evolutionary past and mapping the medical future. Nat Rev Genet,9(6):477–85.

61 Stankiewicz, P. and Lupski, J. R. (2010). Structural variation in the human genome and its role in disease. Annu Rev Med,61:437–55.

Stenson, P. D., Ball, E. V., Mort, M., Phillips, A. D., Shiel, J. A., Thomas, N. S. T., Abeysinghe, S., Krawczak, M., and Cooper, D. N. (2003). Human gene mutation database (hgmd): 2003 update. Hum Mutat,21(6):577–81.

Sunyaev, S. R., Eisenhaber, F., Rodchenkov, I. V., Eisenhaber, B., Tu- manyan, V. G., and Kuznetsov, E. N. (1999). Psic: profile extraction from sequence alignments with position-specific counts of independent ob- servations. Protein Eng,12(5):387–94.

Teer, J. K. and Mullikin, J. C. (2010). Exome sequencing: the sweet spot before whole genomes. Hum Mol Genet,19(R2):R145–51.

Thorn, K. S. and Bogan, A. A. (2001). Asedb: a database of alanine muta- tions and their effects on the free energy of binding in protein interactions. Bioinformatics,17(3):284–5.

Thusberg, J. and Vihinen, M. (2009). Pathogenic or not? and if so, then how? studying the effects of missense mutations using bioinformatics methods. Hum Mutat,30(5):703–14.

UniProt Consortium (2015). Uniprot: a hub for protein information. Nucleic Acids Res,43(Databaseissue):D204–12.

Webb, E. C. (1992). Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes. Number Ed. 6. Academic Press, New York.

Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature,447(7145):661–78.

62 Yachdav, G., Hecht, M., Pasmanik-Chor, M., Yeheskel, A., and Rost, B. (2014). Heatmapviewer: interactive display of 2d data in biology. F1000Res,3:48.

Zhang, F., Gu, W., Hurles, M. E., and Lupski, J. R. (2009). Copy number variation in human health, disease, and evolution. Annu Rev Genomics Hum Genet,10:451–81.

63 Appendix

This work constitutes a cumulative dissertation and the methodologies and results presented in this thesis have been published in peer-reviewed journals. The corresponding articles are appended to this dissertation and will be briefly summarized in the following:

• Maximilian Hecht,YanaBromberg,andBurkhardRost(2013). News from the protein mutability landscape. Journal of Molecular Bi- ology, 425(21):3937-48.

• Maximilian Hecht,YanaBromberg,andBurkhardRost(2015). Bet- ter prediction of functional effects for sequence variants. BMC Genomics, 16(Suppl 8):S1.

64 News from the protein mutability landscape

In this publication we present a novel approach towards variant effect pre- diction by expanding the view from single variants to complete in-silico mu- tagenesis predictions. We widely review the field of variant effect prediction by presenting methods and discussing the current state-of-the-art. We show that methods are able to computationally predict the effects of most vari- ants and that effect predictions tend to cluster around important functional regions. Although most methods perform comparably there is a significant difference between individual methods’ predictions, which are attributable to the features and data used in method development. In this article we more- over present functional effect predictions for a complete in-silico mutagenesis experiment in the human beta-2-adrenergic receptor. We show the generated predictions as a heat map and how these correlate with secondary structure elements in the protein. To further visualize our results, we used the known experimental 3D structure and highlighted it according to our predictions thus making high-impact regions in the protein immediately visible in the structure. This example shows how computational predictions of the muta- bility landscape can translate into novel hypotheses on protein function and possibly aid drug development.

Author contributions: Maximilian Hecht, Yana Bromberg and Burkhard Rost conceived the study design and methodologies. Maximilian Hecht col- lected the necessary data and carried out the experiments. Results were analyzed by Maximilian Hecht, Yana Bromberg and Burkhard Rost. The manuscript was written, revised and approved by Maximilian Hecht, Yana Bromberg and Burkhard Rost.

65 Better prediction of functional effects for sequence vari- ants

This article describes the development and evaluation of our novel method SNAP2, a machine learning-based classifier for the prediction of functional effects of sequence variants. We show that SNAP2 compares favorably with other state-of-the-art methods and that it is capable of producing non-random predictions for orphan proteins. Our analysis shows that certain variants cannot be predicted equally well by different methods and that our novel method performs remarkably well on variants where multiple methods dis- agree. Moreover, we show that a naïve method combination, that is often employed by biologists and geneticists, performs worse than the individual methods on our data. We present the improvements made to our method and discuss how these affect the results in different testing scenarios. We also discuss how data bias affects computational predictions of variant effects and what users can expect from using these methods.

Author contributions: Maximilian Hecht, Yana Bromberg and Burkhard Rost conceived this work and designed the experiments. Maximilian Hecht wrote the software and carried out the experiments. Maximilian Hecht and Yana Bromberg collected the data and analyzed the results together with Burkhard Rost. The manuscript was written, revised and approved by Max- imilian Hecht, Yana Bromberg and Burkhard Rost.

66 Perspecve

News from the Protein Mutability Landscape

Maximilian Hecht 1, Yana Bromberg 2 and Burkhard Rost 1,3,4

1 - Department of Bioinformatics and Computational Biology I12, Technische Universität München, Boltzmannstrasse 3, 85748 Garching, Germany 2 - Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Drive, New Brunswick, NJ 08901, USA 3 - Institute of Advanced Study, Technische Universität München, Boltzmannstrasse 3, 85748 Garching, Germany 4 - Institute for Food and Plant Sciences, Life Science Center Weihenstephan, Alte Akademie 8, 85354 Freising, Germany

Correspondence to Maximilian Hecht: [email protected] http://dx.doi.org/10.1016/j.jmb.2013.07.028 Edited by E. Alexov

Abstract

Some mutations of protein residues matter more than others, and these are often conserved evolutionarily. The explosion of deep sequencing and genotyping increasingly requires the distinction between effect and neutral variants. The simplest approach predicts all mutations of conserved residues to have an effect; however, this works poorly, at best. Many computational tools that are optimized to predict the impact of point mutations provide more detail. Here, we expand the perspective from the view of single variants to the level of sketching the entire mutability landscape. This landscape is defined by the impact of substituting every residue at each position in a protein by each of the 19 non-native amino acids. We review some of the powerful conclusions about protein function, stability and their robustness to mutation that can be drawn from such an analysis. Large-scale experimental and computational mutagenesis experiments are increasingly furthering our understanding of protein function and of the genotype–phenotype associations. We also discuss how these can be used to improve predictions of protein function and pathogenicity of missense variants. © 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY license.

Introduction Most likely, the precise ratios heavily depend on the particular protein under investigation (e.g., the human immunodeficiency virus gp120 is likely to be much more robust against mutation than p53 simply Lewis Carroll—Through the Looking Glass: because many of the p53 residues are involved in “Why, sometimes I've believed as many as six binding and therefore “vulnerable” to mutation). A key impossible things before breakfast.” aspect in the development of strategies for diagnosis and treatment of genetic diseases is to further our understanding of the underlying mechanisms that link Understanding Genetic Diversity—A genotypes and phenotypes. Central Challenge for Deep Sequencing Sequence variants such as single nucleotide poly- morphisms (SNPs) are the most prevalent form of Elucidating which human genetic variations have human genetic variation [3]. It has been estimated that which phenotypic effect and how the variation more than 11 million SNPs will be observed among impacts disease is one of the major scientific people; 7 million of these are frequent (common challenges in the 21st century. While the vast majority variants), that is, occur with a minor allele frequency of genetic variants are hypothesized to be neutral [1], above 5%, while the remaining (minor allele frequen- that is, are assumed not to contribute to any cy, b5%) are considered as rare [4]. Many of both rare phenotype, the relative percentage of neutral, near- and common variants may be instrumental in defining neutral [2] and non-neutral variants remains unclear. individual's differences [5–7]. Increasingly, however,

0022-2836 © 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY license. J. Mol. Biol. (2013) 425, 3937–3948 3938 Perspective: The Protein Mutability Landscape researchers begin to suspect that every possible point example, by altering protein function directly or mutation might ultimately be observed. indirectly by impacting structure and/or binding. For medical biology, non-synonymous SNPs Such changes can lead to pathogenic phenotypes (nsSNPs) or missense variants that change the [8].Recentstudiessuggestthateverypairof amino acid sequence of the protein are particularly individuals differs by almost one amino acid variant interesting. These variants are more likely to affect in each protein while individuals have about 1.2–1.7 function than synonymous SNPs. Single-amino-acid variants (nsSNPs) that are novel with respect to both variants can change the resulting phenotype, for parents, that is, not observed in either parent [7].

Fig. 1 (legend on next page) Perspective: The Protein Mutability Landscape 3939

Knowing how these changes affect function can changed and complete mutagenesis in which each give, for instance, insight into a child's disease position is replaced by every non-native amino acid. predisposition. However, we largely discard effects of varying GWAS (genome-wide association studies) has multiple residues at the same time. Obviously, evolved as the most widely used approach relating even such a reduced version of the protein mutability human genetic variation to phenotypic diversity [9]. landscape already carries very important information Results of these studies greatly increased our about protein function. We attempt to sketch how this understanding of molecular pathways underlying landscape brings about new challenges and new specific human diseases. Most common SNPs possibilities. In fact, we have to learn to understand have been assessed for statistical associations with what we see in this new looking glass. many complex traits and common diseases. Howev- er, for the vast majority of complex trait associations, the underlying mechanisms remain unknown, and for Most Variant Effects Predicted Correctly many of the known common SNPs related to In Silico complex disease phenotypes, GWAS misses the known associations [10]. Rare variants are missed Several computational methods predict the effect due to limited numbers [4,11]. The hope is that deep of variants (SAAS or nsSNPs). Some predict the sequencing will address some of these issues by effect on protein function {e.g., sorting intolerant from revealing variations between the full sequences. tolerant (SIFT) [22,23] or screening for non-accep- table polymorphisms (SNAP) [24,25]}, others predict the effect with respect to their pathogenicity (e.g., Mutability Landscapes May Be THE Key MutPred [26], SNPs&GO [27], Mutation Assesor [28] to Understanding Diversity or MutationTaster [29]) and others yet predict the effect on protein structure directly [30] or cannot be Meanwhile, a different approach toward under- easily fit into these categories (e.g., PolyPhen-2 [31] standing the genotype–phenotype association is to or PON-P [32]). These methods use a diverse study functionally important regions, robustness and spectrum of input features, typically combining evolvability of proteins by investigating the mutability evolutionary information with biophysical features landscape. The mutability landscape of, for example, and experimental information about protein structure protein function, can be defined as the effect of all and function where available. There are several possible point mutations/variants upon protein func- outstanding reviews on the prediction of functional tion, that is, of substituting the native amino acid at effects [33–35], and the community puts great effort each residue position against all 19 non-native into assessing such predictions. For instance, CAGI amino acids, one at a time (Fig. 1 [20,21]). Studying (Critical Assessment of Genome Interpretation) such a landscape may help us a lot in understanding aspires to assessing method performance in pre- protein function and evolution. dicting phenotypic impacts of genomic variation [36]. In this review, we exclusively focus on the effects More formal studies assess predictors specifically of varying the protein sequence by single-amino-a- with respect to their performance in identifying cid substitutions (SAASs). In order to avoid obscur- pathogenic variants [37]. The results of these studies ing acronyms, we will simply use the term variant as suggest that each method has strengths and synonym for SAAS. We review comprehensive weaknesses, possibly resulting from the data used mutagenesis in which each position in a protein is for development and the types of information

Fig. 1. Mutability landscape of a protein. The top line (a) sketches the sequence and secondary structure (transmembrane helices) of the adrenergic receptor (ADRB2_HUMAN, ID: P07550 [12]; assignment of secondary from the high-resolution structure PDB ID: 3PDS [13] using DSSP [14]). For each of the 413 residues (x-axis) of the receptor, (b) shows the predictions for the effects of all 19 non-native variants (y-axis; the stronger the predicted effect, the redder; the stronger the predicted neutrality, the greener). (c) Zoom of the fragment spanning from residue 91 to residue 210 and the relative positions of binding sites (D113 and T118) and proposed target residues (W99 and Y199). (d) The predicted functional effect of variants for the 3D structure (PDB ID: 2RH1 [15]); both known binding sites (positions 113 and 118) are shown as blue spheres. Shown are the average scores [SNAP score ranges from the most neutral (−100) to the strongest effect (+100)] over amino acids that would be considered as “neutral” given the biophysical amino acid features as captured in the PHAT substitution matrix [16] for transmembrane regions and in the BLOSUM62 matrix [17] for all other residues. Red depicts high average scores (score N 60), orange depicts intermediate scores (40 b score b 60), yellow depicts low scores (20 b score b 40) and gray marks sites with SNAP scores b20 (predicted as neutral or with little effect). (e) The 3D structure (PDB ID: 3PDS [13]) with a bound agonist and the two residues (W99 and Y199) that exhibit a high overall predicted effect and are under strong evolutionary constraint (predicted by EVfold [18,19]) with each other and the two binding sites. 3940 Perspective: The Protein Mutability Landscape included in prediction. Good in silico methods role within an interaction pathway [49]. Typically, correctly predict the experimentally observed effects only a few residues in a protein interaction interface for most variants. contribute most of the binding affinity. These can be Typically, only 5–10% of all residues relate directly identified by the change in binding free energy upon to function [38–40]. Some of these are revealed in mutation to alanine and are often referred to as the substitution profiles of protein families. A whole binding hot spots [50,51]. One definition for a hot generation of methods targets the prediction of such spot is that the binding free energy is altered by functional sites through analyzing evolutionary infor- ≥1 kcal/mol upon mutation [52]. While the precise mation (e.g., ET [41], INTREPID [42] or DISCERN definition might be subject to debate, hot spots are [43]). Since these methods predict functional sites, it “real” in the sense that they can be predicted is not surprising that they also capture some of the accurately by methods that do not even assume signal that variants impact function (V. Link and K. that hot spots exist [53].Wemoreovermight Sjölander, unpublished results). Thus, the ability to anticipate the observation that the residues or predict the functional effect of variants is clearly positions contributing most to the energy of binding related to predicting protein function. might also be the residues used more frequently Predictions of variant effects have helped us when choosing sites that bind to many binding prioritize mutations for large-scale reverse genetics partners. Indeed, hot spots have been observed to projects, where mutations are randomly introduced have a high propensity for interaction with multiple into the genome. An example for such strategies is partners [54]. TILLING (targeting induced local lesions in ge- Substituting the native amino acid by alanine is nomes), a method that combines chemical muta- typically experimentally easiest and expected to be genesis with a sensitive DNA screening technique in most revealing. Thus, alanine scans are most order to allow direct identification of mutations in a common, but increasingly, glycine, proline and specific gene. TILLING uses the functional effect cysteine scans are also carried out [55–57]. In predictions from SIFT [22,23] to prioritize the these scans, all native residues of a protein are post-processing of variants [44]. Another important individually substituted by one of the above amino application is to the assessment of disease-related acids and the effect upon a given functional assay is human variants [20,45,46]. For instance, mutations measured. ASEdb, the Alanine Scanning Energetics that directly cause a disease, such as those found in database, provides a central repository for such data OMIM [47], are clearly identified by methods that [58]. Residues that significantly change protein predict the functional effect of variants [48]. Existing function are usually considered important. What in silico methods can even be good enough to reveal constitutes a significant change depends on the problems with experimental data: today's assess- type of function. In silico predictions suggest that ment of functional neutrality of variants seems when looking at the effect of all 19-non-native particularly problematic [100] mutations, alanine substitutions are most represen- tative (correlate most with the average over all mutations) [21]. Although this observation is based Peeking Experimentally into the Protein on one single protein (HXK4) and may therefore not Mutability Landscape be representative, the fact that in silico methods accurately reproduce such expert knowledge should be appreciated as an independent evidence of their Alanine scans reveal function and interaction success in predicting essential aspects of the hot spots mutability landscape.

The experimental study on how site-directed Mutability landscape constrained by correlation mutagenesis affects phenotypes may be THE most networks? essential experimental tool for determining protein function. By substituting residues that are assumed Comprehensive experimental mutagenesis stud- to be important and measuring substitution effect, ies confirmed that the effect of point mutations researchers identify the residues that are important (SAAS) upon function depends crucially on their for the hypothesized protein function. Over the last positions in the protein sequence [59–61]. Even decade, the power of experimental and computa- within a unit as familiar as the DNA binding domain of tional mutagenesis has grown considerably: a the Escherichia coli LacI [62,63] repressor, almost decade ago, many publications reported on single any variant can be tolerated at some positions while, point mutations; today, 50 times more may no longer at others, all variants affect function (Fig. 2a). Simple satisfy reviewers. structural constraints might suffice to explain this The ability of proteins to interact with substrates or variability: to accommodate the negatively charged other proteins is essential. The most important DNA, binding regions of the repressor contain function of a protein can therefore be defined as its positively charged residues. Furthermore, binding Perspective: The Protein Mutability Landscape 3941

Fig. 2. Mutagenesis of E. coli LacI repressor. At each position between residue 2 and 329, 12–13 amino acid substitutions are displayed as a bar. The height of a bar depicts the relative percentage of substitutions that alter the repressor function as determined (a) experimentally [59] or (b) by computational prediction using SNAP2. With a correlation of 0.76 over all residues and an accuracy of 78.2% over all variants, this constitutes a below-average prediction of SNAP2 (~82% estimated overall accuracy). requires helix formation. These two simple biophys- perfectly highlight the importance of analyzing the ical realities constrain the mutability landscape mutability landscape. significantly in a specific, identity-revealing manner Other recent studies carry the theme of coevol- like a fingerprint. The differential sensitivity to ving positions even further. Patterns of correlated mutation might just be a complex overlay of many mutations in the WW domain nearly suffice to such simple biophysical constraints. synthetize artificial WW domains with native-like The same constraints are written into the profile folding and function [79,80]. Applying statistical of evolutionary conservation of changes observed coupling analysis to the S1A protein family, the within families of related proteins [64–66]. These Ranganathan laboratory introduced the “sector evolutionary imprints are strong enough to aid the hypothesis” [81] that proteins are organized into prediction of protein structure [67–70] and function distinct subunits or networks (sectors) of coevolving [39]. One particular idea that uses the constraints residues that are essential to structure and function. imposed by the mutability landscape is that of Such sectors involve only about every fifth residue; compensating/correlated mutations [71–73].To they are built around active sites, and they connect simplify this, imagine a salt bridge, that is, the to other functional sites distant in sequence and interaction between a positively charged residue structure through “networks” of contiguous residue and a negatively charged residue. If the negative interactions in the protein core [82]. These networks one is mutated into a positively charged amino acid, of coevolving residues may have resulted from the the affected protein may malfunction. A compensa- need for rapid adaptive variation arising from tory mutation that also flips the charge of the fluctuating selection pressure and that the organi- positive position will again allow salt-bridge forma- zation into networks of cooperatively acting resi- tion. If we could identify correlated mutations, we dues may provide such rapid adaptive potential could use them to predict inter-residue contacts through only a few mutations [82]. If so, structure within [72] and between proteins [74,75]. After and function may mostly be affected by mutations at many years of development [18,76], this concept sector positions while non-sector positions may has finally brought about de novo predictions for tolerate variation. This hypothesis was tested three-dimensional (3D) structure of globular [77] through a complete single mutagenesis (individually and large membrane proteins [18,19], several of substituting each residue by all 19 non-native amino those are not similar to any protein structure we acids) in one representative member of the PDZ know today. Such predictions may even lead the family (PSD95pdz3) [82]. The study showed that the way beyond structure [19]: many residues that are statistical correlation between mutations with signif- evolutionarily coupled and not close in space may icant functional effect and sector positions was very be relevant for protein function. In fact, in a study of strong; it was, in fact, stronger than that between over 14,000 variants related to disease from over mutations in the protein core (buried positions) and 1000 human proteins, correlated positions positions with ligand contacts. Moreover, a combi- appeared significantly more likely to harbor disease nation of two mutations at sector positions was mutations than average positions [78]. Compensa- sufficient to change the binding specificity of tory mutations involve coupled variants and might PSD95pdz3 for a class-switching ligand. This adap- rashly be considered to go beyond the focus of this tion is exclusively initiated through mutations in the review. We show that the correlated mutations sector. While awaiting large-scale confirmation, 3942 Perspective: The Protein Mutability Landscape these findings already highlight the importance of each method (Fig. 3; note that not all of those annotating correlated mutational behavior for the variants can be observed since not all amino acids prediction of pathogenicity/functional effects of can be transformed into all others through a SNP). missense variants. Although SNAP predicts more effect substitutions than SIFT, trends appear to be largely similar. For instance, both methods predict substitutions from Penetrating the Protein Mutability and to tryptophan as highly damaging on average. Landscape In Silico This is plausible due to its structural importance to proteins. Similarly, both methods predict substitu- tions of phenylalanine by any other residue, except Predicting the mutability landscape of the for leucine and tyrosine, as rather damaging. human exome Phenylalanine is preferentially exchanged with tyrosine, which differs only in that it contains a Ultimately, we want to study the entire protein hydroxyl group in place of the ortho hydrogen on the mutability landscapes for at least some hundreds of benzene ring. The preference for leucine seems representative proteins by assaying changes in plausible due to its hydrophobic character. Exam- protein function and their impact upon the organism. ples of SNAP and SIFT differences are in the Despite tremendous breakthroughs in high-through- predictions for substitutions of arginine and by put experimentation, this analysis falls more into the proline. SNAP might be closer to the truth for these world of Lewis Carroll than into that of a scientific two because they may be difficult to treat via a purely grant proposal. However, such a landscape can be evolution based method. “To proline” mutations are easily predicted, for example, for all human proteins. likely to be rare due to their disruptions. For arginine, The downside is that we do not yet fully understand the explanation seems less clear. How can we cast how to interpret the results. Nevertheless, in the such predictions into new methods that, for example, context of understanding the deep sequencing data, predict active sites? How can we use them to guide such views are needed. protein design? Finding the causal variants for a particular disease continues to be a challenging endeavor despite the Outcome of alanine scans predicted continued decrease of the cost in sequencing entire genomes and entire exomes [83].Accordingly, Methods that predict functional effects have rarely researches prioritize zooming in onto candidate been assessed in large-scale mutagenesis experi- variants in these studies by including computational ments. One reason is obviously the shortage of such effect predictions [84]. It has been suggested to experiments. Another might be the perception that combine several prediction methods (e.g., through computational methods typically predict neither the majority vote) in order to overcome individual severity nor the direction of the effect (increase or weaknesses and obtain most reliable predictions decrease of function/affinity). It is true that today's [85]. The dbNSFP [86] database is built to simplify prediction methods cannot directly distinguish be- this endeavor by providing effect predictions and tween variants that increase and those that de- scores from various methods for every potential crease binding. Instead, both tend to be predicted as variant in the human genome (approximately 76 effects. Nevertheless, prediction scores (i.e., the million variants). Differences between methods signal strength) on average correlate with the become most apparent when comparing predictions severity of the effect [24]. The concept of “impor- on this large scale. The pairwise agreement between tance for function” never entered the data set choice the four methods in dbNSFP ranges from 61% to or development phase when creating SNAP. Still, 77%. The fraction of all potential substitutions when applied for residues in ASEdb that the method predicted to be deleterious by individual methods had never “seen” before, it correctly identified over ranges from 40% to 56%, suggesting that methods 70% of the functionally important sites and correctly disagree strongly. Overall, methods accurately predicted many to-alanine variants (up to 84%, predict Mendelian disease-causing variants to depending on cutoff) [21]. strongly effect function. Unfortunately, this does not imply that the same methods can find a single Comprehensive in silico mutagenesis helps disease-causing variant among the thousands of studying disease-related proteins variants observed between any pair of individuals from the same population. A detailed study of the human melanocortin 4 To visualize the predictive behavior of the two receptor (hMC4R) demonstrated the value of study- widely used methods SNAP [24,25] and SIFT ing the mutability landscape in silico [20]. hMC4R is [22,23], we compiled a pairwise amino acid substi- related to diabetes and to weight regulation. Muta- tution matrix over all theoretically possible variants in tions in hMC4R have been shown to account for the human proteome based on the predictions of approximately 3% of all severe obesity cases (body Perspective: The Protein Mutability Landscape 3943

Fig. 3. Effect of pairwise amino acid substitutions in the human exome. Shown is the fraction of substitutions predicted to have an effect for every substitution of every amino acid (y-axis) by any other (x-axis) in the entire human exome. Results were obtained by locally calculating the predictions for (a) SNAP and (b) SIFT for every possible SAAS in every reviewed protein in the Swiss-Prot database [12] with human origin. Cells are colored according to the fraction of deleterious predictions with high values in red and low values in white. For every prediction for substitutions of amino acid “m” (y-axis) by “n” (x-axis), we applied the default threshold for each method (SNAP, 0; SIFT, 0.05). mass index, N40), and consequently, they are the seven transmembrane helices. SNAP assessed the most frequent cause of monogenic obesity in functional essentiality for each of the 332 residues in humans [87,88]. MC4R, a member of the G-pro- hMC4R and the functional impact of all possible tein-coupled receptor (GPCR) family, is an integral variants; predictions were compared to all available membrane protein that crosses the lipid bilayer with experimental data. The predictions of variants with 3944 Perspective: The Protein Mutability Landscape functional effect and predictions of important regions were shown to increase pindolol-stimulated cAMP in hMC4R largely agreed with experimental evi- accumulation [94,95]. Although located in the cyto- dence [20]. Toward this end, we down-weighted plasmic region, strong signals also highlight (1) mutations expected to be neutral for structure (e.g., Y141, for which a substitution by F is known to hydrophobic to hydrophobic in membrane regions). abolish insulin-induced tyrosine phosphorylation and Despite this scoring, the computational mutagen- insulin-induced receptor super sensitization [96], esis predicted as many as 118 residues to be and (2) C341, for which a mutation to G was functionally important. This seems a substantial shown to alter binding (uncoupling of receptor) [97]. over-prediction. Indeed, so far, we have experimen- Thus, 46 sites (57 predicted, 11 experimentally tal evidence for only 18 residues to be important for observed) remain with strong effect predictions for function; 15 of these 18 were in the set of 118 which no variants have been tested experimentally. residues predicted to have strong impact [20], which Again we observe a rather large discrepancy is not an impressive performance but much higher between observed and predicted “sensitive to than the random 6 in 18. The nsSNP database of mutation” positions. Some of these predictions will effects (SNPdbe [89]) provides experimental links to likely just be false positives. However, due to being obesity for 27 residues, 17 of those were in the 118. located in the protein core, others may in fact affect Only one single residue is found in both sets. Thus, function by structural alterations/misfolding. Applica- in silico mutagenesis correctly predicted 31 of the tion of an even more stringent threshold (score N 80 known 44 positions reported to influence function at N95% expected accuracy) weeds out 38 of these, if mutated. leaving 12 residues with very strong effect pre- What about the 74 residues with predictions but dictions and without current variant annotations. We without observation (118 predicted, 44 so far studied these also in light of EVfold [18,19] (predic- experimentally known)? At this point, 74 mutations tion of inter-residue contacts through correlated constitute a relatively large number of high-effect mutations). Only 2 of the 12 had residue couplings predictions, which cannot be verified due to lack of in the realm of the top 5%, namely, W99 and Y199 data. Re-evaluating the predictions, we might (Fig. 1e). We could not find any experimental apply a more stringent threshold to consider an annotation about these two. However, a visual effect important. For instance, at a threshold with inspection (Fig. 1e) of a 3D structure with irreversibly an expected accuracy N95%, 22 residues are bound agonist (PDB ID: 3PDS [13]) appears to predicted to impact function; 10 of those corre- suggest the two as reasonable targets for experi- spond to experimentally known sites, 1 corre- mental verification. sponds to a site implicated with obesity and 11 This detailed view of the beta-2-adrenergic recep- remain without experimental annotations. These tor provides another example for how useful it might might constitute ideal starting points for designing be to analyze the mutability landscape through a new experiments [20]. complete in silico mutagenesis; it highlights func- tionally important regions and may help in experi- Detailed analysis of mutability landscape for a ment design to probe function locally or test entire GPCR, the beta-2-adrenergic receptor regions for docking and drug development. The example also suggests that variant effect prediction To visualize the results of such an in silico might benefit from including inter-residue contact/ mutagenesis, we applied SNAP2 (M.H., unpublished evolutionary coupling predictions. results) to another GPCR, the beta-2 adrenergic receptor for which experimental high-resolution 3D structures are available in the Protein Data Bank Perspective (PDB) [90] (PDB ID: 2RH1 [15], Fig. 1d; PDB ID: 3PDS [13], Fig. 1e). The predicted high-effect Comprehensive mutagenesis experiments have regions cluster around the binding sites and are furthered our understanding of protein function and significantly more abundant on the inside (facing the continue to provide insight into the mechanisms of binding sites) than elsewhere. Strong effects pathogenicity and adaption. Novel methodologies (SNAP N 60; note: score ranges [−100,+100]) are and technical advancements reduce the cost of predicted for 57 residues (Fig. 1d, red highlighting) experimental mutagenesis and enable research that including the two Swiss-Prot annotated [12,91] was previously impossible. Still, studying the coop- binding sites D113 and T118. Nine more sites with erative behavior of amino acids and the combined functional effect annotation in SNPdbe [89] were effect of mutations will remain a laborious and costly found. Among these, we find (1) D79, for which a task. This is where computational methods are mutation to N was shown to affect binding of useful to predict the effects of variants upon protein catecholamines and to produce an uncoupling function, structure and pathogenicity. These between the receptor and stimulatory G-proteins methods have grown in accuracy both in predicting [92,93], and (2) D130, for which mutations to A or N functional effect (Fig. 2b) and disease-causing Perspective: The Protein Mutability Landscape 3945 mutations. Can they reach the next level? Can they Abbreviations used: be used to study the mutability landscape of a 3D, three-dimensional; hMC4R, human melanocortin protein, that is, to unravel the effects of all possible 4receptor;GPCR,G-protein-coupledreceptor;nsSNP, variants? Here, we argue that the study of such a non-synonymous SNP; PDB, Protein Data Bank; SAAS, mutability landscape provides immensely important single-amino-acid substitution; SIFT, sorting intolerant value and that currently neither experimental nor from tolerant; SNAP, screening for non-acceptable computational methods completely mine the poten- polymorphisms; SNP, single nucleotide polymorphism. tial of studying this landscape. Experimental methods remain constrained by the substantial amount of resources such studies would consume. Computational methods remain constrained by the degree to which we can interpret their results. At this point, lack of comprehensive experimental data seems a crucial problem for the development of better computational tools. However, in silico ana- lyses of mutability landscape already help to design experiments and are crucial for the intelligent References interpretation of deep sequencing/next generation sequencing data. [1] Kimura M. Evolutionary rate at the molecular level. Nature 1968;217:624–6. [2] Ohta T. Near-neutrality in evolution of genes and gene regulation. Proc Natl Acad Sci USA 2002;99:16134–7. [3] Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, et al. An integrated encyclopedia of DNA elements Acknowledgements in the human genome. Nature 2012;489:57–74. [4] Frazer KA, Murray SS, Schork NJ, Topol EJ. Human Thanks to Tim Karl, Guy Yachdav and Laszlo genetic variation and its contribution to complex traits. Nat Kajan (Technische Universität München) for invalu- Rev Genet 2009;10:241–51. able help with hardware and software; to Marlena [5] Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny Drabik (Technische Universität München) for admin- EE, Gravel S, et al. Evolution and functional impact of rare istrative support; and to Thomas Hopf and Laszlo coding variation from deep sequencing of human exomes. Kajan (both Technische Universität München) for Science 2012;337:64–9. helpful discussions and help with the manuscript. [6] O'Roak BJ, Vives L, Girirajan S, Karakoc E, Krumm N, Coe BP, et al. Sporadic autism exomes reveal a highly interconnected Thanks to the developers of PyMOL [98] (Fig. 1d and protein network of de novo mutations. Nature 2012;485: e) and Matrix2png [99] (Fig. 1b and c) for providing 246–50. great tools. This work was supported by a grant from [7] Rauch A, Wieczorek D, Graf E, Wieland T, Endele S, the Alexander von Humboldt Foundation through the Schwarzmayr T, et al. Range of genetic mutations German Ministry for Research and Education associated with severe non-syndromic sporadic intellectual (Bundesministerium fuer Bildung und Forschung). disability: an exome sequencing study. Lancet 2012;380: Last, not the least, thanks to all those who deposit 1674–82. their experimental data in public databases and to [8] Thusberg J, Vihinen M. Pathogenic or not? And if so, then those who maintain these databases. how? Studying the effects of missense mutations using bioinformatics methods. Hum Mutat 2009;30:703–14. [9] McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, et al. Genome-wide association studies for Supplementary Data complex traits: consensus, uncertainty and challenges. Nat Rev Genet 2008;9:356–69. Supplementary data to this article can be found [10] Kiezun A, Garimella K, Do R, Stitziel NO, Neale BM, online at http://dx.doi.org/10.1016/j.jmb.2013.07.028 McLaren PJ, et al. Exome sequencing and the genetic basis of complex traits. Nat Genet 2012;44:623–30. Received 30 April 2013; [11] Samani NJ, Erdmann J, Hall AS, Hengstenberg C, Mangino M, Received in revised form 8 July 2013; Mayer B, et al. Genomewide association analysis of coronary Accepted 19 July 2013 artery disease. N Engl J Med 2007;357:443–53. Available online 26 July 2013 [12] Bairoch A, Apweiler R. The SWISS-PROT protein se- quence database and its supplement TrEMBL in 2000. Nucleic Acids Res 2000;28:45–8. Keywords: [13] Rosenbaum DM, Zhang C, Lyons JA, Holl R, Aragao D, Arlow complete single mutagenesis; DH, et al. Structure and function of an irreversible agonist- alanine scanning; beta(2) adrenoceptor complex. Nature 2011;469:236–40. in silico mutagenesis; [14] Kabsch W, Sander C. Dictionary of protein secondary exome-wide mutagenesis; structure: pattern recognition of hydrogen bonded and SNP effects geometrical features. Biopolymers 1983;22:2577–637. 3946 Perspective: The Protein Mutability Landscape

[15] Cherezov V, Rosenbaum DM, Hanson MA, Rasmussen [35] Mah JT, Low ES, Lee E. In silico SNP analysis and SG, Thian FS, Kobilka TS, et al. High-resolution crystal bioinformatics tools: a review of the state of the art to aid structure of an engineered human beta2-adrenergic G drug discovery. Drug Discovery Today 2011;16:800–9. protein-coupled receptor. Science 2007;318:1258–65. [36] Oetting WS. Exploring the functional consequences of [16] Ng PC, Henikoff JG, Henikoff S. PHAT: a transmembrane- genomic variation: the 2010 Human Genome Variation specific substitution matrix. Predicted hydrophobic and Society Scientific Meeting. Hum Mutat 2011;32:486–90. transmembrane. Bioinformatics 2000;16:760–6. [37] Thusberg J, Olatubosun A, Vihinen M. Performance of [17] Henikoff S, Henikoff JG. Amino acid substitution matrices mutation pathogenicity prediction methods on missense from protein blocks. Proc Natl Acad Sci USA 1992;89: variants. Hum Mutat 2011;32:358–68. 10915–9. [38] Lesk AM, Levitt M, Chothia C. Alignment of the amino acid [18] Marks DS, Hopf TA, Sander C. Protein structure prediction sequences of distantly related proteins using variable gap from sequence variation. Nat Biotechnol 2012;30: penalties. Protein Eng 1986;1:77–8. 1072–80. [39] Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y. Automatic [19] Hopf TA, Colwell LJ, Sheridan R, Rost B, Sander C, Marks prediction of protein function. Cell Mol Life Sci 2003;60: DS. Three-dimensional structures of membrane proteins 2637–50. from genomic sequencing. Cell 2012;149:1607–21. [40] Rost B, O'Donoghue S, Sander C (1998). Midnight zone of [20] Bromberg Y, Overton J, Vaisse C, Leibel RL, Rost B. In protein structure evolution. EMBL Heidelberg. silico mutagenesis: a case study of the melanocortin 4 [41] Lichtarge O, Bourne HR, Cohen FE. Evolutionarily conserved receptor. FASEB J 2009;23:3059–69. Galphabetagamma binding surfaces support a model of the G [21] Bromberg Y, Rost B. Comprehensive in silico mutagenesis protein-receptor complex. Proc Natl Acad Sci 1996;93: highlights functionally important residues in proteins. 7507–11. Bioinformatics 2008;24:i207–12. [42] Sankararaman S, Sjolander K. INTREPID—INforma- [22] Kumar P, Henikoff S, Ng PC. Predicting the effects of coding tion-theoretic TREe traversal for Protein functional site non-synonymous variants on protein function using the IDentification. Bioinformatics 2008;24:2445–52. SIFT algorithm. Nat Protoc 2009;4:1073–81. [43] Sankararaman S, Sha F, Kirsch JF, Jordan MI, Sjolander K. [23] Ng PC, Henikoff S. SIFT: predicting amino acid changes Active site prediction using evolutionary and structural that affect protein function. Nucleic Acids Res 2003;31: information. Bioinformatics 2010;26:617–24. 3812–4. [44] Henikoff S, Comai L. Single-nucleotide mutations for plant [24] Bromberg Y, Rost B. SNAP: predict effect of non-synon- functional genomics. Annu Rev Plant Biol 2003;54: ymous polymorphisms on function. Nucleic Acids Res 375–401. 2007;35:3823–35. [45] Zimprich A. Genetics of Parkinson's disease and essential [25] Bromberg Y, Yachdav G, Rost B. SNAP predicts effect of tremor. Curr Opin Neurol 2011;24:318–23. mutations on protein function. Bioinformatics 2008;24: [46] Zimprich A, Benet-Pages A, Struhal W, Graf E, Eck SH, 2397–8. Offman MN, et al. A mutation in VPS35, encoding a subunit [26] Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, of the retromer complex, causes late-onset Parkinson et al. Automated inference of molecular mechanisms of disease. Am J Hum Genet 2011;89:168–75. disease from amino acid substitutions. Bioinformatics [47] Amberger J, Bocchini CA, Scott AF, Hamosh A. McKusick's 2009;25:2744–50. Online Mendelian Inheritance in Man (OMIM). Nucleic Acids [27] Calabrese R, Capriotti E, Fariselli P, Martelli PL, Casadio R. Res 2009;37:D793–6. Functional annotations improve the predictive score of [48] Schaefer C, Bromberg Y, Achten D, Rost B. Disease-re- human disease-related mutations in proteins. Hum Mutat lated mutations predicted to impact protein function. BMC 2009;30:1237–44. Genomics 2012;13:S11. [28] Reva B, Antipin Y, Sander C. Predicting the functional [49] Eisenberg D, Marcotte EM, Xenarios I, Yeates TO. Protein impact of protein mutations: application to cancer genomics. function in the post-genomic era. Nature 2000;405:823–6. Nucleic Acids Res 2011;39:e118. [50] Bogan AA, Thorn KS. Anatomy of hot spots in protein [29] Schwarz JM, Rödelsperger C, Schuelke M, Seelow D. interfaces. J Mol Biol 1998;280:1–9. MutationTaster evaluates disease-causing potential of [51] Clackson T, Wells JA. A hot spot of binding energy in a sequence alterations. Nat Methods 2010;7:575–6. hormone-receptor interface. Science 1995;267:383–6. [30] Schaefer C, Rost B. Predict impact of single amino acid [52] Kortemme T, Baker D. A simple physical model for binding change upon protein structure. BMC Genomics 2012;13: energy hot spots in protein–protein complexes. Proc Natl S4. Acad Sci USA 2002;99:14116–21. [31] Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, [53] Ofran Y, Rost B. Protein–protein interaction hot spots Gerasimova A, Bork P, et al. A method and server for carved into sequences. PLoS Comput Biol 2007;3:e119. predicting damaging missense mutations. Nat Methods [54] DeLano WL, Ultsch MH, de Vos AM, Wells JA. Convergent 2010;7:248–9. solutions to binding at a protein–protein interface. Science [32] Olatubosun A, Valiaho J, Harkonen J, Thusberg J, Vihinen 2000;287:1279–83. M. PON-P: integrated predictor for pathogenicity of mis- [55] Konishi S, Iwaki S, Kimura-Someya T, Yamaguchi A. sense variants. Hum Mutat 2012;33:1166–74. Cysteine-scanning mutagenesis around transmembrane [33] Stitziel NO, Kiezun A, Sunyaev S. Computational and segment VI of Tn10-encoded metal-tetracycline/H(+) anti- statistical approaches to analyzing variants identified by porter. FEBS Lett 1999;461:315–8. exome sequencing. Genome Biol 2011;12:227. [56] Qin L, Cai S, Zhu Y, Inouye M. Cysteine-scanning [34] Cline MS, Karchin R. Using bioinformatics to predict the analysis of the dimerization domain of EnvZ, an functional impact of SNVs. Bioinformatics 2011;27: osmosensing histidine kinase. J Bacteriol 2003;185: 441–8. 3429–35. Perspective: The Protein Mutability Landscape 3947

[57] Gardsvoll H, Gilquin B, Le Du MH, Menez A, Jorgensen TJ, [76] de Juan D, Pazos F, Valencia A. Emerging methods in Ploug M. Characterization of the functional epitope on the protein co-evolution. Nat Rev Genet 2013;14:249–61. urokinase receptor: complete alanine scanning mutagene- [77] Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, sis supplemented by chemical cross-linking. J Biol Chem Zecchina R, et al. Protein 3D structure computed from 2006;281:19260–72. evolutionary sequence variation. PLoS One 2011;6: [58] Thorn KS, Bogan AA. ASEdb: a database of alanine e28766. mutations and their effects on the free energy of binding in [78] Kowarsch A, Fuchs A, Frishman D, Pagel P. Correlated protein interactions. Bioinformatics 2001;17:284–5. mutations: a hallmark of phenotypic amino acid substitu- [59] Markiewicz P, Kleina LG, Cruz C, Ehret S, Miller JH. tions. PLoS Comput Biol 2010;6:e1000923. Genetic studies of the lac repressor. XIV. Analysis of 4000 [79] Socolich M, Lockless SW, Russ WP, Lee H, Gardner KH, altered Escherichia coli lac repressors reveals essential and Ranganathan R. Evolutionary information for specifying a non-essential residues, as well as “spacers” which do not protein fold. Nature 2005;437:512–8. require a specific sequence. J Mol Biol 1994;240:421–33. [80] Russ WP, Lowery DM, Mishra P, Yaffe MB, Ranganathan [60] Loeb DD, Swanstrom R, Everitt L, Manchester M, Stamper R. Natural-like function in artificial WW domains. Nature SE, Hutchison CA. Complete mutagenesis of the HIV-1 2005;437:579–83. protease. Nature 1989;340:397–400. [81] Halabi N, Rivoire O, Leibler S, Ranganathan R. Protein [61] Rennell D, Bouvier SE, Hardy LW, Poteete AR. Systematic sectors: evolutionary units of three-dimensional structure. mutation of bacteriophage T4 lysozyme. J Mol Biol Cell 2009;138:774–86. 1991;222:67–88. [82] McLaughlin RN, Poelwijk FJ, Raman A, Gosal WS, [62] Gottesman ME, Yarmolinsky MB. Integration-negative Ranganathan R. The spatial architecture of protein function mutants of bacteriophage lambda. J Mol Biol 1968;31: and adaptation. Nature 2012;491:138–42. 487–505. [83] Maxmen A. Exome sequencing deciphers rare diseases. [63] Gottesman S, Gottesman ME.Elementsinvolvedin Cell 2011;144:635–7. site-specific recombination in bacteriophage lambda. J [84] Li MX, Gui HS, Kwan JS, Bao SY, Sham PC. A comprehen- Mol Biol 1975;91:489–99. sive framework for prioritizing variants in exome sequencing [64] Epstein CJ. Role of the amino acid “code” and of selection studies of Mendelian diseases. Nucleic Acids Res 2012;40: for conformation in the evolution of proteins. Nature e53. 1966;210:25–8. [85] Chun S, Fay JC. Identification of deleterious mutations [65] Woese CR, Dugre DH, Dugre SA, Kondo M, Saxinger W. within three human genomes. Genome Res 2009;19: On the fundamental nature and evolution of the genetic 1553–61. code. Cold Spring Harbor Symp Quant Biol 1966;31: [86] Liu X, Jian X, Boerwinkle E. dbNSFP: a lightweight 723–36. database of human nonsynonymous SNPs and their [66] Dayhoff MO, Barker WC, Hunt LT. Establishing homolo- functional predictions. Hum Mutat 2011;32:894–9. gies in protein sequences. Method Enzymol 1983;91: [87] Farooqi IS, Keogh JM, Yeo GS, Lank EJ, Cheetham T, 524–45. O'Rahilly S. Clinical spectrum of obesity and mutations in the [67] Zvelebil MJ, Barton GJ, Taylor WR, Sternberg MJE. melanocortin 4 receptor gene. N Engl J Med 2003;348: Prediction of protein secondary structure and active sites 1085–95. using alignment of homologous sequences. J Mol Biol [88] Lubrano-Berthelier C, Le Stunff C, Bougneres P, Vaisse C. 1987;195:957–61. A homozygous null mutation delineates the role of the [68] Rost B, Sander C. Prediction of protein secondary structure melanocortin-4 receptor in humans. J Clin Endocrinol Metab at better than 70% accuracy. J Mol Biol 1993;232:584–99. 2004;89:2028–32. [69] Rost B. PHD: predicting one-dimensional protein structure [89] Schaefer C, Meier A, Rost B, Bromberg Y. SNPdbe: by profile based neural networks. Methods Enzymol constructing an nsSNP functional impacts database. 1996;266:525–39. Bioinformatics 2012;28:601–2. [70] Rost B, Sander C. Bridging the protein sequence-structure [90] Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Brice gap by structure predictions. Annu Rev Biophys Biomol MD, Rodgers JR, et al. The Protein Data Bank: a Struct 1996;25:113–36. computer-based archival file for macromolecular structures. [71] Altschuh D, Lesk AM, Bloomer AC, Klug A. Correlation of J Mol Biol 1977;112:535–42. co-ordinated amino acid substitutions with function in viruses [91] Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, related to tobacco mosaic virus. J Mol Biol 1987;193: Boeckmann B, et al. The Universal Protein Resource 693–707. (UniProt): an expanding universe of protein information. [72] Goebel U, Sander C, Schneider R, Valencia A. Correlated Nucleic Acids Res 2006;34:D187–91. mutations and residue contacts in proteins. Proteins Struct [92] Chung FZ, Wang CD, Potter PC, Venter JC, Fraser Funct Genet 1994;18:309–17. CM. Site-directed mutagenesis and continuous expres- [73] Altschuh D, Vernet T, Berti P, Moras D, Nagai K. sion of human beta-adrenergic receptors. Identification Coordinated amino acid changes in homologous protein of a conserved aspartate residue involved in agonist families. Protein Eng 1988;2:193–9. binding and receptor activation. J Biol Chem 1988;263: [74] Pazos F, Ranea JA, Juan D, Sternberg MJ. Assessing 4052–5. protein co-evolution in the context of the tree of life assists in [93] Moffett S, Rousseau G, Lagace M, Bouvier M. The the prediction of the interactome. J Mol Biol 2005;352: palmitoylation state of the beta(2)-adrenergic receptor 1002–15. regulates the synergistic action of cyclic AMP-dependent [75] Pazos F, Valencia A. In silico two-hybrid systemfor the protein kinase and beta-adrenergic receptor kinase in- selection of physically interacting protein pairs. Proteins volved in its phosphorylation and desensitization. J Neu- Struct Funct Genet 2002;47:219–27. rochem 2001;76:269–79. 3948 Perspective: The Protein Mutability Landscape

[94] Ballesteros JA, Jensen AD, Liapakis G, Rasmussen SG, lation and increased responsiveness of the human beta Shi L, Gether U, et al. Activation of the beta 2-adrenergic 2-adrenergic receptor. EMBO J 1995;14:5542–9. receptor involves disruption of an ionic lock between the [97] O'Dowd BF, Hnatowich M, Caron MG, Lefkowitz RJ, cytoplasmic ends of transmembrane segments 3 and 6. J Bouvier M. Palmitoylation of the human beta 2-adrenergic Biol Chem 2001;276:29171–7. receptor. Mutation of Cys341 in the carboxyl tail leads to an [95] Rasmussen SG, Jensen AD, Liapakis G, Ghanouni P, uncoupled nonpalmitoylated form of the receptor. J Biol Javitch JA, Gether U. Mutation of a highly conserved Chem 1989;264:7564–9. aspartic acid in the beta2 adrenergic receptor: constitutive [98] DeLano WL. The PyMOL molecular graphics system; 2002 . activation, structural instability, and conformational rearran- [99] Pavlidis P, Noble WS. Matrix2png: a utility for visualizing gement of transmembrane segment 6. Mol Pharmacol matrix data. Bioinformatics 2003;19:295–6. 1999;56:175–84. [100] Bromberg Y, Kahn PC, Rost B. Neutral and weakly [96] Valiquette M, Parent S, Loisel TP, Bouvier M. Mutation of nonneutral sequence variants may define individuality. tyrosine-141 inhibits insulin-promoted tyrosine phosphory- Proceedings of the National Academy of Sciences; 2003. Hecht et al. BMC Genomics 2015, 16(Suppl 8):S1 http://www.biomedcentral.com/1471-2164/16/S8/S1

RESEARCH Open Access Better prediction of functional effects for sequence variants Maximilian Hecht1*, Yana Bromberg2,3,4, Burkhard Rost1,4 From VarI-SIG 2014: Identification and annotation of genetic variants in the context of structure, function and disease Boston, MA, USA. 12 July 2014

Abstract Elucidating the effects of naturally occurring genetic variation is one of the major challenges for personalized health and personalized medicine. Here, we introduce SNAP2, a novel neural network based classifier that improves over the state-of-the-art in distinguishing between effect and neutral variants. Our method’s improved performance results from screening many potentially relevant protein features and from refining our development data sets. Cross-validated on >100k experimentally annotated variants, SNAP2 significantly outperformed other methods, attaining a two-state accuracy (effect/neutral) of 83%. SNAP2 also outperformed combinations of other methods. Performance increased for human variants but much more so for other organisms. Our method’s carefully calibrated reliability index informs selection of variants for experimental follow up, with the most strongly predicted half of all effect variants predicted at over 96% accuracy. As expected, the evolutionary information from automatically generated multiple sequence alignments gave the strongest signal for the prediction. However, we also optimized our new method to perform surprisingly well even without alignments. This feature reduces prediction runtime by over two orders of magnitude, enables cross-genome comparisons, and renders our new method as the best solution for the 10-20% of sequence orphans. SNAP2 is available at: https://rostlab.org/services/snap2web Definitions used: Delta, input feature that results from computing the difference feature scores for native amino acid and feature scores for variant amino acid; nsSNP, non-synoymous SNP; PMD, Protein Mutant Database; SNAP, Screening for non-acceptable polymorphisms; SNP, single nucleotide polymorphism; variant, any amino acid changing sequence variant.

Introduction polymorphisms, nsSNPs, or single amino acid substitu- Some sequence variations matter, changing native pro- tions, SAAS) on aspects such as protein structure [5], tein function or disease-causing potential, while others stability [6-8], binding affinity [9], and function [10,11]. do not [1]. The distinction between the variants that Some methods focus exclusively on the human genome change protein function and those that are neutral is one [12,13] and some aspire to identify disease-causing key to making sense of the deluge Next Generation variants [14-16]. Applications to personalized health are Sequencing (NGS) or Deep Sequencing data. Many obviously important considerations for the developers of methods have been developed that address this challenge, such tools. Generally, today’s methods are able to distin- spanning a wide range of goals and applications. Some guish between a set with 100 disease-causing and another tools are focused on non-coding regions [2-4]; others with 100 less impacting variants [17,18]. However, identi- focus on coding regions and predict the effects of single fying one or several variants in an individual responsible amino acid variants (non-synonymous single-nucleotide for a certain disease is often beyond our reach. Methods have improved significantly by using more protein and * Correspondence: [email protected] variant annotations, as demonstrated in particular in the 1 Department of Bioinformatics & Computational Biology, Technische advance from PolyPhen [12] to PolyPhen-2 [13]. Despite Universität München, Boltzmannstr. 3, 85748 Garching/Munich, Germany Full list of author information is available at the end of the article many advances, good data remains missing, in particular © 2015 Hecht et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http:// creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Hecht et al. BMC Genomics 2015, 16(Suppl 8):S1 Page 2 of 12 http://www.biomedcentral.com/1471-2164/16/S8/S1

careful annotations of variant neutrality, partially because ‘+++’)ordecrease(‘-’, ‘--’, ‘---’)infunctionwere it is to difficult to carry out “negative experiments” assigned to the ‘effect’ class. Variants with conflicting (absence of change [19]). functional effect annotations were also classified as The best variant effect prediction methods typically use ‘effect’. This approach identified 51,817 variants (neutral: evolutionary information, and a wide variety of features 13,638, effect: 38,179) in 4,061 proteins. descriptive of protein function and structure. Performance EC. 74% of the PMD data were ‘effect’ annotations. We decreases substantially for proteins without informative balanced this with evidence for neutral variants from multiple alignments. Today few human proteins do not enzyme alignments. Assume independent experiments map to well-studied sequence families. However, most fully reveal two enzymes to have the same function, i.e. the sequenced organisms, predominantly prokaryotic, contri- same EC number (Enzyme Commission number [26]). bute a substantial fraction of “orphans” (10-20%) [20]. If these two proteins are very sequence similar, most var- Today’sstate-of-the-artpredictionmethodsfocuson iants between them are likely ‘neutral’ with respect to the discerning disease-causing variants from the background EC number. While not always correct, the procedure cre- variation. They, e.g. differentiate between human cancer- ates a set heavily enriched in truly ‘neutral’ variants. To causing mutations and common variation. This implicitly turn this concept into data, we aligned all enzymes with disregards many variants with functional effects that are experimentally assigned EC numbers in SWISS-PROT not associated with disease. In contrast, the current [22] using pairwise BLAST [27]. We retrieved all enzyme version of our SNAP (Screening for Non-Acceptable Poly- pairs with pairwise sequence identity >40% and HSSP- morphisms) method, SNAP2, does not predict the variant values>0 [28-30]. This yielded 26,840 ‘neutral’ variants in effect as “disease or not” but rather as “change of molecu- 2,146 proteins [11]. lar function or not”. Similar to most experimental assays, Disease. We extracted 22,858 human disease-associated SNAP2 does not directly connect “molecular change” to variants in 3,537 proteins from OMIM [24] and HumVar “impact on organism"; i.e. the goal is not to support state- [25]. All disease-associated variants were classified as ments of the type “this single variant improves survival ‘effect’.Formanyofthesevariantsthechangeinprotein rate”.Alsosimilartomanyexperimental methods, we function has not explicitly been demonstrated. These avoid distinguishing gain-of-function from loss-of-function variants may be not causative but, possibly, in linkage variants, as these outcomes are often subjective. For disequilibrium with the actual disease-causing variants. instance, gaining in the ∆∆G of binding does NOT imply Alternatively, they may be affecting splice-sites and/or a “better molecular function” and even the gain of “mole- regulatory elements in the DNA, finally showing up as cular function” might decrease survival. Here, we intro- amino acid substitutions. Hence, by compiling these into duced several concepts each of which importantly the effect class we may be over-estimating functional improved over our previous method, SNAP [11]. SNAP2 changes. However, we previously established that rela- outperforms its predecessor in three major aspects: better tionships to disease provide much stronger evidence for performance, better predictions without alignments, and functional effect of variants than any other experimental many orders of magnitude lower runtime. evidence [17]. Thus, disease variants are clearly strongly enriched in functional significance. Methods Protein specific studies.Wealsoincludeddatafrom Data sets comprehensive studies of particular proteins, namely The training set for SNAP2 resembled that used for devel- LacI repressor from Escherichia coli [31] (4,041 variants) opment of the original SNAP [11]. In particular, we used and the HIV-1 protease [32] (336 variants). Variants the following mixture: variants from PMD (the Protein functionally equivalent to wild-type were considered Mutant Database [21]), residues differing between ‘neutral’; all others were deemed ‘effect’.Thesevariants enzymes with the same experimentally annotated function were not included in training, overlaps (same variant in according to the enzyme classification commission (EC), one of the sets above and these) were removed. retrieved from SWISS-PROT [22,23], variants associated Evaluation sets.Wecreatedthreesubsetsofourdata with disease as annotated in OMIM (Online Mendelian for evaluation/development of SNAP2. First, PMD + EC Inheritance in Men [24]), and HumVar [25]. +Diseasewere compiled into one comprehensive set PMD.Weextractedallaminoacidchangingvariants termed ALL with 101,515 variants (40,478 neutral, from the Protein Mutant Database [21] (PMD) and 61,037 effect) in 9,744 proteins. We also split the PMD mapped these to their corresponding sequences. PMD data into two subsets: one containing only human muta- annotations with ‘no change’ (‘=’) qualification (function tions (PMD_HUMAN; 9,657 variants in 678 human equivalent to wild-type) were assigned to the ‘neutral’ sequences) and one consisting of all others (PMD_NON- class, while variants with any level of increase (‘+’, ‘++’, HUMAN; 42,160 variants in 3,383 sequences). Hecht et al. BMC Genomics 2015, 16(Suppl 8):S1 Page 3 of 12 http://www.biomedcentral.com/1471-2164/16/S8/S1

Cross-validation impact of variants. Not knowing connections between We clustered our data such that the sets used for training residues (our method does not require the knowledge of (optimizing neural network connections), cross-training 3D structures), we scanned sliding windows of up to 21 (picking best method) [33,34], and testing (results consecutive residues around the central variant position. reported) were not significantly sequence similar. Toward We compiled the original SNAP features: biophysical this end, we all-against-all PSI-BLASTed all proteins in amino acid properties, explicit sequence, PSIC profiles our data sets and recorded all hits with E-values<10-3. [37], secondary structure and solvent accessibility Starting with these, we builtanundirectedgraph,where [38-40], residue flexibility [41], and SWISS-PROT anno- vertices are proteins and edges link vertices to the corre- tations. Additionally, we introduced new features for sponding BLAST hits. We then clustered all proteins SNAP2: amino acid properties as provided by the AAin- using single linkage clustering; i.e. all connected vertices dex database [42], predicted binding residues [43], pre- were assigned to the same cluster. This yielded 1,241 clus- dicted disordered regions [44], proximity to N- and ters of related protein sequences with 1 to 1,941 members. C-terminus, statistical contact potentials [45], co-evolving We randomly grouped the clusters into ten subsets of positions, residue annotations from Pfam [20] and PRO- roughly similar size. This approach ascertained that no SITE [46], low-complexity regions, and other global fea- two proteins between any sets were significantly sequence tures such as secondary structure and solvent accessibility similarity. Due to extremely varied cluster sizes one of composition (Additional File 1, Input feature calculation). these subsets was nearly three times larger than the others. This imbalance was acceptable since the cross-validation Feature selection procedure ensured sufficiently more training data than In order to determine the optimal feature combination, testing data in each rotation. In tenfold cross-validation, we systematically sieved through our feature space using we rotated through the subsets using eight for training, greedy bottom-up feature selection. For the following one for cross-training and the tenth for testing, such that procedure one of the ten training folds (specific to each each subset (and therefore each protein) was used for test- network) was kept out so that it had no part in feature ing exactly once. As a result no variant, protein sequence, selection and parameter optimization at any point. We or even close homologue, was ever used simultaneously trained ten networks, using 9 of the 10 data subsets: 8 for for training and testing. All performance estimates that we training and 1 for cross-testing as described above, using reported were solely based on the testing set. each feature and selecting the highest scoring feature separately for each network (highest AUC, Area Under Prediction method ROC Curve, in cross-training). In the next round, the We applied the different machine learning tools in the selected feature was combined with each of the remain- WEKA suite [35] to our data with default parameters. ing features to train another round of ten networks and Support Vector Machines (SVMs) and Neural Networks the best performing combination of features was selected performed similarly and slightly better than Decision -again,foreachnetworkseparately.Werepeateduntil Trees and Random Forests. Due to runtime efficiency, no additional feature improved performance. We consid- we decided to proceed with standard neural networks. ered different sequence window sizes for each feature As in similar applications [11,36], we used two output independently; i.e.eachfeaturecouldbeselectedina units: one for ‘neutral’, the other for ‘effect’. All free net- window of w = 1,5,9,13,17, or 21 consecutive residues work parameters were optimized on the training (opti- around the observed variant at the center of the window. mizing connection weights) and cross-training We tried to avoid local maxima in training via the fol- (optimizing number of hidden units, learning rate, and lowing steps: S1: Train with balanced data sets [38,40]. momentum; stop training before over-fitting) sets. Ten- S2: Determine the AUC on the cross-training set after fold cross-validation implies training ten networks: each repetition. Record the step with maximal AUC. S3: which one to use for future applications? Taking the Train and determine AUC for the cross-training set at “best” of the ten risks over-training. We avoid this by least another ten repetitions from the highest-scoring using all ten networks to predict for new proteins, com- step. Repeat S2-S3 until no additional improvement is piling separate averages for ‘neutral’ and ‘effect’ over all recorded. ten networks. The final prediction is the difference We collected all features that improved performance on between these averages that ranges from -100 (strongly any of the individual networks into a single combined fea- predicted ‘neutral’) to +100 (strongly predicted ‘effect’). ture set and trained all networks on this set. In a subse- quent backward elimination, we removed all features the Input features removal of which did not alter the average overall predic- Biophysical amino acid features and predicted aspects of tion accuracy. After determining the final feature space, protein function and structure help to predict the we optimized the number of hidden nodes, learning rate, Hecht et al. BMC Genomics 2015, 16(Suppl 8):S1 Page 4 of 12 http://www.biomedcentral.com/1471-2164/16/S8/S1

and learning momentum to obtain the best-performing We calculated accuracy (precision) and coverage (recall) network architecture. As an exhaustive screening of the separately for ‘effect’ (Eqn. 1) and ‘neutral’ (Eqn. 2) predic- entire parameter space was not intended, we heuristically tions: selected parameter combinations for optimization: learn- TP ing rate 0.005-0.1, learning momentum 0.01-0.3, and hid- Accuracyeffect = Precisioneffect = Positive predictive value = TP + FP (1) den nodes 10-100. The best-performing architecture for TP Coverageeffect = Recalleffect = Sensitivity = each network, as determined by its performance on the TP + FN corresponding cross-training set, was chosen for the final TN method. Accuracy = Precision = Negative predictive value = neutral neutral TN + FN Finally, we tested the resulting trained networks (of TN (2) Coverageneutral = Recallneutral = Specificity = specific feature space and the network architecture TN + FP each) against the test sets that were initially kept out of We used the F-measure (F1-Score; Eqn. 3) to asses feature selection and parameter optimization. Since the ‘neutral’ and ‘effect’ variants individually. Combined per- performance on these test sets did not differ signifi- formance was measured by the overall two-state accu- cantly from that estimated during the optimization pro- racy (Q2; Eqn. 4) and the Matthews Correlation cedure, we concluded that we had not over-fitted the Coefficient (MCC; Eqn. 5). networks to the data. precisioneffect recalleffect Feffect =2 · Predicting effects without alignments · precisioneffect + recalleffect (3) We repeated the above feature selection restricted to precisionneutral recallneutral Fneutral =2 · global features (features based on the entire protein, · precisionneutral + recallneutral such as amino acid and secondary structure composi- tions), amino acid indices, alignment-free secondary TP + TN Q = Accuracy = (4) structure predictions, and the biophysical amino acid 2 (TP + FP + TN + FN) properties. We explicitly left out evolutionary informa- tion. We wanted to add a generic average for ‘potential TP TN FP FN effect’.Towardthisend,weusedthecompleteversion MCC = · − · (5) (TP + FP)(TP + FN)(TN + FP)(TN + FN) of SNAP2 to predict effects for all possible variants at each residue position in our entire ALL set. From these Standard! deviation and error for all measures were results, we generated a novel amino acid substitution estimated over n = 1000 bootstrap sets; for each set we matrix of effect probabilities[47]whichweincludedas randomly selected 50% of all variants from the original an additional feature in the feature selection. This pro- test set without replacement. Note that due to over- cedure was aimed at developing a method that can be representation of certain protein families, in our experi- applied without alignments. The resulting method ence, bootstrapping without replacement typically yields (SNAP2noali) predicts functional effects using only single error estimates that are moreaccuratethanthosewith sequences. Note that our SNAP2 implementation selects replacement. Standard deviation was calculated as the the best method given the available information, SNAP2 difference of each test set (xi)fromtheoverallperfor- by default and SNAP2noali for orphans. In the latter mance 〈x〉 (Eqn. 6). Standard error was calculated by case, users are notified about the possibly reduced accu- dividing s by the square root of sample size (Eqn. 7). racy of predictions. (x x )2 Performance measures Standard deviation (SD) = i − ⟨ ⟩ (6) " n We evaluated performance via a variety of measures. For # simplicity, we used the following standard annotations: True positives (TP) were correctly predicted experimental SD ‘ ’ Standard error (SE) = (7) effect variants, while false positives (FP) were experimen- (n 1) tally ‘neutral’ substitutions incorrectly predicted to have an − effect. True negatives (TN) were correctly predicted The reliability index (RI;! Eqn. 8) for each prediction neutrals and false negatives (FN) were effect variants was computed by normalizing the difference between incorrectly predicted to be neutral. Here, like everywhere the two output nodes (one for ‘neutral’,theotherfor else in computational biology, we accept incorrect esti- ‘effect’)intointegersbetween0(lowreliability)and10 mates originating from the triviality that “not observed” (high reliability): does not always imply “not existing”, i.e.someoftheFP RI =10 int(Output Output ) (8) might have an effect that was not experimentally tested. ·| effect − neutral | Hecht et al. BMC Genomics 2015, 16(Suppl 8):S1 Page 5 of 12 http://www.biomedcentral.com/1471-2164/16/S8/S1

Results original SNAP was trained on PMD, suggesting a perfor- SNAP2 significantly improves predictions mance overestimate. Secondly, SIFT scores were nor- First, we assessed the performance of SNAP2 via cross- malized and optimized for simple defaults. This is validation on the original SNAP data. Here, we observed implicitly ignored by showing ROC-curves that provide a performance increase over our original SNAP, originat- values for a wide set of thresholds that had been ing from novel features used in SNAP2. However, by deemed non-optimal by the developers. Thirdly, Poly- adding in more and better variant data, we found a Phen-2 is optimized on human variants that account for further (and significantly higher) improvement in perfor- only 25% of our ALL data. For these, we over-estimate mance over SNAP. Many computational methods predict PolyPhen-2’sperformance.Althoughtheauthors variant effects. As most of these methods focus on pre- assumed that PolyPhen-2 would perform similarly for dicting disease-associated variants, assessing their perfor- other eukaryotes, it might not. To address these compli- mance on our data is inappropriate. Therefore, we cations we compared the methods using additional data explicitly compared SNAP2 only to widely used methods sets. that explicitly aim at the prediction of functional effects: SIFT [10] and PolyPhen-2 [13]. All estimates for the per- Performance differed between the human and non- formance of SNAP2 given in this work are based on full human PMD data cross-validation testing, i.e.ondataneverusedforany The F-measure for predicting effect (Feffect,Eqn.3),the step in the development. Note that this is not true for two state-accuracy (Q2, Eqn. 4), and the Matthew’s cor- other methods in our comparisons. relation coefficient (MCC, Eqn. 5) were slightly higher On the ALL data set (Methods), SNAP2 outperformed for SNAP2 when tested on the non-human than on the its predecessor SNAP [11], as well as both PolyPhen-2 human set (Table 1). For the human PMD data, Poly- and SIFT (Figure 1). However, the direct comparison is Phen-2 performed on par with SNAP2, while SIFT was complicated due to a variety of issues. Firstly, the best for predicting neutrals. For the non-human data, SNAP2 was either on par (Fneutral,Eqn.3)oroutper- formed (Feffect,Q2,MCC)allothermethods(Table1). Again, this comparison is not entirely fair to SNAP2 and SIFT since the human PMD variants overlapped substantially with the PolyPhen-2 training set, i.e. Table 1 likely over-estimates PolyPhen-2.

Blind method combinations might be worse than a good single method If in doubt which method is best, users often mix sev- eral methods. One strategy is to exclusively consider predictions for which several methods agree. We assessed the benefit of this strategy by applying SNAP2, SIFT and PolyPhen-2 on the PMD_HUMAN data set. All methods performed significantly worse for neutral than for effect variants. This can largely be attributed to the difference in the number of variants. The combina- tion of SIFT and PolyPhen-2 improved slightly over SIFT alone for neutral variants (green curve vs. brown arrow/triangle in Figure 2A) and, in terms of accuracy Figure 1 SNAP2 performs best for the ALL data set.Thisfigure (Eqn. 2) over PolyPhen-2 alone (orange curve vs. brown shows performance estimates for the ALL data set. Our new arrow/triangle in Figure 2A). However, for effect var- method SNAP2 (dark blue, AUC = 0.905) outperforms its iants combining PolyPhen-2 and SIFT did not improve predecessor SNAP (light blue, AUC = 0.880), PolyPhen-2 (orange, over the individual methods at all. Moreover, through- AUC = 0.853) and SIFT (green, AUC = 0.838) over the entire out the curves (Figure 2) of both neutral and effect var- spectrum of the Receiver Operating Characteristic (ROC) curve. Curves are significantly different from each other at a significance iants, the combined method did not improve over using level of P < 10-4 as measured by the DeLong method [59]. All SNAP2 alone. Methods such as PredictSNP [48], Condel SNAP2 results were computed on the test sets not used in training [49], and MetaSNP [50] have been explicitly optimized after a rigorous split into training, cross-training and testing. Results to combine different methods, mostly to annotate dis- for PolyPhen-2 and our original SNAP included some of those ease-variant relationships (as opposed to functional proteins in their training, suggesting over-estimated performance. changes). Such meta-methods often tend to improve Hecht et al. BMC Genomics 2015, 16(Suppl 8):S1 Page 6 of 12 http://www.biomedcentral.com/1471-2164/16/S8/S1

Table 1. Method performance on PMD *

Method Feffect (Eqn. 3) Fneutral (Eqn. 3) Q2 (Eqn. 4) MCC (Eqn. 5) human SNAP2 78.0% ± 0.6 46.3% ± 1.3 68.8% ± 0.7 0.24 ± 0.01 PolyPhen-2 78.4% ± 0.4 ** 45.1% ± 1.1 ** 68.9% ± 0.5 ** 0.23 ± 0.01 ** SNAP 74.9% ± 0.5 46.7% ± 1.1 65.8% ± 0.6 0.22 ± 0.01 SIFT 72.2% ± 0.6 49.0% ± 1.0 63.6% ± 0.6 0.23 ± 0.01 non-human SNAP2 79.9% ± 0.3 45.8% ± 0.8 70.7% ± 0.4 0.26 ± 0.01 PolyPhen-2 77.1% ± 0.4 44.7% ± 0.8 67.6% ± 0.5 0.22 ± 0.01 SNAP 77.2% ± 0.3 45.5% ± 0.9 67.9% ± 0.5 0.23 ± 0.01 SIFT 77.0% ± 0.3 45.8% ± 0.8 67.7% ± 0.4 0.23 ± 0.01 * Data set consisting of 9,657 variants (2,788 neutral, 6,869 effect) from 678 human proteins in the top rows and 42,160 variants (10,850 neutral, 31,310 effect) from 3,383 non-human proteins in the bottom rows. For each measure and species group, significantly best results are highlighted in bold. Measures with no bold highlighting indicate absence of a statistically significant best performer. ** Values might over-estimate performance for PolyPhen-2 due to overlap between data set used here and one used for training PolyPhen-2. over the simple combinationsindividuallyattemptedby with 53% and SIFT with 41% compared to 53±1% for many users and tested here. random. We repeated the same analysis for the PMD_HUMAN SNAP2 is clearly best for difficult cases subset (Figure 3). For the 3,963 human variants (1,374 Although overall performance levels were similar for all neutral and 2,589 effect) for which any two of the methods methods tested on the ALL data set, the actual predic- disagreed, SNAP2 and PolyPhen-2 were correct in ~58% tions for a single variant differed substantially between of the cases compared to 50% for SNAP, 46% for SIFT methods. Variants for which methods agree could be and 44±1% for random predictions. Again, the PolyPhen-2 considered “easy” (every method right) or “unsolvable” training set overlapped with these data, suggesting a per- (no method right). In contrast, variants for which meth- formance over-estimate. ods disagree could be considered “difficult”. This classifi- In this set of 3,963 human variants, 305 (45 neutral cation yielded 67,912 easy (~68% of the total; 27,370 and 260 effect) were only correctly predicted by SNAP2. neutral and 40,542 effect), 9,624 unsolvable (~10% of the We investigated these cases in detail, and found that the total; 4,750 neutral and 4,874 effect), and 22,625 difficult effect variants in this set often localized to positions at variants (~22% of the total; 7,504 neutral and 15,121 which the variant residue had been observed in another effect). SNAP2 outperformed others on the difficult protein in the alignment. For most methods, this implies cases, correctly predicting 69%, as compared to SNAP “neutral” prediction. Indeed, SNAP2noali, the version of

Figure 2 Naïve combination is not better than individual methods for PMD_HUMAN data. This figure shows accuracy-coverage curves for the PMD_HUMAN data. The x-axes indicate coverage (also referred to as ‘recall’; Eqn. 1.2), i.e. the percentage of observed neutral (a) and of observed effect (b) variants that are correctly predicted at the given threshold. The y-axes indicate accuracy (also referred to as ‘precision’; Eqn. 1.2), i.e. the percentage of neutral (a) and effect (b) variants among all variants predicted in either class at the given threshold. Arrows mark the performance at the default thresholds for our new method SNAP2 (dark blue), for SIFT (green), and for PolyPhen-2 (orange). A brown triangle/ arrow marks the performance of a (non-optimized) method that combines PolyPhen-2 and SIFT. This combination did not perform better than SNAP2 alone (brown triangle vs. blue SNAP2 curves). Hecht et al. BMC Genomics 2015, 16(Suppl 8):S1 Page 7 of 12 http://www.biomedcentral.com/1471-2164/16/S8/S1

strong correlations with other positions in the protein, suggesting that this particular feature also made a difference.

Evolutionary information most important, other features vary The input features related to evolutionary information were consistently most informative for SNAP2 (Addi- tional File 1, Fig. SOM_1: SNAP2 vs. SNAP2noali). Which other input features best distinguished neutral from effect depended on the data set. This dependency might originate from annotation inconsistencies and/or set size differences or it might genuinely reflect the data. By selecting the best features separately for subsets of related proteins, we tried to differentiate between these alternatives. The majority of our subsets considered structural features (secondary structure and solvent Figure 3 SNAP2 and PolyPhen-2 are best for difficult human accessibility) informative, followed by biophysical amino variants. Bars mark the two-state accuracy (Q2; Eqn. 4) at the acid properties (more precisely: charge and hydrophobi- default thresholds for SNAP2 (dark blue), SNAP (light blue), SIFT city). However, the optimal window sizes (number of (green), and PolyPhen-2 (orange). Random prediction performance consecutive residues used as input) for these features assuming 60:40 effect:neutral background are given in pink. Analysis differed. For instance, residueflexibilitywasconsidered is based on 3,963 ‘difficult’ cases (2,589 effect; 1,374 neutral) from PMD_HUMAN set. Difficult cases were defined as variants where any informative by most subsets, but the optimal window of the above method’s predictions disagreed; i.e. cases where not all size for this feature varied between three and nine resi- methods, excluding random, gave the same prediction. dues around the variant. The final SNAP2 network included the following features: global features (amino acid composition, sec- our method that does not use alignments, predicted 75% ondary structure and solventaccessibilitycomposition, of these effect variants at over 90% accuracy, i.e. reached and protein length), PSI-BLAST [27] profiles and deltas, aperformancesubstantiallyaboveitsaverageforthese PSIC [12] profiles and deltas (differences between mutant cases. Thus, one important source of SNAP2 improve- and wild-type residue annotations; see Methods for ment for difficult cases originates from its use on var- details), residue flexibility, sequence and variant profiles, ious pieces of information, not just alignments. One disorder, secondary structure and relative solvent accessi- example of this improvement is the R109Q variant in bility and their deltas, physicochemical properties the IL4 sequence (interleukin-4 isoform 1 precursor; (charge, hydrophobicity, volume, and their deltas), con- NCBI reference sequence: NP_000580.1), a pleiotropic tact potential profiles and deltas, correlated positions and cytokine produced by activated T-cells and involved in low complexity regions. In addition to these, SWISS- B-cell activation as well asco-stimulationofDNA PROT [22] annotations and SIFT [10] predictions were synthesis [51]. Variations in this gene were shown to be included in SNAP2, if available. For the sequence-only associated with susceptibility to ischemic stroke [52] and network (SNAP2noali)thefollowingfeatureswhere knee osteoarthritis [53]. While our R109Q was not included: amino acid composition, protein length, explicitly found to increase disease susceptibility, there sequence and variant profiles, contact potential profiles is evidence [54] that it reduces T-cell proliferation and and delta, volume and hydrophobicity along with the receptor binding activity. In this case, the variant gluta- corresponding delta features as well as several amino mine is more conserved in the protein alignment than acid indices from the AAindex [42] (Additional File 1, the human native arginine (11% Q vs. 8% R), making Table SOM_1). predictions difficult for methods that over-rely on alignments. SNAP2noali important for many proteins Another potential source of improvement, although one For eight proteins in the ALL data set we found fewer for which we could not find explicit and experimentally than five PSI-BLAST hits in UniProt when we first verified examples in our data, lies in the usage of informa- checked in Oct. 2012. On this tiny set SNAP2noali tion about co-evolving residues (Additional File 1, Input appeared better than SNAP2 (Eqn. 4: Q2SNAP2noali = Feature Calculation). Specifically, some of the variant posi- 61% vs. Q2SNAP2 =60%;Eqn.5:MCCSNAP2noali =0.19 tions in this set exhibited (computationally-determined) vs. MCCSNAP2 = 0.17). PolyPhen-2 made predictions for Hecht et al. BMC Genomics 2015, 16(Suppl 8):S1 Page 8 of 12 http://www.biomedcentral.com/1471-2164/16/S8/S1

only three of these eight proteins (103 variants, Q2Poly- when predicting effect/neutral for all variants, SNAP2 Phen-2 =60%)andSIFTgavenopredictions.Recently is correct in about 75% of its neutral predictions and repeating the analysis, we found homologues for all in 86% of its effect predictions (Figure 4B rightmost eight. SNAP2, SIFT and PolyPhen-2 now outperformed points). If users focus on the 50% strongest predictions SNAP2noali.Our“outdated” analysis was important. On (Figure 4B; x-axis at 0.5), they could expect the ~92% the one hand, over 600 human proteins (~3% of all of the neutral predictions and ~96% of the effect pre- human) still find less than 5 homologues today. On the dictions to be correct (RI≥8, Figure 4B). Note that for other hand, for most organisms for which we know the the purposes of simplified visualization, to display sequences, the corresponding value is much closer to SNAP2 reliability with one digit per residue (e.g.to 10-20%, i.e. millions of the proteins we know today can view along with multiple sequence alignments), we only be handled well by SNAP2noali. projected the actual RI onto integers from 0 (low relia- For our entire training data, SNAP2noali reached Q2 = bility - worst prediction) to 9 (high reliability - best 68%, i.e. seven percentage points more than for the sub- prediction, Figure 4B). set of proteins with small/no families (68% on ALL vs. 61% on NOALI eight protein set). About 10-20% of all Discussion proteins in newly sequenced organisms continue not to Performance related to experimentally biased balance of map anywhere else in today’sdatabases[33,34,55];for neutral vs. effect variants those 10-20% of proteins, SNAP2noali appears to be the Machine learning tends to work best when testing and best method available to predict the effect of mutations. training data are sampled from the same distribution. What are the true data that we want to assess our Performance confirmed for additional data sets method upon? One proxy for this type of truth might be We avoided over-optimistic performance estimates by the next “one million variants” experiment: test 1,000 removing sequence similarity between proteins used for randomly selected naturally occurring variants in 1,000 method development (training/cross-training) and test- representative proteins. One question is: how many var- ing. In addition, we also tested our final method on two iants will be identified as being neutral with respect to data sets of variants from the Escherichia coli LacI protein function? The answer remains importantly vague. repressor and from the HIV-1 protease (Additional File Several seemingly contradictory findings are the follow- 1, Table SOM_2). Given the small size and lack of ing. On the one hand, for almost every sequence position diversity, these results are likely to be more error-prone (residue) there is a non-native variant that has very little than our cross-validation estimates. However, they pro- effect on one particular experimental assay [56]. Loosely vide independent evidence to estimate the performance put: “sequence can change without effect”.Ontheother of SNAP2: Q2 = 78% for 4,041 LacI variants and Q2 = hand, for almost every residue there is a variant that 72% for 336 HIV-1 variants. None of these variants was affects function somehow [56]. Loosely put: “every resi- used during method development. Moreover, our train- due in a protein matters and its variation can change ing data did not contain variants from any homologs of function”.Thereisevidencethatindividualityofpeople these proteins. is partially caused by many slightly non-neutral variants [19]. However, this does not help in estimating the “true” Reliability index allows zooming into best predictions ratio neutral/effect for the next one million. Clearly, The difference between the raw output units reason- today’sdatasetsarestronglybiasedtowardeffectvar- ably estimates prediction confidence [11,36]. We used iants, simply because it is simpler to measure and easier this difference to define a reliability index (RI, Eqn. 6) to publish an effect than a neutral variation. Unfortu- and demonstrated its excellent correlation to predic- nately, most of our performance estimates crucially tion strength, i.e.thereliabilityindexandperformance depend on the true ratio neutral/effect. Thus, our esti- (Figure 4). The final binary predictions (neutral/effect) mates remain almost as incomplete as the experimental of SNAP2 are calculated from the network outputs data. based on the user-defined decision threshold (default: -0.05). By moving the threshold, users can vary the What to expect from variant prediction? accuracy-coverage balance. Higher thresholds result in Methods that identify variants related to disease try to more accurate predictions at the cost of covering fewer pick up changes that are strong enough to cause pheno- variants; lower thresholds cover more variants while typic effects that can be classified as disease. This is dif- reducing accuracy. By dialing through the entire ficult for two reasons. Firstly, the causality between threshold spectrum for our non-disease data (PMD/EC variant and disease is only clear for the simplest cases data), we estimated and fixed the default decision such as monogenic or Mendelian diseases. Most diseases threshold (Figure 4A). To put this into perspective: appear to be complex, in the sense that they are onset Hecht et al. BMC Genomics 2015, 16(Suppl 8):S1 Page 9 of 12 http://www.biomedcentral.com/1471-2164/16/S8/S1

Figure 4 SNAP2 threshold and reliability. The reliability index provides a means of focusing on the most accurate predictions. Panel (a) shows SNAP2 performance on the balanced PMD/EC data set over the entire spectrum of accuracy (solid lines) and coverage (dotted lines) for both effect (red) and neutral (green) variants depending on the chosen threshold (x-axis). The default threshold was set to -0.05, where neutral and effect predictions performed alike (black arrow). By moving the decision threshold users can optimize predictive behavior towards their research needs: predictions at higher absolute scores (e.g. TP>0.5 or TN<-0.5) are much more likely correct but they are not available for all variants. Panel (b) directly relates the reliability index (RI) to the performance on our data. Shown is the cumulative percentage of predictions (x-axis) against accuracy (solid lines) and coverage (dotted lines) above a given reliability index (RI; Methods). Accuracy and coverage are shown separately for neutral (green) and effect (red) predictions. Each marker depicts a reliability threshold ranging from 0 (right most marker, low reliability) to 9 (left most marker, high reliability). Labels for RI >= 2, 4 and, 6 are skipped for simplicity. For instance, 58% of all predictions in our cross-validation were made at reliability levels of 7 or higher (gray arrows). At this reliability, 95% of all effect predictions and 90% of all neutral predictions were correct.

only in the presence of several variants and proper envir- SNAP2 not limited to human variants onmental conditions. GWAS have shown that variants Functional effects of sequence variations are not limited associated with disease are found in healthy individuals, to pathogenicity in humans. As most experimental data and vice versa. Loosely put: the definition of a disease var- are human-centric, and as the disease variants are gener- iant may depend on other variants present in the particu- ally most consistent with functional effect [17], SNAP2 lar genotype of the phenotype carrier. Secondly, even for performed best for those. This might also explain why for seemingly clear-cut cases, the classification of “disease” these SNAP2 performed similar to PolyPhen-2 that has might be misleading. Consider the example of the sickle- been optimized to human data. On non-human variants, cell anemia variants of the hemoglobin B-chain, which can however, SNAP2 predictions were most accurate and result in a number of chronic health problems on the one reliable as compared to other methods. This suggests hand but grant immunity to some malaria types on the SNAP2 as a valuable tool for the preliminary analysis of other. In other words, the definition of a disease variant variants in any organism. Specifically, SNAP2 might be may depend on the environment of the individual. the ideal starting point for the comparison of variants In contrast to disease, the prediction of the effect of a between species, e.g. human vs. chimp vs. mouse. variant upon molecular function focuses only on the native function of one particular protein. For many Neutral variants predicted worse examples, such effects are independent of the individual All methods performed significantly better for effect and, often (although not always), of the environment. than for neutral variants. This in agreement with find- However, such a focus bears another set of problems: (1) ings reported in Bromberg et al [19] and can be Today’scomputationalmethodscannotreliablydistin- explained in two ways. guish between gain and loss of function. They simply (1) The imbalance might originate from incomplete predict whether or not the mutation affects native func- experimental evidence. The effect of variants is typically tion at all. (2) It is often difficult to relate the strength of evaluated on the basis of one or a few phenotypes/ afunctionaleffecttoitsbiologicalrelevance.For assays. If these produce no visible difference as com- instance, a “bit” of change in p53 functionality may cause pared to wild-type control the variant is reported as severe phenotypes, whereas a “large” functional effect on neutral. However, it might still have an effect on other other proteins may have little biological impact. In other assays that are not performed. words, predicted effects have to be put into perspective (2) The variants for experimental analysis are usually of the protein in question. not selected at random. Instead, researchers prudently Hecht et al. BMC Genomics 2015, 16(Suppl 8):S1 Page 10 of 12 http://www.biomedcentral.com/1471-2164/16/S8/S1

focus on the most important changes; often those unfair advantage because the data set used had partially changes are related to diseases. Such a prioritized selec- been used to train PolyPhen-2. tion samples the feature space incompletely. This may hamper computational detection of relevant patterns for More and better data needed to advance further? neutral variants. The incomplete sampling may also SNAP2 and PolyPhen-2 reached similar levels of perfor- skew performance estimates: the variants most trivially mance with rather different approaches, but we made so expected to be neutral might be predicted by the meth- many so important changes to SNAP that we were sur- ods but might not be tested experimentally because they prised not to improve more. Was this because predic- are simple to guess. For this reason, comprehensive test- tion performance has reached a plateau, i.e.havewe ing as performed for the E. Coli LacI repressor or the reached the limits for a method using only sequence HIV-1 protease is an invaluable source of information information as input? Many observations suggest that for computational prediction of variant effects. Such our data sets remain importantly incomplete. For data will likely be crucial in overcoming the neutrality instance, we observed that our EC data was inconsistent dilemma and will significantly further our understanding but that we fared worse by leaving it out. We improved of the underlying molecular mechanisms of variant a little through the addition of the OMIM data, but pos- effects. sibly only so much so because the data had implicitly already been predicted correctly [17]. In other words:

SNAP2noali succeeded where others failed OMIM samples exhibit, on average, extreme signals that We specifically trained a classifier to predict functional are somewhat ‘easy’ to predict. Thus, adding samples effects without using evolutionary information. This from the top end of the effect distribution did not help unique novel resource might become increasingly useful improve our prediction of difficult cases where we often as ongoing sequencing efforts bring in more data. The find unclear/contradicting signals. Another indication of current release of the UniRef50 (March 2014) contains incompleteness of experimental data was the result that ~9.5 million sequence clusters of which over 6.5 million we needed to use all available data to achieve peak perfor- (~68%) contain only one protein, i.e.areproteinssofar mance, i.e. smaller subsets reduced performance (data not unique to one organism. For those over 6.5 million, very shown). Still, are we close to a saturation of performance, little evolutionary information is available to guide other or can we expect another leap? The lessons learned from variant effect predictions and the fraction of orphan advancing secondary structure prediction through the clusters appears to be increasing; i.e.inOctober2012, combination of machine learning and evolutionary infor- the UniRef50 contained ~64% orphan clusters - a 4% mation suggest that there is yet no way to tell. increase over 1.5 years. This difference might originate from the decreasing quality of increasing sequencing Conclusions data. However, a similar trend had been observed We significantly improved over our seven-year-old 12 years ago with arguably more accurate sequencing method SNAP for the prediction of functional effects data [57]. Except for SNAP2noali,allmethodsperform from single point variants or mutations in the amino significantly worse for orphans and, in some cases, at acid sequence. SNAP2, the new method improved the level of throwing a coin. Often they produce no through more and better data and through more input results, which also is at the random level. By including a features. SNAP2 annotates functional effects of variants variety of specific features, we developed a classifier that with little preference to particular species and/or parti- still achieves a two-state accuracy Q2 around 68% from cular types of effects. This allows users to perform bias- sequence alone even for these 6.5 million orphan free cross-species comparisons, such as looking at families. This unique type of predicted information sequence positions that differ between human and might become very relevant for uncharacterized protein mouse. We believe that this might be helpful for under- families. standing and predicting disease-causing variation, as well as for facilitating drug development. A measure of Best prediction of difficult cases prediction reliability (Reliability Index; RI) allows users By comparing predictions for variants for which com- to focus on the most promising candidates. Additionally, monly applied methods disagreed, we extracted variants abigachievementofthisworkisthedevelopmentof that were difficult to classify. For these difficult cases, our SNAP2noali -amodelthatpredictseffectsofvariants new method SNAP2 significantly outperformed SNAP without using evolutionary information. Ongoing deep- (set ALL-difficult: Q2(snap2) = 69%, Q2(snap) = 53%) sequencing efforts bring in novel sequences and novel and SIFT (Q2(sift) = 41%). For the difficult variants from variants alike. Many of these variants occur in sequences human PMD, SNAP2 performed just as well as Poly- without families. Possibly for millions of proteins Phen-2, although this comparison gave PolyPhen-2 an SNAP2noali provides a reliable prediction of variant Hecht et al. BMC Genomics 2015, 16(Suppl 8):S1 Page 11 of 12 http://www.biomedcentral.com/1471-2164/16/S8/S1

effects and allows for a quick assessment of functionally 7. Capriotti E, Fariselli P, Calabrese R, Casadio R: Predicting protein stability relevant positions in novel proteins. Both versions of changes from sequences using support vector machines. Bioinformatics 2005, , 21 Suppl 2: ii54-58. SNAP2 have been optimized towards runtime efficiency 8. Capriotti E, Fariselli P, Casadio R: I-Mutant2.0: predicting stability changes to enable large-scale in silico mutagenesis studies that upon mutation from the protein sequence or structure. Nucleic Acids Res probe the landscape of protein mutability [56,58] to 2005, 33(Web Server):W306-310. 9. Dehouck Y, Kwasigroch JM, Rooman M, Gilis D: BeAtMuSiC: Prediction of learn important news about protein structure and changes in protein-protein binding affinity on mutations. Nucleic Acids function. Res 2013, 41(Web Server):W333-339. 10. Ng PC, Henikoff S: SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 2003, 31(13):3812-3814. 11. Bromberg Y, Rost B: SNAP: predict effect of non-synonymous Competing interests polymorphisms on function. Nucleic Acids Res 2007, 35(11):3823-3835. The authors declare that they have no competing interests. 12. Sunyaev SR, Eisenhaber F, Rodchenkov IV, Eisenhaber B, Tumanyan VG, Kuznetsov EN: PSIC: profile extraction from sequence alignments with Authors’ contributions position-specific counts of independent observations. Protein Eng 1999, MH, YB, and BR conceived this work and designed the experiments. MH 12(5):387-394. wrote the software and carried out the experiments. MH and YB collected 13. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, the data and analyzed the results. MH, YB, and BR wrote, revised, and Kondrashov AS, Sunyaev SR: A method and server for predicting approved the manuscript. damaging missense mutations. Nat Methods 2010, 7(4):248-249. 14. Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, Mooney SD, Acknowledgements Radivojac P: Automated inference of molecular mechanisms of disease Thanks to Tim Karl, Guy Yachdav, and Laszlo Kajan (TUM) for invaluable help from amino acid substitutions. Bioinformatics 2009, 25(21):2744-2750. with hardware and software; to Marlena Drabik (TUM) for administrative 15. Calabrese R, Capriotti E, Fariselli P, Martelli PL, Casadio R: Functional support; to Peter Hoenigschmid (WZW) and Christian Schaefer (TUM) for annotations improve the predictive score of human disease-related helpful discussions; to Shaila C. Roessle (LRZ Munich), Veit Hoehn (TUM), mutations in proteins. Human mutation 2009, 30(8):1237-1244. Mark Ofman (TUM), Manfred Roos (TUM), Wiktor Jurkowski (Univ. 16. Reva B, Antipin Y, Sander C: Predicting the functional impact of protein Luxembourg) and Reinhard Schneider (Univ. Luxembourg) for extensive beta mutations: application to cancer genomics. Nucleic Acids Res 2011, 39(17): testing of SNAP2. Last, not least, thanks to all those who deposit their e118. experimental data in public databases, and to those who maintain these 17. Schaefer C, Bromberg Y, Achten D, Rost B: Disease-related mutations databases. This work and its publication was supported by a grant from the predicted to impact protein function. BMC Genomics 2012, 13(Suppl 4):S11. Alexander von Humboldt foundation through the German Ministry for 18. Cline MS, Karchin R: Using bioinformatics to predict the functional impact Research and Education (BMBF: Bundesministerium fuer Bildung und of SNVs. Bioinformatics 2011, 27(4):441-448. Forschung). This work was also supported by the German Research 19. Bromberg Y, Kahn PC, Rost B: Neutral and weakly nonneutral sequence Foundation (DFG) and the Technische Universität München within the variants may define individuality. Proceedings of the National Academy of funding programme Open Access Publishing. Sciences of the United States of America 2013, 110(35):14255-14260. This article has been published as part of BMC Genomics Volume 16 20. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Supplement 8, 2015: VarI-SIG 2014: Identification and annotation of genetic Forslund K, Ceric G, Clements J, et al: The Pfam protein families database. variants in the context of structure, function and disease. The full contents Nucleic Acids Res 2012, 40(Database):D290-301. of the supplement are available online at http://www.biomedcentral.com/ 21. Kawabata T, Ota M, Nishikawa K: The Protein Mutant Database. Nucleic bmcgenomics/supplements/16/S8. Acids Res 1999, 27(1):355-357. 22. Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and Authors’ details its supplement TrEMBL in 2000. Nucleic Acids Res 2000, 28(1):45-48. 1Department of Bioinformatics & Computational Biology, Technische 23. Dimmer EC, Huntley RP, Alam-Faruque Y, Sawford T, O’Donovan C, Universität München, Boltzmannstr. 3, 85748 Garching/Munich, Germany. Martin MJ, Bely B, Browne P, Mun Chan W, Eberhardt R, et al: The UniProt- 2Department of Biochemistry and Microbiology, Rutgers University, 76 GO Annotation database in 2011. Nucleic Acids Res 2011. Lipman Dr, New Brunswick, NJ 08901, USA. 3Department of Genetics, Rutgers 24. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online University, 145 Bevier Road, Piscataway, NJ 08854-8082, USA. 4Institute of Mendelian Inheritance in Man (OMIM), a knowledgebase of human Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, genes and genetic disorders. Nucleic Acids Res 2005, 33(Database): Germany & WZW - Weihenstephan, Alte Akademie 8, Freising, Germany. D514-517. 25. Capriotti E, Calabrese R, Casadio R: Predicting the insurgence of human Published: 18 June 2015 genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics References 2006, 22(22):2729-2734. 1. Zuckerkandl E, Pauling L: Molecules as documents of evolutionary history. 26. Webb EC: Enzyme Nomenclature 1992. Recommendations of the Journal of Theoretical Biology 1965, 8:357-366. Nomenclature committee of the International Union of Biochemistry and 2. Schwarz JM, Rodelsperger C, Schuelke M, Seelow D: MutationTaster Molecular Biology. 1992 edition. New York: Academic Press;1992. evaluates disease-causing potential of sequence alterations. Nat Methods 27. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, 2010, 7(8):575-576. Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein 3. Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, database search programs. Nucleic Acids Res 1997, 25(17):3389-3402. Ruden DM: A program for annotating and predicting the effects of 28. Sander C, Schneider R: Database of homology-derived protein structures single nucleotide polymorphisms, SnpEff: SNPs in the genome of and the structural meaning of sequence alignment. Proteins 1991, Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 2012, 6(2):80-92. 9(1):56-68. 4. McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F: Deriving 29. Rost B: Twilight zone of protein sequence alignments. Protein Eng 1999, the consequences of genomic variants with the Ensembl API and SNP 12(2):85-94. Effect Predictor. Bioinformatics 2010, 26(16):2069-2070. 30. Mika S, Rost B: UniqueProt: creating representative protein sequence 5. Schaefer C, Rost B: Predict impact of single amino acid change upon sets. Nucleic Acids Res 2003, 31(13):3789-3791. protein structure. BMC Genomics 2012, 13(Suppl 4):S4. 31. Markiewicz P, Kleina LG, Cruz C, Ehret S, Miller JH: Genetic studies of the 6. Dehouck Y, Grosfils A, Folch B, Gilis D, Bogaerts P, Rooman M: Fast and lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors accurate predictions of protein stability changes upon mutations using reveals essential and non-essential residues, as well as “spacers” which statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics do not require a specific sequence. J Mol Biol 1994, 240(5):421-433. 2009, 25(19):2537-2543. Hecht et al. BMC Genomics 2015, 16(Suppl 8):S1 Page 12 of 12 http://www.biomedcentral.com/1471-2164/16/S8/S1

32. Loeb DD, Swanstrom R, Everitt L, Manchester M, Stamper SE, Hutchison CA 55. Liu J, Rost B: Comparing function and structure between entire III: Complete mutagenesis of the HIV-1 protease. Nature 1989, proteomes. Protein Science 2001, 10(10):1970-1979. 340(6232):397-400. 56. Hecht M, Bromberg Y, Rost B: News from the protein mutability 33. Mistry J, Kloppmann E, Rost B, Punta M: An estimated 5% of new protein landscape. J Mol Biol 2013, 425(21):3937-3948. structures solved today represent a new Pfam family. Acta 57. Liu J, Rost B: Comparing function and structure between entire crystallographica Section D, Biological crystallography 2013, proteomes. Protein science : a publication of the Protein Society 2001, 69(Pt 11):2186-2193. 10(10):1970-1979. 34. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, 58. Bromberg Y, Rost B: Comprehensive in silico mutagenesis highlights Forslund K, Ceric G, Clements J, et al: The Pfam protein families database. functionally important residues in proteins. Bioinformatics 2008, 24(ECCB Nucleic Acids Res 2012, 40(Database):D290-301. Proceedings):i207-i212. 35. Frank E, Hall M, Trigg L, Holmes G, Witten IH: Data mining in 59. DeLong ER, DeLong DM, Clarke-Pearson DL: Comparing the areas under bioinformatics using Weka. Bioinformatics 2004, 20(15):2479-2481. two or more correlated receiver operating characteristic curves: a 36. Rost B, Sander C: Prediction of protein secondary structure at better than nonparametric approach. Biometrics 1988, 44(3):837-845. 70% accuracy. J Mol Biol 1993, 232:584-599. 37. Sunyaev SR, Eisenhaber F, Rodchenkov IV, Eisenhaber B, Tumanyan VG, doi:10.1186/1471-2164-16-S8-S1 Kuznetsov EN: PSIC: profile extraction from sequence alignments with Cite this article as: Hecht et al.: Better prediction of functional effects position-specific counts of independent observations. Protein Engineering for sequence variants. BMC Genomics 2015 16(Suppl 8):S1. 1999, 12(5):387-394. 38. Rost B: PHD: predicting one-dimensional protein structure by profile based neural networks. Methods in Enzymology 1996, 266:525-539. 39. Rost B, Sander C: Conservation and prediction of solvent accessibility in protein families. Proteins 1994, 20(3):216-226. 40. Rost B, Sander C: Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol 1993, 232(2):584-599. 41. Schlessinger A, Yachdav G, Rost B: PROFbval: predict flexible and rigid residues in proteins. Bioinformatics 2006, 22(7):891-893. 42. Kawashima S, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Res 2000, 28(1):374. 43. Ofran Y, Rost B: ISIS: interaction sites identified from sequence. Bioinformatics 2007, 23(2):e13-16. 44. Schlessinger A, Punta M, Yachdav G, Kajan L, Rost B: Improved disorder prediction by combination of orthogonal approaches. PLoS One 2009, 4(2):e4433. 45. Simons KT, Ruczinski I, Kooperberg C, Fox BA, Bystroff C, Baker D: Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins 1999, 34(1):82-95. 46. Sigrist CJ, Cerutti L, de Castro E, Langendijk-Genevaux PS, Bulliard V, Bairoch A, Hulo N: PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res 2010, 38(Database): D161-166. 47. Hoehn V: In-depth comparison of predicted high-and low-impact SNPs from the 1,000 Genomes Project. Master Thesis Technische Universität München; 2012. 48. Bendl J, Stourac J, Salanda O, Pavelka A, Wieben ED, Zendulka J, Brezovsky J, Damborsky J: PredictSNP: robust and accurate consensus classifier for prediction of disease-related mutations. PLoS Comput Biol 2014, 10(1):e1003440. 49. Gonzalez-Perez A, Lopez-Bigas N: Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. American journal of human genetics 2011, 88(4):440-449. 50. Capriotti E, Altman RB, Bromberg Y: Collective judgment predicts disease- associated single nucleotide variants. BMC Genomics 2013, 14(Suppl 3):S2. 51. Yokota T, Otsuka T, Mosmann T, Banchereau J, DeFrance T, Blanchard D, De Vries JE, Lee F, Arai K: Isolation and characterization of a human interleukin cDNA clone, homologous to mouse B-cell stimulatory factor 1, that expresses B-cell-and T-cell-stimulating activities. Proceedings of the National Academy of Sciences of the United States of America 1986, 83(16):5894-5898. Submit your next manuscript to BioMed Central 52. Zee RY, Cook NR, Cheng S, Reynolds R, Erlich HA, Lindpaintner K, Ridker PM: and take full advantage of: Polymorphism in the P-selectin and interleukin-4 genes as determinants of stroke: a population-based, prospective genetic analysis. Human molecular genetics 2004, 13(4):389-396. • Convenient online submission 53. Yigit S, Inanir A, Tekcan A, Tural E, Ozturk GT, Kismali G, Karakus N: • Thorough Significant association of interleukin-4 gene intron 3 VNTR • No space constraints or color figure charges polymorphism with susceptibility to knee osteoarthritis. Gene 2014, 537(1):6-9. • Immediate publication on acceptance 54. Ramanathan L, Ingram R, Sullivan L, Greenberg R, Reim R, Trotta PP, Le HV: • Inclusion in PubMed, CAS, and Immunochemical mapping of domains in human interleukin 4 • Research which is freely available for redistribution recognized by neutralizing monoclonal antibodies. Biochemistry 1993, 32(14):3549-3556. Submit your manuscript at www.biomedcentral.com/submit Hecht, Bromberg & Rost Supporting online material

Supporting online material for: Better prediction of functional effects for sequence variants Maximilian Hecht, Yana Bromberg & Burkhard Rost

Table of Contents for Supporting Online Material

1. Input feature calculation 2. Table SOM_1: Input features selected from AAindex 3. Table SOM_2: Performance on independent data sets 4. Table SOM_3: Performance estimates on ALL data set 5. Figure SOM_1: Accuracy-Coverage cruves on ALL data set 6. Figure SOM_2: Score distribution for SNAP2 on ALL data set

Short description of Supporting Online Material

This SOM contains a detailed description of features and their extraction for use in the neural network predictor. We also included three tables (1) listing the cluster representatives from the AAindex database, that were selected as helpful features in SNAP2noali, (2) a performance comparison on independent protein-specific data, namely the HIV-1 protease and the Escherichia Coli LacI repressor and (3) a table showing performance values on our comprehensive ALL (main manuscript, methods section) data set. Moreover, this SOM includes a figure (Fig. SOM_1) showing the performance of SNAP2 and SNAP2noali in comparison to SIFT and random predictions.

Appendix p. 1 Hecht, Bromberg & Rost Supporting online material

Material

Input feature calculation. In order to use amino acid and protein properties in neural networks these have to be presented as normalized numerical values. The following section describes the exact calculation or extraction of these values. Delta features. Where applicable, we calculated delta features that describe the change in certain features between the native amino acid and its variant. All delta features are encoded by two nodes per residue: one for the “severity” (absolute difference between wildtype and mutant value) the other for the “direction” (‘1’ if positive and ‘0’ if negative) of change. Biophysical properties. In addition to mass, volume, charge, hydrophobicity and the presence of C-beta branching amino acids (as already present in SNAP) we collected one representative for each cluster of correlated amino acid indices from the AAindex database 1. These indices are matrices containing values for each amino acid (or pair of amino acids) that cover a variety of amino acid properties and features derived from these (Table SOM_1). We extracted the corresponding (already normalized) value for each residue in the window, resulting in w input values. Then we calculated the two-node delta feature. The first node was the absolute difference between the wildtype and the mutant value. Binding residues. We used ISIS 2 to predict the protein-protein binding sites and DISIS 3 to predict the protein-DNA binding sites. We extracted both the binary prediction (binding/non-binding) and the raw prediction score for each residue in the window (21 * 2 = 42 input nodes). Disordered regions. We used the META-Disorder predictor tool (MD; 4) tool to calculate a three-node disorder feature for all residues in the window: We extracted the binary per-residue prediction (disordered/not-disordered) and the prediction reliability. Proximity to N- and C-terminus. We calculated the proximity of the variant position to each terminus individually as the normalized number of residues between terminus and the position of interest (2*1 = 2 input nodes). Contact potentials. We extracted normalized distance-dependent statistical potentials (for contacts within 5 Ångstrøms=0.5nm) 5. For both native amino acid and variant, we extracted the potential as a 20-node feature. Additionally, we calculated the delta values for this feature (difference between native and variant) for their eight (four residues before and after) sequence neighbors (20*2 + 8*2 = 56 input nodes). Co-evolving positions. We estimated the co-evolution of positions in a multiple sequence alignment following the approach from 6. For each position in the multiple alignment we used the OMES 7 algorithm to calculate the correlation

Appendix p. 2 Hecht, Bromberg & Rost Supporting online material with any other position. The OMES method compares the observed co-occurrence of amino acid X at position i and amino acid Y at position j to the expected co- occurrence at positions i and j. This pairwise comparison yielded a ranking of all positions based on their pairwise correlation to any other position. From these, we extracted a six-node feature indicating the rank and the score (i.e. the deviation from the expectation value) for the three positions most correlated with the mutation position (2*3 = 6 input nodes). Residue annotation. In addition to SWISS-PROT annotations and SIFT predictions as already used in SNAP we considered residue annotation from Pfam 8 and PROSITE 9 to describe native and variant amino acids: (i) We determined whether the position was part of a PfamA domain. If so, we collected metrics of domain conservation and the posterior probability of native and variant belonging to that domain (4 input nodes). (ii) From PROSITE we extracted a binary single-node feature for all residues in the window indicating whether the specific residue is part of a PROSITE pattern (21 input nodes). Low-complexity regions. We used the SEG 10 algorithm to mask protein regions with low-complexity. From this masking, we extracted a feature of 21 binary input nodes indicating whether a mutation is in or close to a low-complexity region. Global features. We added global sequence information by calculating four features: The amino acid composition as the relative frequency of each amino acid (20 amino acids + 1 unknown = 21 input nodes); the sequence length feature encoding the protein length in 6 bins (0-60, 61-120, 121-180,181-240, 241-300, >300; 6 input nodes); the secondary structure composition and the solvent accessibility composition, each as a twelve-node binary feature using four bins (0- 25%, 26%-50%, 51%-75%, 76%-100%) for each state: helix-strand-other or buried- intermediate-exposed (2 * 12 = 24 input nodes).

Appendix p. 3 Hecht, Bromberg & Rost Supporting online material

Table SOM_1: Input features selected from AAindex *

AAindex 1 Description accession VINM940103 Normalized flexibility parameters (B-values) for each residue surrounded by one rigid neighbour 11 BLAM930101 Alpha helix propensity 12 DAYM780201 Relative mutability 13 QIAN880123 Weights for beta-sheet 14 KLEP840101 Prediction of protein function from sequence properties; Discriminant analysis of a data base: Net charge 15 SNEP660101 Relations between chemical structure and biological activity in peptides: Principal component I 16 RICJ880113 Relative preference values of amino acids at C2 17 SIMK990101 Distance-dependent statistical potential (contacts within 0-5 Angstroms) 5

* We listed the best-performing input features, i.e. amino acid indices that were selected by the feature selection procedure. Other indices from the corresponding clusters performed similarly. For each of these features both window-based and delta features were included into the final sequence-only network SNAP2noali.

Table SOM_2: Performance on independent data sets *

Method LacI repressor HIV-1 protease SIFT 72.2% ± 1.0 79.5% ± 3.2 SNAP 72.0% ± 1.0 78.3% ± 3.0 SNAP2 78.3% ± 0.9 74.1% ± 3.2 • Shown is the overall two-state accuracy (Q2 value; Method section) on 4041 LacI mutants and 336 HIV-1 protease mutants for SIFT, SNAP and SNAP2.

Appendix p. 4 Hecht, Bromberg & Rost Supporting online material

Table SOM_3: Performance estimates on ALL data set.

Q2 F1 (neutral) F1 (effect) MCC ROC AUC SNAP2 83.5% 0.79 0.87 0.65 0.91 SNAP 80.1% 0.76 0.83 0.59 0.88 SIFT 77.4% 0.74 0.83 0.54 0.84 PolyPhen-2 80.8% 0.75 0.84 0.60 0.85

* Performance estimates were obtained from cross-validation for SNAP2. For all methods the default thresholds were applied. Estimates are based on all variants from our ALL data set (see methods section; Data).

Appendix p. 5 Hecht, Bromberg & Rost Supporting online material

Figure SOM_1: Accuracy-Coverage curves for ALL data. These figures show performance on the ALL data set. Our new method SNAP2 (dark blue) outperforms its predecessor (SNAP, light blue), and SIFT (green) for both the variants that do not affect function (neutral, a) and for those that affect function (b). The x-axes indicate coverage/recall (Eqn. 1,2), i.e. the percentage of observed neutral (a) and effect (b) variants that are correctly predicted at the given threshold. The y-axes indicate accuracy/precision (Eqn. 1,2), i.e. the percentage of neutral (a) and effect (b) variants among all variants predicted in either class at the given threshold. The dark line (SNAP2noali) marks the performance of a SNAP2 version that does not use any information from sequence alignment. All results are computed on the test sets not used in training. A pink line marks the performance of a random predictor.

Appendix p. 6 Hecht, Bromberg & Rost Supporting online material

Figure SOM_2: Score distribution for SNAP2 on ALL data. Shown is the number of instance (y-axis) for each score (x-axis). Effect variants (red) mostly have predicted scores > 0 while neutral variants (green) are predominantly predicted at scores < 0.

References for Supporting Online Material

1. Kawashima, S. & Kanehisa, M. (2000). AAindex: amino acid index database. Nucleic acids research 28, 374. 2. Ofran, Y. & Rost, B. (2007). ISIS: interaction sites identified from sequence. Bioinformatics 23, e13-6.

Appendix p. 7 Hecht, Bromberg & Rost Supporting online material

3. Ofran, Y., Mysore, V. & Rost, B. (2007). Prediction of DNA-binding residues from sequence. Bioinformatics 23, i347-53. 4. Schlessinger, A., Punta, M., Yachdav, G., Kajan, L. & Rost, B. (2009). Improved disorder prediction by combination of orthogonal approaches. PloS one 4, e4433. 5. Simons, K. T., Ruczinski, I., Kooperberg, C., Fox, B. A., Bystroff, C. & Baker, D. (1999). Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins 34, 82-95. 6. Kowarsch, A., Fuchs, A., Frishman, D. & Pagel, P. (2010). Correlated mutations: a hallmark of phenotypic amino acid substitutions. PLoS computational biology 6. 7. Fodor, A. A. & Aldrich, R. W. (2004). Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins 56, 211-21. 8. Punta, M., Coggill, P. C., Eberhardt, R. Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E. L., Eddy, S. R., Bateman, A. & Finn, R. D. (2012). The Pfam protein families database. Nucleic acids research 40, D290-301. 9. Sigrist, C. J., Cerutti, L., de Castro, E., Langendijk-Genevaux, P. S., Bulliard, V., Bairoch, A. & Hulo, N. (2010). PROSITE, a protein domain database for functional characterization and annotation. Nucleic acids research 38, D161-6. 10. Wootton, J. C. & Federhen, S. (1993). Statistics of local complexity in amino acid sequences and sequence databases. Computers & chemistry 17, 149-163. 11. Vihinen, M., Torkkila, E. & Riikonen, P. (1994). Accuracy of protein flexibility predictions. Proteins 19, 141-9. 12. Blaber, M., Zhang, X. J. & Matthews, B. W. (1993). Structural basis of amino acid alpha helix propensity. Science 260, 1637-40. 13. Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978). A model of evolutionary change in proteins. In Atlas of protein sequence and structure (Dayhoff, M. O., ed.), Vol. 5. National Biomedical Research Foundation, Washington, DC. 14. Qian, N. & Sejnowski, T. J. (1988). Predicting the secondary structure of globular proteins using neural network models. Journal of molecular biology 202, 865-884. 15. Klein, P., Kanehisa, M. & DeLisi, C. (1984). Prediction of protein function from sequence properties: Discriminant analysis of a data base. Biochimica et Biophysica Acta (BBA)-Protein Structure and Molecular Enzymology 787, 221-226. 16. Sneath, P. (1966). Relations between chemical structure and biological activity in peptides. Journal of theoretical biology 12, 157-195. 17. Richardson, J. S. & Richardson, D. C. (1988). Amino acid preferences for specific locations at the ends of alpha helices. Science (New York, NY) 240, 1648.

Appendix p. 8