Technische Universität München Prediction of Functional Effects For

Technische Universität München Lehrstuhl für Bioinformatik Prediction of functional effects for sequence variants Maximilian Natale Friedrich Hecht Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. Daniel Cremers Prüfer der Dissertation: 1. Univ.-Prof. Dr. Burkhard Rost 2. Univ.-Prof. Dr. Dimitri Frischmann Die Dissertation wurde am 24.06.2015 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 13.10.2015 angenommen. Contents Abstract 5 1Introduction 6 1.1 Forms of genetic variation . 6 1.2 Investigating human genetic variation . 10 1.3 Resources for variants and their molecular effects . 13 1.4 Prediction of variant effects . 15 1.5 Thesismotivationandgoals . 19 2Methods 20 2.1 Data................................ 20 2.2 Features for variant effect prediction . 22 2.3 Predictionmethod ........................ 24 2.3.1 Clustering and cross-validation . 25 2.3.2 Feature selection and parameter optimization . 26 2.3.3 Alignment-free prediction . 27 2.4 Performance measures . 28 2.5 Result visualization . 30 3Resultsanddiscussion 31 3.1 Better prediction of functional effects . 31 3.2 Difficult cases and method combinations . 35 3.3 Neutral variant dilemma . 37 3.4 Interpretation and reliability of variant prediction . 39 3.5 Variant prediction in orphan proteins . 41 3.6 The protein mutability landscape . 44 4Conclusion 48 Acknowledgements 49 3 References 50 Appendix 64 4 Abstract Technological advancements have enabled us to find and record genetic differences between humans. Yet, the vast amount of data gathered through next generation sequencing has long since overwhelmed our ability to study every discovered difference experimentally. Thus, computational approaches have been developed to cope with the deluge of data and guide experimen- tal efforts towards prioritizing the most promising candidates. This thesis focuses on the computational advancements made towards predicting the effects of genetic variation. It discusses the current state of research and presents a newly developed method for the prediction of functional effects of sequence variants. Our new method SNAP2 predicts variants with over 83% accuracy and is also able to predict effects for variants of orphan proteins. Both constitute significant and important improvements over other methods. Furthermore, it not only predicts single variants but offers a comprehensive view of all possible substitution effects in a protein. This opens up a novel view on the landscape of protein mutability. Possible applications are pre- sented that show how this novel view on functional effect predictions can translate into novel hypotheses and thus aid the identification of targets for drug development. 5 1Introduction Understanding human genetic variation is one of the major scientific chal- lenges of the 21st century. Nearly 15 years after we first sequenced the human genome, we still understand little of the mechanisms that link differences on the genetic level to the differences observed on the phenotypic level. Genetic variations influence phenotypes: from the most obvious, such as the visi- ble differences between individuals with different ethnic background, to the least obvious such as the differential response to drug treatment. Therefore, the genotype-phenotype link is crucial for understanding development and progression of diseases: it may be the key towards developing personalized treatment. 1.1 Forms of genetic variation Differing inheritable traits can either favor or hinder survival and reproduc- tion. This process is called natural selection. The interplay between genetic variation and selection pressure is what constantly adapts organisms to best fit their specific niche in the environment. In nature genetic variation hap- pens randomly through changes on the molecular level which are caused by various factors, such as mutations and random mating between organisms (Krishnamurthy, 2003). There are different forms of genetic variation, which can be categorized according to their size and type: (i) numerical variation, (ii) large-scale structural variation and (iii) small-scale sequence variation. Numerical Variation Numerical variation can be defined as changes in the number of chromosomes, referred to as polyploidy (numerical change of the whole set of chromosomes) or aneuploidy (number of individual chromosomes is altered). A prominent example of polyploidy is wheat, which through years of hybridization and human modification has different species that range from diploid (two sets 6 of chromosomes) to hexaploid (six sets of chromosomes; today’s common bread wheat) (Martinez-Perez et al., 2003). In humans polyploidy plays not only a role in the normal development of certain cell types but also in the development of cancer (Davoli and de Lange, 2011). For humans, aneuploidy of most chromosomes results in miscarriage but there are exceptions that result in live births (Driscoll and Gross, 2009). These children often suffer from genetic disorders, the most common example being trisomy 21 (known as Down syndrome), where a third (possibly partial) copy of chromosome 21 is present (Patterson, 2009). Aneuploidy can also be observed in cancer cells (Sen, 2000). Large-scale structural variation The class of large-scale structural variation comprises variations of long stretches/regions of DNA, typically ranging in size from kilo- to megabases. These structural variations include rearrangements such as translocations (exchange of region between chromosomes) or inversions (region is reversed in the chromosome) but also so-called Copy-Number Variations (CNV; region is duplicated or deleted). While balanced rearrangements (reciprocal translocations or inversions) do not result in gain or loss of genetic material they may still affect gene expression through gene fusion or by dissociating genes from their long-range regulatory elements (Harewood et al., 2010). CNVs, on the other hand, represent a significant alteration of the genetic material resulting from duplication or deletion of long stretches of chromosomes. These large-scale variations were found to be widespread and common among humans (Iafrate et al., 2004; Sebat et al., 2004) accounting for roughly 13% of the human genome (Stankiewicz and Lupski, 2010). Yet, with respect to CNVs only 0.4% of the genome significantly differed in a study comparing eight individual genomes to the reference assembly (Kidd et al., 2008). This suggests that most of the CNVs are common and shared by members of the same population. While CNVs have been associated with 7 diseases, for instance autism and schizophrenia (Cook and Scherer, 2008), it appears that gene gains are more common than gene losses and that gene amplification is favored through positive selection (Zhang et al., 2009). For example, the salivary amylase gene (AMY1) copy number was found to be varying significantly in different populations. This was considered an adap- tion of agricultural societies towards high-starch diet, as higher copy numbers correlated with higher salivary starch-digesting enzyme levels (Perry et al., 2007). Small-scale sequence variation The last class, the small-scale sequence variation, is the most prevalent form of genetic variation among humans (ENCODE Project Consortium, 2012). It comprises short or single nucleotide insertions/deletions (typically abbre- viated to ‘indels’) and single nucleotide substitutions (so-called point mutations). Indels (that are not a multiple of 3 bases) in genetic regions that encode proteins or functional RNA change the reading frame. This is prob- lematic because it affects how the codon triplets are read during transcription into mRNA (or other RNA). These so-called frameshift mutations have been shown to be involved in a number of diseases: For instance, the Crohn’s disease has been associated with an insertion in the NOD2 gene. The insertion of cytosine at position 3020 of the NOD2 gene was shown to produce a truncated protein. The resulting protein no longer responded to bacterial lipopolysaccharides, which might lead to increased susceptibility to Crohn’s disease (Ogura et al., 2001). Single nucleotide variation The most frequent human genetic variations are single nucleotide substitutions (called ‘single nucleotide polymorphisms’ – SNPs, or ‘single nucleotide variants’ – SNVs). The international SNP Map Working Group estimated that any two haploid genomes differed on average by around 1 nucleotide 8 every 1300 basepairs (Sachidanandam et al., 2001), a number which likely varies between ethnic groups. Estimates suggest that the human genome contains over 11 million SNPs of which roughly 7 million are common (oc- curring at a minor allele frequency [MAF] greater than 5%) in the human population while the remaining 4 million are uncommon (1% < MAF 5%) (Kruglyak and Nickerson, 2001). These SNPs are estimated to constitute 90% of the genetic variation in humans while the remaining 10% consist of a vast array of variants that are each rare within the population (Interna- tional HapMap Consortium, 2003). Researchers suspect to find a tremendous amount of these rare variants. It is likely that almost every ‘life-compatible’ variant can be observed in at least one of the roughly 7 billion people on earth (Frazer et al., 2009). The vast majority of SNPs is located in non- protein-coding regions of the genome, since only around

Technische Universität München Prediction of Functional Effects For

AI and Bioinformatics

Are You an Invited Speaker? a Bibliometric Analysis of Elite Groups for Scholarly Events in Bioinformatics

Dear Delegates,History of Productive Scientiﬁc Discussions of New Challenging Ideas and Participants Contributing from a Wide Range of Interdisciplinary ﬁelds

Improved Prediction of Protein Secondary Structure by Use Of

ISMB/ECCB 2007: the Premier Conference on Computational Biology Thomas Lengauer, B

A Novel Approach for Predicting Protein Functions by Transferring Annotation Via Alignment Networks Warith Djeddi, Sadok Ben Yahia, Engelbert Mephu Nguifo

Basel Computational Biology Conference. from Information to Simulation

Janga-Phd-Thesis.Pdf (PDF, 9Mb)

In Systems Physiology

CV Burkhard Rost

Fast and Free Software for Protein Contact Prediction from Residue Co-Evolution

Ensemble of Bidirectional Recurrent Networks and Random Forests for Protein Secondary Structure Prediction