Assessing Functional Impacts of Human Coding Variants

by

Fan Yang

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Molecular Genetics University of Toronto

© Copyright by Fan Yang, 2017

i ii

Assessing Functional Impacts of Human Coding Variants

Fan Yang

Doctor of Philosophy

Molecular Genetics

University of Toronto

2017 Abstract

Advances in sequencing technology have made it routine to determine all coding variation in an individual . A pressing challenge in the post-genomic era is to functionally characterize these variants, particularly within the disease-associated . Within the realm of cancer genome research, a critical problem that remains is how to separate the ‘driver’ from ‘passenger’ mutations and to further understand the functional mechanisms and consequences of driver mutations.

I analyzed the missense somatic tumor mutations from 71 whole-genome or whole-exome sequencing studies across 21 cancers. I identified cancer-type-specific mutated domains and mutational hotspots. In some cases, I identified shifts in mutation and domain position between cancer types (but within a given product). I also provided clues to mutations’ functional effects. In addition to this, I identified different domain-centric mutational distribution patterns between oncoproteins and tumor suppressor proteins. The systematic correlation of mutations and cancer types at the domain level has the potential to guide more precise cancer treatments.

Predictive models were also developed to quantify the impact of antigenicity on the spectrum of tumor missense somatic mutations. I found that somatic mutations are significantly depleted in peptides that are predicted to be displayed by MHC class I proteins, and characterized the dependence of this depletion on expression level. My results indicate that HLA class I alleles are, in general, incompletely dominant. I developed a model that produces an ‘antigenicity score’ for any input somatic coding mutation. These antigenicity scores could guide immunotherapy or aid in developing personalized cancer vaccines. iii

In another collaborative effort to characterize human variation, I developed yeast-based functional assays to assess the functionality of the disease-associated coding variants. I evaluated the ability of wild-type human disease-associated genes to rescue homologous yeast mutants. Complementation between homologous human and yeast genes could often be found in the absence of annotated orthology, and these complementation relationships were of similar value as orthologous relationships for detecting human disease-causing variation. Finally, I found that the ability to detect pathogenic variation from complementation assays was not limited to variants which occur within the aligned region of human and yeast homology.

iv

Acknowledgments

I would first like to express my deepest appreciation to my supervisor, Dr. Frederick Roth, who has supported and inspired me through the duration of my time in graduate school. In addition to teaching me the subject of high-throughput biology, he has also indirectly taught me on how to be a generous and kind person by setting such an excellent example. It has been a great privilege and pleasure to work under his supervision. I thank him for his creativity and his insight over the past four and a half years – this time working under his supervision has been a life changing and irreplaceable experience.

I would like to thank my two supervisory committee members, Drs. Frank Sicheri and Lincoln Stein, for providing a constant source of high caliber ideas as well as offering a critical eye. They have both contributed a great deal including helpful guidance and continuous support, which has resulted in the successful completion of my project.

I offer my sincerest gratitude to collaborators – without them, this work would not have been possible. Their technical expertise, and a never-ending support have been invaluable to me. As a non-exhaustive list, I would like to thank Guihong Tan, Nidhi Sahni, Song Yi, David E. Hill, Marc Vidal, and Charlie Boone. A special thanks to Drs. Hidewaki Nakagawa and Seiya Imoto, and their respective teams– their hard work and dedication contributed to the success of my analysis on the immunogenicity of cancer mutations.

To all of the members of Roth lab – it has been a pleasure working with you. I would like to thank each and every one of them for constantly providing me with excellent advice, for their senses of humor, for pulling me up when I’m down, keeping me excited about research, and for being the best possible teammates I could ask for. I’ve never felt so much like I was a part of something important, as I’ve felt here with all of them.

Of special note, thank you to Song Sun for being a patient teacher while ‘I learned the ropes’ of human-yeast complementation. Thank you to Evangelia Petsalakis for her continuous support with everything, especially data analysis, as well as her friendship, which has made the hardest times much easier. Thank you to Dae-Kyum Kim and Yingzhou Wu for their patience in answering all of my questions. Thank you to Kristina Ognjanovic for her great effort in proof reading this thesis. A huge thank you to Rong Huang and Shijie Zhou for being my home away v from home, making every day fun and filled with laughter. I have spent the most productive hours of my last four and a half years with them, and they have no idea how much I appreciate their presence in my life.

Finally, I would like to thank my friends and family. I have been lucky enough to be surrounded by more beautiful, supportive people than I could possibly list here – I would like to thank them for being my safety net, my biggest fans, and my best friends. Yang Zhao and Huan Lian, my sister-friends, for understanding exactly how much this means to me. To Meng Zhang, Liang Chen and Bo Bao, I am thankful for the drinks, games, and late nights – they helped me survive. I thank Qiuyue Qu for being a constant source of hugs – I just hope I’m not too boring once grad school is over. To my parents, I am grateful for the myriad ways that they have helped me on this journey, and for backing me up no matter what. They have kept me focused on the light at the end of the tunnel, and given me perspective, something that I often overlooked in the final years of my work. To Jiantao Xie, I am grateful for him holding me up during the best and worst of times, being my voice of reason in the middle of the night, and making sure I always take good care of myself. He is everything I could have asked for as a boyfriend, and I appreciate his love, understanding and support more than he knows. vi

Table of Contents

Acknowledgments ...... iv

Table of Contents ...... vi

List of Tables ...... ix

List of Figures ...... xi

List of Appendices ...... xvi

List of Abbreviations ...... xvii

Chapter 1 Introduction ...... 1

Introduction ...... 2 1.1 Addressing the genotype-phenotype question ...... 3 1.2 Assessing the functional impact of human disease mutations using the Saccharomyces cerevisiae as a model system ...... 5 1.2.1 Yeast temperature sensitive strains ...... 5 1.2.2 Human-yeast functional complementation assay ...... 6 1.3 Cancer ...... 8 1.3.1 Complexity and development of cancer ...... 9 1.3.2 Primary and metastatic tumor ...... 10 1.3.3 Oncogenes and tumor suppressors ...... 11 1.4 Introduction to current cancer genomics research ...... 12 1.4.1 TCGA, ICGC and PCAWG Projects ...... 13 1.5 The function and dysfunction of the immune system in cancer ...... 15 1.5.1 Cancer immune-editing ...... 16 1.5.2 Antigen presentation process ...... 18 1.5.3 MHC binding peptides prediction ...... 20 1.6 Protein domains ...... 22 1.7 Summary and Rationale ...... 23

Chapter 2 Complementation of yeast genes with human genes as an experimental platform for functional testing of human genetic variants ...... 25 vii

Complementation of yeast genes with human genes as an experimental platform for functional testing of human genetic variants ...... 26 2.1 Introduction ...... 26 2.2 Methods ...... 29 2.2.1 Defining the test space of complementation assays ...... 29 2.2.2 Humanized yeast plasmid (wild-type and mutated) construction ...... 40 2.2.3 Functional complementation assay ...... 41 2.2.4 FC score and computational analysis ...... 43 2.3 Functional complementation assay results in orthologous pairs ...... 45 2.3.1 Human to yeast complementation in orthologous pairs ...... 45 2.3.2 Assessing functional impact of disease associate genetic variants ...... 46 2.3.3 Comparing functional complementation assay result with computational prediction result 48 2.4 Functional complementation assay results for paralogous pairs ...... 50 2.4.1 Many complementation relationships exist for human-yeast paralogs ...... 50 2.4.2 Some essential yeast genes are complemented by multiple human paralogs sharing only a single domain ...... 53 2.4.3 Paralog complementation is only weakly predicted by sequence similarity ...... 55 2.4.4 Assessing the pathogenicity of missense variants ...... 58 2.5 Discussion ...... 63

Chapter 3 Protein domain-level landscape of cancer-type-specific somatic mutations ...... 70

Protein domain-level landscape of cancer-type-specific somatic mutations ...... 71 3.1 Introduction ...... 72 3.2 Methods ...... 76 3.2.1 Creating the cancer missense somatic mutation dataset ...... 76 3.2.2 Creating the dataset of potentially damaging missense somatic mutations ...... 79 3.2.3 Cancer-type-specific significantly-mutated domain instance analyses ...... 79 3.2.4 Cancer-type-specific significantly-mutated position based mutational hotspot analyses .. 80 3.2.5 Structural properties for position based mutational hotspots analyses ...... 80 3.3 Results ...... 81 3.3.1 Mutation distribution at the domain level analysis ...... 81 3.3.2 Cancer-type-specific positioning of mutations within a given gene ...... 93 viii

3.3.3 Comparison between oncogenes and tumor suppressors ...... 96 3.4 Discussion ...... 109

Chapter 4 Predictive models of immunogenicity for somatic mutations ...... 118

Predictive models of immunogenicity for somatic mutations ...... 119 4.1 Introduction ...... 119 4.2 Methods ...... 121 4.2.1 Prediction of MHC binding peptides ...... 121 4.2.2 Calculate the depletion of mutations within HLA class I binding peptides ...... 122 4.2.3 The expression level of each gene ...... 123 4.2.4 Relating predicted HLA binding peptides to personal HLA types ...... 123 4.3 Results ...... 124 4.3.1 Depletion of mutations within predicted HLA binding peptides ...... 124 4.3.2 Depletion of mutations within expressed predicted MHC binding peptides ...... 126 4.3.3 Depletion of mutations within predicted patient-displayed HLA binding peptides ...... 128 4.4 Discussion ...... 134

Chapter 5 Thesis summary and future directions ...... 137

Thesis summary and future direction ...... 138 5.1 Thesis summary ...... 138 5.2 Future directions ...... 139 5.2.1 Identifying genetic interactions affected by human disease-associated mutations ...... 139 5.2.2 Testing impacts of human disease-associated variants in human cell lines ...... 140 5.2.3 Antigenicity of mutations associated with more HLA protein types ...... 141

References ...... 143

Appendices ...... 177 Appendix A. Detailed protocol for human-yeast complementation assay ...... 177 Appendix B. Media ...... 180 Appendix C. Supplementary table for Chapter Three ...... 180

Copyright Acknowledgements ...... 195 ix

List of Tables

Table 1-1 The 34 cancer types with counts of samples with RNASeq samples in TCGA dataset as of 05/2016 ...... 14

Table 2-1 139 orthologous pairs and complementation results...... 29

Table 2-2 List of variants tested by functional complementation assay ...... 35

Table 2-3 Human-yeast complementing paralogous pairs ...... 52

Table 2-4 Single Point Performance Estimates ...... 61

Table 2-5 Numbers of Human Disease-associated Genes with Orthologs and Paralogs in Five Model Species ...... 69

Table 3-1 Prevalence of predicted damaging mutations in domain instances among cancer types ...... 75

Table 3-2 Patient tissue samples from selected cancer genome studies across 21 cancer types. 78

Table 3-3 Cancer-type-specific significantly mutated domain instances and corresponding genes in different cancers. For each cancer type this table lists the significantly mutated domain instances (SMDs), corresponding gene symbols, and number of mutations in each domain instance...... 82

Table 3-4 Number of significantly mutated domain instances and corresponding genes in each cancer type...... 86

Table 3-5 Number of cancer types in which each domain instance was significantly mutated. . 86

Table 3-6 Genes that encode cancer-type-specific significantly mutated domain instance and overlap with the Sleeping Beauty dataset ...... 90

Table 3-7 Genes that encode more than one cancer-type-specific significantly mutated domain instance...... 93

Table 3-8 Significantly mutated domain instances in oncogenes and tumor suppressors...... 98 x

Table 3-9 Mutational hotspots observed at functional sites...... 103

Table 3-10 Domain position-based mutational hotspots shared by at least three cancers with functional annotations ...... 106

Table 3-11 List of pairs of significantly mutated domain instances that corresponded to directly- interacting proteins (Rolland et al., unpublished results)...... 111

Table 3-12 Ten pairs of domain instances that are inferred to mediate protein interaction with each other, for which both domains in the pair were found to be significantly mutated in the same cancer type...... 114

xi

List of Figures

Figure 1-1 Complementation functional assay between human and yeast homologues as a platform to assess functional impacts of more disease-associated mutations...... 8

Figure 1-2 Cartoon representation of the T cell receptor’s interaction with Peptide-loaded MHC- I ( PDB id: 1G6R). The MHC-I heavy chain, loaded with the epitope SIINFEKL, is shown in green, while beta2-microglobulin is depicted in purple, and a T cell receptor in magenta...... 19

Figure 1-3 MHC class I Protein crystal structure (PDB id: 1VAC) with a cartoon representation of the MHC class I heavy chain (purple), beta-2 microglobulin (green) in complex with the co- crystalized ovalbumin-derived epitope SIINFEKL and the MHC class I peptide binding cavity constrained by two alpha-helices and a beta sheet...... 20

Figure 1-4 MHC class II protein crystal structure (PDB id: 2ICW) with a co-crystalized super antigen of length 13 depicted in green. MHC class II α and β chains are colored in purpleblue and green accordingly...... 21

Figure 2-1 Comparison of the fraction of complementation and non-complementation pairs among all the test human yeast orthologous in two different cases. P-values were calculated by Fisher’s Exact test...... 28

Figure 2-2 Selection of human and yeast paralogous pairs outline...... 34

Figure 2-3 Outline for plasmid construction, Gateway cloning and spotting assay ...... 42

Figure 2-4 Pipeline to assess the functional impact of human genetic variants by mutagenesis, functional complementation assay...... 44

Figure 2-5 Pipeline to identify human and yeast orthologous complementing pairs using functional complementation assay...... 46

Figure 2-6 Outline for assessing the functional impacts of human genetics variants based on the complementation relationship between human and yeast orthologous pairs ...... 48

Figure 2-7 Distribution of FC score in disease-associated and non-disease associated variants, and sequence identity in complementing and non-complementing pairs...... 49 xii

Figure 2-8 A. Precision and recall curve for FC scores, Polyphen2 scores; B. Precision and recall curve for Polyphen2 scores, SIFT scores and PROVEAN scores...... 49

Figure 2-9 Identify human and yeast paralogous complementing pairs...... 51

Figure 2-10 Comparison of protein domain structures among yeast KIN28 and its human paralogs tested. The complementing human genes are labeled in blue, non-complementing human genes are labeled in black. Yeast kin28 is labeled in red. Protein domains are shown as boxes in different colors...... 54

Figure 2-11 Comparison of protein domain structures among yeast CDK1 and its human paralogs tested. Pkinase domains are shown as blue boxes ...... 55

Figure 2-12 A. The average percent identity (PID) score distribution for yeast-human pairs for which a yeast protein had multiple human paralogs tested. B. The average PID score distribution for human-yeast pairs for which a human protein had multiple yeast paralogs tested...... 57

Figure 2-13 A. The precision-recall (PRC) curve of FC score and PolyPhen2 score. B. The distribution of FC score of disease-associated and non-disease-associated variants...... 60

Figure 2-14 Precision vs. recall analysis for variants that either do (‘aligned’) or do not (non- aligned) fall within sequence region that can be aligned between human and yeast homologs. . 62

Figure 2-15 The phylogenetic tree of yeast kin28 and its paralogs tested...... 65

Figure 2-16 Protein kinase tree showing all the tested human and yeast protein kinases. All the tested non-complementing human proteins are colored in cyan, complementing human proteins are colored in magenta...... 67

Figure 3-1 Mapping Mutations Detected from Different Cancers to Domain Instances. Rectangles represent protein domain instances in a given gene. Colored dots represent mutations detected in different cancer types...... 77

Figure 3-2 Mutation Densities for Domain Instances across Cancers. Box plots display mutation densities for mutated domain instances in different cancers. Outliers are shown as dots. Only predicted- damaging mutations predicted by IntOGen were used for this analysis...... 85 xiii

Figure 3-3 Clustering of Significantly-mutated Domain Instances across 21 Cancer Types. The heatmap reflects the significance of cancer-type-specific mutation density of each domain instance in different cancers. Side bars in the same color indicate domain instances encoded by the same gene, and domain instances belonging to the same domain type...... 92

Figure 3-4 Mutations in EGFR across 5 Different Cancers with Protein Structure Context. (A) The histogram displays the proportions of mutation counts detected at each residue to the total number of mutations that fall in the four different domains encoded by the gene EGFR, in five different cancers. The x-axis indicates the position of mutant residues. Mutations in different domains are shown in different colors. (B) shows the structure of the EGFR protein with epidermal growth factors colored in orange. The arrows point to enlargments of portions of the protein. The tails of the kinase domain are not shown in this structure. The structure visualization was based on Protein Data Bank structure models 1nql, 1ivo, 2jwa, 1m17 and 2gs6. Significantly-mutated domain instances (SMDs) Were shown as thicker boxes ...... 95

Figure 3-5 Cancer-type-specific Mutational Hotspots and Mutational Hotspots Shared by Several Cancer Types. A. shows the distribution of mutational hotspots for different cancer types within a given domain instance. B. shows mutational hotspot distribution patterns of different domain instances (encoded by different genes) that each correspond to the same protein domain type. Mutational hotspots are shown as balls and sticks, domain instances are shown as boxes. Mutational hotspots in different colors represent mutations in different cancer types...... 97

Figure 3-6 Distribution of Mutated Residues within a Single Gene. (A) compares the prevalence with which mutations from a specific cancer type fall within significantly mutated domain instances (SMDs) to the prevalence of mutations in other domain instances. Genes with at least one SMD are represented on x-axis in descending order by the number of mutated residues. The length of each blue bar shows the number of the mutated residues falling in SMDs for each cancer type, the length of red bars shown the number of mutated residues falling in other domain instances within the same gene. (B) compares the fraction of mutated residues in SMDs that are hotspots in oncogenes (yellow) and tumor suppressors (green)...... 98 xiv

Figure 3-7 Distribution of Mutated Residues in FGFR. Sequence positions and frequencies of mutated residues in the FGFR protein are shown. Mutational hotspots for each cancer type are displayed as red dots. SMDs are shown as thicker boxes...... 102

Figure 3-8 Structural Context of p53 Protein (PDB 3q05) Mutational Hotspots. Mutational hotspots shared by eight cancers are displayed as blue sticks. Liver-cancer-specific mutational hotspots are displayed as magenta sticks. The p53 protein structure is colored according to amino acid chain...... 105

Figure 3-9 Mutation Distributions of Different Ras Domain Instances and The Structure of Ras Domain. (A) bar graph shows Ras domains encoded by different genes have different mutation rates across cancer types. (B) heat map shows fraction of mutations observed at each residue of a given gene in a given cancer. (C) the structure of the Ras domain encoded by the KRAS gene (PDB structural model 4lpk). GTP/GDP binding sites are displayed as magenta sticks, GDP binding sites are colored in cyan...... 108

Figure 4-1 Outline for antigenicity score of cancer mutations ...... 121

Figure 4-2 Predicting the MHC binding peptides and calculate mutation density ...... 123

Figure 4-3 Three types of MHC binding peptides based on patient HLA types ...... 124

Figure 4-4 The percentage of all mutations of all types that are missense somatic mutations, calculated both within MHC binding peptides and out of MHC binding peptide ...... 126

Figure 4-5 The average mutation density for genes with different expression percentiles. For each tumor, expression levels were from cancer samples matched by cancer type...... 127

Figure 4-6 Mutation density in expressed genes within MHC binding peptides and out of MHC binding peptides. High expression level genes are those genes whose expression level higher than median expression level. Low expression level genes are those genes whose expression level lower than median expression level. In each case, expression levels were from cancer samples matched by cancer type...... 128

Figure 4-7 Dependence of mutation depletion on expression level in patient-displayed MHC binding peptides ...... 130 xv

Figure 4-8 Average mutation density in peptides predicted to be displayed by at least one of the 12 common HLA-A or HLA-B allele types (“in MHC binding peptides”) or not displayed by any of the 12 common HLA-A or HLA-B allele types (“out of MHC binding peptides”), for different numbers of predicted-displaying HLA alleles. Where a patient has allele types from the common set of 12 alleles at both the HLA-A and HLA-B loci, but none of these alleles are predicted to display, we say “No alleles”. Where a patient has an allele type outside of the common set of 12 alleles at either the HLA-A or HLA-B loci, we say that the number of displaying alleles is “Unknown”...... 131

Figure 4-9 Ratio of mutation density in MHC binding peptides to that of “non-MHC binding peptides” displayed by one HLA allele (R1, colored in red), by two alleles (R2, colored in blue) ...... 132

Figure 4-10 Mutations density for samples with different HLA types. Samples on the diagonal of the heat map are homozygous at HLA loci...... 133

Figure 4-11 The distribution of depletion score of each MHC binding peptide type at different expression level...... 134

xvi

List of Appendices

Appendix A. Detailed protocol for human-yeast complementation assay

Appendix B. Supplementary table for Chapter 3 xvii

List of Abbreviations

Individual amino acids are abbreviated in the text as three letters.

Volumes, lengths, concentration units are using the symbols approved by the International Systems of Units.

D – deletion (of a gene)

℃– degree Celsius

ATP – adenosine triphosphate bp – base pairs

CEN – centromere clonNAT – nourseothricin

CRISPR – Clustered regularly interspaced short palindromic repeats

CTL – cytotoxic T-lymphocytes

COSMIC – Catalogue of Somatic Mutations in Cancer database

DC – dendritic cells

DMSO – dimethyl sulfoxide

E. Coli – Escherichia coli

ER – endoplasmic reticulum

FC score – Failure-to-Complement score

G418 – geneticin

GAL – galactose xviii

GFP – green fluorescent protein

HLA – human leukocyte antigen

ICGC – International Cancer Genome Consortium kanMX – kanamycin resistant cassette

LB – Lysogeny broth

MHC – major histocompatibility complex mRNA – messenger RNA natMX – nourseothricin resistant cassette

OD – optical density

ORF – open reading frame

PCAWG – the Pancancer Analysis of Whole Genomes

PCR – polymerase chain reaction

S. cerevisiae – Saccharomyces cerevisiae

SC media – Synthetic complete dropout media

SGA – Synthetic Genetic Array

SMDs – significantly mutated domain instances

SNP – single nucleotide polymorphisms ssDNA – single stranded DNA

TAP – transporter associated with antigen presentation

Tet – tetracycline-regulatable promoter-replacement xix

TS – temperature-sensitive

TCGA – The Cancer Genome Atlas

WT – wild-type

YPD – yeast extract peptone dextrose Chapter 1 Introduction

1 2

Introduction

In recent years, geneticists have made rapid and tremendous progress in understanding the genetic basis of phenotypes (Steinmetz and Davis 2004). The has sequenced the genomes of 2504 individuals from 26 populations (Clarke, Zheng-Bradley et al. 2012, Sudmant, Rausch et al. 2015), while whole-genome and whole-exome sequences for ~30000 individuals are set to become available in the next few years (2010). However, one limitation of most large-scale genomic studies aimed at revealing mutational consequences is that thousands of genetic loci are often associated with human diseases, but only a few of the variants at those loci contribute to human diseases (Frazer, Murray et al. 2009). Even for many well-known disease-causing genes, it is unclear which variants contribute to a given human disease. Meanwhile, even for those identified as true causal mutations, little is known about how and to what extent these mutations affect the protein function and associated protein-protein interactions. Most whole-gene-based or whole-protein-based large-scale studies have not distinguished the diverse functional consequences of mutations in different protein regions, which is important to understand how and why mutations in a single gene can cause different types of diseases. Therefore, there is a pressing need to identify and functionally characterize human disease-associated variants in a systematic manner. To achieve this goal, during my doctoral studies, I conducted three projects that aimed to understand the functional impacts of human disease associated variants systematically using both experimental and computational methods. First, to identify pathogenic human variants efficiently and accurately, I collaborated with others to develop a set of complementation assays, which represent a surrogate genetics platform for identifying pathogenic human variants. I systematically tested the complementation relationships between 96 human and yeast orthologous pairs and assessed the functional impact of 179 variants (101 disease- and 78 non-disease-associated variants) from 22 human disease genes. Then I further expanded the testing dataset to 1060 human-yeast homologous pairs. This work demonstrated that orthology was not generally required for complementation and that assays for detecting pathogenic variants could therefore be developed for a wider range of human disease genes.

Aiming to understand how tumor mutation distribution pattern associated with different cancer development mechanisms, I analyzed the distribution pattern of cancer missense somatic 3 mutations at the protein domain level and proposed a new way to predict oncogenes and tumor suppressors.

Finally, in order to illustrate how human genetic variation distribution pattern is associated with the immune selection pressure on those genetically unstable cancer cells, I built models of immunogenicity of cancer mutations and developed a model that produces an ‘antigenicity score’ for any input somatic coding mutation. Although these three projects are not directly related to each other, they all aim to understand the interplay between human genetic variants and human disease mechanisms.

In this chapter I, I will set the stage for this work by discussing both the genotype and phenotype associations. Then, I will describe some relevant features of the model system used in experimentally assessing the functional impacts of genetics variants described herein, the yeast Saccharomyces cerevisiae. I will then summarize some key discoveries in cancer genomics, specifically in immunogenicity of cancer mutations. Lastly, I will detail how protein domains can play impactful roles in understanding the association between multi-functional proteins and different disease mechanisms.

1.1 Addressing the genotype-phenotype question

The rapid development of biotechnology has provided different kinds of ‘omic’ data, including whole-genome sequencing, transcriptomic, methylomic and metabolomic data. Too little is currently known about the nature of genetic variation underlying complex diseases in humans. During the past decades, people expected that omics data could help us to understand disease development mechanisms and providing strategies for disease prediction and precise treatment (Hood and Flores 2012). Despite our growing access to omics data, our ability to explain molecular basis that regulates complex human diseases, such as cancer, remains limited. Meanwhile, although genetic linkage and association studies have connected tens of thousands of genetic loci with diverse human diseases, the disease-causing variants have been identified for only a small fraction of these loci (Stranger, Stahl et al. 2011). The challenges in dealing with large-scale omics data are not only due to the complexity of biological processes, these problems can also be caused by the noisy nature of experimental data and the limitations of statistical analyses, such as false positive associations. (Scholz, Lo et al. 2012). Therefore, an important goal and unmet challenge of omics data analysis is to develop effective models that 4 can predict phenotypic traits and outcomes, elucidate important biomarkers and generate important insights into the genetic basis of complex traits (Ritchie, Holzinger et al. 2015).

The vast wealth of omics data laid the foundation for a new kind of genetic analysis – the field of molecular biology known as functional genomics, which seeks to interpret the data generated by large-scale system-level analyses (Suter, Auerbach et al. 2006). Various functional genomics methods have been proposed to describe gene functions and interactions, by synthesizing genomic knowledge into a global understanding of the dynamic and highly interconnected processes that regulate a cell. At a fundamental level, this means that functional genomics aims to address the functions and interactions that mediate the relationship between genotype and phenotype – helping to understand how the genetic makeup of an organism determines its overall functioning and fitness.

Although there are many different kinds of genetic variants, which can contribute to human diseases, my research projects mainly focus on understanding the functional impact of single nucleotide polymorphisms (SNPs). SNPs can be non-synonymous variants, which change the encoded amino acids, or silent synonymous variants, or may occur in noncoding regions. They may have influences at the transcript expression level, on messenger RNA (mRNA) splicing, conformation, post-transcriptional regulation and protein stability. Any of these effects may ultimately lead to diseases. Therefore, my studies aim to identify human coding variations and analyze their functional impacts, which will bring us a better understanding of their impacts on genes, proteins and disease development (Shastry 2009).

Diverse computational and experimental methods exist to infer the pathogenicity of rare human coding variants. Computational methods are beginning to find acceptance as clinical diagnostic tools (Richards, Aziz et al. 2015), but still have limited predictive power (Mathe, Olivier et al. 2006, Chan, Duraisamy et al. 2007, Cline and Karchin 2011, Thusberg, Olatubosun et al. 2011, Castellana and Mazza 2013, Frousios, Iliopoulos et al. 2013, Gnad, Baucom et al. 2013). Experimental assessment of variant function in human cells is hampered by inefficient allele replacement technology and by the presence of paralogs with overlapping function, making complementation testing in ‘humanized’ model organisms an attractive alternative (Marini, Gin et al. 2008, Trevisson, Burlina et al. 2009, Mayfield, Davies et al. 2012, Davis, Frangakis et al. 2014, Laurent, Young et al. 2015). Complementation of mutant versions of model organism 5 genes by cognate human genes is a classic method to identify human gene function (Lee and Nurse 1987, Levitan, Doyle et al. 1996, Osborn and Miller 2007, Keegan, McGurk et al. 2011). Such complementation relationships can then be exploited to assess the impact of amino acid changes in human proteins (Kruger and Cox 1994, Kruger and Cox 1995, Marini, Gin et al. 2008, Trevisson, Burlina et al. 2009, Wei, Wang et al. 2010, Mayfield, Davies et al. 2012, Dimster-Denk, Tripp et al. 2013, Davis, Frangakis et al. 2014). Like computational methods, complementation assays in experimentally tractable model organisms allow functional assessment of human variation at a scale that is commensurate with the size of the human population.

1.2 Assessing the functional impact of human disease mutations using the Saccharomyces cerevisiae as a model system

The yeast Saccharomyces cerevisiae has long been an effective eukaryotic model organism and continues to contribute greatly to human gene functions. Compared with other eukaryotic model organisms, this single-cell fungus has many advantages: 1) it is easy and inexpensive to maintain and grow; 2) it is stable in both haploid and diploid states (Goffeau, Barrell et al. 1996, Goffeau 2000); 3) its genome sequence (the first reported among eukaryotes) is perhaps the best understood, with 5152 out of 6604 yeast open reading frames (ORFs) having an annotated function (Engel, Dietrich et al. 2014); and 4) it has high recombination rate between homologous DNA sequences, which is impartment for the precise insertion of DNA sequences at specific yeast genome locations (Wach, Brachat et al. 1994).

1.2.1 Yeast temperature sensitive strains

In general, essential genes and non-essential genes are genetic descriptions referring to the functional consequence of deleting a gene with respect to its effect on an organism’s viability. If a gene is required for survival it is considered to be essential (Gerdes, Scholle et al. 2003). Previous studies have shown that at least 60% of yeast genes have been estimated to have a human homolog, and 87% of yeast protein domains can be found in human genes (Peterson, Doughty et al. 2013, Peterson, Park et al. 2013). Essential yeast genes tend to be highly conserved in humans, 38% of essential yeast genes have clear counterparts in humans (Hughes 6

2002). The high conservation of gene functions between human and yeast suggested that the study of yeast can serve as a model organism to interpret essential gene functions in human.

Studies have shown that 97% of all ~6,000 yeast genes can display a fitness defect under some growth condition (Hillenmeyer, Fung et al. 2008). The functional analysis of essential genes in yeast has traditionally relied on conditional mutants. Conditional alleles enable a functional version of the gene under a permissive condition and a compromised version under semi- permissive or non-permissive conditions. In yeast, the most common conditional alleles include temperature-sensitive (TS) (Hartwell 1967), temperature-sensitive inteins (Skubitz, Christiansen et al. 1989), cold-sensitive (Graham, Davis et al. 1975), temperature-inducible degron (Sanchez- Diaz, Kanemaki et al. 2004), and tetracycline-regulatable promoter-replacement (Yen, Gitsham et al. 2003).

Among those conditional alleles, the most common one is the TS allele. A temperature sensitive mutant can retain its normal function at a low (permissive) temperature, but may completely or partially lose its function when the temperature increases. The Boone and Andrews labs have collected and created yeast temperature sensitive (TS) mutant strains for almost all essential genes, which is a valuable resource for assessing functions of both yeast and corresponding human genes (Morakote and Justus 1988). To date, they have constructed ~1000 TS allele strains representing ~900 essential genes, accounting for ~90% of all essential genes in yeast (Guihong Tan, personal communication).

1.2.2 Human-yeast functional complementation assay

As mentioned above, the release of the yeast genome sequence, coupled with the ease of genetic manipulation in S. cerevisiae, enabled the generation of a powerful set of unique tools ranging from individual assays to high-throughput genome-wide scale experiments (Botstein and Fink 1988, Botstein and Fink 2011) for genome-scale functional studies.

There are several approaches for the assessment of human genetic variants using yeast. One method is to study the functional impact of a given human variant by introducing a homologous mutation in the corresponding yeast ortholog based on sequence conservation. However, this gene-sequence-based approach will limit the number of genetic variants that can be studied. Also, another limitation of this gene-sequence-based approach might be the accuracy, since the 7 function of a gene, or a given genomic region can be maintained although gene sequence may divergent a lot during the evolution (Shmerling 1972). A more desirable way is to study the functional impact of human genetic variants using a protein-sequence-based approach by cross- species complementation. Yeast-based functional complementation assays have been successfully employed to study human protein functions, as well as the functional impact of amino acid changes in proteins (Beach, Durkacz et al. 1982, Dolinski and Botstein 2007, Mayfield, Davies et al. 2012). The first example is the assessment of mutational consequences in the human cystathionine b-synthase gene (Kruger and Cox 1995). More recently, several studies have adopted the functional complementation assay to measure the pathogenicity of disease-associated missense mutations, e.g. the test of 12 mutations in human arginine succinate lyase (ASL) (Trevisson, Burlina et al. 2009, Marini, Thomas et al. 2010). Now, hundreds of human-yeast complementation pairs have been reported (Dreyer 1985).

Several measurable phenotypes have been used in cross-species complementation assays. The most straightforward phenotype to assess is the ability of a human gene to rescue the growth defects caused by the yeast mutant. Summarized in a previous study by Philip Hieter et al (Hamza, Tammpere et al. 2015), “Complementation of essential yeast genes by candidate human gene homologs can be assessed by the ability of a human gene to rescue lethality caused by 1) a null allele (deletion in a haploid strain), 2) a conditional allele under restrictive conditions (e.g., temperature-sensitive strain), or 3) down regulation by a repressible promoter (e.g., Tet system) (Kachroo, Laurent et al. 2015). The complementation of a nonessential yeast gene mutants by a candidate human gene can also be assayed when they present a phenotype (e.g., drug sensitivity), or when the nonessential yeast gene can be converted to an essential gene by disrupting a synthetic lethal partner (Greene, Snipe et al. 1999)”.

In the project introduced in Chapter Two, I use temperature sensitive strains to conduct a human-yeast complementation assay. Wild-type human genes will be expressed from a plasmid in yeast strains with TS mutations at different temperatures, ranging from permissive temperature (room temperature) to non-permissive temperature. Complementation is identified when expression of wild-type human disease genes increases fitness, i.e. the growth rate, of TS yeast strains at semi-permissive or non-permissive temperatures, here, increased fitness is measured relative to their corresponding TS yeast strains carrying green fluorescent protein (GFP) reporter plasmids (Figure 1-1). 8

Figure 1-1 Complementation functional assay between human and yeast homologues as a platform to assess functional impacts of more disease-associated mutations.

1.3 Cancer

Cancer is a complex, diverse and dynamic disease that still not fully understood even after decades of studies. According to the National Institute of Health, cancer is defined as “a term for 9 diseases in which abnormal cells divide without control and are able to invade other tissues” (http://www.cancer.gov/cancertopics/cancerlibrary/what-is-cancer). Based on the tissue or cell- type of cancer initiation, cancer can be separated into more than 200 types. One main difference between cancer and other well-studied genetic diseases is that cancer is a complex disease. Usually, multiple factors may contribute to a given cancer, including genetic and environmental factors, making it harder to understand the mechanism of cancer initiation and development.

1.3.1 Complexity and development of cancer

Complex diseases are by definition caused by a combination of genetic and environmental factors (which may include lifestyle factors or pathogens. Therefore, it is difficult to determine the underlying mechanisms of complex diseases (complex disease are those diseases caused by many contributing factors). Cancer is made more complicated to study because the majority of cancer-causing mutations are somatic (not present in the germline) (Greaves and Maley 2012, Bedard, Hansen et al. 2013).

Although environmental and other non-genetic factors have roles in many stages of tumorigenesis, the disease itself is presumed to be induced through the development of somatic mutations in cancer-associated genes. These mutations lead to cancer via different mechanisms, such as 1) evading growth suppressors; 2) sustaining proliferative signaling; 3) resisting cell death and inducing angiogenesis; 4) enabling replicative immortality; 5) activating invasion and metastasis (Greaves and Maley 2012, Bedard, Hansen et al. 2013). All tumors exhibit some of these hallmarks, and the specific needs of each tumor are determined by its microenvironment.

Currently, the Cancer Gene Consensus lists 595 genes (Futreal, Coin et al. 2004, Santarius, Shipley et al. 2010). The Catalogue of Somatic Mutations in Cancer (COSMIC) lists human genetic variants in more than 20000 protein coding regions (Forbes, Bindal et al. 2011), although most of these are likely to be non-causal ‘passenger’ mutations. Many annotated genes in the human genome are linked to mutations in cancer, with most of those genes being mutated at a low rate in cancer. Based on a previous study in 12 cancer types, except for acute myeloid leukemia, all of the other remaining 11 cancers have more than one mutation per Mb on average (Kandoth, McLellan et al. 2013). The exact number of mutations required for tumor initiation and development is unknown, but the current model requires multiple mutations contributing to the disruption of multiple cellular functions (Futreal, Coin et al. 2004, Santarius, Shipley et al. 10

2010, Bedard, Hansen et al. 2013). Moreover, some mutations caused by carcinogens, such as thymidine dimerization due to UV exposure, occur randomly and do not tend to target specific genes (Brash, Rudolph et al. 1991). Mutations that confer a selective advantage are preferentially expanded, while cells that receive deleterious mutations will either undergo cell-cycle arrest or . The selected cells will continue to divide and receive additional mutations. Eventually, those mutations will be sufficient to produce a tumor.

Cancer cells can metastasize, or disseminate and invade different tissues; hence, cancer types are defined primarily by their origins. This migration of cells also makes cancer more difficult to treat, with most cancers having a worse prognosis when metastases are present. Additionally, metastasis, the process by which cancer cells spread to other human tissues or body regions, results in increased genetic heterogeneity in cancer as cells are under selective pressures resulting from their microenvironments (Nguyen, Bos et al. 2009).

After initiation, the cancer environment changes which results in new selective pressures. The cancer cell needs to evolve or alter current mechanisms to adjust to these new pressures. These mutations are considered progression mutations and occur independently in a cancer type. While some of the cancer cells inevitably die off due to lack of progression mutations, different subpopulations can respond to the new selective pressures with independent mutations, resulting in tumor heterogeneity. The genetic inter-patient and intra-tumor heterogeneity can pose even more challenges for cancer genome analyses (Bedard, Hansen et al. 2013). In the study outlined in Chapter Three and Chapter Four, I analyzed the cancer-type-specific and patient-specific cancer genomic data to help understand the different impacts of mutations in different cancers and different patients.

1.3.2 Primary and metastatic tumor

Different cancer types are often named according to the tissue or cell types in which it first appears (the “primary site” of the cancer; e.g., cancer observed in the lung is called “lung cancer”). Cancer cells may later spread to other parts of the body of a patient through the blood or lymphatic system. If cancer spreads from the primary site to another place in the body, it is called “metastatic” or “secondary” cancer. Metastatic cancer is referred to with the same name as the original, or primary, cancer. For example, if a lung cancer spreads to the liver and forms metastatic tumor, it is called metastatic lung cancer, not liver cancer (Fidler and Kripke 2015). 11

Tumor metastasis is a major contributor to the deaths of cancer patients. Therefore, it is important to prevent cancers from the initiation of metastasis. A predictive understanding of how cancer-associated variants evolve during cancer development (Steeg 2016) could enable identification and more aggressive treatment of primary tumors that are more likely to metastasize.

1.3.3 Oncogenes and tumor suppressors

Cancer genes can be separated into two categories: oncogenes and tumor suppressors. Oncogenes typically have a “dominant” effect and can thus contribute to tumorigenesis if one allele is mutated, amplified or inappropriately expressed. Tumor suppressors frequently have a “recessive” effect, requiring mutations in both alleles of the genes for effectiveness. Oncogenes often affect processes like cell proliferation (e.g. transcription factors that alter cell cycle) while tumor suppressors are often involved in anti-cell proliferation (e.g. activation of apoptosis). A combination of mutations in both classes can result in dysregulation of normal cellular pathways leading to abnormal cell growth (Chin, Andersen et al. 2011).

Either the activation of oncogenes or inactivation of tumor suppressors could contribute to tumor initiation and development. Normal genes are called proto-oncogenes if they have the potential to be oncogenic. Those genes can help cells grow in normal conditions, if there is a proto-oncogene mutation or there are too many copies of the proto-oncogene, it can become permanently ‘turned on’ or activated and ultimately contribute to cancer. These genes are commonly mutated in two different ways: 1) mutation in regulatory regions resulting in over expression of the normal gene, or 2) a coding mutation resulting in hyperactivation of the protein. Oncogenes often consist of growth factors, kinases, GTPases, and transcription factors involved in pathways including cell proliferation, growth, survival, and differentiation.

Tumor suppressor genes typically behave in a recessive manner: Normal expression of the protein limits tumor formation, so both copies of the protein must be disrupted for tumor growth. Since the mutations are randomly gained, there should be two independent mutations in tumor suppressors. While two independent mutations are often required, some mutations can result in “dominant negative” mutations such that the mutated allele is able to reduce the activity of the gene product of the normal allele. Tumor suppressor genes prevent tumor growth via 12 several mechanisms: reduced cell division, increased DNA damage repair, or induction of apoptosis of damaged cells (Fodde and Smits 2002, Hohenstein 2004).

Most oncogenic and tumor suppressor mutations are acquired somatically rather than inherited, but there are many important exceptions where a cancer-contributing allele is inherited via the germline (Guilford, Hopkins et al. 1998, Richards, McKee et al. 1999, Shinmura, Kohno et al. 1999, Stone, Bevan et al. 1999, Yoon, Ku et al. 1999, Kusano, Kakiuchi et al. 2001, Graziano, Ruzzo et al. 2003, Zhang, Liu et al. 2006, Corso, Marrelli et al. 2012). Also, some genes, such as the gene FGFR, can behave as either an oncogene or a tumor suppressor in different cancers (Lafitte, Moranvillier et al. 2013).

1.4 Introduction to current cancer genomics research

Identifying genetic variants in each cancer’s genome and understanding how such changes lead to cancer initiation and development can help to improve cancer prevention, early detection, and treatment. Large-scale whole-genome or whole exome sequencing projects are conducted to identify genetic variants that drive tumor initiation and development (Stratton 2011). The Cancer Genome Atlas (TCGA) pilot project has provided a flood of human somatic mutation data mainly associated with about 30 major cancer types. However, the ability to distinguish driver mutations, which are expected to cause tumor growth, from passenger mutations, which play no roles in cancer development is still challenging. Lastly, the lack of detailed understanding of the mechanisms of tumor development had led to ineffective treatments (Dunn, Old et al. 2004).

Recent whole genome or whole exome sequencing studies have provided us the first comprehensive insights into the proportions of cancer mutations (Stratton 2011, Simonds, Khoury et al. 2013). The vast majority of genetic variants detected in cancer samples appear to be passenger mutations, but there may be more driver mutations than what we have currently identified (Greenman, Stephens et al. 2007). Therefore, although many mutations contribute little to cancer development, there are a substantial number of driver mutations remain to be identified.

To identify driver mutations associated with different types of cancers, many studies have been designed and conducted. Most studies take a gene-based approach to identify driver mutations, 13 highly mutated genes are more likely to contain driver mutations. However, recent work has demonstrated that passenger mutations are not always randomly distributed throughout the genome. Therefore, to accurately identify cancer genes, additional strategies became necessary (Bignell, Greenman et al. 2010). Recently, protein structure information has been taken into consideration in several of these studies. Several novel methods for mapping mutations to distinct protein or protein domains have been used (Nehrt, Peterson et al. 2012, Studer, Dessailly et al. 2013). The detailed functional context and network characterization of these candidates to understand how they contribute to the tumor phenotype, since even in a well- established causal cancer gene, not all mutations will be functionally equivalent. Many studies were designed to integrate human oncogenomics, transcriptomics, and cancer interatomic networks together to understand phenotypes as manifestations of network properties, but not merely as the result of individual genomic variations. Also, systematically tests of how DNA tumor virus proteins interact with host proteins and perturb the transcriptome can highlight pathways that go awry in cancer. To better understand mechanisms of cancer, it will be important to map candidate somatic mutations to protein structures and combine the interactome network and pathway information.

1.4.1 TCGA, ICGC and PCAWG Projects

In recent years, many cancer genome studies have generated extensive genomic and/or clinical data of different cancer types (Chin, Andersen et al. 2011, Chin, Hahn et al. 2011, 2015). The Cancer Genome Atlas (TCGA) (Cancer Genome Atlas Research, Weinstein et al. 2013, Tomczak, Czerwinska et al. 2015) is one of the most important datasets used for cancer analysis and genomic studies, which has put a significant effort towards studying human cancers at the genome-scale. The TCGA Data Portal collected hundreds of tumors from 34 cancer types (Table 1-1). There are four major types of data available in TCGA, 1) Clinical information about participants in the program; 2) Metadata about the samples (for example biospecimen data; 3) Histopathology slide images from tumor samples; and 4) Molecular information derived from the samples (e.g. mRNA/miRNA expression, protein expression, copy number, etc.). Not only tumor samples but also appropriate matched normal samples are often provided by TCGA. Each of these data types is available at three stages of analysis. Level 1 provides the raw data, which contains patient identifiers and is limited to approved users. Level 2 and 3 provide summarized and de-identified results and are ready for general download. Besides TCGA, the International 14

Cancer Genome Consortium (ICGC) also generated comprehensive catalogs of genomic abnormalities, including somatic mutations, abnormal expression of genes and epigenetic modifications, in tumors from 50 different cancer types.

Table 1-1 The 34 cancer types with counts of samples with RNASeq samples in TCGA dataset as of 05/2016 Available Cancer Types # Cases w/ RNASeq*

Acute Myeloid Leukemia [LAML] 200

Adrenocortical carcinoma [ACC] 80

Bladder Urothelial Carcinoma [BLCA] 412

Brain Lower Grade Glioma [LGG] 516

Breast invasive carcinoma [BRCA] 1100 Cervical squamous cell carcinoma and endocervical 308

adenocarcinoma [CESC]

Cholangiocarcinoma [CHOL] 36

Colon adenocarcinoma [COAD] 461

Esophageal carcinoma [ESCA] 185

FFPE Pilot Phase II [FPPP] 38

Glioblastoma multiforme [GBM] 529

Head and Neck squamous cell carcinoma [HNSC] 528

Kidney Chromophobe [KICH] 66

Kidney renal clear cell carcinoma [KIRC] 536

Kidney renal papillary cell carcinoma [KIRP] 291

Liver hepatocellular carcinoma [LIHC] 377

Lung adenocarcinoma [LUAD] 521

Lung squamous cell carcinoma [LUSC] 510

Lymphoid Neoplasm Diffuse Large B-cell Lymphoma[DLBC] 48

Mesothelioma [MESO] 87

Ovarian serous cystadenocarcinoma [OV] 586

Pancreatic adenocarcinoma [PAAD] 185

Pheochromocytoma and Paraganglioma [PCPG] 179

Prostate adenocarcinoma [PRAD] 498

Rectum adenocarcinoma [READ] 172

Sarcoma [SARC] 261

Skin Cutaneous Melanoma [SKCM] 470

Stomach adenocarcinoma [STAD] 445

Testicular Germ Cell Tumors [TGCT] 150

Thymoma [THYM] 124

Thyroid carcinoma [THCA] 507 15

Uterine Carcinosarcoma [UCS] 57

Uterine Corpus Endometrial Carcinoma [UCEC] 548

Uveal Melanoma [UVM] 80

The Pancancer Analysis of Whole Genomes (PCAWG) of the ICGC and TCGA is an international collaboration seeking to identify common patterns of mutation in more than 2,800 cancer whole genomes. According to the PCAWG project, key areas of their study include: “1) Discovery of driver mutations outside of the protein-coding regions of the genome; 2) Integrating mutational signatures across tumor types and mutation categories; 3) Characterizing sub-clonal structures and patterns of genome evolution across cancers; 4) Investigating relationships between germline and somatic mutations; 5) Investigating biological pathways targeted by driver mutations”(Cancer Genome Atlas Research, Weinstein et al. 2013).

All of the tumor somatic missense mutation data used in my study were determined via whole genome/exome sequencing technologies by TCGA or ICGC studies. The expression level and HLA type of each patient were obtained by the PCAWG project.

1.5 The function and dysfunction of the immune system in cancer

As early as 1909, Paul Ehrlich noted that “The immune system has an amazing ability to seek out and destroy that which is deemed foreign, and generally leaves ‘self’ alone. Yet, tumor cells, thanks to accumulated mutations and altered patterns of , differ from their normal counterparts. Could the same killing power that eradicates infection be harnessed to destroy cancer cells -- cells that are nevertheless self?” It is now well known that the immune system plays many crucial roles in the prevention of cancer. However, the interaction between cancer cells and the immune system is complex and incompletely understood. Our attempts to target tumors by immunotherapy have been less successful than one might have predicted from the Ehrlich hypothesis (Ehrlich 1908).

It is now clear that the immune system can be either a driving force or a protective force against cancer. The immune system not only detects and kills tumor cells before they progress to a malignant state (Vesely, Kershaw et al. 2011), it can also act preventatively (Schreiber, Old et al. 2011). This dual role of the immune system in cancer progression has been called “cancer immune editing” (Dunn, Bruce et al. 2002, Arum, Anderssen et al. 2010). 16

Recently, new therapies have been developed that are aimed at inducing immune response to cancer. There are at least three main areas on which cancer immunotherapy currently focuses to enhance an individual’s anti-tumor immune response: 1) promoting the antigen presentation abilities of dendritic cells (DCs), 2) enhancing protective T-cell responses and 3) overcoming mechanisms of tumor immunosuppression (Rosenberg 2005, Mellman, Coukos et al. 2011, Topalian, Weiner et al. 2011, Brower 2015).

1.5.1 Cancer immune-editing

Cancer immunoediting consists of three phases: 1) the elimination phase (also referred to as immunosurveillance), 2) the equilibrium phase, and 3) the escape phase (Dunn, Old et al. 2004).

The first stage of cancer immunoediting is called the elimination phase. During this phase, the innate and adaptive immune system work together to eliminate transformed cells and prevent tumor outgrowth through various mechanisms.

The elimination phase is characterized by competent adaptive and innate immune responses destroying transformed cells before the tumors become clearly visible (Haughton and Amos 1968, Topalian, Weiner et al. 2011). Transformed cells can be detected by CD8+ effector cells through the presentation of tumor antigens on major histocompatibility complex (MHC) class I molecules or by NK cells through their NKG2D ligands (Obeid, Tesniere et al. 2007). Tumor cells may also express stress-induced molecules, or they can release endogenous damage-associated molecular patterns to enhance the immune system’s recognition of cancer cells (Ghiringhelli, Apetoh et al. 2007).

During the cancer elimination phase, tumor-specific antigens will play several important roles. Tumors that have been less exposed to the immune system have a tendency to be more immunogenic once exposed. Tumors cells that possess the strong antigens will successfully be rejected through T-cell mediated immunoselection, but tumors lacking strong antigens will fail to be rejected by the immune system (Ghiringhelli, Apetoh et al. 2007).

Although the immune system uses different mechanisms to eliminate tumor cells, sometimes the immune system still fails to eliminate the transformed cells. If a cancer variant is not destroyed by the immune system during the first phase, then it will enter the second phase of the immunoediting process, the so-called equilibrium phase. It is also the longest phase of cancer 17 immunoediting. During the second phase, there is a balance between pressures exerted by the tumor and by the adaptive immune system. Unfortunately, the tumor acquires its most immunoevasive mutations during this point. In this phase, the adaptive immune system can control the outgrowth of tumor cells but it is unable to eradicate the tumor.

Finally, through a combination of constant selection pressure on tumor cells and their genetic instability, tumor cells may gradually acquire survival advantages through mutations enabling the tumors to progress into clinically detectable malignancies and enter the escape phase. This is the final phase of immunoediting. In this phase, tumor cells evade immune pressure and progress to clinically evident disease (Dunn, Koebel et al. 2006). The mechanisms involved in evading the immune system are complex and vary between tumors. Mechanisms which have been shown to be involved in immune evasion include suppression of pro-inflammatory danger signals to prevent dendritic cell maturation, and elimination of ligands for the NK cell effector molecules (Wang, Niu et al. 2004). Moreover, it is more likely that many immunogenic tumor variants have been eliminated from the tumor population through immunoediting early in cancer progression, the remaining cells may be less immunogenic.

Tumors can also emerge that they are no longer been recognized or identified by our immune system due to the antigen lose variants or defects in their antigen processing and presentation pathways. The human immune system can present tumor antigens to the receptor of cytotoxic T- lymphocytes (CTLs), this will lead to the immune clearance of tumor cells. However, cancer cells can escape this immune clearance due to somatic clonal evolution. By applying constant selective pressure, the immune system has caused the tumor population to evolve and become less immunogenic. For example, some tumors acquire altered expression of major histocompatibility complex (MHC) molecules. This can impact tumor development because MHC molecules play an important role in antigen presentation and the regulation of natural killer cell (NK) cell function (Khong, Wang et al. 2004). Another example is that tumor cells can be become insensitive to IFNγ (Dunn, Koebel et al. 2006). This has simultaneously given the tumor a survival advantage and promoted tumor cell proliferation. Therefore, in order to combat these effects, immunotherapies with multiple targets and combination therapy approaches are a promising path to reduce the probability of tumor relapse. 18

1.5.2 Antigen presentation process

According to the Immunobiology book by Charles Janeway, “The major histocompatibility complex (MHC) is a set of cell surface proteins which are essential for the acquired immune system to recognize foreign molecules in vertebrates” (Janeway, Travers et al. 1997). Tumor antigens are presented by MHC class I proteins via several steps. First, the proteasome can degrade antigens. Then, the resulting antigen peptides are translocated via transporters associated with antigen presentation (TAP), and loaded onto MHC class I molecules in the endoplasmic reticulum (ER) lumen. Peptide–MHC class I complexes are released from the ER and transported to the membrane. Those antigens can be presented to T cells.

The MHC protein family consists of three subgroups: class I, class II, and class III. MHC class I molecules have β2 subunits and can only be recognized by CD8 co-receptors, which are expressed in cytotoxic (killer) T-cells (Figure 1-2). MHC class II molecules do not have β2 subunits and can be recognized by CD4 co-receptors, expressed in helper and regulatory T-cells. 19

Figure 1-2 Cartoon representation of the T cell receptor’s interaction with Peptide-loaded MHC-I ( PDB id: 1G6R). The MHC-I heavy chain, loaded with the epitope SIINFEKL, is shown in green, while beta2-microglobulin is depicted in purple, and a T cell receptor in magenta.

In humans, the human leukocyte antigen (HLA) system is a gene complex that encodes the MHC proteins. HLA genes are highly polymorphic. Different HLA classes have different functions (Charles A Janeway 2001): 1) HLAs corresponding to MHC class I (A, B, and C) display peptides derived from inside the cell to T-cells, 2) HLAs corresponding to MHC class II display antigens derived from outside of the cell to T-cells (Charles A Janeway 2001). MHC class I and MHC class II proteins have different kinds of binding clefts. The clefts of MHC class I molecules are closed at both ends; by contrast, the ends of MHC class II clefts are open. Therefore, the length of MHC class I binding peptides is usually fixed, ranging from 8 to 10 amino acids long. The whole of the MHC binding peptides are within the binding groove, and the MHC class I proteins make contact with residues at the terminal region of the binding peptides (Figure 1-3). The length of MHC class II binding peptides can be longer, usually from 20

13 to 17 amino acids long. The peptide can stick out on both sides. MHC class II proteins make contact along the peptide backbone (Figure 1-4).

Figure 1-3 MHC class I Protein crystal structure (PDB id: 1VAC) with a cartoon representation of the MHC class I heavy chain (purple), beta-2 microglobulin (green) in complex with the co- crystalized ovalbumin-derived epitope SIINFEKL and the MHC class I peptide binding cavity constrained by two alpha-helices and a beta sheet.

1.5.3 MHC binding peptides prediction

MHC binding peptides may potentially be used as an important method for cancer diagnosis and treatment (Brown, Warren et al. 2014). Many different computational analysis methods have been developed to predict MHC binding peptides (Schueler-Furman, Altuvia et al. 2000, Udaka, Wiesmuller et al. 2000, Donnes and Elofsson 2002, Reche, Glutting et al. 2002, Udaka, Mamitsuka et al. 2002, Soam, Khan et al. 2009, Gok and Ozcerit 2012, Meydan, Otu et al. 2013). 21

Figure 1-4 MHC class II protein crystal structure (PDB id: 2ICW) with a co- crystalized super antigen of length 13 depicted in green. MHC class II α and β chains are colored in purpleblue and green accordingly.

There are two different general approaches that have been used to predict the MHC binding peptides: sequence-based or structure-based methods.

Sequence-based methods study the frequencies of amino acids in different positions of identified MHC-peptides to develop allele-specific sequence motifs. For example, based on collections of bound peptides, previous studies have reported that peptides which can bind to HLA-A*0201 frequently have two anchor residues, lysine in position 2 and valine in position 9 (Rammensee, Friede et al. 1995, Sette, Chesnut et al. 2001). Motifs generated by sequence-based methods can be used to score each amino acid in each position. (Gribskov, McLachlan et al. 1987). Two frequently used profile-based prediction methods are SYFPEITHI (Rammensee, Bachmann et al. 1999) and HLA_BIN (Gulukota, Sidney et al. 1997). One common problem with profile- based methods is that they treat positions independently, i.e., they do not consider the correlations between frequencies in different positions. Another common problem is failing to take information from peptides that do not bind into account. To solve those problems, machine learning approaches such as artificial neural networks (Honeyman, Brusic et al. 1998) and hidden Markov models (Mamitsuka 1998) have been used to predict MHC binding peptides. 22

One well developed and frequently used HLA binding peptide prediction method using artificial neural networks is NetMHC (Andreatta and Nielsen 2016).

The structure-based MHC binding peptides prediction method usually evaluates how well a peptide fits in the binding groove of an MHC molecule. The energy is estimated based on the interactions between the binding peptides and the HLA molecule in the binding pocket (Schueler-Furman, Altuvia et al. 2000). The structure-based prediction method is limited by the availability of a known structure of a given MHC protein. Certainly, once a structure is available, it might be enough to develop a prediction model.

1.6 Protein domains

Protein domains are distinct functional and structural units in proteins. Mutations within the same domain are more likely to affect similar structural and functional properties (Ponting and Russell 2002, Vogel, Bashton et al. 2004). Protein domains can be informative about a protein’s function because they are the units that often accomplish individual functions within the protein (Vogelstein, Papadopoulos et al. 2013). Thus, domain-based mutational studies within a given protein could be useful in elucidating the functional impact of mutations (Vogel, Bashton et al. 2004, Bashton and Chothia 2007). Some recent cancer genome studies in multi-domain cancer- causing genes, e.g. NOTCH1 and PIK3CA, have suggested that mutations in different domains can be pathologically distinct (Nehrt, Peterson et al. 2012, Studer, Dessailly et al. 2013).

Conservation of domain content can be informative about the functional equivalence of homologs. Thus, using the protein domain as a framework to transfer functional information from yeast to human has successfully associated phenotypic changes in yeast with diseases in human (Ponting and Russell 2002, Vogel, Bashton et al. 2004).

To understand how mutations in different protein domains differentially affect protein function, as well as to add a more detailed functional annotation to those potential disease-associated variants, I took a first step towards revealing the domain-centric mutational landscape of missense somatic tumor mutations across 21 cancer types. Also, I used protein domain annotations as a criterion for selecting human-yeast paralogous pairs. 23

1.7 Summary and Rationale

The availability of thousands of human genome sequences presents us with a more pressing need to functionally characterize genetic variants. The aim of my project is to functionally characterize potential disease causal variants by evaluating the impacts of these mutations using both computational methods and experimental methods.

An opportunity for testing many human variants experimentally is offered by functional complementation assays in yeast: Readily available yeast conditional mutant collections (Hillenmeyer, Fung et al. 2008), including a large temperature-sensitive mutant collection (G. Tan, C. Boone and B. Andrews, personal communication) and human clone collections (Rual, Hirozane-Kishikawa et al. 2004, Team, Temple et al. 2009) have made it practical to systematically test human-yeast complementation. Yeast genetic experiments are relatively easy to scale up. My systematic complementation assay has largely confirmed known complementation relationships between human and yeast orthologs. I also found many novel complementation relationships between human and yeast genes that share homologous protein domains. My study showed that the human-yeast functional complementation assay can serve as a platform to test the consequences of thousands of mutations introduced by mutagenic PCR in human disease genes. It is also possible to look exhaustively for genetic modifiers in yeast using the Synthetic Genetic Array (SGA) method if further in-depth experiments are needed (Tong and Boone 2006, Scarcelli, Viggiano et al. 2008, Sassi, Bastajian et al. 2009, Baryshnikova, Costanzo et al. 2010). For example, I can further test how the known genetic interactions of a yeast gene with a TS mutant allele change in the presence of a mutated vs a wild-type human homolog. Thus, the platform I helped establish could be used for 15% or more of human disease genes (Sun, Yang et al. 2016) to assess the functional impact of human-disease associated mutations at a large scale, and points to the value more generally of using model organisms and tractable human cell models for functional studies.

Furthermore, unlike simple Mendelian genetic diseases, cancer is complex and usually caused by the accumulation of somatic mutations. Identifying driver mutations and their functional consequences are critical to our understanding of cancer. Due to the fact that domains are the functional units of protein, in this project, I explored the protein domain-level landscape of cancer-type-specific somatic mutations. Specifically, I systematically examined tumor genomes 24 from 21 cancer types to identify domains with high mutational density in specific tissues, the positions of mutational hotspots within these domains, and the functional and structural context where possible. This work helped to prioritize cancer mutations that may represent the foundation for future functional studies aimed at identifying more effective cancer treatments.

Finally, I worked to develop predictive models of the immunogenicity of cancer mutations. Immuno-therapy approaches have been hampered by the fact that every patient’s tumor possesses a unique set of mutations, and we must first identify those patient specific immunogenic mutations (Kreiter, Vormehr et al. 2015). Individual patients can differ dramatically in their immune systems, based on HLA type and other allelic variation in immune genes, as well as a unique repertoire of mature immune cells. Thus, personalized immuno- therapy could be helpful in cancer treatment. To achieve these goals, I first tested whether the mutation rate of somatic mutations falling within normally-MHC binding peptides is less than that of mutations falling outside normally-MHC binding peptides (This is because missense mutations falling within peptides presented via the HLA class I system may be subject to strong purifying selection due to clearance by the immune system.). I then assessed the likelihood of single nucleotide mutations being displayed within an MHC binding peptide based on individual HLA types. Finally, I developed a model that produces an ‘antigenicity score’ for any observed somatic coding mutation. This project could ultimately use somatic mutations to explain phenotypic differences between individual tumors, as well as the impact and therapeutic potential for particular cancer types.

Chapter 2 Complementation of yeast genes with human genes as an experimental platform for functional testing of human genetic variants

The work outlined in this chapter can be found in the following papers: “An extended set of yeast-based functional assays accurately identifies human disease mutations”, published in Genome Research 26(5):670-80 (2016) and “Accurate assessment of human variant functionality by yeast complementation does not require orthology” (submitted).

I performed all the experimental tests and computational analyses introduced in this chapter, with the exception of the humanized yeast plasmids and complementation assays among human- yeast orthologous pairs which were designed by Song Sun. All yeast temperature sensitive strains were provided by Guihong Tan, Charlie Boone, and Brenda Andrews. Human ORF clones were obtained from David E. Hill and Marc Vidal. Annotations of orthology and curation of literature for complementation was performed by Kara Dolinksi, Rose Oughtred, Jodi Hirschman and Chandra L. Theesfeld.

25

26

Complementation of yeast genes with human genes as an experimental platform for functional testing of human genetic variants 2.1 Introduction

Previous studies have reported that an individual human can carry 100 to 400 rare variants, each with a potentially major impact on health and disease (Pritchard 2001, Nelson, Wegmann et al. 2012). It is important to identify deleterious variants and to estimate how those variants contribute to different human diseases. Experimental assessments of variant function by human- cell-based phenotyping is hampered by inefficient allele replacement technology and by paralogs with overlapping function, making complementation testing in ‘humanized’ model organisms an attractive alternative approach (Lee and Nurse 1987). Yeast-based functional complementation assays has been successfully employed to study human protein functions, as well as the functional impact of amino acid changes in proteins (Dolinski and Botstein 2007, Mayfield, Davies et al. 2012). Song Sun, a postdoctoral in the Roth lab and I worked together to build a platform capable of identifying pathogenic human variants at a large scale based on a human-yeast functional complementation assay.

The functional complementation assay introduced in this chapter was separated into two phases. During the first phase, I tested the ability of wild-type human disease genes to rescue yeast mutants from 139 human-yeast orthologous pairs by yeast-based functional complementation assay. The initial focus was on human disease genes with essential yeast orthologs for which a TS allele is available. Of the eleven previous-reported complementation relationships tested, we were able to recapitulate seven pairs in our complementation assay. Meanwhile, 19 novel complementation pairs were found, which approximately doubled the number of such relationships known within the tested space. After finding human-yeast complementation pairs, I then tested how mutations in human disease genes affect complementation relationships between human and yeast orthologous pairs. In total, 179 genetic variants were tested.

The protein sequences of the 139 tested human-yeast orthologous pairs were mapped to domains (E-value ≤ 0.001). Here, human-yeast orthologous pairs were separated into two categories based on whether or not all domains in the yeast protein are found in corresponding

27 human protein. I found that complementation is more likely to occur when homologies to all domains in the yeast protein are found in the human protein (Fisher’s Exact test, with a null model that the probability of complementation between the two classes of human-yeast orthologous pairs is equal; P-value = 0.026). A similar tendency was observed by mapping sequences of all curated literature-reported human and yeast complementation pairs to Pfam domains (Figure 2-1, Fisher’s Exact test, P – value =3.94e-22). This result indicates that orthology may not be generally required for complementation, conserved protein domains might be used as the selection criteria for human and yeast pairs to be tested. Therefore, pathogenic variant detection assays could potentially be extended to a wider range of human disease genes.

During the second phase of this project, I examined 1060 human-yeast gene pairs, which are annotated as homologs but not as orthologs. I identified 34 complementing pairs, of which 33 (97%) were novel. My study further showed that non-orthologous complementation can be used to identify pathogenic human variants. Within this search space, non-orthologous complementation more than doubled the number of human disease genes with a complementation-based assay for human disease variation.

In this chapter, I will first evaluate the ability of yeast-based functional complementation assays on both orthologous and paralogous human-yeast pairs. Then I will demonstrate that ortholog- based complementation, and paralog-based complementation can both outperform current computational methods for identifying pathogenic human variation. Furthermore, I show that complementation assays can assess the functional impact of variants that fall outside of the aligned homology region. I believe that this study provides an important platform that can be used to understand the key role that deleterious variant detection plays in human disease, and that studies of this kind will have a beneficial impact on developing personalized treatment.

28

Figure 2-1 Comparison of the fraction of complementation and non- complementation pairs among all the test human yeast orthologous in two different cases. P-values were calculated by Fisher’s Exact test.

29

2.2 Methods

2.2.1 Defining the test space of complementation assays

To date, our collaborators from the Center for Cancer Systems Biology (CCSB) of Dana-Farber Cancer Institute (PI: Marc Vidal) have generated a library of cloned human ORFs, representing nearly all human genes (hORFeome 8.1 collection, (Yang, Boehm et al. 2011)). To systematically test the ability of wild-type human disease-associated genes to rescue mutations in homologous yeast genes, my study only focused on human genes for which HGMD (Stenson, Ball et al. 2003, Cooper, Stenson et al. 2006, Stenson, Ball et al. 2012) has annotated one or more alleles as being ‘DM’ (disease-causing) and for which a clone was available in ORFeome version 8.1 (Team, Temple et al. 2009).

During the first phase of this project, all human and yeast orthologous pairs were extracted from the InParanoid database (http://inparanoid.sbc.su.se/cgi-bin/index.cgi). In total, 139 ‘wild type’ human disease genes, corresponding to 125 orthologous yeast essential genes were selected for first phase functional assay (Table 2-1).

Table 2-1 139 orthologous pairs and complementation results.

Gene Entrez hORF Yeast ORF TS Strain ID Detected Reported Symbol ID Position Complementation Complementation EMG1 10436 81040@F04 YLR186W TSA1056 yes yes TSA1141 YARS 8565 81096@H06 YGR185C TSA290 yes yes PIN1 5300 81024@E09 YJR017C TSA698 yes yes DHDDS 79947 81063@D07 YBR002C NA yes yes NCS1 23413 81028@H08 YDR373W TSA399 yes yes NHP2 55651 81023@F06 YDL208W TSA1176 yes yes UBE2I 7329 81013@B03 YDL064W TSA370 yes yes PKLR 5313 81099@D02 YAL038W TSA34 yes no UROS 7390 81043@E07 YOR278W NA yes no PGK1 5230 81077@D12 YCR012W NA yes no GLUL 2752 81067@G02 YPR035W TSQ2428 yes no NSDHL 50814 81072@E05 YGL001C TSA742 yes no DPAGT1 1798 81076@E09 YBR243C NA yes no GDI1 2664 81084@H06 YER136W TSA64 yes no FDFT1 2222 81077@A01 YHR190W TSA1128 yes no

30

CMPK1 51727 81037@E04 YKL024C TSA1034 yes no GFPT2 9945 81110@D04 YKL104C TSA989 yes no RPIA 22934 81038@G02 YOR095C TSQ2310 yes no CALM1 801 81012@B09 YBR109C TSA154 yes no DHFR 1719 81028@C01 YOR236W TSA884 yes no TBP 6908 81059@D06 YER148W TSA476 yes no SUMO1 7341 81006@D07 YDR510W TSA928 yes no TPK1 27010 81046@D11 YOR143C TSA1173 yes no TECR 9524 81053@A05 YDL015C TSA1113 yes no CALM3 808 81012@H09 YBR109C TSA154 yes no PSMD7 5713 81061@E08 YOR261C NA yes no RFT1 91869 81009@B09 YBL020W NA no yes GFER 2671 81032@E11 YGR029W NA no yes NDOR1 27158 81090@A03 YPR048W TSQ2435 no yes PRKCH 5583 81001@E07 YBL105C TSA535 no yes PIGV 55650 81088@G07 YBR004C TSA1243 no yes PYCR1 5831 81060@H11 YER023W TSA1082 no yes EIF4E 1977 81035@D09 YOL139C TSA306 no yes LARS 51520 81131@D09 YPL160W TSA922 no yes ADSL 158 81136@B02 YLR359W NA no no HMBS 3145 81067@A01 YDL205C NA no no MVK 4598 81075@C06 YMR208W TSQ2204 no no MYH9 4627 81011@F09 YHR023W NA no no TIMM44 10469 81085@B04 YIL022W TSA956 no no ACTC1 70 81072@A04 YFL039C TSA72 no no ALAS2 212 81104@A08 YDR232W TSQ1764 no no CSNK1D 1453 81077@E03 YPL204W TSA1196 no no ACAT1 38 81024@B07 YPL028W TSA531 no no ACTA2 59 81072@B04 YFL039C TSA72 no no CPOX 1371 81121@E03 YDR044W TSA1076 no no GARS 2617 81111@A04 YBR121C TSA1071 no no PGAM2 5224 81041@H08 YKL152C NA no no SPTLC1 10558 81136@A07 YMR296C TSA600 no no TPI1 7167 81040@D03 YDR050C TSA988 no no EIF2B2 8892 81065@B12 YLR291C TSA1149 no no PANK2 80025 81011@A05 YDR531W TSA1225 no no VCP 7415 81100@C08 YDL126C TSA179 no no HMGCS2 3158 81125@H10 YML126C TSA1104 no no MPI 4351 81080@H03 YER003C TSA1081 no no SBDS 51119 81046@A12 YLR022C TSA1026 no no

31

TUBA1A 7846 81084@F06 YML085C TSQ2162 no no UROD 7389 81071@A03 YDR047W NA no no BPGM 669 81042@H04 YKL152C NA no no EIF2B4 8890 81089@C02 YGR083C TSQ1936 no no FECH 2235 81123@E08 YOR176W TSQ2339 no no HLCS 3141 81111@F07 YDL141W TSQ1739 no no SDHC 6391 81025@B05 YKL141W NA no no TSEN2 80746 81086@D05 YLR105C TSA676 no no STXBP1 6812 81106@E12 YDR164C TSA36 no no PRPF3 9129 81014@E10 YDR473C TSA58 no no HSPD1 3329 81105@D07 YLR259C TSQ2106 no no PRPF31 26121 81094@D10 YGR091W TSA400 no no TSA472 SMARCA 6595 81102@F09 YIL126W TSA507 no no 2 DPM1 8813 81042@H06 YPR183W TSA330 no no ELAC2 60528 81017@E06 YKR079C NA no no GLE1 2733 81111@E09 YDL207W TSA919 no no POLD1 5424 81138@F02 YDL102W TSA17 no no SEC63 11231 81111@A11 YOR254C TSA231 no no TAF7L 54457 81124@B04 YMR227C TSA638 no no PSMB9 5698 81091@B07 YJL001W TSA292 no no ACTB 60 81072@D10 YFL039C TSA72 no no TUBB2B 347733 81079@D01 YFL037W TSQ1492 no no PYCRL 65263 81048@G05 YER023W TSA1082 no no ALG2 85365 81040@D07 YGL065C TSA461 no no CSNK1E 1454 81077@H06 YPL204W TSA1196 no no RBM28 55131 81016@E11 YPL043W TSA660 no no PSMB8 5696 81118@F02 YPR103W TSA916 no no CCT5 22948 81097@B09 YJR064W TSA1254 no no PIP4K2A 5305 81076@G09 YDR208W TSA129/TSA90 no no 3 RPL5 6125 81006@F04 YPL131W TSQ2383 no no RPA4 29935 81042@F12 YNL312W TSA1180 no no CIRH1A 84916 81110@D03 YDR324C TSQ1773 no no TUBB1 81027 81084@E03 YFL037W TSQ1492 no no PIGM 93183 81082@E06 YJR013W NA no no RARS2 57038 81070@H09 YDR341C TSA1118 no no SLC7A1 6541 81108@A07 YGR191W TSQ1955 no no AURKC 6795 81117@A11 YPL209C TSA165 no no

32

COG4 25839 81097@G08 YPR105C TSQ2442 no no HARS2 23438 81094@F08 YPR033C TSA945 no no HSPA9 3313 81110@D07 YJR045C TSA328 no no KARS 3735 81106@A04 YDR037W TSQ1746 no no MCM5 4174 81015@E03 YLR274W TSA403 no no ORC4 5000 81040@B03 YPR162C TSA1247 no no PIGN 23556 81019@G07 YKL165C NA no no POLR1C 9533 81064@A02 YPR110C TSA642 no no PRPF6 24148 81127@E10 YBR055C TSA367 no no TSA455 RPS27A 6233 81023@G08 YLR167W NA no no SEC23B 10483 81016@D08 YPR181C TSA900 no no RPL35A 6165 81044@A01 YPL143W NA no no PIGA 5277 81088@H05 YPL175W TSA474 no no PLCG2 5336 81020@C07 YPL268W NA no no DDOST 1650 81085@B06 YEL002C TSA581 no no TUBB3 10381 81084@H10 YFL037W TSQ1492 no no SPTLC2 9517 81099@H05 YDR062W TSA575 no no TSA585 RAD21 5885 81108@A02 YDL003W TSA70 no no TSA1229 NOP56 10528 81026@B08 YLR197W TSA1140 no no CDK7 1022 81064@F10 YDL108W TSA109 no no CSTF2T 23283 81107@D10 YGL044C TSA652 no no DHX37 57647 81001@D08 YMR128W TSQ2182 no no EEF1B2 1933 81036@H10 YAL003W TSA987 no no EPC2 26122 81130@E02 YFL024C NA no no EXOC8 149371 81014@F10 YBR102C TSA24 no no FBXW7 55294 81014@A04 YIL046W TSA947 no no GSPT1 2935 81137@B12 YDR172W TSA28 no no IQGAP3 128239 81021@B02 YPL242C NA no no KPNA1 3836 81097@C10 YNL189W TSA1182 no no MAGEE2 139599 81126@D02 YDR288W TSA994 no no NXF3 56000 81097@F04 YPL169C TSA816 no no NXF5 55998 81071@F11 YPL169C TSA816 no no PIGO 84720 81085@D07 YLL031C TSA959 no no PLCD4 84812 81016@F11 YPL268W NA no no POLE2 5427 81096@A08 YPR175W NA no no POLR2A 5430 81099@H09 YDL140C TSA804 no no POLR2E 5434 81030@D10 YBR154C TSA1110 no no

33

PRKCB 5579 81110@A09 YBL105C TSA535 no no PRPF39 55015 81108@E08 YDR235W TSA1085 no no RNF113A 7737 81064@A07 YLR323C TSQ2114 no no RPA2 6118 81043@C07 YNL312W TSA1180 no no RPS3 6188 81039@G09 YNL178W NA no no SLC7A2 6542 81109@C11 YGR191W TSQ1955 no no SMC1B 27127 81020@F08 YFL008W TSA68 no no TSA321 STXBP3 6814 81100@A03 YDR164C TSA36 no no UFD1L 7353 81053@D05 YGR048W TSA677 no no ZC3H14 79882 81053@H05 YGL122C NA no no

During the second phase of this functional complementation assay, I expanded the test space to human-yeast homologous pairs, which are not annotated as orthologous. For convenience, we define the term “paralog” inclusively to indicate a homolog that is not annotated as an ortholog. Due to the fact that protein domains are distinct functional and structural units in a protein, mutations within a particular domain have a heightened chance of affecting structural and functional properties of the proteins in which they appear (Ponting and Russell 2002, Vogel, Bashton et al. 2004, Yang, Petsalaki et al. 2015), and since domain-based mutational studies have proven useful in elucidating the functional and disease effects of mutations (Vogel, Bashton et al. 2004, Bashton and Chothia 2007, Starita, Young et al. 2015), I chose to also use protein domain annotations as a criterion for selecting human-yeast paralogous pairs. I searched both yeast and human genes against the Pfam domain types from the Pfam protein domain family database (version 27) (Finn, Mistry et al. 2010), using an E-value cutoff of 0.001 (Finn, Mistry et al. 2006), and identified cases where all protein domains encoded by a yeast gene were fully ‘covered’ by a human gene. Considering only human and yeast pairs where the yeast gene was essential and had an available temperature sensitive mutation, where the human gene had an available expression clone, and where all protein domains in the yeast gene were covered in the corresponding human gene, I selected 1060 human-yeast paralog pairs corresponding to 314 human genes and 162 yeast genes, the selection pipeline is shown in Figure 2-2.

Having established a reference set of human and yeast complementation relationships, I then used these cross-species complementation relationships to predict human variant pathogenicity. In this chapter, “disease-associated” will be used to indicate that there is some evidence (strong or weak) to associate a variant with diseases and “non-disease-associated” will be used to

34

Figure 2-2 Selection of human and yeast paralogous pairs outline. indicate variants that have not been annotated as being associated with any disease. In total, within the 26 human disease-associated genes with complementation relationships identified during the first phase of this study, 101 disease-associated and 78 non-disease-associated variants across 22 genes were selected. Also, among the 33 human disease-associated genes for which complementation relationships were identified during the second phase of this study, 19

35 disease-associated and 16 non-disease-associated variants across 7 genes were selected (Table 2-2).

Table 2-2 List of variants tested by functional complementation assay

Gene Symble AA_Position Type FC_Score PPH2_Prob Provean_Score CASK T573I SNP 0.6 0.021 -2.35 CASK D471N SNP 0.4 0.005 -1.48 CASK M438L SNP 0.4 0 -1.24 CASK R430C SNP 0.4 0.035 -2.51 CASK R28L Mut 0.6 1 -3.59 CYP19A1 M21T SNP 0.8 0.01 -0.65 CYP19A1 M85R Mut 0.8 0.128 -2.77 CYP19A1 W39R SNP 0.4 0.343 -5.16 CYP19A1 M127R Mut 0.8 1 -4.87 CYP19A1 Y81C Mut 0.8 1 -6.87 DHDDS K42E Mut 0 0.786 -3.65 EMG1 D86G Mut 0.5 1 -6.99 IFT122 G51A SNP 0.2 0.016 -4.11 IFT122 T91I SNP 0.2 0.953 -3.99 IFT122 S373F Mut 0.5 0.951 -5.038 IFT122 L99W SNP 0 0.861 -0.178 IFT122 R328W SNP 0.2 0.994 -6.168 RAB33B P219S Mut 1 0.005 0.6 RAB33B K46Q Mut 1 1 -3.55 RAB33B P142L SNP 0.6 1 -9.99 RAB33B T177M SNP 0.7 1 -5.21 VCP A232G Mut 0.6 0.005 -1.87 VCP I151V Mut 0.4 0 -0.51 VCP I27V SNP 0.3 0 -0.43 VCP Q191R SNP 0.4 0 0.61 VCP S171N SNP 0 0.004 -1.18 VCP T436I SNP 0.3 0.236 -3.76 VCP I206F Mut 0.5 0.983 -3.7 VCP L198W Mut 0.6 1 -4.71 VCP R155C Mut 0.6 1 -6.56 VCP R159C Mut 0.9 1 -6.31 VCP R159H Mut 0.9 0.517 -2.97 VCP R191G Mut 0.6 0.999 -6.49 VCP P137L Mut 0.3 1 -9.31 VCP R159G Mut 0.3 0.998 -5.18 UBE2I V25M Mut 0 0.849 -1.663 CALM1 N98S Mut 0.8 0.118 -4.378

36

CALM1 I53N Mut 0.6 0.956 -5.846 CALM1 F90L Mut 0.6 0.886 -5.477 CALM1 D130G Mut 0 0.983 -4.312 CALM1 F142L Mut 0.6 0.892 -3.28 DHDDS K42E Mut 0 0.566 -3.649 DHFR D153V Mut 1 0.605 -6.485 DHFR L80F Mut 0 0.962 -3.636 DPAGT1 Y170C Mut 0 0.997 -8.866 DPAGT1 I69N Mut 0 0.454 -2.811 DPAGT1 M108I Mut 0 0.326 -3.629 DPAGT1 I29F Mut 0.8 0.224 -3.255 DPAGT1 I297F Mut 0.6 0.646 -3.781 DPAGT1 R301H Mut 0.6 0.992 -4.893 DPAGT1 L385R Mut 0 0.197 -1.941 DPAGT1 A114G Mut 0 0.143 -3.674 DPAGT1 G160S Mut 0 0.838 -5.283 EMG1 D86G Mut 1 1 -6.986 GDI1 L92P Mut 0.8 0.883 -4.423 GDI1 R423P Mut 1 0.979 -6.067 GLUL R324C Mut 0 0.422 -7.368 GLUL R341C Mut 1 0.557 -7.369 NCS1 R102Q Mut 0 0.003 0.011 NHP2 V126M Mut 0.6 0.98 0.689 NHP2 Y139H Mut 0.6 0.985 -4.342 NSDHL A105V Mut 1 1 -3.978 NSDHL G205S Mut 1 0.995 -5.808 NSDHL C340R Mut 1 0.949 -7.508 NSDHL S147R Mut 1 0.585 -4.812 NSDHL Y349H Mut 1 0.994 -4.288 NSDHL A182P Mut 1 0.364 -1.959 NSDHL Y349C Mut 1 0.994 -7.645 NSDHL C132W Mut 1 1 -10.674 NSDHL G124S Mut 1 1 -5.968 PGK1 D268N Mut 0 0.005 -0.923 PGK1 C316R Mut 0.8 0.926 -5.643 PGK1 L89P Mut 0.6 0.966 -6.664 PGK1 R206P Mut 0 0.168 -5.3 PGK1 S320N Mut 1 0.521 -2.415 PGK1 A354P Mut 0.6 0.16 -3.091 PGK1 I371K Mut 0.8 0.998 -6.473 PGK1 I47N Mut 0.6 0.994 -6.615 PGK1 I253T Mut 0 0.852 -4.741 PGK1 T378P Mut 0.6 0.618 -5.454

37

PGK1 D285V Mut 1 0.98 -8.612 PGK1 D164V Mut 1 1 -8.767 PGK1 D315N Mut 0.8 1 -4.787 PGK1 G158V Mut 0.8 0.999 -8.287 PIN1 S32C Mut 1 0.99 -4.816 PKLR E407G Mut 1 1 -6.36 PKLR R486L Mut 0.6 0.826 -4.99 PKLR M107T Mut 1 0.66 -5.286 PKLR R163C Mut 1 0.999 -6.972 PKLR G406R Mut 1 0.996 -7.437 PKLR K410E Mut 1 0.74 -3.738 PKLR R479C Mut 0.8 0.923 -6.246 PKLR R426W Mut 0.8 0.98 -5.546 PKLR R479H Mut 1 0.783 -3.205 PKLR R163L Mut 1 0.985 -6.086 PKLR R337Q Mut 1 0.996 -3.814 PKLR G275R Mut 1 1 -6.949 PKLR D331E Mut 0.8 0.995 -3.742 PKLR Q421K Mut 1 0.953 -2.152 PKLR R359C Mut 0.8 0.997 -7.062 PKLR A468G Mut 1 0.671 -3.25 PKLR D281N Mut 1 0.631 -4.595 PKLR G341D Mut 1 1 -6.526 PKLR C360Y Mut 1 0.854 -10.649 PKLR V460M Mut 0.6 0.988 -2.492 RPIA A135V Mut 1 0.998 -3.853 TECR P182L Mut 1 1 -8.818 TPK1 N219S Mut 0.8 0.991 -4.554 TPK1 N50H Mut 0.8 0.998 -3.416 TPK1 L40P Mut 0.8 0.546 -2.59 UROS A69T Mut 1 0.993 -3.615 UROS G188R Mut 1 1 -3.479 UROS L237P Mut 1 0.995 -4.103 UROS P53L Mut 1 1 -7.863 UROS V3F Mut 1 0.984 -3.799 UROS A104V Mut 1 0.801 -2.502 UROS H173Y Mut 1 0.994 -4.062 UROS G188W Mut 1 1 -4.466 UROS T228M Mut 1 1 -5.38 UROS S212P Mut 0.8 0.175 -1.277 UROS I219S Mut 1 0.998 -3.033 UROS Q187P Mut 1 0.942 -2.764 UROS G225S Mut 0.8 1 -5.387

38

UROS S47P Mut 0.8 0.907 -2.217 UROS A66V Mut 0.8 0.652 -3.087 UROS G236V Mut 1 0.999 -7.21 UROS G58R Mut 1 1 -7.77 UROS V99A Mut 1 1 -3.552 UROS T62A Mut 1 0.941 -4.504 UROS L4F Mut 1 0.993 -2.952 YARS G41R Mut 1 1 -7.061 YARS E196K Mut 1 0.994 -3.417 TPK1 G223R Mut 0.8 0.804 -3.415 GFPT2 I471V Mut 0 0.009 -0.787 CMPK1 N83S Mut 0 0.001 -0.768 FTFD1 K45R Mut 0 0 -0.199 CALM1 Q42L SNP 0 0.065 -5.883 CALM1 Q50R SNP 0 0.016 -3.201 CMPK1 E58K SNP 0 0.005 -0.45 DHFR E63Q SNP 0.6 0.285 -2.383 DHFR M140L SNP 0 0 -1.15 DPAGT1 I393V SNP 0 0 -0.357 DPAGT1 L137F SNP 0 0.998 -3.874 DPAGT1 I37L SNP 0 0.002 -0.991 DPAGT1 L 371F SNP 0 0 -1.012 DPAGT1 L401F SNP 0 0.179 -2.4 DPAGT1 E376D SNP 0 0.995 -2.91 DPAGT1 F332V SNP 0 0.002 -0.21 DPAGT1 L70F SNP 0.8 0.66 -2.28 DPAGT1 D116E SNP 0 0.999 -3.87 EMG1 E214G SNP 0 0.955 -6.387 FTFD1 S356C SNP 0 0.772 -0.905 GDI1 P63S SNP 0 0.054 -2.804 GDI1 D256Y SNP 0.6 0.885 -5.688 GFPT2 T430I SNP 0 0.995 -5.383 GLUL V80M SNP 0 0.047 0.17 GLUL T116I SNP 0 0.187 -3.12 NCS1 R61Q SNP 0 0.009 -0.28 NHP2 A118T SNP 0 0.01 -0.987 NHP2 R101Q SNP 0 0.005 -1.05 NSDHL M9L SNP 0 0 -0.32 NSDHL L95V SNP 0 0.199 -1.431 NSDHL W298L SNP 0 0.001 -1.274 NSDHL R281C SNP 0 0.947 -2.298 NSDHL A14V SNP 0 0.011 0.09 NSDHL Q253K SNP 0 0.003 0.35

39

NSDHL Q89E SNP 0 0.005 0.03 PGK1 L119P SNP 0 0.994 -6.804 PGK1 N163S SNP 0 0.179 -4.619 PGK1 R350W SNP 0 0.003 -2.507 PGK1 A398T SNP 0 0.86 -3.473 PGK1 S57L SNP 0 0.946 -4.515 PGK1 S305C SNP 0 0.575 -3.99 PIN1 R142Q SNP 0 0.007 -0.629 PKLR R569Q SNP 0 0.428 -3.434 PKLR E277K SNP 0 0.775 -3.105 PKLR V460A SNP 0 0.953 -3.058 PKLR R504L SNP 0.6 0.679 -6.55 PKLR G549R SNP 0 0.989 -6.31 PKLR I219T SNP 0 0.706 -3.54 PKLR V213A SNP 0 0.246 -2.06 PKLR R479C SNP 0 0.923 -6.25 PKLR V506I SNP 0 0.002 -0.34 PKLR H306Q SNP 0.6 0.001 1.47 PKLR R449C SNP 0 0.003 -1.48 PKLR R209Q SNP 0.6 0.002 -1.19 PKLR E538D SNP 0 0.001 -0.71 PKLR P303A SNP 0 0.001 -1.31 PKLR R518C SNP 0 0.44 -2.98 PKLR D260N SNP 0 0.963 -4.38 PKLR R273H SNP 0 0.015 -2 PKLR R99C SNP 0 0.151 -5.52 PKLR R490W SNP 0 1 -6.82 RPIA H266Y SNP 0.6 0.004 #N/A TECR R79W SNP 0 0.724 -5.128 TECR L50I SNP 0 0.018 -1.07 TPK1 E57K SNP 0 0.045 -1.496 UBE2I F58C SNP 1 0.941 -5.691 UROS G57R SNP 0.6 1 -3.348 UROS T103A SNP 1 0.951 -3.903 UROS V171G SNP 0 0.018 -2.597 UROS A222V SNP 0 0.011 0.414 UROS I165V SNP 0 0.007 -0.488 UROS D113V SNP 0 0.001 -1.349 UROS P261A SNP 0 0.001 -0.912 UROS K206E SNP 0 0.001 0.465 UROS A234T SNP 0 0.043 -0.38 UROS T247M SNP 0 0.944 -3.36 UROS G263S SNP 0 0.001 0.46

40

UROS C264F SNP 0 0.146 -1.3 UROS K124R SNP 0 0.486 -1.1 UROS L36V SNP 0 0.809 -2.63 YARS M431L SNP 0 0.003 -1.569 YARS A339T SNP 0 0 -0.49

2.2.2 Humanized yeast plasmid (wild-type and mutated) construction

Wild-type human disease-associated ORFs were selected from the human ORFeome version 8.1 (Team, Temple et al. 2009). For the 26 human genes selected in the first phase of the study, a Gateway cloning destination vector was constructed from the pHiDest-DB (CEN/ARS-based, ADH1 promoter, and LEU2 marker). Two-step modification of the original pHiDest-DB resulted in two Gateway-compatible destination vectors with different selection markers. First, the entire GAL4 DNA-binding domain was deleted from the pHiDest-DB resulting in pHYCDest-LEU2 (for use in strains for which the TS allele is linked to natMX). Next, pHYCDest-natMX (for use in strains for which the TS allele is linked to kanMX) was constructed by replacing LEU2 with natMX in pHYC-LEU2. These modifications were achieved by separate PCR amplification of the pHiDest-DB backbone and natMX4 cassette followed by homologous recombination in yeast.

Human ORFs with disease-associated mutations were constructed by site-directed mutagenesis using the Thermo Scientific Phusion® Site-Directed Mutagenesis Kit (Appendix A). The Gateway Donor plasmid was amplified using phosphorylated primers that introduce the desired changes followed by a 5-minute, room-temperature ligation reaction. The resulting plasmid was then transformed into NEB5α competent E. coli cells (New England Biolabs).

All expressed ORFs used in the second phase of this study—including wild-type human disease- associated ORFs, human ORFs with constructed alleles, and the GFP control—were transferred into the destination vector pCM188- URA (2 microne-based, URA marker) (Gari, Piedrafita et al. 1997) by Gateway LR reactions using the All Gateway® LR Clonase® kit from Life Technologies.

Plasmids generated by Gateway LR cloning were transformed into NEB5α competent E. coli cells (New England Biolabs) and selected on LB Agar plates with 100µg/mL Ampicillin. All

41 plasmid DNA samples were isolated and purified using the NucleoSpin® 96 Plasmid toolkit (Ref: 740625.24) and confirmed by Sanger sequencing. Plasmids carrying expressed ORFs were then transformed into the corresponding yeast temperature-sensitive strains (Appendix A). Those humanized yeast TS strains are prepared for spotting assay (Figure 2-3).

2.2.3 Functional complementation assay

Yeast temperature-sensitive strains carrying human ORFs or GFP control were spotted in a 10- fold dilution series and grown at a range of temperatures (room temperature of ~24 °C, and 28, 30, 32, 33, 34, 35, 36 and 38°C, Figure 2-3) on selection plates. Results were interpreted by comparing the growth difference between the yeast strains expressing human genes and the corresponding control strain expressing the GFP gene. Each test was initially performed twice, and pairs were found in at least one replicate were considered complementation candidates. Then, I further considered only those candidates passing a third functional complementation

42

Figure 2-3 Outline for plasmid construction, Gateway cloning and spotting assay

43

2.2.4 FC score and computational analysis

To predict functional effects for each missense genetic variant, I assessed complementation with the above-described yeast spotting assays and assigned a semi-quantitative Failure-to- Complement (FC) score (corresponding to the FCS score of Sun et al, Genome Research, 2016) (Sun, Yang et al. 2016). Semi-quantitative FC scores were assigned to each variant: 0 (wild- type-like complementation), 0.6 (reduced complementation), 0.8 (severely reduced complementation) and 1 (complete loss of complementation) (Figure 2-4). The predicted functional impact score for disease-associated variants was generated by Polymorphism Phenotyping v2 (PolyPhen2, (Sokic and Dukanovic 1971, Adzhubei, Schmidt et al. 2010)), and PROVEAN (Choi, Sims et al. 2012).

The area under the precision-recall curve (AUPRC) was calculated using R package “PRROC”. When comparing the performance of functional complementation assays in predicting disease associated mutations in either aligned or not aligned regions, in order to account for the effect that changing the prior probability of disease mutations can have on precision estimates. Therefore, performance was estimated using the ratio of AUPRC relative to the prior probability (designated as AUPRC_norm) instead of AUPRC.

44

Figure 2-4 Pipeline to assess the functional impact of human genetic variants by mutagenesis, functional complementation assay.

45

2.3 Functional complementation assay results in orthologous pairs

2.3.1 Human to yeast complementation in orthologous pairs

To identify human and yeast complementing orthologous pairs, I first systematically tested the ability of ‘wild type’ human disease genes to rescue temperature-sensitive mutations in orthologous yeast genes. For the selected 139 human and yeast orthologous pairs, the ‘wild type’ human disease genes were transferred into expression vectors by Gateway cloning technology and transformed into corresponding yeast temperature-sensitive mutant strains. The growth of yeast strains expressing either a given human gene or a Green Fluorescent Protein (GFP) negative control gene was assessed under permissive, semi-permissive and non- permissive temperatures. A complementation relationship was defined by a humanized yeast strain growing significantly faster than the GFP control strain at a semi-permissive or non- permissive temperature. Among the 139 orthologous pairs tested, 26 complementation relationships were identified (Figure 2-5).

46

Figure 2-5 Pipeline to identify human and yeast orthologous complementing pairs using functional complementation assay.

2.3.2 Assessing functional impact of disease associate genetic variants

Having establishing functional complementation relationships between yeast and human orthologs, I then assessed the pathogenicity of human genetic variants within the 26 human genes for which complementation relationships were identified. Among the 26 human genes, 22 genes have missense mutations annotated by HGMD database. For these 22 genes, I selected disease- and non-disease-associated variants. Construction of each missense variant was

47 attempted by site-directed mutagenesis, transferred to a yeast expression plasmid by Gateway technology and individually transformed into the corresponding yeast strains (described in Chapter 2.2.2).

In total, 101 disease-associated variants from HGMD Professional were selected and constructed within expression clones successfully. As a control, another 92 non-disease- associated genetic variants were selected from the dbSNP databases (Wheeler, Barrett et al. 2008, Sayers, Barrett et al. 2009, Coordinators 2014), attempting to match the numbers of disease mutations in each gene (matching was not possible for some genes, given the limited number of validated non-disease variants). Of these 92, I generated expression clones for 78 non-disease mutations. Only 2 of the 78 non-disease mutations we constructed were common (minor allele frequency ≥ 1%), as compared with 2 of 101 of the successfully constructed disease-associated variants.

To assess the functional impact of these 179 variants, I carried the functional complementation assay on each of them. Meanwhile, the assay for each allele was assessed with assays of the corresponding wild-type allele and the GFP gene as negative and positive controls for loss of complementation (Figure 2-6). Then, I gave a semi-quantitative Failure-to-Complement (FC) score to each variant based on spotting assay results, this is described as the FCS score in Sun et al (Sun et al, Genome Research, 2016).

48

Figure 2-6 Outline for assessing the functional impacts of human genetics variants based on the complementation relationship between human and yeast orthologous pairs

2.3.3 Comparing functional complementation assay result with computational prediction result

I also compared the ability of functional complementation assay with computational methods for predicting the functional impact of missense variants. Here, I obtained the functional impact scores predicted by three widely-used computational methods, PolyPhen2 (Adzhubei, Schmidt et al. 2010), SIFT (Ng and Henikoff 2001), and PROVEAN (Choi, Sims et al. 2012).

To compare the relative ability of functional complementation assay and computational methods to identify disease mutations, I used the 119 variants scored by functional complementation assays (81 disease and 38 non-disease variants) as a ‘gold standard’. Distributions of the FC scores for the 81 disease-associated mutations and the 38 non-disease-associated mutations (Figure 2-7), clearly show that our functional complementation assays were able to largely separate disease-associated variants from non-disease-associated variants.

49

Figure 2-7 Distribution of FC score in disease-associated and non-disease associated variants, and sequence identity in complementing and non-complementing pairs.

The performance was plotted in terms of precision (fraction of disease predictions that were correct according to the gold standard) and recall (fraction of all gold standard disease variants that were successfully predicted to be a disease variant) for both the FC scores and PolyPhen2 scores at various thresholds (Figure 2-8 A). This precision-recall curve clearly shows that yeast- based functional complementation assays outperform PolyPhen2 in terms of pathogenicity prediction. Similar analyses of three other computational methods, SIFT (Ng and Henikoff 2001) and PROVEAN yielded similar conclusion (Figure 2-8 B).

Figure 2-8 A. Precision and recall curve for FC scores, Polyphen2 scores; B. Precision and recall curve for Polyphen2 scores, SIFT scores and PROVEAN scores.

50

Based on the precision-recall curves of Figure 2-8, the cut-off value of 0.6 for the FC score was set as the threshold value, genetic variants with an FC score more than 0.6 are regarded as deleterious. At this threshold, the functional complementation assay achieved a precision of 94% and recall of 78%. Meanwhile, genetic variants with a PolyPhen2 score large than 0.5 was usually regarded as deleterious. At these thresholds, the precision of FC scoring (94%) significantly exceeded that of PolyPhen2 (precision 83%; Fisher’s Exact test, P-value = 0.02). Analysis of overlap between experimental and computational predictions shows that they largely predict the same disease mutations to be deleterious. There are 66 genetic variants predicted to be deleterious by PolyPhen2, and 63 genetic variants predicted to be deleterious by our functional complementation assay, among these variants, 59 variants are predicted to be deleterious by the two methods. Thus, performance differences largely stemmed from differences in the calls for non-disease mutations: Of the 38 non-disease-associated mutations, only four (11%) were classified as deleterious by the FC score, while PolyPhen2 predicted fourteen (37%) of the non-disease mutations to be deleterious (Fisher’s Exact test, P-value = 0.006).

2.4 Functional complementation assay results for paralogous pairs

2.4.1 Many complementation relationships exist for human-yeast paralogs

To expand the set of human disease genes with a functional complementation assay, during the second phase of this project, I performed functional complementation assays for the test set of 1060 human-yeast paralog pairs, and also for eight known-complementing control pairs (including seven orthologs and one paralog). Those 1060 human-yeast pairs corresponding to 341 human disease associated genes. All the human and yeast paralogous pairs share at least one homologous domain. For each of the 341 human genes in this test space, I obtained an open reading frame (ORF) from the hORFeome 8.1 collection (Rual, Venkatesan et al. 2005, Team, Temple et al. 2009), and derived a ‘humanizing’ yeast expression plasmid via Gateway cloning as described in Chapter 2.2.2.

To assess complementation for each human-yeast pair, the human protein was expressed in yeast strains bearing temperature-sensitive mutations in the corresponding yeast gene (Figure 2-9), and growth was assessed at temperatures ranging from permissive to non-permissive

51 temperatures, including room temperature (permissive temperature), 28 °C, 32°C, 33 °C, 34 °C, 35°C, 36°C, 37°C and 38 °C.

Figure 2-9 Identify human and yeast paralogous complementing pairs.

Each test was performed twice, including yeast transformation and spotting assay, and 42 pairs yielded complementation in at least one replicate. These complementing pairs included 34 of the 1060 paralog pairs in the test space and all 8 positive controls. Then, I attempted to confirm complementation for all 34 novel pairs. A third round spotting assay was performed. As a result of systematic complementation testing, I identified 34 reproducible complementing pairs amongst human–yeast paralogs (Table 2-3). For 33 (97%) of these human-yeast paralog pairs, the complementation relationships I discovered were novel. Of the 314 human disease- associated genes tested, 33 (10.4%) yielded a complementation relationship with at least one yeast paralog.

According to the YeastMine database (Balakrishnan, Park et al. 2012), there are 773 additional human disease-associated genes with yeast paralogs, suggesting that a functional assay could potentially be developed for ~70 additional human disease-associated genes through further examination of paralog complementation assays.

52

Table 2-3 Human-yeast complementing paralogous pairs

Complementing Pattern of Entrez ID Yeast ORF Associated Disease Age of Onset Human Genes Inheritance autosomal ACVR1C 130399 YDL108W Glaucoma, primary congenital NA recessive autosomal ACVR2B 93 YDL108W Left-right axis malformation NA recessive Severe insulin resistance and autosomal AKT2 208 YDL108W diabetes NA dominant mellitus,Hypoglycaemia fetuses between BBS10 79738 YJR064W Bardet-Biedl syndrome ages 21 and 26 NA weeks' gestation Brachydactyly type A2,Breast autosomal BMPR1B 658 YDL108W NA cancer recessive autosomal CASK 8573 YPL204W Mental retardation NA dominant incomplete CYP19A1 1588 YHR007C Prostate cancer,Alzheimer's 8-18 year old dominant

EIF4E 1977 YOL139C Autism autosomal ELP2 55250 YNR026C Intellectual disability NA recessive GNB1L 54584 YNR026C Autism spectrum disorder NA NA H2BFWT 158983 YKL049C Oligospermia,Male infertility NA NA Costello syndrome,Bladder HRAS 3265 YOR101W NA NA cancer autosomal IFT122 55764 YNR026C Cranioectodermal dysplasia 27 monthes recessive Subcortical band autosomal PAFAH1B1 5048 YPL126W less than 4 heterotopia,Lissencephaly recessive Smith-McCort autosomal RAB33B 83452 YFL005W dysplasia,Dyggve-Melchior- 24 years old recessive Clausen syndrome autosomal RFWD2 64326 YPL126W Autism spectrum disorder recessive Suicide,Keratosis follicularis SAT1 6303 YFL017C 11 monthes X-linked spinulosa decalvans autosomal SPAST 6683 YDR394W Upper motor neuron syndrome dominant Herpes simplex virus in early autosomal TBK1 29110 YFL029C encephalitis adulthood dominant

TIMM44 10469 YIL022W Oncocytic thyroid carcinoma NA NA Congenital heart UBE2B 7320 YDR054C NA NA disease,Azoospermia Cleft palate,Catch 22 complex UFD1L 7353 YGR048W syndrome,Cardiac and NA disease craniofacial defect

IBMPFD,Myopathy, rimmed autosomal VCP 7415 YGL048C vacuolar,Amyotrophic lateral 47-60 dominant sclerosis,Alzheimer disease

53

Bipolar disorder, association ADRBK2 157 YDL028C NA NA with Intellectual disability,Cystinosis, CDK7 1022 YFL029C NA nephropathic juvenile

CDKL3 51265 YDL108W Potential protein deficiency NA NA Inter-ventricular septum GADD45B 4616 YDL208W NA NA hypertrophy

Altered activity,Reduced transcriptional activity,Increased complex GRK4 2868 YDL108W transcriptional activity,Potential NA disease protein deficiency,Essential hypertension

PPP2R1A 5518 YMR308C Altered promoter activity NA NA Azathioprine haematotoxicity in TPMT carriers,Reduced RAC1 5879 YPR165W NA NA promoter activity,Ulcerative colitis,Radial ray defect RAGE 5891 YPR161C Renal carcinoma NA NA RPS6KB1 6198 YAR019C Potential protein deficiency NA NA RPS6KL1 83694 YDL108W Potential protein deficiency NA NA

2.4.2 Some essential yeast genes are complemented by multiple human paralogs sharing only a single domain

Among the 33 novel human-yeast paralog complementation pairs found in this study, there were four yeast genes that could each be complemented by multiple human genes. For each of these yeast genes, the corresponding set of complementing human genes shared a common protein domain. For example, the function of yeast serine/threonine protein kinase Kin28 (ORF ID: YDL108W) could be complemented by expression of seven different human proteins (Figure 2-10). These proteins are Ribosomal Protein S6 Kinase-Like 1(RPS6KL1), G Protein-Coupled Receptor Kinase 4 (GRK4), Cyclin-Dependent Kinase-Like 3 (CDKL3), Bone Morphogenetic Protein Receptor, type IB (BMPR1B), V-Akt Murine Thymoma Viral Oncogene Homolog 2 (AKT2), Activin Receptor Type-2B (ACVR2B) and Activin A Receptor, Type 1C (ACVR1C). Each of these shares the Pkinase protein domain found within yeast Kin28. However, each of these seven human proteins contains one or more additional protein domains and have different functions in different pathways. Indeed, the only apparent common thread among Kin28- complementing human proteins is the Pkinase protein domain. This domain is a structurally conserved protein domain containing the catalytic function of protein kinases. Previous studies

54 have reported that its function has been evolutionarily conserved from E. coli to Homo sapiens. They play key roles in various cellular processes, including division, proliferation, apoptosis, and differentiation (Manning, Plowman et al. 2002)

Figure 2-10 Comparison of protein domain structures among yeast KIN28 and its human paralogs tested. The complementing human genes are labeled in blue, non- complementing human genes are labeled in black. Yeast kin28 is labeled in red. Protein domains are shown as boxes in different colors.

Other examples of multiple human genes complementing a given yeast gene involve the yeast proteins cyclin-dependent kinase-activating kinase Cdk1 (ORF ID: YFL029C), guanine nucleotide exchange factor Sec12 (ORF ID: YNR026C), and Net1-Associated Nuclear protein Nan1 (ORF ID: YPL126W). Two human genes can complement loss of yeast Cdk1: human

55 genes encoding Serine/threonine-Protein Kinase (TBK1) and Cyclin-Dependent Kinase 7 (CDK7) (Figure 2-11). Loss of yeast Sec12 can be complemented by human genes IFT122, ELP2, and GNB1L. Loss of yeast Nan1 can be rescued by human gene PAFAH1B1 and RFWD2, which each share the same protein domain, the WD40 repeat domain (PF00400). WD40 repeat domain is a short structural motif of approximately 40 amino acids, often terminating in a tryptophan-aspartic acid (W-D) dipeptide. This protein domain is implicated in a variety of functions, such as signal transduction, transcription regulation, autophagy and apoptosis. The underlying common function of all WD40-repeat proteins is coordinating multi-protein complex assemblies, where the repeating units serve as a rigid scaffold for protein interactions (Stirnimann, Petsalaki et al. 2010). My functional complementation result indicates that structural domains are conserved among proteins with different functions.

Thus, protein domain function, even for otherwise highly-diverged paralogs, can be sufficiently conserved to allow functional rescue of a yeast protein by multiple human paralogs and to thus provide a potential assay for the functionality of human variants.

Figure 2-11 Comparison of protein domain structures among yeast CDK1 and its human paralogs tested. Pkinase domains are shown as blue boxes

2.4.3 Paralog complementation is only weakly predicted by sequence similarity

To understand the rules that may contribute to the functional complementation relationship between human and yeast paralogs, I examined the extent of sequence identity between human

56 disease-associated genes and their yeast paralogs. For each human and yeast gene pair, I calculated the pairwise percent sequence identity (PID) score (the percentage of aligned positions with identical residues). For a yeast gene with multiple human paralogs tested, I examined the average PID scores for complementing and non-complementing human-yeast paralog pairs. As expected, complementing pairs had higher PID scores than non- complementing pairs (Figure 2-12A, P-value = 0.0072, Wilcoxon test). Similarly, for human genes that had multiple yeast paralogs tested, complementing pairs had relatively higher average PID scores ( Figure 2-12B, P-value = 0.0034, Wilcoxon test).

My results show that sequence similarity is correlated with complementation for human-yeast paralogs, as might be expected given a similar result for human-yeast orthologs (Kachroo, Laurent et al. 2015). However, as was also previously shown for orthologous human-yeast gene pairs (Kachroo, Laurent et al. 2015), sequence similarity is only weakly predictive of complementation: Although most (60%) of the complementing pairs showed PID > 30, 30% of non-complementing pairs also showed a PID score above this threshold. Thus, systematic experimental testing will continue to be required for discovery of complementing paralog pairs.

57

A similar analysis performed for three additional sequence-identity calculation methods. The

Figure 2-12 A. The average percent identity (PID) score distribution for yeast- human pairs for which a yeast protein had multiple human paralogs tested. B. The average PID score distribution for human-yeast pairs for which a human protein had multiple yeast paralogs tested.

58

PIDai score was calculated based on the ratio of identical positions to the summary of aligned positions and internal gap positions. The PIDsl score is the percentage of identical positions to the shorter length sequence, but PIDal uses the average length of human and yeast gene pairs. Even when these different sequence-identity calculation methods were used, for those yeast genes that had more than one human paralog tested, the complementing paralogous pairs showed higher sequence identity (P-value = 0.0089 for PIDai, P-value = 0.0037 for PIDsl , P- value = 0.0046 for PIDal). For human genes that have multiple yeast paralogs, the complementing paralogous pairs show higher sequence identity except when considering both the aligned positions and internal gaps (P-value =0.1995 for PIDai, P-value = 0.0090 for PIDsl,

P-value = 0.02 for PIDal). The different results based on PIDai and PID score indicates that human genes tend to be much longer than their yeast counterparts. The protein domains and motifs added over the course of evolution may have given new functions to human genes.

2.4.4 Assessing the pathogenicity of missense variants

Having established functional complementation relationships between human-yeast paralogs, I then wondered whether these relationships could be exploited to assess the pathogenicity of human genetic variants. Among the 33 disease-associated genes for which I could identify novel complementation relationships, pathogenic missense variants (here defined by HGMD DM annotation) are known for 17 human genes. To assess the ability of human/yeast paralog complementation assays to identify pathogenic variants, I therefore selected a subset of seven human disease-associated genes with multiple annotated disease-causing missense variants (Stenson, Ball et al. 2003, Cooper, Stenson et al. 2006, Stenson, Ball et al. 2012) (Table 2-2). Non-disease-annotated missense variants were randomly selected in dbSNP database (Wheeler, Barrett et al. 2008, Sayers, Barrett et al. 2009, Coordinators 2014) for five of these seven genes. Within other two human genes, no missense mutations were reported in dbSNP database. In total, I tested 19 disease-causing missense variants, each having both the most stringent disease- causal “DM” annotation in HGMD and the most stringent “pathogenic” annotation in ClinVar (Landrum, Lee et al. 2014). I also tested 16 non-disease-associated variants from dbSNP, selecting lower allele frequency variants where possible to better control for the generally low allele frequency of disease-causing variants.

59

For each of these 35 human variants, I generated an expression clone by site-directed mutagenesis and recombinational cloning, transformed it into the appropriate temperature- sensitive (TS) yeast strain, and assessed functional complementation (Figure 2-3). For each genetic variant, this yielded a semi-quantitative Failure-to-Complement (FC) score. FC scores were calibrated so that the positive (complementing) control wild-type human plasmid achieves an FC score of 0, and a Green Fluorescent Protein (GFP) negative (non-complementing) control achieves an FC score of 1. Adhering to the previous convention, only variants with a score greater than 0.5 were considered deleterious (Sokic and Dukanovic 1971, Sun, Yang et al. 2016).

Functional complementation assays predicted 15 (79%) of 19 disease variants and only 4 (25%) of the 16 non-disease-associated variants we tested to be deleterious. Thus, paralog-based functional complementation assays achieved 79% precision (fraction of predicted-deleterious variants that are annotated as pathogenic) at 79% recall (fraction of pathogenic variants predicted to be deleterious). Different performance tradeoffs could be achieved at different thresholds. For example, I achieved 100% precision at a recall value of 19.2% (Figure 2-13 A). Interestingly, our results for non-disease-associated results suggest that genetic variants not annotated as non-disease-associated variants may in fact often impact gene function. Nevertheless, functional complementation assays help distinguish disease and non-disease- associated genetic variants: For the five genes that have both disease-associated and non- disease-associated variants, disease-associated variants had higher FC scores (P-value = 0.007, Wilcoxon test; Figure 2-13B; Table 2-2).

60

Figure 2-13 A. The precision-recall (PRC) curve of FC score and PolyPhen2 score. B. The distribution of FC score of disease-associated and non-disease-associated variants.

I next compared FC with PolyPhen2 (Sokic and Dukanovic 1971) and Protein Variation Effect Analyzer (PROVEAN) (Choi, Sims et al. 2012), two widely used computational methods for predicting pathogenic variants. The area under the precision-recall curve (AUPRC) for FC scores is 0.83, for Polyphen2 is 0.76 and for PROVEAN is 0.70 (Figure 2-13 A). Thus, precision vs. recall analysis clearly shows that paralog-based functional complementation

61 assays, as previously shown for ortholog-based assays (Sun, Yang et al. 2016), outperformed PolyPhen2 and PROVEAN in predicting pathogenicity. I considered several alternative summary performance measures of pathogenicity prediction, including: recall at 90% precision, area under the precision-recall curve, area under the receiver operating characteristic curve, and Matthews Correlation Coefficient. By every measure, paralog-based functional complementation assays are at least on par with computational methods in predicting pathogenicity (Table 2-4).

Table 2-4 Single Point Performance Estimates Method MCC AUPRC AUROC REC90 FC 0.59 0.83 0.55 0.78 Phlyphen2 0.48 0.76 0.55 0.74 PROVEAN 0.37 0.7 0.52 0.71 (MCC) Matthews correlation coefficient; (AUPRC) area under the precision-recall curve; (AUROC) area under the receiver-operating characteristic curve; (REC90) recall at 90% precision.

To more generally assess the performance of complementation-based pathogenicity assays against computational tests, we combined both paralog-based complementation pathogenicity tests with previous ortholog-based tests(Sun, Yang et al. 2016). At score thresholds where FC score and Polyphen2 achieve a recall of 0.9, the FC precision is 0.81 whereas Polyphen2 is 0.72. Using the previously described performance threshold value of 0.5 for the FC score (Sun, Yang et al. 2016) achieves a recall of 0.78 and precision of 0.89 for the FC score. The PolyPhen2 score threshold (0.56), which matches this recall performance, achieves a lower precision of 0.73 (Fisher’s exact test P-value = 0.003). A similar comparison using only ortholog-based assays yielded the same conclusion, albeit with a less significant P-value of 0.008 (Sun, Yang et al. 2016). Thus, inclusion of paralog-based complementation strengthens previous conclusions that complementation-based identification of functional variation outperforms current computational approaches.

I wondered whether complementation assays are capable of detecting pathogenic variants when these variants fall outside of the aligned homology region. It is possible that variants will affect additional human gene functions that are not needed for complementation, so that such

62 pathogenic variants will be missed. However, variants, which alter the ability of a protein to fold in human cells, may often alter the ability of that protein to fold in a yeast cell.

Figure 2-14 Precision vs. recall analysis for variants that either do (‘aligned’) or do not (non-aligned) fall within sequence region that can be aligned between human and yeast homologs.

Interestingly, failure of a variant to complement did not depend on whether or not it falls within the aligned region of homology between yeast and human genes. More specifically, the distributions of FC scores for variants inside and outside of the aligned region were statistically indistinguishable (P-value = 0.37, Wilcoxon test). I specifically evaluated the performance of complementation assays to detect disease variants for those variants outside of the aligned region, and found that the likelihood of detecting a disease variant was comparable: 0.76 and 0.87 respectively for variants inside and outside of the aligned region (Figure 2-14). I extended this analysis to include previous data on orthologous complementation during the first phase of this study (Sun, Yang et al. 2016). Taking orthologous and paralogous yeast/human complementation assay data together, the likelihood of pathogenic variant detection variation was 0.92 and 0.88 respectively for variants inside and outside of the aligned region. Thus, yeast- based functional assays are frequently capable of accurately detecting pathogenic variants, even outside of the aligned homology region.

63

2.5 Discussion

In this study, the effort on identifying complementation relationships was focused on yeast genes with an essential function by using temperature-sensitive alleles of yeast genes with a human disease-associated homolog. We used yeast temperature-sensitive strains because the conditional ‘no growth’ phenotype simplifies rescue experiments and the subsequent evaluation of the functional effect of human genetic variants. Similar assays could be performed for yeast nonessential genes, but this is complicated by the fact that the majority of nonessential gene deletion strains had either small or no fitness defect in a single experimental condition (Giaever, Chu et al. 2002, Deutschbauer, Jaramillo et al. 2005, Breslow, Cameron et al. 2008). However, a more severe phenotype for any given nonessential gene can be revealed by either negative genetic interaction (i.e., by using a sensitized strain background) or using a sensitizing growth environment (Hillenmeyer, Fung et al. 2008, Costanzo, Baryshnikova et al. 2010). For example, it is efficient to look for genetic modifiers in yeast using the synthetic genetic array (Baryshnikova, Costanzo et al. 2010). We can test how the known genetic interactions of a yeast gene with a TS mutant allele change in the presence of a mutated human homolog compared to the presence of the wild-type human gene. By doing so, it may be possible for us to figure out how mutations in different genes or gene regions interact with different gene sets and therefore suggest different functional consequences of these mutations.

Given the rate at which we observed complementation (26 out of 139 or 19%) and our screen’s apparent false negative rate (8 out of 15 or 53%), a yeast complementation assay might be expected for ~40% of those human disease-associated genes with a yeast ortholog. Applying this ratio to the 1047 human disease-associated genes that have a yeast ortholog, approximately 400 disease-associated genes are expected to be able to functionally replace their yeast ortholog counterparts, which accounts for 8% of the total number of disease-associated genes (disease- associated variants have been identified and deposited in OMIM or HGMD for ~5000 human genes).

Functional complementation assays for human disease genes enable the simple and accurate identification of pathogenic disease variants (Lee and Nurse 1987, Osborn and Miller 2007, Sun, Yang et al. 2016). Until now, trans-species functional complementation assays have been almost

64 exclusively based on orthology relationships. During the second phase of this study, I considered gene pairs without annotated orthology (so-called paralogs in this study), carrying out a systematic search that yielded novel functional complementation assays for 33 human disease genes. I further showed that these assays are significantly more accurate in predicting pathogenic variants than the computational approaches that are the only currently practicable alternative. Given new genetic and genomic tools becoming available for model organisms and human cell models, together with the fact that orthology is not required for functional complementation and that complementation assays may be extended using sensitized strains/environments, functional assays could quickly become available for the majority of human disease genes. These functional assays could be employed as a platform to evaluate human genetic variations and study disease mechanisms.

In addition to yielding a direct benefit in the form of novel functional assays, my systematic search for non-orthologous complementation enabled some general observations about paralogous complementation relationships.

As with orthologs, sequence similarity is only a very weak predictor of complementation relationships, necessitating that searches for complementation relationships be experimental. Second, despite the idea that paralogs often have divergent functions, I found that multiple human genes (having in common a single protein domain) can sometimes complement the same yeast gene. For example, the seven human disease-associated genes that can complement the yeast kin28 all encode a protein kinase domain, corresponding to 3 major kinase groups, including TKL kinases, CMGC kinases and AGC kinases. However, there were an additional 31 human disease-associated genes that encode the same protein domain that did not complement yeast kin28. Examining the phylogenetic tree of tested human protein homologs of yeast Kin28, using the multiple sequence alignment tool, Clustal (Sievers, Wilm et al. 2011) (Figure 2-15), there was no evident clustering of the yeast Kin28 and its complementing human homologs. I also mapped all the tested human kinases into the kinase family tree (Figure 2-16, adapted from a phylogenetic analysis by Manning et al., 2002). Complementing human homologs could be found in five distinct sub-families. These results highlight the idea that closer evolutionary relationships do not guarantee complementation, and that there are many apparently-paralogous complementing pairs are truly paralogous as opposed to being cryptic orthologs. In keeping with a previous study that found specific functional categories to be more

65 amenable to complementation (Aashiq H. Kachroo et al, 2015), my results showed that protein kinases are more likely to complement. From this study, we conclude that complementation relationships between human and yeast homologous pairs cannot be precisely predicted.

Figure 2-15 The phylogenetic tree of yeast kin28 and its paralogs tested. However, in the future, we could try other methods of prediction, e.g., using different gene features, such as membership in specific pathways or functional categories.

Interestingly, my study results show that functional complementation assays can be used to accurately identify pathogenic variants even when those variants fall outside of the aligned region. This is consistent with the idea that many deleterious variants affect protein folding or stability and disrupt the function of the entire protein. Thus, even where only a single domain is required for a human protein to complement its yeast paralog, that relationship can be exploited to detect a substantial subset of functional variation throughout the length of the human protein.

66

Considerable effort has been made to understand how genetic changes give rise to the molecular effects that cause diseases (Steward, MacArthur et al. 2003, Mooney 2005, Ng and Henikoff 2006). There are many databases and tools for prioritizing candidate single nucleotide polymorphisms (SNPs) or hypothesizing the molecular causes of genetic disease. Functional complementation assays have the potential to search genetic variants with molecular features that are likely to affect function.

67

Figure 2-16 Protein kinase tree showing all the tested human and yeast protein kinases. All the tested non-complementing human proteins are colored in cyan, complementing human proteins are colored in magenta.

One interesting example is the gene CASK, which encodes calcium/calmodulin-dependent serine protein kinase. CASK encodes a 921-amino acid polypeptide with an N-terminal calcium/calmodulin-dependent protein kinase-like domain, PDZ and SH3 domains, a potential protein-binding motif, and a domain homologous to guanylate kinase (Cohen, Woods et al. 1998). Sequence variants in CASK cause intellectual disability (Atasoy, Schoch et al. 2007). For the only annotated disease variant I tested in CASK, kinase domain variant R28L causing

68

FG Syndrome (Piluso, D'Amico et al. 2009), I observed loss of complementation. I also tested several non-disease-associated CASK variants (D471N, M438L, R430C, and T573I). Although three of the four non-disease variants tested were found, as expected, to retain the ability to complement, variant T573I (rs141840001), despite not being annotated as associated with Mendelian disease (Hamosh, Scott et al. 2005, Cooper, Stenson et al. 2006, Landrum, Lee et al. 2014) or via any GWA study (Welter, MacArthur et al. 2014), showed reduced complementation. This variant was originally identified in a clinical genetics laboratory (Emory Genetics Laboratory, ClinVar accession RCV000175306.1), but no information on the patient phenotype that might have motivated clinical sequencing was provided.

Another example is the gene RAB33B, which encodes a small GTP-binding protein of the RAB family and is associated Smith-McCort Dysplasia. I successfully observed failure to complement for the two disease-associated variants, P219S and K46Q (Alshammari, Al-Otaibi et al. 2012, Dupuis, Lebon et al. 2013). Interestingly, both non-disease-annotated variants, P142L (rs369719131) and T177M (rs140381459), also showed loss of complementation. My findings agreed with Polyphen2 and PROVEAN which each also predicted them to be deleterious. All four variants tested are within the Ras domain. Thus, that even though variants P142L and T177M are not known to be associated with disease, they appear to affect protein function.

Here during the second phase of this study, I tested 314 of the 3535 human disease-associated genes with a yeast paralog. Of the 314 we examined, 33 (10%) can complement a yeast paralog. According to HGMD, about 3019 human disease-associated genes have paralogs in either S. cerevisiae or Schizosaccharomyces pombe. Simple extrapolation suggests that a more exhaustive search for complementation relationships in these two yeast species could yield complementation assays for assessing functional variation in 300 human disease genes. Considering multicellular model organisms, the number of potential complementation assays increases further (see Table 2-5 for a summary of human disease-associated genes with either an ortholog or paralog in five model animal systems). Taking orthologs and paralogs together, functional complementation assays could potentially be developed for more than 15% of human disease genes in a one of these two yeast species.

69

Table 2-5 Numbers of Human Disease-associated Genes with Orthologs and Paralogs in Five Model Species Human disease-associated genes Organism Orthologs Paralogs Mus musculus 5547 256 Rattus norvegicus 5492 265 Danio rerio 4619 231 Drosophila melanogaster 3021 384* Caenorhabditis elegans 2665 169

*This figure is conservative, in that the HGMD source for this information used a more stringent criterion for paralogy (elsewhere in this study homologs without annotated orthology are referred to as paralogs).

Finally, I revisit our permissive working definition of paralogy (homology without annotated orthology). Paralogs under this definition may well be previously unrecognized orthologs, and gene pairs with complementation relationships may be enriched in such cases. However, for the practical purpose of identifying pathogenic variants, the distinction between paralogy and cryptic orthology is largely irrelevant. Here, all the human-yeast orthologous pairs were collected from the InParanoid database. The InParanoid program uses pairwise similarity scores for constructing orthology groups by initially finding two-way best hits between two proteomes as seed orthologs. More sequences (“called inparalogs”) are added to the group if there are sequences in the two proteomes that are closer to the corresponding seed ortholog than to any sequence in the other proteome. In this case, it is possible that in my study, some human-yeast paralogous pairs are orthologs that have not been annotated as orthologs yet. In either case, complementation relationships between human genes and their homologs in other species beyond S. cerevisiae provide substantial further opportunities to study the functional properties of human disease-associated variants.

70

Chapter 3 Protein domain-level landscape of cancer-type-specific somatic mutations

The work outline in this chapter is in paper “Domain-level analysis of type-specific somatic mutations in human cancers”, which is published in PLoS Computational Biology.

I performed all the analysis introduced in this chapter.

71

Protein domain-level landscape of cancer-type- specific somatic mutations

Extensive tumor genome sequencing has provided raw material to understand mutational processes and identify cancer-associated somatic variants. However, in order to understand how genomic variants, contribute to cancer initiation and progression, there are several fundamental problems waiting to be solved. i) separate ‘driver’ mutations (which give a selective advantage to a cancer cell) from ‘passenger’ mutations (giving no fitness advantage to a cancer cell), ii) further understand the functional mechanisms and consequences of driver mutations, and iii) identify the cancer types in which each driver mutation is relevant.

In order to solve those questions, I analyzed whole-genome and whole-exome tumor sequencing data from the perspective of protein domains — the basic structural and functional units of proteins. Exploring the cancer-type-specific landscape of domain mutations across 21 cancer types, I identified both cancer-type-specific mutated domains and mutational hotspots.

Frequently-mutated domains were identified for oncoproteins for which the ‘mutational hotspot’ phenomenon owing to the relative rarity of gain-of-function mutations is well known, and also for tumor suppressor proteins, for which more uniformly distributed loss-of-function driver mutations are expected. A given gene product may be perturbed differently in different cancer types. Indeed, I observed systematic shifts between cancer types of the positions at which mutations occur within a given protein. Both known and novel candidate driver mutations were retrieved. Novel cancer gene candidates significantly overlapped with orthogonal systematic cancer screen hits, supporting the power of this approach to identify cancer genes.

72

3.1 Introduction

Cancer is caused in large part by the accumulation of mutations in oncogenes and tumor suppressor genes. Previous analyses of well-studied cancers, such as colorectal cancer and retinoblastoma, have suggested that as few as three mutations are sufficient for cancer initiation

(Knudson 1971, Luebeck and Moolgavkar 2002, Beerenwinkel, Antal et al. 2007). Thousands of cancer genomes have now been sequenced (Greenman, Stephens et al. 2007, Wood, Parsons et al. 2007), including efforts from The Cancer Genome Atlas (TCGA) and the International

Cancer Genome Consortium (ICGC) (International Cancer Genome, Hudson et al. 2010, Mitra,

Carvunis et al. 2013). In recent years, the genetic landscape of mutations has been revealed in several well-studied cancers (Wood, Parsons et al. 2007, Stephens, McBride et al. 2009,

Parsons, Li et al. 2011, Stransky, Egloff et al. 2011, Grasso, Wu et al. 2012, Hodis, Watson et al. 2012, Stephens, Tarpey et al. 2012, Vogelstein, Papadopoulos et al. 2013, Watson, Takahashi et al. 2013). However, the process of extracting useful knowledge from this vast sequence resource has only begun.

The complexity of cancer genomes represents a challenge to our basic understanding of the disease and therefore also to therapy. Individual cancers can contain thousands of somatic mutations (Beroukhim, Mermel et al. 2010, Pleasance, Cheetham et al. 2010), only a small fraction of which are likely to be driver mutations contributing to tumor initiation or progression

(Haeno, Iwasa et al. 2007, Boyko, Williamson et al. 2008, Beroukhim, Mermel et al. 2010,

Pleasance, Stephens et al. 2010, McFarland, Korolev et al. 2013). Even genes that are well known to cause cancer contain many effectively-neutral passenger mutations (Stratton 2011).

For example, it was reported that 80 % of non-synonymous single-base substitutions observed in genes encoding protein kinases are passenger mutations (Greenman, Stephens et al. 2007). Most

73 candidate driver gene identification has been done on the basis of observing mutations in a large fraction of tumor samples. However, the list of putative driver mutations generated includes many that are implausible, with a false positive rate that increases with the number of sequenced tumor samples (Bignell, Greenman et al. 2010, Lawrence, Stojanov et al. 2013).

Determining the effect of mutations on the structure and function of proteins remains challenging (Sjoblom, Jones et al. 2006). Previous gene-based studies have generally focused on the whole gene or whole protein, but mutations in different protein domains, structural units that often have distinct functions, may have different functional consequences. Thus, gene-level analysis can identify genes that contribute to multiple cancers, but does not map mutations to structural elements. Recently, computational structural studies have explored mutational effects on specific regions of a protein (e.g., the binding site)(Wan, Garnett et al. 2004, Joerger and

Fersht 2007, Dixit, Yi et al. 2009, Dixit and Verkhivker 2014). For example, Joerger and Fersht showed that certain mutations in the p53 protein can determine folding state and affinity of p53 for specific target DNA elements. Also, different p53 mutations affect different protein–protein interaction interfaces dictating either tetramerization of p53 or its interaction with a multitude of other regulatory proteins (Joerger and Fersht 2007). Similarly, the effects of mutations in different protein kinase sub-domains have been shown to have different functional impacts

(Dixit and Verkhivker 2014). Thus, within a multi-functional gene, different mutations can affect different functions. The structural details of individual mutants can provide the basis for the design of cancer therapeutics. Indeed, a given gene can have different functional roles in different cancers, reflected in shifts in the mutational distribution of different cancers. Recently,

Nehrt et al. examined 100 colon cancer and 522 breast cancer samples to identify specific domain types with heightened mutation rates, succeeding even within genes that have generally lower mutation rates in colon or breast cancer (Nehrt, Peterson et al. 2012, Studer, Dessailly et

74 al. 2013). Mutations occurring within a particular domain are more likely to share structural and functional effects (Vogelstein, Papadopoulos et al. 2013). Two mutations within a given gene may be associated with different human diseases, e.g., potentially by disrupting different protein interactions (Zhong, Simonis et al. 2009, Sahni, Yi et al. 2013). Thus, studies that consider mutational positions (e.g., relative to known domains) could be beneficial in elucidating functional effects of mutations.

In this study, a “domain instance” refers to a particular protein domain encoded within a particular gene and a “domain type” refers to a Pfam domain ‘pattern’ that may correspond to different domain instances encoded by different genes. In other words, a domain instance refers to a specific amino acid subsequence within a given protein that matches a given domain type.

To better distinguish this study from previous related studies, such as the domain landscape in colon and breast cancer by Nehrt et al, I note that I systematically analyzed multiple (twenty- one) cancer types. Like Nehrt et al., I analyzed each Pfam domain type. In addition, however, I specifically analyzed each Pfam domain instance. Rather than simply seeking domains with a mutational density that is enriched relative to other genomic regions, I further required that this enrichment is greater in one cancer type than in all other cancer types. This has the advantage of pointing us to interesting differences between cancer types, while also implicitly controlling for region-specific differences in background mutation rate. Thus, I analyzed the ‘domain-centric mutational landscape’ by examining the domain-level distribution of missense somatic mutations across multiple cancer types.

By mapping missense somatic mutations to domain instances for 21 cancer types (Table 3-1), I detected 100 cancer-type-specific significantly-mutated domain instances (SMDs) among different cancers. Further examination of these 100 domain instances showed that the proportion

75 of within-domain mutations corresponding to hotspot positions can distinguish oncoproteins from tumor suppressor proteins. I also found that the vast majority of within-domain mutational hotspots shared by multiple cancer types occurred at functional sites. Thus, domain mutational landscape information can be used to prioritize candidate cancer-causing mutations and to elucidate their cancer-type-dependent functional effects.

Table 3-1 Prevalence of predicted damaging mutations in domain instances among cancer types

Cancer Type Mutation Counts Gene Counts Domain Families Neuroblastoma 166 154 132 Chondrosarcoma 68 49 47 Breast cancer 4026 2568 1293 Glioblastoma and medulloblastoma 1423 911 559 Cervical cancer 707 570 410 Endometrial carcinoma 12183 6105 2451 Lymphoma and leukemia 1443 692 468 Renal cell carcinoma. 4774 3241 1591 Colorectal cancer 27132 9706 3426 Liver cancer 485 309 235 Lung cancer 11654 5899 2401 Meningioma 89 66 62 Esophageal adenocarcinoma 935 568 366 Ovarian cancer 4182 3035 1395 Pancreatic cancer 703 534 362 Prostate cancer 1917 1409 785 Adenoid cystic carcinomas 165 126 107 Melanoma 1824 1352 758 Striated muscle 16 13 12 Head and neck squamous cell 582 445 307 carcinoma Bladder cancer 2996 1289 804

76

3.2 Methods

To perform the study, I first assembled a dataset of somatic missense tumor mutations. Then, from this dataset I derived a dataset of potentially damaging missense somatic mutations. I analyzed cancer-type-specific SMDs and cancer-type-specific significantly-mutated position- based mutational hotspots. Finally, I analyzed the structural properties of those mutational hotspots.

3.2.1 Creating the cancer missense somatic mutation dataset

I assembled a total of 237,716 missense somatic mutations in 21 cancer types (Table 3-1, Table 3-2) from 71 whole-genome (WGS) or whole-exome sequencing (WES) studies (Greenman, Stephens et al. 2007, Cancer Genome Atlas Research 2008, Jones, Wang et al. 2010, Agrawal, Frederick et al. 2011, Bass, Lawrence et al. 2011, Berger, Lawrence et al. 2011, Cancer Genome Atlas Research 2011, Durinck, Ho et al. 2011, Galante, Parmigiani et al. 2011, Jiao, Shi et al. 2011, Li, Zhao et al. 2011, Morin, Mendez-Lago et al. 2011, Totoki, Tatsuno et al. 2011, Yan, Xu et al. 2011, Agrawal, Jiao et al. 2012, Barbieri, Baca et al. 2012, Cancer Genome Atlas 2012, Ding, Ley et al. 2012, Duns, Hofstra et al. 2012, Gerlinger, Rowan et al. 2012, Grasso, Wu et al. 2012, Guichard, Amaddeo et al. 2012, Guo, Gui et al. 2012, Hodis, Watson et al. 2012, Huang, Deng et al. 2012, Imielinski, Berger et al. 2012, Iyer, Hanrahan et al. 2012, Jones, Jager et al. 2012, Kannan, Inagaki et al. 2012, Krauthammer, Kong et al. 2012, Le Gallo, O'Hara et al. 2012, Lee, Stewart et al. 2012, Liu, Lee et al. 2012, Molenaar, Koster et al. 2012, Nik-Zainal, Alexandrov et al. 2012, Pena-Llopis, Vega-Rubin-de-Celis et al. 2012, Pugh, Weeraratne et al. 2012, Robinson, Parker et al. 2012, Rudin, Durinck et al. 2012, Seo, Ju et al. 2012, Seshagiri, Stawiski et al. 2012, Shah, Roth et al. 2012, Stephens, Tarpey et al. 2012, Wang, Tsutsumi et al. 2012, Alexandrov, Nik-Zainal et al. 2013, Bettegowda, Agrawal et al. 2013, Cancer Genome Atlas Research, Kandoth et al. 2013, Ciriello, Miller et al. 2013, Clark, Erson-Omay et al. 2013, Dulak, Stojanov et al. 2013, Green, Gentles et al. 2013, Ho, Kannan et al. 2013, Kim, Jung et al. 2013, Leich, Weissbach et al. 2013, Lindberg, Mills et al. 2013, Roberts, Lawrence et al. 2013, Sausen, Leary et al. 2013, Tarpey, Behjati et al. 2013, Watson, Takahashi et al. 2013, Yost, Pastorino et al. 2013, Zhang, Grubor et al. 2013, Zhou, Yang et al. 2013) included in the COSMIC (Catalogue of Somatic Mutations in Cancer) database (version 67)(Bamford, Dawson

77 et al. 2004, Forbes, Bhamra et al. 2008, Forbes, Bindal et al. 2011). Most of those studies were conducted by either the ICGC (Hudson, Anderson et al. 2010) or TCGA project (Chin, Hahn et al. 2011). The mutations fell within a total of 18,682 genes, corresponding to 22,367 different protein isoforms. Amino acid sequences corresponding to the mutated protein isoforms were also available from the COSMIC database.

I used all the protein sequences corresponding to those cancer-associated genes to search against Pfam domain types from the Pfam protein domain family database (version 27) (Finn, Mistry et al. 2010), using an E-value cutoff of 0.001 (Finn, Mistry et al. 2006). A total of 11,633 unique Pfam domain types, encoded by 18,682 mutated genes, were obtained from the Pfam database, considering all transcripts of these genes. Then I mapped the missense somatic mutations to protein domain positions after multiple sequence alignments using HMMER (v3.1b1) (Eddy 1998). Where a given mutation could be assigned to multiple overlapping domain instances, I mapped the mutation to all of them Figure 3-1. Significance of enrichment was calculated separately for each domain instance, so that the results for any given domain instance did not depend on the presence of other overlapping domain instances. I note that the vast majority of all mutations mapped only to a single domain instance (only 6 mutations can be mapped to different domain instances). Finally, among the 11,633 protein domain types, I found 6950 unique Pfam domain types that have at least one missense somatic mutation detected in the studies, all of which had an E-value <0 .0001 (10-fold more stringent than the Pfam- recommended threshold). These 6950 Pfam domain types corresponded to 29,302 unique

Figure 3-1 Mapping Mutations Detected from Different Cancers to Domain Instances. Rectangles represent protein domain instances in a given gene. Colored dots represent mutations detected in different cancer types. domain instances.

78

Table 3-2 Patient tissue samples from selected cancer genome studies across 21 cancer types.

Sample Primary Site Cancer Type References Counts (Molenaar, Koster et al. 2012, Sausen, Autonomic ganglia Neuroblastoma 134 Leary et al. 2013) Bone Chondrosarcoma 66 (Tarpey, Behjati et al. 2013) (Galante, Parmigiani et al. 2011, Nik- Breast Breast cancer 978 Zainal, Alexandrov et al. 2012, Shah, Roth et al. 2012, Stephens, Tarpey et al. 2012) (Cancer Genome Atlas Research 2008, Jones, Jager et al. 2012, Kannan, Inagaki et Glioblastoma and al. 2012, Lee, Stewart et al. 2012, Pugh, Central nervous system 525 medulloblastoma Weeraratne et al. 2012, Robinson, Parker et al. 2012, Bettegowda, Agrawal et al. 2013, Yost, Pastorino et al. 2013) Cervix Cervical cancer 14 (Roberts, Lawrence et al. 2013) (Le Gallo, O'Hara et al. 2012, Cancer Endometrial Endometrium 261 Genome Atlas Research, Kandoth et al. carcinoma 2013) (Morin, Mendez-Lago et al. 2011, Yan, Xu Lymphoma and et al. 2011, Ding, Ley et al. 2012, Green, Hematopoietic and lymph 415 leukemia Gentles et al. 2013, Leich, Weissbach et al. 2013, Zhang, Grubor et al. 2013) (Duns, Hofstra et al. 2012, Gerlinger, Renal cell Rowan et al. 2012, Guo, Gui et al. 2012, Kidney 594 carcinoma. Pena-Llopis, Vega-Rubin-de-Celis et al. 2012, Dulak, Stojanov et al. 2013) (Bass, Lawrence et al. 2011, Cancer Colon Colorectal cancer 762 Genome Atlas 2012, Seshagiri, Stawiski et al. 2012, Zhou, Yang et al. 2013) (Li, Zhao et al. 2011, Totoki, Tatsuno et al. Liver Liver cancer 531 2011, Guichard, Amaddeo et al. 2012, Huang, Deng et al. 2012) (Imielinski, Berger et al. 2012, Liu, Lee et al. 2012, Rudin, Durinck et al. 2012, Seo, Ju Lung Lung cancer 825 et al. 2012, Alexandrov, Nik-Zainal et al. 2013, Kim, Jung et al. 2013) Meninges Meningioma 39 (Clark, Erson-Omay et al. 2013) Esophageal (Agrawal, Jiao et al. 2012, Dulak, Stojanov Esophagus 242 adenocarcinoma et al. 2013) (Jones, Wang et al. 2010, Cancer Genome Ovary Ovarian cancer 637 Atlas Research 2011) (Jiao, Shi et al. 2011, Wang, Tsutsumi et al. Pancreas Pancreatic cancer 202 2012)

79

(Berger, Lawrence et al. 2011, Barbieri, Prostate Prostate cancer 423 Baca et al. 2012, Grasso, Wu et al. 2012, Lindberg, Mills et al. 2013) Adenoid cystic Salivary gland 60 (Ho, Kannan et al. 2013) carcinomas (Hodis, Watson et al. 2012, Krauthammer, Skin Melanoma 133 Kong et al. 2012) Soft tissue Striated muscle 13 (Lee, Stewart et al. 2012) Head and neck (Agrawal, Frederick et al. 2011, Durinck, Upper aero-digestive tract squamous cell 203 Ho et al. 2011) carcinoma Urinary tract Bladder cancer 203 (Iyer, Hanrahan et al. 2012)

3.2.2 Creating the dataset of potentially damaging missense somatic mutations

To predict potential damaging mutations, I used the IntOGen–mutation platform (Gundem, Perez-Llamas et al. 2010, Gonzalez-Perez, Perez-Llamas et al. 2013), which classified the 237,716 missense somatic mutations into five categories: high, medium, low, unknown and none. I excluded mutations predicted to have no or unknown functional effects by IntOGen from further analyses. This left only 76,158 mutations as potential driver mutations. Those mutations were distributed in 4,509 unique domain types, corresponding to 14,083 genes (Table 3-1).

3.2.3 Cancer-type-specific significantly-mutated domain instance analyses

To avoid the possible bias caused by different domain instance lengths and imbalanced sequencing frequency across cancer types, I calculated the cancer-type-specific mutation density as the total number of somatic missense mutations falling in the domain-encoding region of each gene, normalized by the corresponding cumulative domain instance length. I used the Fisher's Exact test to determine whether a certain domain instance is significantly mutated in a given cancer, using the “stats” package in R (http://www.r-project.org). The mutation counts for the R function corresponded to a 2 × 2 contingency table based on whether or not the mutations detected from each cancer type fell (or did not fall) within a given domain instance. Here, the null hypothesis is that the probability of mutations detected from a given cancer fell within a given domain instance is equal to that within other domains. I chose a P-value threshold (α = 10- 7 ) yielding a false discovery rate (FDR) of less than 0.05. I made a heat map representation of

80 the hierarchical clustering of SMDs in different cancers using the “heatmap.2” R package based on the –log10 (P-value) of each cancer-type-specific domain instance. I analyzed the tendency of SMDs to co-occur in the same patient sample using Fisher’s Exact test (“stats” package in R). Also, genes containing at one or more SMDs were regarded as candidate cancer genes in this study. Overlap between my candidate gene set and Cancer Census genes and the Sleeping Beauty gene sets was also analyzed using the Fisher’s Exact test (“stats” package in R).

3.2.4 Cancer-type-specific significantly-mutated position based mutational hotspot analyses

I calculated the mutational hotspots within each domain instance encoded by a single gene based on Fisher’s Exact test with a P-value cutoff 0.01 (FDR <0.05). False discovery rate analysis was performed using Benjamini & Hochberg FDR (Benjamini and Hochberg 1995). I used the Mann–Whitney U-test to evaluate the significance of difference in distribution patterns of mutation residues between oncoproteins and tumor suppressor proteins. All of these analyses were conducted using the “stats” package in R.

3.2.5 Structural properties for position based mutational hotspots analyses

I downloaded known protein-structure files from the Protein Data Bank (Berman, Henrick et al. 2003). For proteins that had more than one structure file, I chose one, favoring those with larger sequence length and higher crystallographic resolution. For domain-domain interface analysis, mutational hotspots were first mapped onto the available structures by using the Pymol Software (http://www.pymol.org; (Schrodinger 2010)). The interfacial residues of different domains in different chains were analyzed using Mechismo (http://mechismo.russelllab.org/), ProtInDB (PROTein-protein INterface residues Data Base; Rafael A Jordan, Feihong Wu, Drena Dobbs and Vasant Honavar. unpublished results), and PDBePISA (Proteins, Interfaces, Surfaces and Assemblies; (Krissinel and Henrick 2007) ) servers. I retrieved the functional-site information for those mutational hotspots from the Catalytic Site Atlas (Porter, Bartlett et al. 2004) and the PhosphoSite (Hornbeck, Chabra et al. 2004) databases. I used the odds ratio and Fisher’s Exact test to calculate the tendency of mutational hotspots in oncoproteins to occur at ATP/GTP binding sites or enzyme-active sites, as compared with mutational hotspots in tumor suppressor proteins.

81

3.3 Results

3.3.1 Mutation distribution at the domain level analysis

Protein domains are generally regarded as the conserved structural and functional units of proteins. I therefore focused on the 237,716 missense somatic mutations, across 21 different human tissues, that fell within protein domain instances. I further focused on the subset of 76,158 mutations that were predicted to compromise the function of the harboring protein, using the IntOGen–mutation platform (Gonzalez-Perez, Perez-Llamas et al. 2013) (Table 3-1, Table 3-3). To avoid observational biases, the above-mentioned mutations were derived only from genome-scale (either whole-genome or exome) sequencing studies.

The prevalence of missense somatic mutations can vary from cancer to cancer at the domain level (Jackson and Loeb 1998, Dolle, Snyder et al. 2000). I found that most domain instances had a mutational density of only one or two missense somatic mutations per megabase in the corresponding DNA sequence (Figure 3-2). For example, a mutated domain of length 209 residues (the average domain instance length) contains an average of one single-amino acid- changing mutation for every 71 patients. Although domains with high mutation rates can be seen for many cancers (Figure 3-2), these mutation rates can be misleading. Given heterogeneity of mutation rates across the genome and differences in overall mutation rate for different cancers, domain-instances with the highest mutation density in a given cancer may not be the true drivers of cancer progression (Lawrence, Stojanov et al. 2013).

To control for both positional and cancer-type specific differences in mutation rate, I sought domain instances that were highly mutated relative both to the same domain instance in other cancer types and also to other domain instances within the same cancer type. I identified ~100 cancer-type-specific significantly mutated domain instances (SMDs) in 21 cancer types ( Table 3-3, P-value =10-7, Fisher’s Exact test, False Discovery Rate (FDR) <0.05, see Chapter 3.2.3). The number of cancer-type-specific SMDs in each of the 21 cancer types is listed in Table 3-4, and in Table 3-5 in greater detail. With only two exceptions, the smallest number of mutations observed for a domain instance that was declared to be significantly mutated was 6. The exceptions were the Collagen domain instance (with only 2 mutations) within the COLEC11 gene product in soft tissue cancers, for which only 14 samples were available; and the CCDC14 domain instances (with 3 mutations) encoded by CCDC14 in cancers of the salivary gland, for

82 which only 60 samples were available (Appendix B.). With more samples of this two cancer types sequenced in the future, the mutation density in this two protein domain instance may increase.

Table 3-3 Cancer-type-specific significantly mutated domain instances and corresponding genes in different cancers. For each cancer type this table lists the significantly mutated domain instances (SMDs), corresponding gene symbols, and number of mutations in each domain instance.

Primary Site Gene Symbol Domain Name # of Mutations Fallin SMDIs Autonomic ganglia ALK Pkinase_Tyr 7 Bone DGKZ DAGK_acc 13 Bone ERBB3 GF_recep_IV 13 Bone NTNG1 Laminin_N 13 Bone PRKACB Pkinase 13 Bone BLNK SH2 13 Breast MT-CYB Cytochrom_B_C 368 Breast TP53 P53 368 Breast AKT1 PH 368 Breast MLL3 zf-HC5HC2H 368 Central nervous system PCDH11X Cadherin 228 Central nervous system PCDH11Y Cadherin 228 Central nervous system DDX3X DEAD 228 Central nervous system PSTPIP2 FCH 228 Central nervous system FKBP9 FKBP_C 228 Central nervous system EGFR Furin-like 228 Central nervous system EGFR GF_recep_IV 228 Central nervous system DDX3X Helicase_C 228 Central nervous system TP53 P53 228 Central nervous system FGFR1 Pkinase_Tyr 228 Central nervous system SMARCA4 SNF2_N 228 Cervix CAGE1 CAGE1 15 Cervix SLC16A5 MFS_1 15 Cervix MT1G Metallothio 15 Cervix MAST1 PDZ 15 Endometrium ACSM4 AMP-binding 313 Endometrium POLE DNA_pol_B_exo1 313 Endometrium PTEN DSPc 313

83

Endometrium CYFIP2 FragX_IP 313 Endometrium PPP2R1A HEAT_2 313 Endometrium KIF26B Kinesin 313 Endometrium MBD1 MBD 313 Endometrium MT-ND1 NADHdh 313 Endometrium PIK3CA PI3K_p85B 313 Endometrium PIK3CA PI3Ka 313 Endometrium FGFR2 Pkinase_Tyr 313 Hematopoietic & lymph P2RY8 7tm_1 373 Hematopoietic & lymph BCL2 BH4 373 Hematopoietic & lymph BTG1 BTG 373 Hematopoietic & lymph BCL2 Bcl-2 373 Hematopoietic & lymph DMBT1 CUB 373 Hematopoietic & lymph EPHA6 EphA2_TM 373 Hematopoietic & lymph KRT6A Filament 373 Hematopoietic & lymph FOXO1 Fork_head 373 Hematopoietic & lymph GNA13 G-alpha 373 Hematopoietic & lymph IRF4 IRF 373 Hematopoietic & lymph CD79B ITAM 373 Hematopoietic & lymph CNTN4 Ig_2 373 Hematopoietic & lymph ITPR3 Ins145_P3_rec 373 Hematopoietic & lymph CREBBP KAT11 373 Hematopoietic & lymph HIST1H1E Linker_histone 373 Hematopoietic & lymph CD74 MHC2-interact 373 Hematopoietic & lymph MYC Myc_N 373 Hematopoietic & lymph NPIPL2 NPIP 373 Hematopoietic & lymph TP53 P53 373 Hematopoietic & lymph MAGI1 PDZ 373 Hematopoietic & lymph CDK11A Pkinase 373 Hematopoietic & lymph PIM1 Pkinase 373 Hematopoietic & lymph PRKCB Pkinase 373 Hematopoietic & lymph HCK Pkinase_Tyr 373 Hematopoietic & lymph MYD88 TIR 373 Hematopoietic & lymph ANTXR1 VWA 373 Hematopoietic & lymph CTNNA3 Vinculin 373 Kidney PBRM1 Bromodomain 99 Kidney MT-ND6 Oxidored_q3 99 Kidney SETD2 SET 99 Kidney VHL VHL 99 Kidney PTPRQ fn3 99

84

Colon FAT4 Cadherin 351 Colon SMAD4 MH2 351 Colon FBXW7 WD40 351 Liver TP53 P53 108 Liver TP53 P53_tetramer 108 Liver USP25 UCH 108 Lung KEAP1 Kelch_1 921 Lung TP53 P53 921 Lung EGFR Pkinase_Tyr 921 Lung KRAS Ras 921 Lung CD6 SRCR 921 Lung CFHR4 Sushi 921 Lung SNPH Syntaphilin 921 Lung PRPF19 WD40 921 Lung U2AF1 zf-CCCH 921 Meninges RBP1 Lipocalin 11 Meninges TRAF7 WD40 11 Meninges KLF4 zf-H2C2_2 11 Oesophagus SMAD4 MH2 187 Oesophagus TP53 P53 187 Oesophagus SMARCA4 SNF2_N 187 Oesophagus ODZ2 Ten_N 187 Ovary TP53 P53 225 Pancreas KRAS Ras 32 Prostate CCNF Cyclin_C 57 Prostate PLCH2 EF-hand_7 57 Prostate LPHN3 GPS 57 Prostate NKX3-1 Homeobox 57 Prostate SPOP MATH 57 Prostate CNOT3 Not3 57 Prostate PTPRD Y_phosphatase 57 Prostate MLL3 zf-HC5HC2H 57 Salivary gland CCDC14 CCDC14 3 Skin CDKN2A Ank_2 125 Skin BRAF Pkinase_Tyr 125 Skin NRAS Ras 125 Skin RAC1 Ras 125 Skin PRG4 Somatomedin_B 125 Soft tissue COLEC11 Collagen 2 Upper aero-digestive tract NOTCH1 EGF_CA 88

85

Upper aero-digestive tract TP53 P53 88 Upper aero-digestive tract HRAS Ras 88 Urinary tract SCYL3 Pkinase 6 SMDIs:significantly mutant domain instances SMDs: significantly mutant domains

Figure 3-2 Mutation Densities for Domain Instances across Cancers. Box plots display mutation densities for mutated domain instances in different cancers. Outliers are shown as dots. Only predicted- damaging mutations predicted by IntOGen were used for this analysis.

86

Table 3-4 Number of significantly mutated domain instances and corresponding genes in each cancer type. Cancer Type Significantly Mutated Related Gene Counts Domain Instance Counts Neuroblastoma 1 1 Chondrosarcoma 5 5 Breast cancer 4 4 Glioblastoma and medulloblastoma 11 9 Cervical cancer 4 4 Colorectal cancer 3 3 Endometrial carcinoma 11 10 Lymphoma and leukemia 27 26 Renal cell carcinoma. 5 5 Liver cancer 3 2 Lung cancer 9 9 Meningioma 3 3 Esophageal adenocarcinoma 4 4 Ovarian cancer 1 1 Pancreatic cancer 1 1 Prostate cancer 8 8 Adenoid cystic carcinomas 1 1 Melanoma 5 5 Striated muscle 1 1 Head and neck squamous cell 3 3 carcinoma Bladder cancer 1 1

Table 3-5 Number of cancer types in which each domain instance was significantly mutated. Number of Significantly Domain Gene Mutated Domain Instances Symbol Pfam ID Instances per Gene per Gene Primary Site ACSM4 PF00501.23 1 2 Endometrium AKT1 PF00169.24 1 3 Breast ALK PF07714.12 1 3 Autonomic ganglia ANTXR1 PF00092.23 1 3 Hematopoietic & lymph

87

BCL2 PF00452.14 2 2 Hematopoietic & lymph BCL2 PF02180.12 2 2 Hematopoietic & lymph BLNK PF00017.19 1 1 Bone BRAF PF07714.12 1 3 Skin BTG1 PF07742.7 1 1 Hematopoietic & lymph CAGE1 PF15066.1 1 1 Cervix CCDC14 PF15254.1 1 1 Salivary gland CCNF PF02984.14 1 3 Prostate CD6 PF00530.13 1 1 Lung CD74 PF09307.5 1 2 Hematopoietic & lymph CD79B PF02189.10 1 1 Hematopoietic & lymph CDK11A PF00069.20 1 1 Hematopoietic & lymph CDKN2A PF12796.2 1 2 Skin CFHR4 PF00084.15 1 1 Lung CNOT3 PF04065.10 1 2 Prostate CNTN4 PF13895.1 1 3 Hematopoietic & lymph COLEC11 PF01391.13 1 2 Soft tissue CREBBP PF08214.6 1 6 Hematopoietic & lymph CTNNA3 PF01044.14 1 1 Hematopoietic & lymph CYFIP2 PF05994.6 1 1 Endometrium DDX3X PF00270.24 2 2 Central nervous system DDX3X PF00271.26 2 2 Central nervous system DGKZ PF00609.14 1 4 Bone DMBT1 PF00431.15 1 3 Hematopoietic & lymph EGFR PF00757.15 3 4 Central nervous system EGFR PF07714.12 3 4 Lung EGFR PF14843.1 3 4 Central nervous system EPHA6 PF14575.1 1 4 Hematopoietic & lymph ERBB3 PF14843.1 1 4 Bone FAT4 PF00028.12 1 3 Colon FBXW7 PF00400.27 1 2 Colon FGFR1 PF07714.12 1 2 Central nervous system FGFR2 PF07714.12 1 3 Endometrium FKBP9 PF00254.23 1 2 Central nervous system FOXO1 PF00250.13 1 1 Hematopoietic & lymph GNA13 PF00503.15 1 1 Hematopoietic & lymph HCK PF07714.12 1 3 Hematopoietic & lymph HIST1H1E PF00538.14 1 1 Hematopoietic & lymph HRAS PF00071.17 1 1 Upper aero-digestive tract IRF4 PF00605.12 1 2 Hematopoietic & lymph

88

ITPR3 PF08709.6 1 5 Hematopoietic & lymph KEAP1 PF01344.20 1 3 Lung KIF26B PF00225.18 1 1 Endometrium KLF4 PF13465.1 1 1 Meninges KRAS PF00071.17 1 1 Lung KRAS PF00071.17 1 1 Pancreas KRT6A PF00038.16 1 1 Hematopoietic & lymph LPHN3 PF01825.16 1 6 Prostate MAGI1 PF00595.19 1 1 Hematopoietic & lymph MAST1 PF00595.19 1 3 Cervix MBD1 PF01429.14 1 2 Endometrium MLL3 PF13771.1 1 4 Breast MLL3 PF13771.1 1 4 Prostate MT-CYB PF00032.12 1 1 Breast MT-ND1 PF00146.16 1 1 Endometrium MT-ND6 PF00499.15 1 1 Kidney MT1G PF00131.15 1 1 Cervix MYC PF01056.13 1 2 Hematopoietic & lymph MYD88 PF01582.15 1 1 Hematopoietic & lymph NKX3-1 PF00046.24 1 1 Prostate NOTCH1 PF07645.10 1 4 Upper aero-digestive tract NPIPL2 PF06409.6 1 1 Hematopoietic & lymph NRAS PF00071.17 1 1 Skin NTNG1 PF00055.12 1 2 Bone ODZ2 PF06484.7 1 2 Oesophagus P2RY8 PF00001.16 1 1 Hematopoietic & lymph PBRM1 PF00439.20 1 3 Kidney PCDH11X PF00028.12 1 3 Central nervous system PCDH11Y PF00028.12 1 2 Central nervous system PIK3CA PF00613.15 2 4 Endometrium PIK3CA PF02192.11 2 4 Endometrium PIM1 PF00069.20 1 1 Hematopoietic & lymph PLCH2 PF13499.1 1 1 Prostate POLE PF03104.14 1 3 Endometrium PPP2R1A PF13646.1 1 1 Endometrium PRG4 PF01033.12 1 1 Skin PRKACB PF00069.20 1 1 Bone PRKCB PF00069.20 1 3 Hematopoietic & lymph PRPF19 PF00400.27 1 2 Lung PSTPIP2 PF00611.18 1 1 Central nervous system

89

PTEN PF00782.15 1 2 Endometrium PTPRD PF00102.22 1 3 Prostate PTPRQ PF00041.16 1 2 Kidney RAC1 PF00071.17 1 1 Skin RBP1 PF00061.18 1 1 Meninges SCYL3 PF00069.20 1 1 Urinary tract SETD2 PF00856.23 1 2 Kidney SLC16A5 PF07690.11 1 1 Cervix SMAD4 PF03166.9 1 2 Colon SMAD4 PF03166.9 1 2 Oesophagus SMARCA4 PF00176.18 1 5 Central nervous system SMARCA4 PF00176.18 1 5 Oesophagus SNPH PF15290.1 1 1 Lung SPOP PF00917.21 1 2 Prostate TP53 PF00870.13 2 3 Breast TP53 PF00870.13 2 3 Central nervous system TP53 PF00870.13 2 3 Hematopoietic & lymph TP53 PF00870.13 2 3 Liver TP53 PF00870.13 2 3 Lung TP53 PF00870.13 2 3 Oesophagus TP53 PF00870.13 2 3 Ovary TP53 PF00870.13 2 3 Upper aero-digestive tract TP53 PF07710.6 2 3 Liver TRAF7 PF00400.27 1 1 Meninges U2AF1 PF00642.19 1 2 Lung USP25 PF00443.24 1 1 Liver VHL PF01847.11 1 1 Kidney

I found between 3 and 7 SMDs for each cancer type, except for endometrial cancer (with 11 SMDs) as well as hematopoietic and lymphatic cancer (with 27 SMDs). Of the 94 genes encoding at least one SMD, 40 (42%) had already been implicated in cancer according to the Sanger Cancer Gene Census (‘Cancer Census’) (Futreal, Coin et al. 2004, Santarius, Shipley et al. 2010), including well-established cancer-causing genes such as KRAS, EGFR and TP53. Enrichment for Cancer Census genes was both strong and significant (~12-fold enrichment; P- value = 5×10-34, Fisher’s Exact test), and suggests the remaining 54 genes that are not already known to be cancer drivers represent good candidates. For example, the Syntaphilin protein, encoded by SNPH harbors the syntaphilin domain instance, which was significantly mutated in lung cancer. Syntaphilin domain is a family of eukaryotic proteins, they can bind to syntaxin-1

90 and thereby inhibiting SNARE complex formation by absorbing free syntaxin-1. Despite reports that it is brain-specific (Pruitt, Tatusova et al. 2009), SNPH is expressed in lung according to microarray (Su, Wiltshire et al. 2004, Wu, Macleod et al. 2013) and RNA-seq studies (Hishiki, Kawamoto et al. 2000).

I compared the resulting novel cancer gene candidates with cancer gene candidates emerging from a large-scale in vivo (mouse) screen via mutagenesis with Sleeping Beauty transposons (Collier and Largaespada 2007). Of the 94 genes encoding cancer type-specific SMDs, 24 were found in the Sleeping Beauty dataset (~3-fold enrichment; P-value = 7×10-06, Fisher’s Exact test). Of the subset of 54 candidate genes not already known to be cancer genes, 10 were found in the Sleeping Beauty dataset (~2-fold enrichment; P-value = 5×10-3, Fisher’s Exact test, Table 3-6).

Table 3-6 Genes that encode cancer-type-specific significantly mutated domain instance and overlap with the Sleeping Beauty dataset

Gene Significantly Predicted Reported Cancer Type Symbol Mutated Domain Category Category

CNTN4 PF13895.1 Lymphoma and leukemia Tumor suppressor Not determined CTNNA3 PF01044.14 Lymphoma and leukemia Tumor suppressor Not determined EPHA6 PF14575.1 Lymphoma and leukemia Tumor suppressor Not determined FOXO1 PF00250.13 Lymphoma and leukemia Tumor suppressor Tumor suppressor GNA13 PF00503.15 Lymphoma and leukemia Oncogene Not determined MAGI1 PF00503.15 Lymphoma and leukemia Tumor suppressor Tumor suppressor PCDH11 Not determined PF08266.7 Lymphoma and leukemia Oncogene X PCDH11 Not determined PF00028.12 Glioblastoma, medulloblastoma Tumor suppressor X PTPRD PF00102.22 Prostate cancer Tumor suppressor Not determined SMAD4 PF03166.9 Colorectal cancer Tumor suppressor Tumor suppressor SMAD4 PF03166.9 Esophageal adenocarcinoma Tumor suppressor Tumor suppressor USP25 PF00443.24 Liver cancer Oncogene Not determined

The distribution of cancer-type-specific SMDs varies across cancer types. Among cancer-type- specific SMDs, most (95%) were only significantly mutated in a single cancer type (Figure 3-3). Five domain instances were found to be significantly mutated in more than one cancer (Table 3-7): a Ras domain instance of KRAS, mutated in lung and pancreatic cancer; a PHD finger

91 domain instance (zf-HC5HC2H) of MLL3, mutated in breast and prostate cancer; a MAD homology 2 (MH2) domain instance of SMAD4, mutated in colon and esophageal cancer; a SNF2 family N-terminal (SNF2_N) domain instance of SMARCA4, mutated in esophageal cancer and cancer of the central nervous system; and the P53 DNA binding domain instance of TP53, mutated in 8 cancer types. With the exception of KRAS, these genes are usually regarded as tumor suppressors. This suggests that, while tumor suppressors may cause different cancers via a common loss of function mechanism, the gain-of-function mechanism of oncogenes is more likely to be tissue-specific.

92

Figure 3-3 Clustering of Significantly-mutated Domain Instances across 21 Cancer Types. The heatmap reflects the significance of cancer-type-specific mutation density of each domain instance in different cancers. Side bars in the same color indicate domain instances encoded by the same gene, and domain instances belonging to the same domain type.

93

Table 3-7 Genes that encode more than one cancer-type-specific significantly mutated domain instance.

Gene Symbol Domain Instances Significantly Mutated Domain Primary Site per Gene Instance TP53 2 P53 Liver TP53 2 P53_tetramer Liver TP53 1 P53 Breast TP53 1 P53 Central nervous system TP53 1 P53 Hematopoietic and lymph TP53 1 P53 Lung TP53 1 P53 Esophagus TP53 1 P53 Ovary TP53 1 P53 Upper aero-digestive tract EGFR 2 Furin-like Central nervous system EGFR 2 GF_recep_IV Central nervous system EGFR 1 Pkinase_Tyr Lung KRAS 1 Ras Lung KRAS 1 Ras Pancreas MLL3 1 zf-HC5HC2H Breast MLL3 1 zf-HC5HC2H Prostate SMAD4 1 MH2 Colon SMAD4 1 MH2 Esophagus SMARCA4 1 SNF2_N Central nervous system SMARCA4 1 SNF2_N Esophagus BCL2 2 BH4 Hematopoietic and lymph BCL2 2 Bcl-2 Hematopoietic and lymph DDX3X 2 DEAD Central nervous system DDX3X 2 Helicase_C Central nervous system PIK3CA 2 PI3Ka Endometrium PIK3CA 2 PI3K_p85B Endometrium

3.3.2 Cancer-type-specific positioning of mutations within a given gene

Domain instances mutated in a specific cancer type can point to functions that are specifically disrupted in that cancer type. Furthermore, the observation that a given gene product has different domain instances mutated in different cancer types may elucidate how a single gene

94 can play different roles in different cancers. To identify candidate genes with this behavior, I first selected all of the multi-domain genes that contained at least one cancer-type-specific SMD, and examined the cancer-type-specificity of these domains.

Among the 94 genes identified above to contain cancer-type-specific SMDs, 52 genes had multiple domain instances with differing cancer-type-specificity (see Table 3-3). These 52 genes were enriched for evidence of involvement in cancer, with 16 being Cancer Census genes (enrichment factor ~ 11.9; P-value = 6.7 ×10-13, Fisher’s Exact test), and 15 being candidate cancer genes according to the Sleeping Beauty screen (enrichment factor ~ 4.5; P-value = 1.9 ×10-6, Fisher’s Exact test).

To illustrate this analysis, I show the distribution of domain mutations in the EGF receptor, encoded by EGFR, across five cancers (Figure 3-4A). The EGF receptor is a flexible protein with four distinct domains, including extracellular and transmembrane regions, the intracellular kinase domains, and a long flexible tail (Figure 3-4B). My analysis recapitulated domain mutation patterns seen in previous findings. The extracellular region consists of the furin-like (Furin-Like) domain, the growth factor receptor domain IV (GF_recep_IV) and the L domain (Recep_L_Domain). The Furin-Like and the GF_recep_IV domains Were both found to be significantly mutated in cancers of the central nervous system. Mutations in the extracellular region of EGF receptor have been associated with ligand-independent dimerization in cancers of the central nervous system (Lee, Vivanco et al. 2006), and mutations in the intracellular region of EGF receptor are associated with sensitivity to kinase inhibitors (Lee, Vivanco et al. 2006).

95

Figure 3-4 Mutations in EGFR across 5 Different Cancers with Protein Structure Context. (A) The histogram displays the proportions of mutation counts detected at each residue to the total number of mutations that fall in the four different domains encoded by the gene EGFR, in five different cancers. The x-axis indicates the position of mutant residues. Mutations in different domains are shown in different colors. (B) shows the structure of the EGFR protein with epidermal growth factors colored in orange. The arrows point to enlargments of portions of the protein. The tails of the kinase domain are not shown in this structure. The structure visualization was based on Protein Data Bank structure models 1nql, 1ivo, 2jwa, 1m17 and 2gs6. Significantly-mutated domain instances (SMDs) Were shown as thicker boxes

96

In lung cancers, mutations were significantly enriched in the tyrosine kinase domain (Pkinase_Tyr). This is a well-known location for oncogenic mutations that hyper-activate downstream pro-survival signaling pathways in lung cancer by causing the auto-phosphorylation of C-terminal residues (Sordella, Bell et al. 2004).

The detailed functional consequences of EGFR mutations in prostate and colon cancer are still unclear. Differences in the positions of mutations between the extracellular (glioblastoma and prostate cancer) and intracellular regions (colon, lung and ovarian cancer) of the EGF receptor in different cancer types suggest different oncogenic mechanisms and possibly different therapeutic avenues.

Other interesting examples included that of the histone-lysine N-methyltransferase MLL3 protein, for which the PHD finger domain is mutated in breast cancer and prostate cancer, and for which the SET domain is mutated in glioblastoma and medulloblastoma. MLL3 is reported to possess histone methylation activity and is also involved in transcriptional co-activation. Knockdown or deletion of MLL3 using RNAi or CRISPR is reported to cause acute myeloid leukemia in a mouse model (Chen, Liu et al.).

Domain-associated mutational biases have been reported in several studies focusing on single well-known cancer genes such as the PI3KCA gene in colon and breast cancer (Nehrt, Peterson et al. 2012), and the NOTCH1 gene in leukemia, breast and ovarian cancer (Lobry, Oh et al. 2011). Here, I analyzed the distribution of somatic missense mutations for 14,083 genes across 21 cancer types and identified 52 genes (36 of which are not yet known to be cancer genes) for which different domain instances may contribute to different cancer types.

3.3.3 Comparison between oncogenes and tumor suppressors

I further analyzed genes with at least one cancer-type-specific SMD. More specifically, I identified a collection of 337 cancer-type-specific mutation hotspots in 68 genes, including some hotspots that appeared in several different types of cancer (Figure 3-5, Appendix B). For example, in the EGFR protein, residue p.A289 is a mutational hotspot in central nervous system cancer, p.C231 is a mutational hotspot in prostate cancer (Figure 3-4.). Both residues fall in the Furin-like domain of the extra-cellular part of EGFR, but at different domain-domain interaction interfaces.

97

Figure 3-5 Cancer-type-specific Mutational Hotspots and Mutational Hotspots Shared by Several Cancer Types. A. shows the distribution of mutational hotspots for different cancer types within a given domain instance. B. shows mutational hotspot distribution patterns of different domain instances (encoded by different genes) that each correspond to the same protein domain type. Mutational hotspots are shown as balls and sticks, domain instances are shown as boxes. Mutational hotspots in different colors represent mutations in different cancer types.

Not surprisingly, the vast majority (98%) of the 337 mutational hotspots I identified were found within SMDs. We also noticed that ~ 5% of the identified significantly mutated protein domains were likely only identified because of a cancer-type-specific hotspot (Table 3-3). The seven exceptions not found in SMDs were: 4 of the 11 hotspots in phosphatidylinositol-4,5- bisphosphate 3-kinase (encoded by PIK3CA) in cancers in colon, breast, lung and the upper aero-digestive tract, 1 of 2 hotspots in methyl-CpG binding domain protein (encoded by MBD1) in endometrial cancer, and 2 of 4 hotspots in Latrophilin-3 (encoded by LPHN3) in prostate cancer.

It has been proposed that oncoproteins tend to be recurrently mutated at the same amino acid residues, while tumor suppressor proteins tend to be mutated throughout their length

98

(Vogelstein, Papadopoulos et al. 2013). Therefore, I systematically compared the mutation pattern between tumor suppressor proteins and oncoproteins. In both tumor suppressor proteins and oncoproteins, mutations were enriched in the SMDs, as expected (Figure 3-6A; Table 3-8). For each cancer-type-specific SMD, to assess whether mutations were recurrent at a few locations as opposed to being evenly spread, I compared the ratio of mutational hotspots to the total number of mutated residues within domains. I found this ratio to be significantly higher for oncoproteins than for tumor suppressor proteins (Figure 3-6B; P = 0.00026, Mann-Whitney U- test). This is consistent with the known tendency of tumor suppressor proteins to carry loss-of- function mutations that can occur in many places, and that of oncoproteins to harbor more specific gain-of-function mutations (Vogelstein, Papadopoulos et al. 2013).

Figure 3-6 Distribution of Mutated Residues within a Single Gene. (A) compares the prevalence with which mutations from a specific cancer type fall within significantly mutated domain instances (SMDs) to the prevalence of mutations in other domain instances. Genes with at least one SMD are represented on x-axis in descending order by the number of mutated residues. The length of each blue bar shows the number of the mutated residues falling in SMDs for each cancer type, the length of red bars shown the number of mutated residues falling in other domain instances within the same gene. (B) compares the fraction of mutated residues in SMDs that are hotspots in oncogenes (yellow) and tumor suppressors (green).

Table 3-8 Significantly mutated domain instances in oncogenes and tumor suppressors.

Gene Symbol Suppressor & Oncogene Pfam ID Primary Site

99

ACSM4 Unknown PF00501.23 Endometrium CAGE1 Unknown PF15066.1 Cervix CCDC14 Unknown PF15254.1 Salivary gland CCNF Unknown PF02984.14 Prostate CD6 Unknown PF00530.13 Lung CD74 Unknown PF09307.5 Hematopoietic & lymph CD79B Unknown PF02189.10 Hematopoietic & lymph CFHR4 Unknown PF00084.15 Lung CNOT3 Unknown PF04065.10 Prostate COLEC11 Unknown PF01391.13 Soft tissue CYFIP2 Unknown PF05994.6 Endometrium FKBP9 Unknown PF00254.23 Central nervous system GNA13 Unknown PF00503.15 Hematopoietic & lymph HIST1H1E Unknown PF00538.14 Hematopoietic & lymph ITPR3 Unknown PF08709.6 Hematopoietic & lymph KEAP1 Unknown PF01344.20 Lung KIF26B Unknown PF00225.18 Endometrium KLF4 Unknown PF13465.1 Meninges KRT6A Unknown PF00038.16 Hematopoietic & lymph LPHN3 Unknown PF01825.16 Prostate MAGI1 Unknown PF00595.19 Hematopoietic & lymph MAST1 Unknown PF00595.19 Cervix MBD1 Unknown PF01429.14 Endometrium MT-CYB Unknown PF00032.12 Breast MT-ND1 Unknown PF00146.16 Endometrium MT-ND6 Unknown PF00499.15 Kidney MT1G Unknown PF00131.15 Cervix MYD88 Unknown PF01582.15 Hematopoietic & lymph NPIPL2 Unknown PF06409.6 Hematopoietic & lymph NTNG1 Unknown PF00055.12 Bone ODZ2 Unknown PF06484.7 Oesophagus P2RY8 Unknown PF00001.16 Hematopoietic & lymph PCDH11X Unknown PF00028.12 Central nervous system PCDH11Y Unknown PF00028.12 Central nervous system PIM1 Unknown PF00069.20 Hematopoietic & lymph PLCH2 Unknown PF13499.1 Prostate POLE Unknown PF03104.14 Endometrium PRG4 Unknown PF01033.12 Skin PRKACB Unknown PF00069.20 Bone PSTPIP2 Unknown PF00611.18 Central nervous system

100

SCYL3 Unknown PF00069.20 Urinary tract SLC16A5 Unknown PF07690.11 Cervix SNPH Unknown PF15290.1 Lung TRAF7 Unknown PF00400.27 Meninges U2AF1 Unknown PF00642.19 Lung EPHA6 Tumor suppressor PF14575.1 Hematopoietic & lymph FAT4 Tumor suppressor PF00028.12 Colon FBXW7 Tumor suppressor PF00400.27 Colon Tumor suppressor functions (Reported tumor suppressor functions in endometrium, breast FGFR2 cancer) PF07714 Endometrium BLNK Tumor suppressor PF00017.19 Bone BTG1 Tumor suppressor PF07742.7 Hematopoietic & lymph CDK11A Tumor suppressor PF00069.20 Hematopoietic & lymph CDKN2A Tumor suppressor PF12796.2 Skin CNTN4 Tumor suppressor PF13895.1 Hematopoietic & lymph CREBBP Tumor suppressor PF08214.6 Hematopoietic & lymph DDX3X Tumor suppressor PF00271.26 Central nervous system DDX3X Tumor suppressor PF00270.24 Central nervous system DGKZ Tumor suppressor PF00609.14 Bone DMBT1 Tumor suppressor PF00431.15 Hematopoietic & lymph HCK Tumor suppressor PF07714.12 Hematopoietic & lymph MLL3 Tumor suppressor PF13771.1 Prostate MLL3 Tumor suppressor PF13771.1 Breast NKX3-1 Tumor suppressor PF00046.24 Prostate PBRM1 Tumor suppressor PF00439.20 Kidney PPP2R1A Tumor suppressor PF13646.1 Endometrium PRPF19 Tumor suppressor PF00400.27 Lung PTEN Tumor suppressor PF00782.15 Endometrium PTPRD Tumor suppressor PF00102.22 Prostate PTPRQ Tumor suppressor PF00041.16 Kidney RBP1 Tumor suppressor PF00061.18 Meninges SETD2 Tumor suppressor PF00856.23 Kidney SMAD4 Tumor suppressor PF03166.9 Oesophagus SMAD4 Tumor suppressor PF03166.9 Colon SMARCA4 Tumor suppressor PF00176.18 Oesophagus SMARCA4 Tumor suppressor PF00176.18 Central nervous system SPOP Tumor suppressor PF00917.21 Prostate TP53 Tumor suppressor PF07710.6 Liver

101

TP53 Tumor suppressor PF00870.13 Hematopoietic & lymph TP53 Tumor suppressor PF00870.13 Upper aero-digestive tract TP53 Tumor suppressor PF00870.13 Oesophagus TP53 Tumor suppressor PF00870.13 Liver TP53 Tumor suppressor PF00870.13 Central nervous system TP53 Tumor suppressor PF00870.13 Breast TP53 Tumor suppressor PF00870.13 Lung TP53 Tumor suppressor PF00870.13 Ovary VHL Tumor suppressor PF01847.11 Kidney BRAF Oncogene PF07714.12 Skin BCL2 Oncogene PF00452.14 Hematopoietic & lymph BCL2 Oncogene PF02180.12 Hematopoietic & lymph AKT1 Oncogene PF00169.24 Breast ALK Oncogene PF07714.12 Autonomic ganglia ANTXR1 Oncogene PF00092.23 Hematopoietic & lymph CTNNA3 Oncogene PF01044.14 Hematopoietic & lymph EGFR Oncogene PF14843.1 Central nervous system EGFR Oncogene PF00757.15 Central nervous system EGFR Oncogene PF07714.12 Lung ERBB3 Oncogene PF14843.1 Bone FOXO1 Oncogene PF00250.13 Hematopoietic & lymph HRAS Oncogene PF00071.17 Upper aero-digestive tract IRF4 Oncogene PF00605.12 Hematopoietic & lymph KRAS Oncogene PF00071.17 Pancreas KRAS Oncogene PF00071.17 Lung NRAS Oncogene PF00071.17 Skin PIK3CA Oncogene PF00613.15 Endometrium PIK3CA Oncogene PF02192.11 Endometrium PRKCB Oncogene PF00069.20 Hematopoietic & lymph RAC1 Oncogene PF00071.17 Skin USP25 Oncogene PF00443.24 Liver Oncogene (Reporoted tumor FGFR1 suppressor funtions in breast cancer) PF07714.12 Central nervous system MYC Both_Tumor Suppressor Leukemia PF01056.13 Hematopoietic & lymph NOTCH1 Both_Tumor Suppressor in SCC PF07645.10 Upper aero-digestive tract

For example, the fibroblast growth factor receptor 2 (FGFR2) is generally regarded as an oncoprotein in breast cancer (Vogelstein, Papadopoulos et al. 2013). Consistent with this view, I found a single hotspot (p.N549) for FGFR2 in breast cancer in the kinase domain, which had not been reported as a hotspot for breast cancer. A previous study of endometrial cancer (Gatius,

102

Velasco et al. 2011) suggested FGFR2 to be a tumor suppressor protein. Supporting this view, I observed nine evenly-distributed mutated residues in the kinase domain in endometrial cancer, although I also confirm previous observation (Gatius, Velasco et al. 2011) of the p.N549 hotspot which is more suggestive of an oncoprotein. Four mutational hotspots in the Immunoglobulin I- set domain of FGFR2 were observed in colon cancer, which hints at a tumor suppression role for FGFR2 in colon cancer (Figure 3-7). The general phenomenon indicated here—in which a gene acts as an oncogene in one cancer type but a tumor suppressor in another—has not been

Figure 3-7 Distribution of Mutated Residues in FGFR. Sequence positions and frequencies of mutated residues in the FGFR protein are shown. Mutational hotspots for each cancer type are displayed as red dots. SMDs are shown as thicker boxes.

103 noted previously for FGFR2, However, this intriguing phenomenon, which must arise from cancer-type-specific differences in cancer initiation mechanisms, has been noted for other genes. For example, some previous studies have revealed that the gene NOTCH1 is an oncogene in “liquid tumors” such as lymphomas and leukemias, mutations were enriched in the PEST or HD domain. The first cleavage in the HD domain can free the intracellular portion of Notch to enter the nucleus and activate the transcription of target genes. In squamous cell carcinomas, NOTCH1 is a tumor suppressor with mutations enriched in the EGF-like domain. NOTCH1 inactivation in the epidermis results in derepressed beta-catenin signaling in cells that should normally undergo differentiation (Tosello and Ferrando 2013).

I also analyzed the functional properties of the mutational hotspots I observed. Out of the 68 proteins that have at least one mutational hotspot in at least one cancer, I selected 13 proteins for which structures and functional site annotations were available. Of these 13 proteins, seven proteins are encoded by oncogenes (AKT1, BRAF, EGFR, HRAS, KRAS, NRAS, and PIK3CA) and six are encoded by tumor suppressors (CDKN2A, FBXW7, PTEN, SMAD4, TP53, and VHL). I mapped all observed mutations to protein structures. For tumor suppressor proteins, I found that most mutational hotspots fell at the interface of domain-domain interactions. I also found that, of 47 mutational hotspots, only 3 (6%) fell at functional sites (residues associated with post-translational modification, catalytic activation, ligand binding, or protein-DNA or protein-RNA interactions) of tumor suppressor proteins (Table 3-9). For oncoproteins, of 40 mutational hotspots, 15 (38%) fell at functional sites, including GTP/ATP binding sites and other active sites of . Functional sites were significantly overrepresented among oncogenic mutational hotspots (Odds ratio= 10.0, P=0.0006, Fisher’s Exact test).

Table 3-9 Mutational hotspots observed at functional sites.

Mutational Gene Symbol Category Functions Hotspot AKT1 Oncogene p.E17 PH–KD interaction BRAF Oncogene p.G469 ATP binding BRAF Oncogene p.G466 ATP binding BRAF Oncogene p.N581 Enzyme-active EGFR Oncogene p.G719 ATP binding EGFR Oncogene p.G724 ATP binding HRAS Oncogene p.G13 GTP binding

104

HRAS Oncogene p.Q61 Enzyme-active KRAS Oncogene p.G12 GTP binding KRAS Oncogene p.G13 GTP binding NRAS Oncogene p.G12 GTP binding NRAS Oncogene p.Q61 Enzyme-active PIK3CA Oncogene p.E545 Intra-molecular binding PIK3CA Oncogene p.E542 Intra-molecular binding PIK3CA Oncogene p.Q546 Intra-molecular binding TP53 Tumor suppressor p.R248 DNA binding TP53 Tumor suppressor p.R273 DNA binding Post-translational TP53 Tumor suppressor p.R337 modification

The three mutational hotspots detected at known functional sites of a tumor suppressor protein all fell within the p53 protein. The p.R248 and p.R273 hotspots were within the DNA binding site, and have each been reported as sites of potentially oncogenic mutations in many cancer types, including breast cancer (Sigal and Rotter 2000). The hotspot p.R337, found in liver cancer, fell within p53’s tetramerization domain, a site of post-translational modification targeted by Protein Arginine N-Methyl Transferase 5 (PRMT5). Methylation of this residue affects the target protein specificity of p53(Jansson, Durant et al. 2008, Scoumanne and Chen 2008). As shown in Figure 3-8, the contact between p.R337 and p.L348, which is a residue in the P53-Tetramerization domain of another chain, may be necessary for tetramerization of the whole protein. The tetramerization of different domains is reported to be essential for the activity of p53 (Chene 2001). Disruption of tetramerization could have a dominant-negative loss-of-function effect, or a gain- or change-of-function mutation if the un-tetramerized subunits have additional activities. Thus, my analysis points to residue p.R337 being a novel driver mutation in liver cancer.

105

Figure 3-8 Structural Context of p53 Protein (PDB 3q05) Mutational Hotspots. Mutational hotspots shared by eight cancers are displayed as blue sticks. Liver-cancer-specific mutational hotspots are displayed as magenta sticks. The p53 protein structure is colored according to amino acid chain.

106

The different mutational hotspot distribution patterns between oncoproteins and tumor suppressor proteins were generally consistent with the expected gain- and loss-of-function mechanisms of oncogenes and tumor suppressors, respectively (Vogelstein, Papadopoulos et al. 2013). Mutations at functional sites may increase the activation of oncoproteins, while mutations at the inter-chain interfaces may destabilize the protein and lead to loss of function in a tumor suppressor. These distinct mutation patterns can help classify newly identified cancer- associated genes for which oncogene or tumor suppressor roles are unknown. I categorized the ten novel cancer candidate genes that overlap with the Sleeping Beauty dataset based on similarity to the hotspot distribution patterns that are characteristic of oncogenes and tumor suppressors (Table 3-6). Among the ten genes, seven (CNTN4, CTNNA3, EPHA6, FOXO1, MAGI1, PTPRD and SMAD4) were predicted to be tumor suppressors. Using transposon insertion positions, the Sleeping Beauty study (Dupuy, Jenkins et al. 2006) had annotated three of these seven genes as loss of function (while not suggesting an annotation for the remaining four). I also reported two potential oncogenes, USP25 and GNA13. Finally, I identified one gene, PCDH11X, for which the domain mutation patterns suggest an oncogenic role in lymphoma and leukemia but a tumor suppressive role in glioblastoma and medulloblastoma.

At the domain level, I noticed that 10 out of 13 oncogenic mutational hotspots shared by at least three cancer types occurred at functional sites (Table 3-10). This is true not only for domains corresponding to a single gene but also for domain types corresponding to different genes. For example, the Ras domain type (for which instances may be found in multiple genes) was significantly mutated in different cancers (Figure 3-9A). Enrichment of somatic mutations within Ras domains has been reported for different individual genes (Fernandez-Medarde and Santos 2011, Prior, Lewis et al. 2012). Here, I collectively analyzed the domain position-based hotspots for K-RAS, H-RAS, and N-RAS, finding that at least one of the GTP binding site residues p.G12 or p.G13, or the active site residue p.R61 show a relatively high mutation rate in at least five cancer types (Figure 3-9 B and C). While each of these three hotspots was known previously for individual genes in individual cancer types, this analysis suggests that an increase in statistical power can be gained in the future by grouping protein domain instances of the same domain type.

Table 3-10 Domain position-based mutational hotspots shared by at least three cancers with functional annotations

107

Corresponding Cancer Mutational Domain Types Function Genes Types Hotspot KRAS, NRAS, Ras domain 8 p.G12 GTP binding HRAS KRAS, NRAS, Ras domain 5 p.G13 GTP binding HRAS KRAS, NRAS, Ras domain 5 p.Q61 Enzyme-active HRAS PH domain AKT1 5 p.E17 Intra-molecular binding Phosphoinositide 3- PIK3CA 7 p.E545 Intra-molecular binding kinase family Phosphoinositide 3- PIK3CA 5 p.E542 Intra-molecular binding kinase family P53 DNA binding TP53 14 p.R248 Contact with DNA domain P53 DNA binding TP53 9 p.G245 Contact with DNA domain P53 DNA binding TP53 6 p.R273 Contact with DNA domain P53 DNA binding TP53 6 p.R282 Contact with DNA domain

108

Figure 3-9 Mutation Distributions of Different Ras Domain Instances and The Structure of Ras Domain. (A) bar graph shows Ras domains encoded by different genes have different mutation rates across cancer types. (B) heat map shows fraction of mutations observed at each residue of a given gene in a given cancer. (C) the structure of the Ras domain encoded by the KRAS gene (PDB structural model 4lpk). GTP/GDP binding sites are displayed as magenta sticks, GDP binding sites are colored in cyan.

109

Beyond the 10 out of 13 oncogenic mutational hotspots occurring at functional sites, there were three oncogenic mutational hotspots shared by at least three cancer types. They are V600 in Serine/threonine-protein kinase B-Raf (encoded by BRAF), and R88 and C420 in the phosphatidylinositol-4,5-bisphosphate 3-kinase encoded by PIK3CA. I found both C420 and R88 to be positions of mutational hotspots in endometrium, colon and breast cancer. Although the two residues fall within different domains (C420 in C2 domain, and R88 in p85α domain), they both play important roles in maintaining the p110α/ p85α-iSH2 structure (Hon, Berndt et al. 2012), and both are at the binding interface of the C2 and p85α domains. Although each of these mutations has been previously studied as a potential driver mutation in each of these three cancer types, this analysis objectively confirms the ‘hotspot’ status of these mutations.

3.4 Discussion

Major bottlenecks in the systematic study of cancer genomes exist following identification of somatic tumor mutations, including the identification of driver mutations and their functional impacts. By taking advantage of large-scale whole-genome or whole-exome sequencing data and accumulated information about protein structures, I was able to derive and compare the mutational landscapes for 21 cancer types at the domain level. I used a significance test that not only required a domain to have enriched mutational density in a given tumor type relative to other regions in that tumor type, but further required that an enrichment be significantly greater than that observed for all other cancer types taken together. Because region-dependence of mutation rates is similar across tumor types (Lawrence, Stojanov et al. 2013), this approach not only identifies cancer-type specific mutational positioning but also implicitly controls for regional differences in mutation rate across the genome. Also, Lawrence et al have reported heterogeneity in the mutational spectrum across cancers (Lawrence, Stojanov et al. 2013). Theoretically, differences in mutational spectrum across cancers could lead to mutational enrichment in some domains in different cancers. For example, for lung cancers with a greater frequency of C > A mutations, protein domains encoded by DNA regions with high GC content could be more likely to be identified as significantly mutated domains. However, the observation that genes with SMDs are 2 to 3 times more likely to have been identified as cancer genes by other means suggests that this phenomenon does not preclude meaningful discovery of SMDs. Moreover, 10% of SMDs occur in multiple cancer types, and those types often have a different mutational spectrum.

110

I analyzed domain types that are significantly mutated in different cancers, such as the Ras domain type and the Pkinase domain type. I found hotspots that were shared between different domain instances in the same domain type, and which appeared in multiple cancer types. By combining this information with protein structure information, I found that all (10 out of 10) such identified hotspots, where they fell within known oncoproteins, are ‘functional hotspots’ in the sense that all fell within ligand-binding or active sites. I also found that, in a given cancer type, a functional hotspot corresponding to a given domain type was never mutated in more than one of the domain instances corresponding to that domain type in the same tumor sample. This suggests that functional hotpots falling within different genes corresponding to a given domain type may contribute to cancer development by a similar and parallel mechanism, and further suggests that only one mutated functional site might be able to increase the activity of those proto-oncogenes and ultimately contribute to cancer initiation. Functional hotspots included oncogenic mutations within proteins that are generally considered to be tumor suppressors, for example p.R248 and p.R273 in TP53. Except for the DNA binding sites p.R248 and p.R273 in the p53 DNA binding domain, I did not find mutational hotspots in known tumor suppressors that appeared in more than five cancer types. Providing greater nuance to previous reports that mutations tend to span the entire tumor suppressor gene (Vogelstein, Papadopoulos et al. 2013), I found that tumor suppressor mutations detected in a given cancer type tended to be distributed throughout the entirety of a significantly mutated domain instance, and many mutations occurred within core regions important for the stabilization of the protein complex. Mutations detected in different cancers tended to be focused within domain instances, but were distributed across different domain instances of the same gene product.

Mutational positioning information could assist drug design aimed at precisely targeting the region of the protein involved in a particular cancer. In contrast with gene-level studies of mutational frequency, the domain-level viewpoints to particular functional regions, and identifies tendencies of a gene to be mutated in different regions in different cancers. For most genes, only one domain was found to be significantly mutated in a given cancer type. However, I found five genes that each contain two interaction-mediating domain instances that were significantly mutated in the same cancer type. These five interaction-mediating domain pairs are Bcl-2 and BH4 domains encoded by BCL2, which play important roles in regulating cell death and survival (Kelekar and Thompson 1998); DEAD and Helicase_C domains encoded by

111

DDX3X, which play important roles in metabolic processes involving RNAs (Owsianka and Patel 1999); and PI3Ka and PI3K_p85B domains encoded by PIK3CA, which interact with each other to initiate a vast array of signaling events (Miled, Yan et al. 2007); Furin-like and GF_recep_IV domains encoded by EGFR, which are both extracellular domains of receptor tyrosine protein kinases and which interact with each other to regulate the binding of ligands to the receptor (Cho and Leahy 2002); and finally the DNA binding and P53_tetramer (tetramerization) domains encoded by TP53. I also identified 117 domain instance pairs that corresponded to interacting proteins (Rolland, Ta An et al. 2014), for which at least one member of the pair was an SMD. For most interaction-mediating pairs, only one domain instance was significantly mutated (Table 3-11). There are only ten cases where both domains of a predicted interaction-mediating domain pair were significantly mutated in the same cancer type (Table 3-12). This result raises the possibility that mutations in those domain instances act by disrupting domain-domain interactions. Distinctive mutation landscapes in different cancers could indicate that tumor development mechanisms across different cancer types are dissimilar, although it is also possible that differences in the mutational spectrum between different cancer types alter the probability of mutation in one domain relative to another.

Table 3-11 List of pairs of significantly mutated domain instances that corresponded to directly- interacting proteins (Rolland et al., unpublished results).

Domain Instance Interacting Partner Primary Site Pvalue PF12012_ZMYM4 PF13465_CTCF Bone 0.00000459 PF12012_ZMYM4 PF00051_ROR2 Bone 0.00000459 PF12012_ZMYM4 PF00412_FHL2 Bone 0.00000459 PF12012_ZMYM4 PF01392_ROR2 Bone 0.00000459 PF12012_ZMYM4 PF01753_RUNX1T1 Bone 0.00000459 PF12012_ZMYM4 PF07531_RUNX1T1 Bone 0.00000459 PF12012_ZMYM4 PF07714_ROR2 Bone 0.00000459 PF00169_AKT1 PF01840_TCL1A Breast 1.09E-17 PF00069_IRAK4 PF01582_MYD88 Central nervous system 0.00000762 PF00010_MAX PF11701_UNC45A Endometrium 0.00000521 PF00225_KIF26B PF00917_TRAF2 Endometrium 0.000000388 PF13646_PPP2R1A PF00313_CSDC2 Endometrium 0.000000179 PF00017_STAT3 PF00017_SYK Hematopoietic & lymph 0.00000654 PF00017_STAT3 PF07714_SYK Hematopoietic & lymph 0.00000654 PF00017_STAT3 PF00017_BMX Hematopoietic & lymph 0.00000654

112

PF00017_STAT3 PF00018_BLK Hematopoietic & lymph 0.00000654 PF00017_STAT3 PF00169_BMX Hematopoietic & lymph 0.00000654 PF00017_STAT3 PF00412_LMO2 Hematopoietic & lymph 0.00000654 PF00017_STAT3 PF07714_BMX Hematopoietic & lymph 0.00000654 PF00038_KRT6A PF00038_KRT13 Hematopoietic & lymph 0.00000027 PF00038_KRT6A PF00038_KRT15 Hematopoietic & lymph 0.00000027 PF00038_KRT6A PF00038_KRT31 Hematopoietic & lymph 0.00000027 PF00038_KRT6A PF00038_KRT38 Hematopoietic & lymph 0.00000027 PF00038_KRT6A PF00038_KRT40 Hematopoietic & lymph 0.00000027 PF00038_KRT6A PF00790_HGS Hematopoietic & lymph 0.00000027 PF00038_KRT6A PF01363_HGS Hematopoietic & lymph 0.00000027 PF00038_KRT6A PF07842_TFIP11 Hematopoietic & lymph 0.00000027 PF00038_KRT6A PF12457_TFIP11 Hematopoietic & lymph 0.00000027 PF00038_KRT6A PF13923_TRIM54 Hematopoietic & lymph 0.00000027 PF00038_KRT6A PF00025_TRIM23 Hematopoietic & lymph 0.00000027 PF00038_KRT6A PF00225_KIFC3 Hematopoietic & lymph 0.00000027 PF00069_PIM1 PF00651_ZBTB1 Hematopoietic & lymph 1.19E-30 PF00069_PIM1 PF05641_FXR2 Hematopoietic & lymph 1.19E-30 PF00069_PIM1 PF12235_FXR2 Hematopoietic & lymph 1.19E-30 PF00069_PIM1 PF13465_ZBTB1 Hematopoietic & lymph 1.19E-30 PF00069_PIM1 PF00010_NHLH1 Hematopoietic & lymph 1.19E-30 PF00069_PIM1 PF00206_FH Hematopoietic & lymph 1.19E-30 PF00069_PIM1 PF10415_FH Hematopoietic & lymph 1.19E-30 PF00503_GNAI2 PF00008_NOTCH2NL Hematopoietic & lymph 0.00000409 PF00503_GNAI2 PF00038_KRT31 Hematopoietic & lymph 0.00000409 PF00503_GNAI2 PF00038_KRT40 Hematopoietic & lymph 0.00000409 PF00503_GNAI2 PF00412_TRIP6 Hematopoietic & lymph 0.00000409 PF00503_GNAI2 PF00615_RGS20 Hematopoietic & lymph 0.00000409 PF00503_GNAI2 PF07645_NOTCH2NL Hematopoietic & lymph 0.00000409 PF00503_GNAI2 PF13885_KRTAP10-8 Hematopoietic & lymph 0.00000409 PF01582_MYD88 PF00069_IRAK4 Hematopoietic & lymph 1.48E-24 PF07714_HCK PF00008_NOTCH2NL Hematopoietic & lymph 0.000000112 PF07714_HCK PF00038_KRT40 Hematopoietic & lymph 0.000000112 PF07714_HCK PF00595_GOPC Hematopoietic & lymph 0.000000112 PF07714_HCK PF07645_NOTCH2NL Hematopoietic & lymph 0.000000112 PF07742_BTG1 PF04857_CNOT8 Hematopoietic & lymph 1.35E-08 PF13465_BCL6 PF00225_KIFC3 Hematopoietic & lymph 0.00000654 PF13465_BCL6 PF00651_ZBTB7B Hematopoietic & lymph 0.00000654 PF13465_BCL6 PF00917_TRAF1 Hematopoietic & lymph 0.00000654 PF13465_BCL6 PF03145_SIAH1 Hematopoietic & lymph 0.00000654

113

PF13465_BCL6 PF03932_CUTC Hematopoietic & lymph 0.00000654 PF13465_BCL6 PF13465_ZBTB7B Hematopoietic & lymph 0.00000654 PF00443_USP25 PF13465_ZNF426 Liver 9.93E-11 PF00443_USP25 PF00017_SYK Liver 9.93E-11 PF00443_USP25 PF00240_RAD23A Liver 9.93E-11 PF00443_USP25 PF07714_SYK Liver 9.93E-11 PF00443_USP25 PF11976_SUMO2 Liver 9.93E-11 PF00071_KRAS PF00076_RBPMS Lung 1.92E-64 PF00642_U2AF1 PF00076_U2AF2 Lung 1.08E-15 PF00642_U2AF1 PF00069_SRPK2 Lung 1.08E-15 PF00642_U2AF1 PF09066_AP2B1 Lung 1.08E-15 PF01344_KEAP1 PF00293_NUDT4 Lung 0.000000218 PF01344_KEAP1 PF00400_WDR83 Lung 0.000000218 PF01344_KEAP1 PF00651_KLHL3 Lung 0.000000218 PF01344_KEAP1 PF01344_KLHL3 Lung 0.000000218 PF01344_KEAP1 PF01423_LSM3 Lung 0.000000218 PF01344_KEAP1 PF03571_DPP3 Lung 0.000000218 PF01344_KEAP1 PF07534_OXR1 Lung 0.000000218 PF01344_KEAP1 PF07707_KLHL3 Lung 0.000000218 PF01344_KEAP1 PF13465_ZNF434 Lung 0.000000218 PF01344_KEAP1 PF00571_PRKAG1 Lung 0.000000218 PF01344_KEAP1 PF02301_MAD2L1 Lung 0.000000218 PF01344_KEAP1 PF03463_ETF1 Lung 0.000000218 PF01344_KEAP1 PF03464_ETF1 Lung 0.000000218 PF01344_KEAP1 PF03465_ETF1 Lung 0.000000218 PF01344_KEAP1 PF04106_ATG5 Lung 0.000000218 PF01454_MAGEB2 PF11618_RPGRIP1 Lung 0.00000917 PF00638_NUP50 PF00514_KPNA6 Salivary gland 0.00000455 PF00638_NUP50 PF01749_KPNA6 Salivary gland 0.00000455 PF00638_NUP50 PF00514_KPNA1 Salivary gland 0.00000455 PF00638_NUP50 PF00514_KPNA2 Salivary gland 0.00000455 PF00638_NUP50 PF00514_KPNA3 Salivary gland 0.00000455 PF00638_NUP50 PF00514_KPNA4 Salivary gland 0.00000455 PF00638_NUP50 PF01749_KPNA1 Salivary gland 0.00000455 PF00638_NUP50 PF01749_KPNA2 Salivary gland 0.00000455 PF00638_NUP50 PF01749_KPNA4 Salivary gland 0.00000455 PF00638_NUP50 PF01749_KPNA5 Salivary gland 0.00000455 PF03133_TTLL10 PF00038_KRT40 Salivary gland 0.00000455 PF03133_TTLL10 PF00076_RBPMS Salivary gland 0.00000455 PF03133_TTLL10 PF00412_FHL3 Salivary gland 0.00000455

114

PF03133_TTLL10 PF00412_TRIP6 Salivary gland 0.00000455 PF03133_TTLL10 PF05485_THAP1 Salivary gland 0.00000455 PF03133_TTLL10 PF06292_CADPS Salivary gland 0.00000455 PF00071_RAC1 PF00038_KRT40 Skin 2E-32 PF00071_RAC1 PF00515_CDC23 Skin 2E-32 PF00071_RAC1 PF00595_USH1C Skin 2E-32 PF00071_RAC1 PF04049_CDC23 Skin 2E-32 PF00071_RAC1 PF06818_LZTS2 Skin 2E-32 PF00069_SCYL3 PF00046_LHX8 UT_PF00069_SCYL3 0.000000347 PF00069_SCYL3 PF00412_LHX8 UT_PF00069_SCYL3 0.000000347 PF00069_SCYL3 PF00085_TXN2 UT_PF00069_SCYL3 0.000000347 PF00069_SCYL3 PF02991_GABARAPL1 UT_PF00069_SCYL3 0.000000347

Table 3-12 Ten pairs of domain instances that are inferred to mediate protein interaction with each other, for which both domains in the pair were found to be significantly mutated in the same cancer type.

Domain Instance 1 Domain Instance 2 Primary Site PF01582_MYD88 PF01582_MYD88 Hematopoietic & lymph PF13465_BCL6 PF13465_BCL6 Hematopoietic & lymph PF00638_NUP50 PF00514_KPNA5 Salivary gland PF00010_MAX PF00010_MXI1 Endometrium PF03166_SMAD4 PF00788_RASSF5 Colon PF01344_KEAP1 PF03247_PTMA Lung PF01344_KEAP1 PF00564_SQSTM1 Lung PF01344_KEAP1 PF07648_RECK Lung PF00642_U2AF1 PF00076_RNPS1 Lung PF00071_KRAS PF00227_PSMA3 Lung PF00642_U2AF1 PF01602_AP2B1 Lung PF03166_SMAD4 PF00412_LMO4 Colon

This domain-level study identified known and novel candidate driver mutations and provided clues to the functional effects of tumor-associated somatic mutations. In total, 41 out of the 100 SMDs I identified are encoded by Cancer Census genes. Among the remaining 59 novel candidate driver genes, many domain instances belong to well-known cancer-associated domain types, such as the Pkinase domain type and the WD40 domain type, supporting the idea that this set contains many cancer driver genes that are not yet annotated as such. By comparing the

115 domain-level mutational landscapes of different cancers generated by my study to previously reported gene-level mutation landscapes in small cell lung cancer, melanoma, colon cancer, and breast cancer (Cancer Genome Atlas 2012, Hodis, Watson et al. 2012, Imielinski, Berger et al. 2012, Stephens, Tarpey et al. 2012, Kim, Jung et al. 2013), I noticed at least ten cancer-type- specific SMDs that do not correspond to any previously reported highly mutated cancer- associated genes. For each cancer type, I found at least one new potential cancer-associated domain instance, for example, the diacylglycerol kinase domain encoded by DGKZ in chondrosarcoma. The DGKZ protein (using the diacylglycerol kinase domain) usually acts as a sentinel and can control p53 function both during normal homeostasis and during stress response (Tanaka, Okada et al. 2013). Other examples include the two cadherin domain instances encoded by PCDH11X and PCDH11Y in glioblastoma. These domain instances are thought to play important roles in cell-cell communication and are essential for a normally- functioning central nervous system (Yoshida and Sugano 1999). Also, all the eight tumor samples that contained mutations in PCDH11Y (on the Y ) were also mutated at a corresponding position in the X-chromosome homolog PCDH11X. Another four (female) samples had mutations detected on both alleles of PCDH11X (Lemaire, Fremeaux-Bacchi et al. 2013). These alleles each contained one of the novel hotspot mutations p.T486 or p.G442 in the cadherin domain, suggesting the potential role for these hotspot mutations as important recessive driver mutations in glioblastoma.

Because Nehrt et al. (Nehrt, Peterson et al. 2012) had previously identified significantly mutated domain types for breast and colon cancer, I wished to assess the novelty of the SMDs I found for these cancer types. Of the 23 SMDs I identified for colon cancer, 20 were novel relative to domain types previously identified by Nehrt et al (I confirmed three domain types: the PI3K_p85B domain encoded by PIK3CA, the MH2 domain encoded by SMAD4 and the P53 DNA binding domain encoded by TP53). Of the 12 SMDs I identified for breast cancer, only three correspond to a certain highly mutated domain type reported in the study by Nehrt et al (the PI3K_p85B domain and PI3Ka domain encoded by PI3KCA, and the P53 DNA binding domain encoded by TP53). I note that, even where an SMD corresponds to a domain type previously found to be significantly mutated, my analysis in this case identifies individual domain instances as significantly mutated, as opposed to domain types for which the mutations may be spread across multiple genes. In summary, 20 out of 23 (87%) of the colon-cancer

116 associated SMDs, and 9 out of 12 (75%) of the breast-cancer-associated SMDs found here are novel relative to Nehrt et al.

My study also differs from Nehrt et al.in that I only reported domain instances for which enrichment relative to other regions was significantly greater in one cancer type than in all other cancer types. This procedure controlled both for mutation rates within each cancer type, and for different rates of mutation across cancer types in each domain relative to others. Although a previous study (Lawrence, Stojanov et al. 2013) has pointed to the dangers of candidate driver gene identification through mutation frequency analysis, I note that none of the SMDs I identified fell within the 18 genes for which mutational enrichment was reported to be spurious (Lawrence, Stojanov et al. 2013). In addition to correspondence of the discovered SMDs to known cancer-relevant domain families, my set of novel driver gene candidates overlapped significantly with a large-scale screen for cancer genes based on transposon mutagenesis in mouse. Together, these results indicate that I may be far from having a complete catalogue of cancer-associated genes and that domain-level mutation landscape analysis offers an opportunity to identify new driver genes.

I note that the cancer missense somatic mutation data I mined came from 71 unbiased studies, and that data from unbiased studies tends to contain a higher proportion of passenger mutations compared to data from targeted studies (Tamborero, Gonzalez-Perez et al. 2013). I therefore chose a relatively conservative significance threshold, necessarily causing us to overlook many candidate driver genes, which might be recovered in the future through larger data sets and use of prior information about cancer relatedness. Note that protein domains can be found to be significantly mutated in more than one tumor type, because for each domain we compare mutational density in one cancer type vs all others. However, the approach we took could miss protein domains that are ubiquitously mutated (at a similar mutational density) across many cancer types.

I also note that this study only focused on missense mutations across 21 cancer types, but did not consider any other mutation types, such as the copy number variations (CNVs). Higher copy number makes mutation detection harder, and somatic mutations in tumor suppressors may preferentially appear in the background of a deletion of the other allele. In the future, I could integrate CNV data into this analysis.

117

Instead of identifying significantly mutated domains, one could look for significant clusters of mutations in protein sequence. The main problem of those studies is that the boundaries of clusters cannot be defined ahead of time. Therefore, a search for cancer-type-specific clusters of mutations would (either explicitly or implicitly) introduce many more new hypotheses than our study using predefined domain boundaries. Although it would increase the range of phenomena beyond protein domains, a more exhaustive search for clusters would potentially severely limit our power to detect cancer-type-specific effects at domains after multiple-test correction. Such a multiple test correction would also be challenging given strong dependence of different hypotheses for overlapping sequence regions. Recently, several studies have been conducted to find mutation clusters, such as the 3D cluster introduced by Kamburov et al (Kamburov, Lawrence et al. 2015). This is also a good direction for future study.

118

Chapter 4 Predictive models of immunogenicity for somatic mutations

The work described in this chapter can also be found in a manuscript entitled “Predictive models of immunogenicity for somatic mutations”, which is in preparation for submission.

I performed all the analysis except for identifying the HLA type of patients. This was performed by our collaborators, Drs. Hidewaki Nakagawa and Seiya Imoto from Tokyo University.

119

Predictive models of immunogenicity for somatic mutations 4.1 Introduction

Cancer cells often different themselves from normal cells in two aspects. First, tumor cells may accumulate at least tens of ‘driver’ mutations that lead to cellular transformation, in addition to a generally greater number of non-causal ‘passenger’ mutations. Tumor cells often more highly express many proteins due to genetic and epigenetic alterations (Gubin, Artyomov et al. 2015). These changes provided cancer cells with an altered repertoire of MHC class I-displayed peptides, and MHC class II-displayed peptides (either displayed by the tumor cell itself, if it expresses MHC class II proteins, or displayed by nearby antigen-presenting cells that have acquired antigen from tumor cells). MHC-displayed antigens that distinguish tumor from normal cells can promote recognition by T cells and thus destruction by the immune system (Vesely and Schreiber 2013).

Tumor antigens can be classified into two categories: tumor-associated self-antigens (which may be displayed by other normal cell types even if not displayed by the normal cell type from which the tumor was derived) and antigens derived from tumor-specific mutant proteins. Tumor-specific mutations are ideal targets for cancer immunotherapy, due to the fact that any ‘neo-antigens’ they produce are less likely to be present in healthy cells/tissues and can potentially be recognized by the mature T-cell repertoire (Kelderman and Kvistborg 2016). Also, it has been reported that mutant neo-antigens are likely to be more immunogenic, presumably due to the T-cell maturation process in which T-cells capable of high-avidity recognition of self-antigens are eliminated (Yadav, Jhunjhunwala et al. 2014). Immuno-therapy approaches, however, have been hampered by the fact that every tumor possesses a unique set of mutations that must first be identified (Heemskerk, Kvistborg et al. 2013). Moreover, individual patients can differ dramatically in their immune systems, based on HLA type and other allelic variation in immune genes, as well their unique repertoire of mature immune cells. Thus, personalized immuno-therapy could positively benefit the patient during cancer treatment (Haughton and Amos 1968, Prehn 1969, Hutchinson 2012). After recognition and the process of tumor cell killing by T-cells may release more tumor neo-antigens, in a potentially therapeutic cycle.

120

In principle, any coding genetic variant has the potential to generate mutated peptides that are presented by MHC class I molecules, and subsequent recognition and clearance by cytotoxic T cells. However, to bring this personalized treatment approach to tumor patients, one crucial challenge that currently exists is determining the MHC binding potential of non-self-peptides that arise from somatic tumor mutations, and determining which among them are potential neo- antigens in patients with different MHC types in different cancers.

To improve our understanding of neo-antigenicity in cancer, I conducted several analyses related to the distribution pattern of somatic mutations in genes encoding MHC-binding peptides across different cancer types. Here, I quantified the impact of antigenicity on a spectrum of tumor missense somatic mutations. My analysis quantified an expected qualitative behaviour: counter-selection of nascent tumor cells could lead to a depletion of mutations in ‘mature’ tumor cells that are capable of generating MHC class I-presented non-self-peptides.

The project consisted of five main parts: First, I quantified the extent to which somatic mutations are significantly depleted in peptides that are predicted to be displayed by MHC class I proteins. Second, I improved this analysis with a refined prediction of peptide display that is based on individual HLA alleles rather than display expected of an ‘average’ HLA repertoire. Third, I characterized the dependence of this depletion on expression level. Fourth, I further improved this analysis by relating the mutation density to the number of peptide-displaying HLA alleles. Fifth and finally, I developed a system to score each somatic mutation according to its propensity for immunogenicity, using the extent of under-representation of somatic mutant classes as a proxy measure of immunogenicity.

The ‘antigenicity scores’ resulting from this project predict the a priori immunogenicity of any given somatic mutation (Figure 4-1). Of course, if a somatic mutation has been observed, it has obviously not yet been cleared. However, these models could predict the presence of ‘cryptic immunogenicity’ that could be ‘revealed’, e.g. by de-silencing HLA loci within cancer cells, or by relieving tumour-derived suppression of immune cells. These antigenicity scores could therefore guide immunotherapy or aid in developing personalized cancer vaccines.

121

Figure 4-1 Outline for antigenicity score of cancer mutations

4.2 Methods

4.2.1 Prediction of MHC binding peptides

During the first phase of this study, I collected all the non-synonymous and synonymous mutations generated by 178 whole genome/exome sequencing studies from COSMIC database. I introduced each mutation into the corresponding protein sequence. I specifically focused on the 1,518,399 non-synonymous and 394,862 synonymous mutations distributed across ~27,186 annotated protein-coding genes. I uploaded all the mutated protein sequences to the NetMHC server (Nielsen, Lundegaard et al. 2003, Melhem, Muhanna et al. 2006). Via NetMHC server, I predicted the MHC binding peptides associated with 12 common HLA class I alleles: HLA- A*01, HLA-A*02, HLA-A*03, HLA-A*24, HLA-A*26, HLA-B*07, HLA-B*08, HLA-B*15, HLA-B*27, HLA-B*39, HLA-B*40, and HLA-B*58.

During the second phase of this study, I collected the 121258 non-synonymous somatic cancer mutations in 10745 genes detected from 2834 patient samples provided by the PCAWG project (Kreiter, Vormehr et al. 2015, Kreiter, Vormehr et al. 2015). Similarly, I introduced each mutation into the corresponding protein sequence, and predicted HLA class I binding peptides for the above-listed 12 common HLA alleles using the NetMHC server.

122

In this study, NetMHC scores were obtained for MHC binding peptides of length nine. (Although it is possible for peptides with 10 or 11 residues to bind, this is less common and such cases are more difficult to predict.) Also, only strong MHC class I binding peptides with NetMHC affinity score of 50 or less were selected (where smaller NetMHC scores correspond to higher affinity).

4.2.2 Calculate the depletion of mutations within HLA class I binding peptides

Based on the whole genome or whole exome sequencing data from the COSMIC database, I first calculated the total number of non-synonymous and synonymous mutations falling within predicted MHC binding peptide regions for each protein. Correspondingly, I calculated the total number of non-synonymous and synonymous mutations falling outside of the predicted MHC binding peptide regions. A ratio Rin was calculated for each protein considering only predicted MHC binding peptides, defined by the number of non-synonymous alterations divided by the sum of synonymous and non-synonymous alterations. The non-synonymous mutation density within MHC binding peptides of each protein was calculated as the total number of non- synonymous mutations normalized by the corresponding cumulative length of MHC binding peptides. Similarly, a ratio Rout was calculated for each protein considering only non-MHC binding peptides, defined as the number of non-synonymous alteration counts divided by sum of synonymous and non-synonymous alteration counts (Figure 4-2). I used the Mann–Whitney U- test to evaluate the significance of difference in mutation counts between MHC binding peptides and out of MHC binding peptides of the same gene. All of these analyses were conducted using the “stats” package in R.

123

Figure 4-2 Predicting the MHC binding peptides and calculate mutation density

For all the non-synonymous mutations from the 2834 PCAWG patient samples, I performed similar calculations. For each protein, I calculated the mutation density either within MHC binding peptides or out of MHC binding peptides.

4.2.3 The expression level of each gene

The median expression level of each gene in a given cancer type was calculated based on the RNAseq data from TCGA database (listed in Table 1-1), this is called the cancer type-specific expression level of each gene. For each gene, I also calculated the median expression level across the 34 cancers, this is called the general expression level of a certain gene. Then, within the expressed genes and non-expressed genes, I further tested for the presence of a tendency of mutation depletion in HLA binding peptides in the same way described in Chapter 4.2.2.

4.2.4 Relating predicted HLA binding peptides to personal HLA types

The six-digit HLA type of the 2834 patients was provided by my collaborators, Dr. Seiya Imoto and Dr. Hidewaki Nakagawa from Tokyo University. For each of the peptides predicted to bind MHC proteins encoded by one of the above-listed 12 common HLA types, I matched the HLA presenting allele predicted by NetMHC with the corresponding patient HLA type.

In order to conduct this study, the predicted MHC binding peptides of each mutated gene were separated into three types (Figure 4-3): “D2”, where only one HLA class I allele type is predicted to display the peptide, and the patient is homozygous at this and thus carries two copies of this displaying allele; “D1”, where only one HLA class I allele type can display the peptide, and the patient is heterozygous at this locus and thus has only one allele that can display

124 the peptide; and “D0,” where the patient has zero HLA class I alleles that are predicted to display the peptide. For each gene, I calculated the mutation density within predicted MHC binding peptides for each type (DD2, DD1, and DD0). Then, I further calculated the ratio of DD2 and DD1 to DD0 (these ratios are referred to as R2 and R1, respectively).

Figure 4-3 Three types of MHC binding peptides based on patient HLA types

4.3 Results

4.3.1 Depletion of mutations within predicted HLA binding peptides

To assess the likelihood of single nucleotide mutation being displayed within an MHC binding peptide, the non-synonymous and synonymous mutation datasets based on 178 whole genome/exome sequencing studies were collected from the COSMIC database. The potential MHC class I binding peptides were predicted via the NetMHC (Nielsen, Lundegaard et al. 2003,

125

Melhem, Muhanna et al. 2006). Those non-synonymous mutations and synonymous mutations were separated into two groups, within predicted MHC binding peptides and out of predicted MHC binding peptides. Then, the mutational distribution pattern for MHC binding and non- MHC binding peptides were calculated. Of the 710,365 alterations within MHC binding peptides, 29% were synonymous. By contrast, 16% among 1,202,896 alterations amongst non- MHC binding peptides were synonymous. Thus, somatic missense mutations are 15% less likely to be detected within the MHC binding peptides (Chi-Squared = 4342.829, P-value = 5.0 × 10- 4). For each given cancer associated protein, I calculated: 1) the ratio (Rin) of non-synonymous alteration counts within MHC binding peptides to the total alteration counts within MHC binding peptides (for the same protein); 2) the same ratio for regions outside of MHC binding peptides (Rout). As shown in Figure 4-4, most Rin ranges from 0.4 ~ 0.7, but Rout values have a higher range (from 0.7~0.9; Paired Wilcoxon test, P-value < 2.2e -16). Here, we see three bins with increased density, which are likely due to the three peaks expected at 0.33, 0.5 and 0.66, which arise from taking ratios of small integer mutation counts within small regions. Our analyses are consistent with expected behavior in that missense mutations within predicted MHC binding peptides should be expected to have a greater tendency to be antigenic and thus counter-selected due to clearance by the immune system. Although this phenomenon was to be expected, this is to our knowledge the first quantification of the impact of immune counter- selection on somatic mutations.

126

Figure 4-4 The percentage of all mutations of all types that are missense somatic mutations, calculated both within MHC binding peptides and out of MHC binding peptide

4.3.2 Depletion of mutations within expressed predicted MHC binding peptides

Although mutations within predicted MHC binding peptides were predicted to be counter- selected, I still found many “likely-displayed” mutations within MHC binding peptides. Because a peptide must be expressed within the tumor to be displayed, I further characterized the dependence of this depletion on estimated gene expression levels. I analyzed the relationship between the non-synonymous mutation density within MHC binding peptides and the expression level of the corresponding protein using RNASeq data from 29 cancer types (see Methods). The normalized expression level of each gene in each cancer type was calculated.

127

Then, both for mutations within MHC binding peptides and out of MHC binding peptides, I calculated the average mutation density based on the gene expression level. The relationship between mutation density and expression level of each gene was assessed. My results indicated that mutation density and expression level are negatively correlated as expected, and the average mutation density within MHC binding peptides is lower than that out of MHC binding peptides at different expression levels (Figure 4-5).

Moreover, within expressed genes, I compared the mutation density between MHC binding and non-MHC binding peptides, and my results again showed that mutations are more likely to be

Figure 4-5 The average mutation density for genes with different expression percentiles. For each tumor, expression levels were from cancer samples matched by cancer type. observed within non-MHC binding peptides (Wilcoxon test, P-value < 2.2e -16). As a control, I further compared the mutation density within and out of MHC binding peptides in 2987 non- expressed genes. My results indicated that there was no significant depletion of missense somatic mutations in MHC binding peptides (Figure 4-6). The mutation density in MHC binding peptides (0.0058 per 1Kbp) was even higher than that of out of MHC binding peptides (0.0041 per 1Kbp).

128

Thus, our analysis confirmed the expected phenomenon that somatic mutations within MHC

Figure 4-6 Mutation density in expressed genes within MHC binding peptides and out of MHC binding peptides. High expression level genes are those genes whose expression level higher than median expression level. Low expression level genes are those genes whose expression level lower than median expression level. In each case, expression levels were from cancer samples matched by cancer type. binding peptides are depleted, and that this phenomenon is restricted to expressed peptides. That the phenomenon was not observed for non-expressed peptides serves as a crucial negative control that serves to rule out many potential sources of artifact for the original observation of depletion. This negative control also supports the relevance of the expression datasets we used.

Finally, we note that although the phenomenon of depletion for somatic mutations corresponding to displayed non-self-peptides was expected, and it was expected that this phenomenon would depend on expression, this is to our knowledge the first quantification of this phenomenon.

4.3.3 Depletion of mutations within predicted patient-displayed HLA binding peptides

The analysis above used predictions of MHC display that were based on a scenario on which the HLA type of an individual patient is unknown. Specifically, display predictions were based on a hypothetical (and unrealistic) patient that bears all 12 of the common HLA types. Individual patients can differ dramatically in their immune systems, based on allelic variation in HLA type and other immune genes, and their unique repertoires of mature immune cells. Therefore, I

129 further assessed the likelihood of missense mutations being displayed within an MHC binding peptide based on individual HLA types.

Because predicted HLA type was available for the 2834 patient samples provided by the PCAWG project (Kreiter, Vormehr et al. 2015, Kreiter, Vormehr et al. 2015), we examined the 121258 corresponding non-synonymous somatic cancer mutations from PCAWG detected in 10745 genes. Potential MHC binding peptides of the 10745 genes were predicted using the NetMHC server, using the above-listed 12 common HLA class I alleles. There were 3853 genes in which the variant was predicted to be neo-antigenic, e.g., presented by the MHC class I protein of the patient carrying this mutated gene. For these 3853 genes, I confirmed the tendency for depletion of mutations within MHC binding peptides relative to non-MHC binding peptides (Wilcoxon test, P-value = 4.6e-16).

My analysis showed that missense mutations tend to be counter-selected within MHC binding peptides, both in an idealized patient with unknown HLA type, and when accounting for HLA type in each specific patient sample. In each case, the phenomenon depended on expression level of the gene encoding that peptide (as estimated from expression levels of a cancer of matched type, Figure 4-7). In all subsequent analyses in this section, we considered only peptides expressed in a matched cancer type.

130

Figure 4-7 Dependence of mutation depletion on expression level in patient-displayed MHC binding peptides

In the above analysis, I only considered for each peptide whether or not the patient carried an HLA allele predicted to display that peptide, but did not consider how many copies of the displaying allele were present in that patient. I hypothesized that peptides for which two copies of the displaying HLA allele were present would be more efficiently displayed. (This could be due either to increased expression of the displaying allele due to increased gene dosage, or a decreased chance that the displaying allele would be silenced where the phenomenon of mono allelic expression occurs (Gimelbrant, Hutchinson et al. 2007)). To assess this hypothesis, I further tested, for patient samples where ‘likely-displayed’ mutations were found, if the number of alleles that can display the HLA binding peptides was associated with the extent of mutation depletion.

As described in Chapter 4.2.4, the MHC binding peptides were separated into three categories,

D0, D1 and D2 (Figure 4-3). I calculated the mutation density within predicted MHC binding peptides for each type (DD2, DD1, and DD0). I found that in both D2 and D1, the mutation density within patient-displayed MHC binding peptides is lower than that out of MHC binding peptides of the same protein. In predicted MHC binding peptides that cannot be displayed by the patient

131

HLA molecular, the mutation density is higher than that out of the predicted MHC binding

Figure 4-8 Average mutation density in peptides predicted to be displayed by at least one of the 12 common HLA-A or HLA-B allele types (“in MHC binding peptides”) or not displayed by any of the 12 common HLA-A or HLA-B allele types (“out of MHC binding peptides”), for different numbers of predicted-displaying HLA alleles. Where a patient has allele types from the common set of 12 alleles at both the HLA-A and HLA-B loci, but none of these alleles are predicted to display, we say “No alleles”. Where a patient has an allele type outside of the common set of 12 alleles at either the HLA-A or HLA-B loci, we say that the number of displaying alleles is “Unknown”. peptides of the same protein (Figure 4-8). Here, we noticed that the mutation density out of the MHC binding peptides in “No alleles” is lower than that in “Two alleles” or “One allele”. This may be understood from the combination of two ideas: 1) a patient has a finite number of HLA alleles (generally 12); 2) if a patient has HLA alleles for which NetMHC predictions are possible, but the mutation is not predicted to be displayed by them, then this reduces the number of HLA alleles that could display the peptide. Thus, where the patient has no alleles subject to NetMHC prediction, they have 12 ‘chances’ to display the peptide, whereas if the patient has

132 two alleles which can be confidently predicted NOT to display the peptide, they only have 10 ‘chances’ and thus a higher mutational density.

The ratio of mutations within and out of patient displayed HLA binding peptides showed that R2

(DD2/DD0) and R1(DD1/DD0) generally ranges from 0 to 1, but R2 is lower than R1 (Wilcoxon test,

Figure 4-9 Ratio of mutation density in MHC binding peptides to that of “non-MHC

binding peptides” displayed by one HLA allele (R1, colored in red), by two alleles

(R2, colored in blue) P-value < 2.2e -16; Figure 4-9). These results therefore suggest that HLA class I alleles that can display a particular peptide are generally incompletely dominant, i.e., that having two copies of the display-enabling allele is more effective in displaying that peptide than having just one copy.

I note that the terms “in MHC binding peptides” and “out of MHC binding peptides” were applied based on whether or not peptides were predicted to be displayed by at least one of the 12 common HLA-A or HLA-B allele types. Similar to the phenomenon discussed above, I expect that the phenomenon of depletion of somatic mutations “out of MHC binding peptides” is observed where patients do not have a common displaying allele type because such peptides are

133 more likely to be displayed by one of the other HLA alleles for which NetMHC predictions are not available.

For patients with different specific HLA type combinations, I compared the mutation density within predicted-displayed MHC binding peptides. My analysis showed that the overall mutation density within peptides predicted to be displayed given the patient’s HLA type was lower when the patient was homozygous at a HLA locus than when heterozygous (Figure 4-10). Recent studies have shown that heterozygosity at HLA loci has been associated with a favorable disease outcome (Doherty and Zinkernagel 1975, Zinkernagel and Doherty 1976, Lipsitch, Bergstrom et al. 2003). However, this does not conflict with my finding, given that a more diverse repertoire of displayed peptides may provide a greater advantage than increasing the

Figure 4-10 Mutations density for samples with different HLA types. Samples on the diagonal of the heat map are homozygous at HLA loci. efficiency of display for a more limited peptide repertoire.

134

Based on these analyses, I developed a model that produces an ‘antigenicity score’ for any input somatic coding mutation. Figure 4-11 showed the distribution of antigenicity score (depletion score) for different MHC binding peptide types at different expression level. Expressed MHC binding peptides that can be displayed by two patient HLA alleles are more immunogenic.

4.4 Discussion

In this chapter, I examined signatures of immune selection pressure on the distribution of somatic mutations. I quantified the extent to which somatic mutations are significantly depleted in peptides that are predicted to be displayed by MHC class I proteins, and characterized the dependence of this depletion on expression level. I also examined whether immune selection pressure on somatic mutations changes depending on whether there are either one or two HLA alleles that can display the peptide. My results indicate that HLA class I alleles are, in general,

Figure 4-11 The distribution of depletion score of each MHC binding peptide type at different expression level. incompletely dominant, i.e., that having two copies of the display-enabling allele is more effective in displaying that peptide than having just one copy. Here, I noted that all the MHC

135 binding peptides used in my analysis were calculated using the NetMHC server. According to net MHC server, the overall prediction accuracy was evaluated using Pearson correlation coefficient and area under the receiver operating characteristic curve (AUC). For 9-mer MHC binding peptides the PCC value is 0.71 and AUC value is 0.86 using a 500nM classification threshold (Lundegaard, Lamberth et al. 2008).

As I mentioned in Chapter 4.3.3, having two copies of the display-enabling allele is more effective in displaying that peptide than having just one copy. This could simply be a gene- dosage effect, but there are alternative explanations: Monoallelic expression (MAE), in which only one allele of a given gene is expressed, is a frequent genomic event in normal tissues. MAE-derived silencing of one or more HLA-encoded alleles could potentially cause failure to express MHC binding -peptide-encoding genes, which may, in turn, alter the immunogenicity of somatic mutations. A previous study showed that the genome-wide rate of MAE was higher in tumor cells than in normal tissues, and the MAE rate was increased with specific tumor grade. Oncogenes exhibited significantly higher MAE in high-grade compared with low-grade tumors (Olivier 2004, Pastinen, Sladek et al. 2004, Gimelbrant, Hutchinson et al. 2007). The role of MAE in immunogenicity of cancerous cells is entirely unclear. HLA alleles are known to be subject to MAE (Gimelbrant, Hutchinson et al. 2007). Therefore, a future research study objective would be to assess the impact of MAE by comparing the mutation rates between homozygous and heterozygous samples at HLA class I loci A and B respectively using the allele-specific expression data. One example of a potential therapy that might emerge from this study is that de-silencing (either global or targeted) could lead to the display of otherwise- cryptic neo-antigens and therefore to immune clearance of cancerous cells, especially when used in combination with current immunotherapy strategies. If we can better understand the interplay between individual immune systems and the likelihood that cancer cells bearing specific somatic mutations are cleared, we will gain insight into the therapeutic potential of MAE modulation. For example, if MAE can indeed limit peptide display efficiency, then therapies reducing MAE could potentially increase the efficiency of immune clearance of tumor cells.

In the future, we hope to apply the scoring scheme we developed to mutations detected in individual cancer patients. Here, based on whether the mutation is in or out of the patient’s MHC binding peptides, whether it is displayed by one allele or two alleles, whether the corresponding class of peptides shows depletion of mutations or not, and based on expression

136 level of the corresponding gene, we might predict antigenicity of the mutation. (Further advantages might come from using expression data from a specific patient tumor sample, but this is left for future studies.) Predictions of antigenic mutations could help in developing personalized cancer vaccines. Collective predictions of antigenicity for all the mutations in a tumor could help identify tumors for which immunotherapy has greater potential for success. For example, those tumors which are capable of immunosuppression but which carry many antigenic mutations may be more rapidly cleared if the ability to immunosuppress can be blocked via therapy.

Another interesting study in the future is to analyze the immunogenic potential of those mutational hotpots or mutations within the significantly mutated domains listed in Chapter Three.

137

Chapter 5 Thesis summary and future directions

138

Thesis summary and future direction 5.1 Thesis summary

As introduced in Chapter One, this thesis describes my efforts to develop methods and models to assess the functional impacts of human genetic variants, which includes large-scale analysis of somatic tumor mutations, and functional assays to enable large-scale assessment of the pathogenicity of human disease-causing mutations.

In Chapter Two, I described the development of our human-yeast functional complementation assay as a platform to assess pathogenicity of human variants, and as an analysis pipeline for the identification of mutants that affect human disease associated gene functions. I systematically assessed potential complementation relationships between 96 human-yeast ortholog pairs and assessed the functional impact of 179 variants (101 disease- and 78 non-disease-associated variants) from 22 human disease genes. The results showed that functional complementation assays outperform computational methods such as Polyphen2, SIFT, and PROVEAN, despite one billion years of divergence between human and yeast. Next, I further expanded the testing dataset to 1060 pairs of human-yeast paralogs. To assess the ability of paralog-based complementation assays to detect disease mutations, I examined 35 variants across seven genes, including both disease- and non-disease-associated variants. I found that human-yeast complementation assays can accurately detect disease mutations, even in the absence of orthology. My study further showed that complementation assays can accurately assess the functional impact of variants that fall outside of the aligned homology region. Finally, I conclude that a systematic search for paralog-based complementation in yeast could more than double the number of human disease genes for which functional complementation assays are available.

In Chapter Three, I described large-scale analysis of the landscape of cancer-type-specific somatic mutations at the protein domain-level. Extensive tumor genome sequencing has provided the raw material to understand mutational processes and to identify somatic variations associated with cancer. The goal of this study was to help separate ‘driver’ from ‘passenger’ mutations and to further understand the functional mechanisms and consequences of driver mutations. I analyzed the distribution pattern of cancer missense somatic mutations at the protein domain level, and gained clues into potential mechanisms whereby mutations in

139 different domains of a multi-domain protein could lead to different cancers. I also provided clues to functional effects of somatic mutations by combining protein structure and protein- protein interaction information. I objectively identified both known and new endometrial cancer hotspots in the tyrosine kinase domain (for example, the FGFR2 protein mentioned in Section 3.3. I found that one hotspot known for endometrial cancer is also a hotspot in breast cancer, and found new candidate hotspots in the Immunoglobulin I-set domain in colon cancer.) Furthermore, I identified different domain-centric mutational distribution patterns between oncoproteins and tumor suppressor proteins. Compared with previous studies that focused on the whole gene or whole protein, the systematic correlation of mutations and cancer types at the domain level has the potential to guide more precise cancer treatments.

In Chapter Four, I quantitatively examined different classes of somatic mutations for evidence of counter-selection by the immune system, and exploited these quantitative signatures of counter-selection to assess the antigenicity of tumor mutations. More specifically, I quantified the extent to which somatic mutations are depleted in peptides that are predicted to be displayed by MHC class I proteins and characterized the dependence of this depletion on expression level. I also examined whether immune selection pressure on somatic mutations changes depending on whether there are either one or two HLA alleles that can display the peptide. My results showed that having two copies of a display-enabling allele is more immunogenic than having just one copy. Combining these results, I developed a model that produces an ‘antigenicity score’ for any input somatic coding mutation.

5.2 Future directions

5.2.1 Identifying genetic interactions affected by human disease- associated mutations

As mentioned in Chapter Two, one experiment that could follow development of a functional complementation assay for a human disease-associated gene is to test how different mutations in these human genes are related to different genetic interaction sub-networks in yeast using SGA.

Using Gateway Cloning, one could insert the human ORF with a disease-associated mutation or the wild-type human ORF into a CEN-based plasmid containing URA3 as the selectable marker. One could then construct a collection of MATα yeast TS strains carrying empty plasmids,

140 plasmids containing wild-type human genes or corresponding human disease-associated mutant alleles, respectively. Those strains could then be used as query strains and crossed to an array of yeast strains each carrying a variable gene deletion, which are known genetic interacting partners with the corresponding TS alleles. Heterozygous diploids will be grown up on selective media and then transferred to media with a poor carbon source and low nitrogen to induce the formation of haploid meiotic spore progeny. Selected TS haploid strains with double mutants and the transformed plasmid could then be grown at semi-permissive temperatures. The affected genetic interactions could be identified when some genetic interactions are observed in double mutant TS strains with mutated human ORFs, but not in the corresponding ones with wild-type human ORFs at semi-permissive temperature. One might then analyze the result to investigate whether mutations in different domains of human genes affect different genetic interaction sub- networks, and predict functional impacts of those mutations based on relevant genetic backgrounds.

5.2.2 Testing impacts of human disease-associated variants in human cell lines

Many human disease genes will not have complementation relationships with their yeast homologs. Starting with a subset of these genes, another potential next step would be to establish a platform for complementation testing in human cell lines using the RNA-guided CRISPR-associated nuclease Cas9 system (Cong, Ran et al. 2013, Mali, Yang et al. 2013, Liu, Shen et al. 2014). Although some examples of this already exist (Torres-Ruiz and Rodriguez- Perales 2016), there have been no published studies aimed at systematically developing complementation assays for the full spectrum of human disease genes.

Systematic studies in ~200 cell lines (Schlabach, Luo et al. 2008, Silva, Marran et al. 2008, Cheung, Cowley et al. 2011, Marcotte, Brown et al. 2012) have defined sets of essential genes under certain growth conditions, providing a phenotype through which allelic functions can be assessed. To develop complementation assays, one might first create a “landing pad” in each relevant cell line, which can provide a chromosomal substrate for Cre recombination-mediated cassette exchange. One might then also create Cre-delivering lentiviral vectors with a wild-type or a mutated human gene without introns (Yang, Boehm et al. 2011) that can be integrated into human using the Cre/loxP system (Du, Hu et al. 2009, Minorikawa and Nakayama 2011). After integrating the exogenous wild-type/mutated essential gene into the

141 chromosome, the corresponding endogenous wild-type essential gene would then be knocked out via the CRISPR technology (Jinek, Chylinski et al. 2012, Mali, Yang et al. 2013). The functional consequence of a mutation in the disease gene can then be assessed by comparing the fitness of cells carrying the wild-type allele to the fitness of cells carrying the mutant allele.

I anticipate that large-scale complementation analysis in yeast, in human cell lines, and in other model systems will contribute to the characterization of the functional consequences of thousands of human disease-associated variants. Meanwhile, by assigning those mutations to specific protein domains and functional sites, this work will enhance our understanding of different disease mechanisms.

5.2.3 Antigenicity of mutations associated with more HLA protein types

In Chapter Four, where I developed a model to predict the antigenicity score of cancer mutations, I only focused on 12 common HLA class I alleles in that study. Besides the most common 12 HLA class I alleles, the MHC binding peptides which can be displayed by other HLA alleles can be predicted using the NetMHCpan server (Nielsen and Andreatta 2016). Also, in the study introduced in Chapter Four, I only focused on 9mer MHC binding peptides. Although peptides of differing lengths can bind more rarely and predictions are less accurate, the NetMHC or NetMHCpan server can predict binding peptides from 8 to 13 residues in length. Therefore, another future project could be to expand the dataset by predicting more MHC binding peptides. This would allow us to have a more comprehensive antigenicity score of more cancer mutations. Another potential improvement of this analysis is to use cancer-specific expression data for each gene instead of using the cancer type specific expression data. Moreover, in this study, I used RNASeq data while in future studies, we might calculate the dependence of mutation depletion within MHC binding peptides on protein level expression data.

Previous studies by Scott D. Brown et al have shown that tumor missense mutations which predicted to be immunogenic are associated with patient survival (Brown, Warren et al. 2014). Their study only focused on HLA-A alleles, we can expand this study by analyzing the correlation between our predicted immunogenic mutations with other HLA alleles. Besides that, we can examine signatures of counter-selection in genes expressed in normal cells but not

142 expressed in ‘mature’ tumor, to better understand the stages at which most counter-selection occurs.

Collectively, this thesis described several investigations into assessing the functional impact of germline and somatic mutations in human genes. Although much remains to be done, it is only through a predictive understanding of the impact of mutations that we can hope to accomplish the full benefit of ‘precision’ personalized .

References

(2010). "Human genome: Genomes by the thousand." Nature 467(7319): 1026-1027.

(2015). "The future of cancer genomics." Nat Med 21(2): 99.

Adzhubei, I. A., S. Schmidt, L. Peshkin, V. E. Ramensky, A. Gerasimova, P. Bork, A. S. Kondrashov and S. R. Sunyaev (2010). "A method and server for predicting damaging missense mutations." Nat Methods 7(4): 248-249.

Agrawal, N., M. J. Frederick, C. R. Pickering, C. Bettegowda, K. Chang, R. J. Li, C. Fakhry, T. X. Xie, J. Zhang, J. Wang, N. Zhang, A. K. El-Naggar, S. A. Jasser, J. N. Weinstein, L. Trevino, J. A. Drummond, D. M. Muzny, Y. Wu, L. D. Wood, R. H. Hruban, W. H. Westra, W. M. Koch, J. A. Califano, R. A. Gibbs, D. Sidransky, B. Vogelstein, V. E. Velculescu, N. Papadopoulos, D. A. Wheeler, K. W. Kinzler and J. N. Myers (2011). "Exome sequencing of head and neck squamous cell carcinoma reveals inactivating mutations in NOTCH1." Science 333(6046): 1154- 1157.

Agrawal, N., Y. Jiao, C. Bettegowda, S. M. Hutfless, Y. Wang, S. David, Y. Cheng, W. S. Twaddell, N. L. Latt, E. J. Shin, L. D. Wang, L. Wang, W. Yang, V. E. Velculescu, B. Vogelstein, N. Papadopoulos, K. W. Kinzler and S. J. Meltzer (2012). "Comparative genomic analysis of esophageal adenocarcinoma and squamous cell carcinoma." Cancer Discov 2(10): 899-905.

Alexandrov, L. B., S. Nik-Zainal, D. C. Wedge, S. A. Aparicio, S. Behjati, A. V. Biankin, G. R. Bignell, N. Bolli, A. Borg, A. L. Borresen-Dale, S. Boyault, B. Burkhardt, A. P. Butler, C. Caldas, H. R. Davies, C. Desmedt, R. Eils, J. E. Eyfjord, J. A. Foekens, M. Greaves, F. Hosoda, B. Hutter, T. Ilicic, S. Imbeaud, M. Imielinski, N. Jager, D. T. Jones, D. Jones, S. Knappskog, M. Kool, S. R. Lakhani, C. Lopez-Otin, S. Martin, N. C. Munshi, H. Nakamura, P. A. Northcott, M. Pajic, E. Papaemmanuil, A. Paradiso, J. V. Pearson, X. S. Puente, K. Raine, M. Ramakrishna, A. L. Richardson, J. Richter, P. Rosenstiel, M. Schlesner, T. N. Schumacher, P. N. Span, J. W. Teague, Y. Totoki, A. N. Tutt, R. Valdes-Mas, M. M. van Buuren, L. van 't Veer, A. Vincent- Salomon, N. Waddell, L. R. Yates, I. Australian Pancreatic Cancer Genome, I. B. C. Consortium, I. M.-S. Consortium, I. PedBrain, J. Zucman-Rossi, P. A. Futreal, U. McDermott, P. Lichter, M. Meyerson, S. M. Grimmond, R. Siebert, E. Campo, T. Shibata, S. M. Pfister, P. J. Campbell and M. R. Stratton (2013). "Signatures of mutational processes in human cancer." Nature 500(7463): 415-421.

Alshammari, M. J., L. Al-Otaibi and F. S. Alkuraya (2012). "Mutation in RAB33B, which encodes a regulator of retrograde Golgi transport, defines a second Dyggve--Melchior--Clausen locus." J Med Genet 49(7): 455-461.

Andreatta, M. and M. Nielsen (2016). "Gapped sequence alignment using artificial neural networks: application to the MHC class I system." Bioinformatics 32(4): 511-517.

143

144

Arum, C. J., E. Anderssen, T. Viset, Y. Kodama, S. Lundgren, D. Chen and C. M. Zhao (2010). "Cancer immunoediting from immunosurveillance to tumor escape in microvillus-formed niche: a study of syngeneic orthotopic rat bladder cancer model in comparison with human bladder cancer." Neoplasia 12(6): 434-442.

Atasoy, D., S. Schoch, A. Ho, K. A. Nadasy, X. Liu, W. Zhang, K. Mukherjee, E. D. Nosyreva, R. Fernandez-Chacon, M. Missler, E. T. Kavalali and T. C. Sudhof (2007). "Deletion of CASK in mice is lethal and impairs synaptic function." Proc Natl Acad Sci U S A 104(7): 2525-2530.

Balakrishnan, R., J. Park, K. Karra, B. C. Hitz, G. Binkley, E. L. Hong, J. Sullivan, G. Micklem and J. M. Cherry (2012). "YeastMine--an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit." Database (Oxford) 2012: bar062.

Bamford, S., E. Dawson, S. Forbes, J. Clements, R. Pettett, A. Dogan, A. Flanagan, J. Teague, P. A. Futreal, M. R. Stratton and R. Wooster (2004). "The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website." Br J Cancer 91(2): 355-358.

Barbieri, C. E., S. C. Baca, M. S. Lawrence, F. Demichelis, M. Blattner, J. P. Theurillat, T. A. White, P. Stojanov, E. Van Allen, N. Stransky, E. Nickerson, S. S. Chae, G. Boysen, D. Auclair, R. C. Onofrio, K. Park, N. Kitabayashi, T. Y. MacDonald, K. Sheikh, T. Vuong, C. Guiducci, K. Cibulskis, A. Sivachenko, S. L. Carter, G. Saksena, D. Voet, W. M. Hussain, A. H. Ramos, W. Winckler, M. C. Redman, K. Ardlie, A. K. Tewari, J. M. Mosquera, N. Rupp, P. J. Wild, H. Moch, C. Morrissey, P. S. Nelson, P. W. Kantoff, S. B. Gabriel, T. R. Golub, M. Meyerson, E. S. Lander, G. Getz, M. A. Rubin and L. A. Garraway (2012). "Exome sequencing identifies recurrent SPOP, FOXA1 and MED12 mutations in prostate cancer." Nat Genet 44(6): 685-689.

Baryshnikova, A., M. Costanzo, S. Dixon, F. J. Vizeacoumar, C. L. Myers, B. Andrews and C. Boone (2010). "Synthetic genetic array (SGA) analysis in Saccharomyces cerevisiae and Schizosaccharomyces pombe." Methods Enzymol 470: 145-179.

Baryshnikova, A., M. Costanzo, Y. Kim, H. Ding, J. Koh, K. Toufighi, J. Y. Youn, J. Ou, B. J. San Luis, S. Bandyopadhyay, M. Hibbs, D. Hess, A. C. Gingras, G. D. Bader, O. G. Troyanskaya, G. W. Brown, B. Andrews, C. Boone and C. L. Myers (2010). "Quantitative analysis of fitness and genetic interactions in yeast on a genome scale." Nat Methods 7(12): 1017-1024.

Bashton, M. and C. Chothia (2007). "The generation of new protein functions by the combination of domains." Structure 15(1): 85-99.

Bass, A. J., M. S. Lawrence, L. E. Brace, A. H. Ramos, Y. Drier, K. Cibulskis, C. Sougnez, D. Voet, G. Saksena, A. Sivachenko, R. Jing, M. Parkin, T. Pugh, R. G. Verhaak, N. Stransky, A. T. Boutin, J. Barretina, D. B. Solit, E. Vakiani, W. Shao, Y. Mishina, M. Warmuth, J. Jimenez, D. Y. Chiang, S. Signoretti, W. G. Kaelin, N. Spardy, W. C. Hahn, Y. Hoshida, S. Ogino, R. A. Depinho, L. Chin, L. A. Garraway, C. S. Fuchs, J. Baselga, J. Tabernero, S. Gabriel, E. S. Lander, G. Getz and M. Meyerson (2011). "Genomic sequencing of colorectal adenocarcinomas identifies a recurrent VTI1A-TCF7L2 fusion." Nat Genet 43(10): 964-968.

Beach, D., B. Durkacz and P. Nurse (1982). "Functionally homologous cell cycle control genes in budding and fission yeast." Nature 300(5894): 706-709.

145

Bedard, P. L., A. R. Hansen, M. J. Ratain and L. L. Siu (2013). "Tumour heterogeneity in the clinic." Nature 501(7467): 355-364.

Beerenwinkel, N., T. Antal, D. Dingli, A. Traulsen, K. W. Kinzler, V. E. Velculescu, B. Vogelstein and M. A. Nowak (2007). "Genetic progression and the waiting time to cancer." PLoS Comput Biol 3(11): e225.

Benjamini, Y. and Y. Hochberg (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society. Series B (Methodological): 289-300.

Berger, M. F., M. S. Lawrence, F. Demichelis, Y. Drier, K. Cibulskis, A. Y. Sivachenko, A. Sboner, R. Esgueva, D. Pflueger, C. Sougnez, R. Onofrio, S. L. Carter, K. Park, L. Habegger, L. Ambrogio, T. Fennell, M. Parkin, G. Saksena, D. Voet, A. H. Ramos, T. J. Pugh, J. Wilkinson, S. Fisher, W. Winckler, S. Mahan, K. Ardlie, J. Baldwin, J. W. Simons, N. Kitabayashi, T. Y. MacDonald, P. W. Kantoff, L. Chin, S. B. Gabriel, M. B. Gerstein, T. R. Golub, M. Meyerson, A. Tewari, E. S. Lander, G. Getz, M. A. Rubin and L. A. Garraway (2011). "The genomic complexity of primary human prostate cancer." Nature 470(7333): 214-220.

Berman, H., K. Henrick and H. Nakamura (2003). "Announcing the worldwide Protein Data Bank." Nat Struct Biol 10(12): 980.

Beroukhim, R., C. H. Mermel, D. Porter, G. Wei, S. Raychaudhuri, J. Donovan, J. Barretina, J. S. Boehm, J. Dobson, M. Urashima, K. T. Mc Henry, R. M. Pinchback, A. H. Ligon, Y. J. Cho, L. Haery, H. Greulich, M. Reich, W. Winckler, M. S. Lawrence, B. A. Weir, K. E. Tanaka, D. Y. Chiang, A. J. Bass, A. Loo, C. Hoffman, J. Prensner, T. Liefeld, Q. Gao, D. Yecies, S. Signoretti, E. Maher, F. J. Kaye, H. Sasaki, J. E. Tepper, J. A. Fletcher, J. Tabernero, J. Baselga, M. S. Tsao, F. Demichelis, M. A. Rubin, P. A. Janne, M. J. Daly, C. Nucera, R. L. Levine, B. L. Ebert, S. Gabriel, A. K. Rustgi, C. R. Antonescu, M. Ladanyi, A. Letai, L. A. Garraway, M. Loda, D. G. Beer, L. D. True, A. Okamoto, S. L. Pomeroy, S. Singer, T. R. Golub, E. S. Lander, G. Getz, W. R. Sellers and M. Meyerson (2010). "The landscape of somatic copy-number alteration across human cancers." Nature 463(7283): 899-905.

Bettegowda, C., N. Agrawal, Y. Jiao, Y. Wang, L. D. Wood, F. J. Rodriguez, R. H. Hruban, G. L. Gallia, Z. A. Binder, C. J. Riggins, V. Salmasi, G. J. Riggins, Z. J. Reitman, A. Rasheed, S. Keir, S. Shinjo, S. Marie, R. McLendon, G. Jallo, B. Vogelstein, D. Bigner, H. Yan, K. W. Kinzler and N. Papadopoulos (2013). "Exomic sequencing of four rare central nervous system tumor types." Oncotarget 4(4): 572-583.

Bignell, G. R., C. D. Greenman, H. Davies, A. P. Butler, S. Edkins, J. M. Andrews, G. Buck, L. Chen, D. Beare, C. Latimer, S. Widaa, J. Hinton, C. Fahey, B. Fu, S. Swamy, G. L. Dalgliesh, B. T. Teh, P. Deloukas, F. Yang, P. J. Campbell, P. A. Futreal and M. R. Stratton (2010). "Signatures of mutation and selection in the cancer genome." Nature 463(7283): 893-898.

Botstein, D. and G. R. Fink (1988). "Yeast: an experimental organism for modern biology." Science 240(4858): 1439-1443.

Botstein, D. and G. R. Fink (2011). "Yeast: an experimental organism for 21st Century biology." Genetics 189(3): 695-704.

146

Boyko, A. R., S. H. Williamson, A. R. Indap, J. D. Degenhardt, R. D. Hernandez, K. E. Lohmueller, M. D. Adams, S. Schmidt, J. J. Sninsky, S. R. Sunyaev, T. J. White, R. Nielsen, A. G. Clark and C. D. Bustamante (2008). "Assessing the evolutionary impact of amino acid mutations in the human genome." PLoS Genet 4(5): e1000083.

Brash, D. E., J. A. Rudolph, J. A. Simon, A. Lin, G. J. McKenna, H. P. Baden, A. J. Halperin and J. Ponten (1991). "A role for sunlight in skin cancer: UV-induced p53 mutations in squamous cell carcinoma." Proc Natl Acad Sci U S A 88(22): 10124-10128.

Breslow, D. K., D. M. Cameron, S. R. Collins, M. Schuldiner, J. Stewart-Ornstein, H. W. Newman, S. Braun, H. D. Madhani, N. J. Krogan and J. S. Weissman (2008). "A comprehensive strategy enabling high-resolution functional analysis of the yeast genome." Nat Methods 5(8): 711-718.

Brower, V. (2015). "Checkpoint blockade immunotherapy for cancer comes of age." J Natl Cancer Inst 107(3).

Brown, S. D., R. L. Warren, E. A. Gibb, S. D. Martin, J. J. Spinelli, B. H. Nelson and R. A. Holt (2014). "Neo-antigens predicted by tumor genome meta-analysis correlate with increased patient survival." Genome Res 24(5): 743-750.

Cancer Genome Atlas, N. (2012). "Comprehensive molecular characterization of human colon and rectal cancer." Nature 487(7407): 330-337.

Cancer Genome Atlas Research, N. (2008). "Comprehensive genomic characterization defines human glioblastoma genes and core pathways." Nature 455(7216): 1061-1068.

Cancer Genome Atlas Research, N. (2011). "Integrated genomic analyses of ovarian carcinoma." Nature 474(7353): 609-615.

Cancer Genome Atlas Research, N., C. Kandoth, N. Schultz, A. D. Cherniack, R. Akbani, Y. Liu, H. Shen, A. G. Robertson, I. Pashtan, R. Shen, C. C. Benz, C. Yau, P. W. Laird, L. Ding, W. Zhang, G. B. Mills, R. Kucherlapati, E. R. Mardis and D. A. Levine (2013). "Integrated genomic characterization of endometrial carcinoma." Nature 497(7447): 67-73.

Cancer Genome Atlas Research, N., J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander and J. M. Stuart (2013). "The Cancer Genome Atlas Pan-Cancer analysis project." Nat Genet 45(10): 1113-1120.

Castellana, S. and T. Mazza (2013). "Congruency in the prediction of pathogenic missense mutations: state-of-the-art web-based tools." Brief Bioinform 14(4): 448-459.

Chan, P. A., S. Duraisamy, P. J. Miller, J. A. Newell, C. McBride, J. P. Bond, T. Raevaara, S. Ollila, M. Nystrom, A. J. Grimm, J. Christodoulou, W. S. Oetting and M. S. Greenblatt (2007). "Interpreting missense variants: comparing computational methods in human disease genes CDKN2A, MLH1, MSH2, MECP2, and tyrosinase (TYR)." Hum Mutat 28(7): 683-693.

Charles A Janeway, J., Paul Travers, Mark Walport, and Mark J Shlomchik. (2001). Immunobiology, 5th edition. New York, Garland Science.

147

Chen, C., Y. Liu, Amy R. Rappaport, T. Kitzing, N. Schultz, Z. Zhao, Aditya S. Shroff, Ross A. Dickins, Christopher R. Vakoc, James E. Bradner, W. Stock, Michelle M. LeBeau, Kevin M. Shannon, S. Kogan, J. Zuber and Scott W. Lowe "MLL3 Is a Haploinsufficient 7q Tumor Suppressor in Acute Myeloid Leukemia." Cancer Cell 25(5): 652-665.

Chene, P. (2001). "The role of tetramerization in p53 function." Oncogene 20(21): 2611-2617.

Cheung, H. W., G. S. Cowley, B. A. Weir, J. S. Boehm, S. Rusin, J. A. Scott, A. East, L. D. Ali, P. H. Lizotte, T. C. Wong, G. Jiang, J. Hsiao, C. H. Mermel, G. Getz, J. Barretina, S. Gopal, P. Tamayo, J. Gould, A. Tsherniak, N. Stransky, B. Luo, Y. Ren, R. Drapkin, S. N. Bhatia, J. P. Mesirov, L. A. Garraway, M. Meyerson, E. S. Lander, D. E. Root and W. C. Hahn (2011). "Systematic investigation of genetic vulnerabilities across cancer cell lines reveals lineage- specific dependencies in ovarian cancer." Proc Natl Acad Sci U S A 108(30): 12372-12377.

Chin, L., J. N. Andersen and P. A. Futreal (2011). "Cancer genomics: from discovery science to personalized medicine." Nat Med 17(3): 297-303.

Chin, L., W. C. Hahn, G. Getz and M. Meyerson (2011). "Making sense of cancer genomic data." Genes Dev 25(6): 534-555.

Cho, H. S. and D. J. Leahy (2002). "Structure of the extracellular region of HER3 reveals an interdomain tether." Science 297(5585): 1330-1333.

Choi, Y., G. E. Sims, S. Murphy, J. R. Miller and A. P. Chan (2012). "Predicting the functional effect of amino acid substitutions and indels." PLoS One 7(10): e46688.

Ciriello, G., M. L. Miller, B. A. Aksoy, Y. Senbabaoglu, N. Schultz and C. Sander (2013). "Emerging landscape of oncogenic signatures across human cancers." Nat Genet 45(10): 1127- 1133.

Clark, V. E., E. Z. Erson-Omay, A. Serin, J. Yin, J. Cotney, K. Ozduman, T. Avsar, J. Li, P. B. Murray, O. Henegariu, S. Yilmaz, J. M. Gunel, G. Carrion-Grant, B. Yilmaz, C. Grady, B. Tanrikulu, M. Bakircioglu, H. Kaymakcalan, A. O. Caglayan, L. Sencar, E. Ceyhun, A. F. Atik, Y. Bayri, H. Bai, L. E. Kolb, R. M. Hebert, S. B. Omay, K. Mishra-Gorur, M. Choi, J. D. Overton, E. C. Holland, S. Mane, M. W. State, K. Bilguvar, J. M. Baehring, P. H. Gutin, J. M. Piepmeier, A. Vortmeyer, C. W. Brennan, M. N. Pamir, T. Kilic, R. P. Lifton, J. P. Noonan, K. Yasuno and M. Gunel (2013). "Genomic analysis of non-NF2 meningiomas reveals mutations in TRAF7, KLF4, AKT1, and SMO." Science 339(6123): 1077-1080.

Clarke, L., X. Zheng-Bradley, R. Smith, E. Kulesha, C. Xiao, I. Toneva, B. Vaughan, D. Preuss, R. Leinonen, M. Shumway, S. Sherry, P. Flicek and C. Genomes Project (2012). "The 1000 Genomes Project: data management and community access." Nat Methods 9(5): 459-462.

Cline, M. S. and R. Karchin (2011). "Using bioinformatics to predict the functional impact of SNVs." Bioinformatics 27(4): 441-448.

Cohen, A. R., D. F. Woods, S. M. Marfatia, Z. Walther, A. H. Chishti and J. M. Anderson (1998). "Human CASK/LIN-2 binds syndecan-2 and protein 4.1 and localizes to the basolateral membrane of epithelial cells." J Cell Biol 142(1): 129-138.

148

Collier, L. S. and D. A. Largaespada (2007). "Transposons for cancer gene discovery: Sleeping Beauty and beyond." Genome Biol 8 Suppl 1: S15.

Cong, L., F. A. Ran, D. Cox, S. Lin, R. Barretto, N. Habib, P. D. Hsu, X. Wu, W. Jiang and L. A. Marraffini (2013). "Multiplex genome engineering using CRISPR/Cas systems." Science 339(6121): 819-823.

Cooper, D. N., P. D. Stenson and N. A. Chuzhanova (2006). "The Human Gene Mutation Database (HGMD) and its exploitation in the study of mutational mechanisms." Curr Protoc Bioinformatics Chapter 1: Unit 1 13.

Coordinators, N. R. (2014). "Database resources of the National Center for Biotechnology Information." Nucleic Acids Res 42(Database issue): D7-17.

Corso, G., D. Marrelli and F. Roviello (2012). "Familial gastric cancer and germline mutations of E-cadherin." Ann Ital Chir 83(3): 177-182.

Costanzo, M., A. Baryshnikova, J. Bellay, Y. Kim, E. D. Spear, C. S. Sevier, H. Ding, J. L. Koh, K. Toufighi, S. Mostafavi, J. Prinz, R. P. St Onge, B. VanderSluis, T. Makhnevych, F. J. Vizeacoumar, S. Alizadeh, S. Bahr, R. L. Brost, Y. Chen, M. Cokol, R. Deshpande, Z. Li, Z. Y. Lin, W. Liang, M. Marback, J. Paw, B. J. San Luis, E. Shuteriqi, A. H. Tong, N. van Dyk, I. M. Wallace, J. A. Whitney, M. T. Weirauch, G. Zhong, H. Zhu, W. A. Houry, M. Brudno, S. Ragibizadeh, B. Papp, C. Pal, F. P. Roth, G. Giaever, C. Nislow, O. G. Troyanskaya, H. Bussey, G. D. Bader, A. C. Gingras, Q. D. Morris, P. M. Kim, C. A. Kaiser, C. L. Myers, B. J. Andrews and C. Boone (2010). "The genetic landscape of a cell." Science 327(5964): 425-431.

Davis, E. E., S. Frangakis and N. Katsanis (2014). "Interpreting human genetic variation with in vivo zebrafish assays." Biochim Biophys Acta 1842(10): 1960-1970.

Deutschbauer, A. M., D. F. Jaramillo, M. Proctor, J. Kumm, M. E. Hillenmeyer, R. W. Davis, C. Nislow and G. Giaever (2005). "Mechanisms of haploinsufficiency revealed by genome-wide profiling in yeast." Genetics 169(4): 1915-1925.

Dimster-Denk, D., K. W. Tripp, N. J. Marini, S. Marqusee and J. Rine (2013). "Mono and dual cofactor dependence of human cystathionine beta-synthase enzyme variants in vivo and in vitro." G3 (Bethesda) 3(10): 1619-1628.

Ding, L., T. J. Ley, D. E. Larson, C. A. Miller, D. C. Koboldt, J. S. Welch, J. K. Ritchey, M. A. Young, T. Lamprecht, M. D. McLellan, J. F. McMichael, J. W. Wallis, C. Lu, D. Shen, C. C. Harris, D. J. Dooling, R. S. Fulton, L. L. Fulton, K. Chen, H. Schmidt, J. Kalicki-Veizer, V. J. Magrini, L. Cook, S. D. McGrath, T. L. Vickery, M. C. Wendl, S. Heath, M. A. Watson, D. C. Link, M. H. Tomasson, W. D. Shannon, J. E. Payton, S. Kulkarni, P. Westervelt, M. J. Walter, T. A. Graubert, E. R. Mardis, R. K. Wilson and J. F. DiPersio (2012). "Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing." Nature 481(7382): 506-510.

Dixit, A. and G. M. Verkhivker (2014). "Structure-functional prediction and analysis of cancer mutation effects in protein kinases." Comput Math Methods Med 2014: 653487.

149

Dixit, A., L. Yi, R. Gowthaman, A. Torkamani, N. J. Schork and G. M. Verkhivker (2009). "Sequence and structure signatures of cancer mutation hotspots in protein kinases." PLoS One 4(10): e7485.

Doherty, P. C. and R. M. Zinkernagel (1975). "A biological role for the major histocompatibility antigens." Lancet 1(7922): 1406-1409.

Dolinski, K. and D. Botstein (2007). "Orthology and functional conservation in eukaryotes." Annu Rev Genet 41: 465-507.

Dolle, M. E., W. K. Snyder, J. A. Gossen, P. H. Lohman and J. Vijg (2000). "Distinct spectra of somatic mutations accumulated with age in mouse heart and small intestine." Proc Natl Acad Sci U S A 97(15): 8403-8408.

Donnes, P. and A. Elofsson (2002). "Prediction of MHC class I binding peptides, using SVMHC." BMC Bioinformatics 3: 25.

Dreyer, R. (1985). "How to make successful sales calls." Dent Lab Rev 60(7): 8-10.

Du, Z. W., B. Y. Hu, M. Ayala, B. Sauer and S. C. Zhang (2009). "Cre recombination-mediated cassette exchange for building versatile transgenic human embryonic stem cells lines." Stem Cells 27(5): 1032-1041.

Dulak, A. M., P. Stojanov, S. Peng, M. S. Lawrence, C. Fox, C. Stewart, S. Bandla, Y. Imamura, S. E. Schumacher, E. Shefler, A. McKenna, S. L. Carter, K. Cibulskis, A. Sivachenko, G. Saksena, D. Voet, A. H. Ramos, D. Auclair, K. Thompson, C. Sougnez, R. C. Onofrio, C. Guiducci, R. Beroukhim, Z. Zhou, L. Lin, J. Lin, R. Reddy, A. Chang, R. Landrenau, A. Pennathur, S. Ogino, J. D. Luketich, T. R. Golub, S. B. Gabriel, E. S. Lander, D. G. Beer, T. E. Godfrey, G. Getz and A. J. Bass (2013). "Exome and whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver events and mutational complexity." Nat Genet 45(5): 478-486.

Dunn, G. P., A. T. Bruce, H. Ikeda, L. J. Old and R. D. Schreiber (2002). "Cancer immunoediting: from immunosurveillance to tumor escape." Nat Immunol 3(11): 991-998.

Dunn, G. P., C. M. Koebel and R. D. Schreiber (2006). "Interferons, immunity and cancer immunoediting." Nat Rev Immunol 6(11): 836-848.

Dunn, G. P., L. J. Old and R. D. Schreiber (2004). "The immunobiology of cancer immunosurveillance and immunoediting." Immunity 21(2): 137-148.

Duns, G., R. M. Hofstra, J. G. Sietzema, H. Hollema, I. van Duivenbode, A. Kuik, C. Giezen, O. Jan, J. J. Bergsma, H. Bijnen, P. van der Vlies, E. van den Berg and K. Kok (2012). "Targeted exome sequencing in clear cell renal cell carcinoma tumors suggests aberrant chromatin regulation as a crucial step in ccRCC development." Hum Mutat 33(7): 1059-1062.

Dupuis, N., S. Lebon, M. Kumar, S. Drunat, L. M. Graul-Neumann, P. Gressens and V. El Ghouzzi (2013). "A novel RAB33B mutation in Smith-McCort dysplasia." Hum Mutat 34(2): 283-286.

150

Dupuy, A. J., N. A. Jenkins and N. G. Copeland (2006). "Sleeping beauty: a novel cancer gene discovery tool." Hum Mol Genet 15 Spec No 1: R75-79.

Durinck, S., C. Ho, N. J. Wang, W. Liao, L. R. Jakkula, E. A. Collisson, J. Pons, S. W. Chan, E. T. Lam, C. Chu, K. Park, S. W. Hong, J. S. Hur, N. Huh, I. M. Neuhaus, S. S. Yu, R. C. Grekin, T. M. Mauro, J. E. Cleaver, P. Y. Kwok, P. E. LeBoit, G. Getz, K. Cibulskis, J. C. Aster, H. Huang, E. Purdom, J. Li, L. Bolund, S. T. Arron, J. W. Gray, P. T. Spellman and R. J. Cho (2011). "Temporal dissection of tumorigenesis in primary cancers." Cancer Discov 1(2): 137- 143.

Eddy, S. R. (1998). "Profile hidden Markov models." Bioinformatics 14(9): 755-763.

Ehrlich, P. (1908). Ueber den jetzigen Stand der Karzinomforschung.

Engel, S. R., F. S. Dietrich, D. G. Fisk, G. Binkley, R. Balakrishnan, M. C. Costanzo, S. S. Dwight, B. C. Hitz, K. Karra, R. S. Nash, S. Weng, E. D. Wong, P. Lloyd, M. S. Skrzypek, S. R. Miyasato, M. Simison and J. M. Cherry (2014). "The reference genome sequence of Saccharomyces cerevisiae: then and now." G3 (Bethesda) 4(3): 389-398.

Fernandez-Medarde, A. and E. Santos (2011). "Ras in cancer and developmental diseases." Genes Cancer 2(3): 344-358.

Fidler, I. J. and M. L. Kripke (2015). "The challenge of targeting metastasis." Cancer Metastasis Rev 34(4): 635-641.

Finn, R. D., J. Mistry, B. Schuster-Bockler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S. R. Eddy, E. L. Sonnhammer and A. Bateman (2006). "Pfam: clans, web tools and services." Nucleic Acids Res 34(Database issue): D247-251.

Finn, R. D., J. Mistry, J. Tate, P. Coggill, A. Heger, J. E. Pollington, O. L. Gavin, P. Gunasekaran, G. Ceric, K. Forslund, L. Holm, E. L. Sonnhammer, S. R. Eddy and A. Bateman (2010). "The Pfam protein families database." Nucleic Acids Res 38(Database issue): D211-222.

Fodde, R. and R. Smits (2002). "Cancer biology. A matter of dosage." Science 298(5594): 761- 763.

Forbes, S. A., G. Bhamra, S. Bamford, E. Dawson, C. Kok, J. Clements, A. Menzies, J. W. Teague, P. A. Futreal and M. R. Stratton (2008). "The Catalogue of Somatic Mutations in Cancer (COSMIC)." Curr Protoc Hum Genet Chapter 10: Unit 10 11.

Forbes, S. A., N. Bindal, S. Bamford, C. Cole, C. Y. Kok, D. Beare, M. Jia, R. Shepherd, K. Leung, A. Menzies, J. W. Teague, P. J. Campbell, M. R. Stratton and P. A. Futreal (2011). "COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer." Nucleic Acids Res 39(Database issue): D945-950.

Frazer, K. A., S. S. Murray, N. J. Schork and E. J. Topol (2009). "Human genetic variation and its contribution to complex traits." Nat Rev Genet 10(4): 241-251.

151

Frousios, K., C. S. Iliopoulos, T. Schlitt and M. A. Simpson (2013). "Predicting the functional consequences of non-synonymous DNA sequence variants--evaluation of bioinformatics tools and development of a consensus strategy." Genomics 102(4): 223-228.

Futreal, P. A., L. Coin, M. Marshall, T. Down, T. Hubbard, R. Wooster, N. Rahman and M. R. Stratton (2004). "A census of human cancer genes." Nat Rev Cancer 4(3): 177-183.

Galante, P. A., R. B. Parmigiani, Q. Zhao, O. L. Caballero, J. E. de Souza, F. C. Navarro, A. L. Gerber, M. F. Nicolas, A. C. Salim, A. P. Silva, L. Edsall, S. Devalle, L. G. Almeida, Z. Ye, S. Kuan, D. G. Pinheiro, I. Tojal, R. G. Pedigoni, R. G. de Sousa, T. Y. Oliveira, M. G. de Paula, L. Ohno-Machado, E. F. Kirkness, S. Levy, W. A. da Silva, Jr., A. T. Vasconcelos, B. Ren, M. A. Zago, R. L. Strausberg, A. J. Simpson, S. J. de Souza and A. A. Camargo (2011). "Distinct patterns of somatic alterations in a lymphoblastoid and a tumor genome derived from the same individual." Nucleic Acids Res 39(14): 6056-6068.

Gari, E., L. Piedrafita, M. Aldea and E. Herrero (1997). "A set of vectors with a tetracycline- regulatable promoter system for modulated gene expression in Saccharomyces cerevisiae." Yeast 13(9): 837-848.

Gatius, S., A. Velasco, A. Azueta, M. Santacana, J. Pallares, J. Valls, X. Dolcet, J. Prat and X. Matias-Guiu (2011). "FGFR2 alterations in endometrial carcinoma." Mod Pathol 24(11): 1500- 1510.

Gerdes, S. Y., M. D. Scholle, J. W. Campbell, G. Balazsi, E. Ravasz, M. D. Daugherty, A. L. Somera, N. C. Kyrpides, I. Anderson, M. S. Gelfand, A. Bhattacharya, V. Kapatral, M. D'Souza, M. V. Baev, Y. Grechkin, F. Mseeh, M. Y. Fonstein, R. Overbeek, A. L. Barabasi, Z. N. Oltvai and A. L. Osterman (2003). "Experimental determination and system level analysis of essential genes in Escherichia coli MG1655." J Bacteriol 185(19): 5673-5684.

Gerlinger, M., A. J. Rowan, S. Horswell, J. Larkin, D. Endesfelder, E. Gronroos, P. Martinez, N. Matthews, A. Stewart, P. Tarpey, I. Varela, B. Phillimore, S. Begum, N. Q. McDonald, A. Butler, D. Jones, K. Raine, C. Latimer, C. R. Santos, M. Nohadani, A. C. Eklund, B. Spencer- Dene, G. Clark, L. Pickering, G. Stamp, M. Gore, Z. Szallasi, J. Downward, P. A. Futreal and C. Swanton (2012). "Intratumor heterogeneity and branched evolution revealed by multiregion sequencing." N Engl J Med 366(10): 883-892.

Ghiringhelli, F., L. Apetoh, F. Housseau, G. Kroemer and L. Zitvogel (2007). "Links between innate and cognate tumor immunity." Curr Opin Immunol 19(2): 224-231.

Giaever, G., A. M. Chu, L. Ni, C. Connelly, L. Riles, S. Veronneau, S. Dow, A. Lucau-Danila, K. Anderson, B. Andre, A. P. Arkin, A. Astromoff, M. El-Bakkoury, R. Bangham, R. Benito, S. Brachat, S. Campanaro, M. Curtiss, K. Davis, A. Deutschbauer, K. D. Entian, P. Flaherty, F. Foury, D. J. Garfinkel, M. Gerstein, D. Gotte, U. Guldener, J. H. Hegemann, S. Hempel, Z. Herman, D. F. Jaramillo, D. E. Kelly, S. L. Kelly, P. Kotter, D. LaBonte, D. C. Lamb, N. Lan, H. Liang, H. Liao, L. Liu, C. Luo, M. Lussier, R. Mao, P. Menard, S. L. Ooi, J. L. Revuelta, C. J. Roberts, M. Rose, P. Ross-Macdonald, B. Scherens, G. Schimmack, B. Shafer, D. D. Shoemaker, S. Sookhai-Mahadeo, R. K. Storms, J. N. Strathern, G. Valle, M. Voet, G. Volckaert, C. Y. Wang, T. R. Ward, J. Wilhelmy, E. A. Winzeler, Y. Yang, G. Yen, E. Youngman, K. Yu, H. Bussey, J. D. Boeke, M. Snyder, P. Philippsen, R. W. Davis and M.

152

Johnston (2002). "Functional profiling of the Saccharomyces cerevisiae genome." Nature 418(6896): 387-391.

Gimelbrant, A., J. N. Hutchinson, B. R. Thompson and A. Chess (2007). "Widespread monoallelic expression on human autosomes." Science 318(5853): 1136-1140.

Gnad, F., A. Baucom, K. Mukhyala, G. Manning and Z. Zhang (2013). "Assessment of computational methods for predicting the effects of missense mutations in human cancers." BMC Genomics 14 Suppl 3: S7.

Goffeau, A. (2000). "Four years of post-genomic life with 6,000 yeast genes." FEBS Lett 480(1): 37-41.

Goffeau, A., B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann, F. Galibert, J. D. Hoheisel, C. Jacq, M. Johnston, E. J. Louis, H. W. Mewes, Y. Murakami, P. Philippsen, H. Tettelin and S. G. Oliver (1996). "Life with 6000 genes." Science 274(5287): 546, 563-547.

Gok, M. and A. T. Ozcerit (2012). "Prediction of MHC class I binding peptides with a new feature encoding technique." Cell Immunol 275(1-2): 1-4.

Gonzalez-Perez, A., C. Perez-Llamas, J. Deu-Pons, D. Tamborero, M. P. Schroeder, A. Jene- Sanz, A. Santos and N. Lopez-Bigas (2013). "IntOGen-mutations identifies cancer drivers across tumor types." Nat Methods 10(11): 1081-1082.

Graham, S. L., K. J. Davis, W. H. Hansen and C. H. Graham (1975). "Effects of prolonged ethylene thiourea ingestion on the thyroid of the rat." Food Cosmet Toxicol 13(5): 493-499.

Grasso, C. S., Y. M. Wu, D. R. Robinson, X. Cao, S. M. Dhanasekaran, A. P. Khan, M. J. Quist, X. Jing, R. J. Lonigro, J. C. Brenner, I. A. Asangani, B. Ateeq, S. Y. Chun, J. Siddiqui, L. Sam, M. Anstett, R. Mehra, J. R. Prensner, N. Palanisamy, G. A. Ryslik, F. Vandin, B. J. Raphael, L. P. Kunju, D. R. Rhodes, K. J. Pienta, A. M. Chinnaiyan and S. A. Tomlins (2012). "The mutational landscape of lethal castration-resistant prostate cancer." Nature 487(7406): 239-243.

Graziano, F., A. M. Ruzzo, I. Bearzi, E. Testa, V. Lai and M. Magnani (2003). "Screening E- cadherin germline mutations in Italian patients with familial diffuse gastric cancer: an analysis in the District of Urbino, Region Marche, Central Italy." Tumori 89(3): 255-258.

Greaves, M. and C. C. Maley (2012). "Clonal evolution in cancer." Nature 481(7381): 306-313.

Green, M. R., A. J. Gentles, R. V. Nair, J. M. Irish, S. Kihira, C. L. Liu, I. Kela, E. S. Hopmans, J. H. Myklebust, H. Ji, S. K. Plevritis, R. Levy and A. A. Alizadeh (2013). "Hierarchy in somatic mutations arising during genomic evolution and progression of follicular lymphoma." Blood 121(9): 1604-1611.

Greene, A. L., J. R. Snipe, D. A. Gordenin and M. A. Resnick (1999). "Functional analysis of human FEN1 in Saccharomyces cerevisiae and its role in genome stability." Hum Mol Genet 8(12): 2263-2273.

153

Greenman, C., P. Stephens, R. Smith, G. L. Dalgliesh, C. Hunter, G. Bignell, H. Davies, J. Teague, A. Butler and C. Stevens (2007). "Patterns of somatic mutation in human cancer genomes." Nature 446(7132): 153-158.

Greenman, C., P. Stephens, R. Smith, G. L. Dalgliesh, C. Hunter, G. Bignell, H. Davies, J. Teague, A. Butler, C. Stevens, S. Edkins, S. O'Meara, I. Vastrik, E. E. Schmidt, T. Avis, S. Barthorpe, G. Bhamra, G. Buck, B. Choudhury, J. Clements, J. Cole, E. Dicks, S. Forbes, K. Gray, K. Halliday, R. Harrison, K. Hills, J. Hinton, A. Jenkinson, D. Jones, A. Menzies, T. Mironenko, J. Perry, K. Raine, D. Richardson, R. Shepherd, A. Small, C. Tofts, J. Varian, T. Webb, S. West, S. Widaa, A. Yates, D. P. Cahill, D. N. Louis, P. Goldstraw, A. G. Nicholson, F. Brasseur, L. Looijenga, B. L. Weber, Y. E. Chiew, A. DeFazio, M. F. Greaves, A. R. Green, P. Campbell, E. Birney, D. F. Easton, G. Chenevix-Trench, M. H. Tan, S. K. Khoo, B. T. Teh, S. T. Yuen, S. Y. Leung, R. Wooster, P. A. Futreal and M. R. Stratton (2007). "Patterns of somatic mutation in human cancer genomes." Nature 446(7132): 153-158.

Gribskov, M., A. D. McLachlan and D. Eisenberg (1987). "Profile analysis: detection of distantly related proteins." Proc Natl Acad Sci U S A 84(13): 4355-4358.

Gubin, M. M., M. N. Artyomov, E. R. Mardis and R. D. Schreiber (2015). "Tumor neoantigens: building a framework for personalized cancer immunotherapy." J Clin Invest 125(9): 3413-3421.

Guichard, C., G. Amaddeo, S. Imbeaud, Y. Ladeiro, L. Pelletier, I. B. Maad, J. Calderaro, P. Bioulac-Sage, M. Letexier, F. Degos, B. Clement, C. Balabaud, E. Chevet, A. Laurent, G. Couchy, E. Letouze, F. Calvo and J. Zucman-Rossi (2012). "Integrated analysis of somatic mutations and focal copy-number changes identifies key genes and pathways in hepatocellular carcinoma." Nat Genet 44(6): 694-698.

Guilford, P., J. Hopkins, J. Harraway, M. McLeod, N. McLeod, P. Harawira, H. Taite, R. Scoular, A. Miller and A. E. Reeve (1998). "E-cadherin germline mutations in familial gastric cancer." Nature 392(6674): 402-405.

Gulukota, K., J. Sidney, A. Sette and C. DeLisi (1997). "Two complementary methods for predicting peptides binding major histocompatibility complex molecules." J Mol Biol 267(5): 1258-1267.

Gundem, G., C. Perez-Llamas, A. Jene-Sanz, A. Kedzierska, A. Islam, J. Deu-Pons, S. J. Furney and N. Lopez-Bigas (2010). "IntOGen: integration and data mining of multidimensional oncogenomic data." Nat Methods 7(2): 92-93.

Guo, G., Y. Gui, S. Gao, A. Tang, X. Hu, Y. Huang, W. Jia, Z. Li, M. He, L. Sun, P. Song, X. Sun, X. Zhao, S. Yang, C. Liang, S. Wan, F. Zhou, C. Chen, J. Zhu, X. Li, M. Jian, L. Zhou, R. Ye, P. Huang, J. Chen, T. Jiang, X. Liu, Y. Wang, J. Zou, Z. Jiang, R. Wu, S. Wu, F. Fan, Z. Zhang, L. Liu, R. Yang, X. Liu, H. Wu, W. Yin, X. Zhao, Y. Liu, H. Peng, B. Jiang, Q. Feng, C. Li, J. Xie, J. Lu, K. Kristiansen, Y. Li, X. Zhang, S. Li, J. Wang, H. Yang, Z. Cai and J. Wang (2012). "Frequent mutations of genes encoding ubiquitin-mediated proteolysis pathway components in clear cell renal cell carcinoma." Nat Genet 44(1): 17-19.

Haeno, H., Y. Iwasa and F. Michor (2007). "The evolution of two mutations during clonal expansion." Genetics 177(4): 2209-2221.

154

Hamosh, A., A. F. Scott, J. S. Amberger, C. A. Bocchini and V. A. McKusick (2005). "Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders." Nucleic Acids Res 33(Database issue): D514-517.

Hamza, A., E. Tammpere, M. Kofoed, C. Keong, J. Chiang, G. Giaever, C. Nislow and P. Hieter (2015). "Complementation of Yeast Genes with Human Genes as an Experimental Platform for Functional Testing of Human Genetic Variants." Genetics 201(3): 1263-1274.

Hartwell, L. H. (1967). "Macromolecule synthesis in temperature-sensitive mutants of yeast." J Bacteriol 93(5): 1662-1670.

Haughton, G. and D. B. Amos (1968). " of carcinogenesis." Cancer Res 28(9): 1839-1840.

Heemskerk, B., P. Kvistborg and T. N. Schumacher (2013). "The cancer antigenome." EMBO J 32(2): 194-203.

Hillenmeyer, M. E., E. Fung, J. Wildenhain, S. E. Pierce, S. Hoon, W. Lee, M. Proctor, R. P. St Onge, M. Tyers, D. Koller, R. B. Altman, R. W. Davis, C. Nislow and G. Giaever (2008). "The chemical genomic portrait of yeast: uncovering a phenotype for all genes." Science 320(5874): 362-365.

Hishiki, T., S. Kawamoto, S. Morishita and K. Okubo (2000). "BodyMap: a human and mouse gene expression database." Nucleic Acids Res 28(1): 136-138.

Ho, A. S., K. Kannan, D. M. Roy, L. G. Morris, I. Ganly, N. Katabi, D. Ramaswami, L. A. Walsh, S. Eng, J. T. Huse, J. Zhang, I. Dolgalev, K. Huberman, A. Heguy, A. Viale, M. Drobnjak, M. A. Leversha, C. E. Rice, B. Singh, N. G. Iyer, C. R. Leemans, E. Bloemena, R. L. Ferris, R. R. Seethala, B. E. Gross, Y. Liang, R. Sinha, L. Peng, B. J. Raphael, S. Turcan, Y. Gong, N. Schultz, S. Kim, S. Chiosea, J. P. Shah, C. Sander, W. Lee and T. A. Chan (2013). "The mutational landscape of adenoid cystic carcinoma." Nat Genet 45(7): 791-798.

Hodis, E., I. R. Watson, G. V. Kryukov, S. T. Arold, M. Imielinski, J. P. Theurillat, E. Nickerson, D. Auclair, L. Li, C. Place, D. Dicara, A. H. Ramos, M. S. Lawrence, K. Cibulskis, A. Sivachenko, D. Voet, G. Saksena, N. Stransky, R. C. Onofrio, W. Winckler, K. Ardlie, N. Wagle, J. Wargo, K. Chong, D. L. Morton, K. Stemke-Hale, G. Chen, M. Noble, M. Meyerson, J. E. Ladbury, M. A. Davies, J. E. Gershenwald, S. N. Wagner, D. S. Hoon, D. Schadendorf, E. S. Lander, S. B. Gabriel, G. Getz, L. A. Garraway and L. Chin (2012). "A landscape of driver mutations in melanoma." Cell 150(2): 251-263.

Hohenstein, P. (2004). "Tumour suppressor genes--one hit can be enough." PLoS Biol 2(2): E40.

Hon, W. C., A. Berndt and R. L. Williams (2012). "Regulation of lipid binding underlies the activation mechanism of class IA PI3-kinases." Oncogene 31(32): 3655-3666.

Honeyman, M. C., V. Brusic, N. L. Stone and L. C. Harrison (1998). "Neural network-based prediction of candidate T-cell epitopes." Nat Biotechnol 16(10): 966-969.

155

Hood, L. and M. Flores (2012). "A personal view on systems medicine and the emergence of proactive P4 medicine: predictive, preventive, personalized and participatory." N Biotechnol 29(6): 613-624.

Hornbeck, P. V., I. Chabra, J. M. Kornhauser, E. Skrzypek and B. Zhang (2004). "PhosphoSite: A bioinformatics resource dedicated to physiological protein phosphorylation." Proteomics 4(6): 1551-1561.

Huang, J., Q. Deng, Q. Wang, K. Y. Li, J. H. Dai, N. Li, Z. D. Zhu, B. Zhou, X. Y. Liu, R. F. Liu, Q. L. Fei, H. Chen, B. Cai, B. Zhou, H. S. Xiao, L. X. Qin and Z. G. Han (2012). "Exome sequencing of hepatitis B virus-associated hepatocellular carcinoma." Nat Genet 44(10): 1117- 1121.

Hudson, T. J., W. Anderson, A. Artez, A. D. Barker, C. Bell, R. R. Bernabe, M. K. Bhan, F. Calvo, I. Eerola, D. S. Gerhard, A. Guttmacher, M. Guyer, F. M. Hemsley, J. L. Jennings, D. Kerr, P. Klatt, P. Kolar, J. Kusada, D. P. Lane, F. Laplace, L. Youyong, G. Nettekoven, B. Ozenberger, J. Peterson, T. S. Rao, J. Remacle, A. J. Schafer, T. Shibata, M. R. Stratton, J. G. Vockley, K. Watanabe, H. Yang, M. M. Yuen, B. M. Knoppers, M. Bobrow, A. Cambon- Thomsen, L. G. Dressler, S. O. Dyke, Y. Joly, K. Kato, K. L. Kennedy, P. Nicolas, M. J. Parker, E. Rial-Sebbag, C. M. Romeo-Casabona, K. M. Shaw, S. Wallace, G. L. Wiesner, N. Zeps, P. Lichter, A. V. Biankin, C. Chabannon, L. Chin, B. Clement, E. de Alava, F. Degos, M. L. Ferguson, P. Geary, D. N. Hayes, T. J. Hudson, A. L. Johns, A. Kasprzyk, H. Nakagawa, R. Penny, M. A. Piris, R. Sarin, A. Scarpa, T. Shibata, M. van de Vijver, P. A. Futreal, H. Aburatani, M. Bayes, D. D. Botwell, P. J. Campbell, X. Estivill, D. S. Gerhard, S. M. Grimmond, I. Gut, M. Hirst, C. Lopez-Otin, P. Majumder, M. Marra, J. D. McPherson, H. Nakagawa, Z. Ning, X. S. Puente, Y. Ruan, T. Shibata, M. R. Stratton, H. G. Stunnenberg, H. Swerdlow, V. E. Velculescu, R. K. Wilson, H. H. Xue, L. Yang, P. T. Spellman, G. D. Bader, P. C. Boutros, P. J. Campbell, P. Flicek, G. Getz, R. Guigo, G. Guo, D. Haussler, S. Heath, T. J. Hubbard, T. Jiang, S. M. Jones, Q. Li, N. Lopez-Bigas, R. Luo, L. Muthuswamy, B. F. Ouellette, J. V. Pearson, X. S. Puente, V. Quesada, B. J. Raphael, C. Sander, T. Shibata, T. P. Speed, L. D. Stein, J. M. Stuart, J. W. Teague, Y. Totoki, T. Tsunoda, A. Valencia, D. A. Wheeler, H. Wu, S. Zhao, G. Zhou, L. D. Stein, R. Guigo, T. J. Hubbard, Y. Joly, S. M. Jones, A. Kasprzyk, M. Lathrop, N. Lopez-Bigas, B. F. Ouellette, P. T. Spellman, J. W. Teague, G. Thomas, A. Valencia, T. Yoshida, K. L. Kennedy, M. Axton, S. O. Dyke, P. A. Futreal, D. S. Gerhard, C. Gunter, M. Guyer, T. J. Hudson, J. D. McPherson, L. J. Miller, B. Ozenberger, K. M. Shaw, A. Kasprzyk, L. D. Stein, J. Zhang, S. A. Haider, J. Wang, C. K. Yung, A. Cros, Y. Liang, S. Gnaneshan, J. Guberman, J. Hsu, M. Bobrow, D. R. Chalmers, K. W. Hasel, Y. Joly, T. S. Kaan, K. L. Kennedy, B. M. Knoppers, W. W. Lowrance, T. Masui, P. Nicolas, E. Rial-Sebbag, L. L. Rodriguez, C. Vergely, T. Yoshida, S. M. Grimmond, A. V. Biankin, D. D. Bowtell, N. Cloonan, A. deFazio, J. R. Eshleman, D. Etemadmoghadam, B. B. Gardiner, J. G. Kench, A. Scarpa, R. L. Sutherland, M. A. Tempero, N. J. Waddell, P. J. Wilson, J. D. McPherson, S. Gallinger, M. S. Tsao, P. A. Shaw, G. M. Petersen, D. Mukhopadhyay, L. Chin, R. A. DePinho, S. Thayer, L. Muthuswamy, K. Shazand, T. Beck, M. Sam, L. Timms, V. Ballin, Y. Lu, J. Ji, X. Zhang, F. Chen, X. Hu, G. Zhou, Q. Yang, G. Tian, L. Zhang, X. Xing, X. Li, Z. Zhu, Y. Yu, J. Yu, H. Yang, M. Lathrop, J. Tost, P. Brennan, I. Holcatova, D. Zaridze, A. Brazma, L. Egevard, E. Prokhortchouk, R. E. Banks, M. Uhlen, A. Cambon-Thomsen, J. Viksna, F. Ponten, K. Skryabin, M. R. Stratton, P. A. Futreal, E. Birney, A. Borg, A. L. Borresen-Dale, C. Caldas, J. A. Foekens, S. Martin, J. S. Reis-Filho, A. L. Richardson, C. Sotiriou, H. G. Stunnenberg, G. Thoms, M. van

156 de Vijver, L. van't Veer, F. Calvo, D. Birnbaum, H. Blanche, P. Boucher, S. Boyault, C. Chabannon, I. Gut, J. D. Masson-Jacquemier, M. Lathrop, I. Pauporte, X. Pivot, A. Vincent- Salomon, E. Tabone, C. Theillet, G. Thomas, J. Tost, I. Treilleux, F. Calvo, P. Bioulac-Sage, B. Clement, T. Decaens, F. Degos, D. Franco, I. Gut, M. Gut, S. Heath, M. Lathrop, D. Samuel, G. Thomas, J. Zucman-Rossi, P. Lichter, R. Eils, B. Brors, J. O. Korbel, A. Korshunov, P. Landgraf, H. Lehrach, S. Pfister, B. Radlwimmer, G. Reifenberger, M. D. Taylor, C. von Kalle, P. P. Majumder, R. Sarin, T. S. Rao, M. K. Bhan, A. Scarpa, P. Pederzoli, R. A. Lawlor, M. Delledonne, A. Bardelli, A. V. Biankin, S. M. Grimmond, T. Gress, D. Klimstra, G. Zamboni, T. Shibata, Y. Nakamura, H. Nakagawa, J. Kusada, T. Tsunoda, S. Miyano, H. Aburatani, K. Kato, A. Fujimoto, T. Yoshida, E. Campo, C. Lopez-Otin, X. Estivill, R. Guigo, S. de Sanjose, M. A. Piris, E. Montserrat, M. Gonzalez-Diaz, X. S. Puente, P. Jares, A. Valencia, H. Himmelbauer, V. Quesada, S. Bea, M. R. Stratton, P. A. Futreal, P. J. Campbell, A. Vincent-Salomon, A. L. Richardson, J. S. Reis-Filho, M. van de Vijver, G. Thomas, J. D. Masson-Jacquemier, S. Aparicio, A. Borg, A. L. Borresen-Dale, C. Caldas, J. A. Foekens, H. G. Stunnenberg, L. van't Veer, D. F. Easton, P. T. Spellman, S. Martin, A. D. Barker, L. Chin, F. S. Collins, C. C. Compton, M. L. Ferguson, D. S. Gerhard, G. Getz, C. Gunter, A. Guttmacher, M. Guyer, D. N. Hayes, E. S. Lander, B. Ozenberger, R. Penny, J. Peterson, C. Sander, K. M. Shaw, T. P. Speed, P. T. Spellman, J. G. Vockley, D. A. Wheeler, R. K. Wilson, T. J. Hudson, L. Chin, B. M. Knoppers, E. S. Lander, P. Lichter, L. D. Stein, M. R. Stratton, W. Anderson, A. D. Barker, C. Bell, M. Bobrow, W. Burke, F. S. Collins, C. C. Compton, R. A. DePinho, D. F. Easton, P. A. Futreal, D. S. Gerhard, A. R. Green, M. Guyer, S. R. Hamilton, T. J. Hubbard, O. P. Kallioniemi, K. L. Kennedy, T. J. Ley, E. T. Liu, Y. Lu, P. Majumder, M. Marra, B. Ozenberger, J. Peterson, A. J. Schafer, P. T. Spellman, H. G. Stunnenberg, B. J. Wainwright, R. K. Wilson and H. Yang (2010). "International network of cancer genome projects." Nature 464(7291): 993-998.

Hughes, T. R. (2002). "Yeast and drug discovery." Funct Integr Genomics 2(4-5): 199-211.

Hutchinson, E. (2012). "Tumour immunology: Differing roles for MYD88 in carcinogenesis." Nat Rev Immunol 12(10): 681.

Imielinski, M., A. H. Berger, P. S. Hammerman, B. Hernandez, T. J. Pugh, E. Hodis, J. Cho, J. Suh, M. Capelletti, A. Sivachenko, C. Sougnez, D. Auclair, M. S. Lawrence, P. Stojanov, K. Cibulskis, K. Choi, L. de Waal, T. Sharifnia, A. Brooks, H. Greulich, S. Banerji, T. Zander, D. Seidel, F. Leenders, S. Ansen, C. Ludwig, W. Engel-Riedel, E. Stoelben, J. Wolf, C. Goparju, K. Thompson, W. Winckler, D. Kwiatkowski, B. E. Johnson, P. A. Janne, V. A. Miller, W. Pao, W. D. Travis, H. I. Pass, S. B. Gabriel, E. S. Lander, R. K. Thomas, L. A. Garraway, G. Getz and M. Meyerson (2012). "Mapping the hallmarks of lung adenocarcinoma with massively parallel sequencing." Cell 150(6): 1107-1120.

International Cancer Genome, C., T. J. Hudson, W. Anderson, A. Artez, A. D. Barker, C. Bell, R. R. Bernabe, M. K. Bhan, F. Calvo, I. Eerola, D. S. Gerhard, A. Guttmacher, M. Guyer, F. M. Hemsley, J. L. Jennings, D. Kerr, P. Klatt, P. Kolar, J. Kusada, D. P. Lane, F. Laplace, L. Youyong, G. Nettekoven, B. Ozenberger, J. Peterson, T. S. Rao, J. Remacle, A. J. Schafer, T. Shibata, M. R. Stratton, J. G. Vockley, K. Watanabe, H. Yang, M. M. Yuen, B. M. Knoppers, M. Bobrow, A. Cambon-Thomsen, L. G. Dressler, S. O. Dyke, Y. Joly, K. Kato, K. L. Kennedy, P. Nicolas, M. J. Parker, E. Rial-Sebbag, C. M. Romeo-Casabona, K. M. Shaw, S. Wallace, G. L. Wiesner, N. Zeps, P. Lichter, A. V. Biankin, C. Chabannon, L. Chin, B. Clement, E. de Alava, F. Degos, M. L. Ferguson, P. Geary, D. N. Hayes, T. J. Hudson, A. L. Johns, A. Kasprzyk, H.

157

Nakagawa, R. Penny, M. A. Piris, R. Sarin, A. Scarpa, T. Shibata, M. van de Vijver, P. A. Futreal, H. Aburatani, M. Bayes, D. D. Botwell, P. J. Campbell, X. Estivill, D. S. Gerhard, S. M. Grimmond, I. Gut, M. Hirst, C. Lopez-Otin, P. Majumder, M. Marra, J. D. McPherson, H. Nakagawa, Z. Ning, X. S. Puente, Y. Ruan, T. Shibata, M. R. Stratton, H. G. Stunnenberg, H. Swerdlow, V. E. Velculescu, R. K. Wilson, H. H. Xue, L. Yang, P. T. Spellman, G. D. Bader, P. C. Boutros, P. J. Campbell, P. Flicek, G. Getz, R. Guigo, G. Guo, D. Haussler, S. Heath, T. J. Hubbard, T. Jiang, S. M. Jones, Q. Li, N. Lopez-Bigas, R. Luo, L. Muthuswamy, B. F. Ouellette, J. V. Pearson, X. S. Puente, V. Quesada, B. J. Raphael, C. Sander, T. Shibata, T. P. Speed, L. D. Stein, J. M. Stuart, J. W. Teague, Y. Totoki, T. Tsunoda, A. Valencia, D. A. Wheeler, H. Wu, S. Zhao, G. Zhou, L. D. Stein, R. Guigo, T. J. Hubbard, Y. Joly, S. M. Jones, A. Kasprzyk, M. Lathrop, N. Lopez-Bigas, B. F. Ouellette, P. T. Spellman, J. W. Teague, G. Thomas, A. Valencia, T. Yoshida, K. L. Kennedy, M. Axton, S. O. Dyke, P. A. Futreal, D. S. Gerhard, C. Gunter, M. Guyer, T. J. Hudson, J. D. McPherson, L. J. Miller, B. Ozenberger, K. M. Shaw, A. Kasprzyk, L. D. Stein, J. Zhang, S. A. Haider, J. Wang, C. K. Yung, A. Cros, Y. Liang, S. Gnaneshan, J. Guberman, J. Hsu, M. Bobrow, D. R. Chalmers, K. W. Hasel, Y. Joly, T. S. Kaan, K. L. Kennedy, B. M. Knoppers, W. W. Lowrance, T. Masui, P. Nicolas, E. Rial-Sebbag, L. L. Rodriguez, C. Vergely, T. Yoshida, S. M. Grimmond, A. V. Biankin, D. D. Bowtell, N. Cloonan, A. deFazio, J. R. Eshleman, D. Etemadmoghadam, B. B. Gardiner, J. G. Kench, A. Scarpa, R. L. Sutherland, M. A. Tempero, N. J. Waddell, P. J. Wilson, J. D. McPherson, S. Gallinger, M. S. Tsao, P. A. Shaw, G. M. Petersen, D. Mukhopadhyay, L. Chin, R. A. DePinho, S. Thayer, L. Muthuswamy, K. Shazand, T. Beck, M. Sam, L. Timms, V. Ballin, Y. Lu, J. Ji, X. Zhang, F. Chen, X. Hu, G. Zhou, Q. Yang, G. Tian, L. Zhang, X. Xing, X. Li, Z. Zhu, Y. Yu, J. Yu, H. Yang, M. Lathrop, J. Tost, P. Brennan, I. Holcatova, D. Zaridze, A. Brazma, L. Egevard, E. Prokhortchouk, R. E. Banks, M. Uhlen, A. Cambon-Thomsen, J. Viksna, F. Ponten, K. Skryabin, M. R. Stratton, P. A. Futreal, E. Birney, A. Borg, A. L. Borresen-Dale, C. Caldas, J. A. Foekens, S. Martin, J. S. Reis-Filho, A. L. Richardson, C. Sotiriou, H. G. Stunnenberg, G. Thoms, M. van de Vijver, L. van't Veer, F. Calvo, D. Birnbaum, H. Blanche, P. Boucher, S. Boyault, C. Chabannon, I. Gut, J. D. Masson-Jacquemier, M. Lathrop, I. Pauporte, X. Pivot, A. Vincent- Salomon, E. Tabone, C. Theillet, G. Thomas, J. Tost, I. Treilleux, F. Calvo, P. Bioulac-Sage, B. Clement, T. Decaens, F. Degos, D. Franco, I. Gut, M. Gut, S. Heath, M. Lathrop, D. Samuel, G. Thomas, J. Zucman-Rossi, P. Lichter, R. Eils, B. Brors, J. O. Korbel, A. Korshunov, P. Landgraf, H. Lehrach, S. Pfister, B. Radlwimmer, G. Reifenberger, M. D. Taylor, C. von Kalle, P. P. Majumder, R. Sarin, T. S. Rao, M. K. Bhan, A. Scarpa, P. Pederzoli, R. A. Lawlor, M. Delledonne, A. Bardelli, A. V. Biankin, S. M. Grimmond, T. Gress, D. Klimstra, G. Zamboni, T. Shibata, Y. Nakamura, H. Nakagawa, J. Kusada, T. Tsunoda, S. Miyano, H. Aburatani, K. Kato, A. Fujimoto, T. Yoshida, E. Campo, C. Lopez-Otin, X. Estivill, R. Guigo, S. de Sanjose, M. A. Piris, E. Montserrat, M. Gonzalez-Diaz, X. S. Puente, P. Jares, A. Valencia, H. Himmelbauer, V. Quesada, S. Bea, M. R. Stratton, P. A. Futreal, P. J. Campbell, A. Vincent-Salomon, A. L. Richardson, J. S. Reis-Filho, M. van de Vijver, G. Thomas, J. D. Masson-Jacquemier, S. Aparicio, A. Borg, A. L. Borresen-Dale, C. Caldas, J. A. Foekens, H. G. Stunnenberg, L. van't Veer, D. F. Easton, P. T. Spellman, S. Martin, A. D. Barker, L. Chin, F. S. Collins, C. C. Compton, M. L. Ferguson, D. S. Gerhard, G. Getz, C. Gunter, A. Guttmacher, M. Guyer, D. N. Hayes, E. S. Lander, B. Ozenberger, R. Penny, J. Peterson, C. Sander, K. M. Shaw, T. P. Speed, P. T. Spellman, J. G. Vockley, D. A. Wheeler, R. K. Wilson, T. J. Hudson, L. Chin, B. M. Knoppers, E. S. Lander, P. Lichter, L. D. Stein, M. R. Stratton, W. Anderson, A. D. Barker, C. Bell, M. Bobrow, W. Burke, F. S. Collins, C. C. Compton, R. A. DePinho, D. F. Easton, P. A. Futreal, D. S. Gerhard, A. R. Green, M. Guyer, S. R. Hamilton, T. J. Hubbard, O. P. Kallioniemi,

158

K. L. Kennedy, T. J. Ley, E. T. Liu, Y. Lu, P. Majumder, M. Marra, B. Ozenberger, J. Peterson, A. J. Schafer, P. T. Spellman, H. G. Stunnenberg, B. J. Wainwright, R. K. Wilson and H. Yang (2010). "International network of cancer genome projects." Nature 464(7291): 993-998.

Iyer, G., A. J. Hanrahan, M. I. Milowsky, H. Al-Ahmadie, S. N. Scott, M. Janakiraman, M. Pirun, C. Sander, N. D. Socci, I. Ostrovnaya, A. Viale, A. Heguy, L. Peng, T. A. Chan, B. Bochner, D. F. Bajorin, M. F. Berger, B. S. Taylor and D. B. Solit (2012). "Genome sequencing identifies a basis for everolimus sensitivity." Science 338(6104): 221.

Jackson, A. L. and L. A. Loeb (1998). "The mutation rate and cancer." Genetics 148(4): 1483- 1490.

Janeway, C. A., P. Travers, M. Walport and M. J. Shlomchik (1997). Immunobiology: the immune system in health and disease, Current Biology.

Jansson, M., S. T. Durant, E. C. Cho, S. Sheahan, M. Edelmann, B. Kessler and N. B. La Thangue (2008). "Arginine methylation regulates the p53 response." Nat Cell Biol 10(12): 1431- 1439.

Jiao, Y., C. Shi, B. H. Edil, R. F. de Wilde, D. S. Klimstra, A. Maitra, R. D. Schulick, L. H. Tang, C. L. Wolfgang, M. A. Choti, V. E. Velculescu, L. A. Diaz, Jr., B. Vogelstein, K. W. Kinzler, R. H. Hruban and N. Papadopoulos (2011). "DAXX/ATRX, MEN1, and mTOR pathway genes are frequently altered in pancreatic neuroendocrine tumors." Science 331(6021): 1199-1203.

Jinek, M., K. Chylinski, I. Fonfara, M. Hauer, J. A. Doudna and E. Charpentier (2012). "A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity." Science 337(6096): 816-821.

Joerger, A. C. and A. R. Fersht (2007). "Structure-function-rescue: the diverse nature of common p53 cancer mutants." Oncogene 26(15): 2226-2242.

Jones, D. T., N. Jager, M. Kool, T. Zichner, B. Hutter, M. Sultan, Y. J. Cho, T. J. Pugh, V. Hovestadt, A. M. Stutz, T. Rausch, H. J. Warnatz, M. Ryzhova, S. Bender, D. Sturm, S. Pleier, H. Cin, E. Pfaff, L. Sieber, A. Wittmann, M. Remke, H. Witt, S. Hutter, T. Tzaridis, J. Weischenfeldt, B. Raeder, M. Avci, V. Amstislavskiy, M. Zapatka, U. D. Weber, Q. Wang, B. Lasitschka, C. C. Bartholomae, M. Schmidt, C. von Kalle, V. Ast, C. Lawerenz, J. Eils, R. Kabbe, V. Benes, P. van Sluis, J. Koster, R. Volckmann, D. Shih, M. J. Betts, R. B. Russell, S. Coco, G. P. Tonini, U. Schuller, V. Hans, N. Graf, Y. J. Kim, C. Monoranu, W. Roggendorf, A. Unterberg, C. Herold-Mende, T. Milde, A. E. Kulozik, A. von Deimling, O. Witt, E. Maass, J. Rossler, M. Ebinger, M. U. Schuhmann, M. C. Fruhwald, M. Hasselblatt, N. Jabado, S. Rutkowski, A. O. von Bueren, D. Williamson, S. C. Clifford, M. G. McCabe, V. P. Collins, S. Wolf, S. Wiemann, H. Lehrach, B. Brors, W. Scheurlen, J. Felsberg, G. Reifenberger, P. A. Northcott, M. D. Taylor, M. Meyerson, S. L. Pomeroy, M. L. Yaspo, J. O. Korbel, A. Korshunov, R. Eils, S. M. Pfister and P. Lichter (2012). "Dissecting the genomic complexity underlying medulloblastoma." Nature 488(7409): 100-105.

Jones, S., T. L. Wang, M. Shih Ie, T. L. Mao, K. Nakayama, R. Roden, R. Glas, D. Slamon, L. A. Diaz, Jr., B. Vogelstein, K. W. Kinzler, V. E. Velculescu and N. Papadopoulos (2010).

159

"Frequent mutations of chromatin remodeling gene ARID1A in ovarian clear cell carcinoma." Science 330(6001): 228-231.

Kachroo, A. H., J. M. Laurent, C. M. Yellman, A. G. Meyer, C. O. Wilke and E. M. Marcotte (2015). "Evolution. Systematic humanization of yeast genes reveals conserved functions and genetic modularity." Science 348(6237): 921-925.

Kamburov, A., M. S. Lawrence, P. Polak, I. Leshchiner, K. Lage, T. R. Golub, E. S. Lander and G. Getz (2015). "Comprehensive assessment of cancer missense mutation clustering in protein structures." Proc Natl Acad Sci U S A 112(40): E5486-5495.

Kandoth, C., M. D. McLellan, F. Vandin, K. Ye, B. Niu, C. Lu, M. Xie, Q. Zhang, J. F. McMichael, M. A. Wyczalkowski, M. D. Leiserson, C. A. Miller, J. S. Welch, M. J. Walter, M. C. Wendl, T. J. Ley, R. K. Wilson, B. J. Raphael and L. Ding (2013). "Mutational landscape and significance across 12 major cancer types." Nature 502(7471): 333-339.

Kannan, K., A. Inagaki, J. Silber, D. Gorovets, J. Zhang, E. R. Kastenhuber, A. Heguy, J. H. Petrini, T. A. Chan and J. T. Huse (2012). "Whole-exome sequencing identifies ATRX mutation as a key molecular determinant in lower-grade glioma." Oncotarget 3(10): 1194-1203.

Keegan, L. P., L. McGurk, J. P. Palavicini, J. Brindle, S. Paro, X. Li, J. J. Rosenthal and M. A. O'Connell (2011). "Functional conservation in human and Drosophila of Metazoan ADAR2 involved in RNA editing: loss of ADAR1 in insects." Nucleic Acids Res 39(16): 7249-7262.

Kelderman, S. and P. Kvistborg (2016). "Tumor antigens in human cancer control." Biochim Biophys Acta 1865(1): 83-89.

Kelekar, A. and C. B. Thompson (1998). "Bcl-2-family proteins: the role of the BH3 domain in apoptosis." Trends Cell Biol 8(8): 324-330.

Khong, H. T., Q. J. Wang and S. A. Rosenberg (2004). "Identification of multiple antigens recognized by tumor-infiltrating lymphocytes from a single patient: tumor escape by antigen loss and loss of MHC expression." J Immunother 27(3): 184-190.

Kim, S. C., Y. Jung, J. Park, S. Cho, C. Seo, J. Kim, P. Kim, J. Park, J. Seo, J. Kim, S. Park, I. Jang, N. Kim, J. O. Yang, B. Lee, K. Rho, Y. Jung, J. Keum, J. Lee, J. Han, S. Kang, S. Bae, S. J. Choi, S. Kim, J. E. Lee, W. Kim, J. Kim and S. Lee (2013). "A high-dimensional, deep- sequencing study of lung adenocarcinoma in female never-smokers." PLoS One 8(2): e55596.

Knudson, A. G., Jr. (1971). "Mutation and cancer: statistical study of retinoblastoma." Proc Natl Acad Sci U S A 68(4): 820-823.

Krauthammer, M., Y. Kong, B. H. Ha, P. Evans, A. Bacchiocchi, J. P. McCusker, E. Cheng, M. J. Davis, G. Goh, M. Choi, S. Ariyan, D. Narayan, K. Dutton-Regester, A. Capatana, E. C. Holman, M. Bosenberg, M. Sznol, H. M. Kluger, D. E. Brash, D. F. Stern, M. A. Materin, R. S. Lo, S. Mane, S. Ma, K. K. Kidd, N. K. Hayward, R. P. Lifton, J. Schlessinger, T. J. Boggon and R. Halaban (2012). "Exome sequencing identifies recurrent somatic RAC1 mutations in melanoma." Nat Genet 44(9): 1006-1014.

160

Kreiter, S., M. Vormehr, N. van de Roemer, M. Diken, M. Lower, J. Diekmann, S. Boegel, B. Schrors, F. Vascotto, J. C. Castle, A. D. Tadmor, S. P. Schoenberger, C. Huber, O. Tureci and U. Sahin (2015). "Erratum: Mutant MHC class II epitopes drive therapeutic immune responses to cancer." Nature 523(7560): 370.

Kreiter, S., M. Vormehr, N. van de Roemer, M. Diken, M. Lower, J. Diekmann, S. Boegel, B. Schrors, F. Vascotto, J. C. Castle, A. D. Tadmor, S. P. Schoenberger, C. Huber, O. Tureci and U. Sahin (2015). "Mutant MHC class II epitopes drive therapeutic immune responses to cancer." Nature 520(7549): 692-696.

Krissinel, E. and K. Henrick (2007). "Inference of macromolecular assemblies from crystalline state." J Mol Biol 372(3): 774-797.

Kruger, W. D. and D. R. Cox (1994). "A yeast system for expression of human cystathionine beta-synthase: structural and functional conservation of the human and yeast genes." Proc Natl Acad Sci U S A 91(14): 6614-6618.

Kruger, W. D. and D. R. Cox (1995). "A yeast assay for functional detection of mutations in the human cystathionine beta-synthase gene." Hum Mol Genet 4(7): 1155-1161.

Kusano, M., H. Kakiuchi, M. Mihara, F. Itoh, Y. Adachi, M. Ohara, M. Hosokawa and K. Imai (2001). "Absence of microsatellite instability and germline mutations of E-cadherin, APC and p53 genes in Japanese familial gastric cancer." Tumour Biol 22(4): 262-268.

Lafitte, M., I. Moranvillier, S. Garcia, E. Peuchant, J. Iovanna, B. Rousseau, P. Dubus, V. Guyonnet-Duperat, G. Belleannee, J. Ramos, A. Bedel, H. de Verneuil, F. Moreau-Gaudry and S. Dabernat (2013). "FGFR3 has tumor suppressor properties in cells with epithelial phenotype." Mol Cancer 12: 83.

Landrum, M. J., J. M. Lee, G. R. Riley, W. Jang, W. S. Rubinstein, D. M. Church and D. R. Maglott (2014). "ClinVar: public archive of relationships among sequence variation and human phenotype." Nucleic Acids Res 42(Database issue): D980-985.

Laurent, J. M., J. H. Young, A. H. Kachroo and E. M. Marcotte (2015). "Efforts to make and apply humanized yeast." Brief Funct Genomics.

Lawrence, M. S., P. Stojanov, P. Polak, G. V. Kryukov, K. Cibulskis, A. Sivachenko, S. L. Carter, C. Stewart, C. H. Mermel, S. A. Roberts, A. Kiezun, P. S. Hammerman, A. McKenna, Y. Drier, L. Zou, A. H. Ramos, T. J. Pugh, N. Stransky, E. Helman, J. Kim, C. Sougnez, L. Ambrogio, E. Nickerson, E. Shefler, M. L. Cortes, D. Auclair, G. Saksena, D. Voet, M. Noble, D. DiCara, P. Lin, L. Lichtenstein, D. I. Heiman, T. Fennell, M. Imielinski, B. Hernandez, E. Hodis, S. Baca, A. M. Dulak, J. Lohr, D. A. Landau, C. J. Wu, J. Melendez-Zajgla, A. Hidalgo- Miranda, A. Koren, S. A. McCarroll, J. Mora, R. S. Lee, B. Crompton, R. Onofrio, M. Parkin, W. Winckler, K. Ardlie, S. B. Gabriel, C. W. Roberts, J. A. Biegel, K. Stegmaier, A. J. Bass, L. A. Garraway, M. Meyerson, T. R. Golub, D. A. Gordenin, S. Sunyaev, E. S. Lander and G. Getz (2013). "Mutational heterogeneity in cancer and the search for new cancer-associated genes." Nature 499(7457): 214-218.

161

Le Gallo, M., A. J. O'Hara, M. L. Rudd, M. E. Urick, N. F. Hansen, N. J. O'Neil, J. C. Price, S. Zhang, B. M. England, A. K. Godwin, D. C. Sgroi, N. I. H. I. S. C. C. S. Program, P. Hieter, J. C. Mullikin, M. J. Merino and D. W. Bell (2012). "Exome sequencing of serous endometrial tumors identifies recurrent somatic mutations in chromatin-remodeling and ubiquitin ligase complex genes." Nat Genet 44(12): 1310-1315.

Lee, J. C., I. Vivanco, R. Beroukhim, J. H. Huang, W. L. Feng, R. M. DeBiasi, K. Yoshimoto, J. C. King, P. Nghiemphu and Y. Yuza (2006). "Epidermal growth factor receptor activation in glioblastoma through novel missense mutations in the extracellular domain." PLoS medicine 3(12): e485.

Lee, M. G. and P. Nurse (1987). "Complementation used to clone a human homologue of the fission yeast cell cycle control gene cdc2." Nature 327(6117): 31-35.

Lee, R. S., C. Stewart, S. L. Carter, L. Ambrogio, K. Cibulskis, C. Sougnez, M. S. Lawrence, D. Auclair, J. Mora, T. R. Golub, J. A. Biegel, G. Getz and C. W. Roberts (2012). "A remarkably simple genome underlies highly malignant pediatric rhabdoid cancers." J Clin Invest 122(8): 2983-2988.

Leich, E., S. Weissbach, H. U. Klein, T. Grieb, J. Pischimarov, T. Stuhmer, M. Chatterjee, T. Steinbrunn, C. Langer, M. Eilers, S. Knop, H. Einsele, R. Bargou and A. Rosenwald (2013). "Multiple myeloma is affected by multiple and heterogeneous somatic mutations in adhesion- and receptor tyrosine kinase signaling molecules." Blood Cancer J 3: e102.

Lemaire, M., V. Fremeaux-Bacchi, F. Schaefer, M. Choi, W. H. Tang, M. Le Quintrec, F. Fakhouri, S. Taque, F. Nobili, F. Martinez, W. Ji, J. D. Overton, S. M. Mane, G. Nurnberg, J. Altmuller, H. Thiele, D. Morin, G. Deschenes, V. Baudouin, B. Llanas, L. Collard, M. A. Majid, E. Simkova, P. Nurnberg, N. Rioux-Leclerc, G. W. Moeckel, M. C. Gubler, J. Hwa, C. Loirat and R. P. Lifton (2013). "Recessive mutations in DGKE cause atypical hemolytic-uremic syndrome." Nat Genet 45(5): 531-536.

Levitan, D., T. G. Doyle, D. Brousseau, M. K. Lee, G. Thinakaran, H. H. Slunt, S. S. Sisodia and I. Greenwald (1996). "Assessment of normal and mutant human presenilin function in Caenorhabditis elegans." Proc Natl Acad Sci U S A 93(25): 14940-14944.

Li, M., H. Zhao, X. Zhang, L. D. Wood, R. A. Anders, M. A. Choti, T. M. Pawlik, H. D. Daniel, R. Kannangai, G. J. Offerhaus, V. E. Velculescu, L. Wang, S. Zhou, B. Vogelstein, R. H. Hruban, N. Papadopoulos, J. Cai, M. S. Torbenson and K. W. Kinzler (2011). "Inactivating mutations of the chromatin remodeling gene ARID2 in hepatocellular carcinoma." Nat Genet 43(9): 828-829.

Lindberg, J., I. G. Mills, D. Klevebring, W. Liu, M. Neiman, J. Xu, P. Wikstrom, P. Wiklund, F. Wiklund, L. Egevad and H. Gronberg (2013). "The mitochondrial and autosomal mutation landscapes of prostate cancer." Eur Urol 63(4): 702-708.

Lipsitch, M., C. T. Bergstrom and R. Antia (2003). "Effect of human leukocyte antigen heterozygosity on infectious disease outcome: the need for allele-specific measures." BMC Med Genet 4: 2.

162

Liu, C. J., W. G. Shen, S. Y. Peng, H. W. Cheng, S. Y. Kao, S. C. Lin and K. W. Chang (2014). "miR-134 induces oncogenicity and metastasis in head and neck carcinoma through targeting WWOX gene." Int J Cancer 134(4): 811-821.

Liu, J., W. Lee, Z. Jiang, Z. Chen, S. Jhunjhunwala, P. M. Haverty, F. Gnad, Y. Guan, H. N. Gilbert, J. Stinson, C. Klijn, J. Guillory, D. Bhatt, S. Vartanian, K. Walter, J. Chan, T. Holcomb, P. Dijkgraaf, S. Johnson, J. Koeman, J. D. Minna, A. F. Gazdar, H. M. Stern, K. P. Hoeflich, T. D. Wu, J. Settleman, F. J. de Sauvage, R. C. Gentleman, R. M. Neve, D. Stokoe, Z. Modrusan, S. Seshagiri, D. S. Shames and Z. Zhang (2012). "Genome and transcriptome sequencing of lung cancers reveal diverse mutational and splicing events." Genome Res 22(12): 2315-2327.

Lobry, C., P. Oh and I. Aifantis (2011). "Oncogenic and tumor suppressor functions of Notch in cancer: it's NOTCH what you think." J Exp Med 208(10): 1931-1935.

Luebeck, E. G. and S. H. Moolgavkar (2002). "Multistage carcinogenesis and the incidence of colorectal cancer." Proc Natl Acad Sci U S A 99(23): 15095-15100.

Lundegaard, C., K. Lamberth, M. Harndahl, S. Buus, O. Lund and M. Nielsen (2008). "NetMHC-3.0: accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8-11." Nucleic Acids Res 36(Web Server issue): W509-512.

Mali, P., L. Yang, K. M. Esvelt, J. Aach, M. Guell, J. E. DiCarlo, J. E. Norville and G. M. Church (2013). "RNA-guided human genome engineering via Cas9." Science 339(6121): 823- 826.

Mamitsuka, H. (1998). "Predicting peptides that bind to MHC molecules using supervised learning of hidden Markov models." Proteins 33(4): 460-474.

Manning, G., G. D. Plowman, T. Hunter and S. Sudarsanam (2002). "Evolution of protein kinase signaling from yeast to man." Trends Biochem Sci 27(10): 514-520.

Marcotte, R., K. R. Brown, F. Suarez, A. Sayad, K. Karamboulas, P. M. Krzyzanowski, F. Sircoulomb, M. Medrano, Y. Fedyshyn, J. L. Koh, D. van Dyk, B. Fedyshyn, M. Luhova, G. C. Brito, F. J. Vizeacoumar, F. S. Vizeacoumar, A. Datti, D. Kasimer, A. Buzina, P. Mero, C. Misquitta, J. Normand, M. Haider, T. Ketela, J. L. Wrana, R. Rottapel, B. G. Neel and J. Moffat (2012). "Essential gene profiles in breast, pancreatic, and ovarian cancer cells." Cancer Discov 2(2): 172-189.

Marini, N. J., J. Gin, J. Ziegle, K. H. Keho, D. Ginzinger, D. A. Gilbert and J. Rine (2008). "The prevalence of folate-remedial MTHFR enzyme variants in humans." Proc Natl Acad Sci U S A 105(23): 8055-8060.

Marini, N. J., P. D. Thomas and J. Rine (2010). "The use of orthologous sequences to predict the impact of amino acid substitutions on protein function." PLoS Genet 6(5): e1000968.

Mathe, E., M. Olivier, S. Kato, C. Ishioka, P. Hainaut and S. V. Tavtigian (2006). "Computational approaches for predicting the biological effect of p53 missense mutations: a comparison of three sequence analysis based methods." Nucleic Acids Res 34(5): 1317-1325.

163

Mayfield, J. A., M. W. Davies, D. Dimster-Denk, N. Pleskac, S. McCarthy, E. A. Boydston, L. Fink, X. X. Lin, A. S. Narain, M. Meighan and J. Rine (2012). "Surrogate genetics and metabolic profiling for characterization of human disease alleles." Genetics 190(4): 1309-1323.

McFarland, C. D., K. S. Korolev, G. V. Kryukov, S. R. Sunyaev and L. A. Mirny (2013). "Impact of deleterious passenger mutations on cancer progression." Proc Natl Acad Sci U S A 110(8): 2910-2915.

Melhem, A., N. Muhanna, A. Bishara, C. E. Alvarez, Y. Ilan, T. Bishara, A. Horani, M. Nassar, S. L. Friedman and R. Safadi (2006). "Anti-fibrotic activity of NK cells in experimental liver injury through killing of activated HSC." J Hepatol 45(1): 60-71.

Mellman, I., G. Coukos and G. Dranoff (2011). "Cancer immunotherapy comes of age." Nature 480(7378): 480-489.

Meydan, C., H. H. Otu and O. U. Sezerman (2013). "Prediction of peptides binding to MHC class I and II alleles by temporal motif mining." BMC Bioinformatics 14 Suppl 2: S13.

Miled, N., Y. Yan, W. C. Hon, O. Perisic, M. Zvelebil, Y. Inbar, D. Schneidman-Duhovny, H. J. Wolfson, J. M. Backer and R. L. Williams (2007). "Mechanism of two classes of cancer mutations in the phosphoinositide 3-kinase catalytic subunit." Science 317(5835): 239-242.

Minorikawa, S. and M. Nakayama (2011). "Recombinase-mediated cassette exchange (RMCE) and BAC engineering via VCre/VloxP and SCre/SloxP systems." Biotechniques 50(4): 235-246.

Mitra, K., A.-R. Carvunis, S. K. Ramesh and T. Ideker (2013). "Integrative approaches for finding modular structure in biological networks." Nature Reviews Genetics 14(10): 719-732.

Molenaar, J. J., J. Koster, D. A. Zwijnenburg, P. van Sluis, L. J. Valentijn, I. van der Ploeg, M. Hamdi, J. van Nes, B. A. Westerman, J. van Arkel, M. E. Ebus, F. Haneveld, A. Lakeman, L. Schild, P. Molenaar, P. Stroeken, M. M. van Noesel, I. Ora, E. E. Santo, H. N. Caron, E. M. Westerhout and R. Versteeg (2012). "Sequencing of neuroblastoma identifies chromothripsis and defects in neuritogenesis genes." Nature 483(7391): 589-593.

Mooney, S. (2005). "Bioinformatics approaches and resources for single nucleotide polymorphism functional analysis." Brief Bioinform 6(1): 44-56.

Morakote, N. and D. E. Justus (1988). "Immunosuppression in malaria: effect of hemozoin produced by Plasmodium berghei and Plasmodium falciparum." Int Arch Allergy Appl Immunol 86(1): 28-34.

Morin, R. D., M. Mendez-Lago, A. J. Mungall, R. Goya, K. L. Mungall, R. D. Corbett, N. A. Johnson, T. M. Severson, R. Chiu, M. Field, S. Jackman, M. Krzywinski, D. W. Scott, D. L. Trinh, J. Tamura-Wells, S. Li, M. R. Firme, S. Rogic, M. Griffith, S. Chan, O. Yakovenko, I. M. Meyer, E. Y. Zhao, D. Smailus, M. Moksa, S. Chittaranjan, L. Rimsza, A. Brooks-Wilson, J. J. Spinelli, S. Ben-Neriah, B. Meissner, B. Woolcock, M. Boyle, H. McDonald, A. Tam, Y. Zhao, A. Delaney, T. Zeng, K. Tse, Y. Butterfield, I. Birol, R. Holt, J. Schein, D. E. Horsman, R. Moore, S. J. Jones, J. M. Connors, M. Hirst, R. D. Gascoyne and M. A. Marra (2011). "Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma." Nature 476(7360): 298-303.

164

Nehrt, N. L., T. A. Peterson, D. Park and M. G. Kann (2012). "Domain landscapes of somatic mutations in cancer." BMC Genomics 13 Suppl 4: S9.

Nelson, M. R., D. Wegmann, M. G. Ehm, D. Kessner, P. St Jean, C. Verzilli, J. Shen, Z. Tang, S. A. Bacanu, D. Fraser, L. Warren, J. Aponte, M. Zawistowski, X. Liu, H. Zhang, Y. Zhang, J. Li, Y. Li, L. Li, P. Woollard, S. Topp, M. D. Hall, K. Nangle, J. Wang, G. Abecasis, L. R. Cardon, S. Zollner, J. C. Whittaker, S. L. Chissoe, J. Novembre and V. Mooser (2012). "An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people." Science 337(6090): 100-104.

Ng, P. C. and S. Henikoff (2001). "Predicting deleterious amino acid substitutions." Genome Res 11(5): 863-874.

Ng, P. C. and S. Henikoff (2006). "Predicting the effects of amino acid substitutions on protein function." Annu Rev Genomics Hum Genet 7: 61-80.

Nguyen, D. X., P. D. Bos and J. Massague (2009). "Metastasis: from dissemination to organ- specific colonization." Nat Rev Cancer 9(4): 274-284.

Nielsen, M. and M. Andreatta (2016). "NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets." Genome Med 8(1): 33.

Nielsen, M., C. Lundegaard, P. Worning, S. L. Lauemoller, K. Lamberth, S. Buus, S. Brunak and O. Lund (2003). "Reliable prediction of T-cell epitopes using neural networks with novel sequence representations." Protein Sci 12(5): 1007-1017.

Nik-Zainal, S., L. B. Alexandrov, D. C. Wedge, P. Van Loo, C. D. Greenman, K. Raine, D. Jones, J. Hinton, J. Marshall, L. A. Stebbings, A. Menzies, S. Martin, K. Leung, L. Chen, C. Leroy, M. Ramakrishna, R. Rance, K. W. Lau, L. J. Mudie, I. Varela, D. J. McBride, G. R. Bignell, S. L. Cooke, A. Shlien, J. Gamble, I. Whitmore, M. Maddison, P. S. Tarpey, H. R. Davies, E. Papaemmanuil, P. J. Stephens, S. McLaren, A. P. Butler, J. W. Teague, G. Jonsson, J. E. Garber, D. Silver, P. Miron, A. Fatima, S. Boyault, A. Langerod, A. Tutt, J. W. Martens, S. A. Aparicio, A. Borg, A. V. Salomon, G. Thomas, A. L. Borresen-Dale, A. L. Richardson, M. S. Neuberger, P. A. Futreal, P. J. Campbell, M. R. Stratton and C. Breast Cancer Working Group of the International Cancer Genome (2012). "Mutational processes molding the genomes of 21 breast cancers." Cell 149(5): 979-993.

Obeid, M., A. Tesniere, F. Ghiringhelli, G. M. Fimia, L. Apetoh, J. L. Perfettini, M. Castedo, G. Mignot, T. Panaretakis, N. Casares, D. Metivier, N. Larochette, P. van Endert, F. Ciccosanti, M. Piacentini, L. Zitvogel and G. Kroemer (2007). "Calreticulin exposure dictates the immunogenicity of cancer cell death." Nat Med 13(1): 54-61.

Olivier, M. (2004). "From SNPs to function: the effect of sequence variation on gene expression. Focus on "a survey of genetic and epigenetic variation affecting human gene expression"." Physiol Genomics 16(2): 182-183.

Osborn, M. J. and J. R. Miller (2007). "Rescuing yeast mutants with human genes." Brief Funct Genomic Proteomic 6(2): 104-111.

165

Owsianka, A. M. and A. H. Patel (1999). "Hepatitis C virus core protein interacts with a human DEAD box protein DDX3." Virology 257(2): 330-340.

Parsons, D. W., M. Li, X. Zhang, S. Jones, R. J. Leary, J. C. Lin, S. M. Boca, H. Carter, J. Samayoa, C. Bettegowda, G. L. Gallia, G. I. Jallo, Z. A. Binder, Y. Nikolsky, J. Hartigan, D. R. Smith, D. S. Gerhard, D. W. Fults, S. VandenBerg, M. S. Berger, S. K. Marie, S. M. Shinjo, C. Clara, P. C. Phillips, J. E. Minturn, J. A. Biegel, A. R. Judkins, A. C. Resnick, P. B. Storm, T. Curran, Y. He, B. A. Rasheed, H. S. Friedman, S. T. Keir, R. McLendon, P. A. Northcott, M. D. Taylor, P. C. Burger, G. J. Riggins, R. Karchin, G. Parmigiani, D. D. Bigner, H. Yan, N. Papadopoulos, B. Vogelstein, K. W. Kinzler and V. E. Velculescu (2011). "The genetic landscape of the childhood cancer medulloblastoma." Science 331(6016): 435-439.

Pastinen, T., R. Sladek, S. Gurd, A. Sammak, B. Ge, P. Lepage, K. Lavergne, A. Villeneuve, T. Gaudin, H. Brandstrom, A. Beck, A. Verner, J. Kingsley, E. Harmsen, D. Labuda, K. Morgan, M. C. Vohl, A. K. Naumova, D. Sinnett and T. J. Hudson (2004). "A survey of genetic and epigenetic variation affecting human gene expression." Physiol Genomics 16(2): 184-193.

Pena-Llopis, S., S. Vega-Rubin-de-Celis, A. Liao, N. Leng, A. Pavia-Jimenez, S. Wang, T. Yamasaki, L. Zhrebker, S. Sivanand, P. Spence, L. Kinch, T. Hambuch, S. Jain, Y. Lotan, V. Margulis, A. I. Sagalowsky, P. B. Summerour, W. Kabbani, S. W. Wong, N. Grishin, M. Laurent, X. J. Xie, C. D. Haudenschild, M. T. Ross, D. R. Bentley, P. Kapur and J. Brugarolas (2012). "BAP1 loss defines a new class of renal cell carcinoma." Nat Genet 44(7): 751-759.

Peterson, T. A., E. Doughty and M. G. Kann (2013). "Towards precision medicine: advances in computational approaches for the analysis of human variants." Journal of molecular biology 425(21): 4047-4063.

Peterson, T. A., D. Park and M. G. Kann (2013). "A protein domain-centric approach for the comparative analysis of human and yeast phenotypically relevant mutations." BMC Genomics 14 Suppl 3: S5.

Piluso, G., F. D'Amico, V. Saccone, E. Bismuto, I. L. Rotundo, M. Di Domenico, S. Aurino, C. E. Schwartz, G. Neri and V. Nigro (2009). "A missense mutation in CASK causes FG syndrome in an Italian family." Am J Hum Genet 84(2): 162-177.

Pleasance, E. D., R. K. Cheetham, P. J. Stephens, D. J. McBride, S. J. Humphray, C. D. Greenman, I. Varela, M. L. Lin, G. R. Ordonez, G. R. Bignell, K. Ye, J. Alipaz, M. J. Bauer, D. Beare, A. Butler, R. J. Carter, L. Chen, A. J. Cox, S. Edkins, P. I. Kokko-Gonzales, N. A. Gormley, R. J. Grocock, C. D. Haudenschild, M. M. Hims, T. James, M. Jia, Z. Kingsbury, C. Leroy, J. Marshall, A. Menzies, L. J. Mudie, Z. Ning, T. Royce, O. B. Schulz-Trieglaff, A. Spiridou, L. A. Stebbings, L. Szajkowski, J. Teague, D. Williamson, L. Chin, M. T. Ross, P. J. Campbell, D. R. Bentley, P. A. Futreal and M. R. Stratton (2010). "A comprehensive catalogue of somatic mutations from a human cancer genome." Nature 463(7278): 191-196.

Pleasance, E. D., P. J. Stephens, S. O'Meara, D. J. McBride, A. Meynert, D. Jones, M. L. Lin, D. Beare, K. W. Lau, C. Greenman, I. Varela, S. Nik-Zainal, H. R. Davies, G. R. Ordonez, L. J. Mudie, C. Latimer, S. Edkins, L. Stebbings, L. Chen, M. Jia, C. Leroy, J. Marshall, A. Menzies, A. Butler, J. W. Teague, J. Mangion, Y. A. Sun, S. F. McLaughlin, H. E. Peckham, E. F. Tsung, G. L. Costa, C. C. Lee, J. D. Minna, A. Gazdar, E. Birney, M. D. Rhodes, K. J. McKernan, M. R.

166

Stratton, P. A. Futreal and P. J. Campbell (2010). "A small-cell lung cancer genome with complex signatures of tobacco exposure." Nature 463(7278): 184-190.

Ponting, C. P. and R. R. Russell (2002). "The natural history of protein domains." Annu Rev Biophys Biomol Struct 31: 45-71.

Porter, C. T., G. J. Bartlett and J. M. Thornton (2004). "The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data." Nucleic acids research 32(suppl 1): D129-D133.

Prehn, R. T. (1969). "The relationship of immunology to carcinogenesis." Ann N Y Acad Sci 164(2): 449-457.

Prior, I. A., P. D. Lewis and C. Mattos (2012). "A comprehensive survey of Ras mutations in cancer." Cancer Res 72(10): 2457-2467.

Pritchard, J. K. (2001). "Are rare variants responsible for susceptibility to complex diseases?" Am J Hum Genet 69(1): 124-137.

Pruitt, K. D., T. Tatusova, W. Klimke and D. R. Maglott (2009). "NCBI Reference Sequences: current status, policy and new initiatives." Nucleic Acids Res 37(Database issue): D32-36.

Pugh, T. J., S. D. Weeraratne, T. C. Archer, D. A. Pomeranz Krummel, D. Auclair, J. Bochicchio, M. O. Carneiro, S. L. Carter, K. Cibulskis, R. L. Erlich, H. Greulich, M. S. Lawrence, N. J. Lennon, A. McKenna, J. Meldrim, A. H. Ramos, M. G. Ross, C. Russ, E. Shefler, A. Sivachenko, B. Sogoloff, P. Stojanov, P. Tamayo, J. P. Mesirov, V. Amani, N. Teider, S. Sengupta, J. P. Francois, P. A. Northcott, M. D. Taylor, F. Yu, G. R. Crabtree, A. G. Kautzman, S. B. Gabriel, G. Getz, N. Jager, D. T. Jones, P. Lichter, S. M. Pfister, T. M. Roberts, M. Meyerson, S. L. Pomeroy and Y. J. Cho (2012). "Medulloblastoma exome sequencing uncovers subtype-specific somatic mutations." Nature 488(7409): 106-110.

Rammensee, H., J. Bachmann, N. P. Emmerich, O. A. Bachor and S. Stevanovic (1999). "SYFPEITHI: database for MHC ligands and peptide motifs." Immunogenetics 50(3-4): 213- 219.

Rammensee, H. G., T. Friede and S. Stevanoviic (1995). "MHC ligands and peptide motifs: first listing." Immunogenetics 41(4): 178-228.

Reche, P. A., J. P. Glutting and E. L. Reinherz (2002). "Prediction of MHC class I binding peptides using profile motifs." Hum Immunol 63(9): 701-709.

Richards, F. M., S. A. McKee, M. H. Rajpar, T. R. Cole, D. G. Evans, J. A. Jankowski, C. McKeown, D. S. Sanders and E. R. Maher (1999). "Germline E-cadherin gene (CDH1) mutations predispose to familial gastric cancer and colorectal cancer." Hum Mol Genet 8(4): 607-610.

Richards, S., N. Aziz, S. Bale, D. Bick, S. Das, J. Gastier-Foster, W. W. Grody, M. Hegde, E. Lyon, E. Spector, K. Voelkerding and H. L. Rehm (2015). "Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College

167 of Medical Genetics and Genomics and the Association for Molecular Pathology." Genet Med 17(5): 405-423.

Ritchie, M. D., E. R. Holzinger, R. Li, S. A. Pendergrass and D. Kim (2015). "Methods of integrating data to uncover genotype-phenotype interactions." Nat Rev Genet 16(2): 85-97.

Roberts, S. A., M. S. Lawrence, L. J. Klimczak, S. A. Grimm, D. Fargo, P. Stojanov, A. Kiezun, G. V. Kryukov, S. L. Carter, G. Saksena, S. Harris, R. R. Shah, M. A. Resnick, G. Getz and D. A. Gordenin (2013). "An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers." Nat Genet 45(9): 970-976.

Robinson, G., M. Parker, T. A. Kranenburg, C. Lu, X. Chen, L. Ding, T. N. Phoenix, E. Hedlund, L. Wei, X. Zhu, N. Chalhoub, S. J. Baker, R. Huether, R. Kriwacki, N. Curley, R. Thiruvenkatam, J. Wang, G. Wu, M. Rusch, X. Hong, J. Becksfort, P. Gupta, J. Ma, J. Easton, B. Vadodaria, A. Onar-Thomas, T. Lin, S. Li, S. Pounds, S. Paugh, D. Zhao, D. Kawauchi, M. F. Roussel, D. Finkelstein, D. W. Ellison, C. C. Lau, E. Bouffet, T. Hassall, S. Gururangan, R. Cohn, R. S. Fulton, L. L. Fulton, D. J. Dooling, K. Ochoa, A. Gajjar, E. R. Mardis, R. K. Wilson, J. R. Downing, J. Zhang and R. J. Gilbertson (2012). "Novel mutations target distinct subgroups of medulloblastoma." Nature 488(7409): 43-48.

Rolland, T., M. Ta An, B. Charloteaux, S. J. Pevzner, Q. Zhong, N. Sahni, S. Yi, I. Lemmens, C. Fontanillo, R. Mosca, A. Kamburov, S. D. Ghiassian, X. Yang, L. Ghamsari, D. Balcha, B. E. Begg, P. Braun, M. Brehme, M. P. Broly, A. R. Carvunis, D. Convery-Zupan, R. Corominas, J. Coulombe-Huntington, E. Dann, M. Dreze, A. Dricot, C. Fan, E. Franzosa, F. Gebreab, B. J. Gutierrez, M. F. Hardy, M. Jin, S. Kang, R. Kiros, G. N. Lin, K. Luck, A. MacWilliams, J. Menche, R. R. Murray, A. Palagi, M. M. Poulin, X. Rambout, J. Rasla, P. Reichert, V. Romero, E. Ruyssinck, J. M. Sahalie, A. Scholz, A. A. Shah, A. Sharma, Y. Shen, K. Spirohn, S. Tam, A. O. Tejeda, S. A. Trigg, J. C. Twizere, K. Vega, J. Walsh, M. E. Cusick, Y. Xia, A. L. Barabasi, L. M. Iakoucheva, P. Aloy, J. De Las Rivas, J. Tavernier, M. A. Calderwood, D. E. Hill, T. Hao, F. P. Roth and M. Vidal (2014). "A proteome-scale map of the human interactome network." Cell 159(5): 1212-1226.

Rosenberg, S. A. (2005). "Cancer immunotherapy comes of age." Nat Clin Pract Oncol 2(3): 115.

Rual, J. F., T. Hirozane-Kishikawa, T. Hao, N. Bertin, S. Li, A. Dricot, N. Li, J. Rosenberg, P. Lamesch, P. O. Vidalain, T. R. Clingingsmith, J. L. Hartley, D. Esposito, D. Cheo, T. Moore, B. Simmons, R. Sequerra, S. Bosak, L. Doucette-Stamm, C. Le Peuch, J. Vandenhaute, M. E. Cusick, J. S. Albala, D. E. Hill and M. Vidal (2004). "Human ORFeome version 1.1: a platform for reverse proteomics." Genome Res 14(10B): 2128-2135.

Rual, J. F., K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G. F. Berriz, F. D. Gibbons, M. Dreze, N. Ayivi-Guedehoussou, N. Klitgord, C. Simon, M. Boxem, S. Milstein, J. Rosenberg, D. S. Goldberg, L. V. Zhang, S. L. Wong, G. Franklin, S. Li, J. S. Albala, J. Lim, C. Fraughton, E. Llamosas, S. Cevik, C. Bex, P. Lamesch, R. S. Sikorski, J. Vandenhaute, H. Y. Zoghbi, A. Smolyar, S. Bosak, R. Sequerra, L. Doucette-Stamm, M. E. Cusick, D. E. Hill, F. P. Roth and M. Vidal (2005). "Towards a proteome-scale map of the human protein-protein interaction network." Nature 437(7062): 1173-1178.

168

Rudin, C. M., S. Durinck, E. W. Stawiski, J. T. Poirier, Z. Modrusan, D. S. Shames, E. A. Bergbower, Y. Guan, J. Shin, J. Guillory, C. S. Rivers, C. K. Foo, D. Bhatt, J. Stinson, F. Gnad, P. M. Haverty, R. Gentleman, S. Chaudhuri, V. Janakiraman, B. S. Jaiswal, C. Parikh, W. Yuan, Z. Zhang, H. Koeppen, T. D. Wu, H. M. Stern, R. L. Yauch, K. E. Huffman, D. D. Paskulin, P. B. Illei, M. Varella-Garcia, A. F. Gazdar, F. J. de Sauvage, R. Bourgon, J. D. Minna, M. V. Brock and S. Seshagiri (2012). "Comprehensive genomic analysis identifies SOX2 as a frequently amplified gene in small-cell lung cancer." Nat Genet 44(10): 1111-1116.

Sahni, N., S. Yi, Q. Zhong, N. Jailkhani, B. Charloteaux, M. E. Cusick and M. Vidal (2013). "Edgotype: a fundamental link between genotype and phenotype." Curr Opin Genet Dev 23(6): 649-657.

Sanchez-Diaz, A., M. Kanemaki, V. Marchesi and K. Labib (2004). "Rapid depletion of budding yeast proteins by fusion to a heat-inducible degron." Sci STKE 2004(223): PL8.

Santarius, T., J. Shipley, D. Brewer, M. R. Stratton and C. S. Cooper (2010). "A census of amplified and overexpressed human cancer genes." Nat Rev Cancer 10(1): 59-64.

Sassi, H. E., N. Bastajian, P. Kainth and B. J. Andrews (2009). "Reporter-based synthetic genetic array analysis: a functional genomics approach for investigating the cell cycle in Saccharomyces cerevisiae." Methods Mol Biol 548: 55-73.

Sausen, M., R. J. Leary, S. Jones, J. Wu, C. P. Reynolds, X. Liu, A. Blackford, G. Parmigiani, L. A. Diaz, Jr., N. Papadopoulos, B. Vogelstein, K. W. Kinzler, V. E. Velculescu and M. D. Hogarty (2013). "Integrated genomic analyses identify ARID1A and ARID1B alterations in the childhood cancer neuroblastoma." Nat Genet 45(1): 12-17.

Sayers, E. W., T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, R. Edgar, S. Federhen, M. Feolo, L. Y. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D. J. Lipman, T. L. Madden, D. R. Maglott, V. Miller, I. Mizrachi, J. Ostell, K. D. Pruitt, G. D. Schuler, E. Sequeira, S. T. Sherry, M. Shumway, K. Sirotkin, A. Souvorov, G. Starchenko, T. A. Tatusova, L. Wagner, E. Yaschenko and J. Ye (2009). "Database resources of the National Center for Biotechnology Information." Nucleic Acids Res 37(Database issue): D5- 15.

Scarcelli, J. J., S. Viggiano, C. A. Hodge, C. V. Heath, D. C. Amberg and C. N. Cole (2008). "Synthetic genetic array analysis in Saccharomyces cerevisiae provides evidence for an interaction between RAT8/DBP5 and genes encoding P-body components." Genetics 179(4): 1945-1955.

Schlabach, M. R., J. Luo, N. L. Solimini, G. Hu, Q. Xu, M. Z. Li, Z. Zhao, A. Smogorzewska, M. E. Sowa, X. L. Ang, T. F. Westbrook, A. C. Liang, K. Chang, J. A. Hackett, J. W. Harper, G. J. Hannon and S. J. Elledge (2008). "Cancer proliferation gene discovery through functional genomics." Science 319(5863): 620-624.

Scholz, M. B., C. C. Lo and P. S. Chain (2012). "Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis." Curr Opin Biotechnol 23(1): 9-15.

169

Schreiber, R. D., L. J. Old and M. J. Smyth (2011). "Cancer immunoediting: integrating immunity's roles in cancer suppression and promotion." Science 331(6024): 1565-1570.

Schrodinger, LLC (2010). The PyMOL Molecular Graphics System, Version 1.3r1.

Schueler-Furman, O., Y. Altuvia, A. Sette and H. Margalit (2000). "Structure-based prediction of binding peptides to MHC class I molecules: application to a broad range of MHC alleles." Protein Sci 9(9): 1838-1846.

Scoumanne, A. and X. Chen (2008). "Protein methylation: a new mechanism of p53 tumor suppressor regulation." Histol Histopathol 23(9): 1143-1149.

Seo, J. S., Y. S. Ju, W. C. Lee, J. Y. Shin, J. K. Lee, T. Bleazard, J. Lee, Y. J. Jung, J. O. Kim, J. Y. Shin, S. B. Yu, J. Kim, E. R. Lee, C. H. Kang, I. K. Park, H. Rhee, S. H. Lee, J. I. Kim, J. H. Kang and Y. T. Kim (2012). "The transcriptional landscape and mutational profile of lung adenocarcinoma." Genome Res 22(11): 2109-2119.

Seshagiri, S., E. W. Stawiski, S. Durinck, Z. Modrusan, E. E. Storm, C. B. Conboy, S. Chaudhuri, Y. Guan, V. Janakiraman, B. S. Jaiswal, J. Guillory, C. Ha, G. J. Dijkgraaf, J. Stinson, F. Gnad, M. A. Huntley, J. D. Degenhardt, P. M. Haverty, R. Bourgon, W. Wang, H. Koeppen, R. Gentleman, T. K. Starr, Z. Zhang, D. A. Largaespada, T. D. Wu and F. J. de Sauvage (2012). "Recurrent R-spondin fusions in colon cancer." Nature 488(7413): 660-664.

Sette, A., R. Chesnut and J. Fikes (2001). "HLA expression in cancer: implications for T cell- based immunotherapy." Immunogenetics 53(4): 255-263.

Shah, S. P., A. Roth, R. Goya, A. Oloumi, G. Ha, Y. Zhao, G. Turashvili, J. Ding, K. Tse, G. Haffari, A. Bashashati, L. M. Prentice, J. Khattra, A. Burleigh, D. Yap, V. Bernard, A. McPherson, K. Shumansky, A. Crisan, R. Giuliany, A. Heravi-Moussavi, J. Rosner, D. Lai, I. Birol, R. Varhol, A. Tam, N. Dhalla, T. Zeng, K. Ma, S. K. Chan, M. Griffith, A. Moradian, S. W. Cheng, G. B. Morin, P. Watson, K. Gelmon, S. Chia, S. F. Chin, C. Curtis, O. M. Rueda, P. D. Pharoah, S. Damaraju, J. Mackey, K. Hoon, T. Harkins, V. Tadigotla, M. Sigaroudinia, P. Gascard, T. Tlsty, J. F. Costello, I. M. Meyer, C. J. Eaves, W. W. Wasserman, S. Jones, D. Huntsman, M. Hirst, C. Caldas, M. A. Marra and S. Aparicio (2012). "The clonal and mutational evolution spectrum of primary triple-negative breast cancers." Nature 486(7403): 395-399.

Shastry, B. S. (2009). "SNPs: impact on gene function and phenotype." Methods Mol Biol 578: 3-22.

Shinmura, K., T. Kohno, M. Takahashi, A. Sasaki, A. Ochiai, P. Guilford, A. Hunter, A. E. Reeve, H. Sugimura, N. Yamaguchi and J. Yokota (1999). "Familial gastric cancer: clinicopathological characteristics, RER phenotype and germline p53 and E-cadherin mutations." Carcinogenesis 20(6): 1127-1131.

Shmerling, D. H. (1972). "[Diagnosis and treatment of exocrine pancreatic insufficiency in mucoviscidosis]." Z Allgemeinmed 48(24): 1080-1084.

170

Sievers, F., A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J. Soding, J. D. Thompson and D. G. Higgins (2011). "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega." Mol Syst Biol 7: 539.

Sigal, A. and V. Rotter (2000). "Oncogenic mutations of the p53 tumor suppressor: the demons of the guardian of the genome." Cancer Res 60(24): 6788-6793.

Silva, J. M., K. Marran, J. S. Parker, J. Silva, M. Golding, M. R. Schlabach, S. J. Elledge, G. J. Hannon and K. Chang (2008). "Profiling essential genes in human mammary cells by multiplex RNAi screening." Science 319(5863): 617-620.

Simonds, N. I., M. J. Khoury, S. D. Schully, K. Armstrong, W. F. Cohn, D. A. Fenstermacher, G. S. Ginsburg, K. A. Goddard, W. A. Knaus, G. H. Lyman, S. D. Ramsey, J. Xu and A. N. Freedman (2013). "Comparative effectiveness research in cancer genomics and precision medicine: current landscape and future prospects." J Natl Cancer Inst 105(13): 929-936.

Sjoblom, T., S. Jones, L. D. Wood, D. W. Parsons, J. Lin, T. D. Barber, D. Mandelker, R. J. Leary, J. Ptak, N. Silliman, S. Szabo, P. Buckhaults, C. Farrell, P. Meeh, S. D. Markowitz, J. Willis, D. Dawson, J. K. Willson, A. F. Gazdar, J. Hartigan, L. Wu, C. Liu, G. Parmigiani, B. H. Park, K. E. Bachman, N. Papadopoulos, B. Vogelstein, K. W. Kinzler and V. E. Velculescu (2006). "The consensus coding sequences of human breast and colorectal cancers." Science 314(5797): 268-274.

Skubitz, K. M., N. P. Christiansen and J. R. Mendiola (1989). "Preparation and characterization of monoclonal antibodies to human neutrophil cathepsin G, lactoferrin, eosinophil peroxidase, and eosinophil major basic protein." J Leukoc Biol 46(2): 109-118.

Soam, S. S., F. Khan, B. Bhasker and B. N. Mishra (2009). "Prediction of MHC class I binding peptides using probability distribution functions." Bioinformation 3(9): 403-408.

Sokic, G. and D. Dukanovic (1971). "[Polyphen in the treatment of some diseases]." Stomatol Glas Srb 18(3): 159-162.

Sordella, R., D. W. Bell, D. A. Haber and J. Settleman (2004). "Gefitinib-sensitizing EGFR mutations in lung cancer activate anti-apoptotic pathways." Science 305(5687): 1163-1167.

Starita, L. M., D. L. Young, M. Islam, J. O. Kitzman, J. Gullingsrud, R. J. Hause, D. M. Fowler, J. D. Parvin, J. Shendure and S. Fields (2015). "Massively Parallel Functional Analysis of BRCA1 RING Domain Variants." Genetics 200(2): 413-422.

Steeg, P. S. (2016). "Targeting metastasis." Nat Rev Cancer 16(4): 201-218.

Steinmetz, L. M. and R. W. Davis (2004). "Maximizing the potential of functional genomics." Nat Rev Genet 5(3): 190-201.

Stenson, P. D., E. V. Ball, M. Mort, A. D. Phillips, K. Shaw and D. N. Cooper (2012). "The Human Gene Mutation Database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution." Curr Protoc Bioinformatics Chapter 1: Unit1 13.

171

Stenson, P. D., E. V. Ball, M. Mort, A. D. Phillips, J. A. Shiel, N. S. Thomas, S. Abeysinghe, M. Krawczak and D. N. Cooper (2003). "Human Gene Mutation Database (HGMD): 2003 update." Hum Mutat 21(6): 577-581.

Stephens, P. J., D. J. McBride, M.-L. Lin, I. Varela, E. D. Pleasance, J. T. Simpson, L. A. Stebbings, C. Leroy, S. Edkins and L. J. Mudie (2009). "Complex landscapes of somatic rearrangement in human breast cancer genomes." Nature 462(7276): 1005-1010.

Stephens, P. J., P. S. Tarpey, H. Davies, P. Van Loo, C. Greenman, D. C. Wedge, S. Nik-Zainal, S. Martin, I. Varela, G. R. Bignell, L. R. Yates, E. Papaemmanuil, D. Beare, A. Butler, A. Cheverton, J. Gamble, J. Hinton, M. Jia, A. Jayakumar, D. Jones, C. Latimer, K. W. Lau, S. McLaren, D. J. McBride, A. Menzies, L. Mudie, K. Raine, R. Rad, M. S. Chapman, J. Teague, D. Easton, A. Langerod, M. T. Lee, C. Y. Shen, B. T. Tee, B. W. Huimin, A. Broeks, A. C. Vargas, G. Turashvili, J. Martens, A. Fatima, P. Miron, S. F. Chin, G. Thomas, S. Boyault, O. Mariani, S. R. Lakhani, M. van de Vijver, L. van 't Veer, J. Foekens, C. Desmedt, C. Sotiriou, A. Tutt, C. Caldas, J. S. Reis-Filho, S. A. Aparicio, A. V. Salomon, A. L. Borresen-Dale, A. L. Richardson, P. J. Campbell, P. A. Futreal and M. R. Stratton (2012). "The landscape of cancer genes and mutational processes in breast cancer." Nature 486(7403): 400-404.

Stephens, P. J., P. S. Tarpey, H. Davies, P. Van Loo, C. Greenman, D. C. Wedge, S. Nik-Zainal, S. Martin, I. Varela, G. R. Bignell, L. R. Yates, E. Papaemmanuil, D. Beare, A. Butler, A. Cheverton, J. Gamble, J. Hinton, M. Jia, A. Jayakumar, D. Jones, C. Latimer, K. W. Lau, S. McLaren, D. J. McBride, A. Menzies, L. Mudie, K. Raine, R. Rad, M. S. Chapman, J. Teague, D. Easton, A. Langerod, C. Oslo Breast Cancer, M. T. Lee, C. Y. Shen, B. T. Tee, B. W. Huimin, A. Broeks, A. C. Vargas, G. Turashvili, J. Martens, A. Fatima, P. Miron, S. F. Chin, G. Thomas, S. Boyault, O. Mariani, S. R. Lakhani, M. van de Vijver, L. van 't Veer, J. Foekens, C. Desmedt, C. Sotiriou, A. Tutt, C. Caldas, J. S. Reis-Filho, S. A. Aparicio, A. V. Salomon, A. L. Borresen-Dale, A. L. Richardson, P. J. Campbell, P. A. Futreal and M. R. Stratton (2012). "The landscape of cancer genes and mutational processes in breast cancer." Nature 486(7403): 400- 404.

Steward, R. E., M. W. MacArthur, R. A. Laskowski and J. M. Thornton (2003). "Molecular basis of inherited diseases: a structural perspective." Trends Genet 19(9): 505-513.

Stirnimann, C. U., E. Petsalaki, R. B. Russell and C. W. Muller (2010). "WD40 proteins propel cellular networks." Trends Biochem Sci 35(10): 565-574.

Stone, J., S. Bevan, D. Cunningham, A. Hill, N. Rahman, J. Peto, A. Marossy and R. S. Houlston (1999). "Low frequency of germline E-cadherin mutations in familial and nonfamilial gastric cancer." Br J Cancer 79(11-12): 1935-1937.

Stranger, B. E., E. A. Stahl and T. Raj (2011). "Progress and promise of genome-wide association studies for human complex trait genetics." Genetics 187(2): 367-383.

Stransky, N., A. M. Egloff, A. D. Tward, A. D. Kostic, K. Cibulskis, A. Sivachenko, G. V. Kryukov, M. S. Lawrence, C. Sougnez, A. McKenna, E. Shefler, A. H. Ramos, P. Stojanov, S. L. Carter, D. Voet, M. L. Cortes, D. Auclair, M. F. Berger, G. Saksena, C. Guiducci, R. C. Onofrio, M. Parkin, M. Romkes, J. L. Weissfeld, R. R. Seethala, L. Wang, C. Rangel-Escareno, J. C. Fernandez-Lopez, A. Hidalgo-Miranda, J. Melendez-Zajgla, W. Winckler, K. Ardlie, S. B.

172

Gabriel, M. Meyerson, E. S. Lander, G. Getz, T. R. Golub, L. A. Garraway and J. R. Grandis (2011). "The mutational landscape of head and neck squamous cell carcinoma." Science 333(6046): 1157-1160.

Stratton, M. R. (2011). "Exploring the genomes of cancer cells: progress and promise." Science 331(6024): 1553-1558.

Studer, R. A., B. H. Dessailly and C. A. Orengo (2013). "Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes." Biochem J 449(3): 581-594.

Su, A. I., T. Wiltshire, S. Batalov, H. Lapp, K. A. Ching, D. Block, J. Zhang, R. Soden, M. Hayakawa, G. Kreiman, M. P. Cooke, J. R. Walker and J. B. Hogenesch (2004). "A gene atlas of the mouse and human protein-encoding transcriptomes." Proc Natl Acad Sci U S A 101(16): 6062-6067.

Sudmant, P. H., T. Rausch, E. J. Gardner, R. E. Handsaker, A. Abyzov, J. Huddleston, Y. Zhang, K. Ye, G. Jun, M. Hsi-Yang Fritz, M. K. Konkel, A. Malhotra, A. M. Stutz, X. Shi, F. Paolo Casale, J. Chen, F. Hormozdiari, G. Dayama, K. Chen, M. Malig, M. J. Chaisson, K. Walter, S. Meiers, S. Kashin, E. Garrison, A. Auton, H. Y. Lam, X. Jasmine Mu, C. Alkan, D. Antaki, T. Bae, E. Cerveira, P. Chines, Z. Chong, L. Clarke, E. Dal, L. Ding, S. Emery, X. Fan, M. Gujral, F. Kahveci, J. M. Kidd, Y. Kong, E. W. Lameijer, S. McCarthy, P. Flicek, R. A. Gibbs, G. Marth, C. E. Mason, A. Menelaou, D. M. Muzny, B. J. Nelson, A. Noor, N. F. Parrish, M. Pendleton, A. Quitadamo, B. Raeder, E. E. Schadt, M. Romanovitch, A. Schlattl, R. Sebra, A. A. Shabalin, A. Untergasser, J. A. Walker, M. Wang, F. Yu, C. Zhang, J. Zhang, X. Zheng-Bradley, W. Zhou, T. Zichner, J. Sebat, M. A. Batzer, S. A. McCarroll, C. Genomes Project, R. E. Mills, M. B. Gerstein, A. Bashir, O. Stegle, S. E. Devine, C. Lee, E. E. Eichler and J. O. Korbel (2015). "An integrated map of structural variation in 2,504 human genomes." Nature 526(7571): 75-81.

Sun, S., F. Yang, G. Tan, M. Costanzo, R. Oughtred, J. Hirschman, C. Theesfeld, P. Bansal, N. Sahni, S. Yi, A. Yu, T. Tyagi, C. Tie, D. E. Hill, M. Vidal, B. J. Andrews, C. Boone, K. Dolinski and F. P. Roth (2016). "An extended set of yeast-based functional assays accurately identifies human disease mutations." Genome Res.

Sun, S., F. Yang, G. Tan, M. Costanzo, R. Oughtred, J. Hirschman, C. L. Theesfeld, P. Bansal, N. Sahni, S. Yi, A. Yu, T. Tyagi, C. Tie, D. E. Hill, M. Vidal, B. J. Andrews, C. Boone, K. Dolinski and F. P. Roth (2016). "An extended set of yeast-based functional assays accurately identifies human disease mutations." Genome Res 26(5): 670-680.

Suter, B., D. Auerbach and I. Stagljar (2006). "Yeast-based functional genomics and proteomics technologies: the first 15 years and beyond." Biotechniques 40(5): 625-644.

Tamborero, D., A. Gonzalez-Perez, C. Perez-Llamas, J. Deu-Pons, C. Kandoth, J. Reimand, M. S. Lawrence, G. Getz, G. D. Bader, L. Ding and N. Lopez-Bigas (2013). "Comprehensive identification of mutational cancer driver genes across 12 tumor types." Sci Rep 3: 2650.

Tanaka, T., M. Okada, Y. Hozumi, K. Tachibana, C. Kitanaka, Y. Hamamoto, A. M. Martelli, M. K. Topham, M. Iino and K. Goto (2013). "Cytoplasmic localization of DGKzeta exerts a protective effect against p53-mediated cytotoxicity." J Cell Sci 126(Pt 13): 2785-2797.

173

Tarpey, P. S., S. Behjati, S. L. Cooke, P. Van Loo, D. C. Wedge, N. Pillay, J. Marshall, S. O'Meara, H. Davies, S. Nik-Zainal, D. Beare, A. Butler, J. Gamble, C. Hardy, J. Hinton, M. M. Jia, A. Jayakumar, D. Jones, C. Latimer, M. Maddison, S. Martin, S. McLaren, A. Menzies, L. Mudie, K. Raine, J. W. Teague, J. M. Tubio, D. Halai, R. Tirabosco, F. Amary, P. J. Campbell, M. R. Stratton, A. M. Flanagan and P. A. Futreal (2013). "Frequent mutation of the major cartilage collagen gene COL2A1 in chondrosarcoma." Nat Genet 45(8): 923-926.

Team, M. G. C. P., G. Temple, D. S. Gerhard, R. Rasooly, E. A. Feingold, P. J. Good, C. Robinson, A. Mandich, J. G. Derge, J. Lewis, D. Shoaf, F. S. Collins, W. Jang, L. Wagner, C. M. Shenmen, L. Misquitta, C. F. Schaefer, K. H. Buetow, T. I. Bonner, L. Yankie, M. Ward, L. Phan, A. Astashyn, G. Brown, C. Farrell, J. Hart, M. Landrum, B. L. Maidak, M. Murphy, T. Murphy, B. Rajput, L. Riddick, D. Webb, J. Weber, W. Wu, K. D. Pruitt, D. Maglott, A. Siepel, B. Brejova, M. Diekhans, R. Harte, R. Baertsch, J. Kent, D. Haussler, M. Brent, L. Langton, C. L. Comstock, M. Stevens, C. Wei, M. J. van Baren, K. Salehi-Ashtiani, R. R. Murray, L. Ghamsari, E. Mello, C. Lin, C. Pennacchio, K. Schreiber, N. Shapiro, A. Marsh, E. Pardes, T. Moore, A. Lebeau, M. Muratet, B. Simmons, D. Kloske, S. Sieja, J. Hudson, P. Sethupathy, M. Brownstein, N. Bhat, J. Lazar, H. Jacob, C. E. Gruber, M. R. Smith, J. McPherson, A. M. Garcia, P. H. Gunaratne, J. Wu, D. Muzny, R. A. Gibbs, A. C. Young, G. G. Bouffard, R. W. Blakesley, J. Mullikin, E. D. Green, M. C. Dickson, A. C. Rodriguez, J. Grimwood, J. Schmutz, R. M. Myers, M. Hirst, T. Zeng, K. Tse, M. Moksa, M. Deng, K. Ma, D. Mah, J. Pang, G. Taylor, E. Chuah, A. Deng, K. Fichter, A. Go, S. Lee, J. Wang, M. Griffith, R. Morin, R. A. Moore, M. Mayo, S. Munro, S. Wagner, S. J. Jones, R. A. Holt, M. A. Marra, S. Lu, S. Yang, J. Hartigan, M. Graf, R. Wagner, S. Letovksy, J. C. Pulido, K. Robison, D. Esposito, J. Hartley, V. E. Wall, R. F. Hopkins, O. Ohara and S. Wiemann (2009). "The completion of the Mammalian Gene Collection (MGC)." Genome Res 19(12): 2324-2333.

Thusberg, J., A. Olatubosun and M. Vihinen (2011). "Performance of mutation pathogenicity prediction methods on missense variants." Hum Mutat 32(4): 358-368.

Tomczak, K., P. Czerwinska and M. Wiznerowicz (2015). "The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge." Contemp Oncol (Pozn) 19(1A): A68-77.

Tong, A. H. and C. Boone (2006). "Synthetic genetic array analysis in Saccharomyces cerevisiae." Methods Mol Biol 313: 171-192.

Topalian, S. L., G. J. Weiner and D. M. Pardoll (2011). "Cancer immunotherapy comes of age." J Clin Oncol 29(36): 4828-4836.

Torres-Ruiz, R. and S. Rodriguez-Perales (2016). "CRISPR-Cas9 technology: applications and human disease modelling." Brief Funct Genomics.

Tosello, V. and A. A. Ferrando (2013). "The NOTCH signaling pathway: role in the pathogenesis of T-cell acute lymphoblastic leukemia and implication for therapy." Ther Adv Hematol 4(3): 199-210.

Totoki, Y., K. Tatsuno, S. Yamamoto, Y. Arai, F. Hosoda, S. Ishikawa, S. Tsutsumi, K. Sonoda, H. Totsuka, T. Shirakihara, H. Sakamoto, L. Wang, H. Ojima, K. Shimada, T. Kosuge, T. Okusaka, K. Kato, J. Kusuda, T. Yoshida, H. Aburatani and T. Shibata (2011). "High-resolution characterization of a hepatocellular carcinoma genome." Nat Genet 43(5): 464-469.

174

Trevisson, E., A. Burlina, M. Doimo, V. Pertegato, A. Casarin, L. Cesaro, P. Navas, G. Basso, G. Sartori and L. Salviati (2009). "Functional complementation in yeast allows molecular characterization of missense argininosuccinate lyase mutations." J Biol Chem 284(42): 28926- 28934.

Udaka, K., H. Mamitsuka, Y. Nakaseko and N. Abe (2002). "Prediction of MHC class I binding peptides by a query learning algorithm based on hidden markov models." J Biol Phys 28(2): 183- 194.

Udaka, K., K. H. Wiesmuller, S. Kienle, G. Jung, H. Tamamura, H. Yamagishi, K. Okumura, P. Walden, T. Suto and T. Kawasaki (2000). "An automated prediction of MHC class I-binding peptides based on positional scanning with peptide libraries." Immunogenetics 51(10): 816-828.

Vesely, M. D., M. H. Kershaw, R. D. Schreiber and M. J. Smyth (2011). "Natural innate and adaptive immunity to cancer." Annu Rev Immunol 29: 235-271.

Vesely, M. D. and R. D. Schreiber (2013). "Cancer immunoediting: antigens, mechanisms, and implications to cancer immunotherapy." Ann N Y Acad Sci 1284: 1-5.

Vogel, C., M. Bashton, N. D. Kerrison, C. Chothia and S. A. Teichmann (2004). "Structure, function and evolution of multidomain proteins." Curr Opin Struct Biol 14(2): 208-216.

Vogelstein, B., N. Papadopoulos, V. E. Velculescu, S. Zhou, L. A. Diaz, Jr. and K. W. Kinzler (2013). "Cancer genome landscapes." Science 339(6127): 1546-1558.

Wach, A., A. Brachat, R. Pohlmann and P. Philippsen (1994). "New heterologous modules for classical or PCR-based gene disruptions in Saccharomyces cerevisiae." Yeast 10(13): 1793- 1808.

Wan, P. T., M. J. Garnett, S. M. Roe, S. Lee, D. Niculescu-Duvaz, V. M. Good, C. M. Jones, C. J. Marshall, C. J. Springer, D. Barford, R. Marais and P. Cancer Genome (2004). "Mechanism of activation of the RAF-ERK signaling pathway by oncogenic mutations of B-RAF." Cell 116(6): 855-867.

Wang, L., S. Tsutsumi, T. Kawaguchi, K. Nagasaki, K. Tatsuno, S. Yamamoto, F. Sang, K. Sonoda, M. Sugawara, A. Saiura, S. Hirono, H. Yamaue, Y. Miki, M. Isomura, Y. Totoki, G. Nagae, T. Isagawa, H. Ueda, S. Murayama-Hosokawa, T. Shibata, H. Sakamoto, Y. Kanai, A. Kaneda, T. Noda and H. Aburatani (2012). "Whole-exome sequencing of human pancreatic cancers and characterization of genomic instability caused by MLH1 haploinsufficiency and complete deficiency." Genome Res 22(2): 208-219.

Wang, T., G. Niu, M. Kortylewski, L. Burdelya, K. Shain, S. Zhang, R. Bhattacharya, D. Gabrilovich, R. Heller, D. Coppola, W. Dalton, R. Jove, D. Pardoll and H. Yu (2004). "Regulation of the innate and adaptive immune responses by Stat-3 signaling in tumor cells." Nat Med 10(1): 48-54.

Watson, I. R., K. Takahashi, P. A. Futreal and L. Chin (2013). "Emerging patterns of somatic mutations in cancer." Nat Rev Genet 14(10): 703-718.

175

Wei, Q., L. Wang, Q. Wang, W. D. Kruger and R. L. Dunbrack, Jr. (2010). "Testing computational prediction of missense mutation phenotypes: functional characterization of 204 mutations of human cystathionine beta synthase." Proteins 78(9): 2058-2074.

Welter, D., J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P. Flicek, T. Manolio, L. Hindorff and H. Parkinson (2014). "The NHGRI GWAS Catalog, a curated resource of SNP-trait associations." Nucleic Acids Res 42(Database issue): D1001-1006.

Wheeler, D. L., T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. Dicuccio, R. Edgar, S. Federhen, M. Feolo, L. Y. Geer, W. Helmberg, Y. Kapustin, O. Khovayko, D. Landsman, D. J. Lipman, T. L. Madden, D. R. Maglott, V. Miller, J. Ostell, K. D. Pruitt, G. D. Schuler, M. Shumway, E. Sequeira, S. T. Sherry, K. Sirotkin, A. Souvorov, G. Starchenko, R. L. Tatusov, T. A. Tatusova, L. Wagner and E. Yaschenko (2008). "Database resources of the National Center for Biotechnology Information." Nucleic Acids Res 36(Database issue): D13-21.

Wood, L. D., D. W. Parsons, S. Jones, J. Lin, T. Sjoblom, R. J. Leary, D. Shen, S. M. Boca, T. Barber and J. Ptak (2007). "The genomic landscapes of human breast and colorectal cancers." Sci Signal 318(5853): 1108.

Wood, L. D., D. W. Parsons, S. Jones, J. Lin, T. Sjoblom, R. J. Leary, D. Shen, S. M. Boca, T. Barber, J. Ptak, N. Silliman, S. Szabo, Z. Dezso, V. Ustyanksky, T. Nikolskaya, Y. Nikolsky, R. Karchin, P. A. Wilson, J. S. Kaminker, Z. Zhang, R. Croshaw, J. Willis, D. Dawson, M. Shipitsin, J. K. Willson, S. Sukumar, K. Polyak, B. H. Park, C. L. Pethiyagoda, P. V. Pant, D. G. Ballinger, A. B. Sparks, J. Hartigan, D. R. Smith, E. Suh, N. Papadopoulos, P. Buckhaults, S. D. Markowitz, G. Parmigiani, K. W. Kinzler, V. E. Velculescu and B. Vogelstein (2007). "The genomic landscapes of human breast and colorectal cancers." Science 318(5853): 1108-1113.

Wu, C., I. Macleod and A. I. Su (2013). "BioGPS and MyGene.info: organizing online, gene- centric information." Nucleic Acids Res 41(Database issue): D561-565.

Yadav, M., S. Jhunjhunwala, Q. T. Phung, P. Lupardus, J. Tanguay, S. Bumbaca, C. Franci, T. K. Cheung, J. Fritsche, T. Weinschenk, Z. Modrusan, I. Mellman, J. R. Lill and L. Delamarre (2014). "Predicting immunogenic tumour mutations by combining mass spectrometry and exome sequencing." Nature 515(7528): 572-576.

Yan, X. J., J. Xu, Z. H. Gu, C. M. Pan, G. Lu, Y. Shen, J. Y. Shi, Y. M. Zhu, L. Tang, X. W. Zhang, W. X. Liang, J. Q. Mi, H. D. Song, K. Q. Li, Z. Chen and S. J. Chen (2011). "Exome sequencing identifies somatic mutations of DNA methyltransferase gene DNMT3A in acute monocytic leukemia." Nat Genet 43(4): 309-315.

Yang, F., E. Petsalaki, T. Rolland, D. E. Hill, M. Vidal and F. P. Roth (2015). "Protein domain- level landscape of cancer-type-specific somatic mutations." PLoS Comput Biol 11(3): e1004147.

Yang, X., J. S. Boehm, X. Yang, K. Salehi-Ashtiani, T. Hao, Y. Shen, R. Lubonja, S. R. Thomas, O. Alkan, T. Bhimdi, T. M. Green, C. M. Johannessen, S. J. Silver, C. Nguyen, R. R. Murray, H. Hieronymus, D. Balcha, C. Fan, C. Lin, L. Ghamsari, M. Vidal, W. C. Hahn, D. E. Hill and D. E. Root (2011). "A public genome-scale lentiviral expression library of human ORFs." Nat Methods 8(8): 659-661.

176

Yen, K., P. Gitsham, J. Wishart, S. G. Oliver and N. Zhang (2003). "An improved tetO promoter replacement system for regulating the expression of yeast genes." Yeast 20(15): 1255-1262.

Yoon, K. A., J. L. Ku, H. K. Yang, W. H. Kim, S. Y. Park and J. G. Park (1999). "Germline mutations of E-cadherin gene in Korean familial gastric cancer patients." J Hum Genet 44(3): 177-180.

Yoshida, K. and S. Sugano (1999). "Identification of a novel protocadherin gene (PCDH11) on the human XY homology region in Xq21.3." Genomics 62(3): 540-543.

Yost, S. E., S. Pastorino, S. Rozenzhak, E. N. Smith, Y. S. Chao, P. Jiang, S. Kesari, K. A. Frazer and O. Harismendy (2013). "High-resolution mutational profiling suggests the genetic validity of glioblastoma patient-derived pre-clinical models." PLoS One 8(2): e56185.

Zhang, J., V. Grubor, C. L. Love, A. Banerjee, K. L. Richards, P. A. Mieczkowski, C. Dunphy, W. Choi, W. Y. Au, G. Srivastava, P. L. Lugar, D. A. Rizzieri, A. S. Lagoo, L. Bernal-Mizrachi, K. P. Mann, C. Flowers, K. Naresh, A. Evens, L. I. Gordon, M. Czader, J. I. Gill, E. D. Hsi, Q. Liu, A. Fan, K. Walsh, D. Jima, L. L. Smith, A. J. Johnson, J. C. Byrd, M. A. Luftig, T. Ni, J. Zhu, A. Chadburn, S. Levy, D. Dunson and S. S. Dave (2013). "Genetic heterogeneity of diffuse large B-cell lymphoma." Proc Natl Acad Sci U S A 110(4): 1398-1403.

Zhang, Y., X. Liu, Y. Fan, J. Ding, A. Xu, X. Zhou, X. Hu, M. Zhu, X. Zhang, S. Li, J. Wu, H. Cao, J. Li and Y. Wang (2006). "Germline mutations and polymorphic variants in MMR, E- cadherin and MYH genes associated with familial gastric cancer in Jiangsu of China." Int J Cancer 119(11): 2592-2596.

Zhong, Q., N. Simonis, Q. R. Li, B. Charloteaux, F. Heuze, N. Klitgord, S. Tam, H. Yu, K. Venkatesan, D. Mou, V. Swearingen, M. A. Yildirim, H. Yan, A. Dricot, D. Szeto, C. Lin, T. Hao, C. Fan, S. Milstein, D. Dupuy, R. Brasseur, D. E. Hill, M. E. Cusick and M. Vidal (2009). "Edgetic perturbation models of human inherited disorders." Mol Syst Biol 5: 321.

Zhou, D., L. Yang, L. Zheng, W. Ge, D. Li, Y. Zhang, X. Hu, Z. Gao, J. Xu, Y. Huang, H. Hu, H. Zhang, H. Zhang, M. Liu, H. Yang, L. Zheng and S. Zheng (2013). "Exome capture sequencing of adenoma reveals genetic alterations in multiple cellular pathways at the early stage of colorectal tumorigenesis." PLoS One 8(1): e53310.

Zinkernagel, R. M. and P. C. Doherty (1976). "Virus-immune cytotoxic T cells are sentized to by virus specifically altered structures coded for in H-2K or H-2D: a biological role for major histocompatibility antigens." Adv Exp Med Biol 66: 387-389.

177

Appendices Appendix A. Detailed protocol for human-yeast complementation assay

Mutagenesis PCR

Prepare reactions. • 5 x Phusion HF buffer 5 ul • 10mM dNTPs 0.5 ul • Forward primer 1.25 ul • Reverse primer 1.25 ul • Template plasmid (from file Mutagenesis_Plasmid.xlsx) 1 ul • Phusion HotStar polymerase 0.25 ul • PCR water 15.75 ul

PCR program:

98C 30s

98C 10s

68C 30s

72C 3min (Go to step 2 and repeat for another 24 cycles)

72C 7min

4C hold

LR Reaction

Add the following components to a PCR tube at room temperature and mix: • Entry clone (40 ng) 1 µl • Destination vector (100 ng) 1 µl • 5X LR Clonase Reaction Buffer 1 µl • TE buffer 1 µl • LR ClonaseI enzyme mix 1 µl (Remove LR ClonaseI enzyme mix from -80 °C and thaw on ice for about 2 minutes. Vortex the LR Clonaseô enzyme mix briefly twice.

178

Mix well by vortexing briefly twice. Microcentrifuge briefly.

Return LR ClonaseI enzyme mix to -80°C storage immediately after use.

Incubate reactions at room temperature for 20 hrs

E. Coli Transformation

Transform 1 ul ligation reaction to 10 ul NEB competent cells by following the transformation protocol

Flip 5-6 times

Add 10 µl NEB 5alpha competent cells into each well in PCR plate

Add 1 µl ligated plasmid and one empty control

Mix, pipette, keep on ice for 30 min

@ 42 degree (use PCR machine) 45 sec

Back on ice immediately after heat shock and on ice for 5 min

Add 100 µl SOC medium and transfer to eppendof tubes

37 degrees, 200 rpm shaking for 1 hour

Sping down, then mix by pipetting

Spot 10 µl on LA + Amp plates and incubate @ 37 degree

Yeast Transformation

Make Transformation buffer (1 reaction) : • 50% PEG4000 80 ul • 10 x TE buffer 11 ul

179

• 1M LiAc 10 ul • 100% DMSO 10 ul

Prepare 50% PEG3350: • mix 500 mg PEG3350 with 600 ul ddH2O • keep at 60C for 5 minutes, vortex several times, the final volume will be 1 ml • filterize to sterilize (~400ul left after filterizing)

Prepare 1M LiAc: 10.2 g LiAc into 100 ml ddH2O

Prepare ssDNA (10 mg/ml) • dissolve ssDNA in ddH2O to make a final concentration 10 mg/ml • boil 5 minutes and put back on ice • cool down • boil 5 minutes and put back on ice before use

Mix Transformation buffer with ssDNA, 2ul per ssDNA (10 mg/ml) with 100 ul transformation buffer

Transformation procedure:

Mix 50ul (100 ul transformation buffer with 2ul ssDNA) and 5 ul plasmid (humanORF) in a 96- well PCR plate

Inoculate fresh colonies and mix by pipetting up and down (Guihong didn't specify the inoculum concentration but he said the range is wide)

Inoculate yeast strains with human ORF, with GFP and one empty control

Room temperature for 2 hours

4C overnight

Next Day:

Put on ice before heat shock

Heat Shock at 42C in a PCR machine for 30 minutes ice for 10 minutes

180 plating, incubate @ room temperature

Appendix B. Media

SC-URA (1L)

Ingredients Amount

Yeast nitrogen base 6.74 g SC DO mix (–U) 2g Add the above agents to 100ml water in a 250ml bottle Add 20g(10g) bacto agar to 850ml water in a 2L bottle

Autoclave separately and combine autoclaved solutions 40% Glucose 50mL

Drop-out Mix Synthetic minus Argnine, Histidine, Leucine, Lysine, Methionine, Tryptophan, Adenine, Uracil: 1.167 g (Argnine, Histidine, Lysine, Methionine, Tryptophan: 0.0724g, Leucine: 0.3623g, Adenine: 0.1081g)

LB

Ingredients Amount Tryptone 10 g Bacto-Yeast_extrac 5g

Nacl 10g

Agar 20g

Combine reagents and shake until the solutes have dissolved. Adjust PH to7, 5N NaOH(~0.2ml).

Appendix C. Supplementary table for Chapter Three

Appendix B, table 1

Primary Site Gene Pfam ID Residu Pvalue Mutant Sample Symbol e times Counts Endometrium ACSM4 PF00501.23 A278 1.26E-03 2 249 Endometrium ACSM4 PF00501.23 F450 1.26E-03 2 249

Endometrium ACSM4 PF00501.23 T362 1.26E-03 2 249

181

Endometrium ACSM4 PF00501.23 V132 1.26E-03 2 249

Breast AKT1 PF00169.24 E17 0.00E+00 20 526 Meninges AKT1 PF00169.24 E17 0.00E+00 3 28

Lung AKT1 PF00169.24 E17 0.00E+00 3 748 Colon AKT1 PF00169.24 E17 0.00E+00 3 562 Endometrium AKT1 PF00169.24 E17 1.24E-09 5 249

Autonomic ALK PF07714.12 R1275 0.00E+00 6 53 ganglia Hematopoietic & ANTXR PF00092.23 V144 0.00E+00 6 187 lymph 1 Hematopoietic & BCL2 PF00452.14 A131 1.27E-09 9 187 lymph Skin BRAF PF07714.12 V600 0.00E+00 55 104 Colon BRAF PF07714.12 V600 0.00E+00 20 562 Central nervous BRAF PF07714.12 V600 0.00E+00 8 305 system Lung BRAF PF07714.12 G469 2.31E-10 7 748 Lung BRAF PF07714.12 V600 2.31E-10 7 748 Lung BRAF PF07714.12 G466 2.29E-08 6 748 Colon BRAF PF07714.12 D594 4.15E-04 4 562 Lung BRAF PF07714.12 N581 7.00E-03 3 748 Hematopoietic & BTG1 PF07742.7 Q36 9.63E-03 2 187 lymph Cervix CAGE1 PF15066.1 E490 9.09E-06 2 14 Salivary gland CCDC1 PF15254.1 E679 6.57E-03 1 36 4 Salivary gland CCDC1 PF15254.1 E798 6.57E-03 1 36 4 Salivary gland CCDC1 PF15254.1 E839 6.57E-03 1 36 4 Prostate CCNF PF02984.14 Q507 0.00E+00 6 240

182

Lung CD6 PF00530.13 S52 0.00E+00 12 748

Hematopoietic & CD74 PF09307.5 D18 4.41E-03 2 187 lymph Hematopoietic & CD74 PF09307.5 L33 4.41E-03 2 187 lymph

Hematopoietic & CD79B PF02189.10 Y196 0.00E+00 8 187 lymph Hematopoietic & CDK11 PF00069.20 K463 3.54E-07 3 187 lymph A Bone CDKN2 PF12796.2 D108 0.00E+00 3 5 A Skin CDKN2 PF12796.2 P114 3.59E-09 6 104 A Lung CDKN2 PF12796.2 D108 2.40E-05 5 748 A Lung CDKN2 PF12796.2 D84 2.40E-05 5 748 A Prostate CNOT3 PF04065.10 E20 2.75E-06 3 240 Prostate CNOT3 PF04065.10 E70 1.30E-03 2 240 Hematopoietic & CREBB PF08214.6 R1446 4.71E-06 4 187 lymph P Hematopoietic & CREBB PF08214.6 Y1482 5.21E-04 3 187 lymph P Hematopoietic & CREBB PF08214.6 Y1503 5.21E-04 3 187 lymph P Endometrium CTNNA PF01044.14 T392 1.13E-11 4 249 3 Hematopoietic & CTNNA PF01044.14 L342 1.01E-04 2 187 lymph 3 Hematopoietic & CTNNA PF01044.14 N816 1.01E-04 2 187 lymph 3

183

Hematopoietic & CTNNA PF01044.14 R378 1.01E-04 2 187 lymph 3

Hematopoietic & CTNNA PF01044.14 T454 1.01E-04 2 187 lymph 3 Breast CYFIP2 PF05994.6 R492 0.00E+00 3 526

Endometrium CYFIP2 PF05994.6 A858 2.41E-03 2 249 Endometrium CYFIP2 PF05994.6 G1215 2.41E-03 2 249

Endometrium CYFIP2 PF05994.6 G599 2.41E-03 2 249 Endometrium CYFIP2 PF05994.6 R1022 2.41E-03 2 249

Endometrium CYFIP2 PF05994.6 R1066 2.41E-03 2 249 Endometrium CYFIP2 PF05994.6 T793 2.41E-03 2 249 Central nervous DDX3X PF00270.24 R326 9.43E-05 4 305 system Central nervous DDX3X PF00270.24 R376 5.75E-03 3 305 system Cervix DMBT1 PF00530.13 D527 0.00E+00 3 14 Prostate DMBT1 PF00530.13 D1019 0.00E+00 3 240 Urinary tract DMBT1 PF00530.13 R1061 1.62E-05 3 85 Hematopoietic & DMBT1 PF00431.15 S1809 7.39E-05 4 187 lymph Endometrium DMBT1 PF00530.13 G590 5.30E-03 3 249 Endometrium DMBT1 PF00530.13 V258 5.30E-03 3 249 Lung EGFR PF07714.12 L858 0.00E+00 32 748 Lung EGFR PF01030.19 L62 0.00E+00 4 748 Central nervous EGFR PF00757.15 A289 7.71E-11 7 305 system Central nervous EGFR PF14843.1 G598 7.94E-11 6 305 system Lung EGFR PF07714.12 L861 8.25E-08 7 748 Lung EGFR PF07714.12 G719 1.34E-04 5 748 Colon EGFR PF07714.12 G724 1.08E-03 3 562

184

Central nervous EGFR PF00757.15 C326 5.67E-03 3 305 system

Bone ERBB3 PF14843.1 P554 0.00E+00 2 5 Urinary tract ERBB3 PF01030.19 V104 0.00E+00 3 85

Colon ERBB3 PF01030.19 V104 2.58E-10 6 562 Colon FBXW7 PF00400.27 R465 0.00E+00 32 562

Urinary tract FBXW7 PF00400.27 R505 1.07E-06 6 85 Lung FBXW7 PF00400.27 R505 1.61E-05 8 748 Endometrium FBXW7 PF00400.27 R465 1.48E-04 12 249

Skin FBXW7 PF00400.27 R505 3.83E-03 4 104 Central nervous FGFR1 PF07714.12 K656 8.29E-09 4 305 system Central nervous FGFR1 PF07714.12 N546 1.89E-03 2 305 system Breast FGFR2 PF07714.12 N549 0.00E+00 5 526 Endometrium FGFR2 PF07714.12 K657 3.71518E- 2 249 05 Endometrium FGFR2 PF07714.12 K659 3.71518E- 2 249 05 Endometrium FGFR2 PF07714.12 K660 3.71518E- 2 249 05 Endometrium FGFR2 PF07714.12 N549 3.69E-10 6 249 Colon FGFR2 PF07679.11 R203 3.37E-05 4 562 Lung FGFR2 PF07679.11 D247 1.25E-04 3 748 Colon FGFR2 PF07679.11 G305 3.04E-03 3 562 Central nervous FKBP9 PF00254.23 R107 0.00E+00 16 305 system Hematopoietic & FOXO1 PF00250.13 S193 7.37E-03 2 187 lymph Hematopoietic & FOXO1 PF00250.13 S203 7.37E-03 2 187 lymph

185

Hematopoietic & GNA13 PF00503.15 L197 3.31E-03 2 187 lymph

Lung HRAS PF00071.17 G13 4.45E-12 6 748 Upper aero- HRAS PF00071.17 G13 7.97E-08 4 88 digestive tract

Upper aero- HRAS PF00071.17 Q61 7.49E-03 2 88 digestive tract Hematopoietic & ITPR3 PF08709.6 A96 2.11E-08 4 187 lymph Hematopoietic & ITPR3 PF08709.6 H95 3.32E-03 2 187 lymph Endometrium KIF26B PF00225.18 A517 2.43E-03 2 249 Endometrium KIF26B PF00225.18 A650 2.43E-03 2 249 Endometrium KIF26B PF00225.18 P504 2.43E-03 2 249 Endometrium KIF26B PF00225.18 S714 2.43E-03 2 249 Meninges KLF4 PF13465.1 K409 0.00E+00 3 28 Lung KRAS PF00071.17 G12 0.00E+00 236 748 Lung KRAS PF00071.17 G13 0.00E+00 23 748 Pancreas KRAS PF00071.17 G12 0.00E+00 30 90 Colon KRAS PF00071.17 G12 0.00E+00 123 562 Endometrium KRAS PF00071.17 G12 0.00E+00 58 249 Colon KRAS PF00071.17 G13 0.00E+00 25 562 Endometrium KRAS PF00071.17 G13 0.00E+00 14 249 Breast KRAS PF00071.17 G12 0.00E+00 10 526 Ovary KRAS PF00071.17 G12 0.00E+00 6 385 Oesophagus KRAS PF00071.17 G12 1.76E-11 6 150 Hematopoietic & KRAS PF00071.17 A146 8.47E-08 4 187 lymph Colon KRAS PF00071.17 A146 2.31E-07 12 562 Hematopoietic & KRT6A PF00038.16 A335 5.39E-04 2 187 lymph

186

Colon LPHN3 PF02354.11 R1183 0.00E+00 8 562

Prostate LPHN3 PF02191.11 R234 0.00E+00 3 240 Kidney LPHN3 PF01825.16 M828 0.00E+00 3 305

Prostate LPHN3 PF12003.3 A760 0.00E+00 3 240 Kidney LPHN3 PF12003.3 A650 0.00E+00 3 305 Colon LPHN3 PF00002.19 K914 4.74E-07 3 562

Colon LPHN3 PF02191.11 D264 2.11E-06 3 562 Endometrium LPHN3 PF00002.19 I1052 6.97E-06 3 249 Endometrium LPHN3 PF00002.19 L947 6.97E-06 3 249 Endometrium LPHN3 PF12003.3 R629 1.07E-05 3 249 Endometrium LPHN3 PF12003.3 T783 1.07E-05 3 249 Breast LPHN3 PF12003.3 K626 1.09E-05 3 526 Breast LPHN3 PF12003.3 T706 1.09E-05 3 526 Colon LPHN3 PF12003.3 L651 9.05E-05 3 562 Colon LPHN3 PF12003.3 R629 9.05E-05 3 562 Prostate LPHN3 PF01825.16 R826 6.29E-03 3 240 Prostate LPHN3 PF01825.16 W832 6.29E-03 3 240 Cervix MAST1 PF00595.19 A1048 9.96E-03 2 14 Cervix MAST1 PF00595.19 E1035 9.96E-03 2 14 Endometrium MBD1 PF01429.14 R22 0.00E+00 11 249 Kidney MBD1 PF01429.14 D58 0.00E+00 4 305 Endometrium MBD1 PF02008.15 R229 1.51E-03 3 249 Breast MLL3 PF13771.1 R284 2.53E-08 6 526 Prostate MLL3 PF13771.1 G315 6.53E-05 4 240 Prostate MLL3 PF13771.1 R284 6.53E-05 4 240 Breast MLL3 PF13771.1 G315 2.06E-04 4 526 Kidney MT- PF00499.15 S84 1.07E-04 4 305 ND6 Hematopoietic & MYD88 PF01582.15 L265 0.00E+00 13 187 lymph

187

Hematopoietic & NPIPL2 PF06409.6 P143 1.75E-03 2 187 lymph

Hematopoietic & NPIPL2 PF06409.6 P204 1.75E-03 2 187 lymph Skin NRAS PF00071.17 Q61 0.00E+00 20 104

Colon NRAS PF00071.17 G12 0.00E+00 10 562 Colon NRAS PF00071.17 Q61 0.00E+00 10 562

Lung NRAS PF00071.17 Q61 0.00E+00 6 748 Endometrium NRAS PF00071.17 Q61 8.16E-05 3 249

Bone NTNG1 PF00055.12 R156 0.00E+00 3 5 Oesophagus ODZ2 PF06484.7 R183 6.42E-04 2 150 Oesophagus ODZ2 PF06484.7 R229 6.42E-04 2 150 Oesophagus ODZ2 PF06484.7 R350 6.42E-04 2 150 Hematopoietic & PCDH1 PF08266.7 N48 0.00E+00 3 187 lymph 1X Central nervous PCDH1 PF00028.12 G442 1.96E-08 6 305 system 1X Central nervous PCDH1 PF00028.12 T486 1.96E-08 6 305 system 1X Central nervous PCDH1 PF00028.12 G474 1.27E-05 4 305 system 1Y Central nervous PCDH1 PF00028.12 T523 1.27E-05 4 305 system 1Y Endometrium PIK3CA PF02192.11 R88 0.00E+00 14 249 Endometrium PIK3CA PF00613.15 E545 0.00E+00 30 249 Endometrium PIK3CA PF00613.15 E542 0.00E+00 24 249 Endometrium PIK3CA PF00613.15 Q546 0.00E+00 18 249 Lung PIK3CA PF00613.15 E545 0.00E+00 42 748 Colon PIK3CA PF00613.15 E545 0.00E+00 36 562 Urinary tract PIK3CA PF00613.15 E545 0.00E+00 16 85 Lung PIK3CA PF00613.15 E542 0.00E+00 16 748

188

Breast PIK3CA PF00613.15 E545 0.00E+00 15 526

Colon PIK3CA PF00613.15 Q546 0.00E+00 13 562 Oesophagus PIK3CA PF00613.15 E542 0.00E+00 8 150

Breast PIK3CA PF00792.19 C420 1.30E-12 6 526 Endometrium PIK3CA PF02192.11 R93 9.53E-11 12 249 Oesophagus PIK3CA PF00613.15 E545 6.50E-10 6 150

Colon PIK3CA PF00454.22 F909 9.39E-09 6 562 Colon PIK3CA PF02192.11 R88 2.50E-08 8 562 Upper aero- PIK3CA PF00613.15 E545 4.10E-08 4 88 digestive tract Colon PIK3CA PF00613.15 E542 6.10E-08 8 562 Endometrium PIK3CA PF02192.11 R38 6.24E-08 10 249 Lung PIK3CA PF00792.19 E453 1.64E-07 4 748 Endometrium PIK3CA PF00792.19 C378 3.93E-07 6 249 Endometrium PIK3CA PF00792.19 C420 3.93E-07 6 249 Endometrium PIK3CA PF00454.22 C901 5.54E-07 4 249 Colon PIK3CA PF02192.11 R38 3.60E-05 6 562 Colon PIK3CA PF00454.22 E970 7.23E-05 4 562 Colon PIK3CA PF00454.22 G1007 7.23E-05 4 562 Colon PIK3CA PF00454.22 S1008 7.23E-05 4 562 Colon PIK3CA PF00792.19 C420 7.78E-05 4 562 Colon PIK3CA PF00792.19 P471 7.78E-05 4 562 Breast PIK3CA PF00613.15 E542 2.35E-04 4 526 Breast PIK3CA PF02192.11 R88 2.46E-04 4 526 Endometrium PIK3CA PF00792.19 E453 1.02E-03 4 249 Endometrium PIK3CA PF02192.11 R108 6.26E-03 6 249 Hematopoietic & PIM1 PF00069.20 S97 1.85E-07 5 187 lymph Hematopoietic & PIM1 PF00069.20 L164 1.63E-03 3 187 lymph Endometrium POLE PF03104.14 P286 0.00E+00 18 249

189

Endometrium POLE PF03104.14 V411 0.00E+00 10 249

Colon POLE PF03104.14 P286 1.15E-10 5 562 Endometrium PPP2R1 PF13646.1 P179 3.82E-10 8 249 A

Colon PPP2R1 PF13646.1 R183 6.41E-06 4 562 A Endometrium PPP2R1 PF13646.1 S256 8.12E-05 5 249 A

Ovary PPP2R1 PF13646.1 R183 1.36E-04 3 385 A Lung PRKCB PF00069.20 V522 7.69E-08 4 748 Colon PRKCB PF00069.20 A509 1.80E-06 3 562 Hematopoietic & PRKCB PF00069.20 D376 4.39E-05 3 187 lymph Hematopoietic & PRKCB PF00069.20 I432 4.39E-05 3 187 lymph Hematopoietic & PRKCB PF00069.20 S352 4.39E-05 3 187 lymph Lung PRPF19 PF00400.27 T287 0.00E+00 9 748 Central nervous PSTPIP PF00611.18 P64 0.00E+00 4 305 system 2 Endometrium PTEN PF00782.15 R130 0.00E+00 39 249 Colon PTEN PF10409.4 K344 1.55E-04 4 562 Colon PTEN PF00782.15 D92 1.84E-04 5 562 Colon PTEN PF00782.15 E150 1.84E-04 5 562 Central nervous PTEN PF00782.15 D92 3.58E-04 3 305 system Breast PTEN PF00782.15 R130 3.61E-04 3 526

Colon PTEN PF00782.15 C136 5.70E-03 4 562 Colon PTEN PF00782.15 R130 5.70E-03 4 562 Breast PTPRD PF07679.11 R216 0.00E+00 4 526

190

Endometrium PTPRD PF07679.11 C257 0.00E+00 4 249

Kidney PTPRD PF00041.16 R452 0.00E+00 3 305 Skin RAC1 PF00071.17 P29 0.00E+00 26 104

Meninges RBP1 PF00061.18 I113 5.01E-04 2 28 Colon SCYL3 PF00069.20 R61 0.00E+00 6 562 Urinary tract SCYL3 PF00069.20 D76 1.82E-03 2 85

Urinary tract SCYL3 PF00069.20 E177 1.82E-03 2 85 Urinary tract SCYL3 PF00069.20 R61 1.82E-03 2 85 Cervix SLC16A PF07690.11 F252 1.22E-04 2 14 5 Cervix SLC16A PF07690.11 I279 1.22E-04 2 14 5 Colon SMAD4 PF03166.9 R361 0.00E+00 16 562 Oesophagus SMAD4 PF03166.9 G386 9.68E-07 4 150 Oesophagus SMAD4 PF03166.9 R361 1.80E-04 3 150 Colon SMAD4 PF03166.9 D351 9.24E-04 5 562 Hematopoietic & SMARC PF00176.18 R973 0.00E+00 4 187 lymph A4 Central nervous SMARC PF00176.18 T910 1.85E-10 5 305 system A4 Oesophagus SMARC PF00176.18 R966 2.12E-03 2 150 A4 Oesophagus SMARC PF00176.18 R978 2.12E-03 2 150 A4 Oesophagus SMARC PF00176.18 T910 2.12E-03 2 150 A4 Central nervous SMARC PF00176.18 E821 5.12E-03 2 305 system A4 Prostate SPOP PF00917.21 F133 4.89E-07 5 240 Endometrium SPOP PF00917.21 E50 5.54E-04 3 249 Oesophagus TP53 PF00870.13 R248 0.00E+00 33 150

191

Breast TP53 PF00870.13 H193 0.00E+00 29 526

Breast TP53 PF00870.13 Y220 0.00E+00 25 526 Breast TP53 PF00870.13 H179 0.00E+00 21 526

Liver TP53 PF00870.13 Y163 0.00E+00 15 123 Central nervous TP53 PF00870.13 R273 0.00E+00 18 305 system

Lung TP53 PF00870.13 R248 0.00E+00 45 748 Lung TP53 PF00870.13 G245 0.00E+00 41 748 Lung TP53 PF00870.13 R273 0.00E+00 39 748

Lung TP53 PF00870.13 R158 0.00E+00 30 748 Lung TP53 PF00870.13 R249 0.00E+00 27 748 Hematopoietic & TP53 PF00870.13 C176 0.00E+00 15 187 lymph Hematopoietic & TP53 PF00870.13 R248 0.00E+00 15 187 lymph Ovary TP53 PF00870.13 Y220 0.00E+00 21 385 Ovary TP53 PF00870.13 R248 0.00E+00 18 385 Colon TP53 PF00870.13 R248 0.00E+00 69 562 Urinary tract TP53 PF00870.13 R248 0.00E+00 28 85 Colon TP53 PF00870.13 G245 0.00E+00 24 562 Endometrium TP53 PF00870.13 R248 0.00E+00 21 249 Lung TP53 PF07710.6 R337 0.00E+00 12 748 Hematopoietic & TP53 PF08563.6 T18 0.00E+00 4 187 lymph Central nervous TP53 PF07710.6 A347 0.00E+00 4 305 system Bone TP53 PF00870.13 R282 0.00E+00 3 5 Breast TP53 PF00870.13 R248 1.85E-12 19 526 Liver TP53 PF07710.6 R337 2.97E-12 8 123 Oesophagus TP53 PF00870.13 R273 1.28E-11 14 150 Skin TP53 PF00870.13 S127 1.65E-11 9 104

192

Hematopoietic & TP53 PF00870.13 Y234 4.82E-11 12 187 lymph

Colon TP53 PF00870.13 R273 4.30E-10 18 562 Prostate TP53 PF00870.13 C135 6.96E-10 9 240

Lung TP53 PF00870.13 C176 1.46E-09 20 748 Oesophagus TP53 PF00870.13 G245 5.34E-09 12 150

Upper aero- TP53 PF00870.13 R248 8.56E-09 9 88 digestive tract Upper aero- TP53 PF00870.13 V173 8.56E-09 9 88 digestive tract Breast TP53 PF00870.13 C141 3.87E-08 15 526 Breast TP53 PF00870.13 G245 3.87E-08 15 526 Breast TP53 PF00870.13 I195 3.87E-08 15 526 Breast TP53 PF00870.13 Y163 3.87E-08 15 526 Pancreas TP53 PF00870.13 H193 7.20E-08 6 90 Pancreas TP53 PF00870.13 R248 7.20E-08 6 90 Lung TP53 PF00870.13 C242 9.68E-08 18 748 Lung TP53 PF00870.13 V157 9.68E-08 18 748 Ovary TP53 PF00870.13 I195 3.50E-07 12 385 Hematopoietic & TP53 PF00870.13 G245 6.74E-07 9 187 lymph Hematopoietic & TP53 PF00870.13 Y107 6.74E-07 9 187 lymph Lung TP53 PF00870.13 Y220 7.28E-07 17 748 Central nervous TP53 PF00870.13 R158 1.32E-06 9 305 system Skin TP53 PF00870.13 H179 2.85E-06 6 104 Skin TP53 PF00870.13 R248 2.85E-06 6 104 Breast TP53 PF00870.13 K132 3.83E-06 13 526 Ovary TP53 PF00870.13 C176 4.31E-06 11 385 Endometrium TP53 PF00870.13 R273 5.13E-06 8 249

193

Lung TP53 PF00870.13 Y163 5.18E-06 16 748

Colon TP53 PF07710.6 R337 1.26E-05 4 562 Oesophagus TP53 PF00870.13 C135 2.34E-05 9 150

Oesophagus TP53 PF00870.13 C176 2.34E-05 9 150 Oesophagus TP53 PF00870.13 H193 2.34E-05 9 150 Central nervous TP53 PF00870.13 R282 2.38E-05 8 305 system Colon TP53 PF00870.13 G244 2.92E-05 13 562 Prostate TP53 PF00870.13 G245 3.32E-05 6 240

Prostate TP53 PF00870.13 R248 3.32E-05 6 240 Breast TP53 PF00870.13 V173 3.42E-05 12 526 Lung TP53 PF00870.13 P278 3.47E-05 15 748 Lung TP53 PF00870.13 R26 3.47E-05 15 748 Ovary TP53 PF00870.13 R273 4.88E-05 10 385 Endometrium TP53 PF00870.13 Y205 9.94E-05 7 249 Urinary tract TP53 PF00870.13 G245 1.19E-04 6 85 Upper aero- TP53 PF00870.13 H193 1.94E-04 6 88 digestive tract Upper aero- TP53 PF00870.13 R282 1.94E-04 6 88 digestive tract Upper aero- TP53 PF00870.13 Y220 1.94E-04 6 88 digestive tract Pancreas TP53 PF00870.13 G266 3.12E-04 4 90 Oesophagus TP53 PF00870.13 R282 3.13E-04 8 150 Ovary TP53 PF00870.13 G245 5.03E-04 9 385 Ovary TP53 PF00870.13 H179 5.03E-04 9 385 Ovary TP53 PF00870.13 R282 5.03E-04 9 385 Liver TP53 PF00870.13 C176 8.34E-04 6 123

Liver TP53 PF00870.13 P151 8.34E-04 6 123 Liver TP53 PF00870.13 R249 8.34E-04 6 123 Endometrium TP53 PF00870.13 R158 1.69E-03 6 249

194

Hematopoietic & TP53 PF00870.13 S215 3.49E-03 6 187 lymph

Hematopoietic & TP53 PF00870.13 Y126 3.49E-03 6 187 lymph Kidney TP53 PF00870.13 H193 5.04E-03 3 305

Kidney TP53 PF00870.13 N239 5.04E-03 3 305 Kidney TP53 PF00870.13 R213 5.04E-03 3 305

Kidney TP53 PF00870.13 R248 5.04E-03 3 305 Kidney TP53 PF00870.13 Y234 5.04E-03 3 305

Central nervous TP53 PF00870.13 G245 5.39E-03 6 305 system Central nervous TP53 PF00870.13 R248 5.39E-03 6 305 system Central nervous TP53 PF00870.13 R249 5.39E-03 6 305 system Central nervous TP53 PF00870.13 V216 5.39E-03 6 305 system Central nervous TP53 PF00870.13 Y220 5.39E-03 6 305 system Lung TP53 PF00870.13 G244 7.06E-03 12 748 Colon TP53 PF00870.13 C238 9.97E-03 10 562 Colon TP53 PF00870.13 R282 9.97E-03 10 562 Lung U2AF1 PF00642.19 S34 0.00E+00 24 748 Liver USP25 PF00443.24 E600 4.53E-06 3 123 Liver USP25 PF00443.24 T427 4.53E-06 3 123 Colon VHL PF01847.11 F119 6.24E-06 6 562 Kidney VHL PF01847.11 L158 2.46E-03 5 305 Colon VHL PF01847.11 L128 6.53E-03 4 562

195

Copyright Acknowledgements