<<

Imperial College London Department of Life Sciences

The Relationships Between Genetic and the Structures of Protein and their Interactions

Sirawit Ittisoponpisan

Submitted in part fulfilment of the requirements for the degree of of Imperial College London and the Diploma of Imperial College London, September 2018 Declaration of Originality

I, Sirawit Ittisoponpisan, declare that this thesis titled, “The Relationships Between

Genetic Diseases and the Structures of Protein and their Interactions” and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a research degree

at this University.

• Where I have consulted the published work of others, this is always clearly at-

tributed.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have

made clear exactly what was done by others and what I have contributed myself.

Sirawit Ittisoponpisan

1 Copyright Declaration

The copyright of this thesis rests with the author and is made available under a Creative

Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work.

2 Abstract

The rapid growth in the number of human genome sequences is leading to the dis- covery of numerous single amino acid variants (SAVs), some of which are associated with genetic . Understanding these variants at both the sequence level and the structural level is of great benefits particularly for clinical researchers.

The first part of this thesis focuses on characterising pleiotropic proteins (proteins as- sociated with multiple disease phenotypes). This study shows that pleiotropic proteins have unique properties: 1) they are enriched in disease-associated SAVs and rare neu- tral SAVs, 2) they tend to contain disordered regions, 3) they tend to have a higher number of protein-protein interactions, and 4), they are more specific to diseases of certain classes such as neoplasm, nervous system, and congenital malformation. This new understanding of pleiotropic proteins could have important implications for evo- lutionary studies, genetic studies, and any other related field.

At the structural level, the effects of SAVs on protein structures were analysed by remodelling high-quality PDB coordinates of proteins to obtain their corresponding mutant structures. By considering 16 structural features affecting protein stability, 40% of disease-associated SAVs and 11.5% neutral SAVs were shown to cause at least one of the 16 structural effects. The limited number of available experimental PDB structures poses a major challenge in structural analysis. However, incorporating ho- mology models allows more SAVs to be analysed as the residue coverage increases from 17% to approximately 50%. The same analysis was repeated on homology models with sequence identity ranging from 30% to 95%. Similar percentages of disease-associated and neutral SAVs causing structural effects were observed even on models with low se- quence identity (<40%). Subsequently, a web server Missense3D was developed to help analyse the structural effect of SAVs on protein structures. A better understanding of genetic variants could help researchers unveil the molecular mechanisms underly- ing human disease, which could support further research such as in drug design and personalised medicine.

3 Acknowledgements

First of all, I would like to express my sincere gratitude to all the people I worked with at the Structural Bioinformatics Group, Imperial College London. My PhD career would not be successful without these people. I thank Professor Michael J. E. Sternberg, for his brilliant supervision, Dr Alessia David for all her supporting help and valuable advice, Dr Suhail Islam and Dr Lawrence Kelly for their support on computational resources and helpful information, and all the MSc and PhD students for the valuable time we spent together in our research projects.

Importantly, I would like to thank my family and friends, particularly my fellow Ball- room dancers at Imperial College London, for all their emotional support and their general advice on college life which kept me motivated throughout my study.

Lastly, I would like to thank the Royal Thai Government for giving me an opportunity to study in the UK and for sponsoring me throughout my Master’s and Doctorate Degree study at Imperial College London.

4 List of Publications

Accepted articles

David, A., S. Ittisoponpisan, S. and Sternberg, M. J. E. (2018). Protein structure anal- ysis aids in the interpretation of genetic variants of uncertain clinical significance identified in the . Supplements, 34:e2-e3.

Ittisoponpisan, S. and David, A. (2018). Structural biology helps interpret variants of uncertain significance in genes causing endocrine and metabolic disorders. Journal of the Endocrine Society, 2(8), 842-854.

Ittisoponpisan, S., Alhuzimi, E., Sternberg, M. J. E. and David, A. (2017). Landscape of pleiotropic proteins causing human disease: structural and system biology in- sights. Human , 38(3), 289-296.

Under submission

Ittisoponpisan, S., Islam, S. A., Alhuzimi, E., David, A. and Sternberg, M. J. E. (2018). Missense3D: can predicted protein 3D-structures provide reliable insights into whether missense variants are disease-associated?.

5 Contents

Abstract 3

Acknowledgements 4

List of Publications 5

1 Introduction 20

1.1 Aim of the thesis ...... 21

1.2 SNPs and SAVs ...... 21

1.3 Relationships between gene and traits: introducing pleiotropy ...... 22

1.4 Protein structure and folding ...... 27

1.4.1 Intrinsically disordered regions (IDRs) ...... 27

1.4.2 Tools for determining and predicting disordered regions . . . . . 29

1.4.3 Protein structure modelling ...... 31

1.5 Prediction of the effect of SAVs on protein structure ...... 36

1.6 Structural features used in structure-based prediction methods . . . . . 43

1.7 Statistical methods ...... 56

6 CONTENTS 7

1.7.1 P-value ...... 56

1.7.2 Mann-Whitney Wilcoxon test ...... 56

1.7.3 Odds ratio (OR) and ratio of odds ratios (ROR) ...... 57

1.7.4 Hypergeometric test ...... 57

1.7.5 Z test of independent proportions ...... 58

1.7.6 Benjamini-Hochberg procedure ...... 59

1.7.7 True positive rate (TPR) and false positive rate (FPR) . . . . . 59

2 Pleiotropy in Human Disease 61

2.1 Background ...... 62

2.2 Materials & methods ...... 64

2.2.1 Definition of pleiotropic proteins ...... 64

2.2.2 Disease classification ...... 64

2.2.3 Dataset ...... 68

2.2.4 Pfam domains ...... 70

2.2.5 Disordered regions ...... 70

2.2.6 Essential proteins ...... 70

2.2.7 Protein-protein interaction ...... 71

2.2.8 Minor allele frequency ...... 72

2.2.9 Statistical methods ...... 72

2.3 Results ...... 73

2.3.1 Global characteristics of pleiotropic and non-pleiotropic proteins 73 CONTENTS 8

2.3.2 Protein essentiality ...... 77

2.3.3 Number of interactions ...... 79

2.3.4 Differences between pleiotropic and non-pleiotropic proteins in

the distribution of SAVs in disordered regions ...... 82

2.3.5 ICD-10 disease classes in pleiotropic and non-pleiotropic proteins 85

2.3.6 The association between two disease classes in pleiotropic proteins 87

2.3.7 The clustering of SAVs in Pfam domains in pleiotropic proteins . 89

2.3.8 Public web-based database ...... 97

2.4 Discussion ...... 97

2.5 Conclusion ...... 104

3 Structural Effects Of On Human Proteins 105

3.1 Background ...... 106

3.2 Materials & methods ...... 107

3.2.1 Dataset of high-quality experimental structures ...... 107

3.2.2 Dataset of SAV variants ...... 108

3.2.3 Alignment of FASTA sequences to PDB structures ...... 109

3.2.4 Side chain repacking method ...... 109

3.2.5 Calculation of damaging structural features ...... 111

3.2.6 Statistical evaluation of performance ...... 117

3.3 Results ...... 118

3.3.1 Performance of each structural feature ...... 118 CONTENTS 9

3.3.2 Insights into variants affecting secondary structure ...... 121

3.3.3 Overlapping between SIFT prediction and Missense3D analysis . 124

3.3.4 Analysis of rare, common, and unknown neutral variants . . . . 125

3.4 Discussion ...... 127

3.5 Conclusion ...... 138

4 Structural Effects Of Mutations On Homology Models 139

4.1 Background ...... 140

4.2 Materials & methods ...... 141

4.2.1 Analysis pipeline ...... 141

4.2.2 Datasets of predicted structures and variants ...... 141

4.2.3 Model assessment ...... 143

4.3 Results ...... 144

4.3.1 Quality of the predicted models ...... 144

4.3.2 Performance of Missense3D on homology models at different se-

quence identities ...... 144

4.3.3 The performance of each structural feature at different sequence

identities ...... 148

4.3.4 Case studies ...... 164

4.4 Conclusion ...... 169

5 Discussion 170

5.1 Stringent cut-off versus relaxed cut-off for chemical bonds in Missense3D 170 5.2 The prediction performance is generally well preserved in homology models171

5.3 Future work for Missense3D ...... 173

6 Conclusion 176

Bibliography 177

A Pleiotropy in Human Disease 206

B Mutation Prediction on Experimental Structures 214

C Mutation Prediction on Homology Models 216

10 List of Tables

2.1 List of major classes in ICD-10 ...... 67

2.2 Observed and expected numbers of SAVs in disordered regions . . . . . 83

2.3 Observed and expected numbers of SAVs in disordered regions for DPh

and SPh pleiotropic proteins ...... 85

2.4 Distribution of SAVs on Pfam domains ...... 92

2.5 Contingency matrices for observed and expected numbers of disease-

associated SAVs ...... 96

3.1 Performance of Missense3D on experimental structures ...... 121

3.2 Variants leading to secondary structure altered ...... 124

4.1 The 16 structural features evaluated by Missense3D ...... 142

4.2 Performance of Missense3D on predicted models and their equivalent

experimental structures in each sequence identity range ...... 147

5.1 Structural features involving bond length ...... 172

A.1 The number of proteins in each dataset ...... 206

11 A.2 Odds ratios comparing the preferences of SAVs between pleiotropic and

non-pleiotropic proteins ...... 206

A.3 Numbers of proteins with disordered regions and percentages of disor-

dered residues (pleiotropic DPh vs. non-pleiotropic) ...... 207

A.4 Numbers of proteins with disordered regions and percentages of disor-

dered residues (pleiotropic vs. non-pleiotropic) ...... 207

A.5 46 selected mouse phenotypes for the gene essentiality analysis . . . . . 207

A.6 Percentages of essential proteins ...... 209

A.7 Interaction statistics using data from BioGrid and IntAct ...... 210

A.8 Interaction statistics by protein essentiality using data from BioGrid . . 210

A.9 Observed and expected numbers of neutral SAVs in disordered regions . 211

A.10 Observed and expected numbers of SAVs in IDR using IUPred and

DISOPRED3 ...... 212

A.11 Observed and expected numbers of SAVs in Pfam domains of full pleiotropic

set vs. non-pleiotropic set ...... 213

A.12 Disease SAVs clustering in Pfam domains in the full pleiotropic set . . . 213

B.1 Maximum solvent accessibility by Rost and Sander ...... 215

C.1 Number of disease and neutral variants triggering a structural alert . . . 216

12 List of Figures

1.1 Relationship between genotypes and phenotypes ...... 23

1.2 Structured protein and disordered region ...... 27

1.3 Example result from disordered-region prediction by IUPred ...... 30

1.4 Example disordered-region prediction result by DISOPRED3 on PSIPRED

Workbench web server ...... 31

1.5 Outline for homology modelling ...... 33

1.6 Example results from Phyre2 normal mode ...... 35

1.7 Example results from SAAPdap ...... 42

1.8 The backbone dihedral angles (phi, psi, and omega angles) ...... 44

1.9 Ramachandran plots from data by Lovell et al. (2003) ...... 46

1.10 Disulphide bonds in alpha-galactosidase A (GLA)...... 52

1.11 Method for cavity volume calculation ...... 53

1.12 Surface cavities and internal cavities detected in phenylalanine hydrox-

ylase (PAH )...... 55

2.1 The general concept of pleiotropic and non-pleiotropic proteins . . . . . 65

13 LIST OF FIGURES 14

2.2 Overview of the method ...... 68

2.3 Stacked bar chart showing numbers of disease-associated and neutral

SAVs and their proportions in each protein dataset ...... 74

2.4 Odds ratios comparing the distribution preferences of disease-associated

SAVs and neutral SAVs (from Humsavar and ExAC) in pleiotropic DPh

vs. non-pleiotropi proteins ...... 76

2.5 Percentages of proteins with disordered regions ...... 77

2.6 Percentages of essential proteins in pleiotropic and non-pleiotropic pro-

tein sets ...... 78

2.7 Mean number of interactions ...... 80

2.8 Mean number of interactions by protein essentiality ...... 81

2.9 Pleiotropic residue and pleiotropic SAV ...... 84

2.10 The assignment of ICD-10 classes to pleiotropic and non-pleiotropic pro-

teins ...... 86

2.11 The frequency of different disease classes found in pleiotropic and non-

pleiotropic proteins ...... 88

2.12 The approach for constructing the heat map ...... 90

2.13 Heat map showing the clustering of different disease classes found in

pleiotropic DPh proteins ...... 91

2.14 Overview of the SAVs clustering test ...... 94

2.15 PleiotropyDB website interface ...... 98

3.1 The pipeline to analyse the structural impact of missense variants in

experimental structures ...... 108 LIST OF FIGURES 15

3.2 Side chain repacking method ...... 110

3.3 Performance of each structural feature ...... 118

3.4 Cartoon representation of four neutral variants that triggered the sec-

ondary structure alert ...... 123

3.5 Overlap between SIFT and structural analysis in the identification of

disease-associated SAVs with the corresponding numbers of TPs and FPs125

3.6 Positive rates of common, rare, and unknown SAVs ...... 126

3.7 Effects of mutations on proteins ...... 129

3.8 Transthyretin (TTR)...... 133

3.9 Comparison between wild-type and mutant structures of TTR ...... 134

3.10 Phenylalanine hydroxylase (PAH )...... 135

3.11 Variants in phenylalanine hydroxylase (PAH )...... 136

4.1 The RMSD distribution of predicted structures ...... 145

4.2 Missense3D performance at different sequence identities ...... 146

4.3 Histogram of the relative frequencies of RMSD of predicted models

grouped according to whether the variant was a TP or FP ...... 148

4.4 Performance of each structural feature (part 1) ...... 150

4.5 Performance of each structural feature (part 2) ...... 151

4.6 Performance of each structural feature (part 3) ...... 152

4.7 The distribution of MolProbity clash scores of predicted wild-type struc-

tures ...... 154 4.8 Box plot showing total cavity volume calculated from predicted struc-

tures vs. experimental structures ...... 157

4.9 Missense3D analysis pipeline ...... 159

4.10 Missense3D home page ...... 160

4.11 Example result on experimental protein structure PDB 3HG3 (B), vari-

ant Cys52Arg (part one) ...... 161

4.12 Example result on experimental protein structure PDB 3HG3 (B), vari-

ant Cys52Arg (part two) ...... 162

4.13 Example result on experimental protein structure PDB 3HG3 (B), vari-

ant Cys52Arg (part three) ...... 163

4.14 Carbonic anhydrase II ...... 165

4.15 Analysis of the disease variant p.His107Tyr in the experimental and

predicted structures of carbonic anhydrase II ...... 166

4.16 Alpha-galactosidase A ...... 167

4.17 Analysis of the variant p.Cys52Arg in the and predicted struc-

tures of alpha-galactosidase ...... 168

16 List of Abbreviations

ASA Accessible Surface Area

BLAST Basic Local Alignment Search Tool

CI Confidence Interval

Condel Consensus Deleteriousness Score of Missense Mutation

DPh Different Phenotypes

DSSP Define Secondary Structure of Proteins

ExAC Exome Aggregation Consortium

FATHMM Functional Analysis Through Hidden Markov Model

FDR False Discovery Rate

FN False Negative

FP False Positive

FPR False Positive Rate

GO Gene Ontology

GWAS Genome-Wide Association Study

HGMD Human Gene Mutation Database

17 LIST OF ABBREVIATIONS 18

HMM Hidden Markov Model

HOPE Have yOur Protein Explained

ICD-10 International Classification of Diseases 10th revision

IDP Intrinsically Disordered Protein

IDR Intrinsically Disordered Region

MAF Minor Allele Frequency

MGI Mouse Genome Informatics

MT Mutant

NHGRI National Human Genome Research Institute

NMR Nuclear Magnetic Resonance

OGEE Online Gene Essentiality Database

OMIM Online Mendelian Inheritance in Man

OR Odds Ratios

PDB

Phyre2 Protein Homology/analogY Recognition Engine V 2.0

PolyPhen-2 Polymorphism Phenotyping v2

PopMusic Prediction of Protein Mutant Stability Changes

PPI Protein-Protein Interaction

PROVEAN Protein Variation Effect Analyzer

R3 Residue-Rotamer-Reduction

RMSD Root-Mean-Square Deviation LIST OF ABBREVIATIONS 19

ROR Ratio of Odds Ratios

RSA Relative accessible Surface Area or Relative Solvent Accessibility

SAAPdap Single Amino Acid Polymorphism data analysis pipeline

SAV Single Amino Acid Variant

SDM Site Directed Mutator

SIFT Sorting Intolerance From Tolerance

SLiM Short Linear Motif

SPh Similar Phenotypes

STRUM Structure-based sTability change pRediction Upon single-point Mutation

SuSPect Disease-Susceptibility-based SAV Phenotype Prediction

SVM Support Vector Machine

TN True Negative

TP True Positive

TPR True Positive Rate

VUS Variant of Uncertain Significance

WAS Weighted Average of the Normalised Score

WHO World Health Organisation

WT Wild Type

WTR Repacked Wild Type

YASARA Yet Another Scientific Artificial Reality Application Chapter 1

Introduction

Studies aiming at exome or whole-genome sequencing of large cohorts of patients are underway, including the 100,000 Genomes Project (Peplow, 2016). These studies will generate an unprecedented amount of novel genetic variations that need prioritisation and characterisation in order to derive biological information that can lead the way to personalised medicine. A better understanding of the genotype-phenotype relationship is, therefore, of paramount importance.

20 1.1. Aim of the thesis 21

1.1 Aim of the thesis

This thesis aims to explore the relationships between diseases and their predisposing genes and the protein structures of these genes. The study is separated into two.

The first part, reported in Chapter 2, aims to explore the molecular mechanisms of pleiotropy in disease phenotypes. The study reports the characteristics of pleiotropic proteins (proteins associated with multiple disease phenotypes) and an analysis of disease-associated single amino acid variants (SAVs) in relation to their locations on protein sequences. Pleiotropic proteins were compared to non-pleiotropic proteins re- garding mutation frequencies, protein-protein interactions (PPIs), the types of associ- ated disease, and protein essentiality.

The second part reports the development of a novel tool to analyse SAVs using a structure-based approach. This has been split into two chapters. Chapter 3 reports a structural analysis of disease and neutral missense variants and their effects on high- quality protein structures. In Chapter 4, the analysis was performed on homology-built structures to examine how effective the structural analysis on predicted models is. In addition, Chapter 4 reports the development of Missense3D, a publicly-available web server to help analyse the effect of SAVs on protein structures. Discussion and future improvements of Missense3D are reported in Chapter 5.

Background materials related to these studies is reviewed below.

1.2 SNPs and SAVs

The term single nucleotide polymorphism, or SNP, refers to a substitution of a nu- cleotide base on a DNA sequence. When a SNP occurs in the coding region of a gene, possible outcomes are as follows. 1.3. Relationships between gene and traits: introducing pleiotropy 22

Synonymous SNP A SNP may have no effect on the translation from a codon into an amino acid. This is because an amino acid can be translated from different codons with different nucleotide base patterns.

Non-synonymous SNP A change in a nucleotide base in a codon may result in an amino acid change, also known as a missense variant, or single amino acid variant

(SAV). This type of genetic change is account for the most numerous class of protein structure-altering mutations. In addition, multiple changes in a codon can lead to a

SAV in some cases.

Another outcome from mutations that affect amino acid sequence is a stop codon TAG,

TAA, or TGA. This is known as a nonsense or stop-gain mutation that truncates a protein sequence, which may lead to incomplete folding and loss of protein function.

In light of studying protein sequences and their structures, only SAVs, will be covered in this thesis. Other types of mutation such as synonymous mutations and nonsense mutations will not be considered.

1.3 Relationships between gene and traits: introducing

pleiotropy

There are arguably over millions of phenotypic traits in human; however, there are only about 20,000 genes in the human genome. Inevitably, the expression of one single gene (or a single mutation) can influence multiple traits or processes which appear to be unrelated. This phenomenon is known as ‘Pleiotropy’ (from the ancient Greek

‘pleion’, meaning more, and ‘tropos’, meaning way), and was conceptualised in 1910 by a German developmental geneticist Ludwig Plate (Stearns, 2010). Alternatively, a single trait could also be regulated by multiple genes (Figure 1.1). Complex phenotype is a trait influenced by multiple genes, either individually or through interactions with each other, or the environment (Mitchell, 2012). 1.3. Relationships between gene and traits: introducing pleiotropy 23

Figure 1.1: Relationship between genotypes and phenotypes. In this example, G1 and G2 are pleiotropic as they influence multiple phenotypes.

Despite the definition given by Plate, there had not been any study of the mechanisms of pleiotropy until Hans Gruneberg (1938) conducted the first experiment in rats - with a particular focus on genetic skeletal abnormality. At first, two types of pleiotropy were proposed by Gruneberg: 1) ‘Genuine pleiotropy’ when one locus makes more than one distinct product, and 2) ‘Spurious pleiotropy’ when the primary product of the locus initiates a cascade of events resulting in different phenotypic consequences, or when the primary products are utilised in different ways. However, his study of a mutation in rats showed only a spurious pleiotropic effect - there was at that time no evidence to support his concept of genuine pleiotropy. With the growing support of the

‘one gene-one enzyme’ concept by Beadle and Tatum (1941), researchers believed that genuine pleiotropy probably did not exist. Later, the study of pleiotropy had a major focus on a single gene product leading to multiple traits. The term genuine pleiotropy slowly disappeared as Gruneberg’s term ‘spurious pleiotropy’ became more commonly regarded as ‘pleiotropy’. 1.3. Relationships between gene and traits: introducing pleiotropy 24

Pleiotropy was later redefined into two categories, ‘mosaic’ and ‘relational’ pleiotropy, by Hadorn and Mittwoch (1961). Mosaic pleiotropy refers to an event when a single locus directly affects two phenotypic traits, whereas relational pleiotropy is when a single locus initiates a cascade of events resulting in multiple traits. In parallel with

Hadorn and Mittwoch’s definition, Wright and others introduced the term ‘universal pleiotropy’. This concept of pleiotropy means that a mutation at any locus can affect all traits from either a direct or indirect influence. Universal pleiotropy, came from the idea that genes are linked together in the same complex network architecture of the cell, which also includes polymerase, ribosomes, dNTPs, and many other molecules; therefore, mutations that alter protein activity or sequence are also likely to alter the activity of all other genes in this complex network. However, the criticism of universal pleiotropy concept was that it assumes the presence of only one single gene network in an organism (Stearns, 2010).

In the late 1970s, the development of DNA sequencing technology led to a major discovery that a single locus can produce multiple alternative transcripts via alternative splicing or alternative start/stop codons (Pyeritz, 1989), thus supporting Gruneberg’s concept of ‘genuine pleiotropy’ (one locus creates multiple distinct products).

Two decades later, pleiotropy was re-categorised by Jonathan Hodgkin (1998) into the following seven types.

1. Artefactual pleiotropy when one mutation in a chromosome affects the process

of two adjacent genes in the genome.

2. Secondary pleiotropy when a simple biochemical disorder leads to complex

chains of phenotypic consequences. This is similar to Hadorn’s ‘relational pleiotropy’

concept. Complex organisms are likely to exhibit this type of pleiotropy.

3. Adoptive pleiotropy when one gene product is used in many different processes

in different tissues. 1.3. Relationships between gene and traits: introducing pleiotropy 25

4. Parsimonious pleiotropy when one gene product (enzyme) catalyses the same

chemical reaction but in different pathways. Therefore, inactivation of this en-

zyme could result in complex consequences.

5. Opportunistic pleiotropy when one gene product has an additional secondary

role in a different tissue, apart from its major role.

6. Combinatorial pleiotropy when a single gene product can be employed in sev-

eral ways, possessing several properties, as it can interact with different partners

in different cell types. A large number of pleiotropic cases were found to be in

this category, particularly in the nervous system. A mutation affecting this type

of gene is thought to have diverse effects across different tissues and systems.

7. Unifying pleiotropy when one gene or a group of genes encode multiple activ-

ities that support a common biological function. An example is when a single

polyprotein with multiple enzymatic activities is encoded by two adjacent genes

that work under common regulatory controls.

Despite proposing several types of pleiotropy, Hodgkin noted that more refined cat- egorisations of pleiotropy could be proposed and also suggested that the types of pleiotropy that are of particular importance are combinatorial pleiotropy (6) and uni- fying pleiotropy (7), as they are more connected to the evolution and development of complex organisms.

Over the past century, pleiotropy studies have been conducted across different organ- isms, and from a focus on a specific set of genes to large-scale analyses. A study in yeast by He and Zhang (2006) showed a positive correlation between the degree of pleiotropy and the number of biological processes in cellular components. Another study in C. elegans by Zou et al. (2008) showed that pleiotropy could be observed extensively in genes expressed in the early stage of embryogenesis. The degree of pleiotropy, or the number of traits a single gene can affect, is usually higher in complex organisms. For instance, genes from mice show higher degrees of pleiotropy than genes from yeasts 1.4. Protein structure and folding 26

(Wagner et al., 2008; Wagner and Zhang, 2011).

Pleiotropy in complex organisms also comes with a high cost of mutation: as a mutation in one gene for the evolution of one trait can result in unwanted effects on the other traits (Wang et al., 2010). Thus, complex organisms with higher degrees of pleiotropy

(e.g. mice) are less adaptable and would find it more difficult to evolve compared to organisms with less complicated gene-trait relationships (e.g. yeast).

In human, pleiotropy has been shown to be often present in genes involved in complex diseases (deleterious phenotypes influenced by a combination of multiple genetic, en- vironmental, and lifestyle factors). A study by Sivakumaran et al. showed that 16.9% of human genes and 4.6% of SNPs reported in the GWAS catalogue exhibit pleiotropic effects; from this, it was concluded that pleiotropy is a common property of genes and

SNPs associated with complex diseases (Sivakumaran et al., 2011).

Despite some evidence suggesting the abundance of pleiotropy in complex organisms, most previous studies were undertaken on specific genes or phenotypes. Moreover, the mechanism of pleiotropy, particularly in a large-scale evaluation of pleiotropy in human genes, is yet to be explored in detail. Recently, there have been an increasing number of human DNA sequences and genetic variations deposited in several genome databases such as the Human Gene Mutation Database (HGMD) (Stenson et al., 2017), the Exome Aggregation Consortium (ExAC) (Lek et al., 2016), ClinVar (Landrum et al., 2016, 2018), and the Universal Protein Resource (UniProt) (Bateman et al.,

2017) - making a large-scale pleiotropy study on human genome feasible. The research challenges in the study of pleiotropy and human genes are presented in Chapter 2. 1.4. Protein structure and folding 27

Figure 1.2: Structured protein and disordered region. (a) Globular structured pro- tein, (b) Structured protein with one disordered region (shown as a string attached to the structured part), (c) Intrinsically disordered protein with no structured region. Partially structured proteins may also be regarded as intrinsically disordered proteins.

1.4 Protein structure and folding

1.4.1 Intrinsically disordered regions (IDRs)

Intrinsically disordered regions (IDRs), or disordered regions, are parts of protein se- quences that lack fixed three-dimensional structure (Figure 1.2). Proteins with disor- dered regions are also known as intrinsically disordered proteins (IDPs), although some researchers might use the term IDPs for fully unstructured proteins, and “partially dis- ordered proteins” as proteins with disordered regions. Disordered regions are normally categorised into short (≤ 30 residues) and long (> 30 residues) types (Radivojac et al.,

2004; Peng et al., 2006; He et al., 2009). Different enrichment in amino acids and functions have been observed between the two. For example, short disordered regions are enriched in glycine and aspartic acid but depleted in leucine, isoleucine, and valine, whereas long disordered regions are enriched in lysine, glutamic acid, and proline, but depleted in glycine and asparagine (Romero et al., 1997; Radivojac et al., 2004). Short disordered regions can serve as loops or turns connecting helices and strands within the 1.4. Protein structure and folding 28 domain. Miniature disordered parts covering two to five amino acid residues, although they do not provide much flexibility to the overall protein structure, can serve as short linear motifs (SLiMs) (Das et al., 2012; Davey et al., 2012). Unlike short disordered re- gions, long disordered regions allow the protein to adopt many different conformations, enabling the protein to interact with multiple partners (Van Der Lee et al., 2014).

Disordered regions were shown to be more common in complex living organisms, such as eukaryotes, than in bacteria. In a study by Dunker et al. (2001) covering whole-genome protein sequences across 34 species, the estimated percentages of protein sequences possessing long disordered region (≥ 50 residues) were 2% - 21% in bacterial sequences,

4% - 24% in archaea sequences, and 25% - 41% in eukaryotic sequences. In broad agreement with the study of Dunker et al., Ward et al. (2004) showed that long disordered regions were found in 2.0% of archaean, 4.2% of eubacterial and 33.0% of eukaryotic protein sequences.

Iakoucheva et al. (2002) showed that proteins containing disordered regions are in- volved in many essential biological activities such as cell cycle regulation, transcrip- tion, and signal transduction. Additionally, disordered proteins, regarded as flexible proteins by Midic et al. (2009), were shown to more central in the protein interactome.

As suggested later by Dunker et al. (2005), disordered regions may contribute to the protein network by: 1) serving as the structural basis for hub proteins; 2) binding to structured hub proteins; 3) providing flexible linkers between protein domains.

The structural flexibility provided by disordered regions allows proteins to interact with multiple partners of different shapes and forms. Disordered regions also support post- translational modifications, thus enabling conformational changes required by proteins to switch from an inactive (non-functional) to an active (functional) state and vice versa

(Dunker et al., 2005). More findings on the association between disordered regions and diseases are discussed in Chapter 2. 1.4. Protein structure and folding 29

1.4.2 Tools for determining and predicting disordered regions

Despite having no fixed three-dimensional structures, disordered regions can be exper- imentally detected as missing residues when using X-ray crystallography. Moreover, several programs have been developed to predict disordered regions from a given pro- tein amino acid sequence. The following are among the most widely-used disordered region prediction programs.

IUPred Developed by Dosztnyi et al. (2005a), IUPred predicts which regions of a protein are disordered by using the concept of pairwise energy of protein in the native conformation. Pairwise energy is calculated by summing up the interaction energy of local amino acid pairs along the given protein sequence. Disordered regions typically have a higher pairwise energy which is less favourable when compared with ordered regions. After calculating the pairwise energy profile along the sequence, the energy power for each residue is converted into a probabilistic score ranging from 0 (indicating a completely ordered residue) to 1 (indicating a completely disordered residue). A residue with a score above 0.5 is regarded as disordered. An example output is shown in Figure 1.3. However, users can download the raw prediction scores and manually identify disordered regions using their desired cut-off. IUPred was shown to outperform other methods and to achieve both high sensitivity and high specificity in the prediction:

76.33% True Positive Rate (TPR) and 5.33% False Positive Rate (FPR) (Doszt´anyi et al., 2005b). In 2018, the updated version IUPred2A was released along with the new benchmark datasets (M´esz´aroset al., 2018). The authors reported a slight improvement of the program for disordered region prediction.

DISOPRED3 Unlike IUPred, DISOPRED3 (Jones and Cozzetto, 2015) predicts disordered regions by training on missing coordinates in high-resolution structures of homologous sequences. The regions on protein sequences not covered by PDB coor- dinates are assumed to be disordered, although this may not always be true as the 1.4. Protein structure and folding 30

Figure 1.3: Example result from disordered-region prediction by IUPred. The 1,134- residue query sequence is vinculin (UnProt ID: P18206). The red line indicates the probability score for each residue along the entire sequence. The cut-off threshold of 0.5 is presented as a horizontal line. By default, any value above this line is treated as a disordered residue. missing coordinates could arise from an artifact of the crystallisation process. To achieve the prediction, DISOPRED3 incorporates data of unique high-quality X-ray structures (sequence identity < 90% and resolution ≤ 2.2 A)˚ derived from the PISCES database (Wang and Dunbrack, 2003), and NMR structures from the DisProt v5.0 database (Sickmeier et al., 2007) as a training set. A support vector machine (SVM) was applied. SVM is a machine learning algorithm for two-group classification prob- lems (Cortes and Vapnik, 1995). Given data points which belong to one of two classes

(e.g. disordered or ordered), SVM constructs a hyperplane (or a set of hyperplanes) in a high-dimensional space, which can then be used for disordered region classification.

Whilst the sensitivity (also known as true positive rate) of DISOPRED3 varies between

40% and 60%, according to different datasets used, the specificity was shown to be very high (99%, or 1% false positive rate). DISOPRED3 is part of the PSIPRED Protein

Sequence Analysis Workbench (Buchan et al., 2013) developed by David Jones’s re- search group from University College London. Further information such as secondary structure prediction and disordered binding site prediction is also reported when the web-server version is employed. An example output is shown in Figure 1.4. 1.4. Protein structure and folding 31

Figure 1.4: Example disordered-region prediction result by DISOPRED3. The query sequence is vinculin (UnProt ID: P18206). Extra information such as secondary struc- ture is reported (i.e. helices marked in pink, sheets in yellow).

1.4.3 Protein structure modelling

Experimental structures

Experimental structures or crystal structures are structures obtained through con- ducting laboratory measurement of a purified protein sample. The data are typically obtained by X-ray crystallography, nuclear magnetic resonance spectroscopy of pro- teins (protein NMR), or cryo-electron microscopy, although X-ray crystallography is more widely-used.

Atomic coordinate data of biological molecules such as proteins and nucleic acids have been experimentally obtained by different research groups and deposited in the Protein 1.4. Protein structure and folding 32

Data Bank (PDB) database (Berman, 2000). As of December 2017, approximately

140,000 structures were deposited in PDB - with 120,000 being X-ray structures. The resolution of X-ray structures is given in Angstroms (A):˚ a lower number indicates a higher resolution. According to the PDB database, X-ray structures deposited in the database have a median resolution of 2.0 A.˚ Whilst there is no standard to indicate a good X-ray structure, a structure with a resolution of 2.0 A˚ or better is generally treated as a high-quality structure, whereas one with a resolution of over 4 A˚ is regarded as a poor-quality structure.

Protein structure prediction

Due to challenges in obtaining high-quality experimental structures such as the dif-

ficulty in preparing protein crystals, image quality control, and cost of experiments

(Huda and Brad Abrahams, 2015), not many protein structures have been solved to date. Accordingly, protein structure prediction has become an approach to solve the lack-of-experimental-structure problem. Commonly used methods to predict protein structures are homology modelling and ab initio modelling. In Chapter 4, an analysis of whether SAVs could disrupt structure in homology models is presented. Thus, the approach for homology modelling will be reviewed in this section.

Homology modelling Also known as comparative protein modelling, homology modelling is a widely-used method to predict a structure using one or more known experimental structures that share similar protein sequences. Protein structures are generally conserved among homologues although the sequence similarity may vary.

However, the structures of two proteins are likely to be extensively different when the sequence identity is less than 20% (Chothia and Lesk, 1986). One later study found that approximately 90% of the sequence alignment pairs that have over 30% sequence identity are homologous, but only 10% of the alignment pair are homologous when the sequence identity is below 30% (Rost, 1999). The range of sequence identity around 1.4. Protein structure and folding 33

20% - 30% is regarded as a twilight zone because the chance of two protein sharing similar structures starts to drop rapidly when the sequence identity reaches this range, making it very difficult to model a protein structure accurately. The outline for building a homology model is shown in Figure 1.5.

Figure 1.5: Outline for homology modelling

Homology models can be constructed as follows.

1. Identify the homologous sequences, typically using a multiple sequence alignment

algorithm, and select those whose structures are available and can be used as a

template for modelling.

2. Align the query sequence onto the sequence of the template. This can be per-

formed with the aid of sequence alignment tools such as BLAST (Altschul et al.,

1990), PSI-BLAST (Altschul et al., 1997), or more recently HMMER (Finn et al., 1.4. Protein structure and folding 34

2011) and HHblits (Remmert et al., 2012).

3. Construct the backbone of the query sequence based on the atomic coordinates

obtained from the template.

4. Modify the backbone by adding/removing residues according to the position

where insertions/deletions occur.

5. Add side-chains to the backbone. Two widely-used programs for backbone repack-

ing are R3 (Xie and Sahinidis, 2006) and SCWRL4 (Krivov et al., 2009).

6. Refine the stereochemistry and evaluate the quality of the model. Some of the

tools that can perform structural validation are PROCHECK (Laskowski et al.,

1993), WHAT CHECK (Hooft et al., 1996), and MolProbity (Chen et al., 2010).

Several tools using comparative protein modelling methods are available, such as FoldX

(Schymkowitz et al., 2005), MODELLER (Webb and Sali, 2016), Phyre2 (Kelley and

Sternberg, 2009; Kelley et al., 2015), SWISS-MODEL (Guex and Peitsch, 1997; Biasini et al., 2014), and YASARA (Krieger et al., 2002). Phyre2 was used in the study of homology models in this thesis and is reviewed below.

Phyre2 Developed by Kelly et al. (2009; 2015) from the Structural Bioinformatics

Group at Imperial College London, Phyre2 (Protein Homology/AnalogY Recognition

Engine; pronounced as ‘fire’) uses advanced remote homology detection methods to con- struct 3D models starting from the amino acid sequence of the query protein (protein to be modelled). It incorporates a recently developed algorithm, HHblits (Remmert et al., 2012), for searching homologous sequences, and PSIPRED (McGuffin et al.,

2000) for predicting secondary structures. The homologous sequences, together with secondary structure information, are then used by HHsearch (S¨oding,2005) for search- ing homology structures which make up the backbones of the predicted structures. The missing gaps between backbone fragments are fulfilled by the ab initio loop-modelling 1.4. Protein structure and folding 35 algorithm Poing (Jefferys et al., 2010). Finally, the side chains are reintroduced onto the structures using R3 (Xie and Sahinidis, 2006).

Figure 1.6: Example results from Phyre2 normal mode. Shown here are the first 5 results from 120 alignments for alpha-galactosidase A (GLA, UniProt ID: P06280). In addition to predicting models for the top 20 hits, Phyre2 also reports structural coverage, sequence identity, confidence (probability that the used template and the query sequence are homologous) and template information for each alignment.

Phyre2 is available in normal mode and intensive mode. Whilst the main algorithm for constructing models is the same for both normal and intensive mode, the normal mode uses only one PDB template per predicted structure and may therefore cover only a part of the given query sequence. The intensive mode aims to maximise the coverage of the sequence. It therefore incorporates multiple templates and ab initio techniques to model the entire protein and therefore requires a longer time to process the results. One key feature of Phyre2 is that a large number of possible homologous sequence alignments are reported along with their template information. An example result page from Phyre2 using normal mode on alpha-galactosidase A (GLA, UniProt

ID: P06280) sequence is presented in Figure 1.6. Homology models are predicted for the top 20 results, which are ranked according to confidence scores - the probability of whether the particular template and the query sequence are homologous. The scores 1.5. Prediction of the effect of SAVs on protein structure 36 are calculated by taking into account sequence and secondary structure similarity, and the number of insertions and deletions.

Phyre2 is also linked to two other in-house programs: 3DLigandSite (Wass et al.,

2010), a tool for predicting ligand binding sites, and SuSPect (Yates et al., 2014), for analysing the effect of SAVs on protein sequences. Being an Imperial College London’s in-house web server for homology modelling together with having the capability for batch processing, Phyre2 was chosen as the tool for predicting homology models in the study reported in Chapter 4.

1.5 Prediction of the effect of SAVs on protein structure

Several tools have been made available to examine the effect of SAVs on the structure and function of proteins. Here the widely-used ones are described.

Many prediction programs take into account whether a SAV is in a conserved region, a region where the amino acids are similar (show less variation) or identical among homologous proteins. Conserved regions can be determined from multiple sequence alignment of closely related sequences. Conserved regions often are crucial to the function of proteins and it has been shown that genetic variants occurring in such regions tend to have deleterious effects. Although conservation-based prediction tools are widely-used, they give little insight into the mechanism of how a variant impact at the structural level.

SIFT (Sorting Intolerance From Tolerance) utilises the concept of sequence con- servation in the prediction of disease variants. Prediction results are given in scores ranging from 0 (deleterious) to 1 (tolerated): 0.0 to 0.05 means the variant is deleteri- ous; 0.05 to 1.0 means the variant is tolerated (neutral) (Ng and Henikoff, 2003). SIFT, however, considers mainly sequence conservation, which results in a critical disadvan- tage when it is employed for predicting the effects of variants that occur in less con- served regions. Although SIFT has been available for over a decade, a recent study by 1.5. Prediction of the effect of SAVs on protein structure 37

Mort et al. (2010) showed that SIFT performs poorly in identifying disease-associated variants when they are located in disordered regions. Despite this, being regarded as a pioneer in disease variant prediction, SIFT is often used as a benchmark in comparison with novel prediction methods. SIFT was reported to have the sensitivity of 70-85% and the specificity of 70-81% when tested on the human variation (HumVar) dataset

(Sim et al., 2012).

A more recent method, FATHMM (Functional Analysis Through Hidden Markov

Model), employs a Hidden Markov Model algorithm to represent the alignment of ho- mologous sequences and conserved domains. It also examines protein domain families, which can capture evolutionary constraints when aligning homologous sequences. Un- like SIFT, FATHMM scores do not have a minimum or maximum value. A negative score indicates that the SAV is less favourable, and is less likely to be observed than the wild-type residue, whereas a positive score indicates that the SAV is more favourable and is more likely to be observed than the wild-type residue. The specificity and sensi- tivity were shown to be maximised when the threshold of 3.0 and 1.5 are used. Whilst users are allowed to set their cut-off values, -0.75 is the threshold recommended by the developers to yield a balanced 86% sensitivity and 86% specificity (on the VariBench

Benchmarking dataset) (Shihab et al., 2013).

Another widely-used method, PolyPhen-2 (Polymorphism Phenotyping v2), in- corporates structural features, such as the change in accessible surface area and propen- sity for buried residues, in addition to sequence features, such as homology and sequence conservation. This method also compares the properties of a wild-type sequence to the properties of its relative mutant sequence in order to determine the effect of the variant.

PolyPhen-2 scores are in the same range as SIFT’s (0.0 to 1.0), but their interpretation is different. A variant with a PolyPhen-2 score from 0.0 to 0.15 is predicted benign;

0.15 to 0.85, possibly damaging (less confident prediction); 0.85 to 1.0, probably dam- aging (more confident prediction). PolyPhen-2 achieves the sensitivity of 73-92% on different test datasets while maintaining the specificity of 80% (Adzhubei et al., 2010, 1.5. Prediction of the effect of SAVs on protein structure 38

2013).

PROVEAN (Protein Variation Effect Analyzer) is another method that uses scores from sequence alignment to determine disease-causing variants. Furthermore, this method is not limited to analysing only single amino acid variants. It is also capable of analysing in-frame insertions/deletions (indels), and multiple amino acid substitutions, making it a versatile tool for variant assessment. PROVEAN prediction scores range from negative to positive. A lower score corresponds to a more confident prediction that the variant is deleterious. Its recommended (optimal) threshold is -2.5 for 80% sensitivity and 79% specificity. The sensitivity can reach 90% when the thresh- old is set to -4.1. However, at this threshold, the specificity is drastically reduced to

58%. When the threshold is set to -1.3, it can yield 91% specificity but 62% sensitivity

(Choi et al., 2012).

A study on comparative evaluation of disease variant prediction methods performed by

Hassan et al. (2018) shows that FATHMM achieves the highest performance among eight methods evaluated: 1) FATHMM, 2) SIFT, 3) PROVEAN, 4) iFish (Wang and

Wei, 2016), 5) Mutation Assessor (Reva et al., 2007, 2011), 6) PANTHER (Mi et al.,

2016), 7) SNAP2 (Zhang et al., 2014), and 8) PON-P2 (Niroula et al., 2015). It achieves

84% and 78% specificity, and 94% and 75% sensitivity, on whole and random sample dataset, respectively. The tested variants were extracted from the Varibench database.

However, another recent study by Mahmood et al. (2017) shows that the performance of FATHMM (measured using Area Under Curve, AUC = 0.79) is relatively poorer than SIFT (AUC = 0.85) and PolyPhen-2 (AUC = 0.83) when using variants from

ClinVar database.

Later methods investigate many other features in addition to sequence properties.

SuSPect (Disease-Susceptibility-based SAV Phenotype Prediction) is another disease SAV prediction tool that examines many different protein features, including sequence conservation and predicted residue solvent accessibility. Additional features are derived from the network properties of the affected protein in a protein-protein in- 1.5. Prediction of the effect of SAVs on protein structure 39 teraction network. SuSPect utilises a support vector machine (SVM), a state-of-the-art machine-learning algorithm, to help distinguish disease variants from neutral variants.

During the development of SuSPect, the authors also considered some observed struc- tural features such as torsion angles (phi/psi angles), solvent accessibility, and surface pockets. However, structural features did not contribute to significant improvement in the performance of the program. By contrast, the accuracy of the prediction was reduced when network properties were not included. SuSPect scores range from 0

(neutral) to 100 (disease-associated), with 50 being the recommended cut-off for SAV classification (with 75% sensitivity). In the test set the majority of disease-causing

SAVs were found to have scores above 70 and the scores of most neutral SAVs are below 30 (Yates et al., 2014).

Some methods focus on the structural stability change resulting from a SAV. For exam- ple, PopMusic (Prediction of Protein Mutant Stability Changes) calculates the change in thermodynamic stability of protein structures, as represented by en- ergy functions (∆∆G), and uses machine learning to yield a final prediction (Dehouck et al., 2009). ∆∆G for a SAV is calculated by subtracting ∆G (Gibbs free energy) in the mutant structure by ∆G in the wild-type structure, with ∆G defined as the change in energy between the unfolded and folded state. PopMusic was trained on a specially-curated set of variants extracted from the ProTherm database (Bava, 2004) to consider only SAVs in globular proteins for which experimental structures (X-ray or

NMR) are available. Another similar prediction tool from the same research group is

HoTMuSiC, released later in 2016, which similarly takes into account the concept of thermal stability, but with a different approach by calculating the change in melting temperature (∆Tm) upon point mutations (Pucci et al., 2016).

STRUM (Structure-based stability change prediction upon single-point mu- tation) (Quan et al., 2016), developed by Zhang Laboratory at the University of

Michigan, and INPS3D (Impact of Non-synonymous mutations on Protein

Stability) (Savojardo et al., 2016), from the Bologna Biocomputing Group, utilise a 1.5. Prediction of the effect of SAVs on protein structure 40 similar concept of change of free energy ∆∆G upon introducing a SAV to determine the pathogenicity of the given variant. Another similar tool that uses the concept of struc- tural stability is SDM (Site Directed Mutator) (Pandurangan et al., 2017). SDM allows users to upload their PDB files and provide a list of mutations to be analysed.

Despite several structure-based methods relying on the concept of an energy function

∆∆G calculation, some differences can be noticed in terms of the parameters used in the calculation, training sets, machine learning algorithms, and the scoring schemes. A major disadvantage of these free energy calculation methods is that they are normally time-consuming and are neither available nor suitable for large-scale analyses. In ad- dition, the developers showed that INPS3D did not perform well on variants on the protein surface. Using the concept of free energy change alone might not fully explain the nature of deleterious variants and multiple pieces of information should be taken into consideration to best interpret the effect of SAVs.

Interpretation of the variant effect is problematic if different methods yield contra- dictory results. Condel (Consensus Deleteriousness Score of Missense Muta- tion) (Gonz´alez-P´erezand L´opez-Bigas, 2011) overcomes this problem by generating an overall prediction score based on prediction results from five prediction tools: 1)

LogR.E-value (Clifford et al., 2004), a sequence-based predictor which uses HMMER

(Finn et al., 2011) together with data from Pfam (Finn et al., 2014) to predict whether a SAV is on a conserved protein domain; 2) MAPP (Multivariate Analysis of Protein

Polymorphism) (Stone and Sidow, 2005) which takes into account physicochemical characteristics in each position of the protein, based on observed evolutionary varia- tion; 3) Mutation Assessor which evaluates the functional impact from a point mu- tation using evolutionary data obtained from multiple sequence alignment (MSA); 4)

PolyPhen-2; and 5) SIFT. A new technique, developed by the authors, called “weighted average of the normalised scores of the individual methods” (WAS) was employed to yield a final prediction score for each variant, based on the scores from the five predic- tors. With this technique, the authors showed that Condel significantly outperformed each of the component methods. 1.5. Prediction of the effect of SAVs on protein structure 41

Despite the abundance of programs for predicting deleterious SAVs, most of them only provide simple Boolean prediction outcomes (i.e. neutral or deleterious) - which is insufficient for qualitative assessment of SAVs. To gain more information, having a tool that can provide detailed insights into the mechanism by which SAVs affect protein structures is of paramount importance. Unfortunately, only a few tools are available.

Here, two methods are reported.

HOPE (Have yOur Protein Explained), developed by Venselaar et al. (2010) at the Centre for Molecular and Biomolecular Informatics at Radboud University, the Netherlands, incorporates several third-party programs and data from external servers. It uses BLAST for finding the best-matched sequence in the UniProt and PDB databases and uses YASARA (Yet Another Scientific Artificial Reality Application)

(Krieger et al., 2002) for homology modelling. Upon modelling, further information is gathered from external resources such as UniProt for sequence annotations, InterPro

(Hunter et al., 2009) for protein domain and family classifications, HSSP (Dodge et al.,

1998) for conservation scores, and the DAS server (Prli´cet al., 2007) for sequence- based predictions of membrane region, solvent accessibility, secondary structure, and phosphorylation sites. All collected information is then processed using a decision tree.

HOPE, however, is not capable of analysing users’ PDB coordinates nor is suitable for batch processing. It only supports as input a UniProt ID or a protein sequence in

FASTA format. Another major concern is that HOPE was tested on only 115 mutations and was evaluated by comparing the prediction results to those from SIFT, PolyPhen-2, and the manual analysis performed by the developers. Furthermore, although HOPE can generate homology models from protein sequences provided by users, there has been no benchmark against the prediction performance on homology structures. Most importantly, HOPE, which was released in 2010, has not been well-maintained and often fails to deliver predictions to users.

Developed by Al-Numair at Andrew C.R. Martin’s Bioinformatics Group at University

College London, a web-based prediction program SAAPdap (Single Amino Acid 1.5. Prediction of the effect of SAVs on protein structure 42

Polymorphism data analysis pipeline) provides a detailed analysis of how a mu- tation affects protein structure (Al-Numair and Martin, 2013)(see Figure 1.7). There are 15 features considered by SAAPdap, and they can be classified into five main cat- egories: 1) protein interface, 2) ligand binding site, 3) folding, 4) instability and 5) sequence conservation. As demonstrated by the authors, SAAPdap outperforms the conventional predictors SIFT and PolyPhen-2 when sequence conservation is taken into consideration together with structural analysis.

Figure 1.7: Example results from SAAPdap. 8 PDB structures representing the same protein myosin-7 (UniProt ID: P12883) were analysed for the effect of a mutation p.Ser242Glu (the example input suggested by SAAPdap). Different structural problems (denoted by ‘X’ marks) can be observed among different PDBs analysed. Features that could not be assessed are marked ‘?’. In all cases, the pathogenicity is flagged by sequence conservation alerts (shown in ‘Impact’ column).

Taking one UniProt ID as input together with a residue position and a change of amino acid, SAAPdap examines all related experimental structures found in Protein

Data Bank (PDB). The analysis takes about three to five minutes per PDB structure analysed. The time required for the analysis will depend on the number of PDB structures processed - taking up to 24 hours to analyse a single variant if the protein has hundreds of structures available. In addition, different PDB codes for the same

UniProt ID can yield different results. SAAPdap does not allow users to upload their

PDB coordinates nor support batch processing.

The effects of SAVs on protein structures will be further discussed in Chapter 3 (on experimental structures) and Chapter 4 (on predicted structures). 1.6. Structural features used in structure-based prediction methods 43

1.6 Structural features used in structure-based prediction

methods

Steric clash

A steric clash, sometimes referred to as a clash, occurs when atoms in a protein struc- ture are too close together. Steric clashes are prevalent in poorly-modelled PDB struc- tures (e.g. structures with low resolution), or predicted structures (Ramachandran et al., 2011), showing that those atoms are not in the correct positions. The un- favourable energetics from steric clashes would prevent the folding of the native con- formation. Thus, if a SAV leads to a drastic increase in steric clashes, it is very likely that it will affect the protein stability or disrupt the native conformation and could potentially lead to loss of function and hence disease.

The criterion for harmful steric clashes is not agreed upon. Hurst et al. (2009) defined a clash as two atoms having a van der Waals overlap and a distance between the centres of the two atoms less than 2.5 A.˚ Hurst et al. suggested that a mutation introducing three clashes is sufficient to be regarded as deleterious. These criteria were used in

SAAPdap (Al-Numair and Martin, 2013). Word et al. (1999), from the MolProbity developer team suggested that a structure with more than 30 clashes per 1,000 atoms should be considered as a poor structure. They defined a clash as the overlap of van der Waals radii of 0.4 A,˚ regardless of the distance between the centres of the two atoms. The van der Waals radii used in their calculation are 1.17 A˚ for hydrogen atoms, 1.75 A˚ for carbon atoms, 1.55 A˚ for nitrogen atoms, and 1.4 A˚ for oxygen atoms. The numbers for van der Waals radii, however, vary slightly according to each research group (Batsanov, 2001). 1.6. Structural features used in structure-based prediction methods 44

Figure 1.8: The backbone dihedral angles (phi, psi, and omega angles)

Backbone torsion angles

Backbone torsion angles (also called backbone dihedral angles) were studied by Ra- machandran et al. (1963). Three types of backbone torsion angle are defined in a protein chain: phi (φ), psi (ψ), and omega (ω) (Figure 1.8). The omega angle is largely

fixed to either 180◦ (indicating trans conformation) or 0◦ (indicating cis conformation).

The cis conformation was found to be very rare for non-proline residues (Jabs et al.,

1999). The other two backbone torsion angles, phi and psi, are flexible. However, the ranges for phi/psi angles vary for different amino acids.

The most widely-used approach to characterise the range of naturally-found phi/psi angles for each amino acid type is to measure phi/psi angles of each residue in a large protein dataset and plot them on a simple Cartesian plane with phi on the horizontal axis and psi on the vertical axis. This type of plot is known as a Ramachandran plot. A study by Lovell et al. (2003) on 500 proteins generalised single amino acid

Ramachandran plots to create plots for four categories of amino acid: general, glycine, 1.6. Structural features used in structure-based prediction methods 45 proline, and pre-proline. From each plot the favoured regions, covering 98% of the residues, and the allowed regions, covering 99.95% of the residues, were determined

(Figure 1.9).

Glycine and proline were shown to have unique Ramachandran plots. Having no side chain, glycine is the amino acid that has the broadest range of favoured phi/psi angles.

Proline, in contrast, has the most restricted range of phi/psi angles because of its distinctive cyclic side-chain structure that locks the phi angle at approximately 60◦.

The differences in the Ramachandran plots of other amino acids are minor. Therefore, the plots can be combined into one general case.

Substituting an amino acid while keeping the protein backbone fixed forces the mutant residue to adopt the torsion angles from the wild-type residue. When the adopted phi/psi angles are not in the allowed region of the mutant residue, it could potentially result in the protein being incorrectly folded, leading to loss of function and disease.

However, a study has found that 0.3% of the residues in high-quality crystal structures were observed to have their phi/psi angles in the disallowed regions (Gunasekaran et al., 1996). These residues are mostly found in loops. The disallowed phi/psi angles observed in crystal structures are generally compensated for by distortion in bond lengths and electrostatic interactions (e.g. hydrogen bonds).

Substitution of cis-conformation amino acids Cis-conformation peptide bonds are rarely observed in crystal structures. A study on PDB structures covering 32,539 peptide bonds, of which 31,005 are amide bonds and 1,534 are imide bonds (X-Pro), revealed that only 0.05% of the amide bonds were in cis conformation. In contrast, the percentage was much higher (6.5%) for imide bonds (X-Pro). Cis peptides usually play important roles in protein stability and function. They are usually found in the regions of high curvature, and the correlation is very high when considering the imide bond involving prolines. This suggests that cis-conformation proline could serve specific roles in maintaining the protein structure (Stewart et al., 1990). One example is a 1.6. Structural features used in structure-based prediction methods 46

Figure 1.9: Ramachandran plots from data by Lovell et al. (2003) (a) General case, (b) Glycine, (c) Proline, (d) Pre-proline. The inner regions (red lines) indicate favoured regions where 98% of the residues can be found. The outer boundaries (orange lines) indicate allowed regions which cover 99.95% of the residues in the plots. Images adapted from the original versions by Dcrjsr (a-c) and Vinchenzo8 (d) and used under Creative Commons licences. study of a cis-proline at position 30 in loop 1 of alpha-haemoglobin stabilising protein

(AHSP). Substitution of Pro30 was shown to result in a reduced ability to convert alpha-haemoglobin (Gell et al., 2009). 1.6. Structural features used in structure-based prediction methods 47

Another study focusing on non-proline residues that are in cis-conformation showed that most of them exhibit side-chain/side-chain interactions. Most of these were found to be in the areas that are functionally important such as adjacent to the protein active site. Many genes containing non-proline cis peptide bonds were also found to be related to carbohydrate binding or processing (Jabs et al., 1999).

Secondary structure

The repetitive secondary structures are alpha helices and beta sheets. According to the study by Abrusan and Marsh (2016), alpha helices can tolerate point mutations better than beta strands. The authors suggested that mutations that cause changes in the repetitive secondary structure are more likely to be pathogenic than those that do not affect repetitive secondary structure. Although alpha helix was shown to be more robust than beta strand, proline has been found to be a potent alpha-helix breaker

(Jacob et al., 1999). In general, proline is unfavourable in secondary structure as it has many structural constraints. In addition, amphiphilic residues were also found to be a novel type of secondary structure breaker (Imai and Mitaku, 2005). Residues with amphiphilic side-chains containing both a polar group and a flexible hydrocarbon

(arginine, lysine, histidine, glutamic acid, and glutamine) were more likely to be found in disordered regions, and correlation was found between the clustering of these residues and the chance of breaking repetitive secondary structure.

Solvent accessibility

Water molecules can make physical contact with residues on the surface of a protein.

The commonly-used term accessible surface area (ASA) indicates the extent to which a residue is accessible by water molecules. ASA is generally given in squared Angstroms

(A˚2). To measure the maximum accessible surface area (MaxASA) of each amino acid, the most widely-used approach is to construct a tripeptide molecule Gly-X-Gly, where

X is any amino acid of interest, and measure the accessible surface area of the residue 1.6. Structural features used in structure-based prediction methods 48

X. The relative accessible surface area (RSA), is calculated by RSA = ASA/MaxASA.

A residue with RSA below 5%-9% is commonly regarded as buried.

During the past three decades, the MaxASA values have been studied by many research groups. The widely-accepted ones are from Rose et al. (1985), Miller et. Al (1987), and Rost and Sander (1994).

Solvent accessibility was shown to give reliable prediction results when used in tools for predicting disease-associated variants (Saunders and Baker, 2002). In particular, researchers have been interested in evaluating whether SAVs are located in protein cores (i.e. buried). It has been demonstrated by David et al. (2012) that disease- associated SAVs tend to be either buried in protein cores or on the interface sites rather than randomly distributed on the protein. In addition, the authors showed that disease-associated SAVs are twice as likely to occur in protein cores than neutral SAVs.

Buried glycine Glycine is the smallest amino acid and is often found in loop re- gions because of its ability to adopt a wide range of phi/psi angles (Aurora et al.,

1994). Having no side chain, it is also often found in protein cores with limited space.

Because of this, substitution of glycine in a protein core usually results in a steric clash, particularly when it is replaced with a much larger amino acid such as arginine, aspartic acid, and glutamine (David and Sternberg, 2015). In a large-scale analysis of more than 69,000 variants performed by Petukh et al. (2015), the substitution from glycine to arginine was found to be the most common among disease-associated SAVs.

Thus, any SAV replacing a buried glycine residue could be a potential candidate for a disease-associated variant.

Buried proline Proline is another amino acid that is often favourable in some specific locations. Unlike glycine, proline has a very restricted range of conformations and is not favourable in the protein core. However, its restricted conformation makes it useful for forming sharp turns in loop regions (Parker and Hefford, 1997). The leucine-to- 1.6. Structural features used in structure-based prediction methods 49 proline substitution was found to be the second most common disease-associated SAV, after the glycine-to-arginine substitution (Petukh et al., 2015). Proline is a potent helix breaker because after the start of the helix a regular hydrogen bonding pattern cannot be generated (Williams and Deber, 1991; Jacob et al., 1999). Unlike surface residues, buried residues have major constraints regarding space and conformations. A variant replacing a buried residue must be able to fit in the limited space and adopt the wild-type phi/psi angles in order to maintain the original protein conformation.

Because proline has a restricted range of allowed phi/psi angles, introducing it into a protein core is highly unfavourable and can increase the risk of breaking of alpha or beta secondary structures.

Solvent accessibility (buried/exposed) can be used in conjunction with other amino acid properties, such as hydrophobicity and electrostatic charge. The hydrophobic ef- fect is a major factor that can influence the structure and the stability of the folded protein (Dill, 1990). Hydrophobic residues are likely to cluster in the protein core whereas hydrophilic residues are usually on the protein surface making physical con- tact with surrounding solvent (Munson et al., 1996). Substituting a buried hydropho- bic residue with a hydrophilic one could destabilise the protein structure. Additionally, uncompensated charged residues are inherently incompatible with hydrophobic envi- ronments. However, when they are in the hydrophobic protein core, residues with opposite charges usually cluster, sometimes playing essential roles in biological energy transduction. Conserved hydrophobic residues are also usually found nearby in the protein interior (Torshin and Harrison, 2001; Isom et al., 2010). Therefore, substitut- ing a buried charged residue with an uncharged residue or a residue with the opposite charge could destabilise the protein structure.

Stereochemical bonds

Hydrogen bonds, or H bonds, are commonly found in protein structures. They are created by electrostatic attraction between two polar groups of molecules, one being a 1.6. Structural features used in structure-based prediction methods 50 donor of a hydrogen molecule and the other being an acceptor. Hydrogen bonds can occur within a protein chain (intramolecular) or between two chains (intermolecular).

The strength of the hydrogen bond typically depends on the distance between the donor and the acceptor atoms. In general, those with a length of 2.2 A˚ - 2.5 A˚ are regarded as strong hydrogen bonds, while lengths of 2.5 A˚ - 3.2 A˚ are regarded as moderate bonds. Lastly, a length of 3.2 A˚ - 4.0 A˚ is regarded as a weak bond. Main- chain/main-chain hydrogen bonding is essential for establishing secondary structures

(i.e. alpha helices and beta sheets). Additionally, side-chain/side-chain hydrogen bonds and side-chain/main-chain hydrogen bonds can provide extra stability to the tertiary structures.

Although hydrogen bonds are generally weaker than other types of stereochemical bonds, a collective disruption of the bonds can severely affect protein stability, espe- cially when those bonds are in the protein core (Hubbard and Kamran Haider, 2010).

Side-chain/side-chain hydrogen bonds created by buried residues have been shown to be highly conserved through evolution (Worth and Blundell, 2010). Thus, a SAV dis- rupting side-chain hydrogen bonds in the protein interior can potentially destabilise the protein structure. A previous study on p53 by Cuff et al. (2006) shows that 202

SAVs (referred to as distinct mutations by the authors) in this protein result in the disruption of side-chain hydrogen bonds. If the affected residue is exposed, the loss of side-chain hydrogen bonds could be compensated for by the formation of new hydrogen bonds between the affected residue and water molecules (Cuff et al., 2006).

A salt bridge is another type of bond that can occur between two ionised residues, through a formation of an electrostatic interaction (Bosshard et al., 2004). It requires one residue to be positively charged (i.e. arginine, histidine, or lysine) and another residue to be negatively charged (i.e. aspartic acid or glutamic acid). The bond length is typically ≤ 4 A.˚ Although this is a non-covalent bond, which is weaker than a covalent bond, disruption of salt bridges can result in reduced protein stability and unexpected loss or gain of protein-protein interactions. For example, the disruption of salt bridges 1.6. Structural features used in structure-based prediction methods 51 in p53 tetramerisation domain was shown to increase the propensity to form amyloid

fibrils (Galea et al., 2005).

Disulphide bonds, also called cysteine bonds or SS bonds, are strong interactions between two sulphur atoms from a cysteine pair. A disulphide bond is a covalent bond with approximately 2.05 A˚ S-S length, about 0.5 A˚ longer than a C-C covalent bond.

As it is a strong bond, it has a great influence on protein structural stability, especially in globular proteins (Betz, 1993). The effect of disulphide bond breakage on protein structures was studied by Levin et al. (2013), who suggested that the disruption of disulphide bonds in the structure of β3 subunit of the platelet αIIbβ3 led to inherited blood clotting diseases. Another study on a disulphide-rich protein alpha-galactosidase A (GLA) (Figure 1.10) shows that substitutions of disulphide-bond- forming cysteine residues led to loss of protein functions (Qiu et al., 2015). In addition, some studies suggest that disulphide bond disruption is harmful to the regulation of secretory proteins (Gorr et al., 1999; Schmidt et al., 1999). Thus, disruption of these bonds can result in the protein being disease-associated.

Cavity

Small cavities can be detected in the cores of most protein structures (Rashin et al.,

1986). Internal cavities are energetically unfavourable because they reduce van der

Waals interactions in the hydrophobic core where residues are densely packed (Pakula and Sauer, 1989). Low resolution PDB structures are usually rich in cavities due to poor assignment of atomic coordinates.

One algorithm to identify cavities is KVFinder (Oliveira et al., 2014). Cavities can be measured by rolling a small ‘Probe In’ (also called void probe) of 0.1 A˚ radius and a large ‘Probe Out’ (or solvent probe) of 1.4 A˚ radius (the radius of a water molecule) around the protein surface. The space reachable by the Probe In but not the Probe Out is regarded as a cavity (Figure 1.11). An example cavity calculation of phenylalanine hydroxylase (PAH ) is shown in Figure 1.12. 1.6. Structural features used in structure-based prediction methods 52

Figure 1.10: Disulphide bonds in alpha-galactosidase A (GLA). This protein is rep- resented by PDB 3GH3, chain B. The model is at 1.9 A˚ resolution. Disulphide bonds were found in five cysteine pairs (residue number 52-94, 56-63, 142-172, 202-223, and 378-382). Orange spheres represent the sulphur atoms of each cysteine pair. The pro- tein surface is shown in translucent blue to aid visualisation of the globular structure. Figure created with PyMOL (DeLano, 2002).

The sizes of the probes in practice are adjustable for different purposes. For example, to find ligand binding pockets, sizes of 1.4 A˚ (water molecule size) for Probe In and 2 A˚

- 6 A˚ for Probe Out are recommended. Different probe sizes can yield different results.

If the Probe In radius is too small, numerous cavities can be detected. Alternatively, if the radius is too large, very few cavities would be found. In addition, if the Probe

Out size is set to 1.4 A,˚ the detected cavities will not be accessible by water molecules and are regarded as void. In other cavity detecting tools such as McVol (Till and

Ullmann, 2010) and ProteinVolume (Chen and Makhatadze, 2015), different methods are employed. McVol uses only one probe together with protein volume information to 1.6. Structural features used in structure-based prediction methods 53

Figure 1.11: Method for cavity volume calculation. (a) Small ‘Probe In’ (purple) is rolled over the protein surface (green). (b) Region accessible by Probe In is shown in light purple. (c) The process is repeated using a large ‘Probe Out’ (blue). (d) Region accessible by Probe Out is highlighted in light blue. (e) Cavity is defined as the difference between the regions accessible by Probe In and Probe Out. Figure adapted from (Oliveira et al., 2014). 1.6. Structural features used in structure-based prediction methods 54 derive the protein cavities. ProteinVolume calculates the molecular surface of a protein and uses flood-fill algorithm to calculate intramolecular void volumes.

Replacing a large buried amino acid with a smaller one can result in internal cavity or void. It has been found that the void creation inevitably affects the stability of the protein - particularly in thermostability (Hubbard et al., 1994). A study by Cuff and Martin (2004) on cavity volume in a dataset of 8,925 protein structures, using a void probe of 0.5 A˚ and a solvent probe of 1.4 A,˚ showed that the largest cavity was smaller than 725 A˚3 for 99% of proteins and smaller than 275 A˚3 for 80% of proteins, suggesting that voids around these sizes are well tolerated. The authors also suggested that a substitution that can generate a void of > 275 A˚3 is likely to destabilise the protein. Cavity volumes studied by Hubbard et al. (1994) on 121 high-quality X-ray and NMR structures, however, were much smaller than those found by Cuff and Martin: nearly all internal cavities analysed by Hubbard et al. were smaller than 100 A˚3.

A study by Eriksson et al. (1992) on six cavity-creating SAVs on T4 lysozyme showed a broad scale of volume change ranging from 24 A˚3 created by Leu46Ala to 150 A˚3 by

Leu99Ala. Consistent with Eriksson et al.’s study, a later study by Xu et al. (1998) on T4 lysozyme but with a larger set of 25 SAVs showed that many large-to-small substitutions led to internal cavities. Another study by Lee et al. (2000) on the same

T4 lysozyme protein, with a particular focus on the effect in the hydrophobic core site, confirmed that the destabilisation is due to the loss of van der Waals contacts in the hydrophobic core when a cavity was created. However, in some cases, the replacement of a bulky side chain, such as leucine or phenylalanine, with alanine in the hydrophobic core can result in a slight adjustment of the protein structure in order to maintain the structural stability.

In contrast to cavity-expanding SAVs, cavity-filling SAVs have been shown to slightly stabilise the hydrophobic core by lowering the free energy of native conformation (Saito et al., 2000), although this positive effect on the thermostability is very minimal (Eijsink et al., 1992). However, reducing the cavity size of a surface cavity is more likely to 1.6. Structural features used in structure-based prediction methods 55

Figure 1.12: Surface cavities and internal cavities detected in phenylalanine hydrox- ylase (PAH ). The protein structure (PDB code: 1J8U, chain A) at 1.5 A˚ resolution is shown in green ribbon together with its surface in translucent green. Cavity volumes calculated by KVFinder are shown as light blue dots. The protein’s catalytic site can be seen as a large open pocket. PAH uses tetrahydrobiopterin (shown in orange) and a non-heme iron (shown in red) for catalysis to convert phenylalanine to tyrosine. 1.7. Statistical methods 56 prevent protein-protein interaction or ligand binding. Several studies have shown that mutations occurring in ligand binding sites can potentially disrupt protein activities, leading to diseases (Li et al., 2003; Herga et al., 2006; Argentaro et al., 2007).

1.7 Statistical methods

The statistical procedures and tests used in this thesis are defined here.

1.7.1 P-value

The P-value is the probability of finding the observed results, or more extreme results, when the null hypothesis (H0) is true. The threshold rejecting the null hypothesis based on a p-value is arbitrary, but the widely-accepted ones are 0.05 and 0.01.

1.7.2 Mann-Whitney Wilcoxon test

The Mann-Whitney Wilcoxon test, also called Mann-Whitney U test or Wilcoxon - sum test, is used for determining whether two samples are likely to derive from the same population (i.e. that the two populations have the same distribution shape)

(Mann and Whitney, 1947). Mann-Whitney Wilcoxon test is a non-parametric test of the null hypothesis (H0) that it is equally likely that a randomly selected value from one sample will be less or greater than a randomly selected value from a second sample.

This test does not require the assumption of a normal distribution.

The Mann-Whitney Wilcoxon test was used in the study presented in Chapter 2 to compare the numbers of interactions between two sets of pleiotropic proteins and non- pleiotropic proteins. 1.7. Statistical methods 57

1.7.3 Odds ratio (OR) and ratio of odds ratios (ROR)

Odds ratios are used to compare the relative odds of the occurrence of the outcome of interest (e.g. disease), given exposure to the variable of interest (e.g. medical treatment) (Szumilas, 2010). In Chapter 2, odds ratios were used to compare the occurrence of SAVs (outcome) on protein sequences in the presence of disordered regions or in Pfam domains (exposure), with the occurrence of SAVs in the absence of exposure

(i.e. ordered regions or non-Pfam domains). The interpretation for odd ratios is as follows.

• OR = 1, Exposure does not affect odds of the outcome

• OR > 1, Exposure associated with higher odds of the outcome

• OR < 1, Exposure associated with lower odds of the outcome

The ratio of odds ratios (ROR) was used for comparing the differences between two odds ratios.

The p-value for the difference between two ORs (OR1 and OR2) was calculated from the z score defined as

δ z = SE(δ) where δ is the difference between log(OR): δ = log(OR1) − log(OR2)

q 2 2 and SE(δ) is the standard error: SE(δ) = (SE1 + SE2 )

1.7.4 Hypergeometric test

The hypergeometric test is often used for calculating enrichment of Gene Ontology

(GO) terms within a class of genes (Rivals et al., 2007). In the study presented in 1.7. Statistical methods 58

Chapter 2, a similar use of the hypergeometric test was applied to calculate the enrich- ment of particular disease classes between pleiotropic and non-pleiotropic proteins.

1.7.5 Z test of independent proportions

In Chapter 3 and Chapter 4, a Z test was used for comparing the proportion of SAVs

(disease-associated vs. neutral) detected as causing structural impact under the null hypothesis that the proportion of disease SAVs causing structural impact is the same as the proportion of neutral SAVs:

H0 : p1 = p2 against the alternative hypothesis that the proportion of disease SAVs causing struc- tural impact is greater than the proportion of neutral SAVs:

H1 : p1 > p2 the test statistic is given by

(ˆp − pˆ ) − (p − p ) Z = 1 2 1 2 s   pˆ(1 − pˆ) 1 + 1 n1 n2

wherep ˆ1 andp ˆ2 are the proportion of disease-associated SAVs and neutral SAVs causing structural problems, p1 − p2 = 0, n1 and n2 are the total number of disease-associated SAVs and neutral SAVs respectively, andp ˆ is the proportion of SAVs causing structural problems in the two samples combined. 1.7. Statistical methods 59

1.7.6 Benjamini-Hochberg procedure

When multiple comparisons (or multiple tests) were used, the Benjamini-Hochberg pro- cedure (Benjamini and Hochberg, 1995) was employed for controlling the false discovery rate (FDR).

The corrected p-value is given as below.

m P = P new i where m is the number of tests in the multiple comparisons, i is the rank of the original p-value (the smallest p-value has a rank of i = 1 and the largest p-value has a rank of i = m). P is the original p-value.

1.7.7 True positive rate (TPR) and false positive rate (FPR)

In Chapter 3 and Chapter 4, A SAV found to cause at least one structural problem is regarded as a positive hit. A disease-associated SAV marked as causing at least one structural problem is regarded as a true positive (TP), or otherwise as a false negative

(FN). A neutral SAV marked as causing at least one structural problem is regarded as a false positive (FP), or otherwise a true negative (TN).

The performance of each structural feature was evaluated by considering the true pos- itive rate (TPR), and false positive rate (FPR) which are defined as follows.

TP TP TPR = = FN + TP NP which is the number of disease-associated SAVs identified as causing structural prob- lems divided by the number of all disease-associated SAVs (NP). This is also equivalent 1.7. Statistical methods 60 to the commonly-used term sensitivity.

FP FP FPR = = FP + TN NN which is the number of neutral SAVs identified as causing structural problems divided by the number of all neutral SAVs (NN). This is also equivalent to 1 - sensitivity.

95% confidence intervals (CI) were assigned to values for TPR and FPR as follows:

s (TPR)(1 − TPR) TPR ± 1.96 × NP and s (FPR)(1 − FPR) FPR ± 1.96 × NN Chapter 2

Pleiotropy in Human Disease

Despite several types of gene-trait relationships discovered from specific studies in mice, yeasts, and other organisms, the mechanism that gives pleiotropic properties to a protein is unclear, and in-depth molecular analysis has yet to be performed. This chapter presents a study to explain how SAVs in a human gene can give rise to several unrelated Mendelian diseases. Information from various human proteome and variant databases at the residue, protein, and system were analysed.

This work has been published in Human Mutation as a research article titled “Land- scape of Pleiotropic Proteins Causing Human Disease: Structural and System Biology

Insights” (Ittisoponpisan et al., 2017). The study was directed by Dr Alessia David and Professor Michael Sternberg. The main experiment, including data curation and code development in Python and R, was performed by me. Result interpretations and analyses were jointly performed with Dr Alessia David and Professor Michael Stern- berg. Eman Alhuzimi performed an additional study on the enrichment of GO term in this paper (not presented in this thesis).

61 2.1. Background 62

2.1 Background

Pleiotropy was suggested to be a common phenomenon, particularly in complex or- ganisms. The degree of pleiotropy, or the number of traits affected by one gene, was found to increase in relation to the complexity of organisms (Wagner et al., 2008; Wag- ner and Zhang, 2011). Pleiotropic genes have also been shown to be associated with complex diseases. Sivakumaran et al. (2011) reported that 16.9% of human genes and

4.6% of SNPs reported in the GWAS catalogue exhibited pleiotropic effects effects, thus suggesting that pleiotropy is a common property of genes and SNPs associated with complex diseases. This study was, however, performed on a small set of genes and SNPs (1,380 genes, 1,687 SNPs), and a study on a larger dataset has yet to be conducted in order to confirm their findings.

Genes with complex phenotypic profiles tend to be more central in protein-protein networks (Zou et al., 2008). Chavali et al. (2010) showed that pleiotropic genes in humans tend to be more central in protein networks than non-pleiotropic genes. In addition, genes that are related to phenotypically divergent diseases (referred to as

‘phenodiv’ genes by Chavali et al.) were also found to be more central in the network than genes related to phenotypically similar genes (‘phenosim’ genes). These findings emphasise the importance of protein-protein interactions (PPIs) in pleiotropic genes.

Intrinsically disordered regions (IDRs), or disordered regions (regions that lack a de-

fined or ordered three dimensional structure) are often involved in protein-protein in- teraction. Moreover, the lack of a fixed structure confers flexibility to the protein, thus allowing the interaction with several different interactors. Intrinsically disordered pro- teins (IDPs) (proteins which are fully or partially disordered) are involved in a large number of biological activities (Iakoucheva et al., 2002), and are more central in the protein interactome (Midic et al., 2009). Since disordered regions are important for proteins to act as hubs (Dunker et al., 2005, 2008), variants occurring in these regions can result in deleterious outcomes. Uversky et al. (2009) found that proteins associated with diseases are usually enriched with disordered regions. Moreover, both Uversky et 2.1. Background 63 al. (2008) and Midic et al. (2009) reported the association between variants occurring in disordered regions and many complex diseases, such as cardiovascular, neurodegener- ative diseases, diabetes, and cancer. Uversky et al. suggested that often these complex disorders are the result of proteins failing to interact with multiple partners.

The mechanism by which variants in disordered regions can cause disease was eluci- dated in studies by Vacic et al. (2012): disease-causing variants alter the structure of the proteins, resulting in original functions being removed or novel functions being introduced. They demonstrated that 22% of all disease-causing variants were in dis- ordered regions. A single variant in a disordered region can cause disorder-to-order transition, which has many possible outcomes: it can generate a new globular region within a long-disordered sequence, shorten the disorder region, or completely elimi- nate the whole disorder region. Such structural effects can potentially prevent proteins from undergoing conformational changes or from establishing protein-protein interac- tion, thus leading to loss of protein function and hence disease (Uversky et al., 2009;

Vacic and Iakoucheva, 2012; Vacic et al., 2012). Notably, the disorder-to-order transi- tion was responsible for 20% of disease-associated variants in disordered regions (Vacic and Iakoucheva, 2012). This number is almost double the percentage of disorder-to- order transitions caused by neutral variants in disordered regions (11.5%). The five most common amino acid changes that contribute to 44% disorder-to-order-causing variants were arginine to tryptophan, arginine to cysteine, arginine to histidine, argi- nine to glutamine and glutamic acid to lysine. Disordered regions, however, have yet to be further studied to clarify their relation to pleiotropy.

The association between pleiotropy and essential genes, or genes that are required for embryonic survival, is not fully understood. A study by Dickerson et al. (2011) sug- gested that essential genes have more interactions and are associated with a larger range of diseases, compared to non-essential genes. Moreover, when human-mouse or- thologues were analysed, the authors found that gene products of essential orthologues showed high connectivity in protein networks and were highly correlated to cancer. 2.2. Materials & methods 64

The conclusion from this study was that essential genes are key candidates for causing multiple unrelated human diseases.

The study presented in this chapter covers the systematic analysis of pleiotropic pro- teins in regards to several features, such as essentiality, protein network, classes of diseases they are associated with and the distribution of SAVs in disordered regions and Pfam domains. The aim is to gain a better understanding of the relationship between pleiotropic genes and human disease.

2.2 Materials & methods

2.2.1 Definition of pleiotropic proteins

Pleiotropy is a broad concept which expresses the relationship between genes and traits

(see Section 1.3 in Chapter 1). Since the main focus of this study is the relationship between pleiotropic proteins and Mendelian disorders, pleiotropic proteins are defined as proteins that are associated with more than one disease. Moreover, since this rela- tionship is explored at structural level, only SAVs occurring in protein coding regions were analysed. Non-pleiotropic proteins are defined as proteins with a SAV that has been associated with only one genetic disease (Figure 2.1). The identifiers for human

Mendelian diseases (MIM IDs) were extracted from the Online Mendelian Inheritance in Man (OMIM) database (Hamosh et al., 2005). Proteins not reported to be associated with any genetic disease were excluded from this study.

2.2.2 Disease classification

The International Classification of Diseases 10th revision (ICD-10) (World Health

Organization., 2004), which is a well-established hierarchical classification of diseases defined by World Health Organisation (WHO), was used to further classify pleiotropic proteins according to the similarity of their associated diseases. The mapping from 2.2. Materials & methods 65

Figure 2.1: The general concept of pleiotropic and non-pleiotropic proteins. Differ- ences in shapes and colours between the pleiotropic protein and the non-pleiotropic protein are for illustrative purposes.

MIM IDs to ICD-10 codes was performed using reference data from Orphadata which provide free-access drug and disease data (available at: http://www.orphadata.org/). In this classification system, each disease is assigned a 3-letter code for its identifier - e.g. E11 for insulin-dependent, or type I, diabetes mellitus. There are 22 main classes of diseases according to the affected system - e.g. class IV for metabolic diseases and class VI for neurological diseases. The full list of disease classes is presented in Table 2.1. Although the majority of the classes are well-defined, in a few classes

(e.g. class 3 and 18), the heterogeneity may be present. To overcome this problem, the analysis was also performed at the subclass level. Each main class is further categorised into subclasses according to the similarity of the disease phenotypes. At the lowest hierarchical level, diseases with highly similar phenotypes were grouped together and assigned the same 3-letter identifier. Diseases with the same ICD-10 code are usually regarded as the same disease but with different severity, condition, or affecting different organs, despite having different MIM identifiers. To distinguish those highly similar diseases sharing the same ICD identifier, further digits following the main 3-letter identifiers are assigned. For example, type I diabetes mellitus (class IV, code E11) is further labelled E11.0 (with coma) and E11.1 (with ketoacidosis). This detailed information regarding the severity of disease is, however, not fully reported in other 2.2. Materials & methods 66 databases such as in UniProt Humsavar and Orphadata. Therefore, to make cross- referencing between different databases feasible, only the first 3 letters of the ICD-10 identifier codes were considered.

It is possible that different MIM identifiers can be mapped onto the same 3-letter ICD code because they are phenotypically similar diseases. Because of this, pleiotropic proteins were further separated into two sub-groups according to the ICD-10 classi-

fication of their associated diseases: 1) pleiotropic proteins with different phenotypes

(pleiotropic DPh), consisting of proteins associated with at least two disordered affect- ing different physiological systems (i.e. their associated diseases are assigned different

ICD-10 codes), and 2) pleiotropic proteins with similar phenotypes (pleiotropic SPh), consisting of proteins of which associated diseases belong to the same physiological system (share the same ICD-10 3-letter code). With the aim of emphasising the differ- ences between pleiotropic proteins and non-pleiotropic proteins, the pleiotropic DPh set was chosen as the main set to be examined against the non-pleiotropic set. In addition, analyses of the full pleiotropic set versus non-pleiotropic set are provided in the Appendix section. 2.2. Materials & methods 67 List of major classes in ICD-10 Table 2.1: Class Name 123 Certain infectious4 and parasitic Neoplasms diseases 5 Diseases of6 the blood Endocrine, and nutritional blood-forming7 and organs metabolic and Mental diseases certain and8 disorders behavioural involving disorders the Diseases immune of9 mechanism the nervous Diseases system of10 the eye Diseases and of adnexa 11 the ear Diseases and of mastoid12 the Diseases process circulatory of system the respiratory13 Diseases system of the digestive14 Diseases system of the skin15 Diseases and of subcutaneous the tissue musculoskeletal16 Diseases system of and the connective genitourinary tissue 17 Pregnancy, system childbirth and the18 puerperium Certain conditions originating in19 Congenital the malformations, perinatal deformations period and20 Symptoms, chromosomal signs abnormalities and abnormal21 Injury, clinical poisoning and and laboratory certain22 findings, External other not causes consequences elsewhere of of classified morbidity external Factors and causes influencing mortality health status Codes and contact for with special health purposes services 2.2. Materials & methods 68

Figure 2.2: Overview of the method. SAVs data were obtained from Humsavar and proteins were screened and separated into 2 sets, pleiotropic and non-pleiotropic, according to the number of associated diseases.

2.2.3 Dataset

The approach for the study of pleiotropy in human disease is shown in Figure 2.2.

68,607 SAVs (in 12,543 proteins) from UniProt (Humsavar, October 2014 release) were originally classified into 3 types: disease (24,070 SAVs), polymorphism (37,947 SAVs), and unclassified (6,590 SAVs). Polymorphism, according to Humsavar, stands for vari- ants that are not reported to be implicated in diseases. In other words, the variants are known as neutral. Variants classified as disease are those reported to be disease-causing.

The diseases reported in Humsavar are Mendelian disorders, which are generally rare. 2.2. Materials & methods 69

As polymorphisms can be referred to as SNPs, to avoid the ambiguity, Humsavar’s polymorphisms here is regarded as ‘neutral SAVs’ in this thesis.

From the Humsavar dataset, only disease and neutral SAVs were considered. Unclassi-

fied SAVs were excluded from the analysis due to lack of supporting evidence to indicate whether they are disease-associated or neutral. The proteins Titin and p53 were re- moved from the dataset because their exceptional length (Titin is 34,350 amino acid long) and number of SAVs (p53 has more than 700 missense SAVs), could introduce a bias in the results, particularly in the statistical analysis. The remaining proteins were classified into pleiotropic (513 proteins) and non-pleiotropic (1,610 proteins) according to the method described previously in section 2.2.1.

The pleiotropic protein set was further screened according to the criteria in section

2.2.2. A pleiotropic protein was required to have at least two MIM IDs that could be mapped to ICD-10 codes. However, since not all MIM IDs can be automatically mapped onto ICD-10 codes, this resulted in some pleiotropic proteins showing only one

MIM-to-ICD relation. Those proteins were subsequently excluded from the dataset.

The final pleiotropic dataset consisted of a total of 412 proteins: 257 were pleiotropic

DPh and 155 were pleiotropic SPh.

It should be considered that a protein can be associated with other diseases not reported in Humsavar through other types of mutation, such as nonsense and splice mutations

(i.e. insertion and deletion of nucleotides). Since Humsavar only reports missense

SAVs, the non-pleiotropic set was therefore further screened by checking against OMIM database (Morbidmap file). Any protein that was found to be associated with any other disease not reported in UniProt was removed. The removed non-pleiotropic proteins were, however, not reintroduced into the pleiotropic set because they were associated with only one disease through SAVs. Consequently, the final non-pleiotropic dataset comprised 1,345 proteins. 2.2. Materials & methods 70

2.2.4 Pfam domains

The Pfam database (version 27.0) (Finn et al., 2014) was used to identify protein domains. The Pfam’s statistical E-value was set to 0.0001 to allow retrieving only results with high confidence for the analysis.

2.2.5 Disordered regions

IUPred (structure mode) (Doszt´anyi et al., 2005a,b), which relies on the pairwise energy of proteins in their native conformation, was selected for the prediction of disordered regions. As the main focus is on the protein flexibility influenced by long disordered regions, all the prediction results from IUPred were post-filtered to keep only long disordered regions. Long disordered regions are required to cover at least 50 consecutive residues. Any predicted disordered region that is less than 50-residue long will be regarded as a part of its adjacent ordered regions.

To confirm the result, DISOPRED3 (Jones and Cozzetto, 2015) was chosen for addi- tional analysis. DISOPRED3 has a different approach to identifying disordered regions as it was trained using support vector machine (SVM) on missing coordinates in high- resolution X-ray structures.

The study was also repeated using a less-stringent cut-off of 30 residues to identify long disordered regions. In summary, four datasets of disordered regions were used: IUPred with thresholds of 50 and 30 residues, and DISOPRED3 with the same thresholds.

The main analyses using IUPred as a disordered region predictor with the 50-residue threshold are reported in this chapter.

2.2.6 Essential proteins

Essential proteins are defined as proteins required for embryonic survival. This study used two different methods in order to identify essential human proteins. The first 2.2. Materials & methods 71 method is to identify essential genes in mouse, then determine their corresponding hu- man orthologues. Starting with a full dataset of mouse genes and phenotypes taken from Mouse Genome Informatics (MGI) (Eppig et al., 2015) (available at ftp://ftp. informatics.jax.org/pub/reports/MGI_PhenoGenoMP.rpt), 4,382 mouse genes re- lated to 46 lethal phenotypes were extracted. Those lethal phenotypes were pre- determined by Georgi et al. (2013), and the full list is provided in the Supplementary

Table A.5. Out of 4,382 lethal mouse genes, 3,268 genes could be mapped (one-to-one) to human orthologues via the Ensembl web server.

The second method for identifying essential proteins is by using data from the Online

Gene Essentiality Database (OGEE) (Chen et al., 2012). This database contains 1,528 essential human genes analysed by Silva et al. (2008). The identification of essential proteins in their study was undertaken using multiplex RNAi screening method. Since essential proteins in this database were already given UniProt IDs as their main protein identifiers, no further mapping was required.

2.2.7 Protein-protein interaction

Data for protein-protein interactions in human were obtained from BioGrid version

3.3 (Stark, 2006; Chatr-Aryamontri et al., 2013). BioGrid reports interactions between genes, with gene symbols used as the main identifiers. Therefore, a one-to-one mapping from gene symbols to human UniProt IDs was performed via the UniProt web server.

Gene symbols that could not be mapped to UniProt IDs were removed from the inter- action network. As a final result, a non-redundant set of protein-protein interactions was obtained. To confirm the results, the same process was repeated using another interactome database IntAct (Kerrien et al., 2012). 2.2. Materials & methods 72

2.2.8 Minor allele frequency

Neutral SAVs were further classified into ‘common’ and ‘rare’ according to their minor allele frequencies (MAF). The standard MAF threshold of 0.01 was used to classify

SAVs as rare (MAF < 0.01) and common (MAF ≥ 0.01). Since UniProt does not report MAF data, these obtained from two external databases - dbSNP and ExAC.

All variants from dbSNP (Smigielski, 2000; Sherry, 2001) are given identifiers starting with ‘rs’ as a prefix. Since dbSNP identifiers were included in UniProt, MAF data of each SAV could be retrieved automatically from dbSNP database.

ExAC (Lek et al., 2016) contains genetic variations identified in 60,000 unrelated in- dividuals. Moreover, ExAC reports variants in canonical genes and their isoforms. To be in agreement with the main dataset taken from UniProt, only data obtained from canonical entries were considered. As ExAC database covers many additional variants not found in Humsavar, all common SAVs reported in ExAC were also extracted for some analyses. These SAVs, however, were only analysed as a separate set and were not combined with common neutral SAVs originally taken from Humsavar.

2.2.9 Statistical methods

The chi-square goodness of fit test was used for the comparison of observed and ex- pected number for categorical variable such as the number of mutations found in dis- ordered regions, Pfam domains, and the number of essential proteins. Preferences of

SAVs to be located in particular regions of proteins, e.g. in disordered regions or Pfam domains, were determined using odds ratios (OR). The ratio of odds ratios (ROR) was also used when comparing the differences between two odds ratios across pleiotropic and non-pleiotropic groups (see Section 1.7.3 in Chapter 1). The Mann-Whitney-

Wilcoxon test was used for comparing the distribution of values between two groups

(i.e. numbers of interactions between pleiotropic and non-pleiotropic proteins). The hypergeometric test was used for the multiple comparisons of enrichment in disease 2.3. Results 73 classes in pleiotropic and non-pleiotropic proteins. Lastly, the Benjamini-Hochberg procedure was used to correct the p-values from multiple comparisons. All test results were considered significant if their p-values were < 0.05.

2.3 Results

2.3.1 Global characteristics of pleiotropic and non-pleiotropic pro- teins

Length of protein sequences and number of SAVs

Pleiotropic proteins had a median length of 593 residues (range 79 - 5,202) whereas the median for non-pleiotropic proteins was 509 (range 51 - 6,885). The median for pleiotropic proteins with different phenotypes (DPh) was 570 (range 110 - 4,967) and that for pleiotropic proteins with similar phenotypes (SPh) was 644 (range 79 - 5,202).

Overall, amino acid sequences of pleiotropic proteins (DPh) were slightly longer than the sequences of non-pleiotropic proteins (p < 0.01, Mann-Whitney two-tailed test)

Despite a substantial difference in the number of proteins in the pleiotropic set (412) and the non-pleiotropic set (1,345), these two sets shared similar numbers of SAVs.

12,209 SAVs were found in the pleiotropic set whereas 12,743 SAVs were found in the non-pleiotropic set. However, when those SAVs were divided into disease-associated and neutral SAVs, the proportion of disease-associated SAVs was much higher in the pleiotropic sets (83% for all pleiotropic, 83% for pleiotropic DPh, and 84% for pleiotropic SPh), than in the non-pleiotropic set (72%) (Figure 2.3).

Distribution of SAVs across pleiotropic & non-pleiotropic proteins

Pleiotropic DPh and non-pleiotropic sets were chosen for the comparison of SAV distributions. The dataset contained 15,222 disease-associated SAVs (6,100 were in 2.3. Results 74

Figure 2.3: Stacked bar chart showing numbers of disease-associated and neutral SAVs and their proportions in each protein dataset. The numbers of disease-associated SAVs are presented on the left side of the bar chart while those of neutral SAVs are shown on the right. The proportions of SAVs shown in light blue represent disease- associated SAVs, and in light grey, neutral SAVs. pleiotropic DPh and 9,122 were in non-pleiotropic proteins) and 4,913 neutral SAVs

(1,292 were in pleiotropic DPh and 3,621 were in non-pleiotropic proteins). The pleiotropic DPh set had a total length of 225,681 residues, and the non-pleiotropic set had a total length of 924,002 residues. From these, odds ratios and chi-square tests were used to compare the distributions of SAVs across pleiotropic DPh and non- pleiotropic set. The results are presented in Figure 2.4.

Disease SAVs were more likely to be found in pleiotropic proteins than in non-pleiotropic proteins (OR = 2.74, p < 0.01). Interestingly, the pleiotropic group was also enriched in neutral SAVs (MAF ≥ 0.01) (OR = 1.46, p < 0.01). However, when the neutral 2.3. Results 75

SAVs were separated according to their frequencies, common neutral SAVs (356 in pleiotropic and 1,650 in non-pleiotropic) were less likely to be found in pleiotropic (OR

= 0.88, p = 0.03) than in non-pleiotropic proteins. The preference of common neutral

SAVs towards non-pleiotropic proteins was confirmed even when using variant infor- mation from ExAC containing 499 neutral common SAVs in pleiotropic and 2,811 in non-pleiotropic proteins (OR = 0.73, p < 0.01).

In contrast to the results from common neutral SAVs (MAF ≥ 0.01), rare neutral SAVs

(MAF < 0.01) (320 in pleiotropic and 917 in non-pleiotropic) and neutral SAVs with unknown MAF (616 in pleiotropic and 1,054 in non-pleiotropic) were more likely to accumulate in pleiotropic proteins (OR = 1.43, p < 0.01, for rare SAVs; OR = 2.41, p < 0.01, for no MAF SAVs). These results suggest that pleiotropic proteins are not only enriched in disease-causing variants but also in rare neutral SAVs. Additionally, these analyses were repeated using the full pleiotropic set vs. non-pleiotropic set and similar results were observed (Supplementary Table A.2).

Number of proteins with disordered regions

The percentage of proteins with at least one disordered region was higher in pleiotropic

DPh set (51.4%) than in non-pleiotropic set (38.4%) (p < 0.01, chi-square test) (Fig- ure 2.5). This result remained significant when comparing the full set of pleiotropic proteins (DPh and SPh) with non-pleiotropic proteins (p < 0.01, chi-square test). In- terestingly, there was no significant difference in the number of proteins with disordered regions between the DPh set and the SPh set (p = 0.78, chi-square test). The overall results were still preserved when the cut-off for determining long disordered regions was changed from 50 residues to 30 residues, and when another disordered region pre- dictor, DISOPRED3, was used. Full results from using all cut-off values with IUPred and DISOPRED3 are provided in Supplementary Tables A.3-A.4. 2.3. Results 76 Odds ratios comparing the distribution preferences of disease-associated SAVs and neutral SAVs (from Humsavar and ExAC) in pleiotropic DPh vs. non-pleiotropi proteins. The whiskers represent 95% confidence intervals. Figure 2.4: 2.3. Results 77

Figure 2.5: Percentages of proteins with disordered regions.

2.3.2 Protein essentiality

Two methods were chosen to study the difference between pleiotropic and non-pleiotropic proteins in terms of the number of essential proteins. The results from each individual method are shown in Figure 2.6.

Identification of essential proteins using human-mouse orthologues

4,382 mouse genes associated with 46 lethal phenotypes (Georgi et al., 2013) were extracted from MGI database and classified as essential. These MGI phenotypes are listed in Supplementary Table A.5. 3,268 mouse genes could be mapped onto human orthologues via the Ensembl web server. 59.1% of the proteins from pleiotropic DPh set and 35.9% from non-pleiotropic set were identified as essential (p < 0.01). Similar 2.3. Results 78

Figure 2.6: Percentages of essential proteins in pleiotropic and non-pleiotropic protein sets. On the left panel is the comparison using data from mouse orthologues whereas on the right is the comparison using data from OGEE. results were observed when considering the full pleiotropic set (55.3%) (p < 0.01).

When the two subsets of pleiotropic proteins were considered separately, the pleiotropic

DPh set showed a higher percentage of essential proteins than the pleiotropic SPh set

(59.1% vs. 49.0%, p = 0.05). All statistical comparisons were performed using chi- square tests.

Identification of essential proteins using OGEE database

The OGEE database classifies 1,528 human genes as essential. 95.4% of pleiotropic proteins (393 proteins) and 95.8% of non-pleiotropic proteins (1,289 proteins) were reported in the OGEE database and data on their essentiality could be analysed. A significant difference (p < 0.01) was observed in the percentage of essential proteins 2.3. Results 79 between pleiotropic DPh (20.2%) and non-pleiotropic sets (10.5%). A similar result

(p < 0.01) was also observed when considering the pleiotropic full set (16.8%). In addition, a significant difference (p = 0.01) was found between the pleiotropic DPh

(20.2%) and pleiotropic SPh (11.0%) sets. However, as the percentage of essential proteins in the pleiotropic SPh group was low, the difference between this set and the non-pleiotropic set was no longer significant (p = 0.30). Details are provided in

Supplementary Table A.6.

The results from both human-mouse orthologue method and OGEE database analyses indicate that pleiotropic proteins are overall more likely to be essential compared to non-pleiotropic proteins. Moreover, pleiotropic proteins causing multiple diseases with different phenotypes (DPh) are more likely to be essential compared to pleiotropic proteins causing diseases with similar phenotypes (SPh).

2.3.3 Number of interactions

Protein-protein interaction data were extracted from the BioGrid database. Figure

2.7 shows the mean numbers for each protein set. In order to compare the number of interactions between any two sets, Mann-Whitney-Wilcoxon two-tailed test was performed. Pleiotropic DPh proteins had a significantly higher number of interactions

(mean 40.2, median 11) than non-pleiotropic proteins (mean 18.9, median 6) (p < 0.01).

This significant difference was observed even when the full pleiotropic set was analysed

(p < 0.01). When the subsets of pleiotropic proteins were compared, pleiotropic DPh proteins showed a significantly higher number of interactions than pleiotropic SPh proteins (mean 20.0, median 7) (p < 0.01). There was no significant difference between pleiotropic SPh and non-pleiotropic proteins (p = 0.47). These results were similar when using protein-protein interaction information from IntAct (see Supplementary

Table A.7). 2.3. Results 80

Figure 2.7: Mean number of interactions. The error bars depict standard errors of the mean values for each set.

Number of interactions by protein essentiality

A study by Dickerson et al. (2011) showed that essential genes exhibited more in- teractions than non-essential genes. This raises the question of whether the higher number of interactions that were observed in the pleiotropic set in the previous sec- tion was caused by the presence of more essential proteins in the pleiotropic set than in the non-pleiotropic set. In order to address this, the analysis was repeated on the dataset containing only human non-essential proteins (as indicated in OGEE database).

Pleiotropic DPh and non-pleiotropic sets were compared (Figure 2.8).

After essential proteins were removed, the remaining non-essential pleiotropic proteins still had more interactions (mean 27.3, median 8) than non-essential non-pleiotropic proteins (mean 16.6, median 6) (p < 0.01). The overall result was still in broad agree- ment when only essential proteins were considered in comparing pleiotropic proteins

(mean 90.7, median 32) to non-pleiotropic proteins (mean 43.5, median 16) (p = 0.02). 2.3. Results 81

Figure 2.8: Mean number of interactions by protein essentiality. The error bars depict standard errors of the mean values of each set.

The large standard error in the essential pleiotropic DPh set was due to few numbers of proteins analysed. The major difference was that the number of interactions drastically increased when only essential proteins were taken into consideration. The result also confirmed the finding by Dickerson et al. that essential proteins had more interaction than non-essential proteins.

When essential proteins were classified as such via the mouse orthologue method, es- sential pleiotropic proteins still had more interactions (mean 55.9, median 17) than essential non-pleiotropic proteins (mean 26.2, median 8) (p < 0.01). However, no difference was present between non-essential pleiotropic vs. non-pleiotropic proteins

(p = 0.38). Details are provided in Supplementary Table A.8. 2.3. Results 82

2.3.4 Differences between pleiotropic and non-pleiotropic proteins in the distribution of SAVs in disordered regions

This study was performed on a dataset containing 257 pleiotropic DPh proteins (6,100 disease-associated SAVs and 1,292 neutral SAVs) and 1,345 non-pleiotropic proteins

(9,122 disease SAVs-associated and 3,621 neutral SAVs). Odds ratios and chi-square tests were used for determining whether SAVs occur more frequently in disordered regions than expected (equally distributed between ordered and disordered regions).

The results are shown in Table 2.2.

Disease-associated SAVs were less likely to occur in disordered regions in both pleiotropic

DPh (OR = 0.58, p < 0.01) and non-pleiotropic proteins (OR = 0.29, p < 0.01). Nev- ertheless, the ratio of the odds ratios (ROR) for the disease-associated SAVs suggested that they occur about twice as often in the disordered regions in pleiotropic proteins compared to the occurrence in non-pleiotropic proteins (ROR = 2.05, P(ROR) < 0.01).

For all the neutral SAVs, however, no significant difference in the distribution was found. The values for P(ROR) were 0.67 for common neutral SAVs, 0.27 for rare neutral SAVs, and 0.28 for neutral SAVs with unknown MAF.

When considering common neutral SAVs from ExAC, the preferences of SAVs to be found in disordered regions as opposed to ordered regions became significant (OR = 1.55 for pleiotropic DPh and 1.32 for non-pleiotropic, p < 0.01 for both cases). Despite the apparent enrichment of common neutral SAVs in disordered regions, the difference in the enrichment between the two sets was not statistically significant (ROR = 1.18, p = 0.10). The test was repeated using the full pleiotropic set and with another disorder predictor DISOPRED3; similar results were observed (Supplementary Tables

A.9-A.10).

In some pleiotropic proteins, residues associated with multiple disease phenotypes (de-

fined as multiple MIM IDs) can be observed. These residues were defined in this study as pleiotropic residues (Figure 2.9). The same analysis on the distribution of SAVs in 2.3. Results 83

Table 2.2: Observed and expected numbers of SAVs in disordered regions.

Disease-associated O E O/E P-value OR ROR P(ROR) Pleiotropic DPh 750 1170 0.64 < 0.01 0.58 2.05 < 0.01 Non-pleiotropic 477 1468 0.32 < 0.01 0.29 Neutral O E O/E P-value OR ROR P(ROR) Pleiotropic DPh 258 248 1.04 0.47 1.05 0.91 0.30 Non-pleiotropic 652 583 1.11 < 0.01 1.15 Common neutral O E O/E P-value OR ROR P(ROR) Pleiotropic DPh 80 68 1.14 0.11 1.22 0.94 0.67 Non-pleiotropic 329 266 1.24 < 0.01 1.30 Rare neutral O E O/E P-value OR ROR P(ROR) Pleiotropic DPh 73 61 1.18 0.10 1.24 1.19 0.27 Non-pleiotropic 153 148 1.04 0.63 1.04 No MAF neutral O E O/E P-value OR ROR P(ROR) Pleiotropic DPh 105 118 0.89 0.18 0.87 0.86 0.28 Non-pleiotropic 170 170 1.00 0.97 1.00 ExAC common neutral O E O/E P-value OR ROR P(ROR) Pleiotropic DPh 136 96 1.40 < 0.01 1.55 1.18 0.10 Non-pleiotropic 566 453 1.25 < 0.01 1.32

Note 1) O: Observed value 2) E: Expected value 3) ROR: Ratio of odds ratios (ORs) 4) P(ROR): P-value of ROR 2.3. Results 84 disordered regions was repeated but with a particular focus on pleiotropic residues and pleiotropic SAVs (i.e. SAVs associated with multiple MIM IDs). Results are shown in

Table 2.3.

Figure 2.9: Pleiotropic residue and pleiotropic SAV. (a) a pleiotropic residue is a residue associated with multiple diseases which could arise from one or more SAVs, (b) a pleiotropic SAV is a SAV associated with multiple diseases. Note that a residue that contains a pleiotropic SAV is also regarded as a pleiotropic residue.

The chi-square test showed that pleiotropic residues were less likely to occur in disor- dered regions (OR = 0.54, p < 0.01). This is similar to the previous observation that disease-associated SAVs were more likely to be found in disordered regions than in or- dered regions. Pleiotropic SAVs also showed similar distribution to that of pleiotropic residues (O/E = 0.53, p < 0.01). Although the odds ratios of both pleiotropic residues and pleiotropic SAVs were slightly higher in pleiotropic DPh compared to pleiotropic

SPh, the R(ROR) values showed that these slight differences were not significant: 2.3. Results 85

P(ROR) = 0.60 for pleiotropic residues and 0.96 for pleiotropic SAVs.

Table 2.3: Observed and expected numbers of SAVs in disordered regions for DPh and SPh pleiotropic proteins.

Pleiotropic residues O E O/E P-value OR ROR P(ROR) Pleiotropic (full) 73 124 0.59 < 0.01 0.54 Pleiotropic DPh 46 74 0.62 < 0.01 0.57 o 1.15 0.60 Pleiotropic SPh 27 50 0.54 < 0.01 0.50 Pleiotropic SAVs O E O/E P-value OR ROR P(ROR) Pleiotropic (full) 53 100 0.53 < 0.01 0.48 Pleiotropic DPh 32 59 0.54 < 0.01 0.49 o 1.02 0.96 Pleiotropic SPh 21 40 0.53 < 0.01 0.48

Note 1) O: Observed value 2) E: Expected value 3) ROR: Ratio of odds ratios (ORs) 4) P(ROR): P-value of ROR

2.3.5 ICD-10 disease classes in pleiotropic and non-pleiotropic pro- teins

This section reports the analysis of the abundance of different disease classes found in pleiotropic DPh and non-pleiotropic datasets. In order to find the frequency of each disease class, ICD-10 classes (see Table 2.1) were assigned to each protein via MIM ID

- ICD-10 relations. A pleiotropic protein can be assigned several ICD-10 classes due to its multiple MIM ID associations while a non-pleiotropic protein can be assigned only one ICD-10 class (see Figure 2.10). MIM IDs that cannot be assigned ICD-10 classes were removed. As some proteins may share the same class, multiple entries in each pool (pleiotropic and non-pleiotropic) were allowed. In total, there were 661 MIM-ICD relations in the pleiotropic pool and 1,205 in the non-pleiotropic pool.

From the initial analysis, some disease classes were present at very low percentages (i.e. less than 2.5% in both pleiotropic and non-pleiotropic pool). These classes (numbers 2.3. Results 86

Figure 2.10: The assignment of ICD-10 classes to pleiotropic and non-pleiotropic proteins. 2.3. Results 87

1, 5, 8, 10, 11, 12, 13, 14, 15, and 16) were therefore combined and defined as ‘other’.

Apart from only 1 MIM ID found in class 19 from a pleiotropic protein, there were no other MIM IDs found in class numbers 18-22; hence, these classes were removed from the statistical analysis. The relative frequencies of each disease class were then calculated. The final histogram containing data from 8 classes; 2, 3, 4, 6, 7, 9, 17, and

‘other’ is shown in Figure 2.11.

The hypergeometric test was employed to compare the enrichment of each class in the pleiotropic and the non-pleiotropic sets, together with Benjamini-Hochberg proce- dure to correct the p-values. Pleiotropic proteins were enriched in neoplasm (class 2, p < 0.01), diseases of the nervous system (class 6, p < 0.01), diseases of the circulatory system (class 9, p < 0.01), and congenital malformation (class 17, p < 0.01), whereas non-pleiotropic proteins were enriched in blood conditions (class 3, p < 0.01) and metabolic diseases (class 4, p < 0.01). There were no significant differences observed in class 7 (diseases of the eye and adnexa) and ‘other’ group.

2.3.6 The association between two disease classes in pleiotropic pro- teins

As pleiotropic proteins were found to be specific to some disease classes, this raises the question whether specific disease classes are more likely to be co-present in the same protein. A heat map showing which two disease classes were frequently found together in the same protein was generated (Figure 2.12). For each protein in the pleiotropic

DPh set, all unique MIM IDs were extracted, and their corresponding ICD-10 classes were assigned. A zero square matrix of size 19 for each protein was generated. For each protein, the matrix was then filled with ‘1’ in a cell [i,j], where i and j are the class numbers associated with that protein. For example, if a protein is associated with 5

MIM IDs - each belongs to class numbers 2, 2, 2, 4, and 6, respectively - the considered pairs would be [2, 2], [2, 4], [2, 6], [4, 2], [4, 6], [6, 2], and [6, 4]. Note that the pair

[2, 2] is only counted once, regardless of the redundancy in ICD-10 for class number 2, 2.3. Results 88

Figure 2.11: The frequency of different disease classes found in pleiotropic and non-pleiotropic proteins. The study was performed using hypergeometric test with Benjamini-Hochberg procedure. Each double asterisk (**) denotes significant differ- ences in the frequencies between pleiotropic and non-pleiotropic sets (p < 0.01). 2.3. Results 89 and the matrix is symmetric. The matrix is then summed for all the proteins.

Although there were 22 classes defined by ICD-10, almost all the data were spread in the first 17 classes; one MIM was found to be associated with class number 19 and none with class numbers 18, 20, 21, and 22. Pleiotropic SPh were excluded from this study as they were characterised by having different diseases sharing the same 3-letter

ICD-10 codes (their associated diseases affect the same biological systems) which were not informative for heat map analysis.

The final heat map representing the overall profile of pleiotropic DPh proteins is shown in Figure 2.13 (left). Due to the of diseases in class number 17 (congenital malformation), the linear scale used in the heat map limited visual interpretation. To facilitate visualising the heat map, log values were used via the formula below.

valuenew = log(value + 1)

The new heat map is shown on the right of Figure 2.13. The adjusted heat map shows bright spots found mainly along the diagonal line, suggesting that many different diseases associated with the same proteins were classified into the same ICD-10 class.

However, some clusterings from other different classes are also observed. Class number

17 (congenital malformation) was found to be co-present with many other disease classes, particularly numbers 2, 4, 6, 8, and 13. In addition, other noticeable class pairs were: 1) class number 4 (metabolic disease) and class number 6 (neurological disease) and 2) class number 6 (neurological disease) and class number 9 ().

2.3.7 The clustering of SAVs in Pfam domains in pleiotropic proteins

In this section, the enrichment of SAVs in Pfam domains is examined. The aim is to explore whether SAVs clustering in the same domain are likely to share the same disease. 2.3. Results 90

Figure 2.12: The approach for constructing heat map. Unique MIM IDs from each protein were extracted and assigned ICD-10 codes. Then a symmetric matrix for each protein was constructed. All the matrices were then summed to create the final heat map representing the disease profile of pleiotropic proteins. 2.3. Results 91

Figure 2.13: Heat map showing the clustering of different disease classes found in pleiotropic DPh proteins. To facilitate visualising, the log values were used for gen- erating this heat map. The minimum value (0) is shown in red, while the maximum value (1.82, original value 65) is in white. Figure generated by Heatmapper (Babicki et al., 2016).

Pfam domains covered 119,214 out of 367,076 residues in pleiotropic DPh proteins and

455,244 out of 924,002 residues in non-pleiotropic proteins. The distributions of SAVs in Pfam domains in both the pleiotropic DPh and the non-pleiotropic proteins were analysed to see if SAVs were more enriched in Pfam domains than expected (compared with random distribution). The results are shown in Table 2.4.

According to the chi-square test, disease-associated SAVs were more likely to be found in Pfam domains in both pleiotropic (OR = 2.43, p < 0.01) and non-pleiotropic proteins 2.3. Results 92

Table 2.4: Distribution of SAVs on Pfam domains Distribution of SAVs on Pfam domains.

Disease-associated O E O/E P-value OR ROR P(ROR) Pleiotropic DPh 4433 3222 1.38 < 0.01 2.43 0.78 < 0.01 Non-pleiotropic 6846 4494 1.52 < 0.01 3.13 Neutral O E O/E P-value OR ROR P(ROR) Pleiotropic DPh 655 682 0.96 0.12 0.92 0.99 0.85 Non-pleiotropic 1718 1784 0.96 0.03 0.93

Note 1) O: Observed value 2) E: Expected value 3) ROR: Ratio of odds ratios (ORs) 4) P(ROR): P-value of ROR

(OR = 3.13, p < 0.01) than in parts of the sequences not annotated as a Pfam domain.

However, the ratio of odds ratios (ROR) showed that pleiotropic proteins were less enriched in disease SAVs in Pfam domains when compared with non-pleiotropic proteins

(ROR = 0.78, p < 0.001).

In contrast, neutral SAVs, both in pleiotropic and non-pleiotropic sets, were less likely to be found in Pfam domains and the distributions were similar to random (OR = 0.92, p = 0.12 for pleiotropic proteins; OR = 0.93, p = 0.03 for non-pleiotropic proteins).

Although the odds ratios were similar, the result was not significant for the pleiotropic case possible due to fewer SAVs being considered. Furthermore, both the pleiotropic and the non-pleiotropic sets did not show a significant difference in the distribution of neutral SAVs in Pfam domains (ROR = 0.99, p = 0.85). When the test was repeated using the full pleiotropic set, similar results were observed, and all the test statistics became significant (See Supplementary Table A.11).

As a protein can be made up of multiple different Pfam domains, a second analysis was performed in order to establish whether disease SAVs were more likely to cluster in the same domain and whether they shared the same disease when they were in the same domain. This analysis was performed on the pleiotropic DPh set. 2.3. Results 93

The clustering test

The pleiotropic DPh set was screened for proteins that contain at least two Pfam do- mains and have at least two disease SAVs in these domains. Pleiotropic SAVs were removed from the analysis as they lead to difficulty in the comparison between dif- ferent SAVs and their associated diseases. 155 (60.3%) pleiotropic proteins met these requirements. In each protein, all possible SAV pairs located inside Pfam domains were examined whether the pair exhibited one of the following relationships: 1) same domain & same disease, 2) same domain & different disease, 3) different domain & same disease, and 4) different domain & different disease (Figure 2.14a). The numbers of all observed cases were then compared with the numbers from the random model

(assuming that SAVs are spread evenly throughout all the domains) using a chi-square test.

In the random model, SAVs are allowed to occur in any position in Pfam domains together with their diseases (Figure 2.14b). Hence, the probability that two SAVs are in the same domains is given by

 2  2  2 L1 L2 Ln Psame dom = Pn + Pn + ··· + Pn i=1 Li i=1 Li i=1 Li

Where n is the number of domains and L1,L2, ..., Ln are the lengths of each domain. The probability that two SAVs are in different domain is

Pdiff dom = 1 − Psame dom

Note that two SAVs are allowed to occur at the same position in a Pfam domain.

The probability of two SAVs occurring in the same domain and the probability of two

SAVs sharing the same disease are independent; therefore, the probability that they are both sharing the same domain and the same disease can be calculated as below. 2.3. Results 94

Figure 2.14: Overview of the SAVs clustering test. (a) All four possible relationships between two SAVs: same domain same disease, same domain different disease, dif- ferent domain same disease, and different domain different disease. (b) The random models allow SAVs to occur at randomly in Pfam domain, allowing SAVs to relocate to anywhere within Pfam domains whilst carrying their original disease profiles. (c) The two ways to interpret the term same domain’. S1 and S2 depict two SAVs, while D1 and D2 represent two different diseases. 2.3. Results 95

Psame dom&same dis = Psame dom × Psame dis

With the same principle, the probability of the other three relationships can be calcu- lated by multiplying the probabilities of each independent event.

Psame dom&diff dis = Psame dom × Pdiff dis

Pdiff dom&same dis = Pdiff dom × Psame dis

Pdiff dom&diff dis = Pdiff dom × Pdiff dis

Unlike the probability for sharing the same domain which depends on the length of each Pfam domain, the probability for sharing the same disease (Psame dis) is either 1 (when both SAVs are associated with the same MIM ID) or 0 (otherwise).

The same calculation was repeated on all the proteins, and all the probability scores were summed up to generate the final numbers for the expected cases.

Results

The concept of the term ‘same domain’ can be interpreted in two different ways: 1) two SAVs must be located exactly in a particular Pfam domain, or 2) two SAVs can be located in any two separate domains providing that they share the same Pfam identifier

(Figure 2.14c). The contingency matrices containing observed and expected numbers of SAV pairs that satisfied each domain-disease condition is shown in Table 2.5.

When the first condition was considered, the result showed that SAVs were more likely to cluster in the same domains (O/E same domain = 1.19) rather than distributed across different domains. In addition, when SAVs were in the same domain, they were also more likely to share the same disease (O/E same domain & same disease 2.3. Results 96

Table 2.5: Contingency matrices for observed and expected numbers of disease- associated SAVs.

Same Pfam domain Observed Expected Pleiotropic DPh Same MIM Different MIM Total Same MIM Different MIM Total Same Pfam 21718 7993 29711 17158 7809 24967 Different Pfam 44233 17030 61263 48793 17214 66007 Total 65951 25023 90974 65951 25023 90974

Same Pfam identifier Observed Expected Pleiotropic DPh Same MIM Different MIM Total Same MIM Different MIM Total Same Pfam ID 45775 14546 60321 34644 13557 48201 Different Pfam ID 20176 10477 30653 31307 11466 42773 Total 65951 25023 90974 65951 25023 90974

= 1.26) rather than different diseases (O/E same domain & different disease = 1.02).

Furthermore, when SAVs were not in the same domain, the chance that they were associated with different diseases was higher than same diseases (O/E different domain

& different disease = 0.99; O/E different domain & same disease = 0.91). All results were statistically significant (p < 0.01, chi-square test).

Similar results were also observed when SAVs are allowed to be located in separate domains with the same Pfam identifier and are regarded as sharing the same domain.

It can be seen from the results that disease SAVs were more likely to cluster in the same domain (O/E same domain = 1.25). When they were in the same domain, they were also more likely to share the same diseases (O/E same domain & same disease

= 1.32) rather than different diseases (O/E same domain & different disease = 1.07).

In general, SAVs were more likely to cluster in the same domains. When two SAVs were found to be located in different domains, they were more likely to cause different diseases (O/E different domain & different disease = 0.91) than the same disease (O/E different domain & same disease = 0.64). All the test results were significant (p < 0.01, chi-square test) 2.4. Discussion 97

The test was also performed on a full set of 412 pleiotropic proteins, and similar sig- nificant results were observed (see Supplementary Table A.12).

2.3.8 Public web-based database

To make the dataset used in this study publicly accessible, a web database PleiotropyDB was developed (Figure 2.15). This web database, available at http://www.sbg.bio. ic.ac.uk/pleiotropydb, was designed to be easy to use and was created using Django, a Python-based web platform that supports dynamic programming and data-handling

(Django Community, 2018). The website consists of two main browsing pages: by pro- tein and by variants. Two datasets are available: the main dataset consisting of 257 pleiotropic DPh and 1,345 non-pleiotropic proteins, and extended dataset consisting of 412 Pleiotropic proteins and 1,345 non-Pleiotropic proteins. Users can customise the search by providing extra information such as a disease name, a UniProt ID, or a variant type, to refine the search results and display only proteins or variants that meet the given keywords. Additionally, users can download the filtered data into a tab-delimited format text file for further study.

2.4 Discussion

This study shows that pleiotropic proteins are more likely to be disordered, essential, and have more interactions than non-pleiotropic proteins. Pleiotropic proteins are also enriched in rare neutral SAVs and disease-associated SAVs specific to certain types of disease such as cancers, neurological diseases, cardiovascular diseases, and congenital disorders. In contrast, non-pleiotropic proteins are enriched in common neutral SAVs and are associated mostly with metabolic disorders. 2.4. Discussion 98

Figure 2.15: PleiotropyDB website interface. This figure shows the page reporting the list of variants studied in this chapter. Advanced search tools are provided for users to filter information based on supplied UniProt IDs or disease names. External links are provided for each one of protein, variant, and disease identifiers.

Disordered regions are a common feature of pleiotropic proteins

Disordered regions enable proteins to perform essential activities and interact with multiple partners. This analysis shows that more pleiotropic proteins (51%) than non- pleiotropic proteins (38%) contain at least one long disordered regions (> 50 residues).

With the widely-used and a more relaxed cut-off of 30 residues to identify a disordered region, 57.6% pleiotropic proteins and 44.2% non-pleiotropic proteins were predicted by

IUPred to possess at least one disordered region. These results are in a broad agreement with the finding by Oates et al. (2013) showing that on average 44% of human protein- coding genes contain long disordered regions (the cut-off used in their study was >30 residues). The higher proportion of disordered proteins indicates that protein-coding 2.4. Discussion 99 genes associated with a variety of diseases are more likely to contain disordered regions than genes that do not show pleiotropic effects. This study also demonstrated that proteins with higher disease diversity (pleiotropic DPh) show more interactions and are more likely to be essential than proteins with lower disease diversity (pleiotropic

SPh). However, there is no clear correlation found between the disease diversity and the number of disordered residues.

The presence of disease mutations in disordered regions is one of the characteristics that distinguish pleiotropic proteins from non-pleiotropic proteins. Deleterious mutations are scarcely found in disordered regions. This observation can be explained by a study by Uversky et al. (2009) which stated that disordered regions are less conserved and are much more tolerant of mutations. Nevertheless, pleiotropic proteins show a more significant number of deleterious mutations in disordered regions than non-pleiotropic proteins - suggesting that disordered regions in pleiotropic proteins are less tolerant of mutations. This could be because disordered regions play many important biological roles, especially in multi-partner protein-protein interactions which require the protein to be extra flexible. For this reason, minor changes in the flexibility of pleiotropic pro- teins are more likely to be deleterious than similar changes in non-pleiotropic proteins.

Unlike disordered regions, Pfam domains in both protein sets are enriched in disease

SAV compared to non-pleiotropic proteins. Moreover, in pleiotropic proteins, disease mutations sharing the same disease tend to cluster in the same domain. In contrast, when they cause different diseases, they are more likely to be in different domains.

Many essential proteins are pleiotropic

Pleiotropic proteins also differ from non-pleiotropic proteins in terms of essentiality.

Confirmed by both the human-mouse orthologue and the genome-wide analysis ap- proaches, pleiotropic proteins are more likely to be encoded by essential genes. A previous study on pleiotropy in C. elegans by Zou et al. (2008) demonstrated that pleiotropic proteins tend to have more roles in embryonic development and are more 2.4. Discussion 100 central in the protein network. The enrichment of essential proteins found in the pleiotropic dataset analysed here suggested that pleiotropic proteins could also play a similar role in embryonic development in human. Although the two approaches for determining essential proteins gave the same conclusion that essential proteins are en- riched in the pleiotropic dataset, the numbers of essential proteins detected by those two methods are notably different: a greater number of proteins were identified as essential when using the mouse orthologue method. However, a protein found to be essential in mice may not be essential in human. Consequently, the number of essential proteins in human could be overestimated using the mouse orthologue approach.

Pleiotropic proteins contribute to different types of diseases

Several studies have suggested that diseases caused by pleiotropic proteins are mostly related to the failure of cell signalling and communication (Iakoucheva et al., 2002;

Uversky et al., 2008; Raychaudhuri et al., 2009), whereas non-pleiotropic proteins are more likely to be related to diseases involving the disruption of globular proteins (i.e. proteins with no disordered regions) such as metabolic diseases (Xie et al., 2007). This analysis shows that pleiotropic proteins are enriched in diseases related to cancers, neu- rological disorders, cardiovascular diseases, and congenital disorders. There is clinical evidence, found by Dr Alessia David, that supports this finding of the link between pleiotropy and cancers including the report by the PAGE, GECCO, and CCFR consor- tia about the genetic association between colorectal cancer and other types of cancers

(Cheng et al., 2014) and the report by PAGE and TRICL consortia showing the as- sociation between lung and other types of cancers (Park et al., 2014). Several studies demonstrated that cancer is highly related to unstructured proteins (Iakoucheva et al.,

2002) and hub proteins (Uversky et al., 2008). All these properties are commonly- observed features of pleiotropic proteins.

The enrichment of congenital malformation disease (class number 17) found in pleiotropic protein agrees with the study in C. elegans by Zou et al. (2008). The authors showed 2.4. Discussion 101 that pleiotropy could be observed extensively among genes involved in the early stage of embryogenesis. Some of these genes are pleiotropic proteins affecting different systems and tend to be in the centre of protein-protein networks.

Conversely, non-pleiotropic proteins were more enriched in metabolic diseases, which are generally diseases involving disruption of globular structures. The enrichment of metabolic disease in non-pleiotropic proteins and neoplasm in pleiotropic proteins are in agreement with the human disease network developed by Goh et al. (2007). The study shows that the degree of shared genetic background is high for neoplasm and low for endocrine-related disorders. Additionally, a study by Xie et al. (2007) reported that some protein functions such as catalysis, biosynthesis, metabolism, transport (e.g. electron transport, sugar transport), require proteins to be ordered. Proteins with catalytic, biosynthetic, and metabolic functions are mostly enzymes, which tend to have ordered structures.

The co-occurrence between metabolic disease and neurological disease in the results is in agreement with previous studies. In general, many neurodegenerative diseases were shown to be resulting from disorders in metabolic processes which can be caused by a point mutation on a single gene (Swanson, 1995). Among all the proteins that were associated with both metabolic and neurological diseases in the dataset analysed,

45% were related to disorders in lipid storage and metabolism. As fatty molecules are a crucial component of a , failing to metabolise fatty molecule could lead to neurodegenerative disorders such as spasticity and mental retardation (Hageman et al., 1995; M¨uller vom Hagen et al., 2014). Another overlap was between neurologi- cal disease and cardiovascular disease. Proteins associated with both types of diseases were typically proteins causing myopathy (i.e. muscle diseases) and cardiomyopathy

(i.e. diseases in heart muscle tissues). According to the ICD-10, myopathy was classi-

fied in class number 6 (neurological disease) whereas cardiomyopathy was classified in class number 9 (heart disease) due to the affected system. Although the link between cardiovascular disease and other diseases were not so notable in the results reported 2.4. Discussion 102 here, some studies suggest that cardiovascular disease are usually the outcome of other diseases such as obesity, diabetes, and hyperlipidaemia, which are diseases of endocrine and metabolic system (Kannel and McGee, 1979; Hubert et al., 1983; Fox et al., 2005).

Some other cross-associations between different disease types were reported in many studies. For instance, there is evidence of an association between neurological disorders and cancers in the elderly (Driver, 2014), and evidence for a relationship between congenital malformations and an increased risk of cancer, particularly in young children

(Bjørge et al., 2008; Sun et al., 2014).

The differences between pleiotropic and non-pleiotropic proteins can be amplified by the degree of pleiotropy

Although differences can be seen between pleiotropic and non-pleiotropic proteins in both structural (disordered vs. ordered and distribution of SAVs) and biological

(protein-protein interactions, variation of diseases, and protein essentiality) aspects, in some test results pleiotropic proteins causing diseases of similar phenotypes (pleiotropic

SPh) could not be distinguished from non-pleiotropic proteins: the two groups had sim- ilar numbers of interactions and similar numbers of essential proteins when using data from OGEE. The difference between pleiotropic and non-pleiotropic proteins becomes clearer when the associated diseases in pleiotropic proteins are more diverse (i.e. when comparing pleiotropic DPh to non-pleiotropic). This suggests that pleiotropy is not a discrete two-state property but a continuous property that is related to the variety in the associated phenotypes.

The number of pleiotropic proteins has been increasing over the past decade

Part of this study was published in Human Mutation “Landscape of Pleiotropic Pro- teins Causing Human Disease: Structural and System Biology Insights” (authors: Sir- awit Ittisoponpisan, Eman Alhuzimi, Alessia David, and Michael J.E. Sternberg) in 2.4. Discussion 103

January 2017 (available online since December 2016) (Ittisoponpisan et al., 2017). Re- sults from 257 Pleiotropic DPh proteins were reported to emphasise the differences between pleiotropic and non-pleiotropic protein. However, the results from the full set of 412 pleiotropic proteins were also provided in a supplementary document. This paper was mentioned in the editorial “One Gene, Several Diseases: The Characteristics of Pleiotropic Proteins” in Human Mutation (Vihinen, 2017).

A recent study by Chesmore et al. (2018) has explored pleiotropy of human disease in the 2017 version of GWAS database. Based on their criteria for pleiotropy, 44% of disease-associated genes were identified as pleiotropic, which contradicts the earlier study, which identifies 16.9%, by Sivarkumaran et al. (2011). However, as pointed out by Chesmore et al., their analyses were repeated on the data released in each year from

2005 to 2017. They observed the growing proportion of pleiotropic genes in GWAS database, from no pleiotropic genes being identified in 2005 and 2006 to 44% in 2017.

It was estimated that this number would continue to increase as more data are being collected and added to the database.

The study reported here used data available in 2014 and identified 12% of disease- associated proteins in coding regions to be pleiotropic. The criteria of defining pleiotropic genes are more stringent than those used by other research groups: instead of directly regarding a gene associated with different phenotypes as pleiotropic, the ICD-10 classi-

fication by WHO was used to add an additional constraint that the different phenotypes must come from different disease classes. When the full pleiotropic set that includes genes associated with similar disease phenotypes (pleiotropic SPh) was considered, the coverage of pleiotropic genes increased to 19.3%, which is in agreement with the results from the study by Chesmore et al. using GWAS data from the same year (20%). 2.5. Conclusion 104

2.5 Conclusion

The results from this pleiotropy study provide a better understanding of the character- istics of pleiotropic and non-pleiotropic proteins. In particular, it demonstrated that pleiotropic proteins are enriched in deleterious mutations and rare polymorphisms, and are associated with cancer, cardiovascular disease, neurological disease, and congenital malformation. The results emphasise the importance of pleiotropy in human pathology and could help researchers selecting candidate genes for further analysis. Chapter 3

Structural Effects Of Mutations On Human Proteins

This chapter presents the development of a structural-based approach protocol ‘Mis- sense3D’ to analyse the impact of SAVs on the protein structures. In this chapter, the structural effects of SAVs on high-quality X-ray structures are evaluated and the anal- ysis pipeline is benchmarked for later study on homology-built structures presented in

Chapter 4. The same protocol was also used in a study with Dr Alessia David to analyse variants of uncertain significance (VUS), and the study was published in Journal of the

Endocrine Society (title: “Structural Biology Helps Interpret Variants of Uncertain

Significance in Genes Causing Endocrine and Metabolic Disorders”) (Ittisoponpisan and David, 2018). The paper emphasises how crucial the in-depth understanding of the molecular mechanisms of the disease is, particularly in gene prioritisation.

105 3.1. Background 106

3.1 Background

Many novel variants in the human population have been discovered through whole- genome sequencing projects such as the 100,000 Genome Project (Peplow, 2016), which aims to enable new scientific discoveries and medical insights for the treatment of cancer and rare diseases. Whole-genome sequencing has recently become much more affordable, thus allowing sequencing of large numbers of individuals. The overwhelming number of newly discovered variations poses a major challenge for researchers to analyse their effects on protein structures. Performing clinical in vitro analysis on all the variants is impractical due to the high cost needed. Therefore, in silico analysis is an alternative approach to identify potentially deleterious mutations prior to further laboratory study.

Despite the abundance of in silico programs to help analyse the pathogenicity of ge- netic variants (e.g. SIFT, FATHMM, and PolyPhen-2 for sequence-based prediction, and PopMusic, STRUM, INSP3D, SDM, SAAPdap, and HOPE for structure-based predictions), many of them have limitations. First, most of them often return a bi- nary outcome (i.e. neutral or deleterious) or provide little explanation on the effect of the variants. Because of that, the real mechanism still has to be further elucidated.

Second, many tools tend to overpredict disease-causing mutations, resulting in high false positive rates (FPRs). A recent study by Misoge et al. (2015) on 33 de novo variants in mice found that many widely-used sequence-based prediction methods in- cluding PolyPhen-2, SIFT, MutationAssessor, Panther, CADD, and Condel tend to overpredict the deleterious effects of variants. About a third of de novo variants were predicted as deleterious by PolyPhen-2 and other predictors. Nevertheless, only about

20% of them were experimentally confirmed to result in a deleterious phenotypic out- come. In addition, many prediction tools utilise machine learning algorithms, which rely on the training sets used. These tools are likely to flag de novo variants as dele- terious. In many applications, including clinical (Richards et al., 2015; Ellard et al., 2017), a low FPR (or high specificity) is of major benefit. Therefore, conducting 3.2. Materials & methods 107 structural analysis to confirm the pathogenicity of the variant and to give a structural explanation of the phenotypic effect is paramount for further study.

To address the above issues, a new structural analysis tool ‘Missense3D’ was developed.

This chapter reports the evaluation of the structural features used by Missense3D on high-quality experimental structures. The same analysis pipeline was repeated on predicted structures and is reported in Chapter 4.

3.2 Materials & methods

The outline for data curation and structural analysis used in Missense3D is shown in

Figure 3.1).

3.2.1 Dataset of high-quality experimental structures

High-quality X-ray structures were obtained from the Top8000 database (Hintze et al.,

2016). These non-redundant and independent structures were pre-determined by Richard- sons Laboratory research group using the following criteria: resolution < 2.0 A,˚ MolPro- bity score (calculated from a combination of the clash score, percentage of disallowed phi/psi and percentage of bad side-chain rotamers) < 2.0 (Chen et al., 2010), ≤ 5% of residues with bond length outliers (> 4σ), ≤ 5% of residues with bond angle outliers

(> 4σ), and ≤ 5% of residues with Cβ deviation outliers (> 0.25 A)˚ (Lovell et al.,

2003). Human protein structures were extracted and mapped onto UniProt IDs. To reduce complexity, only those PDBs that could be mapped onto only one UniProt ID were considered. As a result, 999 structures contributed to the experimental structure dataset. 3.2. Materials & methods 108

Figure 3.1: The pipeline to analyse the structural impact of missense variants in experimental structures

3.2.2 Dataset of SAV variants

SAV variants were curated by Eman Alhuzimi from UniProt Humsavar (release 4

February 2015), ClinVar (release 7 January 2015), and ExAC (version 0.3, released

13 January 2015) (Alhuzimi et al., 2018). The initial data consists of 26,884 disease- associated, and 563,099 neutral variants. 10,117 SAVs (3,888 disease-associated and

6,229 neutral) were mapped to 796 out of 999 UniProt IDs previously filtered from the 3.2. Materials & methods 109

Top8000 database.

These SAVs were then mapped onto PDB structures. SAVs in which the wild-type amino acids did not match the amino acid on the PDB structures were excluded from the analysis. Additionally, SAVs reported as both neutral and disease-associated by different data sources were removed. 1,965 disease-associated and 2,134 neutral SAVs

(4,099 total) could be correctly mapped onto 606 PDB structures. For later analysis, neutral SAVs were further classified, according to their minor allele frequency (MAF), into 1,550 rare variants (MAF < 0.01), 273 common variants (MAF ≥ 0.01), and 311 variants of unknown frequency (no MAF reported).

3.2.3 Alignment of FASTA sequences to PDB structures

As SAV information is based on UniProt FASTA sequences which may not have the same residue number as their corresponding PDB structures, the cross-referencing be- tween UniProt sequence position and PDB position had to be undertaken. The align- ment was performed using the script ‘alignfa2pdb’, developed by Dr Suhail Islam. This script utilises CLUSTALW algorithm (Larkin et al., 2007).

3.2.4 Side chain repacking method

The following procedure was used to model a mutant (MT) structure to analyse the structural effect of a SAV in a PDB coordinate. First, the distances from each of the neighbour residues to the target residue (where the substitution occurs) were measured.

Any neighbour residue that was < 5 A˚ from the target residue when measured from the closest atom-to-atom distance would have its side chain removed. The target residue was then replaced with the mutant amino acid. The removed side chains were then reintroduced using SCRWL4 (Krivov et al., 2009) with an instruction to repack only the affected side chains and keep all other side chains fixed. Side chain repacking by

SCWRL4 does not affect backbone atoms. 3.2. Materials & methods 110

Figure 3.2: Side chain repacking method. In this example glutamic acid (Glu) is to be replaced with phenylalanine (Phe). Thick green lines represent protein backbone. (a) any neighbour residue that is within 5 A˚ to the query residue will have its side chain removed, (b) Glu is then replaced with Phe and the repacking is performed using SCWRL4. Side chains of the other residues are not altered.

SCWRL4 was shown by the developers to be fast and accurate with 86% accuracy in χ1, and 75% accuracy in χ1 and χ2 angle prediction and was recommended for homology modelling. Most side-chain prediction methods are based on side-chain ro- tamer libraries, which were built by assessing the side chains of high-quality X-ray structures. SCWRL4 makes use of a novel high-quality rotamer library curated from

3,985 protein chains with a resolution of ≤ 1.8 A,˚ an R factor cut-off of 0.22, and a mutual sequence identity of ≤ 50% (Shapovalov and Dunbrack, 2011). SCWRL4 con- siders several factors including hydrogen bondings, Van der Waals interactions, and steric clashes. SCWRL4 also aims to minimise the pairwise energy and was shown to outperform both its previous version SCRWL3 (Canutescu et al., 2003) and the fast side-chain prediction tool R3 (residue-rotamer-reduction) (Xie and Sahinidis, 2006).

One disadvantage of using SCWRL4 is that the program can slightly re-adjust all side chains using its pre-calculated side-chain library even if the residues are labelled as

fixed. To avoid the potential error resulting from side-chain readjustment by SCWRL4 and to ensure that the detected structural problem is caused by the amino acid sub- 3.2. Materials & methods 111 stitution and not by the noise generated by SCWRL4, for each variant a wild-type

(WT) coordinate was generated in which each side chain of the original PDB was specified as fixed but still subjected to minor adjustment by SCWRL4. The WT and the original PDB structures would, in theory, be identical if the SCWRL4 repacking algorithm was perfect. Additionally, a repacked wild-type (WTR) coordinate was also generated. The generation of this structure is similar to that of the MT structure except that the amino acid type at the target residue is unchanged. The WTR struc- ture was used as a control structure when assessing structural features based on bond distance: hydrogen bonds, salt bridges, and disulphide bonds. If any of these were different between the WT and WTR structure, that feature would not be assessed.

3.2.5 Calculation of damaging structural features

A residue was identified as buried if its relative solvent accessibility (RSA) was < 9%.

DSSP (Kabsch and Sander, 1983) was used for calculating the absolute solvent ac- cessibility (ASA) of each residue. RSA is derived by dividing the ASA by maximum

ASA for each amino-acid type. The values for maximum ASA used in this study were measured by Rost and Sander (1994) and values are shown in Supplementary Table

B.1.

Hydrophilic residues are: D, E, H, K, N, Q, and R; hydrophobic residues are: A, C,

F, I, L, M, V, and W; the remaining residues are regarded as neutral (G, P, S, T, and Y). D and E are treated as negatively charged, and H, K, and R as positively charged. However, it should be noted that there are several alternative definitions of hydrophobic residues, which are classified based on different hydrophobicity scales (?).

Based on previous studies on the structural effects of disease-associated SAVs and well- established principles of protein conformation (Yue et al., 2005, 2006; Al-Numair and

Martin, 2013; Gao et al., 2015; Kucukkal et al., 2015; Bhattacharya et al., 2017), 17 structural features were considered and are described below. 3.2. Materials & methods 112

Clash

A substitution is regarded as damaging if the MolProbity clash score is ≥ 30 and the increase in the clash score between the WT and the MT structures is ≥ 18. The clash score is measured locally by considering only atoms within 20 A˚ from the Cα of the target residue.

In order to calculate the clash score for each structure, the script ‘clashlistcluster’

(Word et al., 1999) from Richardson Laboratory was employed. A clash was defined as a van der Waals overlap ≥ 0.4 A˚ and clash score was defined as the average number of clashes per 1,000 atoms. Since a SAV is not likely to drastically increase the clash score of the entire protein, only local atoms within a 20 A˚ radius surrounding the Cα of the target residue were evaluated. A structure with a clash score of ≥ 30 is regarded as a poor structure by MolProbity (Hintze et al., 2016).

Disallowed phi/psi

A substitution is considered damaging if the phi/psi angles are observed to be in a disallowed region for the mutant amino acid. However, if the phi/psi angles in the wild-type amino acid are shown to be in a disallowed region, the SAV will not be regarded as damaging.

Disallowed phi/psi angles were determined by Lovell et al. (Lovell et al., 2003) us- ing amino acid profiles from 500 high-quality and unique PDB structures (Top500 database). A script ‘ramachandran’, developed by Richardson’s research group, mea- sures phi and psi angles of each residue and reports if these torsion angles are in a favoured, allowed, or outlier (disallowed) region. Favoured regions cover 98%, whereas allowed regions include 99.95% of all residues in the Top500 database. 3.2. Materials & methods 113

Secondary structure altered

A substitution is considered damaging if the MT residue is reported to have a different secondary structure from that in the WT residue. Although the backbone atoms are

fixed, a substitution to proline can result in a loss of hydrogen bonds linking the backbone atoms, hence the breakage of the original secondary structure.

The secondary structure details including helices, strands, and turns for both the WT and MT structures were calculated using DSSP (Kabsch and Sander, 1983). DSSP assigns a secondary structure to each residue based on the hydrogen bonding between backbone atoms. The secondary structure is reported by DSSP as one of eight letters:

H - alpha helix, B - beta bridge, E - strand, G - helix-3, I - helix-5, T - turn, S - bend, and blank.

Buried/exposed switch

A substitution is considered damaging if it results in a switch between buried and exposed state of the target residue and the difference in RSA has to be at least 5%.

Buried hydrophilic introduced

A substitution is considered damaging when the wild-type residue is buried and hy- drophobic while the mutant residue is hydrophilic.

Exposed hydrophobic introduced

A substitution is considered damaging when the wild-type residue is exposed and hy- drophilic while the mutant residue is hydrophobic. 3.2. Materials & methods 114

Buried charge introduced

A substitution is regarded as damaging when the wild-type residue is not a charged residue and is buried, and the mutant residue is charged.

Buried charge replaced

A substitution is considered damaging if the original amino acid is charged and buried, and the substitution introduces an uncharged residue.

Buried charge switch

A substitution is considered damaging if the wild-type residue is charged (+/-) and buried, and the substitution introduces a residue with the opposite charge.

Buried salt bridge breakage

A substitution is regarded as damaging if the wild-type residue is buried and it forms a salt bridge with another residue and the substitution disrupts this bond.

To find salt bridges, SALT version 0.9, developed by Dr Suhail Islam, was used to analyse the WT, WTR, and MT structures. The program reports bonding between an oxygen atom from the side chain of an acidic residue (Asp and Glu) and a nitrogen atom from the side chain of a basic residue (Arg, His, and Lys). The length of a salt bridge is typically 4 A.˚ However, as slight shifts in the side chains can be expected from

SCWRL4 repacking (and the structure assessed can be poorer in quality especially in homology models), a less stringent threshold of 5 A˚ was considered. 3.2. Materials & methods 115

Buried H bond breakage

A substitution is considered damaging if it disrupts all side-chain/side-chain hydrogen bonds and side-chain/main-chain hydrogen bonds originally formed by the buried wild- type residue. As the backbone atoms are not altered, the hydrogen bonds connecting main chain/main chain are not considered here.

Hydrogen bonds are calculated using ‘Hbplus’ (McDonald and Thornton, 1994). The default setting in the program for the bond length is 2.9 A˚ (optimised for strong to moderate bond strength). To allow for more relaxation, the threshold of 3.9 A˚ was used.

Disulphide bond breakage

A substitution is considered damaging if the wild-type cysteine residue forms a disul- phide bond with another cysteine residue.

To detect disulphide bonds, a binary script ‘disulphide’, developed by Dr Suhail Islam, was employed. A disulphide bond is defined as a bond between two sulphur atoms of cysteines that are less than 2.3 A˚ apart (program’s default setting). However, a less stringent threshold of 3.3 A˚ was used.

Cis Pro replaced

A substitution from proline whose backbone was originally in a cis-conformation is considered as damaging.

The omega angles were calculated using a binary script ‘torsions’ (version 1997), from

Martin’s Bioinformatics group. Although an omega angle is theoretically either 0◦ or

180◦, this number may vary slightly when measured from a PDB file. Thus, we allow

- 45◦ to 45◦ range for the omega angle to be considered in the cis-conformation (-45◦

< omega < 45◦). 3.2. Materials & methods 116

Gly in a bend

A substitution is considered damaging if the wild-type residue is glycine and its sec- ondary structure assigned by DSSP is ‘S’ for bend or sharp turn.

Glycine has only one hydrogen atom as its side chain. Since it can adopt a far larger region in the phi/psi backbone dihedral angle space than other amino acids, it is often found in a loop or a region where a polypeptide chain needs to make a sharp turn.

Accordingly, glycine is often conserved.

Buried Gly replaced

Buried glycine is normally conserved. A substitution is considered damaging if the wild-type residue was a buried glycine.

Buried Pro introduced

Introducing a proline into a structure often poses structural constraints due to proline’s limited range of conformations (Choi and Mayo, 2006). Accordingly, a substitution was regarded as damaging if it replaced a buried wild-type residue with a proline.

Cavity altered

A substitution is considered damaging if it results in an increase or decrease of a cavity volume of at least 70 A˚3. This is consistent with an upper limit of most observed cavities in proteins in a previous study by Hubbard et al. (1994).

KVFinder was used to calculate cavity volumes in both wild-type and mutant structures

(Oliveira et al., 2014). The sizes of the probes were set to their default values: 1.4 A˚ for Probe In and 4.0 A˚ for Probe Out. 3.2. Materials & methods 117

3.2.6 Statistical evaluation of performance

A disease-associated SAV that is identified as having at least one structural problem was regarded as a True Positive (TP), otherwise, a False Negative (FN). For a neutral

SAV, if it is identified to have at least one structural problem, it is regarded as a

False Positive (FP), or a True Negative (TN) if no structural problem is detected.

The performance of each structural feature was evaluated by considering True Positive

Rate (TPR) and False Positive Rate (FPR), where TPR = TP/(TP+FN) and FPR =

FP/(FP+TN).

A one-tailed z-test of proportions was used to assess whether the TPR for a particular structural feature was significantly greater than the FPR. The Benjamini-Hochberg procedure was used for multiple testing (Benjamini and Hochberg, 1995). 3.3. Results 118

3.3 Results

3.3.1 Performance of each structural feature

Figure 3.3: Performance of each structural feature. The TPR on disease-associated SAVs and the FPR on neutral SAVs are plotted as bars. The TPR/FPR ratios are given (axis on the right). For ease of viewing these are connected by a line. Significance at p < 0.01 (denoted by **) is evaluated in a one-tailed z-test of the difference of two proportions.

Figure 3.3 shows the performances of the 17 structural features sorted by TPR/FPR ratios. Overall, disease-associated SAVs could be well distinguished from neutral SAVs.

16 structural features had TPR/FPR ratios > 1.0. 15 of the TPR/FPR ratios were sta- tistically significant (p < 0.01, z-test). These differences remain significant (p < 0.01) when the Benjamini-Hochberg procedure was applied, for 17 multiple tests. Of particu- lar mention are the top five features, which can strongly discriminate disease-associated

SAVs from neutral SAVs: disulphide bond breakage (TPR/FPR 25.3), buried Pro introduced (TPR/FPR 8.7), clash (TPR/FPR 8.6), buried hydrophilic introduced 3.3. Results 119

(TPR/FPR 8.0), and buried charge introduced (TPR/FPR 7.1).

The disruption of disulphide bonds was the most discriminating feature, achieving a

TPR/FPR ratio of 25.3. There were only three false positive cases from C-type lectin domain family 4 member (CLC4A) (Variant Cys381Arg in PDB Code: 1XPH (A),

UniProt ID: Q9H2X3, p.Cys381Arg), beta-2 glycoprotein 1 (APOH ) (Cys306Gly in

3OP8 (A), P16581, p.Cys325Gly), and E-selectin (SELE) (Cys109Trp in 1G1T (A),

P16581, p.Cys130Trp). One possibility is that the assignment of these SAVs as neutral is incorrect. Through manual inspection, a disulphide bond in each of the wild-type structures was confirmed. In particular, the breakage of disulphide bonds in beta-2 glycoprotein I due to Cys306 mutation was shown to lead to a loss of protein function of phospholipid binding if present in homozygosity (Sanghera et al., 1997). However, this

SAV is reported on UniProt as neutral. Cys381Arg in CLC4A and Cys109Trp in SELE are extremely rare variants (MAF 0.00007 and unknown, respectively). Moreover, none of these proteins is essential according to OGEE database (Chen et al., 2012). The first two proteins are classified as non-essential, and the third is ‘conditional’ which means the assignment is uncertain.

The next most discriminating feature was substituting a buried residue with a proline.

As the allowed phi/psi angles are substantially reduced for proline compared to any other residue, introducing a proline can disrupt the structure (Lovell et al., 2003). The space limitation in protein interior makes it much harder for the backbone to rearrange in order to accommodate the introduction of a proline.

‘Clash’ was the third most discriminating structural features. ‘Buried Gly replaced’ was listed as a separate feature, although one might expect that substituting a glycine, which was located in the protein core is highly likely to cause severe clashes. In this analysis, 115 disease-associated SAVs (5.85 %) and 33 neutral SAVs (1.55 %) were found to replace a buried glycine. However, 87 disease-associated SAVs (75.7 %) and

30 neutral SAVs (90.9 %) did not trigger the clash alert. This was, in part, a result of requiring a large number of clashes to generate this alert to minimise false positives in 3.3. Results 120 modelling.

The next four most discriminating features relate to the introduction of a hydrophilic residue or alteration of a charged residue in the protein core. These are consistent with the principle that an unpaired charged or polar group in the core are very likely to destabilise the protein (Lee and Vasmatzis, 1997).

The introduction of an exposed hydrophobic residue, in contrast, could not well distin- guish disease-associated SAVs from neutral SAVs (TPR/FPR 0.63). The analysis on the crystal structure dataset showed that 103 disease-causing mutations (5.24 %) and

177 neutral SAVs (8.29 %) introduced a hydrophobic residue onto the protein surface

(p < 0.01). This suggests that the introduction of a hydrophobic residue on the surface can still be well-tolerated, which is in broad agreement with the SAAPdap analysis on a different high-quality PDB dataset (Al-Numair and Martin, 2013). In SAAPdap,

10.43% pathogenic mutations and 13.58 % neutral mutations were marked as creating hydrophobic surfaces. Nevertheless, the buried hydrophilic feature did not perform well in the SAAPdap dataset: 6.43% and 6.62% of disease SAVs and neutral SAVs were identified as having this structural problem. As the ‘exposed hydrophobic’ feature did not contribute to strong discrimination between disease-associated and neutral SAVs, it was removed from the final feature list.

Substitution from cis proline is the only feature that was not significant, despite a

TPR/FPR ratio of 1.90 (p = 0.30). As cis-conformation prolines are rare, only few variants were detected: 7 disease-causing SAVs (0.36 %) and 4 neutral SAVs (0.19 %).

Overall, 40.1 % (788) of the disease-associated SAVs and 11.4 % (244) of the neutral

SAVs triggered at least one structural alert (TPR/FPR 3.51, p < 0.01). In evaluation of the result, one should consider that the FPR could be overestimated since a structurally damaged protein may not result in a disease if the corresponding gene is non-essential.

Moreover, some SAVs reported as neutral may actually be disease-associated but this is yet to be identified. Full details of each structural feature including TPR and FPR with 95% confidence intervals are provided in Table 3.1. 3.3. Results 121

Table 3.1: Performance of Missense3D on experimental structures

Feature assessed TP FP TPR±CI FPR±CI P-value Disulphide bond breakage 70 3 3.56 ± 0.82 0.14 ± 0.16 < 0.0001 Buried Pro introduced 64 8 3.26 ± 0.78 0.37 ± 0.26 < 0.0001 Clash 103 13 5.24 ± 0.99 0.61 ± 0.33 < 0.0001 Buried hydrophilic introduced 81 11 4.12 ± 0.88 0.52 ± 0.30 < 0.0001 Buried charge introduced 170 26 8.65 ± 1.24 1.22 ± 0.47 < 0.0001 Secondary structure altered 21 4 1.07 ± 0.45 0.19 ± 0.18 0.0001 Buried charge switch 19 4 0.97 ± 0.43 0.19 ± 0.18 0.0004 Disallowed phi/psi 115 29 5.85 ± 1.04 1.36 ± 0.49 < 0.0001 Buried charge replaced 99 25 5.04 ± 0.97 1.17 ± 0.46 < 0.0001 Buried Gly replaced 115 33 5.85 ± 1.04 1.55 ± 0.52 < 0.0001 Buried H-bond breakage 148 53 7.53 ± 1.17 2.48 ± 0.66 < 0.0001 Buried salt bridge breakage 62 23 3.16 ± 0.77 1.08 ± 0.44 < 0.0001 Cavity altered 105 41 5.34 ± 0.99 1.92 ± 0.58 < 0.0001 Buried/exposed switch 83 38 4.22 ± 0.89 1.78 ± 0.56 < 0.0001 Cis Pro replaced 7 4 0.36 ± 0.26 0.19 ± 0.18 0.1483 Gly in a bend 43 25 2.19 ± 0.65 1.17 ± 0.46 0.0054 Exposed hydrophobic 103 177 5.24 ± 0.99 8.29 ± 1.17 0.9999 introduced (evaluated but not used)

3.3.2 Insights into variants affecting secondary structure

Given that the backbone coordinates are fixed, only a variant involving proline can result in a disruption of hydrogen bonds linking backbone atoms at the secondary structure level. 25 cases comprising of 21 disease-associated SAVs and 4 neutral SAVs

(1.07% TPR and 0.19% FPR respectively) were found to affect the secondary structure assigned by DSSP. The list of variants is provided in Table 3.2.

As expected, all variants involved a proline. Two of them had proline as the wild-type amino acid and were reported as neutral. The rest were substitutions to proline. The three most common substitutions were Ser to Pro (30%), Leu to Pro (17%), and Arg to Pro (17%). Many of these proline substitutions had been well-studied and shown to be disease-associated. For instance, p.Ser155Pro on sulfatase-modifying factor 1 3.3. Results 122

(Gene: SUMF1, UniProt ID: Q8NBK3, position 1155 in PDB 1Z70 chain X) results in the breakage of the beta strand, which spans from residue 154 to residue 158. The breakage of the beta strand was shown to result in loss of the activity of the protein

(Cosma et al., 2003, 2004).

Introducing a proline could result in disallowed phi/psi angles in addition to the sec- ondary structure being altered. After inspection, about 30% of secondary-structure- breaking proline substitutions did not trigger a disallowed phi/psi angle alert. There- fore, the inclusion of secondary structure disruption caused by introducing a proline in the structural feature assessment list would somewhat increase the overall TPR.

Structural analysis of all four neutral SAVs that resulted in the alteration of secondary structures found that most SAVs are located either at the end of alpha helices or near the start of beta strands (see Figure 3.4). Moreover, all these SAVs are on the protein surface. Pro202 in dehydrogenase/reductase SDR family member 4 (Gene: DHRS4,

UniProt ID: Q9BTZ2, PDB: 3O4R chain B) is at the end of a long alpha helix which spans from residue 180 to 203. The substitution to Ser shortens the helix. Pro328

(PDB position 329) in dihydroorotate dehydrogenase (DHODH, Q02127, 1D3G chain

A) is a disordered residue adjacent to the start of a beta strand (PDB position 330-

333). The substitution to Leu makes this residue become part of the strand. Gln83

(position 43 in PDB) in the Low affinity immunoglobulin gamma Fc region receptor II- b (FCGR2B, P31994, 2FCB chain A) is located at the beginning of a short beta strand and the substitution to proline truncated the strand by one residue. This suggests that alteration at the margin of an alpha helix/beta sheet or could be tolerated. Lastly,

Gln226 in N-glycosylase/DNA lyase (OGG1, O15527, 2XHI chain A) is near the end of an alpha helix spanning from residue 222 to 229. However, the change in the DSSP- assigned secondary structure was from alpha-helix to helix-3. There was not enough evidence to suggest the pathogenicity of all of these mutations. 3.3. Results 123

Figure 3.4: Cartoon representation of four neutral variants that triggered the sec- ondary structure alert. The wild-type structures are shown on the left in light green. The mutant structures are on the right in pink. The wild-type residues at the position of the variants are highlighted in green and the mutant residues are highlighted in red. 3.3. Results 124

Table 3.2: Variants leading to secondary structure altered

UniProt Variant PDB Type Phi/psi Consequence Q8NBK3 p.Ser155Pro 1Z70 (X) Disease Disallowed Strand breakage P02766 p.Leu75Pro 3CN4 (B) Disease Disallowed Strand breakage P13569 p.His620Pro 2PZE (B) Disease Disallowed Strand to Bend Q14896 p.His257Pro 3CX2 (A) Disease Allowed Strand breakage P05543 p.Leu247Pro 2XN6 (A) Disease Disallowed Strand breakage O75369 p.Ser188Pro 2WA7 (A) Disease Allowed Helx-3 to Turn P00480 p.Arg141Pro 1OTH (A) Disease Allowed Strand breakage P35219 p.Ser100Pro 2W2J (A) Disease Disallowed Strand breakage P02545 p.Arg541Pro 3GEF (A) Disease Disallowed Strand breakage P19438 p.Arg121Pro 1EXT (A) Disease Disallowed Beta bridge breakage P19438 p.His134Pro 1EXT (A) Disease Disallowed Strand breakage P22033 p.Gln293Pro 2XIJ (A) Disease Allowed Alpha helix to Turn P00439 p.Thr238Pro 1J8U (A) Disease Disallowed Alpha helix to Turn P00439 p.Arg413Pro 1J8U (A) Disease Disallowed Strand breakage P35555 p.Ser827Pro 2W86 (A) Disease Disallowed Strand to Bend P49748 p.Ala213Pro 2UXW (A) Disease Disallowed Strand breakage P00813 p.Leu107Pro 3IAR (A) Disease Allowed Helx-3 to Turn P30566 p.Ser438Pro 2J91 (A) Disease Allowed Turn to Helx-3 P53673 p.Leu69Pro 3LWK (A) Disease Disallowed Strand breakage P08235 p.Ser805Pro 2AAX (A) Disease Allowed Alpha helix to Turn Q9UBC3 p.Ser270Pro 3FLG (A) Disease Disallowed Strand breakage Q9BTZ2 p.Pro202Ser 3O4R (B) Neutral Allowed Helx-3 to Turn Q02127 p.Pro328Leu 1D3G (A) Neutral Allowed Disordered to Strand P31994 p.Gln83Pro 2FCB (A) Neutral Disallowed Strand breakage O15527 p.Gln226Pro 2XHI (A) Neutral Allowed Alpha helix to Helix-3

3.3.3 Overlapping between SIFT prediction and Missense3D analysis

SIFT is based on sequence conservation. Similar to many other sequence-based predic- tors, SIFT was shown to provide a high sensitivity but at the expense of a low specificity

(Miosge et al., 2015). When SIFT was employed to predict the pathogenicity of SAVs used in this study, it could correctly identify the majority of the disease-associated 3.3. Results 125

SAVs (TPR = 86.8%). However, the false positive rate of SIFT was very high (46.2%), resulting in the TPR/FPR ratio of 1.88 (p < 0.01, z-test). Although Missense3D did not yield a TPR (40.1%) as high as that of SIFT, it had a far lower FPR (11.4%).

Figure 3.5: Overlap between SIFT and structural analysis in the identification of disease-associated SAVs with the corresponding numbers of TPs and FPs.

When comparing variants predicted to be deleterious by Missense3D and SIFT, a sub- stantial overlap can be observed as shown in the Venn diagram (Figure 3.5). SIFT covers nearly all disease SAVs identified via structural analysis (94.4%). When the

SIFT prediction was supported by structural explanation, the number of false positive hits were greatly reduced. The new TPR became 37.9%, and the new FPR became

8.3%, achieving the maximum TPR/FPR ratio of 4.56. In many settings, particu- larly in clinical research, a low FRP is essential (Richards et al., 2015; Ellard et al.,

2017). Thus, structural confirmation of sequence-based predictions can help prioritize disease-associated SAVs.

3.3.4 Analysis of rare, common, and unknown neutral variants

11.4% of neutral variants were identified as causing at least one structural problem.

Neutral variants were then further separated into three groups based on their minor 3.3. Results 126

Figure 3.6: Positive rates of common, rare, and unknown SAVs. 95% confidence intervals on the positive rates are shown as lines allele frequency (MAF): 273 common (MAF ≥ 0.01), 1,550 rare (MAF < 0.01), and

311 unknown (no MAF available).

Common neutral variants had the lowest FPR among all three - only 5.9% were iden- tified. When only rare neutral variants were considered, their FPR became 11.6%.

Neutral variants of unknown frequency (presumably extremely rare) had the highest

FPR of 15.8%. A similar trend was observed when considering SIFT predictions. From

SIFT, the FPR for common, rare, and unknown variants were 29.3%, 48.2%, and 51.1%, respectively.

These results suggest that rare variants and variants of unknown frequency are more likely to be predicted as disease-causing, even when using a structural analysis method.

This finding is also consistent with the results from the previous study on pleiotropy

(reported in Chapter 2) which showed that rare variants and variants with no MAF 3.4. Discussion 127 have characteristics similar to those of disease-associated variants. Therefore, several rare and unknown SAVs that raise structural alerts should be subject to further inves- tigation.

3.4 Discussion

Percentage of SAVs that affect structural instability are different across research groups

Structural instability is one of the possible outcomes caused by single amino acid vari- ants. A variant can be well-tolerated in terms of protein folding but have a deleterious effect on protein-protein interactions or ligand binding. It is unclear what percentage of disease-associated variants cause structural instabilities, as different percentages were identified by different research groups.

A pioneering study by Wang and Moult (2001) reported that 83% of disease-causing missense mutations affect protein stability. However, this analysis was performed on a disease set of only 23 randomly selected unique proteins (262 variants) and a neutral set of only 22 proteins (42 variants). The study was performed using in silico analysis.

According to their results, a clash was the most common structural impact (22%), followed by loss of hydrogen bond (21%), backbone constraints (disallowed phi/psi angles and cis proline, 19%), and buried charged residues (14%). However, a large number of variants may have been identified to affect protein stability because of very sensitive criteria. For example, their definition of clash is when a substitution shortens atomic contacts (< 2.5 A˚ for buried residues and < 2.0 A˚ for exposed residues), whereas

Missense3D requires a drastic increase in the clash score. Their hydrogen bond and salt bridges criteria also include variants on the protein surface, whereas Missense3D only considers core residues. For the 42 neutral variants, Wang and Moult showed that

30% of variants affected protein stability. The relatively high percentage of neutral

SAVs causing structural instability could be explained by the stringent criteria used. 3.4. Discussion 128

In a recent study published by Bhattacharya et al. (2017) on 374 SAVs in 334 unique

PDB structures, 44.5% of disease variants and 3.4% of neutral variants were found to af- fect protein stability. The dataset contains 71 variants confirmed as disease-associated,

39 variants that are likely to be disease-associated, and 264 confirmed as neutral. All of the variants analysed have their PDB structures experimentally solved. The effects of variants were not predicted in silico. Instead, they were manually curated and verified from published literature. 58 (16%) out of 374 variants were found to affect protein stability (Figure 3.7). Among these, only nine were neutral SAVs. The majority of 43 variants were confirmed as disease-associated, whereas six were marked as posing a risk for the disease. The rest of the non-neutral variants were found to equally affect other properties such as protein activity, aggregation, binding, assembly, and rearrangement.

This means that approximately 44.5% (49 out of 110) non-neutral variants can be expected to hinder the folding and stability of the protein. When only 71 variants with strong evidence of diseases were considered, 60.6% were found to cause structural instability.

SAAPdap developed by Al-Numair and Martin (2013) analyses variants using a range of features, including structural instability, protein folding, interface, and sequence conservation. In the SAAPdap benchmark dataset, many neutral SAVs were identified as causing structural problems - which contradicts the finding by Bhattacharya et al. SAAPdap reported that 63.2% of disease-associated and 30.0% of neutral SAVs affected protein stability, and 28.4% of disease-associated and 10.3% of neutral SAVs affected protein folding. When interface, binding sites, and sequence conservation were incorporated, this analysis tool could identify 85.1% of disease-associated SAVs.

Nevertheless, it also identified a large number of neutral SAVs (64.2%). One can see that SAAPdap tends to overpredict disease variants. 15.3% of disease-associated and 6.2% of neutral SAVs were found to cause severe clashes. 26.7% disease and

16.2% neutral SAVs disrupted hydrogen bonds. The overwhelming numbers of disease- associated (34.5%) and neutral (17.5%) SAVs resulted in cavity problems. 3.4. Discussion 129

Figure 3.7: Effects of mutations on proteins. Figure adapted from the results by Bhattacharya et al. (2017)

In contrast to Missense3D, SAAPdap performs very poorly on the hydrophilic core feature. 6.4% of disease versus 6.2% of neutral SAVs were identified (TPR/FPR 1.03), whereas in Missense3D, 4.1% of disease but only 0.5% of neutral SAVs were detected

(TPR/FPR 8.0).

Based on Al-Numair and Martin’s analysis pipeline, the prediction can yield a very high false positive rate if a simple boolean logic gate is used. Accordingly, a machine learning algorithm is employed to yield the final prediction. SAAPdap was developed using a random forest algorithm and a 10-fold cross-validation. The authors showed that their prediction method had an accuracy of 84.6%. Accuracy is another widely- used term for performance evaluation. It measures the proportion of True Positives and

True Negatives among all cases examined: Accuracy = (TP+TN)/(TP+TN+FP+FN).

SAAPdap was reported to outperform SIFT and PolyPhen-2. The performance of 3.4. Discussion 130

SAAPdap, however, cannot be compared to Missense3D since the latter only investi- gates structural impacts, which contribute to 45-60% of the disease-associated SAVs according to Bhattacharya et al.’s finding (Bhattacharya et al., 2017).

Missense3D can identify structural impacts in 40.1% of disease-associated SAVs and the false positive rate was 11.4%, which is very close to the finding by Bhattacharya et al. Note that a variant that does not destabilise the tertiary structure can still cause disease, for example, by increasing or decreasing protein activity, causing aggregation, and preventing ligand or protein binding.

Rare neutral variants are more likely to be misinterpreted as disease- causing

Most rare missense variants were shown to be deleterious. A study on de novo rare missense variants curated from various sources including HGMD, NIEHS-EGP, Seat- tleSNPs, and JSNP by Kryukov et al. (2007) shows that 20% of them were severely damaging, 53% were mildly deleterious, and the rest (27%) were benign. Their finding was also in agreement with studies by other research groups (Yampolsky et al., 2005;

Eyre-Walker et al., 2006).

Most widely-used tools to predict the effect of a SAV do not distinguish rare neutral variants from disease-associated variants. Rare neutral variants are more likely to be falsely predicted as disease-associated than common neutral variants because those prediction tools were mostly trained on imbalanced datasets, which are often dominated by common neutral variants (Li et al., 2013). Rare variants are considered to have similar structural and functional properties to disease variants and are subject to strong purifying selection.

Ioannidis et al. (2016) constructed a novel method for variant prediction by us- ing the prediction scores from 18 widely-used predictors including SIFT, PolyPhen-2,

PROVEAN, and MutationAssessor. A random forest machine learning technique was 3.4. Discussion 131 employed to help classify the variants based on the 18 prediction scores. This technique of using accumulated scores was shown to give more accurate results when predicting the pathogenicity of rare variants. Although random forest can show which features contribute to the classification of disease-associated SAVs, in-depth structural analysis of variants remains necessary.

In the study presented in this chapter, when the structural analysis was performed on the neutral variant dataset, rare variants and variants of unknown frequency (pre- sumably very rare) were found to cause more structural defects compared to common variants. Only 5% of common neutral variants were found to cause structural prob- lems. This was half the proportion found in rare neutral variants. The percentage was highest for neutral variants of unknown frequency - about three times of the percentage in common variants.

In Chapter 2, rare and unknown variants are shown to have similar characteristics as disease-associated variants. Accordingly, it is possible that the rare and unknown variants analysed in the present chapter may actually cause structural problems, but the effects caused by these variants are minor or can be tolerated. Alternatively, they might be deleterious, but there has not been sufficient clinical evidence to confirm their pathogenicity. Therefore, variants that in silico cause structural impact should be prioritised for further laboratory study.

Structural analysis is crucial for studying variants of uncertain signif- icance

The analysis pipeline and features used in this chapter (except buried hydrogen bond) were applied to study an additional set of 641 genes causing endocrine and metabolic disorders. These genes harboured 12,266 unique missense variants, of which 4,123

(33.7%) were of uncertain significance (VUSs) and 6,183 disease-causing. This study was jointly conducted with Dr Alessia David, who performed data curation and all statistical analysis. The study was published in Journal of the Endocrine Society (ti- 3.4. Discussion 132 tle: “Structural Biology Helps Interpret Variants of Uncertain Significance in Genes

Causing Endocrine and Metabolic Disorders”) (Ittisoponpisan and David, 2018). The study confirmed the low specificity (high FPR) of sequence-based prediction by SIFT,

PolyPhen-2, MutationAssessor, and Condel. Only 55.8% of the neutral variants were correctly predicted as benign by all the predictors. When VUSs were taken into con- sideration, in silico prediction showed that 38.6% of them were predicted damaging by the majority of the predictors. However, only 30.9% of these VUSs can be confirmed as structurally damaging. Conversely, for VUSs predicted to be benign by the majority of predictors, 90.3% of them was confirmed as not causing structural damage. These results highlight that although the use of multiple in silico predictors may be useful in prioritising potentially damaging variants, in-depth understanding of the molecular mechanisms of the disease remains crucial. Protein structural analysis is a compelling tool for prioritising genetic variants and should be used more extensively, especially for assessing rare variants and VUSs.

Different structural problems may be identified when using a fixed backbone strategy: a case study on transthyretin (TTR)

Transthyretin (TTR) (UniProt ID: P02766) is a protein responsible for transporting the thyroid hormone thyroxine (T4) and retinol-binding protein bound to retinol. Its name is derived from transports thyroxine and retinol. Transthyretin functions as a homotetramer unit consisting of two dimers. Each dimer, consisting of two identical

147-amino-acid monomers, is capable of binding one thyroxine. Thus, there are two thyroxine binding sites per tetramer (Figure 3.8).

Transthyretin has been linked to several diseases including senile systemic amyloidosis

(SSA) (Westermark et al., 1990), amyloid diseases (Zeldenrust and Benson, 2010), familial amyloid cardiomyopathy (FAC) (Jacobson et al., 1997), and familial amyloid polyneuropathy (FAP) (Coelho, 1996). In many cases, the diseases are due to amino acid substitutions that lead to amyloid fibril formation (Figure 3.8d). 3.4. Discussion 133

Figure 3.8: Transthyretin (TTR). (a) monomer, (b) dimer, (c) tetramer (native conformation), (d) a fibril consisting of 8 monomer subunits. The structures of (a)-(c) were from PDB 3CN4. The fibril structure consisting of 8 monomer subunits were experimentally determined using X-ray crystallography technique on the gene with Leu55Pro missense variant (PDB code: 5TTR, resolution 2.7 A).˚ Figures gererated by QuteMol (Tarini et al., 2006).

Among several variants that have been widely studied, the variant Leu55Pro (position

75 in UniProt FASTA sequence, also known as p.Leu75Pro) was reported to cause early-onset familial amyloidotic polyneuropathy (Jacobson et al., 1997). The struc- tural analysis performed by Missense3D showed that replacing leucine with proline introduces a backbone constraint, thus causing an impact on the protein structure.

The limited range of conformation of proline leads to disallowed phi/psi angles. How- ever, this effect cannot be visualised as the predicted mutant structure would have an 3.4. Discussion 134

Figure 3.9: Comparison between wild-type and mutant structures of TTR. (a) wild- type crystal structure (PDB code: 3CN4, resolution 1.4 A).˚ The wild-type residue Leu55 (position 75 in UniProt FASTA sequence) is highlighted in orange. (b) mutant crystal structure (PDB code: 5TTR). The mutant Pro55 is highlighted in red. (c) superposition of wild-type (orange) and mutants (light blue) experimental structures. The small box shows hydrogen bonds originally formed between Val14 and Leu55 in the wild-type structure. identical backbone to the wild-type structure. Nevertheless, one can assume misfold- ing of the protein structure as the consequence of having torsion angles that fall in the

Ramachandran outlier region. When the mutant crystal structure (PDB code: 5TTR, chain A) was superposed onto the wild-type crystal structure (PDB code: 3CN4, chain

A), it can be seen that Leu55Pro leads to the twist of the backbone atoms on strand

D from residue 54 to 56 (Figure 3.9). At its native conformation, Leu55, located in strand D, forms a backbone-to-backbone hydrogen bond with Val14 located in strand

A. The substitution to proline at position 55 leads to part of strand D being pushed away from strand A, disrupting the strong hydrogen bond and resulting in 15% less dimer interface contact area (Sebasti˜aoet al., 1996, 1998). The impairment of dimer- dimer contacts has been shown to reduce the stability of the tetramer and promote 3.4. Discussion 135 amyloidogenicity (McCutchen et al., 1993).

Multiple structural problems can be caused by a single variant: a case study on phenylalanine hydroxylase

Figure 3.10: Phenylalanine hydroxylase (PAH ). This protein consists of three do- mains: regulatory domain, catalytic domain, and tetramerization domain. The globu- lar protein image represents the catalytic domain of the protein PAH from PDB 1J8U chain A (resolution 1.5 A).˚

Phenylalanine hydroxylase (PAH, UniProt ID: P00439) is a protein responsible for converting the essential amino acid l-phenylalanine (l-Phe) into l-tyrosine using the co-factors tetrahydrobiopterin (BH4), iron, and oxygen. It consists of three domains: the regulatory N-terminal domain (residue 1 to 142), the catalytic domain (residue 143 to 411) and the tetramerisation domain (residue 412 to 452). The pocket-like active site is located in the catalytic domain (Figure 3.10). When PAH loses its enzymatic 3.4. Discussion 136 activity or expression, it causes the build-up of phenylalanine level in the body. If left untreated, the harmful levels of phenylalanine can cause phenylketonuria (PKU,

MIM: 261600), a complex condition with intellectual disability and several other severe health conditions (Erlandsen et al., 2003). PAH has been known to be enriched in disease variants. Here, two disease-associated variants Arg158Trp and Arg243Gln are discussed.

Figure 3.11: Variants in phenylalanine hydroxylase (PAH ). (a) substitution from arginine to tryptophan at position 158 results in the loss of a salt bridge which con- nected this residue with glutamic acid at position 280 in the wild-type structure. (b) arginine at position 243 forms a salt bridge to aspartic acid at position 129 in the wild-type structure. The substitution replacing arginine with glutamine results in the loss of salt bridge. Both mutations disrupt the salt bridges and were shown to reduce enzymatic activity - causing phenylketonuria.

Arg158 is a buried residue (RSA 4.8% as detected by Missense3D on PDB 1J8U chain 3.4. Discussion 137

A) that originally forms a salt bridge with an exposed residue Glu280 (distance 3.0 A)˚ in the wild-type structure. As both residues are near the active site of the protein, this electrostatic interaction is considered to be crucial for maintaining the shape of the active site (Erlandsen et al., 2003). Missense3D shows that the substitution from arginine to tryptophan disrupts the buried salt bridge that holds the two residues together, thus suggesting destabilisation of protein structure (Figure 3.11). Moreover,

Missense3D identified additional structural problems caused by this substitution, such as loss of hydrogen bond formed between Arg158 and Tyr154, a switch between buried and exposed state of residue 158 (new RSA 16.2%), and reduction of the volume of the surface cavity near the active site by 91.6 A˚3. These multiple structural problems highly suggest this variant is likely to be deleterious. Previous in vitro studies found that this amino acid substitution destabilises the conformation of the active site, thus resulting in an almost absent enzymatic activity compared to the wild-type protein

(Martinez et al., 1995; Erlandsen et al., 2003). This variant is reported in UniProt as causing phenylketonuria.

Similarly, another residue Arg243 in the Cβ1 strand, forms a buried salt bridge to

Asp129 in the Cα1 helix. Missense3D confirmed that the wild-type residue was buried

(RSA 4.8%) and it formed a salt bridge with Asp129 (distance 4.0 A).˚ The replacement of arginine with glutamine was shown to not only disrupt the electrostatic interaction but also introduce an uncharged amino acid. This damaging effect from Missense3D is in agreement with in vitro results, which showed < 1% residual activity when compared to the wild-type protein (Okano et al., 1991). 3.5. Conclusion 138

3.5 Conclusion

In this chapter, the analysis pipeline of Missense3D to study the effect of SAVs on protein structures is reported. Missense3D shows that approximately 40% of disease- associated variants are structurally damaging. Many traditional prediction methods have been shown to overpredict disease-associated variants. Therefore, a structural analysis should be undertaken to complement sequence-based predictions in order to screen out false positives, thus reducing the cost of laboratory experiments. Chapter 4

Structural Effects Of Mutations On Homology Models

This chapter explores the use of Missense3D with predicted structures of the same pro- tein set used in the previous chapter; predicted structures at various ranges of sequence identity were used to see whether they could provide reliable structural analysis results in the absence of experimental 3D coordinates.

A manuscript titled “Missense3D: Can Predicted Protein 3D-structures provide Reli- able Insights into whether Missense Variants are Disease-associated?” - authored by

Ittisoponpisan S (myself), Islam SA, Alhuzimi E, David A and Sternberg MJE - was submitted in September 2018 for publication. To accompany the manuscript, a web- server version of Missense3D has been made available at http://www.sbg.bio.ic.ac. uk/~missense3d for all users.

139 4.1. Background 140

4.1 Background

One of the great benefits of performing qualitative structural analysis on genetic vari- ants is that the impact of a genetic variant can be addressed through observable changes in protein structure. Nevertheless, a major limitation in structural analysis lies in the availability of PDB coordinates. According to Mizianty et al. (2014), the structural coverage of the human proteome obtained by X-ray crystallography (as of 2014) was estimated to be around 14% of residues. Although this number had increased to around

17% in early 2018, the effect of the vast majority of human variants can still not be interpreted at structural level.

To overcome this limitation, one can make use of predicted structures which can be obtained by either comparative protein modelling (i.e. homology-built structures) or ab initio modelling. Homology modelling is based on the assumption that two homologous proteins are likely to share similar protein folds (Chothia and Lesk, 1986), whereas ab initio modelling is template-free, less reliable, and often used to construct a small part of a protein. Structural coverage of the human proteome (including homology models) increased over the past decade. Khafizov et al. (2014) showed that this coverage went from approximately 17% in 2000 to 35% in 2011. A later estimate in December 2017 by

Somody et al. (2017) suggested that the coverage was approximately 50% if homology models of at least 30% sequence identity were considered. Therefore, incorporating homology structures in structural analysis is of great benefit as it allows many more variants to be analysed.

Although the qualitative structural-based analysis method SAAPdap (Al-Numair and

Martin, 2013) can identify structural problems that arise from a SAV, it is not capable of analysing predicted structures. In fact, it does not allow users to upload and analyse their protein PDB coordinates as this program fully relies on the available experimen- tal structures in the PDB database. Because of this, analyses using SAAPdap are inevitably limited to a small number of variants. 4.2. Materials & methods 141

Another structural analysis tool, HOPE (Venselaar et al., 2010), gives more flexibility in the input by allowing users to provide a protein sequence or a UniProt ID. It predicts a homology structure using YASARA (Krieger et al., 2002). HOPE, however, relies on information from many external servers and often fails to deliver results. Most importantly, no benchmark was performed using predicted structures in their structural analysis pipeline. Hence, it is still questionable whether using predicted structures can yield results as accurate as using experimental structures.

In this chapter, the analysis by Missense3D presented in Chapter 3 is repeated on a dataset of predicted structures. The performance of each of Missense3D sixteen structural features is evaluated and discussed.

4.2 Materials & methods

4.2.1 Analysis pipeline

Mutant structure prediction and structural analysis methods are as described in Chap- ter 3. Instead of the 606 high-quality X-ray structures from the Top8000 database, homology-built structures at various sequence identities to the template queries were analysed. A summary of sixteen structural features used by Missense3D and their criteria is presented in Table 4.1.

4.2.2 Datasets of predicted structures and variants

606 FASTA sequences of the proteins used in Chapter 3 (from the Top8000 database) were extracted and the structures of those proteins were predicted using Imperial Col- lege London’s in-house homology-based prediction tool Phyre2 (Kelley and Sternberg,

2009; Kelley et al., 2015). Phyre2 is available in normal mode (using one template per one protein modelling) and intensive mode (using multiple templates) (see Section 4.2. Materials & methods 142

Table 4.1: The 16 structural features evaluated by Missense3D

Feature Description Disulphide bond breakage The substitution breaks a disulphide bond in the wild type. The maximum S-S bond length is 3.3 A.˚ Buried Pro introduced The substitution introduces a proline to a buried envi- ronment. Clash The mutant structure has a MolProbity clash score ≥ 30 and the increase in the clash score is > 18 compared to the wild-type structure. Buried hydrophilic introduced The substitution replaces a buried hydrophobic residue with a hydrophilic residue. Buried charge introduced The substitution replaces a buried uncharged residue with a charged residue. Secondary structure altered A substitution results in a change in the secondary struc- ture assigned by DSSP. Buried charge switch The substitution switches the charge (+/-) of the buried residue. Disallowed phi/psi The mutant residue is in the outlier region while the wild-type residue is in the favoured or allowed regions. Buried charge replaced The substitution replaces a buried charged residue with an uncharged residue. Buried Gly replaced The substitution replaces a buried glycine. Buried H-bond breakage The substitution breaks all side-chain/side-chain H- bond(s) and/or side-chain/main-chain bond(s) formed by the wild type which was buried. The maximum per- mitted H-bond N-O length is 3.9 A.˚ Buried salt bridge breakage The substitution breaks a salt bridge formed by a buried wild-type residue. The maximum N-O bond length is 5.0 A.˚ Cavity altered The substitution leads to an expansion or contraction of the cavity volume of ≥70 A˚3. Buried / exposed switch The substitution results in a change between the buried and exposed states of the variant residue. In addition, the difference between RSA has to be at least 5%. Cis Pro replaced A cis proline in the wild type is replaced in the mutant. Gly in a bend The wild-type residue is glycine and is located in a bend curvature (reported ’S’ in DSSP). 4.2. Materials & methods 143

1.4.3 in Chapter 1 for more information). In this study, the normal mode was used as the original experimental structures are all monomers with one domain.

Phyre2 can generate up to 20 models from different templates and at different sequence identities to each query sequence. In total, 3,544 monomeric homology models for 494 proteins were generated. The resulting models were further filtered to obtain models that meet the following criteria: 1) the sequence identity must be >30% but ≤ 95%

(the upper limit was set to prevent the case when the same protein, through the same or different PDB codes, was used as a template for modelling), 2) the template PDB code must not match the PDB code of the experimental structure, 3) confidence ≥ 90%,

4) residue coverage ≥ 80% of the original experimental structure. Note that these four conditions can be reordered without affecting the final results.

The models that met the screening criteria were separated into seven bins according to their sequence identity (as sequence identities reported in Phyre2 are integers, the used ranges were 30%-39%, 40%-49%, 50%-59%, 60%-69%, 70%-79%, 80%-89%, and

90%-95%). To avoid multiple models representing the same protein occurring in one bin, representative models for each protein in each bin were selected based on the resolution of the template PDBs. The preference was high-resolution X-ray > low- resolution X-ray > NMR > unknown resolution X-ray. The same set of variations used in the previous chapter (1,965 disease-associated and 2,134 neutral) were then mapped onto those representative models.

4.2.3 Model assessment

Each model was assessed by measuring the RMSD (root-mean-square deviation) of its

Cα atoms compared to those from its corresponding experimental structure. RMSD quantifies the average distance between the atoms of two given structures. Given that vi and wi are the coordinates of the Cα atoms of two structures (v and w) to be compared, the RMSD is defined as follows: 4.3. Results 144

v u n u 1 X RMSD(v, w) =t ||v − w ||2 n i i i=1 v u n u 1 X =t ((v − w )2 + (v − w2 ) + (v − w )2) n ix ix iy iy iz ix i=1

Where vi and wi are the coordinates of equivalent atoms at index i, and n is the number of atoms being compared. A lower value of the RMSD indicates higher similarity between the two structures.

The RMSDs between predicted structures and their respective experimental structures were calculated using a script ‘capri-fit’ developed by Dr Suhail Islam.

4.3 Results

4.3.1 Quality of the predicted models

A total of 1,052 Phyre2-predicted homology models for 404 experimental PDB struc- tures remained after apply the screening criteria and selecting representative models.

The distribution of RMSDs of equivalent Cα atoms between the predicted and cor- responding experimental structures is shown in Figure 4.1. As expected, models at

90%-95% sequence identity showed the lowest median RMSD (0.86 A).˚ Additionally, the median RMSD increases with decreasing sequence identity percentage. The median

RMSD was 2.79 A˚ at the lowest sequence identity range (30%-39%).

4.3.2 Performance of Missense3D on homology models at different sequence identities

For most proteins homology models were only available in some sequence identity ranges, so here only the mappable variants in each bin were assessed. Figure 4.2 shows the results of applying Missense3D to the representative Phyre2-predicted homology 4.3. Results 145

Figure 4.1: The RMSD distribution of RMSDs of predicted structures. The RMSD of each model was measured by comparing all its Cα atoms to those of its corresponding experimental structure. Each box represents quartile 1 (Q1), median (middle line), and quartile 3 (Q3). The whiskers extend up to 1.5 times the interquartile range (Q3-Q1) away from the box, but not exceeding the minimum or maximum values of each bin. models. TPR and FPR values with 95% confidence intervals are provided in Table

4.2. In addition to the positive rates from the predicted structures, the rates from the experimental structures of the same set of variants, both disease-associated and neutral, are also provided in the bar chart for comparison.

In general, similar TPRs and FPRs were observed when using predicted models and experimental structures. In most bins, the TPRs were between 35% and 38%. When

95% confidence intervals were taken into consideration, for six out of the seven bins a large overlap in the TPRs as well as FPRs between the results for homology models and the experimental structures can be observed. The number of disease-associated variants detected as causing structural impact was lower in bins 70%-79% and 80%-

89%, both in the predicted structures and the experimental structures. This is due 4.3. Results 146 to the lack of templates in these sequence identity ranges, which led to far larger confidence intervals and hinders comparison. Despite some fluctuation in the TPRs at

70%-89% sequence identity range, the percentages of the FPRs in all bins remained broadly stable between 11% and 13%. These results show that the overall performance of Missense3D does not deteriorate substantially even when the structures analysed are from low sequence identity templates. The successful extension of these results from experimental to predicted structures is, in part, the result of using broad cut-offs

(i.e. more relaxed thresholds for bond length calculation) in the identification of the structural alerts.

Figure 4.2: Missense3D performance at different sequence identities. For each se- quence identity bin, the TPRs and FPRs are shown for both the predicted and their corresponding experimental structures. 95% confidence intervals on the positive rates are shown as lines. The numbers of variants analysed in each bin are shown in the table under the graph.

Figure 4.3 shows the distributions of the relative frequencies of the RMSDs of the 4.3. Results 147

Table 4.2: Performance of Missense3D on predicted models and their equivalent experimental structures in each sequence identity range

with 95% confidence interval Sequence identity (%) Structure TP FN FP TN TPR (±CI) FPR (±CI) MODEL 442 830 140 1104 34.75 ± 2.62 11.25 ± 1.76 30-39 EXP 459 813 144 1100 36.08 ± 2.64 11.58 ± 1.78 MODEL 350 622 113 789 36.01 ± 3.02 12.53 ± 2.16 40-49 EXP 390 582 112 790 40.12 ± 3.08 12.42 ± 2.15 MODEL 334 597 80 553 35.88 ± 3.08 12.64 ± 2.59 50-59 EXP 331 600 75 558 35.55 ± 3.07 11.85 ± 2.52 MODEL 205 346 46 306 37.21 ± 4.04 13.07 ± 3.52 60-69 EXP 197 354 38 314 35.75 ± 4.00 10.80 ± 3.24 MODEL 27 82 21 164 24.77 ± 8.10 11.35 ± 4.57 70-79 EXP 22 87 22 163 20.18 ± 7.54 11.89 ± 4.66 MODEL 70 193 21 156 26.62 ± 5.34 11.86 ± 4.76 80-89 EXP 77 186 22 155 29.28 ± 5.50 12.43 ± 4.86 MODEL 96 163 20 148 37.07 ± 5.88 11.90 ± 4.90 90-95 EXP 86 173 19 149 33.20 ± 5.74 11.31 ± 4.79

predicted models that contributed to TP and FP assignments. Note that a model can be assigned TP or FP more than once if it contains multiple variants. For example, a model with RMSD = 2.1 A˚ consisting of ten variants may lead to seven TPs and three

FPs. The overall distribution suggests that TP cases typically had lower RMSDs. The mean RMSDs were 1.83 A˚ for TP cases (with many models in 2.0 A˚ - 2.5 A˚ range) and

2.11 A˚ for FP cases (many are in 2.5 A˚ - 3.0 A).˚

A one-tailed Mann-Whitney U-test confirmed that the distribution of RMSDs for the

TP predictions is significantly lower (p < 0.0001) than for the FP predictions. However, as there is a large overlap in the RMSDs, this suggests that model quality cannot provide a reliable guide as to whether Missense3D will yield a correct identification of a damaging effect. 4.3. Results 148

Figure 4.3: Histogram of the relative frequencies of RMSD of predicted models grouped according to whether the variant was a TP or FP. The bin labels show the upper bound (e.g. 0.5 denotes the range 0.0 A˚ ≤ RMSD < 0.5 A).˚

4.3.3 The performance of each structural feature at different sequence identities

Although overall the TPRs and FPRs remained broadly stable even at very low se- quence identity (30%-40%), one might question whether the performance of all individ- ual features would change upon different sequence identity between the query sequence and the template. Here, each feature is assessed by comparing its TPRs and FPRs to those yielded by using experimental structures. Figures 4.4-4.6 present the TPRs and

FPRs for each of the 16 features. Full details including confidence interval ranges are provided in Supplementary table C.1.

By inspection, the 16 features can be separated into two groups: 1) the features that 4.3. Results 149 maintain good performance (ratio of TPR/FPR) even when the sequence identity is low, and 2) the features whose performance drops when analysing predicted models.

Features that tend to maintain good performance even at low sequence identity rely on the buried or exposed state of the wild-type or mutant residue, on the backbone conformation (torsion angles / secondary structures), and on changes involving certain amino acids such as glycine or proline. This is because homologous structures tend to preserve the backbone conformation more than the side-chain positions(Flores et al.,

1993) and residues tend to maintain their buried or exposed status (Rost and Sander,

1994) even with low sequence identity.

In contrast, features requiring the measurement of a distance - such as the steric clashes, and the disruption of hydrogen bonds, salt bridges and disulphide bonds - tend to perform worse when they were used on homology models, particularly at low sequence identity due to poor modelling of the side-chain position. The details of each individual feature are discussed below.

Disruption of chemical bonds (disulphide, H bond, and salt bridge)

The performance of the features concerning the disruption of chemical bonds tends to drop when the sequence identity is low. These features require measurement of a distance between two atoms (maximum allowed distances are 3.3 A˚ for a disulphide bond, 3.9 A˚ for a strong to moderate strength H bond, and 5 A˚ for a salt bridge).

It can be observed, when the results from the experimental structures were compared to those from the predicted structures, that at low sequence identity (i.e. less than

50%), fewer TPs were identified. The false positive rates for all the types of chemical bonds remained relatively low in nearly all bins. Some fluctuation can be observed at high sequence identity, but this is most likely due to the lack of models and variants analysed.

The lower percentages of positive hits are not likely to be because the substitutions did 4.3. Results 150

Figure 4.4: Performance of each structural feature (part 1) 4.3. Results 151

Figure 4.5: Performance of each structural feature (part 2) 4.3. Results 152

Figure 4.6: Performance of each structural feature (part 3) 4.3. Results 153 not break the bonds, but because the bonds were not detected in the wild-type struc- tures. At low sequence identity, extensive backbone modifications can be anticipated.

This could lead to less accuracy in the repacking of the side chains, which eventually results in the chemical bonds not being detected in the predicted wild-type structures.

Changes involving glycine and proline

This section includes three features: ‘buried Pro introduced’, ‘buried Gly replaced’, and

‘Gly in a bend’. These features consider either secondary structure or solvent acces- sibility, which is a well conserved property among homologous structures, even at low sequence identity. The performance of these features was, therefore, not substantially deteriorated by the low sequence identity.

Clash

The discriminating power of the ‘clash’ feature is considerably weakened at low sequence identity. At 30%-39% sequence identity range, the TPR yielded by predicted structures was only half the rate yielded by experimental structures of the same proteins. The expected TPR was 4.2%. However, only 2.1% were detected. However, this does not mean that there were fewer clashes in the predicted models at low sequence identity.

As expected, these models had more clashes than the X-ray structures. In particular, the clash scores were much lower for high sequence identity models, and much higher for low sequence identity models (Figure 4.7). Nevertheless, the criterion for a clash requires a large difference in the MolProbity clash scores between the wild-type and the mutant structures (>18). Since the clash scores were already high in the predicted wild- type structures, introducing variants into those structures is not likely to drastically increase the clash scores.

There are two explanations to address why high clash scores were observed in the wild-type homology models. First, a low sequence identity between the query se- 4.3. Results 154

Figure 4.7: The distribution of MolProbity clash scores of predicted wild-type struc- tures. Each box represent Q1, median, and Q3. The whiskers extend up to 1.5 times the interquartile range (Q3-Q1) away from the box, but not exceeding the minimum or maximum values of each bin. quence and the template sequence suggests that more backbone modifications (i.e. insertion/deletion) are required in order to build a homology model. As this step is performed without considering side chains, adding more residues may lead to less three- dimensional space for side chains to be repacked. The second reason lies in the fact that Phyre2 uses the R3 algorithm for packing side chains (Kelley et al., 2015). Al- though its accuracy was shown to be similar to that of SCWRL3 (about 80% in terms of correct dihedral angles chi1 and chi2), R3 does not prioritise minimising clashes. Instead, it prioritises minimising the calculation time, which was shown to be about 2 to 78 times faster than SCWRL3 (Xie and Sahinidis, 2006). Hence, high clash scores can be expected. 4.3. Results 155

Substitution involving buried residues

Several features that involve solvent accessibility calculation, such as ‘buried/exposed switch’, ‘buried hydrophilic introduced’, and ‘buried charge introduced/switched’, main- tained an overall good performance even at the lowest available sequence identity range

(30%-39%). Moreover, the FPRs of these features were broadly stable. These results underline the importance of the hydrophobic environment in the protein interior which is conserved among homologous proteins. Similar to other features, the drop in perfor- mance observed between 60% and 80% sequence identity ranges is likely to be because of the lack of models and variants analysed.

Secondary structure

According to the results on experimental structures, all changes in secondary struc- tures were caused by introducing or removing a proline in a structure, which leads to disruption of backbone hydrogen bonds. As expected, the discriminating power of this feature is maintained at low sequence identity, as secondary structures are normally well-conserved among homologous structures. However, because the number of cases observed to have this structural problem was low in the experimental structures (21 disease-associated and 4 neutral SAVs which contributed to 1.07% TPR and 0.19%

FPR respectively), a slight difference in the number of observed cases can result in a considerable change in the appearance of the graph which is accompanied by large con-

fidence intervals. At 90% sequence identity, only 3 out of 259 (1.16%) disease-associated

SAVs and 2 out of 168 (1.19%) neutral SAVs were detected to cause secondary struc- ture disruption. However, at 30% sequence identity, 11 out of 1,272 disease-associated

SAVs (0.86%) but still only 2 out of 1,244 neutral SAVs (0.24%) were detected to cause this problem. 4.3. Results 156

Disallowed phi/psi

Similar to secondary structure and solvent accessibility, phi/psi torsion angles tend to be conserved among homologous structures. The percentages of disease-associated

SAVs identified as causing disallowed phi/psi angles varied between 4% and 6% due to different numbers of variants analysed in each bin. The overlap in the confidence intervals suggests that the positive rates between predicted and experimental structures are, however, not significantly different.

In the backbone construction process by Phyre2, the insertion of missing backbone is handled by using a library of fragments of known protein structures which have the lengths of 2-15 amino acids. Similarly, when a deletion occurs, the fragment represent- ing the sequence encompassing a window on either side of the deletion is used. The fragments are then further adjusted to fit the crude model. This process minimally alters the torsion angles in the fragments (Canutescu and Dunbrack, 2003) and could ensure favourable phi/psi torsion angles in the backbone. As Missense3D disregards wild-type residues whose torsion angles are in disallowed regions, the abundance of residues with favourable phi/psi angles in the predicted wild-type structures enables a structural alert to be raised by Missense3D.

Cis Pro replaced

In the analysis performed on experimental structures, only 7 (0.36%) disease-associated

SAVs and 4 (0.19%) neutral SAVs were identified to be the substitutions from cis pro- line. Even at the 30% sequence identity range which has the largest numbers of models and variants, only 3 out of 1,272 (0.24%) disease-associated SAVs and only 1 out of

1,244 (0.08%) neutral SAVs were identified as causing this structural problem. In addition, no SAV was found at 60%-80% sequence identity. Due to the low occur- rence of such variants, which also resulted in wide ranges of confidence intervals, the performance of this feature could not be conclusively assessed. Nevertheless, as cis 4.3. Results 157 conformation is determined from the omega angle, one of the backbone torsion angles

(phi, psi, and omega) that tend to be conserved, one may hypothesise that this feature would maintain its performance regardless of sequence identity.

Cavity altered

Figure 4.8: Box plot showing total cavity volume calculated from predicted structures (green) vs. experimental structures (blue). The plots were separated into different bins according to sequence identity. Each box represents Q1, median, and Q3 of the data in each bin. The top whiskers extend up to the range of 1.5 inter-quartiles (Q3-Q1) away from the box. The bottom of every box is adjacent to the x-axis because in many cases no cavity was detected in the structures.

Although the TPR/FPR ratios did not drastically change across different sequence identity bins, considerably more cavity problems can be detected at low sequence iden- tity than at high sequence identity. In the experimental dataset of 606 proteins, Mis- sense3D could identify 5.3% disease-associated SAVs and 1.9% neutral SAVs causing cavity problems. However, at 30% sequence identity, the percentages in homology 4.3. Results 158 models were 6.8% and 3.1% respectively while the expected values were 4.3% and

2.2%. Predicted structures, particularly at low sequence identity, were more enriched in cavities than their corresponding experimental structures. The predicted wild-type structures showed wider ranges of total cavity volume when compared to the wild-type experimental structures of the same proteins (Figure 4.8). In particular, the difference in total cavity volume between predicted and experimental structures is noticeable when the sequence identity is low. The wide ranges of total cavity volumes in low sequence identity homology models are probably due to extensive backbone modifica- tions and side-chain replacements. Because of less accurate repacking of side chains, a great number of cavities throughout the predicted models can be expected. Introduc- ing a SAV onto a structure abundant in cavities is, therefore, more likely to trigger the alert for cavity volume change - particularly when a small amino acid is replaced with a larger one. Alternatively, in a structure with very small or no cavities, replacing a small amino acid with a large amino acid would instead trigger the clash alert, rather than the cavity volume alert.

Development of 3D Missense, a web server to help analyse structural effect of missense variants

Missense3D was made available as an offline version, which supports batch analysis, to be used internally within the Structural Bioinformatics Group at Imperial College

London and as an online web-server version, available at http://www.sbg.bio.ic. ac.uk/~missense3d/, which is freely available for all users.

The web server was developed with assistance from Dr Tochukwu Ofoegbu (for PDB structure visualisation), Dr Alessia David (documentation), and Dr Suhail Islam (user- server interaction). Missense3D incorporates recent web technologies, such as HTML5, jQuery, and Bootstrap 4.0 to deliver a clean web interface and great user experience.

The web-server pipeline is shown in Figure 4.9. Users are asked to either provide a

PDB code or upload a 3D coordinate file, for the protein they would like to analyse, 4.3. Results 159

Figure 4.9: Missense3D analysis pipeline along with the variant information (position, original amino acid, and mutant amino acid) (Figure 4.10).

After reading all user-provided input parameters, the web server executes the main

Missense3D analysis pipeline, which was written in Python for side chain repacking and structural analysis, as described in Chapter 3. The result is typically returned in about 3 minutes per variant. The main report is in a TSV (tab-separated value)

file which can be opened with MS Excel or any text editing program. A user-friendly version of the report is also generated as an HTML file. The HTML report consists of two sections. The first section (Figure 4.11) shows the summary of the variant analysed and the structural problems detected by Missense3D. The mutant structure is also displayed in superposition with the wild-type structure using JSMol (Hanson et al., 2013). Here, users can customise the display of each structure (e.g. colour and shape) by using the buttons provided next to the viewer. Users can download all the results to their personal computers for offline use. The second section (Figure 4.12 and 4.3. Results 160

Figure 4.10: Missense3D home page 4.3. Results 161

Figure 4.11: Example result experimental protein structure on PDB 3HG3 (B), variant Cys52Arg (part one). This variant triggers the disulphide breakage alert. The wild-type structure is shown in grey. The mutant structure is in purple. However, as both structures share the same backbone coordinates, in this superposition view the backbone and most of the side chains can be seen in the wild-type colour. Residue 52 is highlighted in green for the wild-type amino acid (Cys) and red for the mutant amino acid (Arg).

Figure 4.13) provides detailed information for the 16 features that were analysed.

The original Missense3D script was designed to support batch analysis. This feature is currently only available in the offline version. To perform batch processing, users need to provide all variant information in a text file. In addition to providing a PDB code, users can provide variant information based on positions in FASTA sequences. In this case, users are required to provide a UniProt ID and a FASTA sequence. Missense3D will then align the sequence onto the PDB structure and perform the analysis. 4.3. Results 162

Figure 4.12: Example result on experimental protein structure PDB 3HG3 (B), variant Cys52Arg (part two). Here disulphide breakage is detected and is highlighted in red. 4.3. Results 163

Figure 4.13: Example result on experimental protein structure PDB 3HG3 (B), variant Cys52Arg (part three). 4.3. Results 164

4.3.4 Case studies

Two example case studies are presented to illustrate the use of Missense3D for analysing the structural consequences of SAVs in predicted models, at various sequence identities, with comparison of the results to those yielded by using experimental structures.

Carbonic anhydrase II: a salt bridge disruption destabilises the protein

Carbonic anhydrase II (UniProt ID: P00918, gene name CA2 ) is one of the 14 forms of human alpha carbonic anhydrases involved in catalysing the reversible hydration of carbon dioxide (Figure 4.14). Defects in this protein were shown to be associated with with (MIM: 259730) (Almstedt et al., 2008), a rare condition causing severe impairment both mentally and physically. His107 is a completely buried residue (RSA 0%) located in the core of CA2. Variants replacing

His107 usually destabilise the protein. However, the bulkiness of the Tyr side chain places the pathogenic SAV p.His107Tyr among the variants that have the most severe destabilising effect on the folding of the protein (Almstedt et al., 2008).

In the high-quality 1.1 A˚ wild-type experimental structure (PDB code: 2FOS, chain A) two salt bridges are found between His107 and Glu117. The two N-O distances are 2.6

A˚ and 3.8 A˚ (Figure 4.15a). The variant p.His107Tyr results in the abolition of these salt bridges (Figure 4.15b). One of the structures predicted by Phyre2 had a sequence identity of 36% (template PDB code 1JD0) and resulted in a model with RMSD to the experimental structure (2FOS) of 2.6 A.˚ In the predicted structure, the His-Glu salt bridge remains intact with the N-O distances being 2.8 A˚ and 3.9 A˚ (Figure 4.15c).

Despite the predicted model having a low sequence identity, the predicted structure remained sufficiently accurate in this region so that salt bridges could be identified in the predicted wild-type structure. This enabled Missense3D to identify the breakage of the bonds in the predicted mutant structure (Figure 4.15d). 4.3. Results 165

Figure 4.14: Carbonic anhydrase II. This protein is represented by PDB 2FOS (chain A, resolution 1.1 A)˚ with a secondary structure representation. The two buried residues His107 and Glu117 are highlighted in red and orange, respectively.

Alpha-galactosidase A: disulphide disruption results in loss of activity

Alpha-galactosidase (UniProt ID: P06280, gene name GLA), is an enzyme respon- sible for hydrolysing terminal, non-reducing alpha-D-galactose residues in alpha-D- galactosides, including galactolipids, galactose oligosaccharides, and galactomannans

(Calhoun et al., 1985; Ge et al., 2017). GLA harbours several disease-causing variants, which cause a rare , (Lukas et al., 2016).

The GLA protein, comprising of 429 amino acids, is enriched in disulphide bonds which provide structural stability. Examination of the experimental structure of alpha- galactosidase shows disulphide bonds connecting cysteine residues 52-94, 56-63, 142-

172, 202-223, and 378-382 (Figure 4.16) (Qiu et al., 2015). There are, however, two cysteine residues that do not form a disulphide bond: Cys90 and Cys174. Variants at 4.3. Results 166

Figure 4.15: Analysis of the disease variant p.His107Tyr in the experimental and predicted structures of carbonic anhydrase II. (a) His107 in the wild type (PDB: 2FOS), (b) Tyr107 in the predicted mutant (modelled by Missense3D), (c) Phyre2-predicted structure of the wild type with His107, (d) predicted mutant structure with Tyr107 based on the predicted wild-type structure. In all four panels, the Cα traces of the structures analysed by Missense3D are shown in grey. In boxes (c) and (d), the Cα traces and side-chain positions that are shown in (a) and (b) are in pink. The side chains of the His107, Tyr107 and Glu117 are shown by chemical type (green for non- polar, blue for positive and red for negative). 4.3. Results 167

Figure 4.16: Alpha-galactosidase A. (a) overall structure of the protein in a dimer conformation. Cys52 is a surface residue highlighted in red. (b) cartoon representation of the monomer. Disulphide bonds are shown as orange spheres. (c) Location of each cysteine pair that form a disulphide bond on the protein sequence. The structure is from PDB 3HG3 (chain B, resolution 1.9 A).˚ these two residues only slightly alter the level of enzymatic activity. Variants affecting other cysteine residues have been shown to cause loss of GLA expression and activity, as a result of disulphide bond disruption, which severely affects GLA folding and stability

(Qiu et al., 2015).

In the structural analysis by Missense3D based on the 1.9 A˚ resolution PDB structure

3HG3, the residue Cys52 is originally located on the surface of the protein (RSA 32%)

(Figure 4.16a-b). This residue forms a disulphide bond with Cys94 with a 2.08 A˚ bond length (Figure 4.17a). Substituting Cys52 to Arg or any other amino acid would disrupt this bond (Figure 4.17b). In addition, Missense3D can also successfully identify the disulphide breakage at positions 56, 94, 142, 172, 202, 223, and 378. Additionally, variants at these residues have been reported as disease-causing in UniProt Humsavar

(except two residues, 63 and 382, which have no variant record).

A more-accurately predicted wild-type structure has 54% sequence identity (template 4.3. Results 168

Figure 4.17: Analysis of the variant p.Cys52Arg in the crystal and predicted struc- tures of alpha-galactosidase. (a) Cys52 in the wild-type structure (PDB 3HG3), (b) Arg52 in the mutant modelled from the wild-type structure, (c) Phyre2-predicted wild- type structure of the Cys52 using a template with 54% sequence identity (RMSD to the experimental structure of 1.4 A),˚ (d) Phyre2-predicted wild-type structure of the Cys52 using a template with 35% sequence identity (RMSD to the experimental struc- ture of 2.8 A).˚ Colour scheme for the four panels as in Figure 4.15. Additionally, the SG atoms are shown in yellow. 4.4. Conclusion 169

PDB: 1KTB) and has an RMSD with the experimental structure of 1.4 A.˚ The wild- type disulphide bond remains identifiable with an S-S distance of 2.00 A˚ (Figure 4.17c).

Therefore, the Cys52Arg substitution is correctly identified as structurally damaging in the mutant structure. In contrast, the less-accurately-predicted wild-type structure that is based on a model with 35% identity (template PDB: 4NZJ) has an RMSD of 2.86 A˚ to the experimental structure. The wild-type disulphide bond cannot be identified as the S-S distance is 4.0 A˚ (Figure 4.17d) and so the variant is not considered as introducing a damaging change into the structure.

4.4 Conclusion

This chapter reported the performance of our analysis pipeline when tested on homol- ogy models at various sequence identity ranges; it was observed that the performance of the prediction on homology models is nearly as good as in the crystal structure, although there are minor changes in the accuracy of some features at low sequence identity. This suggests that homology models can be used as a promising alternative for structural analysis when experimental structures of the proteins are not available.

Without the use of any machine learning algorithm, Missense3D can provide good coverage for variants that damage protein structures while keeping the false positive rate low. Missense3D is a powerful descriptive tool and, if used in conjunction with sequence-based prediction tools, can provide more accurate interpretation of the effect of genetic variants. Chapter 5

Discussion

5.1 Stringent cut-off versus relaxed cut-off for chemical

bonds in Missense3D

In the development of Missense3D, relaxed cut-offs were used in order to perform the structural analysis on 3D coordinates from high-quality and low-quality X-rays struc- tures, NMR structures, and homology-built structures. The widely-accepted lengths for disulphide bonds, salt bridges, and hydrogen bonds are are 2.1 A,˚ 4.0 A,˚ and up to

2.9 A˚ (for moderate-strength H bond), respectively. However, as SCRWL4 may sightly move the side chains, each bond length was allowed 1 A˚ relaxation. The differences in the performance of each feature using normal cut-offs versus relaxed cut-offs are discussed below.

With the default threshold of 2.3 A˚ (from the ‘disulphide’ script) for disulphide bonds,

65 disease-associated SAVs (3.30%) and three neutral SAVs (0.14%) were detected as disrupting a disulphide bond. By increasing the threshold to 3.3 A,˚ five more disease- associated SAVs were identified (see Table 5.1).

A drastic increase in the number of positive hits can be observed when considering the relaxed cut-off for buried salt bridge breakage. With the default 4.0 A˚ threshold, 13

170 5.2. The prediction performance is generally well preserved in homology models 171 disease SAVs (0.66%) and five neutral SAVs (0.23%) were identified. By increasing the threshold to 5 A,˚ nearly five times more salt-bridge-breaking SAVs were detected: 62

(3.16%) disease-associated and 23 (1.08%) neutral mutations (TPR/FPR 2.93). The performance of the salt bridge feature, however, dropped drastically when non-buried salt bridges were included. This suggests that a salt bridge breakage that happens on the surface can be tolerated as the affected residue may form a new bond with water as a compensation.

In contrast with the above two bonds, increasing the threshold for hydrogen bonds results in fewer structural alerts. At 2.9 A˚ threshold, 197 (10.03%) disease-associated and 87 (3.80%) neutral SAVs were flagged as disrupting side-chain hydrogen bonds in the protein core. However, at 3.9 A,˚ the numbers were reduced to 148 (7.53%) and 53

(2.48%), respectively. Unlike salt bridges and disulphide bonds, a single residue can establish multiple hydrogen bonds with its neighbour residues. The criterion used in this study required all side-chain hydrogen bonds to be disrupted in order to trigger the alert. Extending the cut-off to 3.9 A˚ enables more hydrogen bonds to be detected (i.e. weak hydrogen bonds). This makes it much harder to completely disrupt all side-chain hydrogen bonds by introducing a SAV, as the mutant residue may still be able to form hydrogen bonds with its neighbour residues. Nonetheless, using the relaxed cut-off can effectively filter out false positive hits and increase the TPR/FPR ratio to 3.03. Similar to the salt bridge disruption, the performance of this feature dropped when exposed residues were included. Buried hydrogen bonds were shown to have more impact on protein folding (Worth and Blundell, 2010).

5.2 The prediction performance is generally well preserved

in homology models

By incorporating homology models, the coverage of the human proteome can be in- crease to approximately 50% (Somody et al., 2017). Prior to the work in this thesis, it 5.2. The prediction performance is generally well preserved in homology models 172

Table 5.1: Structural features involving bond length. The features used in Missense3D are in bold.

Feature TP FP TPR FPR TPR/FPR P-value Disulphide breakage 2.3 A˚ 65 3 3.31 0.14 23.53 < 0.0001 Disulphide breakage 3.3 A˚ 70 3 3.56 0.14 25.34 < 0.0001 Salt bridge breakage 4.0 A˚ 60 53 3.05 2.48 1.23 0.1328 Salt bridge breakage 5.0 A˚ 182 168 9.26 7.87 1.18 0.0559 Buried salt bridge breakage 4.0 A˚ 13 5 0.66 0.23 2.82 0.0194 Buried salt bridge breakage 5.0 A˚ 62 23 3.16 1.08 2.93 < 0.0001 H bond breakage 2.9 A˚ 440 340 22.39 15.93 1.41 < 0.0001 H bond breakage 3.9 A˚ 392 369 19.95 17.29 1.15 0.0144 Buried H bond breakage 2.9 A˚ 197 81 10.03 3.80 2.64 < 0.0001 Buried H bond breakage 3.9 A˚ 148 53 7.53 2.48 3.03 < 0.0001

was unclear whether using homology models to evaluate the structural impact of SAVS can be reliable. This question was addressed in Chapter 4 and the study showed that the results obtained using homology models are overall similar to those obtained using experimental structures. The 16 structural features analysed by Missense3D can be divided into two main groups: 1) features that retain a good performance regardless of the template sequence identity - which are mostly features concerning solvent acces- sibility and backbone structures and 2) features that lose their accuracy particularly when the sequence identity of template is low - which are mostly features concerning bond lengths and atom distance. Despite the less accuracy observed in some features, using high-quality predicted models (i.e. models with high sequence identity) can en- sure more reliability of Missense3D results. 5.3. Future work for Missense3D 173

5.3 Future work for Missense3D

Improved features and criteria of Missense3D

More elaborate criteria in the current features

The criterion of some structural features could be refined to yield more positive hits.

One possible improvement is the criterion for the disruption of buried hydrogen bonds and salt bridges. As of the current criterion, the wild-type residue is required to be buried in order for the bonds to be considered. However, the buried bonds could still be formed by an exposed residue that has its backbone laying on the protein surface and its side chain directing towards the protein interior, forming a buried chemical bond with another residue. An example is Arg121Cys in cystathionine beta-synthase

(UniProt ID: P35520) which is reported in UniProt as disease-associated. Arg121 is located near the ligand binding site and is considered exposed. However, its side chain points towards the protein core, forming several hydrogen bonds with neighbour residues Thr87, Glu110, Thr236, and Glu239. A structural analysis on PDB code

1M54 (chain F) confirmed that p.Arg121Cys is likely to destabilise the conformation of the binding site by disrupting multiple hydrogen bonds. Although the side-chain replacement algorithm used in Missense3D correctly predicted loss of hydrogen bonds in the mutant structure, the structural alert was not raised because the wild-type residue was exposed. Therefore, considering buried hydrogen bonds formed by exposed residues could improve the sensitivity of the program. One possible way to do this is to consider the solvent accessibility of only the side chains instead of the entire residues.

Considering SAVs that make the structure too rigid

Currently, the main goal of Missense3D is to provide insights into the impact of SAVs on protein structures. Most of the criteria consider disruption of features that can greatly destabilise proteins. However, SAVs that make the structures too rigid could 5.3. Future work for Missense3D 174 also lead to loss or gain of protein functions. For example, SAVs in loop regions may reduce the flexibility in protein fold, or a mutation may form extra hydrogen bonds, making the protein structure more rigid which is unfavourable for some flexible proteins that interact with multiple partners.

Considering biological units for a more comprehensive analysis

Better insights could be obtained by analysing quaternary structures (or in protein complexes), as SAVs could be located near interface residues and obstruct formations of protein complexes. Residues which were shown as exposed on tertiary structures may, in fact, be buried at the quaternary level as they serve as interface residues.

Hydrogen bonds may be formed by these interface residues between each subunit.

Similarly, hydrophilic residues and charged residues, which are highly unfavourable when introduced into the protein core, can also be taken into account as they could have a strong impact on interface residues.

Allowing flexible backbones for a better visualisation of mutant struc- tures

Although using the fixed backbone technique is powerful for analysing the structural impact, the repacked mutant structure can only be used as a guide to show if a SAV introduces any unfavourable conditions to the structure. However, this may not rep- resent the final conformation of the mutant structure. Of particular concern is when a SAV results in more steric clashes, disallowed phi/psi angles, buried hydrophilic in- troduced, or buried charge introduced, as the mutant structure with a fixed backbone would be in a thermodynamically unfavourable conformation. Thus, the backbone has to be adjusted to accommodate the structural constraints. However, such an analysis would require more computational resources and more advanced techniques such as incorporating . 5.3. Future work for Missense3D 175

Allowing batch processing in the web-server version of Missense3D

As of the current version (1.0), Missense3D only allows batch processing when used lo- cally. The storage needed for the output by Missense3D is approximately 3 megabytes per variant. However, this does not include the extra storage needed for a zip file of the results, which include all modified structures, HTML pages, and the downloadable structure viewer, JSMol (43 megabytes per job). JSMol has to be included in the download package so that users can visualise the structure without an internet connec- tion. Thus, a challenge for making the batch mode available in the online version of

Missense3D lies in the storage required for each job.

Incorporating machine learning algorithms to predict the pathogenic- ity of SAVs

The current version of Missense3D does not utilise any machine learning algorithm.

However, as it was designed to identify major structural problems which could be lethal, it could effectively screen out neutral SAVs and can provide an in-depth quali- tative analysis of a mutation. In the future, incorporating machine learning algorithms together with additional structural and sequence features could potentially increase the

TPR and make Missense3D ideal for analysing SAVs in both descriptive and predictive aspects.

To make the best use of machine learning, a large dataset of structures and variants is required for the training set. One common problem when using machine learning algorithms is that the predictions are optimised for the training set. Therefore, it is important that the dataset is curated and harbours a variety of structures and variants.

In addition, more non-structural features, such as network property (e.g. centrality and betweenness), sequence conservation, and prediction scores from several in silico predictors could be incorporated to obtain a more accurate prediction. Chapter 6

Conclusion

This thesis presents the studies aimed to explain the relationship between SAVs and their associated genetic diseases by analysing sequence and protein structure features.

Moreover, it shows the importance of pleiotropy in Mendelian diseases and elucidates the differences between pleiotropic and non-pleiotropic proteins at a system level. The results of this thesis also demonstrate that structural analysis performed on experi- mental structures as well as homology models can be extremely helpful for exploring the mechanisms by which disease-associated SAVs impact protein structures. A thor- ough understanding of the effect of genetic variants at sequence, structure, and protein network level is essential for elucidating the basis of genetic diseases and guide future in vitro studies.

176 Bibliography

Abrus´an,G. and Marsh, J. A. (2016). Alpha helices are more robust to mutations than

beta strands. PLoS Computational Biology, 12(12):e1005242.

Adzhubei, I., Jordan, D. M., and Sunyaev, S. R. (2013). Predicting functional effect of

human missense mutations using PolyPhen-2. Current Protocols in Human Genetics,

76(SUPPL.76):1–7.

Adzhubei, I. A., Schmidt, S., Peshkin, L., Ramensky, V. E., Gerasimova, A., Bork, P.,

Kondrashov, A. S., and Sunyaev, S. R. (2010). A method and server for predicting

damaging missense mutations. Nature Methods, 7(4):248–249.

Al-Numair, N. S. and Martin, A. C. (2013). The SAAP pipeline and database: tools

to analyze the impact and predict the pathogenicity of mutations. BMC Genomics,

14 Suppl 3(Suppl 3):S4.

Alhuzimi, E., Leal, L. G., Sternberg, M. J., and David, A. (2018). Properties of human

genes guided by their enrichment in rare and common variants. Human Mutation,

39(3):365–370.

Almstedt, K., M˚artensson, L. G., Carlsson, U., and Hammarstr¨om,P. (2008). Thermo-

dynamic interrogation of a folding disease. Mutant mapping of position 107 in human

carbonic anhydrase II linked to marble brain disease. Biochemistry, 47(5):1288–1298.

Altschul, S., Madden, T. L., Sch¨affer,A. A., Zhang, J., Zhang, Z., Miller, W., and

Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein

database search programs. Nucleic Acids Research, 25(17):3389–3402.

177 BIBLIOGRAPHY 178

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic

local alignment search tool. Journal of Molecular Biology, 215(3):403–410.

Argentaro, A., Yang, J.-C., Chapman, L., Kowalczyk, M. S., Gibbons, R. J., Higgs,

D. R., Neuhaus, D., and Rhodes, D. (2007). Structural consequences of disease-

causing mutations in the ATRX-DNMT3-DNMT3L (ADD) domain of the chromatin-

associated protein ATRX. Proceedings of the National Academy of Sciences,

104(29):11939–11944.

Aurora, R., Srinivasan, R., and Rose, G. D. (1994). Rules for α-helix termination by

glycine. Science, 264(5162):1126–1130.

Babicki, S., Arndt, D., Marcu, A., Liang, Y., Grant, J. R., Maciejewski, A., and

Wishart, D. S. (2016). Heatmapper: web-enabled heat mapping for all. Nucleic

acids research, 44(W1):W147–W153.

Bateman, A., Martin, M. J., O’Donovan, C., Magrane, M., Alpi, E., Antunes, R.,

Bely, B., Bingley, M., Bonilla, C., Britto, R., Bursteinas, B., Bye-AJee, H., Cowley,

A., Da Silva, A., De Giorgi, M., Dogan, T., Fazzini, F., Castro, L. G., Figueira,

L., Garmiri, P., Georghiou, G., Gonzalez, D., Hatton-Ellis, E., Li, W., Liu, W.,

Lopez, R., Luo, J., Lussi, Y., MacDougall, A., Nightingale, A., Palka, B., Pichler,

K., Poggioli, D., Pundir, S., Pureza, L., Qi, G., Rosanoff, S., Saidi, R., Sawford,

T., Shypitsyna, A., Speretta, E., Turner, E., Tyagi, N., Volynkin, V., Wardell, T.,

Warner, K., Watkins, X., Zaru, R., Zellner, H., Xenarios, I., Bougueleret, L., Bridge,

A., Poux, S., Redaschi, N., Aimo, L., ArgoudPuy, G., Auchincloss, A., Axelsen,

K., Bansal, P., Baratin, D., Blatter, M. C., Boeckmann, B., Bolleman, J., Boutet,

E., Breuza, L., Casal-Casas, C., De Castro, E., Coudert, E., Cuche, B., Doche, M.,

Dornevil, D., Duvaud, S., Estreicher, A., Famiglietti, L., Feuermann, M., Gasteiger,

E., Gehant, S., Gerritsen, V., Gos, A., Gruaz-Gumowski, N., Hinz, U., Hulo, C.,

Jungo, F., Keller, G., Lara, V., Lemercier, P., Lieberherr, D., Lombardot, T., Martin,

X., Masson, P., Morgat, A., Neto, T., Nouspikel, N., Paesano, S., Pedruzzi, I.,

Pilbout, S., Pozzato, M., Pruess, M., Rivoire, C., Roechert, B., Schneider, M., Sigrist, BIBLIOGRAPHY 179

C., Sonesson, K., Staehli, S., Stutz, A., Sundaram, S., Tognolli, M., Verbregue, L.,

Veuthey, A. L., Wu, C. H., Arighi, C. N., Arminski, L., Chen, C., Chen, Y., Garavelli,

J. S., Huang, H., Laiho, K., McGarvey, P., Natale, D. A., Ross, K., Vinayaka, C. R.,

Wang, Q., Wang, Y., Yeh, L. S., and Zhang, J. (2017). UniProt: The universal

protein knowledgebase. Nucleic Acids Research, 45(D1):D158–D169.

Batsanov, S. S. (2001). Van der Waals radii of elements. Inorganic Materials, 37(9):871–

885.

Bava, K. A. (2004). ProTherm, version 4.0: thermodynamic database for proteins and

mutants. Nucleic Acids Research, 32(90001):120D–121.

Beadle, G. W. and Tatum, E. L. (1941). Genetic control of biochemical reactions in

Neurospora. Proceedings of the National Academy of Sciences of the United States

of America, 27(11):499–506.

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical

and powerful approach to multiple testing. Journal of the Royal Statistical Society,

57(1):289–300.

Berman, H. M. (2000). The protein data bank. Nucleic Acids Research, 28(1):235–242.

Betz, S. F. (1993). Disulfide bonds and the stability of globular proteins. Protein

Science, 2(10):1551–1558.

Bhattacharya, R., Rose, P. W., Burley, S. K., and Prli´c,A. (2017). Impact of ge-

netic variation on three dimensional structure and function of proteins. PLoS ONE,

12(3):e0171355.

Biasini, M., Bienert, S., Waterhouse, A., Arnold, K., Studer, G., Schmidt, T., Kiefer,

F., Cassarino, T. G., Bertoni, M., Bordoli, L., and Schwede, T. (2014). SWISS-

MODEL: Modelling protein tertiary and quaternary structure using evolutionary

information. Nucleic Acids Research, 42(W1). BIBLIOGRAPHY 180

Bjørge, T., Cnattingius, S., Lie, R. T., Tretli, S., and Engeland, A. (2008). Cancer risk

in children with birth defects and in their families: A population based cohort study

of 5.2 million children from Norway and Sweden. Cancer Epidemiology Biomarkers

and Prevention, 17(3):500–506.

Bosshard, H. R., Marti, D. N., and Jelesarov, I. (2004). Protein stabilization by salt

bridges: Concepts, experimental approaches and clarification of some misunderstand-

ings. Journal of Molecular Recognition, 17(1):1–16.

Buchan, D. W., Minneci, F., Nugent, T. C., Bryson, K., and Jones, D. T. (2013).

Scalable web services for the PSIPRED Protein Analysis Workbench. Nucleic acids

research, 41(Web Server issue):W349–W357.

Calhoun, D. H., Bishop, D. F., Bernstein, H. S., Quinn, M., Hantzopoulos, P., and

Desnick, R. J. (1985). Fabry disease: isolation of a cDNA clone encoding human

alpha-galactosidase A. Proceedings of the National Academy of Sciences of the United

States of America, 82(November):7364–7368.

Canutescu, A. A. and Dunbrack, R. L. (2003). Cyclic coordinate descent: A robotics

algorithm for protein loop closure. Protein Science, 12(5):963–972.

Canutescu, A. A., Shelenkov, A. A., and Dunbrack, R. L. (2003). A graph-theory

algorithm for rapid protein side-chain prediction. Protein Science, 12(9):2001–2014.

Chatr-Aryamontri, A., Breitkreutz, B. J., Heinicke, S., Boucher, L., Winter, A., Stark,

C., Nixon, J., Ramage, L., Kolas, N., O’Donnell, L., Reguly, T., Breitkreutz, A.,

Sellam, A., Chen, D., Chang, C., Rust, J., Livstone, M., Oughtred, R., Dolinski, K.,

and Tyers, M. (2013). The BioGRID interaction database: 2013 update. Nucleic

Acids Research, 41(D1):D816–D823.

Chavali, S., Barrenas, F., Kanduri, K., and Benson, M. (2010). Network properties of

human disease genes with pleiotropic effects. BMC Syst Biol, 4:78.

Chen, C. R. and Makhatadze, G. I. (2015). ProteinVolume: Calculating molecular van

der Waals and void volumes in proteins. BMC Bioinformatics, 16(1):101. BIBLIOGRAPHY 181

Chen, V. B., Arendall, W. B., Headd, J. J., Keedy, D. A., Immormino, R. M., Kapral,

G. J., Murray, L. W., Richardson, J. S., and Richardson, D. C. (2010). MolProbity:

All-atom structure validation for macromolecular crystallography. Acta Crystallo-

graphica Section D: Biological Crystallography, 66(1):12–21.

Chen, W. H., Minguez, P., Lercher, M. J., and Bork, P. (2012). OGEE: An online gene

essentiality database. Nucleic Acids Research, 40(D1):D901–D906.

Cheng, I., Kocarnik, J. M., Dumitrescu, L., Lindor, N. M., Chang-Claude, J., Avery,

C. L., Caberto, C. P., Love, S. A., Slattery, M. L., Chan, A. T., Baron, J. A.,

Hindorff, L. A., Park, S. L., Schumacher, F. R., Hoffmeister, M., Kraft, P., Butler,

A. M., Duggan, D. J., Hou, L., Carlson, C. S., Monroe, K. R., Lin, Y., Carty, C. L.,

Mann, S., Ma, J., Giovannucci, E. L., Fuchs, C. S., Newcomb, P. A., Jenkins, M. A.,

Hopper, J. L., Haile, R. W., Conti, D. V., Campbell, P. T., Potter, J. D., Caan,

B. J., Schoen, R. E., Hayes, R. B., Chanock, S. J., Berndt, S. I., K¨ury, S., B´ezieau,

S., Ambite, J. L., Kumaraguruparan, G., Richardson, D. M., Goodloe, R. J., Dilks,

H. H., Baker, P., Zanke, B. W., Lemire, M., Gallinger, S., Hsu, L., Jiao, S., Harrison,

T. A., Seminara, D., Haiman, C. A., Kooperberg, C., Wilkens, L. R., Hutter, C. M.,

White, E., Crawford, D. C., Heiss, G., Hudson, T. J., Brenner, H., Bush, W. S.,

Casey, G., Le Marchand, L., and Peters, U. (2014). Pleiotropic effects of genetic

risk variants for other cancers on colorectal cancer risk: PAGE, GECCO and CCFR

consortia. Gut, 63(5):800–807.

Chesmore, K., Bartlett, J., and Williams, S. M. (2018). The ubiquity of pleiotropy in

human disease. Human Genetics, 137(1):39–44.

Choi, E. J. and Mayo, S. L. (2006). Generation and analysis of proline mutants in

protein G. Protein Engineering, Design and Selection, 19(6):285–289.

Choi, Y., Sims, G. E., Murphy, S., Miller, J. R., and Chan, A. P. (2012). Predicting the

functional effect of amino acid substitutions and indels. PLoS ONE, 7(10):e46688. BIBLIOGRAPHY 182

Chothia, C. and Lesk, A. M. (1986). The relation between the divergence of sequence

and structure in proteins. The EMBO journal, 5(4):823–6.

Clifford, R. J., Edmonson, M. N., Nguyen, C., and Buetow, K. H. (2004). Large-

scale analysis of non-synonymous coding region single nucleotide polymorphisms.

Bioinformatics, 20(7):1006–1014.

Coelho, T. (1996). Familial amyloid polyneuropathy: New developments in genetics

and treatment. Current Opinion in Neurology, 9(5):355–359.

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning,

20(3):273–297.

Cosma, M. P., Pepe, S., Annunziata, I., Newbold, R. F., Grompe, M., Parenti, G., and

Ballabio, A. (2003). The multiple sulfatase deficiency gene encodes an essential and

limiting factor for the activity of sulfatases. Cell, 113(4):445–456.

Cosma, M. P., Pepe, S., Parenti, G., Settembre, C., Annunziata, I., Wade-Martins,

R., Di Domenico, C., Di Natale, P., Mankad, A., Cox, B., Uziel, G., Mancini, G.,

Zammarchi, E., Donati, M. A., Kleijer, W. J., Filocamo, M., Carrozzo, R., Carella,

M., and Ballabio, A. (2004). Molecular and functional analysis of SUMF1 mutations

in multiple sulfatase deficiency. Human Mutation, 23(6):576–581.

Cuff, A. L., Janes, R. W., and Martin, A. C. (2006). Analysing the ability to retain

sidechain hydrogen-bonds in mutant proteins. Bioinformatics, 22(12):1464–1470.

Cuff, A. L. and Martin, A. C. R. (2004). Analysis of void volumes in proteins and

application to stability of the p53 tumour suppressor protein. Journal of Molecular

Biology, 344(5):1199–1209.

Das, R. K., Mao, A. H., and Pappu, R. V. (2012). Unmasking functional motifs within

disordered regions of proteins. Science Signaling, 5(220).

Davey, N. E., Van Roey, K., Weatheritt, R. J., Toedt, G., Uyar, B., Altenberg, B., BIBLIOGRAPHY 183

Budd, A., Diella, F., Dinkel, H., and Gibson, T. J. (2012). Attributes of short linear

motifs. Molecular BioSystems, 8(1):268–281.

David, A., Razali, R., Wass, M. N., and Sternberg, M. J. (2012). Protein-protein

interaction sites are hot spots for disease-associated nonsynonymous SNPs. Human

Mutation, 33(2):359–363.

David, A. and Sternberg, M. J. (2015). The contribution of missense mutations in core

and rim residues of protein-protein interfaces to human disease. Journal of Molecular

Biology, 427(17):2886–2898.

Dehouck, Y., Grosfils, A., Folch, B., Gilis, D., Bogaerts, P., and Rooman, M. (2009).

Prediction of protein stability changes upon mutations using statistical potentials

and neural networks: PoPMuSiC 2.0. Bioinformatics, 25(19):2537–2543.

DeLano, W. L. (2002). The PyMOL System.

http://www.pymol.org.

Dickerson, J. E., Zhu, A., Robertson, D. L., and Hentges, K. E. (2011). Defining the

role of essential genes in human disease. PLoS ONE, 6(11):e27368.

Dill, K. A. (1990). Dominant Forces in Protein Folding. Biochemistry, 29(31):7133–

7155.

Django Community (2018). Django: The web Framework for perfeccionists with dead-

lines.

Dodge, C., Schneider, R., and Sander, C. (1998). The HSSP database of pro-

tein structure-sequence alignments and family profiles. Nucleic Acids Research,

26(1):313–315.

Doszt´anyi, Z., Csizmok, V., Tompa, P., and Simon, I. (2005a). IUPred: Web server for

the prediction of intrinsically unstructured regions of proteins based on estimated

energy content. Bioinformatics, 21(16):3433–3434. BIBLIOGRAPHY 184

Doszt´anyi, Z., Csizm´ok,V., Tompa, P., and Simon, I. (2005b). The pairwise energy

content estimated from amino acid composition discriminates between folded and

intrinsically unstructured proteins. Journal of Molecular Biology, 347(4):827–839.

Driver, J. A. (2014). Inverse association between cancer and neurodegenerative disease:

review of the epidemiologic and biological evidence. Biogerontology, 15(6):547–557.

Dunker, A. K., Cortese, M. S., Romero, P., Iakoucheva, L. M., and Uversky, V. N.

(2005). Flexible nets. The roles of intrinisic disorder in protein interaction networks.

Febs, 272(20):5129–5148.

Dunker, A. K., Lawson, J. D., Brown, C. J., Williams, R. M., Romero, P., Oh, J. S.,

Oldfield, C. J., Campen, A. M., Ratliff, C. M., Hipps, K. W., Ausio, J., Nissen,

M. S., Reeves, R., Kang, C., Kissinger, C. R., Bailey, R. W., Griswold, M. D.,

Chiu, W., Garner, E. C., and Obradovic, Z. (2001). Intrinsically disordered protein.

J.Mol.Graph.Model., 19(1093-3263 (Print)):26–59.

Dunker, K., Silman, I., Uversky, V. N., and Sussman, J. L. (2008). Function and

structure of inherantly disordered proteins. Curr Opin Struct Biol, 18(6):756–764.

Eijsink, V. G., Dijkstra, B. W., Vriend, G., Van Der Zee, J. R., Vettman, O. R., Van

Der Vinne, B., Van Den Burg, B., Kempe, S., and Venema, G. (1992). The effect of

cavity-filling mutations on the thermostability of Bacillus stearothermophilus neutral

protease. Protein Engineering, Design and Selection, 5(5):421–426.

Ellard, S., Baple, E. L., Owens, M., Eccles, D. M., Abbs, S., and Zandra, C. (2017).

ACGS best practice guidelines for variant classification 2017. Association for Clinical

Genetic Science.

Eppig, J. T., Blake, J. A., Bult, C. J., Kadin, J. A., Richardson, J. E., Anagnostopoulos,

A., Babiuk, R. P., Baldarelli, R. M., Beal, J. S., Bello, S. M., Berghout, J., Blodgett,

O., Butler, N. E., Corbani, L. E., Cousins, S. L., Dene, H., Drabkin, H. J., Forthofer,

K. L., Hale, P., Hutchins, L., Knowlton, M., Law, M., Lewis, J. R., McAndrews, M.,

Miers, D. S., Montenko, H., Ni, L., Onda, H., Pittman, W., Recla, J. M., Reed, D. J., BIBLIOGRAPHY 185

Richards-Smith, B., Sitnikov, D., Smith, C. L., Tomczuk, M., Washburn, L. L., and

Zhu, Y. (2015). The Mouse Genome Database (MGD): Facilitating mouse as a model

for human biology and disease. Nucleic Acids Research, 43(D1):D726–D736.

Eriksson, A. E., Baase, W. A., Zhang, X. J., Heinz, D. W., Blaber, M., Baldwin, E. P.,

and Matthews, B. W. (1992). Response of a protein structure to cavity-creating

mutations and its relation to the hydrophobic effect. Science, 255(5041):178–183.

Erlandsen, H., Patch, M. G., Gamez, A., Straub, M., and Stevens, R. C. (2003). Struc-

tural studies on phenylalanine hydroxylase and implications toward understanding

and treating phenylketonuria. Pediatrics, 112(6 Pt 2):1557–65.

Eyre-Walker, A., Woolfit, M., and Phelps, T. (2006). The distribution of fitness effects

of new deleterious amino acid mutations in humans. Genetics, 173(2):891–900.

Finn, R. D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R.,

Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E. L. L., Tate, J.,

and Punta, M. (2014). Pfam: the protein families database. Nucleic Acids Research,

42(D1):D222–D230.

Finn, R. D., Clements, J., and Eddy, S. R. (2011). HMMER web server: interactive

sequence similarity searching. Nucleic Acids Research, 39(suppl):W29–W37.

Flores, T. P., Orengo, C. A., Moss, D. S., and Thornton, J. M. (1993). Comparison of

conformational characteristics in structurally similar protein pairs. Protein Science,

2(11):1811–1826.

Fox, C. S., Larson, M. G., Leip, E. P., Meigs, J. B., Wilson, P. W., and Levy, D. (2005).

Glycemic status and development of kidney disease: the Framingham Heart Study.

Diabet Care, 28(10):2436–2440.

Galea, C., Bowman, P., and Kriwacki, R. W. (2005). Disruption of an intermonomer

salt bridge in the p53 tetramerization domain results in an increased propensity to

form amyloid fibrils. Protein science, 14(12):2993–3003. BIBLIOGRAPHY 186

Gao, M., Zhou, H., and Skolnick, J. (2015). Insights into disease-associated mutations

in the human proteome through protein structural analysis. Structure, 23(7):1362–

1369.

Ge, W., Wei, B., Zhu, H., Miao, Z., Zhang, W., Leng, C., Li, J., Zhang, D., Sun, M.,

and Xu, X. (2017). A novel mutation of α-galactosidase A gene causes Fabry disease

mimicking primary in a Chinese family. International Journal of

Neuroscience, 127(5):448–453.

Gell, D. A., Feng, L., Zhou, S., Jeffrey, P. D., Bendak, K., Gow, A., Weiss, M. J., Shi,

Y., and Mackay, J. P. (2009). A cis-proline in α-hemoglobin stabilizing protein di-

rects the structural reorganization of α-hemoglobin. Journal of Biological Chemistry,

284(43):29462–29469.

Georgi, B., Voight, B. F., and Bu´can,M. (2013). From mouse to human: evolu-

tionary genomics analysis of human orthologs of essential genes. PLoS Genetics,

9(5):e1003484.

Goh, K.-I., Cusick, M. E., Valle, D., Childs, B., Vidal, M., and Barabasi, A.-L. (2007).

The human disease network. Proceedings of the National Academy of Sciences,

104(21):8685–8690.

Gonz´alez-P´erez, A. and L´opez-Bigas, N. (2011). Improving the assessment of the

outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel.

American Journal of Human Genetics, 88(4):440–449.

Gorr, S. U., Huang, X. F., Cowley, D. J., Kuliawat, R., and Arvan, P. (1999). Disruption

of disulfide bonds exhibits differential effects on trafficking of regulated secretory

proteins. The American journal of physiology, 277(1 Pt 1):C121–C131.

Gruneberg, H. (1938). An analysis of the ”pleiotropic” effects of a new lethal mutation

in the rat (mus norvegicus). Proceedings of the Royal Society B: Biological Sciences,

125(838):123–144. BIBLIOGRAPHY 187

Guex, N. and Peitsch, M. C. (1997). SWISS-MODEL and the Swiss-PdbViewer: An

environment for comparative protein modeling. Electrophoresis, 18(15):2714–2723.

Gunasekaran, K., Ramakrishnan, C., and Balaram, P. (1996). Disallowed Ramachan-

dran conformations of amino acid residues in protein structures. Journal of Molecular

Biology, 264(1):191–198.

Hadorn, E. and Mittwoch, U. (1961). Developmental genetics and lethal factors. The

American Journal of the Medical Sciences, 242(4):522.

Hageman, A. T., Gabre¨els,F. J., Jong, J. G., Gabre¨elsFesten, A. A., Van Den Berg,

C. J., Van Oost, B. A., and Wevers, R. A. (1995). Clinical Symptoms of Adult

Metachromatic Leukodystrophy and Arylsulfatase a Pseudodeficiency. Archives of

Neurology, 52(4):408–413.

Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., and McKusick, V. A.

(2005). Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human

genes and genetic disorders. Nucleic Acids Research, 33(Database issue):D514–D517.

Hanson, R. M., Prilusky, J., Renjian, Z., Nakane, T., and Sussman, J. L. (2013).

JSmol and the next-generation web-based representation of 3D molecular structure

as applied to proteopedia. Israel Journal of Chemistry, 53(3-4):207–216.

Hassan, M. S., Shaalan, A., Dessouky, M., Abdelnaiem, A. E., and ElHefnawi, M.

(2018). Evaluation of computational techniques for predicting non-synonymous single

nucleotide variants pathogenicity. Genomics.

He, B., Wang, K., Liu, Y., Xue, B., Uversky, V. N., and Dunker, A. K. (2009). Pre-

dicting intrinsic disorder in proteins: An overview. Cell Research, 19(8):929–949.

He, X. and Zhang, J. (2006). Toward a molecular understanding of pleiotropy. Genetics.

Herga, S., Berrin, J. G., Perrier, J., Puigserver, A., and Giardina, T. (2006). Identifi-

cation of the zinc binding ligands and the catalytic residue in human aspartoacylase,

an enzyme involved in Canavan disease. FEBS Letters, 580(25):5899–5904. BIBLIOGRAPHY 188

Hintze, B. J., Lewis, S. M., Richardson, J. S., and Richardson, D. C. (2016). Molpro-

bity’s ultimate rotamer-library distributions for model validation. Proteins: Struc-

ture, Function and Bioinformatics, 84(9):1177–1189.

Hodgkin, J. (1998). Seven types of pleiotropy. International Journal of Developmental

Biology, 42(3):501–505.

Hooft, R. W., Vriend, G., Sander, C., and Abola, E. E. (1996). Errors in protein

structures. Nature, 381(6580):272.

Hubbard, R. E. and Kamran Haider, M. (2010). Hydrogen bonds in proteins: role and

strength. In Encyclopedia of Life Sciences. John Wiley & Sons, Ltd, Chichester, UK.

Hubbard, S. J., Gross, K. H., and Argos, P. (1994). Intramolecular cavities in globular

proteins. Protein Engineering, Design and Selection, 7(5):613–626.

Hubert, H., Feinleib, M., McNamara, P., and Castelli, W. (1983). Obesity as an

independent risk factor for cardiovascular disease: a 26- year follow-up of participants

in the Framingham Heart Study. Circulation, 67(5):968–977.

Huda, W. and Brad Abrahams, R. (2015). X-ray-based medical imaging and resolution.

American Journal of Roentgenology, 204(4):W393–W397.

Hunter, S., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bork,

P., Das, U., Daugherty, L., Duquenne, L., Finn, R. D., Gough, J., Haft, D., Hulo,

N., Kahn, D., Kelly, E., Laugraud, A., Letunic, I., Lonsdale, D., Lopez, R., Madera,

M., Maslen, J., Mcanulla, C., McDowall, J., Mistry, J., Mitchell, A., Mulder, N.,

Natale, D., Orengo, C., Quinn, A. F., Selengut, J. D., Sigrist, C. J., Thimma, M.,

Thomas, P. D., Valentin, F., Wilson, D., Wu, C. H., and Yeats, C. (2009). InterPro:

The integrative protein signature database. Nucleic Acids Research, 37(SUPPL.

1):D211–D215.

Hurst, J. M., McMillan, L. E., Porter, C. T., Allen, J., Fakorede, A., and Martin,

A. C. (2009). The SAAPdb web resource: A large-scale structural analysis of mutant

proteins. Human Mutation, 30(4):616–624. BIBLIOGRAPHY 189

Iakoucheva, L. M., Brown, C. J., Lawson, J. D., Obradovi´c,Z., and Dunker, A. K.

(2002). Intrinsic disorder in cell-signaling and cancer-associated proteins. Journal of

Molecular Biology, 323(3):573–584.

Imai, K. and Mitaku, S. (2005). Mechanisms of secondary structure breakers in soluble

proteins. Biophysics, 1:55–65.

Ioannidis, N. M., Rothstein, J. H., Pejaver, V., Middha, S., McDonnell, S. K., Baheti,

S., Musolf, A., Li, Q., Holzinger, E., Karyadi, D., Cannon-Albright, L. A., Teerlink,

C. C., Stanford, J. L., Isaacs, W. B., Xu, J., Cooney, K. A., Lange, E. M., Schleutker,

J., Carpten, J. D., Powell, I. J., Cussenot, O., Cancel-Tassin, G., Giles, G. G.,

MacInnis, R. J., Maier, C., Hsieh, C. L., Wiklund, F., Catalona, W. J., Foulkes,

W. D., Mandal, D., Eeles, R. A., Kote-Jarai, Z., Bustamante, C. D., Schaid, D. J.,

Hastie, T., Ostrander, E. A., Bailey-Wilson, J. E., Radivojac, P., Thibodeau, S. N.,

Whittemore, A. S., and Sieh, W. (2016). REVEL: An ensemble method for predicting

the pathogenicity of rare missense variants. American Journal of Human Genetics,

99(4):877–885.

Isom, D. G., Castaneda, C. A., Cannon, B. R., Velu, P. D., and Garcia-Moreno E., B.

(2010). Charges in the hydrophobic interior of proteins. Proceedings of the National

Academy of Sciences, 107(37):16096–16100.

Ittisoponpisan, S., Alhuzimi, E., Sternberg, M. J., and David, A. (2017). Landscape of

pleiotropic proteins causing human disease: structural and system biology insights.

Human Mutation, 38(3):289–296.

Ittisoponpisan, S. and David, A. (2018). Structural biology helps interpret variants of

uncertain significance in genes causing endocrine and metabolic disorders. Journal

of the Endocrine Society, 2(8):842–854.

Jabs, A., Weiss, M. S., and Hilgenfeld, R. (1999). Non-proline cis peptide bonds in

proteins. Journal of Molecular Biology, 286(1):291–304. BIBLIOGRAPHY 190

Jacob, J., Duclohier, H., and Cafiso, D. S. (1999). The role of proline and glycine

in determining the backbone flexibility of a channel-forming peptide. Biophysical

Journal, 76(3):1367–1376.

Jacobson, D. R., Pastore, R. D., Yaghoubian, R., Kane, I., Gallo, G., Buck, F. S.,

and Buxbaum, J. N. (1997). Variant-sequence transthyretin (isoleucine 122) in late-

onset cardiac amyloidosis in black Americans. New England Journal of Medicine,

336(7):466–473.

Jefferys, B. R., Kelley, L. A., and Sternberg, M. J. E. (2010). Protein folding requires

crowd control in a simulated cell. Journal of Molecular Biology, 397(5):1329–1338.

Jones, D. T. and Cozzetto, D. (2015). DISOPRED3: Precise disordered region predic-

tions with annotated protein-binding activity. Bioinformatics, 31(6):857–863.

Kabsch, W. and Sander, C. (1983). Dictionary of protein secondary structure: Pattern

recognition of hydrogenbonded and geometrical features. Biopolymers, 22(12):2577–

2637.

Kannel, W. B. and McGee, D. L. (1979). Diabetes and glucose tolerance as risk factors

for cardiovascular disease: the Framingham study. Diabetes care, 2(2):120–126.

Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N., and Sternberg, M. J. (2015). The

Phyre2 web portal for protein modeling, prediction and analysis. Nature Protocols,

10(6):845–858.

Kelley, L. A. and Sternberg, M. J. E. (2009). Protein structure prediction on the Web:

a case study using the Phyre server. Nature Protocols, 4(3):363–373.

Kerrien, S., Aranda, B., Breuza, L., Bridge, A., Broackes-Carter, F., Chen, C., Dues-

bury, M., Dumousseau, M., Feuermann, M., Hinz, U., Jandrasits, C., Jimenez, R. C.,

Khadake, J., Mahadevan, U., Masson, P., Pedruzzi, I., Pfeiffenberger, E., Porras, P.,

Raghunath, A., Roechert, B., Orchard, S., and Hermjakob, H. (2012). The IntAct

molecular interaction database in 2012. Nucleic Acids Research, 40(D1):D841–D846. BIBLIOGRAPHY 191

Khafizov, K., Madrid-Aliste, C., Almo, S. C., and Fiser, A. (2014). Trends in structural

coverage of the protein universe and the impact of the Protein Structure Initiative.

Proceedings of the National Academy of Sciences, 111(10):3733–3738.

Krieger, E., Koraimann, G., and Vriend, G. (2002). Increasing the precision of com-

parative models with YASARA NOVA - A self-parameterizing force field. Proteins:

Structure, Function and Genetics, 47(3):393–402.

Krivov, G. G., Shapovalov, M. V., and Dunbrack, R. L. (2009). Improved prediction of

protein side-chain conformations with SCWRL4. Proteins: Structure, Function and

Bioinformatics, 77(4):778–795.

Kryukov, G. V., Pennacchio, L. A., and Sunyaev, S. R. (2007). Most rare missense

alleles are deleterious in humans: implications for complex disease and association

studies. The American Journal of Human Genetics, 80(4):727–739.

Kucukkal, T. G., Petukh, M., Li, L., and Alexov, E. (2015). Structural and physico-

chemical effects of disease and non-disease nsSNPs on proteins. Current Opinion in

Structural Biology, 32:18–24.

Landrum, M. J., Lee, J. M., Benson, M., Brown, G., Chao, C., Chitipiralla, S., Gu, B.,

Hart, J., Hoffman, D., Hoover, J., Jang, W., Katz, K., Ovetsky, M., Riley, G., Sethi,

A., Tully, R., Villamarin-Salomon, R., Rubinstein, W., and Maglott, D. R. (2016).

ClinVar: Public archive of interpretations of clinically relevant variants. Nucleic

Acids Research, 44(D1):D862–D868.

Landrum, M. J., Lee, J. M., Benson, M., Brown, G. R., Chao, C., Chitipiralla, S., Gu,

B., Hart, J., Hoffman, D., Jang, W., Karapetyan, K., Katz, K., Liu, C., Maddipatla,

Z., Malheiro, A., McDaniel, K., Ovetsky, M., Riley, G., Zhou, G., Holmes, J. B.,

Kattman, B. L., and Maglott, D. R. (2018). ClinVar: Improving access to variant

interpretations and supporting evidence. Nucleic Acids Research, 46(D1):D1062–

D1067. BIBLIOGRAPHY 192

Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGettigan, P. A.,

McWilliam, H., Valentin, F., Wallace, I. M., Wilm, A., Lopez, R., Thompson, J. D.,

Gibson, T. J., and Higgins, D. G. (2007). Clustal W and Clustal {\texttimes} version

2.0. Bioinformatics, 23(21):2947–2948.

Laskowski, R. A., MacArthur, M. W., Moss, D. S., and Thornton, J. M. (1993).

PROCHECK: a program to check the stereochemical quality of protein structures.

Journal of Applied Crystallography, 26(2):283–291.

Lee, B. and Vasmatzis, G. (1997). Stabilization of protein structures. Current Opinion

in Biotechnology, 8(4):423–428.

Lee, J., Lee, K., and Shin, S. (2000). Theoretical studies of the response of a protein

structure to cavity-creating mutations. Biophysical Journal, 78(4):1665–1671.

Lek, M., Karczewski, K. J., Minikel, E. V., Samocha, K. E., Banks, E., Fennell, T.,

O’Donnell-Luria, A. H., Ware, J. S., Hill, A. J., Cummings, B. B., Tukiainen, T.,

Birnbaum, D. P., Kosmicki, J. A., Duncan, L. E., Estrada, K., Zhao, F., Zou, J.,

Pierce-Hoffman, E., Berghout, J., Cooper, D. N., Deflaux, N., DePristo, M., Do,

R., Flannick, J., Fromer, M., Gauthier, L., Goldstein, J., Gupta, N., Howrigan, D.,

Kiezun, A., Kurki, M. I., Moonshine, A. L., Natarajan, P., Orozco, L., Peloso, G. M.,

Poplin, R., Rivas, M. A., Ruano-Rubio, V., Rose, S. A., Ruderfer, D. M., Shakir, K.,

Stenson, P. D., Stevens, C., Thomas, B. P., Tiao, G., Tusie-Luna, M. T., Weisburd,

B., Won, H. H., Yu, D., Altshuler, D. M., Ardissino, D., Boehnke, M., Danesh, J.,

Donnelly, S., Elosua, R., Florez, J. C., Gabriel, S. B., Getz, G., Glatt, S. J., Hultman,

C. M., Kathiresan, S., Laakso, M., McCarroll, S., McCarthy, M. I., McGovern, D.,

McPherson, R., Neale, B. M., Palotie, A., Purcell, S. M., Saleheen, D., Scharf, J. M.,

Sklar, P., Sullivan, P. F., Tuomilehto, J., Tsuang, M. T., Watkins, H. C., Wilson,

J. G., Daly, M. J., and MacArthur, D. G. (2016). Analysis of protein-coding genetic

variation in 60,706 humans. Nature, 536(7616):285–291.

Levin, L., Zelzion, E., Nachliel, E., Gutman, M., Tsfadia, Y., and Einav, Y. (2013). A BIBLIOGRAPHY 193

single disulfide bond disruption in the b 3 integrin subunit promotes thiol / disulfide

exchange , a molecular dynamics study. PLoS ONE, 8(3):1–10.

Li, C., Iosef, C., Jia, C. Y. H., Gkourasas, T., Han, V. K. M., and Li, S. S. C. (2003).

Disease-causing SAP mutants are defective in ligand binding and protein folding.

Biochemistry, 42(50):14885–14892.

Li, M. X., Kwan, J. S. H., Bao, S. Y., Yang, W., Ho, S. L., Song, Y. Q., and Sham,

P. C. (2013). Predicting mendelian disease-causing non-synonymous single nucleotide

variants in exome sequencing studies. PLoS Genetics, 9(1):e1003143.

Lovell, S. C., Davis, I. W., Arendall, W. B., De Bakker, P. I., Word, J. M., Prisant,

M. G., Richardson, J. S., and Richardson, D. C. (2003). Structure validation by

Cα geometry: φ,ψ and Cβ deviation. Proteins: Structure, Function and Genetics,

50(3):437–450.

Lukas, J., Scalia, S., Eichler, S., Pockrandt, A. M., Dehn, N., Cozma, C., Giese, A. K.,

and Rolfs, A. (2016). Functional and clinical consequences of novel α-galactosidase

A mutations in Fabry disease. Human Mutation, 37(1):43–51.

Mahmood, K., Jung, C.-h., Philip, G., Georgeson, P., Chung, J., Pope, B. J., and Park,

D. J. (2017). Variant effect prediction tools assessed using independent, functional

assay-based datasets: implications for discovery and diagnostics. Human Genomics,

11(1):10.

Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random vari-

ables is stochastically larger than the other. The Annals of Mathematical Statistics,

18(1):50–60.

Martinez, a., Knappskog, P. M., Olafsdottir, S., Døskeland, a. P., Eiken, H. G., Svebak,

R. M., Bozzini, M., Apold, J., and Flatmark, T. (1995). Expression of recombinant

human phenylalanine hydroxylase as fusion protein in Escherichia coli circumvents

proteolytic degradation by host cell proteases. Isolation and characterization of the

wild-type enzyme. The Biochemical journal, 306 ( Pt 2:589–597. BIBLIOGRAPHY 194

McCutchen, S. L., Colon, W., and Kelly, J. W. (1993). Transthyretin mutation Leu-55-

Pro significantly alters tetramer stability and increases amyloidogenicity. Biochem-

istry, 32(45):12119–12127.

McDonald, I. K. and Thornton, J. M. (1994). Satisfying hydrogen bonding potential

in proteins. Journal of Molecular Biology, 238(5):777–793.

McGuffin, L. J., Bryson, K., and Jones, D. T. (2000). The PSIPRED protein structure

prediction server. Bioinformatics, 16(4):404–405.

M´esz´aros,B., Erd¨os,G., and Doszt´anyi, Z. (2018). IUPred2A: Context-dependent

prediction of protein disorder as a function of redox state and protein binding. Nucleic

Acids Research, 46(W1):W329–W337.

Mi, H., Poudel, S., Muruganujan, A., Casagrande, J. T., and Thomas, P. D. (2016).

PANTHER version 10: expanded protein families and functions, and analysis tools.

Nucleic Acids Research, 44(D1):D336–D342.

Midic, U., Oldfield, C. J., Keith, A. K., Obradovic, Z., and Uversky, V. N. (2009).

Protein disorder in the human diseasome: Unfoldomics of human genetic diseases.

BMC Genomics, 10(SUPPL. 1):S12.

Miller, S., Janin, J., Lesk, A. M., and Chothia, C. (1987). Interior and surface of

monomeric proteins. Journal of Molecular Biology, 196(3):641–656.

Miosge, L. A., Field, M. A., Sontani, Y., Cho, V., Johnson, S., Palkova, A., Balakish-

nan, B., Liang, R., Zhang, Y., Lyon, S., Beutler, B., Whittle, B., Bertram, E. M.,

Enders, A., Goodnow, C. C., and Andrews, T. D. (2015). Comparison of predicted

and actual consequences of missense mutations. Proceedings of the National Academy

of Sciences, 112(37):E5189–E5198.

Mitchell, K. J. (2012). What is complex about complex disorders?

Mizianty, M. J., Fan, X., Yan, J., Chalmers, E., Woloschuk, C., Joachimiak, A., and

Kurgan, L. (2014). Covering complete proteomes with X-ray structures: A current BIBLIOGRAPHY 195

snapshot. Acta Crystallographica Section D: Biological Crystallography, 70(11):2781–

2793.

Mort, M., Evani, U. S., Krishnan, V. G., Kamati, K. K., Baenziger, P. H., Bagchi, A.,

Peters, B. J., Sathyesh, R., Li, B., Sun, Y., Xue, B., Shah, N. H., Kann, M. G., N,

D., Radivojac, P., and Mooney, S. D. (2010). In silico functional profiling of human

disease- associated and polymorphic amino acid substitutions. Gene, 31(3):335–46.

M¨ullervom Hagen, J., Karle, K. N., Sch¨ule,R., Kr¨ageloh-Mann,I., and Sch¨ols,L.

(2014). Leukodystrophies underlying cryptic spastic paraparesis: Frequency and

phenotype in 76 patients. European Journal of Neurology, 21(7):983–988.

Munson, M., Balasubramanian, S., Fleming, K. G., Nagi, A. D., O’Brien, R., Sturte-

vant, J. M., and Regan, L. (1996). What makes a protein a protein? Hydropho-

bic core designs that specify stability and structural properties. Protein Science,

5(8):1584–1593.

Ng, P. C. and Henikoff, S. (2003). SIFT: predicting amino acid changes that affect

protein function. Nucleic Acids Res., 31(13):3812–3814.

Niroula, A., Urolagin, S., and Vihinen, M. (2015). PON-P2: Prediction Method for

Fast and Reliable Identification of Harmful Variants. PLOS ONE, 10(2):e0117380.

Oates, M. E., Romero, P., Ishida, T., Ghalwash, M., Mizianty, M. J., Xue, B.,

Doszt´anyi, Z., Uversky, V. N., Obradovic, Z., Kurgan, L., Dunker, A. K., and Gough,

J. (2013). D2P2: Database of disordered protein predictions. Nucleic Acids Research,

41(D1).

Okano, Y., Eisensmith, R. C., G¨uttler,F., Lichter-Konecki, U., Konecki, D. S., Trefz,

F. K., Dasovich, M., Wang, T., Henriksen, K., Lou, H., and Woo, S. L. (1991). Molec-

ular basis of phenotypic heterogeneity in phenylketonuria. New England Journal of

Medicine, 324(18):1232–1238.

Oliveira, S. H. P., Ferraz, F. A. N., Honorato, R. V., Xavier-Neto, J., Sobreira, T. J. P., BIBLIOGRAPHY 196

and de Oliveira, P. S. L. (2014). KVFinder: Steered identification of protein cavities

as a PyMOL plugin. BMC Bioinformatics, 15(1):197.

Pakula, A. A. and Sauer, R. T. (1989). Genetic Analysis of Protein Stability and

Function. Annual Review of Genetics, 23(1):289–310.

Pandurangan, A. P., Ochoa-Monta˜no,B., Ascher, D. B., and Blundell, T. L. (2017).

SDM: A server for predicting effects of mutations on protein stability. Nucleic Acids

Research, 45(W1):W229–W235.

Park, S. L., Fesinmeyer, M. D., Timofeeva, M., Caberto, C. P., Kocarnik, J. M., Han,

Y., Love, S. A., Young, A., Dumitrescu, L., Lin, Y., Goodloe, R., Wilkens, L. R.,

Hindorff, L., Fowke, J. H., Carty, C., Buyske, S., Schumacher, F. R., Butler, A.,

Dilks, H., Deelman, E., Cote, M. L., Chen, W., Pande, M., Christiani, D. C., Field,

J. K., Bickebller, H., Risch, A., Heinrich, J., Brennan, P., Wang, Y., Eisen, T., Houl-

ston, R. S., Thun, M., Albanes, D., Caporaso, N., Peters, U., North, K. E., Heiss, G.,

Crawford, D. C., Bush, W. S., Haiman, C. A., Landi, M. T., Hung, R. J., Kooper-

berg, C., Amos, C. I., Le Marchand, L., and Cheng, I. (2014). Pleiotropic associations

of risk variants identified for other cancers with lung cancer risk: The PAGE and

TRICL consortia. Journal of the National Cancer Institute, 106(4):dju061.

Parker, M. H. and Hefford, M. A. (1997). A consensus residue analysis of loop and

helix-capping residues in four- alpha-helical-bundle proteins. Protein Engineering

Design and Selection, 10(5):487–496.

Peng, K., Radivojac, P., Vucetic, S., Dunker, A. K., and Obradovic, Z. (2006). Length-

dependent prediction of protein in intrinsic disorder. BMC Bioinformatics, 7(1):208.

Peplow, M. (2016). The 100 000 Genomes Project. Bmj, 353:i1757.

Petukh, M., Kucukkal, T. G., and Alexov, E. (2015). On human disease-causing amino

acid variants: Statistical study of sequence and structural patterns. Human Muta-

tion, 36(5):524–534. BIBLIOGRAPHY 197

Prli´c,A., Down, T. A., Kulesha, E., Finn, R. D., K¨ah¨ari,A., and Hubbard, T. J. (2007).

Integrating sequence and structural biology with DAS. BMC Bioinformatics, 8:333.

Pucci, F., Bourgeas, R., and Rooman, M. (2016). Predicting protein thermal stability

changes upon point mutations using statistical potentials: Introducing HoTMuSiC.

Scientific Reports, 6(1):23257.

Pyeritz, R. E. (1989). Pleiotropy revisited: Molecular explanations of a classic concept.

American Journal of , 34(1):124–134.

Qiu, H., Honey, D. M., Kingsbury, J. S., Park, A., Boudanova, E., Wei, R. R., Pan,

C. Q., and Edmunds, T. (2015). Impact of cysteine variants on the structure, activity,

and stability of recombinant human α-galactosidase A. Protein Science, 24(9):1401–

1411.

Quan, L., Lv, Q., and Zhang, Y. (2016). STRUM: Structure-based prediction of protein

stability changes upon single-point mutation. Bioinformatics, 32(19):2936–2946.

Radivojac, P., Obradovic, Z., Smith, D. K., Zhu, G., Vucetic, S., Brown, C. J., Lawson,

J. D., and Dunker, A. K. (2004). Protein flexibility and intrinsic disorder. Protein

Sci, 13(1):71–80.

Ramachandran, G. N., Ramakrishnan, C., and Sasisekharan, V. (1963). Stereochim-

istry of polypeptide chain configurations. Journal of molecular biology, 7(1):95–99.

Ramachandran, S., Kota, P., Ding, F., and Dokholyan, N. V. (2011). Automated

minimization of steric clashes in protein structures. Proteins: Structure, Function

and Bioinformatics, 79(1):261–270.

Rashin, A. A., Iofin, M., and Honig, B. (1986). Internal cavities and buried waters in

globular proteins. Biochemistry, 25(12):3619–3625.

Raychaudhuri, S., Dey, S., Bhattacharyya, N. P., and Mukhopadhyay, D. (2009). The

role of intrinsically unstructured proteins in neurodegenerative diseases. PLoS ONE,

4(5):e5566. BIBLIOGRAPHY 198

Remmert, M., Biegert, A., Hauser, A., and S¨oding,J. (2012). HHblits: Lightning-fast

iterative protein sequence searching by HMM-HMM alignment. Nature Methods,

9(2):173–175.

Reva, B., Antipin, Y., and Sander, C. (2007). Determinants of protein function revealed

by combinatorial entropy optimization. Genome Biology, 8(11):R232.

Reva, B., Antipin, Y., and Sander, C. (2011). Predicting the functional impact of pro-

tein mutations: Application to cancer genomics. Nucleic Acids Research, 39(17):e118.

Richards, S., Aziz, N., Bale, S., Bick, D., Das, S., Gastier-Foster, J., Grody, W. W.,

Hegde, M., Lyon, E., Spector, E., Voelkerding, K., and Rehm, H. L. (2015). Stan-

dards and guidelines for the interpretation of sequence variants: A joint consensus

recommendation of the American College of Medical Genetics and Genomics and the

Association for Molecular Pathology. Genetics in Medicine, 17(5):405–424.

Rivals, I., Personnaz, L., Taing, L., and Potier, M. C. (2007). Enrichment or depletion

of a GO category within a class of genes: Which test? Bioinformatics, 23(4):401–407.

Romero, P., Obradovic, Z., Kissinger, C., Villafranca, J., and Dunker, A. (1997). Iden-

tifying disordered regions in proteins from amino acid sequence. In Proceedings of

International Conference on Neural Networks (ICNN’97), volume 1, pages 90–95.

IEEE.

Rose, G. D., Geselowitz, A. R., Lesser, G. J., Lee, R. H., and Zehfus, M. H. (1985).

Hydrophobicity of amino acid residues in globular proteins. Science, 229(4716):834–

838.

Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Engineering,

Design and Selection, 12(2):85–94.

Rost, B. and Sander, C. (1994). Conservation and prediction of solvent accessibility in

protein families. Proteins: Structure, Function, and Bioinformatics, 20(3):216–226. BIBLIOGRAPHY 199

Saito, M., Kono, H., Morii, H., Uedaira, H., Tahirov, T. H., Ogata, K., and Sarai, A.

(2000). Cavity-filling mutations enhance protein stability by lowering the free energy

of native state. The Journal of Physical Chemistry B, 104(15):3705–3711.

Sanghera, D. K., Wagenknecht, D. R., McIntyre, J. A., and Kamboh, M. I. (1997).

Identification of structural mutations in the fifth domain of apolipoprotein H (β2-

glycoprotein I) which affect phospholipid binding. Human Molecular Genetics,

6(2):311–316.

Saunders, C. T. and Baker, D. (2002). Evaluation of structural and evolutionary contri-

butions to deleterious mutation prediction. Journal of Molecular Biology, 322(4):891–

901.

Savojardo, C., Fariselli, P., Martelli, P. L., and Casadio, R. (2016). INPS-MD: A web

server to predict stability of protein variants from sequence and structure. Bioinfor-

matics, 32(16):2542–2544.

Schmidt, B. Z., Fowler, N. L., Hidvegi, T., Perlmutter, D. H., and Colten, H. R.

(1999). Disruption of disulfide bonds is responsible for impaired secretion in human

complement factor H deficiency. Journal of Biological Chemistry, 274(17):11782–

11788.

Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F., and Serrano, L. (2005).

The FoldX web server: An online force field. Nucleic Acids Research, 33(SUPPL. 2).

Sebasti˜ao,M. P., Saraiva, M. J., and Damas, A. M. (1998). The crystal structure

of amyloidogenic Leu55 Pro transthyretin variant reveals a possible pathway for

transthyretin polymerization into amyloid fibrils. Journal of Biological Chemistry,

273(38):24715–24722.

Sebasti˜ao,P., Dauter, Z., Saraiva, M. J., and Damas, A. M. (1996). Crystallization

and preliminary X-ray diffraction studies of Leu55Pro variant transthyretin. Acta

Crystallographica Section D: Biological Crystallography, 52(3):566–568. BIBLIOGRAPHY 200

Shapovalov, M. V. and Dunbrack, R. L. (2011). A smoothed backbone-dependent

rotamer library for proteins derived from adaptive kernel density estimates and re-

gressions. Structure, 19(6):844–858.

Sherry, S. T. (2001). dbSNP: the NCBI database of genetic variation. Nucleic Acids

Research, 29(1):308–311.

Shihab, H. A., Gough, J., Cooper, D. N., Stenson, P. D., Barker, G. L., Edwards,

K. J., Day, I. N., and Gaunt, T. R. (2013). Predicting the functional, molecular, and

phenotypic consequences of amino acid substitutions using hidden Markov models.

Hum Mutat, 34(1):57–65.

Sickmeier, M., Hamilton, J. A., LeGall, T., Vacic, V., Cortese, M. S., Tantos, A., Szabo,

B., Tompa, P., Chen, J., Uversky, V. N., Obradovic, Z., and Dunker, A. K. (2007).

DisProt: The database of disordered proteins. Nucleic Acids Research, 35(SUPPL.

1):786–93.

Silva, J. M., Marran, K., Parker, J. S., Silva, J., Golding, M., Schlabach, M. R.,

Elledge, S. J., Hannon, G. J., and Chang, K. (2008). Profiling essential genes in

human mammary cells by multiplex RNAi screening. Science, 319(5863):617–620.

Sim, N.-L., Kumar, P., Hu, J., Henikoff, S., Schneider, G., and Ng, P. C. (2012). SIFT

web server: predicting effects of amino acid substitutions on proteins. Nucleic acids

research, 40(Web Server issue):452–7.

Sivakumaran, S., Agakov, F., Theodoratou, E., Prendergast, J. G., Zgaga, L., Mano-

lio, T., Rudan, I., McKeigue, P., Wilson, J. F., and Campbell, H. (2011). Abundant

pleiotropy in human complex diseases and traits. American Journal of Human Ge-

netics, 89(5):607–618.

Smigielski, E. M. (2000). dbSNP: a database of single nucleotide polymorphisms.

Nucleic Acids Research, 28(1):352–355.

S¨oding,J. (2005). Protein homology detection by HMM-HMM comparison. Bioinfor-

matics, 21(7):951–960. BIBLIOGRAPHY 201

Somody, J. C., MacKinnon, S. S., and Windemuth, A. (2017). Structural coverage of

the proteome for pharmaceutical applications. Drug Discovery Today, 22(12):1792–

1799.

Stark, C. (2006). BioGRID: a general repository for interaction datasets. Nucleic Acids

Research, 34(90001):D535–D539.

Stearns, F. W. (2010). One hundred years of pleiotropy: A retrospective. Genetics,

186(3):767–773.

Stenson, P. D., Mort, M., Ball, E. V., Evans, K., Hayden, M., Heywood, S., Hus-

sain, M., Phillips, A. D., and Cooper, D. N. (2017). The Human Gene Mutation

Database: towards a comprehensive repository of inherited mutation data for medical

research, genetic diagnosis and next-generation sequencing studies. Human Genetics,

136(6):665–677.

Stewart, D. E., Sarkar, A., and Wampler, J. E. (1990). Occurrence and role ofcis

peptide bonds in protein structures. Journal of Molecular Biology, 214(1):253–260.

Stone, E. a. and Sidow, A. (2005). Physicochemical constraint violation by missense

substitutions mediates impairment of protein function and disease severity Physico-

chemical constraint violation by missense substitutions mediates impairment of pro-

tein function and disease severity. Genome Research, 15(7):978–986.

Sun, Y., Overvad, K., and Olsen, J. (2014). Cancer risks in children with congenital

malformations in the nervous and circulatory system-A population based cohort

study. Cancer Epidemiology, 38(4):393–400.

Swanson, P. D. (1995). Diagnosis of inherited metabolic disorders affecting the nervous

system. Journal of Neurology, Neurosurgery and Psychiatry, 59(5):460–470.

Szumilas, M. (2010). Explaining odds ratios. Journal of the Canadian Academy of

Child and Adolescent Psychiatry, 19(3):227–229. BIBLIOGRAPHY 202

Tarini, M., Cignoni, P., and Montani, C. (2006). Ambient occlusion and edge cueing

to enhance real time molecular . IEEE Transactions on Visualization

and Computer Graphics, 12(5):1237–1244.

Till, M. S. and Ullmann, G. M. (2010). McVol - A program for calculating protein

volumes and identifying cavities by a Monte Carlo algorithm. Journal of Molecular

Modeling, 16(3):419–429.

Torshin, I. Y. and Harrison, R. W. (2001). Charge centers and formation of the protein

folding core. Proteins: Structure, Function and Genetics, 43(4):353–364.

Uversky, V. N., Oldfield, C. J., and Dunker, A. K. (2008). Intrinsically disordered pro-

teins in human diseases: introducing the D2 concept. Annual Review of Biophysics,

37(1):215–246.

Uversky, V. N., Oldfield, C. J., Midic, U., Xie, H., Xue, B., Vucetic, S., Iakoucheva,

L. M., Obradovic, Z., and Keith, A. K. (2009). Unfoldomics of human diseases:

Linking protein intrinsic disorder with diseases. BMC Genomics, 10(SUPPL. 1):S7.

Vacic, V. and Iakoucheva, L. M. (2012). Disease mutations in disordered regions -

Exception to the rule? Molecular BioSystems, 8(1):27–32.

Vacic, V., Markwick, P. R., Oldfield, C. J., Zhao, X., Haynes, C., Uversky, V. N.,

and Iakoucheva, L. M. (2012). Disease-associated mutations disrupt function-

ally important regions of intrinsic protein disorder. PLoS Computational Biology,

8(10):e1002709.

Van Der Lee, R., Buljan, M., Lang, B., Weatheritt, R. J., Daughdrill, G. W., Dunker,

A. K., Fuxreiter, M., Gough, J., Gsponer, J., Jones, D. T., Kim, P. M., Kriwacki,

R. W., Oldfield, C. J., Pappu, R. V., Tompa, P., Uversky, V. N., Wright, P. E., and

Babu, M. M. (2014). Classification of intrinsically disordered regions and proteins.

Chemical Reviews, 114(13):6589–6631.

Venselaar, H., te Beek, T. A., Kuipers, R. K., Hekkelman, M. L., and Vriend, G. BIBLIOGRAPHY 203

(2010). Protein structure analysis of mutations causing inheritable diseases. An e-

Science approach with life scientist friendly interfaces. BMC Bioinformatics, 11:548.

Vihinen, M. (2017). One gene, several diseases: the characteristics of pleiotropic pro-

teins. Human Mutation, 38(3):241–241.

Wagner, G. P., Kenney-Hunt, J. P., Pavlicev, M., Peck, J. R., Waxman, D., and

Cheverud, J. M. (2008). Pleiotropic scaling of gene effects and the 'cost of

complexity'. Nature, 452(7186):470–472.

Wagner, G. P. and Zhang, J. (2011). The pleiotropic structure of the genotype-

phenotype map: the evolvability of complex organisms. Nature Reviews Genetics,

12(3):204–213.

Wang, G. and Dunbrack, R. L. (2003). PISCES: A protein sequence culling server.

Bioinformatics, 19(12):1589–1591.

Wang, M. and Wei, L. (2016). iFish: predicting the pathogenicity of human nonsynony-

mous variants using gene-specific/family-specific attributes and classifiers. Scientific

reports, 6:31321.

Wang, Z., Liao, B.-Y., and Zhang, J. (2010). Genomic patterns of pleiotropy and

the evolution of complexity. Proceedings of the National Academy of Sciences,

107(42):18034–18039.

Wang, Z. and Moult, J. (2001). SNPs, protein structure, and disease. Human mutation,

17(4):263–270.

Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F., and Jones, D. T. (2004). Pre-

diction and functional analysis of native disorder in proteins from the three kingdoms

of life. Journal of Molecular Biology, 337(3):635–645.

Wass, M. N., Kelley, L. A., and Sternberg, M. J. E. (2010). 3DLigandSite: pre-

dicting ligand-binding sites using similar structures. Nucleic Acids Research,

38(suppl 2):W469–W473. BIBLIOGRAPHY 204

Webb, B. and Sali, A. (2016). Comparative protein structure modeling using MOD-

ELLER. Current Protocols in Bioinformatics, 2016:1–5.

Westermark, P., Sletten, K., Johansson, B., and Cornwell, G. G. (1990). Fibril in

senile systemic amyloidosis is derived from normal transthyretin. Proceedings of the

National Academy of Sciences, 87(7):2843–2845.

Williams, K. A. and Deber, C. M. (1991). Proline residues in transmembrane helices:

structural or dynamic role? Biochemistry, 30(37):8919–8923.

Word, J. M., Lovell, S. C., Labean, T. H., Taylor, H. C., Zalis, M. E., Presley, B. K.,

Richardson, J. S., and Richardson, D. C. (1999). Visualizing and quantifying molec-

ular goodness-of-fit: Small-probe contact dots with explicit hydrogen atoms. Journal

of Molecular Biology, 285(4):1711–1733.

World Health Organization. (2004). International statistical classification of diseases

and related health problems. World Health Organization.

Worth, C. L. and Blundell, T. L. (2010). On the evolutionary conservation of hydrogen

bonds made by buried polar amino acids: The hidden joists, braces and trusses of

protein architecture. BMC Evolutionary Biology, 10(1):161.

Xie, H., Vucetic, S., Iakoucheva, L. M., Oldfield, C. J., Dunker, A. K., Uversky, V. N.,

and Obradovic, Z. (2007). Functional anthology of intrinsic disorder. 1. Biological

processes and functions of proteins with long disordered regions. Journal of Proteome

Research, 6(5):1882–1898.

Xie, W. and Sahinidis, N. V. (2006). Residue-rotamer-reduction algorithm for the

protein side-chain conformation problem. Bioinformatics, 22(2):188–194.

Xu, J., Baase, W. A., Baldwin, E., and Matthews, B. W. (1998). The response of

T4 lysozyme to large-to-small substitutions within the core and its relation to the

hydrophobic effect. Protein Science, 7(1):158–177. BIBLIOGRAPHY 205

Yampolsky, L. Y., Kondrashov, F. A., and Kondrashov, A. S. (2005). Distribution of

the strength of selection against amino acid replacements in human proteins. Human

Molecular Genetics, 14(21):3191–3201.

Yates, C. M., Filippis, I., Kelley, L. A., and Sternberg, M. J. (2014). SuSPect: Enhanced

prediction of single amino acid variant (SAV) phenotype using network features.

Journal of Molecular Biology, 426(14):2692–2701.

Yue, P., Li, Z., and Moult, J. (2005). Loss of protein structure stability as a major

causative factor in monogenic disease. Journal of Molecular Biology, 353(2):459–473.

Yue, P., Melamud, E., and Moult, J. (2006). SNPs3D: Candidate gene and SNP

selection for association studies. BMC Bioinformatics, 7(1):166.

Zeldenrust, S. R. and Benson, M. D. (2010). Familial and senile amyloidosis caused

by transthyretin. In Protein Misfolding Diseases: Current and Emerging Principles

and Therapies, pages 795–815.

Zhang, J., Liu, J., Sun, J., Chen, C., Foltz, G., and Lin, B. (2014). Identifying driver

mutations from sequencing data of heterogeneous tumors in the era of personalized

genome sequencing. Briefings in Bioinformatics, 15(2):244–255.

Zou, L., Sriswasdi, S., Ross, B., Missiuro, P. V., Liu, J., and Ge, H. (2008). System-

atic analysis of pleiotropy in C. elegans early embryogenesis. PLoS Comput. Biol.,

4(2):e1000003. Appendix A

Pleiotropy in Human Disease

Table A.1: The number of proteins in each dataset. The IUPred and DISOPRED rows show the number of proteins predicted to contain long disordered regions using cut-offs of 50 and 30 residues.

Total Full (412) DPh (257) SPh (155) Pleiotropic IUPred (50 / 30) 209 / 231 132 / 148 77 / 83 DISOPRED (50 / 30) 227 / 287 142/ 182 85 / 105 Total Full 1345 Non-pleiotropic IUPred (50 / 30) 516 / 594 DISOPRED (50 / 30) 555 / 843

Table A.2: Odds ratios comparing the preferences of SAVs between pleiotropic and non-pleiotropic proteins. The test was performed on the full pleiotropic set (412 pro- teins) against the non-pleiotropic set (1,345 proteins).

Pleiotropic DPh vs. Non-pleiotropic OR 95% CI P-value Disease-causing SAVs 2.85 2.77 - 2.94 < 0.001 Polymorphisms (ExAC - common) 0.86 0.80 - 0.93 < 0.001 Polymorphisms (Humsavar - all) 1.43 1.36 - 1.51 < 0.001 - Polymorphisms (Humsavar - common) 0.97 0.88 - 1.06 0.4751 - Polymorphisms (Humsavar - rare) 1.46 1.31 - 1.62 < 0.001 - Polymorphisms (Humsavar no MAF) 2.13 2.17 - 2.65 < 0.001

206 207

Table A.3: Numbers of proteins with disordered regions and percentages of disordered residues (pleiotropic DPh vs. non-pleiotropic)

Proteins with disorders Disordered residues Tool P P Pl. DPh Non Pl. Pl. DPh Non Pl. IUPred 30 148 (57.6%) 594 (44.2%) <0.001 20.8 (1.1-100) 26.3 (1.2-100) 0.090 IUPred 50 132 (51.4%) 516 (38.4%) <0.001 23.6 (1.1-100) 29.7 (1.2-100) 0.027 DISOPRED 30 182 (70.8%) 843 (62.7%) 0.013 18.7 (1.1-99.5) 18.7 (1.5-98.5) 0.227 DISOPRED 50 142 (55.3%) 555 (41.3%) 0.039 20.2 (1.7-99.5) 725.1 (1.0-98.5) 0.225

*P: P-Value, Wilcoxon Test

Table A.4: Numbers of proteins with disordered regions and percentages of disordered residues (pleiotropic vs. non-pleiotropic)

Proteins with disorders Disordered residues Tool P P Pl. DPh Non Pl. Pl. DPh Non Pl. IUPred 30 231 (56.1%) 594 (44.2%) <0.001 22.4 (1.1-100) 26.3 (1.2-100) 0.165 IUPred 50 209 (50.7%) 516 (38.4%) <0.001 24.2 (1.1-100) 29.7 (1.2-100) 0.028 DISOPRED 30 287 (69.7%) 843 (62.7%) 0.010 20.8 (1.1-99.5) 18.7 (1.5-98.5) 0.284 DISOPRED 50 227 (55.1%) 555 (41.3%) <0.001 22.9 (1.3-99.5) 25.1 (1.0-98.5) 0.084

*P: P-Value, Wilcoxon Test

Table A.5: 46 selected mouse phenotypes for the gene essentiality analysis. There are 53 phonotypes associated with the keyword ‘lethality’. 46 phenotypes according to the method by Georgi et al. (2013) were selected. Other 7 phenotypes (marked as *) were excluded.

MGI Phenotype Description

MP:0002058 neonatal lethality MP:0002080 prenatal lethality MP:0002081 perinatal lethality MP:0002082 postnatal lethality MP:0002083* premature death MP:0006204 embryonic lethality before implantation

Continued on next page 208

Table A.5 – Continued from previous page

MGI Phenotype Description

MP:0006205 embryonic lethality before somite formation MP:0006206 embryonic lethality before turning of embryo MP:0006207 embryonic lethality during organogenesis MP:0006208 lethality throughout fetal growth and development MP:0008527 embryonic lethality at implantation MP:0008569 lethality at weaning MP:0008762 embryonic lethality MP:0009850 embryonic lethality between implantation and placentation MP:0010768* mortality/aging MP:0010769* abnormal survival MP:0010770 preweaning lethality MP:0010831* partial lethality MP:0010832 lethality during fetal growth through weaning MP:0011083 complete lethality at weaning MP:0011084 partial lethality at weaning MP:0011085 complete postnatal lethality MP:0011086 partial postnatal lethality MP:0011087 complete neonatal lethality MP:0011088 partial neonatal lethality MP:0011089 complete perinatal lethality MP:0011090 partial perinatal lethality MP:0011091 complete prenatal lethality MP:0011092 complete embryonic lethality MP:0011093 complete embryonic lethality at implantation MP:0011094 complete embryonic lethality before implantation MP:0011095 complete embryonic lethality between implantation and placentation MP:0011096 complete embryonic lethality before somite formation MP:0011097 complete embryonic lethality before turning of embryo MP:0011098 complete embryonic lethality during organogenesis MP:0011099 complete lethality throughout fetal growth and development

Continued on next page 209

Table A.5 – Continued from previous page

MGI Phenotype Description

MP:0011100 complete preweaning lethality MP:0011101 partial prenatal lethality MP:0011102 partial embryonic lethality MP:0011103 partial embryonic lethality at implantation MP:0011104 partial embryonic lethality before implantation MP:0011105 partial embryonic lethality between implantation and placentation MP:0011106 partial embryonic lethality before somite formation MP:0011107 partial embryonic lethality before turning of embryo MP:0011108 partial embryonic lethality during organogenesis MP:0011109 partial lethality throughout fetal growth and development MP:0011110 partial preweaning lethality MP:0011111 complete lethality during fetal growth through weaning MP:0011112 partial lethality during fetal growth through weaning MP:0011400 complete lethality MP:0013292* embryonic lethality prior to organogenesis MP:0013293* embryonic lethality prior to tooth bud stage MP:0013294* prenatal lethality prior to heart atrial septation

Table A.6: Percentages of essential proteins

Human-Mouse Orthologue Pleiotropic (412) DPh (257) SPh (155) Non-Pl. (1345) Essential 228 (55.34%) 152 (59.14%) 76 (49.03%) 483 (35.91%) P-value (vs. Non-Pl.) < 0.001 < 0.001 0.001 - OGEE Pleiotropic (412) DPh (257) SPh (155) Non-Pl. (1345) Essential 69 (16.75%) 52 (20.23%) 17 (10.97%) 141 (10.48%) P-value (vs. Non-Pl.) <0.001 <0.001 0.30 - 210

Table A.7: Interaction statistics using data from BioGrid and IntAct

BioGrid Pleiotropic (412) DPh (257) SPh (155) Non-Pl. (1345) Min 0 0 0 0 Max 1980 1980 563 1187 Mean 32.56 40.16 19.97 18.88 Median 9 11 7 6 STD 112.75 136.47 52.09 48.98 IntAct Pleiotropic (412) DPh (257) SPh (155) Non-Pl. (1345) Min 0 0 0 0 Max 171 171 144 526 Mean 9.21 9.64 8.76 6.52 Median 2 2 2 2 STD 20.76 20.84 20.61 18.60

Table A.8: Interaction statistics by protein essentiality using data from BioGrid

Essential protein by mouse orthologue set min max mean median STD P (Mann-Whitney) Pleiotropic DPh (152) 0 1980 55.9 17 173 < 0.001 Non-Pleiotropic (483) 0 1187 26.2 8 72.3 Non-essential protein by mouse orthologue set min max mean median STD P (Mann- Whitney) Pleiotropic DPh (105) 0 289 17.3 4 34.5 0.38 Non-Pleiotropic (862) 0 383 14.8 5 27.6 Essential protein by OGEE set min max mean median STD P (Mann- Whitney) Pleiotropic DPh (52) 0 1980 90.7 32 272.5 0.02 Non-Pleiotropic (141) 0 1187 43.5 16 110.5 Non-essential protein by OGEE set min max mean median STD P (Mann- Whitney) Pleiotropic DPh (205) 0 538 27.3 8 60 < 0.001 Non-Pleiotropic (1204) 0 639 16 6 34.1 211

Table A.9: Observed and expected number of neutral SAVs in disordered regions. The dataset includes 412 pleiotropic (10,154 disease-associated and 2,055 neutral SAVs in 367,076 residues) and 1,345 non-pleiotropic proteins (9,122 disease-associated and 3,621 neutral SAVs in 924,002 residues). The disordered regions were predicted by IUPred with a cut-off of 50 amino acid.

Disease-causing Observed Expected O/E P-value OR ROR P(ROR) Pleiotropic 1101 1861 0.59 < 0.001 0.53 2.05 < 0.001 Non-pleiotropic 477 1468 0.32 < 0.001 0.29 Polymorphism Observed Expected O/E P-value OR ROR P(ROR) Pleiotropic 400 377 1.06 0.18 1.08 0.94 0.39 Non-pleiotropic 652 583 1.12 0.002 1.15 Common Observed Expected O/E P-value OR ROR P(ROR) Pleiotropic 139 116 1.17 0.02 1.25 0.96 0.75 Non-pleiotropic 329 266 1.24 < 0.001 1.30 Rare Observed Expected O/E P-value OR ROR P(ROR) Pleiotropic 109 97 1.12 0.19 1.15 1.10 0.48 Non-pleiotropic 153 148 1.04 0.63 1.04 No MAF Observed Expected O/E P-value OR ROR P(ROR) Pleiotropic 152 163 0.93 0.34 0.92 0.92 0.47 Non-pleiotropic 170 170 1.00 0.97 1.00 ExAC Common Observed Expected O/E P-value OR ROR P(ROR) Pleiotropic 244 176 1.38 < 0.001 1.52 1.15 0.34 Non-pleiotropic 566 453 1.25 < 0.001 1.32

Note 1) ROR: Ratio of Odd Ratios (ORs) 2) P(ROR): P-value of ROR 212

Table A.10: Observed and expected numbers of SAVs in IDR using IUPred and DISOPRED3. The dataset includes 412 pleiotropic (10,154 disease and 2,055 neutral SAVs in 367,076 residues) and 1,345 non-pleiotropic proteins (9,122 disease and 3,621 neutral SAVs in 924,002 residues).

Disease-causing Observed Expected O/E P-value OR ROR P(ROR) IUPred50 Pleiotropic 1101 1861 0.59 < 0.001 0.53 2.05 < 0.001 Non-pleiotropic 477 1468 0.32 < 0.001 0.29 Disease-causing Observed Expected O/E P-value OR ROR P(ROR) IUPred30 Pleiotropic 1129 1934 0.58 < 0.001 0.52 1.73 < 0.001 Non-pleiotropic 528 1530 0.35 < 0.001 0.30 Disease-causing Observed Expected O/E P-value OR ROR P(ROR) DISOPRED50 Pleiotropic 962 2091 0.46 < 0.001 0.40 1.90 < 0.001 Non-pleiotropic 384 1576 0.24 < 0.001 0.21 Disease-causing Observed Expected O/E P-value OR ROR P(ROR) DISOPRED30 Pleiotropic 1086 2424 0.45 < 0.001 0.37 1.81 < 0.001 Non-pleiotropic 466 1871 0.25 < 0.001 0.21 Polymorphism Observed Expected O/E P-value OR ROR P(ROR) IUPred50 Pleiotropic 400 377 1.06 0.18 1.08 0.94 0.39 Non-pleiotropic 652 583 1.12 0.002 1.15 Polymorphism Observed Expected O/E P-value OR ROR P(ROR) IUPred30 Pleiotropic 413 391 1.06 0.22 1.07 0.93 0.30 Non-pleiotropic 681 607 1.12 0.001 1.15 Polymorphism Observed Expected O/E P-value OR ROR P(ROR) DISOPRED50 Pleiotropic 476 423 1.12 0.004 1.16 1.03 0.66 Non-pleiotropic 691 626 1.10 0.004 1.13 Polymorphism Observed Expected O/E P-value OR ROR P(ROR) DISOPRED30 Pleiotropic 550 490 1.12 0.002 1.17 0.96 0.54 Non-pleiotropic 863 743 1.16 < 0.001 1.21

Note 1) ROR: Ratio of Odd Ratios (ORs) 2) P(ROR): P-value of ROR 213

Table A.11: Observed and expected numbers of SAVs in Pfam domains of full pleiotropic set vs. non-pleiotropic set.

Disease-causing Observed Expected O/E P-value OR ROR P(ROR) Pleiotropic 7272 5291 1.37 < 0.001 2.37 0.76 < 0.001 Non-pleiotropic 6846 4494 1.52 < 0.001 3.13 Polymorphism Observed Expected O/E P-value OR ROR P(ROR) Pleiotropic 1012 1071 0.95 0.01 0.89 0.96 0.45 Non-pleiotropic 1718 1784 0.96 0.03 0.93

Note 1) ROR: Ratio of Odd Ratios (ORs) 2) P(ROR): P-value of ROR

Table A.12: Disease SAVs clustering in Pfam domains in full pleiotropic set (412 proteins). The two separate panels show the test results from different same domain interpretations. All results were significant (p < 0.001).

Same Pfam domain Observed Expected Pleiotropic Same MIM Different MIM Total Same MIM Different MIM Total Same Pfam 36755 12405 49160 30033.48 10909.45 40942.93 Different Pfam 60124 22447 82571 66845.52 23942.55 90788.07 Total 96879 34852 131731 96879 34852 131731

Same Pfam identifier Observed Expected Pleiotropic Same MIM Different MIM Total Same MIM Different MIM Total Same Pfam ID 62853 20514 83367 49606.66 18002.03 67608.69 Different Pfam ID 34026 14338 48364 47272.34 16849.97 64122.31 Total 96879 34852 131731 96879 34852 131731 Appendix B

Mutation Prediction on Experimental Structures

214 215

Table B.1: Maximum solvent accessibility by Rost and Sander

Code Amino acid name MaxASA A Alanine 106 B Aspatic Acid or Asparagine 160 C Cysteine 135 D Aspatic Acid 163 E Glutamic Acid 194 F Phenylalanine 197 G Glycine 84 H Histidine 184 J Isoleucine 169 K Lysine 205 L Leucine 164 M Methionine 188 N Asparagine 157 P Proline 136 Q Glutamine 198 R Arginine 248 S Serine 130 T Threonine 142 V Valine 142 W Tryptophan 227 X Undetermined 180 Y Tyrosine 222 Z Glutamic Acid or Glutamine 196 Appendix C

Mutation Prediction on Homology Models

Table C.1: Number of disease and neutral variants triggering a structural alert

Disulphide bond breakage Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 1.42 ± 0.65 2.04 ± 0.78 0.08 ± 0.16 0.08 ± 0.16 40-49 1.54 ± 0.77 2.06 ± 0.89 0.11 ± 0.22 0.11 ± 0.22 50-59 1.93 ± 0.88 1.93 ± 0.88 0.00 ± 0.00 0.00 ± 0.00 60-69 0.54 ± 0.61 0.73 ± 0.71 0.28 ± 0.56 0.28 ± 0.56 70-79 0.92 ± 1.79 0.92 ± 1.79 0.00 ± 0.00 0.00 ± 0.00 80-89 0.76 ± 1.05 0.76 ± 1.05 0.56 ± 1.10 0.56 ± 1.10 90-95 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.60 ± 1.16

Continued on next page 216 217

Table C.1 – Continued from previous page

Buried Pro introduced Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 3.07 ± 0.95 3.14 ± 0.96 0.40 ± 0.35 0.32 ± 0.31 40-49 2.88 ± 1.05 3.60 ± 1.17 0.33 ± 0.38 0.55 ± 0.48 50-59 3.44 ± 1.17 3.22 ± 1.13 0.00 ± 0.00 0.32 ± 0.44 60-69 2.36 ± 1.27 2.90 ± 1.40 0.00 ± 0.00 0.28 ± 0.56 70-79 4.59 ± 3.93 3.67 ± 3.53 0.00 ± 0.00 0.00 ± 0.00 80-89 1.90 ± 1.65 1.90 ± 1.65 0.00 ± 0.00 0.56 ± 1.10 90-95 5.02 ± 2.66 4.25 ± 2.46 0.60 ± 1.16 0.60 ± 1.16

Clash Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 2.12 ± 0.79 4.25 ± 1.11 0.88 ± 0.52 0.48 ± 0.39 40-49 3.60 ± 1.17 4.73 ± 1.33 1.44 ± 0.78 0.67 ± 0.53 50-59 2.04 ± 0.91 4.73 ± 1.36 1.11 ± 0.81 0.16 ± 0.31 60-69 2.90 ± 1.40 3.45 ± 1.52 0.28 ± 0.56 0.28 ± 0.56 70-79 3.67 ± 3.53 2.75 ± 3.07 1.08 ± 1.49 0.54 ± 1.06 80-89 1.52 ± 1.48 1.52 ± 1.48 0.56 ± 1.10 0.56 ± 1.10 90-95 3.09 ± 2.11 3.09 ± 2.11 0.00 ± 0.00 0.60 ± 1.16

Buried hydrophilic introduce Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 3.85 ± 1.06 3.93 ± 1.07 0.24 ± 0.27 0.56 ± 0.42 40-49 4.22 ± 1.26 4.53 ± 1.31 0.55 ± 0.48 0.44 ± 0.43 50-59 4.19 ± 1.29 3.97 ± 1.25 0.16 ± 0.31 0.47 ± 0.54 60-69 3.81 ± 1.60 3.99 ± 1.63 0.00 ± 0.00 0.28 ± 0.56 70-79 0.92 ± 1.79 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 80-89 3.42 ± 2.20 3.42 ± 2.20 0.56 ± 1.10 0.56 ± 1.10 90-95 4.25 ± 2.46 3.86 ± 2.35 0.60 ± 1.16 0.60 ± 1.16

Continued on next page 218

Table C.1 – Continued from previous page

Buried charge introduced Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 6.76 ± 1.38 7.23 ± 1.42 1.13 ± 0.59 1.37 ± 0.65 40-49 7.82 ± 1.69 8.13 ± 1.72 1.00 ± 0.65 1.11 ± 0.68 50-59 7.95 ± 1.74 7.63 ± 1.70 0.63 ± 0.62 0.95 ± 0.75 60-69 8.17 ± 2.29 8.17 ± 2.29 0.85 ± 0.96 0.85 ± 0.96 70-79 3.67 ± 3.53 2.75 ± 3.07 0.54 ± 1.06 0.54 ± 1.06 80-89 6.08 ± 2.89 6.08 ± 2.89 1.13 ± 1.56 1.13 ± 1.56 90-95 7.34 ± 3.18 7.34 ± 3.18 1.19 ± 1.64 1.19 ± 1.64

Secondary structure altered Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 0.86 ± 0.51 0.79 ± 0.49 0.16 ± 0.22 0.24 ± 0.27 40-49 0.72 ± 0.53 0.62 ± 0.49 0.33 ± 0.38 0.11 ± 0.22 50-59 0.97 ± 0.63 0.64 ± 0.51 0.16 ± 0.31 0.16 ± 0.31 60-69 0.54 ± 0.61 0.54 ± 0.61 0.00 ± 0.00 0.00 ± 0.00 70-79 0.92 ± 1.79 1.83 ± 2.52 0.00 ± 0.00 0.00 ± 0.00 80-89 1.14 ± 1.28 1.14 ± 1.28 0.00 ± 0.00 1.13 ± 1.56 90-95 1.16 ± 1.30 1.54 ± 1.50 1.19 ± 1.64 0.60 ± 1.16

Buried charge switch Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 1.10 ± 0.57 1.18 ± 0.59 0.40 ± 0.35 0.24 ± 0.27 40-49 1.23 ± 0.69 1.23 ± 0.69 0.22 ± 0.31 0.33 ± 0.38 50-59 0.86 ± 0.59 0.86 ± 0.59 0.16 ± 0.31 0.32 ± 0.44 60-69 0.54 ± 0.61 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 70-79 0.00 ± 0.00 0.92 ± 1.79 0.00 ± 0.00 0.00 ± 0.00 80-89 0.38 ± 0.74 0.38 ± 0.74 0.00 ± 0.00 0.00 ± 0.00 90-95 0.39 ± 0.76 0.39 ± 0.76 0.00 ± 0.00 0.00 ± 0.00

Continued on next page 219

Table C.1 – Continued from previous page

Disallowed phi/psi Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 3.85 ± 1.06 4.48 ± 1.14 0.88 ± 0.52 0.96 ± 0.54 40-49 5.25 ± 1.40 5.86 ± 1.48 0.89 ± 0.61 1.33 ± 0.75 50-59 5.37 ± 1.45 5.91 ± 1.51 2.05 ± 1.10 2.05 ± 1.10 60-69 5.44 ± 1.89 6.53 ± 2.06 1.99 ± 1.46 1.99 ± 1.46 70-79 4.59 ± 3.93 2.75 ± 3.07 1.62 ± 1.82 1.62 ± 1.82 80-89 4.18 ± 2.42 3.80 ± 2.31 1.13 ± 1.56 0.00 ± 0.00 90-95 6.18 ± 2.93 5.02 ± 2.66 2.38 ± 2.31 2.98 ± 2.57

Buried charge replaced Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 4.95 ± 1.19 5.42 ± 1.24 1.05 ± 0.57 1.21 ± 0.61 40-49 4.73 ± 1.33 5.45 ± 1.43 2.00 ± 0.91 1.33 ± 0.75 50-59 3.54 ± 1.19 3.87 ± 1.24 1.26 ± 0.87 1.58 ± 0.97 60-69 4.17 ± 1.67 3.81 ± 1.60 1.42 ± 1.24 1.42 ± 1.24 70-79 4.59 ± 3.93 3.67 ± 3.53 1.08 ± 1.49 1.08 ± 1.49 80-89 4.94 ± 2.62 4.94 ± 2.62 1.69 ± 1.90 2.26 ± 2.19 90-95 6.56 ± 3.02 5.79 ± 2.84 1.79 ± 2.00 1.19 ± 1.64

Buried Gly replaced Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 5.11 ± 1.21 5.58 ± 1.26 0.80 ± 0.50 1.29 ± 0.63 40-49 5.56 ± 1.44 7.00 ± 1.60 1.22 ± 0.72 1.55 ± 0.81 50-59 4.73 ± 1.36 5.69 ± 1.49 0.95 ± 0.75 1.26 ± 0.87 60-69 5.81 ± 1.95 5.44 ± 1.89 1.14 ± 1.11 1.14 ± 1.11 70-79 0.92 ± 1.79 0.92 ± 1.79 1.62 ± 1.82 1.62 ± 1.82 80-89 6.46 ± 2.97 6.46 ± 2.97 1.69 ± 1.90 1.69 ± 1.90 90-95 3.47 ± 2.23 3.86 ± 2.35 1.79 ± 2.00 1.79 ± 2.00

Continued on next page 220

Table C.1 – Continued from previous page

Buried H-bond breakage Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 5.27 ± 1.23 7.08 ± 1.41 2.97 ± 0.94 2.49 ± 0.87 40-49 5.14 ± 1.39 7.61 ± 1.67 3.22 ± 1.15 2.99 ± 1.11 50-59 5.69 ± 1.49 7.09 ± 1.65 3.00 ± 1.33 3.00 ± 1.33 60-69 6.35 ± 2.04 5.44 ± 1.89 2.56 ± 1.65 2.84 ± 1.74 70-79 5.50 ± 4.28 5.50 ± 4.28 1.62 ± 1.82 2.70 ± 2.34 80-89 3.80 ± 2.31 6.46 ± 2.97 4.52 ± 3.06 5.08 ± 3.24 90-95 10.04 ± 3.66 7.34 ± 3.18 2.98 ± 2.57 2.38 ± 2.31

Buried salt bridge breakage Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 2.20 ± 0.81 3.54 ± 1.02 0.88 ± 0.52 1.05 ± 0.57 40-49 2.16 ± 0.91 3.40 ± 1.14 1.22 ± 0.72 1.44 ± 0.78 50-59 1.93 ± 0.88 2.47 ± 1.00 0.32 ± 0.44 1.11 ± 0.81 60-69 2.90 ± 1.40 2.00 ± 1.17 0.57 ± 0.79 0.57 ± 0.79 70-79 1.83 ± 2.52 1.83 ± 2.52 0.00 ± 0.00 1.62 ± 1.82 80-89 3.80 ± 2.31 1.90 ± 1.65 0.56 ± 1.10 1.13 ± 1.56 90-95 1.16 ± 1.30 4.25 ± 2.46 1.79 ± 2.00 1.19 ± 1.64

Cavity altered Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 6.76 ± 1.38 4.25 ± 1.11 3.14 ± 0.97 2.17 ± 0.81 40-49 6.28 ± 1.52 4.63 ± 1.32 3.22 ± 1.15 2.55 ± 1.03 50-59 6.77 ± 1.61 4.73 ± 1.36 3.79 ± 1.49 1.90 ± 1.06 60-69 5.99 ± 1.98 7.26 ± 2.17 4.26 ± 2.11 1.99 ± 1.46 70-79 0.00 ± 0.00 1.83 ± 2.52 4.32 ± 2.93 2.16 ± 2.10 80-89 1.14 ± 1.28 2.66 ± 1.95 1.69 ± 1.90 1.69 ± 1.90 90-95 4.25 ± 2.46 4.63 ± 2.56 1.79 ± 2.00 0.60 ± 1.16

Continued on next page 221

Table C.1 – Continued from previous page

Buried / exposed switch Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 3.85 ± 1.06 3.93 ± 1.07 0.88 ± 0.52 2.09 ± 0.79 40-49 2.88 ± 1.05 5.04 ± 1.38 1.22 ± 0.72 2.00 ± 0.91 50-59 2.36 ± 0.98 4.94 ± 1.39 1.58 ± 0.97 1.58 ± 0.97 60-69 4.72 ± 1.77 4.36 ± 1.70 0.85 ± 0.96 0.28 ± 0.56 70-79 2.75 ± 3.07 0.92 ± 1.79 0.00 ± 0.00 1.08 ± 1.49 80-89 5.32 ± 2.71 4.94 ± 2.62 0.56 ± 1.10 1.13 ± 1.56 90-95 4.63 ± 2.56 3.47 ± 2.23 2.38 ± 2.31 1.79 ± 2.00

Cis Pro replaced Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 0.24 ± 0.27 0.31 ± 0.31 0.08 ± 0.16 0.24 ± 0.27 40-49 0.21 ± 0.28 0.41 ± 0.40 0.11 ± 0.22 0.22 ± 0.31 50-59 0.11 ± 0.21 0.21 ± 0.30 0.16 ± 0.31 0.32 ± 0.44 60-69 0.00 ± 0.00 0.18 ± 0.36 0.00 ± 0.00 0.00 ± 0.00 70-79 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 80-89 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 90-95 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00

Gly in a bend Positive rate ± 95% confidence interval Sequence Identity (%) MODEL TPR EXP TPR MODEL FPR EXP FPR

30-39 1.97 ± 0.76 2.67 ± 0.89 0.80 ± 0.50 0.88 ± 0.52 40-49 1.44 ± 0.75 3.19 ± 1.10 0.78 ± 0.57 1.00 ± 0.65 50-59 3.11 ± 1.12 3.01 ± 1.10 1.58 ± 0.97 1.90 ± 1.06 60-69 2.54 ± 1.31 2.18 ± 1.22 1.14 ± 1.11 1.14 ± 1.11 70-79 0.92 ± 1.79 0.92 ± 1.79 2.70 ± 2.34 3.24 ± 2.55 80-89 3.80 ± 2.31 3.80 ± 2.31 0.56 ± 1.10 0.56 ± 1.10 90-95 2.32 ± 1.83 1.54 ± 1.50 1.19 ± 1.64 0.00 ± 0.00