PDF hosted at the Radboud Repository of the Radboud University Nijmegen

The following full text is a publisher's version.

For additional information about this publication click this link. http://hdl.handle.net/2066/100866

Please be advised that this information was generated on 2017-12-06 and may be subject to change.

Project HOPE: Providing the last piece of the puzzle

Proefschrift

ter verkrijging van de graad van doctor

aan de Radboud Universiteit Nijmegen

op gezag van de rector magnificus prof. mr. S.C.J.J. Kortmann,

volgens besluit van het college van decanen in het openbaar te verdedigen op donderdag 29 november 2012

om 12:30 uur precies

door

Hanka Venselaar

Geboren op 27 oktober 1982

te Tilburg

Promotor

Prof. dr. G. Vriend

Manuscriptcommissie

Prof. dr. H.G. Brunner

Prof. dr. J.A.M. Smeitink

Prof. dr. M.G. Netea

Project HOPE: Providing the last piece of the puzzle

Hanka Venselaar

ISBN: 978-90-9026835-4

Contents

Chapter 1 5-20 Introduction

Chapter 2 21-38 Homology Modeling

Chapter 3 39-57 Homology Modeling and Spectroscopy, a never-ending lovestory

Chapter 4 59-72 Protein structure analysis of mutations causing inheritable diseases. An e-Science approach with life scientist friendly interfaces

Chapter 5 73-138 Applications of Homology Modeling

Chapter 6 139-149 Automated protein structure annotation of human disease variants

Chapter 6a 150-154 Appendix

Chapter 7 155-162 Summary / Samenvatting

Dankwoord 165-166

Curriculum Vitae 167

Publications 168-171

3

“We sift over our fingers the first grains of this great outpouring of information and say to ourselves that the world be helped by it.“

(Margaret Oakley Dayhoff in 1968 about the first Atlas of Protein Sequence and Structure)

4

Introduction 1

Previous page:

Model of human Dectin – A receptor that specifically recognizes cell wall constituents from pathogenic bacteria and fungi.

6

Introduction 1

Introduction Human beings are remarkable machines. Zoom in on cellular processes and you will be stunned by the amount of activities performed to keep a cell in good shape: energy metabolism, production of new proteins, transfer of signals, defense against pathogens, growing and dividing, etc. Nearly all these processes, from cell-growth to cell-death, are carried out by proteins, the work horses in your body. The fact that proteins are important for almost every biological process is the reason that lots of research is focused on proteins, their functions, their pathways, and their interactions. Different fields of research are looking at different features of these proteins. Research in the field of cell biology or physiology, for example, is often focused on the behavior of a complete protein(complex) during cellular processes, whereas research in the field of organic chemistry tries to manipulate the protein at the atomic level in order to stimulate a specific reaction. Structural bioinformaticians usually study the protein’s structure in detail at the atomic level; trying to understand its mechanism, the binding of ligands or the effect of mutations. They use computers and specialized software to visualize, study, and manipulate the protein’s structure, function, and dynamics in silico. In this thesis we show that the study of a protein’s 3D-structure can contribute to lab-research by providing basic understanding of the proteins, answers to biomedical questions, and ideas for new experiments. In the recent years, the use of Next Generation Sequencing (NGS) has rapidly changed the ways research was done in many areas of the life-sciences field. In contrast to studies that look at the behavior of a mutated protein in context of the whole cell, NGS makes it possible to perform a very detailed study of these mutations on the DNA-level. This requires the use of specialized tools and software that are often not easily available to every life-scientist, making it necessary for them to collaborate with bioinformaticians in even the smallest project. The work described in this thesis aims to make more well-known, more accessible, and more understandable. Our approach culminated in Project HOPE, a system that makes the analyses of molecular phenotypic effects of mutations easily available and understandable for everyone working in the field of (bio)medical science. The results of our collaborations with (bio)medical scientists are described in chapter 5 and provide nice examples of translational science. During the numerous projects carried out in collaboration with medical scientists we found that we sometimes literally had to make translations in the communication between the lab and bioinformatics. Therefore, we set up a bioinformatics-wiki as an online dictionary that explains difficult bioinformatics keywords using as much as possible the jargon of life- sciences. We hope that our work helps bioinformatics to become a standard tool in everyday lab- research and that together we can push forward the limits of our research. We truly believe that in translational science 1+1=3.

Figure 1.1a: The basic structure of amino acids consists of a central carbon atom connected to a proton (H), an amino-group (red), a carboxyl-group (cyan) and a side chain (purple). It is the side chain that gives each its unique properties. In the case of , the side chain connects to the central carbon atom instead of the proton resulting in a rigid ring-structure. Figure 1.1b: The amino acids in a protein are connected to each other via a peptide bond. This bond connects the carboxyl-group to a residue with the amino group of the next residue.

7

Introduction 1

Proteins & amino acids: building blocks of life Over 50% of the mutations known in the Human Gene Mutation Database involve a mutation that affects the protein structure1. Therefore, the central subject of human genetics bioinformatics is proteins, or more precise, the 3D-structure of proteins. Proteins consist of smaller building blocks called amino acids or residues. In general, the proteins in the human body consist of 20 different types of amino acids. Nine of these amino acids are essential for a normal healthy adult, meaning that our body cannot synthesize these residues and therefore we need to include them in our diet. The other amino acids (and their derivatives) normally can be produced in our body. The amino acids are all built up according to similar structural layout. The central carbon atom (Cα) is connected to four different chemical groups: an amino (NH2)-group, a carboxyl (COOH)– group, a hydrogen (H)-atom and one variable group known as the side chain or R-group. This is illustrated in figure 1.1a. All amino acids have this same basic structure, but their side chains differ. In a protein, amino acids are connected to each other through a peptide bond. The carboxyl-group of one amino acid is connected to the amino-group of the next amino acid. As a result, a protein contains a ‘backbone’ in which we can find the amino-group, the central carbon atom, and the carboxyl-group (carbonyl-group now) of a residue. As illustrated in figure 1.1b, the side chains of the amino acids are pointing away from this backbone. The side chain gives each amino acid its own unique properties like hydrophobicity (tendency to hide from water), aromaticity (structure that contains a flat ring with special chemical properties), size, and charge. In the case of the sulfur atom in the side chain can participate in a disulfide bond with another cysteine. This property of cysteine is often related to the stability of the protein. The side chain of consists of just one hydrogen-atom, making glycine the most flexible residue of all because there is no side chain that sterically limits the conformations of the backbone. Glycine is often found at positions in the protein structure where a certain amount of flexibility is required or where unusual torsion angles are needed. Figure 1.2 shows the structure of all 20 amino acids. It is these differences between the amino acids that form the basis of Richard Grantham’s amino acid . Grantham scored each amino acid on its composition (c), volume (v) and polarity (p)2. He reasoned that two amino acids can be substituted for another when they have similar c, p, and v scores. The Grantham score is still used a lot nowadays to score the severity of newly found point mutations.

During protein production, the amino acids are all linked together in a long chain that is not functional yet. This chain needs to fold into the correct conformation to become a functional protein. Protein folding is dictated by interactions between the amino acids; hydrophobic side chains want to sit in the core of the protein to avoid contact with water, charged residues will attract an opposite charge, and proline and glycine will facilitate sharp turns where required. Directed by these forces, and helped by chaperones when needed, a protein will eventually fold into a conformation in which it can fulfill its function.

8

Introduction 1

Figure 1.2: The 20 different amino acids. The backbone is coloured red and shown here as fully protonated. The side chains are coloured black and shown in their neutral form. The residues are labeled with their 1-letter code, 3-letter code and full names.

9

Introduction 1

The structure of a protein can be described at four different levels:  Primary Structure: This is simply the order of amino acids in the protein, also known as the protein sequence. For instance, the protein in figure 1.3 has the sequence “AEMLHDILQ”. The primary structure is often used in databases and software using the 1-letter code and can be represented as simple text. The order of amino acids is encoded in our DNA.  Secondary Structure: When a protein starts folding, it will first form small local structures. We can recognize two common regular structures: the α-helix (a corkscrew-like structure) and the β-strand (the fully extended polypeptide chain that can interact via hydrogen bonds with other β-strands thereby forming a β-sheet). These structures are linked with residues in turns and irregular structures, often indicated as random coils or, more correctly, as loops. Both the α-helix and the β-sheet are shown in figure 1.3.  Tertiary Structure: The local structures interact with each other to form the actual protein conformation. Nature uses a limited number of folds for the tertiary structures of proteins. Databases such as CATH3 or SCOP4 classify proteins based on their structure. The combination of a six parallel β-sheet alternated by two pairs of α-helices, for example, is known as the Rossman fold5, and 8 β-sheets surrounded by 8 α-helicest is known as a TIM- barrel6.  Quaternary Structure: Proteins often need to interact with other proteins or co-factors to fulfill their task. Such a complex of interacting proteins is called the quaternary structure. Haemoglobine, for instance, is only functional as a complex of 2 α-chains and 2 β-chains, each containing a haem-molecule and a series of buried water molecules.

In this thesis we will mainly use the tertiary and quaternary structure of a protein to study the effects of mutations on the function of the protein. We know that the structure and function of a protein are inseparable and that the protein must be properly folded to perform its function. It is the structure that enables the potassium channel to allow passage of potassium-ions only. It is the structure that enables topoisomerases to open and close around a DNA-strand. And in numerous proteins it is the structure of the substrate binding site that makes sure only the correct ligand binds to the protein. None of these proteins could perform their function would they have a different structure. Of course, when looking at protein structures we will have to keep in mind that a protein structure is dynamic. Movement is required for a protein to fulfill its function. For instance, ion- channels should be able to adopt an open or closed formation when needed, and oxygen-binding in one haemoglobin-subunit triggers a conformation-change in the subunits making it easier for the other subunits to bind oxygen. When studying a mutation in a protein we need to consider these movements.

Figure 1.3: The four different levels to describe a protein structure. Primary = the 1-letter code of the protein sequence, secondary = local structures like the α-helix and β-sheet, tertiary = completely folded chain, quaternary = completely functional protein complex with all chains and co-factors.

10

Introduction 1

Every active protein has a 3D-structure, and this structure is important for the function. However, the 3D-structure is known for only 20% of all the 23.000 known human protein sequences. Why only 20%? Determining the protein conformation is difficult and hard lab work. The primary structure of proteins can be determined by a method called ‘sequencing’. In the past this has been a laborious but rather straightforward process. Nowadays, the always ongoing improvement in technology enables us to sequence the DNA of a whole person quickly using next-generation sequencers. It has become rather simple to determine the sequence of a protein. However, defining the 3D-structure of that same protein still requires a more laborious approach for which two main methods exist: X- ray crystallography and Nuclear Magnetic Resonance (NMR).

 X-ray: Structure determination by X-ray requires that crystals are grown of the molecule of interest. These crystals are subjected to X-rays (röntgen radiation). The regular nature of the crystal causes the diffracted X-rays to form distinct patterns on a röntgen sensitive film or nowadays normally a specialized röntgen radiation detector. From these patterns the coordinates of the atoms can be calculated. Unfortunately, not every protein will easily form crystals. Especially, transmembrane domains are notoriously problematic in forming crystals and therefore many membrane proteins are often solved without their transmembrane domain. Nevertheless, X-ray is still the first choice when determining a protein structure.  NMR: Every atom has a nucleus, the core of the atom, and every atomic nucleus has a so- called spin. Using strong magnetic fields and radio pulses, certain spins can be manipulated. The spins of nuclei that are close to each other in space influence each other. Some of these interactions can be measured, and from that data it is possible to get a rough estimate of interatomic distances, torsion angles, etc. This information can be used to calculate atomic coordinates. The result of an NMR experiment is a collection of structures, in contrast to the single structure that is produced by an X-ray experiment. No crystal is needed for an NMR- experiment.

Both methods result in a structure that is stored in a so-called PDB-file in the Protein Data Bank (PDB)7. PDB-files contain, among other information, the X, Y, and Z coordinates of every atom in the protein. PDB-viewers can read these files and visualize the protein on your computer screen. Smart PDB-viewers allow you to manipulate this protein in silico and to use it to study the protein and its mutations. The process from NMR/X-ray experiment to 3D-visualisation is illustrated in figure 1.4. It sounds simple enough, but in contrast to sequencing, both NMR and X-ray require lots of work and time and the resulting structures are not without errors. For instance, the tight crystal packing needed for X-ray crystallization can force residues and loops at the surface of the protein in a conformation that differs from the biological one. It also occurs that a protein is solved in a crystallization-multimer that differs from the biologically relevant one. NMR is in general less accurate and we often see undetermined regions in NMR structures. It is important to keep all these problems in mind when analyzing a protein structure. Solving a protein structure is not easy and therefore many more proteins sequences are known than structures. In the future, the gap between known sequences and their corresponding solved protein structure will only grow. The latest technological developments resulted in machines that can sequence the complete human genome in only a fraction of the time that was needed before. So, whereas the sequencing process is speeding up, the process of solving a 3D-structure is still slow and time-consuming.

11

Introduction 1

Figure 1.4: Process of protein structure determination and visualization. The experimental NMR and X-ray techniques result in coordinates of the atoms in a structure. These coordinates are stored in a PDB-file in the Protein Data Bank from where they can be downloaded and used for visualization using a PDB-viewer. Shown here, is a screenshot of YASARA, our choice in protein-visualization software.

12

Introduction 1

Homology modeling In case the 3D-structure of the protein of interest is not known yet, we can try to predict it. This process is known as ‘homology modeling’. Proteins that are homologs (i.e. have a common evolutionary ancestor) often have similar features such as the topology, the location of the active site, residue types in the active site, etc. Structural principles are often conserved and re-used in nature. It is therefore possible to use one protein structure as a template to predict the structure of a homologous protein. To determine whether two proteins are homologous, we compare the sequences of these proteins in a process called ‘aligning’. While making an alignment, we assume that the two sequences have a common ancestor and that all differences between the two sequences are caused by mutations, insertions and deletions in this common ancestor sequence. The two sequences are arranged in such a way that the minimum number of substitutions is required to change one sequence into the other. This means that we place residues originating from the same ancestor underneath each other; identical residues are aligned, the other residues are aligned with residues that have similar properties as much as possible. The two sequences can have different lengths caused by the introduction or deletion of residues. These insertions/deletions are shown as gaps in the alignment, often indicated with ‘-‘. We can use the alignment of two sequences to determine whether the proteins are homologous. Sander & Schneider8 defined a formula that indicates homology depending on the number of aligned residues and the length of the alignment. A graph that results from this formula is shown in figure 1.5. Any two aligned sequences for which the percentage identical residues fall above the threshold are thought to adopt a similar structure. In the other case the proteins might still be homologous, but that cannot be identified by a simple alignment.

Figure 1.5: This plot indicates the safe homology modeling threshold. When the values for the length (horizontal axis) and percentage identity (vertical axis) fall above the line we can assume the two sequences to be homologous. For example: 25% of sequence identity is required for two aligned sequences of 80 amino acids length to be considered homologous. (Figure after Sander and Schneider8.)

13

Introduction 1

Homology modeling is based on two important assumptions:

 The function of a protein is facilitated by the structure. Nature is very good in re-using ideas that work and, therefore, we can find similar structural motifs (sometimes with slight adaptations) in large protein-families. For instance, the active site of a protease consists of serine, , and and is conserved throughout the family of serine proteases, and even beyond this family. The structure and function of a protein are closely associated.  During evolution structures are more conserved than their sequences. Because some amino acids have the same or similar properties it is possible to change one amino acid into the other without changing the structure or function of that protein, as long as essential active site residues remain unchanged. However, the specificity for substrates or the regulation of the protein often changes during evolution. As soon as computers became widely available people started using software to build models in silico9. The first programs were only able to change the amino acid side chains one by one without further adaptions in rotamers or the backbone. Since then, homology modeling has come a long way. Nowadays, everyone can use online web servers that cannot only change the residue side chains, but can also deal with insertions/deletions, model loops, do rotamer predictions, and perform energy minimizations. Many servers exist that produce fairly good models (Robetta10, ITasser11, etc). A bi-annual protein prediction contest investigates the results of these servers and provides the user with a nice overview of the best web servers for protein modeling. This so-called CASP12 competition not only shows us the progress in protein structure and function prediction, but also indicates the problems and possible points for improvement. Chapter 3 is an extensive investigation of the method and current possibilities of homology modeling.

Bioinformatics According to the Wikipedia we can describe the term bioinformatics as “The application of statistics and computer science to the field of molecular biology.“ The fact that we can nowadays use a system like Wikipedia indicates how much we rely on computers and the World Wide Web for our information. We cannot imagine our lives without a computer or without internet. The same holds for science in general, and for bioinformatics in particular. The term bioinformatics was coined by Paulien Hogeweg in 1978 for the study of informatic processes in biotic systems13. In the last few decades the development of the bioinformatics field has gone hand-in-hand with technological developments. Bigger and faster computers made it possible to store and search through the enormous amount of available data that has been generated, for example, by the human genome project. In a similar way it became much easier to store atomic coordinates from protein 3D-structure nicely as one file instead of a few boxes with punched-cards, as illustrated in figure 1.6. In fact, protein 3D-structures are so easy to use nowadays that a wide variety of research projects is now depending on in silico structure analyses. Take, for example, the development of new medicines. Instead of simply trying every new drug compound in the lab, we can perform in silico drug docking first. This will limit the number of lab-experiments and this will in turn save time and money. On the other hand, we can use the computer to compare already known drugs with new target proteins. This so-called ‘drug repositioning’ can result in a much more time and cost-efficient drug developing and marketing process because the drug is already approved for use in human patients. This makes drugs repositioning the most promising path towards medication for the 5000-7000 orphan diseases14, such as the many genetic disorders described in this thesis. Until recently, these diseases have been neglected by the big pharmaceutical companies because worldwide only a few people are affected by it so that the enormous investments involved in designing a medicine ab initio would never pay off. However, using techniques such as in silico experiments and drug repositioning can finally make it cost-efficient to develop medicines for orphan diseases.

14

Introduction 1

Figure 1.6: Improvement in computer data storage. In the past, storing coordinates of one protein could cost many punched cards. Nowadays, we can store the complete PDB (>70.000 proteins) on one USB-stick.

We can say that bioinformatics has become one of the basic needs for every life science researcher, but bioinformatics would not be possible without the lab-experiments. Data that result from these experiments need to be analyzed and the latest developments in life science technology demand a variety of bioinformatics skills. It is, for instance, impossible to manually analyze the output from a next generation sequencer, or to make sense of the data generated in a micro-array experiment. That is why bioinformatics and lab-work need each other. The ever growing heap of data must be collected in databases; smart algorithms like BLAST15 have been invented to search these databases and protocols like SOAP16 and DAS17 enable us to extract from local and remote sources exactly that piece of information that is useful for us. Our in-house developed search and retrieval system MRS18 is a typical example of an application that became both possible and necessary by the enormous increase in available biological data. This software provides simple and rapid access to a large series of biological data banks and presents this information in a human readable and cross-linked fashion. Currently, lots of biological data are stored in databanks and databases located on servers throughout the world and smart interfaces provide easy access to these information sources. Nowadays, every researcher needs bioinformatics as much as he/she needs lab work and it’s a good thing that (at least at the Radboud University) bioinformatics is part of the compulsory curriculum for molecular life sciences. We are interested in the effect of mutations on protein 3D-structures. This side of bioinformatics is called structural bioinformatics and focuses on the use, validation, and analysis of the 3D-structures of proteins. As mentioned before, almost every protein has a 3D-structure that facilitates the function. For example, the shape of the active site defines what substrate can fit, just like a key fits in a lock. Even a small change in this site can affect binding of the ligand and can thereby disturb the function of the protein. We call such a change a mutation. This thesis focuses on the effects of mutations on protein 3D-structure and function.

15

Introduction 1

We recognize different types of mutations:

 Nonsense: a mutation of an amino acid into a stop codon, this means that the protein is truncated.  Missense: one residue-type is mutated into another residue-type. This means in most cases (except when proline is involved) that only the side chain is changed while the backbone stays principally the same. For instance, a mutation from to means that the small, hydrophobic, and uncharged side chain is changed into a large, hydrophilic, and positively charged side chain.  Insertions/deletions: one or more residues are inserted in or deleted from the protein. This means that the length of the protein sequence changes. Normally, this occurs in irregular loops on the surface and does not have a severe effect, however, an insertion/deletion in secondary structure elements or in functional loops can be detrimental to the protein.  Frame-shifts: residues from a certain point are changed into another residue type, often leading to a premature stop and truncated protein. This is caused by a deletion or insertion at the DNA-level of a number of bases that is not a multiple of three.

We are especially interested in the missense mutations, the mutation from one residue-type into another. Other mutation types are either very rare (insertions/deletions) or more often fatal for the protein (frameshifts/truncations). To study the effect of point mutations we need to visualize the proteins. We mainly use the program YASARA19 (Yet Another Scientific Artificial Reality Application) developed by Elmar Krieger. This, and other programs, can read the atom-coordinates in a PDB-file and show the protein in 3D (as shown in figure 1.4). The structure can be manipulated (zoom, rotate, etc) and this enables us to visualize the effects of mutations. YASARA is available in 4 stages of increasing complexity. YASARA View is the first stage and contains all options described above that enable the user to look at a 3D-structure. The program is so easy and intuitive that it was even possible to incorporate it in a project called `the travelling DNA-labs´, a project in which students visit high schools to teach the pupils the first principles of protein structure analysis. The most complex YASARA stage has all WHAT IF20 functionalities built in and contains everything you need for complex in silico experiments, such as building energy minimized homology models, validation of macromolecular structures, ligand docking, etc. In 2008, YASARA was one of the top-contenders in the aforementioned CASP competition and even ranked first overall in the ‘high-resolution server accuracy'-category. It is not surprising that we choose YASARA for our homology modeling experiments.

Translational science: From bench-to-bedside (and back again) The term ”translational science” is defined as scientific research that is motivated by the need for practical applications that help people21. In other words, translational science is science that tries to develop methods, tools, and products that are directly useful. In medical science, for example, we can see the development of new drugs that will improve human health, and, therefore, this science is also described as ‘from bench to bedside’. The project described in this thesis not only runs in the ‘bench to bedside’ direction, but it also goes from ‘bedside to bench’. On one hand, our method of mutational analysis can provide a valuable contribution to the fundamental understanding of protein structures, which can lead to new ideas and experiments and, eventually, new medicines. On the other hand, our project connects the world of lab-experiments to find disease causing mutations with the world of in silico protein structure analyses that explain their effects. It translates questions that rise in the lab to a problem that can be approached bioinformatically. The answers of the analyses will be translated (using our bioinformatics-wiki) into bioinformatics-jargon free language in such a way that every (bio)medical scientist can understand the answer and use it to design new experiments or to publish the results in

16

Introduction 1 an article. All together, we believe that this thesis describes examples of translational science, that sometimes are very literal.

Several research departments in the academic hospital in Nijmegen are studying hereditary diseases. An extensive series of experiments can result in the knowledge that a certain mutation in a certain gene is causing a disease. Such results are publishable and contribute to the fundamental knowledge about diseases. However, the molecular basis of the disease is then still unknown. We don’t know why, at a molecular level, that particular mutation causes the disease. Is the mutation disturbing the structure of the protein, or is it located at the surface where it can disturb dimerisation? We will not know until we closely study the 3D-structure of the protein in the light of all other knowledge about the molecule. Only when we understand the underlying molecular basis of the disease can we think of a reliable diagnostic test or a possible cure for the patient. The example in the next paragraph illustrates a protein study in which the structural analysis contributed significantly to the understanding of the disease and provided new ideas for experiments that eventually might lead to a cure for patients.

Example: The department of human genetics studied a family of hereditary disorders causing various combinations of limb malformations in the split hand-split foot spectrum, orofacial clefting, and ectodermal dysplasia, including ‘Ectrodactylyl, Ectodermal dysplasia and Cleft lip syndrome’ (also known as EEC syndrome). Lab experiments resulted in the identification of mutations in the gene for p63, a transcription factor22. The mutations seemed to be scattered all over the protein. The identification of these mutations in itself was an important discovery, but the molecular mechanism of the diseases was not yet understood. After homology modeling of the protein structure it became clear that the disease-causing mutations were located in either the DNA-binding region, contacted a metal-ion, or were located at the dimerisation interface. Strikingly, the location of these mutations could be directly correlated to the different variants of the disease23. Researchers at the Swedish Karolinska institute took up the challenge to develop a medicine for this disease. They developed a small molecule named PRIMA-1 that can covalently bind to p53-family members (p63, p53, and p73) and stabilize the dimer. This partly restores the protein function because binding of PRIMA-1 stabilizes the dimer and thereby simply makes sure that there is more active form of the protein available24-26. Hopefully, PRIMA-1 can be used as a lead compound to develop new therapies. Figure 1.7 illustrates this process.

Figure 1.7: Overview of the translational science process in the case of p63 starting with the identification of the syndrome, followed by the discovery of the mutations in the p63 gene, structural analysis of the homology model, and eventually the development of a therapeutic compound.

17

Introduction 1

This example shows how protein 3D-structures can contribute to the understanding of processes in the human body in general, and hereditary diseases in particular, and can be used to develop cures or at least to design new experiments towards this goal. A part of this thesis consists of projects carried out in collaboration with life-science departments. In chapter 5 you will find a summary of the articles that were the results of these collaborations most of which have been published already. These articles investigated one or more point mutations that were found to cause a disease. Other projects have also been carried out, but they fall beyond the scope of this thesis because they do not include point mutations. The summaries show how protein structures can provide a serious contribution to the understanding of a disease mechanism and, in a way, provide the last piece of the molecular puzzle.

HOPE: Have (y)Our Protein Explained Even though a structural analysis of a point mutation seems to be very valuable for further research, not every biomedical scientist has access to a bioinformatics department and not every scientist has the background to study proteins structures in depth. The aim of this PhD project was to bridge the gap between (bio)medical sciences and structural bioinformatics. Therefore, we developed Project HOPE: an online system that can analyze the structural effects of point mutations and explain them in a (bio)medically friendly way. HOPE gathers information about a protein from databases and servers located on computers in a wide range of countries. The name HOPE stands for: Have yOur Protein Explained, but also represents the fact that you can always hope that something more is known about your protein. With HOPE we hope to open the world of protein structures to everyone.

Scope of this thesis This thesis focuses on protein 3D-structures and the effect of mutations on these structures. As mentioned previously, homology modeling is the only way to obtain a 3D-structure of your protein of interest when experimental techniques fail or simply have not yet been tried. A homology modeling experiment is a multi-step process. In chapter 2 we extensively describe each step and its underlying principles. In chapter 3 we describe homology modeling in the light of a series of recent developments, and an overview will be given of the problems and opportunities encountered in this field. The major topic, the accuracy and precision of homology models, will be discussed extensively. In chapter 4 we introduce Project HOPE, a fully automatic program that analyzes the structural and functional effects of point mutations. HOPE collects information from a wide range of information sources including the protein 3D-structure or homology model. These homology models are built according to the method described in chapter 2. Chapter 5 illustrates the use of protein 3D-structures in the field of life-sciences. In the recent years, we collaborated in many projects. We were often able to analyze the protein structure in order to provide insights in the effects of that mutation and to suggest new experiments. Many of these projects resulted in a publication which will be summarized in this chapter. We validated Project HOPE by analyzing a large series of mutations described in journals such as Nature Genetics and the American Journal of Human Genetics. We compared HOPE's predictions with the predictions described by the authors of these articles, with our own manual analyses, and with predictions by SIFT and PolyPhen, two other, commonly used packages for mutant structure prediction. We also compared mutations that we analyzed previously (chapter 5) with HOPE’s predictions. The results of these analyses are described in chapter 6 and 6a. Chapter 7 summarizes this thesis and provides insight in improvements and future perspectives.

18

Introduction 1

References 1. Stenson, P.D., Mort, M., Ball, E.V., Howells, K., Phillips, A.D., Thomas, N.S., and Cooper, D.N. (2009). The Human Gene Mutation Database: 2008 update. Genome Med 1, 13. 2. Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science 185, 862-864. 3. Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. (1997). CATH--a hierarchic classification of protein domain structures. Structure 5, 1093-1108. 4. Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247, 536-540. 5. Rao, S.T., and Rossmann, M.G. (1973). Comparison of super-secondary structures in proteins. J Mol Biol 76, 241-256. 6. Kishan, R., Zeelen, J.P., Noble, M.E., Borchert, T.V., Mainfroid, V., Goraj, K., Martial, J.A., and Wierenga, R.K. (1994). Modular mutagenesis of a TIM-barrel enzyme: the crystal structure of a chimeric E. coli TIM having the eighth beta alpha-unit replaced by the equivalent unit of chicken TIM. Protein Eng 7, 945- 951. 7. Berman, H., Henrick, K., and Nakamura, H. (2003). Announcing the worldwide Protein Data Bank. Nat Struct Biol 10, 980. 8. Sander, C., and Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of . Proteins 9, 56-68. 9. Greer, J. (1980). Model for haptoglobin heavy chain based upon structural homology. Proc Natl Acad Sci U S A 77, 3393-3397. 10. Kim, D.E., Chivian, D., and Baker, D. (2004). Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res 32, W526-531. 11. Zhang, Y. (2008). I-TASSER server for protein 3D structure prediction. BMC Bioinformatics 9, 40. 12. Moult, J., Pedersen, J.T., Judson, R., and Fidelis, K. (1995). A large-scale experiment to assess protein structure prediction methods. Proteins 23, ii-v. 13. Hogeweg, P., and Hesper, B. (1978). Interactive instruction on population interactions. Comput Biol Med 8, 319-327. 14. Stolk, P., Willemen, M.J., and Leufkens, H.G. (2006). Rare essentials: drugs for rare diseases as essential medicines. Bull World Health Organ 84, 745-751. 15. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-410. 16. Gudgin, M., Hadley, M., Mendelsohn, N., Moreau, J., Nielsen, H.F., Karmarkar, A., and Lafon, Y. (2007). SOAP Version 1.2 Part 1: Messaging Framework (Second Edition). In. ( 17. Prlic, A., Down, T.A., Kulesha, E., Finn, R.D., Kahari, A., and Hubbard, T.J. (2007). Integrating sequence and structural biology with DAS. BMC Bioinformatics 8, 333. 18. Hekkelman, M.L., and Vriend, G. (2005). MRS: a fast and compact retrieval system for biological data. Nucleic Acids Res 33, W766-769. 19. Krieger, E., Koraimann, G., and Vriend, G. (2002). Increasing the precision of comparative models with YASARA NOVA--a self-parameterizing force field. Proteins 47, 393-402. 20. Vriend, G. (1990). WHAT IF: a molecular modeling and drug design program. J Mol Graph 8, 52-56, 29. 21. Feldman, A.M. (2008). CTS: a new discipline to catalyze the transfer of information. Clin Transl Sci 1, 1-2. 22. Celli, J., Duijf, P., Hamel, B.C., Bamshad, M., Kramer, B., Smits, A.P., Newbury-Ecob, R., Hennekam, R.C., Van Buggenhout, G., van Haeringen, A., et al. (1999). Heterozygous germline mutations in the p53 homolog p63 are the cause of EEC syndrome. Cell 99, 143-153. 23. Brunner, H.G., Hamel, B.C., and Van Bokhoven, H. (2002). The p63 gene in EEC and other syndromes. J Med Genet 39, 377-381. 24. Lambert, J.M., Gorzov, P., Veprintsev, D.B., Soderqvist, M., Segerback, D., Bergman, J., Fersht, A.R., Hainaut, P., Wiman, K.G., and Bykov, V.J. (2009). PRIMA-1 reactivates mutant p53 by covalent binding to the core domain. Cancer Cell 15, 376-388. 25. Bykov, V.J., Issaeva, N., Shilov, A., Hultcrantz, M., Pugacheva, E., Chumakov, P., Bergman, J., Wiman, K.G., and Selivanova, G. (2002). Restoration of the tumor suppressor function to mutant p53 by a low- molecular-weight compound. Nat Med 8, 282-288. 26. Rokaeus, N., Shen, J., Eckhardt, I., Bykov, V.J., Wiman, K.G., and Wilhelm, M.T. PRIMA-1(MET)/APR-246 targets mutant forms of p53 family members p63 and p73. Oncogene 29, 6442-6451.

19

Introduction 1

20

The Method: Homology Modelling 2

Previous page:

A Toll-Like Receptor (TLR) – A class of receptor proteins that recognize a wide variety of structurally conserved molecules in microbes and therefore play an important role in the innate immune system. (PDB file 2z63)

22

The Method: Homology Modelling 2

Homology Modeling The goal of protein modeling is to predict a structure from its sequence with an accuracy that is comparable to the best results achieved experimentally. This would allow users to safely use in silico generated protein models in scientific fields where today only experimental structures provide a solid basis: structure-based drug design, analysis of protein function, interactions, antigenic behavior, or rational design of proteins with increased stability or novel functions. Protein modeling is the only way to obtain structural information when experimental techniques fail. Many proteins are simply too large for Nuclear Magnetic Resonance (NMR) analysis and cannot be crystallized for X-ray diffraction. Among the major approaches to 3D structure prediction homology modeling is the ‘easiest’ approach based on two major observations:

 The structure of a protein is uniquely determined by its amino acid sequence1, and therefore the sequence should, in theory, contain sufficient information to obtain the structure.  During evolution, structural changes are observed to be modified at a much slower rate than sequences. Similar sequences have been found to adopt practically identical structures while distantly related sequences can still fold into similar structures. This relationship was first identified by Chothia & Lesk2 and later quantified by Sander & Schneider3 as summarized in figure 2.1. Since the initial establishment of this relationship, Rost et al.4 was able to derive a more precise limit for this rule with accumulated data in the Protein Data Bank (PDB). Two protein sequences are highly likely to adopt a similar structure provided that the percentage identity between these proteins for a given length is above the threshold shown in the figure.

Figure 2.1: The two zones of sequence alignments defining the likelihood of adopting similar structures. Two sequences are highly likely to fold into the same structure if their length and percentage sequence identity fall into the region above the threshold, indicated with the smiling icon (the ‘safe’ zone). The region below the threshold indicates the zone where inference of structural similarity cannot be made, thus making it difficult to determine if model building will be possible. (Figure after Sander and Schneider3)

23

The Method: Homology Modelling 2

The identity between the sequences can be obtained by doing a simple BLAST run with the sequence of interest, the model-sequence, against the PDB, see figure 2.2. The sequence that aligns with the model-sequence is called the ‘template’. When the percentage identity in the aligned region of the structure template and the sequences to be modeled falls in the safe modeling zone of figure 2.1, a model can be built.

Figure 2.2: Typical blast output of a model-sequence run against the PDB sequences, where (Q) represents the model-sequence with unknown structure and (S) is the sequence for the PDB structure template.

In figure 2.1 it is shown that the threshold for safe homology modeling can be as low as 25%, especially for longer sequences. However, the quality of the model is dependent on the sequence identity and should be considered since sequences with 25% sequence identity tend to yield poor structural models with high uncertainty in side chain positioning. The impact of identities between the model-sequence and the template on the process of protein homology modeling can be broken down as follows. At sequence identities greater than 75% protein homology modeling can easily be achieved requiring minor manual intervention. Consequently, the time needed to finalize such a model is limited only by the interpretation of the model to answer the biological question at hand, see figure 2.3. The level of accuracy achieved with these models is comparable to structures that were solved by NMR. For sequence identities between 50 - 75% more time is needed to fine-tune the details of the model and correct the alignment if needed. Between 25-50% identity obtaining the best possible alignment becomes the major concern in the process of homology modeling. Finally, sequence identities lower than 25% often mean that no template structure can be detected with a simple BLAST search thus requiring the use of more sensitive alignment techniques to find a potential template structure.

Figure 2.3: The limiting steps in homology modeling as function of percentage sequence identity between the model-sequence and the sequence of the template. (Figure based on Rodriguez and Vriend5)

24

The Method: Homology Modelling 2

In practice, homology modeling is a multi-step process which can be summarized as follows:

1: Template recognition and initial alignment 2: Alignment correction 3: Backbone generation 4: Loop modeling 5: Side chain modeling 6: Model optimization 7: Model validation 8: Iteration

These steps are all illustrated in figure 2.4 and will be discussed in detail in the rest of this chapter.

Decisions need to be made by the modeler at almost every step of this model building process. The choices are not always obvious, thus subjecting model building to a serious thought about how to gamble between multiple seemingly similar choices. To reduce possible errors introduced by a subjective decision making process, algorithms have been developed to automate the model building process (see table 2.1).

Table 2.1: A few examples of the online available homology modeling servers. Server name URL Automatic Homology Modeling Servers

3D-Jigsaw http://www.bmm.icnet.uk/servers/3djigsaw/ CPHModels http://www.cbs.dtu.dk/services/CPHmodels/ EsyPred3D http://www.fundp.ac.be/urbm/bioinfo/esypred/ Robetta http://robetta.bakerlab.org/ SwissModel http://swissmodel.expasy.org/ TASSER-lite http://cssb.biology.gatech.edu/skolnick/webservice/tasserlite/index.html

Semi-Automatically Homology Modeling Servers (provide your own alignment)

HOMER http://protein.cribi.unipd.it/homer/help.html WHAT IF http://swift.cmbi.kun.nl/WIWWWI/

Current techniques allow modelers to construct models for about 25-65% of the amino acids coded in the genome, thereby supplementing the efforts of structural genomics projects6. This value differs significantly between individual genomes, and increases steadily with the continuous growth of the PDB. The remainder of these genomes have no identified template that can be used for homology modeling and therefore modelers will need to resort to fold recognition, ab initio structure prediction, or simply the traditional NMR or X-ray experiment to obtain structural data. While automated model building provides a high throughput-solution the evaluation of these automated methods during CASP indicated that human expertise is still helpful, especially if the sequence identity of the alignment is close to the zone of uncertainty regarding the feasibility of building a proper model (25%, see figure 2.1)7. The 8 steps of homology modeling will be discussed in more detail on the following pages.

Next page: Figure 2.4: The process of building a ‘model’ by homology to a ‘template’. The numbers in the plot correspond to the step-numbers in the subsequent section.

25

The Method: Homology Modelling 2

26

The Method: Homology Modelling 2

Step 1 - Template Recognition and Initial Alignment

Sequences in the safe homology modeling zone (figure 2.1) share a high percentage identity to a possible template and therefore can easily be paired with simple sequence alignment programs such as BLAST8 or FASTA9. To identify the template, the program compares the query sequence to all the sequences of known structures in the PDB using mainly two matrices:

 A residue exchange matrix (figure 2.5). This matrix defines the likelihood that any two of the 20 amino acids ought to be aligned. Exchanges between different residues with similar physico-chemical properties (for example Phe->Tyr) get a better score than exchanges between residues that widely differ in their properties. Conserved residues generally obtain the highest score.  An alignment matrix (figure 2.6). The axes of this matrix correspond to the two sequences to be aligned, and the matrix elements are simply the values from the residue exchange matrix (figure 2.5) for a given pair of residues. During the alignment process, one tries to find the best path through this matrix, starting from a point near the top left, and going down to the bottom right. To make sure that no residue is used twice, one must always take at least one step to the right and one step down. A typical alignment path is shown in figure 2.6. At first sight, the dashed path in the bottom right corner would have led to a higher score. However, it requires the opening of an additional gap in sequence A (Gly of sequence B is skipped). By comparing thousands of sequences and sequence families, it became clear that the opening of gaps is about as unlikely as at least a couple of non-identical residues in a row. The jump roughly in the middle of the matrix on the other hand is justified, because after the jump we earn lots of points (5,6,5) which otherwise would only have been (1,0,0). The alignment algorithm therefore subtracts an ‘opening penalty’ for every new gap and a much smaller ‘gap extension penalty’ for every residue that is additionally skipped once the gap has already been made. The gap extension penalty is much smaller than the gap open penalty because one gap of three residues is much more likely than three gaps of one residue each.

In practice, template structures can be easily retrieved by submitting the query sequence to one of the countless BLAST servers on the web, using the PDB as the database to search. Usually, the template structure with the highest sequence identity will be the first option, see figure 2.2, but other considerations should be made. For example, the conformational state (i.e. active or inactive), present co-factors, other molecules or multimeric complexes will have an impact on model building. Nowadays, the increasing amount of CPU power makes it possible to choose multiple templates and build multiple models giving the investigator the opportunity to select the best model for further study. It has also become possible to combine multiple templates into one structure that is used for modeling. The online Swiss-Model10 and the Robetta servers11, for example, use this approach.

27

The Method: Homology Modelling 2

Figure 2.5: A typical residue exchange or scoring matrix used by alignment algorithms. Because the score for aligning residues A and B is normally the same as for B and A, this matrix is symmetric.

Figure 2.6: The alignment matrix for the sequences VATTPDKSWLTV and ASTPERASWLGTA, using the scores from figure 2.5. The optimum path corresponding to the alignment on the right side is shown in gray. Residues with similar properties are marked with a star '*'. The dashed line marks an alternative alignment that scores more points but requires to open a second gap.

Step 2 - Alignment Correction Having identified one or more possible modeling templates using the initial screen described above, more sophisticated methods are needed to arrive at a better alignment. Sometimes it may be difficult to align two sequences in a region where the percentage sequence identity is very low. A potential strategy is to use other sequences from other homologous proteins to find a solution. An example of using this strategy to address a challenging alignment is shown in figure 2.7. Suppose the sequence LTLTLTLT needs to be aligned with YAYAYAYAY. There are two equally poor possibilities, and the use of third sequence, TYTYTYTYT, that aligns easily to the two sequences can resolve the issue.

28

The Method: Homology Modelling 2

Figure 2.7: Addressing a challenging alignment problem with homologous sequences. Sequences A and B are impossible to align, unless one considers a third sequence C from a homologous protein.

The example above introduced a very powerful concept called ‘multiple sequence alignment’. Many programs are available to align a number of related sequences, for example CLUSTALW12, and the resulting alignment contains a lot of additional information. For example, consider an Ala->Glu mutation. Relying on the matrix in figure 2.5, this exchange always gets a score of 1. In the three dimensional structure of the protein, it is however very unlikely to see such an Ala->Glu exchange in the hydrophobic core, but on the surface this mutation is perfectly normal. The multiple sequence alignment implicitly contains information about this structural context. If at a certain position only exchanges between hydrophobic residues are observed, it is highly likely that this residue is buried. To consider this knowledge during the alignment, one uses the multiple sequence alignment to derive position specific scoring matrices, also called ‘profiles’13; 14. During the last years, new programs like MUSCLE15 and T-Coffee16 have been developed that use these profiles to generate and refine the multiple sequence alignments. Structure based alignment programs, like 3DM, also include structural information in combination with profiles to generate multiple sequence alignments17. The use of 3DM on a specific class of proteins can result in entropy vs. variability plots. The location of a residue in this plot is directly related to function in the protein. This information can in turn be added to the profile and used to correct the alignment or to optimize position specific gap penalties.

When building a homology model, we are in the fortunate situation of having an almost perfect profile: the known structure of the template. We simply know that a certain alanine sits in the protein core and must therefore not be aligned with a glutamate. Multiple sequence alignments are nevertheless useful in homology modeling, for example to place deletions or insertions only in areas where the sequences are strongly divergent. A typical example for correcting an alignment with the help of the template is shown in figures 2.8 and 2.9. Although a sequence alignment gives the highest score for alignment 1 in figure 2.8, a simple look at the structure of the template reveals that alignment 2 is actually a better alignment, because it leads to a small gap, compared to a huge hole associated with alignment 1.

Template I C R L P G S A E A 1: Model (bad) V C R M P - - - E A 2: Model (good) V C R - - - M P E A

Figure 2.8: Example of a sequence alignment where a three-residue deletion must be modeled. While alignment 1, red, appears better when considering just the sequences (a matching proline at position 5), a look at the structure of the template leads to a different conclusion (figure 2.9).

29

The Method: Homology Modelling 2

Figure 2.9: Correcting an alignment based on the structure of the modeling template (C-trace shown in black). While the alignment with the highest score (red, also in figure 2.8) leads to a big gap in the structure, the second option (blue) creates only a tiny hole. This can easily be accommodated by small backbone shifts.

Step 3 - Backbone Generation When the alignment is ready, the actual model building can start. Creating the backbone is trivial for most of the model: one simply transfers the coordinates of those template residues that show up in the alignment with the model-sequence (see figure 2.2). If two aligned residues differ, the backbone coordinates for N, C, C and O and often also the Cβ can be copied. Conserved residues can be copied completely to provide an initial guess. Experimentally determined protein structures are not perfect (but still better than models in most cases). There are countless sources of errors, ranging from poor electron density in the X-ray diffraction map to simple human errors when preparing the PDB file for submission. A lot of work has been spent on writing software to detect these errors (correcting them is even harder), and the current count is at more than 25,000,000 errors in the approximately 50,000 structures deposited in the PDB by the end of 2007. The current PDB-redo18 and RECOORD19 projects have respectively shown that re-refinement of X-ray and NMR structures normally improves the quality of the structure, suggesting that re-refinement before modeling seems to be a wise option. A straightforward way to build a good model is to choose the template with the fewest errors (the PDBREPORT database20 at www.cmbi.ru.nl/gv/pdbreport can be very helpful). But what if two templates are available, and each has a poorly determined region, but these regions are not the same? One should clearly combine the good parts of both templates in one model - an approach known as multiple template modeling. (The same applies if the alignments between the model sequence and possible templates show good matches in different regions). Although this is a simple principle (and used by automated modeling servers like Swiss-Model10) it is hard in practice to achieve results that are closer to the true structure than all the templates. Nevertheless the feasibility of this strategy has already been demonstrated by Andrej Šali's group in CASP4. One extreme example of a program that combines multiple templates is the Robetta server (see table 2.1). This server uses several different algorithms to predict domains in the sequence. The regions of the model-sequence that contain a homologous domain in the PDB are modeled while those parts without homology are predicted de novo by using the Rosetta method. This method compares small fragments of the sequence with the PDB and inserts them with the same local conformation into the model. The Robetta server can generate models of complete sequences even

30

The Method: Homology Modelling 2 without a known template11. Skolnick’s TASSER follows a same strategy but combines larger fragments, found by threading, and folds them into a complete structure21.

Step 4 - Loop Modeling For the majority of homology model building cases, the alignment between model and template sequence contains gaps. Either in the model sequence (deletions as shown in figures 2.8 and 2.9), or in the template sequence (insertions). Gaps in the model-sequence are addressed by simply omitting residues from the template, thus creating a hole in the model that must be closed. Gaps in the template sequences are treated by inserting the missing residues into the continuous backbone. Both cases imply a conformational change of the backbone. The good news is that conformational changes often do not occur within regular secondary structure elements. Therefore it is often safe to shift all insertions or deletions in the alignment out of helices and strands and place them in structural elements that can accommodate such changes in the alignment such as loops and turns. The bad news is that changes in loop conformation are notoriously hard to predict (a big unsolved problem in homology modeling). To make things worse, even without insertions or deletions a number of different loop conformations can be observed between the template and model. There are three main reasons for this difficulty5:  Surface loops tend to be involved in crystal contacts and therefore a significant conformational change is expected between the template in crystal form and the final structure modeled in the absence of crystallization constraints.  The exchange of small to bulky side chains underneath the loop induces a structural change by pushing it away from the protein.  The mutation of a loop residue to proline or from glycine to any other residue will result in a decrease in conformational flexibility. In both cases, the new residue must fit into a more restricted area in the Ramachandran plot, which normally requires a conformational change of the loop.

There are three main approaches to model loops:  Knowledge based: a search is made through the PDB for known loops containing endpoints that match the residues between which the loop is to be inserted. The identified coordinates of the loop are then transfered. All major molecular modeling programs and servers support this approach (e.g. 3D-Jigsaw22, Insight23, Modeller24, Swiss-Model10 or WHAT IF25).  In between or hybrid: the loop is divided in small fragments that are all separately compared to the PDB. This strategy has been described earlier as implemented for the Rosetta method11. The local conformation of all small fragments results in an ab initio modeled loop but is still based on known protein structures. This method is reminiscent of the very old ECEPP software by the Scheraga group26.  Energy based: as in true ab initio fold prediction, an energy function is used to judge the quality of a loop. This is followed by a minimization of the structure, using Monte Carlo27 or molecular dynamics techniques28 to arrive at the best loop conformation. Often the energy function is modified (e.g. smoothed) to facilitate the search29.

For short loops (up to 5-8 residues), the various methods have a reasonable chance of predicting a loop conformation close to the true structure. As mentioned above, surface loops tend to change their conformation due to crystal contacts. So if the prediction is made for an isolated protein and then found to differ from the crystal structure, it might still be correct.

31

The Method: Homology Modelling 2

Step 5 - Side Chain Modeling When we compare the side chain conformations (‘rotamers’) of residues that are conserved in structurally similar proteins, we find that they often have similar 1-angles (i.e. the torsion angle about the C-C bond). It has been shown by Summers et al. that in homologous proteins (over 40% identity) at least 75% of the Cγ occupy the same orientation30. It is therefore possible to simply copy conserved residues entirely from the template to the model (see also Step 3) and achieve a very good starting point for structure optimisation. In practice, this rule of thumb holds only at high levels of sequence identity, when the conserved residues form networks of contacts. When they get isolated (<35% sequence identity), the rotamers of conserved residues may differ in up to 45% of the cases31. In practice, all successful approaches to side chain placement are at least partly knowledge based. Libraries of common rotamers extracted from high resolution X-ray structures are often used to position side chains. The various rotamers are successively explored and scored with a variety of energy functions. Intuitively, one might expect rotamer prediction to be computationally demanding due to the combinatorial explosion of the search space - the choice of a certain rotamer automatically affects the rotamers of all neighboring residues, which in turn affect their neighbors and the effect propagates continuously. For a sequence of 100 residues and an average ~5 rotamers per residue, the rotamer space would yield 5100 different possible conformations to score. Significant research efforts have been invested to develop algorithms to address this issue and make the search through the rotamer conformation space more tractable32; 33. Aside from directly extracting conserved rotamers from the template, the key to handling the combinatorial explosion of conformational possibilities lies in the protein backbone. Instead of using a ‘fixed’ library that stores all possible rotamers for all residue types, an alternative ‘position specific’ library can be used. These libraries use information contained in the backbone to select the correct rotamer. A simple form of a position specific library classifies the backbone based on secondary structure since residues found in helices often favor a rotamer conformation that is not observed in strands or turns. More sophisticated position specific libraries can be built by classifying the backbone according to its Phi/Psi angles or by analyzing high resolution structures and collecting all stretches of 5 to 9 residues (depending on the method) with a reference amino acid at the center of the stretch. These collected examples in the template are superposed on the corresponding backbone to be modeled. The possible side chain conformations are selected from the best backbone matches34. Since certain backbone conformations will be strongly favored for certain rotamers to be adopted (allowing for example a hydrogen bond between side chain and backbone) this strategy greatly reduces the search space. For a given backbone conformation, there may be a residue that strongly populates a specific rotamer conformation and therefore can be modeled immediately. This residue would then serve as an anchor point to model surrounding side chains that may be more flexible and adopt a number of other conformations. An example of a backbone conformation that favors two different rotamers is shown in figure 2.10.

Figure 2.10: Example of a backbone-dependent rotamer library. The current backbone conformation favors two different rotamers for Tyrosine (shown as sticks) which appear about equally often in the database.

32

The Method: Homology Modelling 2

Position-specific rotamer libraries are widely used today for drug docking purposes to visualize all possible shapes of the active site35-37. The study by Chinea et al.34 shows that the search space is even considerably smaller than assumed by Desmet et al.32. Although search space for rotamer prediction initially presented a combinatorial problem to be explored exhaustively, it is far smaller than originally assumed as evidenced in a study conducted in 2001 by Xiang and Honig who first removed a single side chain from known structures and re- predicted it38. In a second step, they removed all the side chains and added them again using the same method. Surprisingly, it turned out that the accuracy was only marginally higher in the much easier first case. Rotamer prediction accuracy is usually quite high for residues in the hydrophobic core where more than 90% of all 1-angles fall within 20° from the experimental values, but much lower for residues on the surface where the percentage is often even below 50%. There are three reasons for this:  Experimental reasons: flexible side chains on the surface tend to adopt multiple conformations, which are additionally influenced by crystal contacts. So even experiments cannot provide a single correct answer.  Theoretical reasons: the energy functions used to score rotamers can easily handle the hydrophobic packing in the core (mainly Van der Waals interactions). The calculation of electrostatic interactions on the surface, including hydrogen bonds with water molecules and associated entropic effects, is more complicated. Nowadays, these calculations are being included in more force fields that are used to optimize the models39.  Biological reasons: loops on the surface often adopt different conformations as part of the biological function, for example to let the substrate enter the protein. It is important to note that the rotamer prediction accuracies given in most publications cannot be reached in real-life applications. This is simply due to the fact that the methods are evaluated by taking a known structure, removing the side chains and re-predicting them. The algorithms thus rely on the correct backbone, which is not available in homology modeling since the backbone of the template often differs significantly from the model structure. The rotamers must thus be predicted based on a ‘wrong’ backbone and reported prediction accuracies tend to be higher than what would be expected for the modeled backbone.

Step 6 - Model Optimization The problem of rotamer prediction mentioned above leads to a classical ‘chicken and egg’ situation. To predict the side chain rotamers with high accuracy, we need the correct backbone, which in turn depends on the rotamers and their packing. The common approach to address this problem is to iteratively model the rotamers and backbone structure. First, we predict the rotamers, then remodel the backbone to accommodate rotamers, followed by a round of refitting the rotamers to the new backbone. This process is repeated until the solution converges. This boils down to a series of rotamer prediction and energy minimization steps. Energy minimization procedures used for loop modeling are applied to the entire protein structure, not just an isolated loop. This requires an enormous accuracy in the energy functions used, because there are many more paths leading away from the answer (the model structure) than towards it. This is why energy minimization must be used carefully. At every minimization step, a few big errors (like bumps, i.e. too short atomic distances) are removed while at the same time many small errors are introduced. When the big errors are gone, the small ones start accumulating and the model moves away from its correct structure, see figure 2.11. As a rule of thumb, today's modeling programs therefore either restrain the atom positions and/or apply only a few hundred steps of energy minimization. In short, model optimization does not work until energy functions (force fields) become more accurate. Two ways to achieve better energetic models for energy minimization are currently being pursued:

33

The Method: Homology Modelling 2

 Quantum force fields: Protein force fields must be fast to handle these large molecules efficiently; energies are therefore normally expressed as a function of the positions of the atomic nuclei only. The continuous increase of computer power has now finally made it possible to apply methods of quantum chemistry to entire proteins, arriving at more accurate descriptions of the charge distribution40. It is however still difficult to overcome the inherent approximations of today's quantum chemical calculations. Attractive Van der Waals forces are for example so hard to treat, that they must often be completely omitted. While providing more accurate electrostatics, the overall accuracy achieved is still about the same as in the classical force fields.  Self-parameterizing force fields: The accuracy of a force field depends to a large extent on its parameters (e.g. Van der Waals radii, atomic charges). These parameters are usually obtained from quantum chemical calculations on small molecules and fitting to experimental data, following elaborate rules41. By applying the force field to proteins, one implicitly assumes that a peptide chain is just the sum of its individual small molecule building blocks - the amino acids. Alternatively, one can just state a goal - e.g. improve the models during an energy minimization - and then let the force field parameterize itself while trying to optimally fulfill this goal42; 43. This leads to a computationally rather expensive procedure. Take initial parameters (for example from an existing force field), change a parameter randomly, energy minimize models, see if the result improved, keep the new force field if yes, otherwise go back to the previous force field. With this procedure, the force field accuracy increases enough to go in the right direction during an energy minimization (figure 2.11). The YASARA software uses this technology for homology model optimization. The CASP 2008 experiment beautifully illustrates the enormous potential of this method. The most straightforward approach to model optimization is to simply run a molecular dynamics simulation of the model. Such a simulation follows the motions of the protein on a femtosecond (10- 15 s) timescale and mimics the true folding process. One thus hopes that the model will complete its folding and converges to the true structure during the simulation. The advantage is that a molecular dynamics simulation implicitly contains entropic effects that are otherwise hard to treat; the disadvantage is that the force fields are again not accurate enough to make it work.

Figure 2.11: The average RMSD between models and the real structures during an extensive energy minimization of 14 homology models with two different force fields. Both force fields improve the models during the first ~500 energy minimization steps but then the small errors sum up in the classic force field and guide the minimization in the wrong direction, away from the real structure while the self-parameterizing force field goes in the right direction. To reach experimental accuracy, the minimization would have to proceed all the way down to ~0.5 Å which is the uncertainty in experimentally determined coordinates.

34

The Method: Homology Modelling 2

Different distributed computing projects (folding@home, Rosetta@home, Models@home), have been developed to use many personal computers in a network to run molecular dynamics simulations and to mimic protein folding. Model optimization becomes more and more important. Even the focus of the CASP competition is changing to a comparison of the optimization of initial models provided by these online servers instead of building the initial models from scratch.

Step 7 - Model Validation Every protein structure contains errors, and homology models are no exception. The number of errors (for a given method) mainly depends on two values:  The percentage sequence identity between template and model-sequence. If the identity is greater than 90%, the accuracy of the model can be compared to crystallographically determined structures, except for a few individual side chains2; 44. From 50% to 90% identity, the root mean square error in the modeled coordinates can be as large as 1.5 Å, with considerably larger local errors. If the sequence identity drops to 25%, the alignment turns out to be the main bottleneck for homology modeling, often leading to very large errors rendering it a meaningless effort to model these regions, see also figure 2.3.  The number of errors in the template. Errors in the template structure can be reduced by an additional re-refinement of the template structure as mentioned earlier. Errors in a model become less of a problem if they can be localized. For example, a loop located distantly from a functionally important region such as the active site of an enzyme can tolerate some inaccuracies. Nevertheless, it is in the best interest to model all regions as accurately as possible since seemingly unimportant residues may be important for protein- protein interactions or a yet unassigned function.

There are two principally different ways to estimate errors in a structure:

 Calculating the model's energy based on a force field: This checks if the bond lengths and bond angles are within normal ranges, and if there are lots of clashing side chains in the model (corresponding to a high Van der Waals energy). Essential questions like "Is the model folded correctly?" cannot yet be answered this way, because completely misfolded but well minimized models often reach the same force field energy as the correct structure45. This is mainly due to the fact that molecular dynamics force fields are lacking several contributing terms to the energy function, most notably those related to solvation effects and entropy.  Determination of normality indices: This describes how well a given characteristic of the model resembles the same characteristic in real structures. Many features of protein structures are well suited for normality analysis. Most of them are directly or indirectly based on the analysis of interatomic distances and contacts. Some published examples are: - General checks for the normality of bond lengths, bond- and torsion angles46,47 are good checks for the quality of experimentally determined structures, but are less suitable for the evaluation of models because the better model building programs simply do not make this kind of errors. - Inside/outside distributions of polar and apolar residues can be used to detect completely misfolded models48. - The radial distribution function for a given type of atom (i.e. the probability to find certain other atoms at a given distance) can be extracted from the library of known structures and converted into an energy-like quantity, called a ‘potential of mean force’49. Such a potential can easily distinguish good contacts (e.g. between a C of and a C of ) from bad ones (e.g. between the same C of valine and the positively charged amino group of ).

35

The Method: Homology Modelling 2

- The direction of atomic contacts can also be accounted for in addition to interatomic distances. The result is a three dimensional distribution of functions, that can also easily identify misfolded proteins and are good indicators of local model building problems50.

Most methods used for the verification of models can also be applied to experimental structures (and hence to the templates used for model building). A detailed verification is essential when trying to derive new information from the model, either to interpret or predict experimental results or plan new experiments.

Step 8 - Iteration When errors in the model are recognized and located, they can be corrected by iterating portions of the homology modeling process. Small errors that are introduced during the optimization step can be removed by running a shorter molecular dynamics simulation. An error in a loop can be corrected by choosing another loop conformation in the loop modeling step. Large mistakes in the backbone conformation sometimes require the complete process to be repeated with another alignment or even with a different template.

In summary, it is safe to say that homology modeling is unfortunately not as easy as stated in the beginning. Ideally, homology modeling also uses threading to improve the alignment, ab initio folding to predict the loops and molecular dynamics simulations with a perfect energy function to converge into true structure. Doing all this correctly will keep researchers busy for a long time, leaving lots of fascinating discoveries.

Acknowledgements We thank Rolando Rodriguez, Chris Spronk, Sander Nabuurs, Robbie Joosten, Maarten Hekkelman, David Jones and Rob Hooft for stimulating discussions, practical help and critically reading the document. We apologize to the numerous crystallographers who made all this work possible by depositing structures in the PDB for not referring to each of the 50.000 very important articles describing these structures.

36

The Method: Homology Modelling 2

References 1. Epstain, C.J.G., R.F. Anfinsen, C.B. (1963). Cold Spring Harbor Symp. Quant Biol 28, 439. 2. Chothia, C., and Lesk, A.M. (1986). The relation between the divergence of sequence and structure in proteins. EMBO J 5, 823-826. 3. Sander, C., and Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9, 56-68. 4. Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Eng 12, 85-94. 5. Rodriguez, R.V., G. (1997). Professional gambling. Proceedings of the NATO Advanced Study Institute on Biomolecular Structure and Dynamics: Recent Experimental and Theoretical Advances. 6. Xiang, Z. (2006). Advances in homology protein structure modeling. Curr Protein Pept Sci 7, 217-227. 7. Fischer, D., Barret, C., Bryson, K., Elofsson, A., Godzik, A., Jones, D., Karplus, K.J., Kelley, L.A., MacCallum, R.M., Pawowski, K., et al. (1999). CAFASP-1: critical assessment of fully automated structure prediction methods. Proteins Suppl 3, 209-217. 8. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-410. 9. Pearson, W.R. (1990). Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183, 63-98. 10. Peitsch, M.C., Schwede, T., and Guex, N. (2000). Automated protein modelling--the proteome in 3D. Pharmacogenomics 1, 257-266. 11. Kim, D.E., Chivian, D., and Baker, D. (2004). Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res 32, W526-531. 12. Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680. 13. Taylor, W.R. (1986). Identification of protein sequence homology by consensus template alignment. J Mol Biol 188, 233-258. 14. Dodge, C., Schneider, R., and Sander, C. (1998). The HSSP database of protein structure-sequence alignments and family profiles. Nucleic Acids Res 26, 313-315. 15. Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792-1797. 16. Notredame, C., Higgins, D.G., and Heringa, J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302, 205-217. 17. Folkertsma, S., van Noort, P., Van Durme, J., Joosten, H.J., Bettler, E., Fleuren, W., Oliveira, L., Horn, F., de Vlieg, J., and Vriend, G. (2004). A family-based approach reveals the function of residues in the nuclear receptor ligand-binding domain. J Mol Biol 341, 321-335. 18. Joosten, R.P., and Vriend, G. (2007). PDB improvement starts with data deposition. Science 317, 195-196. 19. Nederveen, A.J., Doreleijers, J.F., Vranken, W., Miller, Z., Spronk, C.A., Nabuurs, S.B., Guntert, P., Livny, M., Markley, J.L., Nilges, M., et al. (2005). RECOORD: a recalculated coordinate database of 500+ proteins from the PDB using restraints from the BioMagResBank. Proteins 59, 662-672. 20. Hooft, R.W., Vriend, G., Sander, C., and Abola, E.E. (1996). Errors in protein structures. Nature 381, 272. 21. Zhang, Y., and Skolnick, J. (2004). Tertiary structure predictions on a comprehensive benchmark of medium to large size proteins. Biophys J 87, 2647-2655. 22. Bates, P.A., and Sternberg, M.J. (1999). Model building by comparison at CASP3: using expert knowledge and computer automation. Proteins Suppl 3, 47-54. 23. Dayringer, H.E.T., A. Fletterick, R.J. . (1986). Interactive program for visualization and modelling of proteins, nucleic acids and small molecules. J Mol Graph 4, 82-87. . 24. Sali, A., and Blundell, T.L. (1993). Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234, 779-815. 25. Vriend, G. (1990). WHAT IF: a molecular modeling and drug design program. J Mol Graph 8, 52-56, 29. 26. Zimmerman, S.S., Pottle, M.S., Nemethy, G., and Scheraga, H.A. (1977). Conformational analysis of the 20 naturally occurring amino acid residues using ECEPP. Macromolecules 10, 1-9. 27. Simons, K.T., Bonneau, R., Ruczinski, I., and Baker, D. (1999). Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins Suppl 3, 171-176. 28. Fiser, A., Do, R.K., and Sali, A. (2000). Modeling of loops in protein structures. Protein Sci 9, 1753-1773. 29. Tappura, K. (2001). Influence of rotational energy barriers to the conformational search of protein loops in molecular dynamics and ranking the conformations. Proteins 44, 167-179.

37

The Method: Homology Modelling 2

30. Summers, N.L., Carlson, W.D., and Karplus, M. (1987). Analysis of side-chain orientations in homologous proteins. J Mol Biol 196, 175-198. 31. Sanchez, R., and Sali, A. (1997). Evaluation of comparative protein structure modeling by MODELLER-3. Proteins Suppl 1, 50-58. 32. Desmet, J., De Maeyer, M. Hazes, B. Lasters, I. . (1992). The dead-end elimination theorem and its use in protein side-chain positioning. . Nature 356, 539-542. . 33. Canutescu, A.A., Shelenkov, A.A., and Dunbrack, R.L., Jr. (2003). A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci 12, 2001-2014. 34. Chinea, G., Padron, G., Hooft, R.W., Sander, C., and Vriend, G. (1995). The use of position-specific rotamers in model building by homology. Proteins 23, 415-421. 35. De Filippis, V., Sander, C., and Vriend, G. (1994). Predicting local structural changes that result from point mutations. Protein Eng 7, 1203-1208. 36. Stites, W.E., Meeker, A.K., and Shortle, D. (1994). Evidence for strained interactions between side-chains and the polypeptide backbone. J Mol Biol 235, 27-32. 37. Dunbrack, R.L.J.R.K., M. (1992). Conformational analysis of the backbone dependent rotamer preferences of protein side chains. Nat Struct Biol 5 334-340. . 38. Xiang, Z., and Honig, B. (2001). Extending the accuracy limits of prediction for side-chain conformations. J Mol Biol 311, 421-430. 39. Vizcarra, C.L., and Mayo, S.L. (2005). Electrostatics in computational protein design. Curr Opin Chem Biol 9, 622-626. 40. Liu, H., Elstner, M., Kaxiras, E., Frauenheim, T., Hermans, J., and Yang, W. (2001). Quantum mechanics simulation of protein dynamics on long timescale. Proteins 44, 484-489. 41. Wang, J.C., P. Kollman, P.A. . (2000). How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? J Comp Chem 21, 1049-1074. . 42. Krieger, E., Darden, T., Nabuurs, S.B., Finkelstein, A., and Vriend, G. (2004). Making optimal use of empirical energy functions: force-field parameterization in crystal space. Proteins 57, 678-683. 43. Krieger, E., Koraimann, G., and Vriend, G. (2002). Increasing the precision of comparative models with YASARA NOVA--a self-parameterizing force field. Proteins 47, 393-402. 44. Sippl, M.J. (1993). Recognition of errors in three-dimensional structures of proteins. Proteins 17, 355-362. 45. Novotny, J., Rashin, A.A., and Bruccoleri, R.E. (1988). Criteria that discriminate between native proteins and incorrectly folded models. Proteins 4, 19-30. 46. Morris, A.L., MacArthur, M.W., Hutchinson, E.G., and Thornton, J.M. (1992). Stereochemical quality of protein structure coordinates. Proteins 12, 345-364. 47. Czaplewski, C., Rodziewicz-Motowidlo, S., Liwo, A., Ripoll, D.R., Wawak, R.J., and Scheraga, H.A. (2000). Molecular simulation study of cooperativity in hydrophobic association. Protein Sci 9, 1235-1245. 48. Baumann, G., Frommel, C., and Sander, C. (1989). Polarity as a criterion in protein design. Protein Eng 2, 329-334. 49. Sippl, M.J. (1990). Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J Mol Biol 213, 859-883. 50. Vriend, G.S., C. . (1993). Quality control of protein models: Directional atomic contact analysis. J Appl Cryst 26, 47-60. .

38

Homology modelling and spectroscopy 3

Previous page:

A Potassium Channel – The most widely distributed type of ion channel. This channel forms a selective pore in the cell membrane and regulates a wide variety of cellular functions. (PDB file 1s5h)

40

Homology modelling and spectroscopy 3

Introduction Knowledge of the 3D structure of proteins is a prerequisite for much research in fields as diverse as protein engineering, human genetics, and drug design. Only two spectroscopic techniques, NMR and X-ray, can produce high resolution three dimensional coordinates of macromolecules. Most other spectroscopic techniques either add information to such three dimensional coordinates, or require these coordinates for the detailed interpretation of their results. NMR and X-ray are very elaborate techniques, and worldwide only about 30 protein structures are solved per day. In the time needed to read the text above, on the other hand, about 50 sequences were determined (worldwide) and deposited in international, freely and easily accessible sequence databases. Consequently, the necessity for homology modeling is only increasing. In its most elementary form, homology modeling involves calculating the structure of a protein for which only the sequence is known using its alignment with a homologous protein for which the structure is known. The first homology modeling articles were published already in the late 1970's1, and ever since we have kept using, and improving the same concepts as described in those ground-breaking articles. The process starts with the detection of a suitable template; an alignment is produced; insertions, deletions, and residue substitutions are performed; the model is optimized; and since the late 1990's we all agree that structure validation is needed to detect any, unavoidably made, errors in the final model. In the early 1990's many homology models were built (and unfortunately also published; also in EBJ) just for the sake of modeling, but since the mid 1990's homology models are considered tools that can aid with the design of experiments and with the interpretation of their results, albeit that occasionally things still can go very wrong in the literature. G protein-coupled receptors (GPCRs) are by far the most important target for the pharmaceutical industry, and due to the scarcity of GPCR structure data GPCRs are the most frequently modeled molecules ever. Until August 2000 GPCR models were either built ab initio, or using the non-homologous bacteriorhodopsin as a homology modeling template. In August 2000 the structure of bovine opsin became available2, and since that moment GPCR homology modeling is a serious possibility. GPCR models built before that historical moment are better forgotten3. Unfortunately, people kept building, using, and publishing ab initio models even after the bovine opsin structure was solved4. Today, four GPCR structures are available that can be used as template; figure 3.1 shows the principal differences between the bovine opsin structure and the structure of the beta-2 adrenergic receptor5, one of the recently resolved GPCR crystal structures. Long before August 2000 the importance of GPCR structure models for drug design triggered a very large number of spectroscopic experiments focussed on a variety of aspects. The following series of short paragraphs illustrate the mutual relation between spectroscopy and homology modeling. Most examples are drawn from the GPCR research field. These examples are just illustrations and neither imply a judgment nor pretend completeness.

Exposed residue labeling In two studies Davidson and Findlay identified residues that were exposed to the membrane environment or the retinal binding site respectively, by labeling opsin with photoactivated l-azido-4- [125]iodobenzene6; 7. This study was carried out long before the first crystal structure of any GPCR became available, and determining which residues were allowed Davison and Findlay to get a more complete picture of the three dimensional organization of the opsin molecule.

41

Homology modelling and spectroscopy 3

Figure 3.1: The most striking difference between the crystal structures of rhodopsin (PDBid 1f882) and the beta-2 adrenergic receptor (PDBid 2rh15) concerns the structure and location of the extracellular loop between helix IV and V. The loop IV-V in rhodopsin forms a β-sheet that folds into the binding pocket (yellow), whereas loop IV-V in the beta-2 adrenergic receptor forms an α-helix and extends towards the extracellular environment (purple)

Site-directed spin labeling The group of Khorana used site-directed spin labeling to analyze structure and light-dependent changes of part of GPCRs8. The electron paramagnetic resonance (EPR) spectrum of each spin labeled mutant was analyzed in terms of residue accessibility and mobility. In the article where they describe the region extending from helix VII to the palmitoylation sites in the rhodopsin molecule they concluded that this region had extensive tertiary interactions9. After Oliveira et al.10-correctly- modeled this region as a helix that runs parallel to the membrane, the interpretation of the EPR data could be extended significantly, indicating how modeling can help interpret spectroscopic measurements.

Fluorescence resonance energy transfer Turcatti et al.11 studied ligand-receptor interactions in the neurokinin-2 receptor (NK2) using fluorescence resonance energy transfer (FRET). A fluorescent unnatural amino acid was introduced at known sites into NK2. Inter-molecular distances were determined by measuring the fluorescence resonance energy transfer between the fluorescent unnatural amino acids and a fluorescently labeled NK2 heptapeptide antagonist. A similar approach was used to measure distances between the cholecystokinin receptor and a natural agonist12 and between the secretin receptor and secretin analogues13. Distances obtained were used as constraints to improve models for ligand-receptor interactions. The NK2 results were interpreted in terms of an obviously very poor bacteriorhodopsin structure. Looking at their results more than 10 years later, and with four GPCR structures at hand, we can see that they located the NK2 ligand largely at the correct place. This tells us how to trust or distrust the secretin FRET results, and it illustrates how spectroscopy can help improve the modeling.

42

Homology modelling and spectroscopy 3

Fourier transform infrared spectroscopy Molecular models of rhodopsin based upon electron density projection maps that were constructed before the first crystal structure became available proposed a specific interaction between transmembrane (TM) helices III and V, which appeared to be mediated by amino acid residues Glu122 and His211 on TM helices III and V, respectively. Beck et al.14 used a combination of site- directed mutagenesis and Fourier transform infrared spectroscopy (FTIR) to validate this hypothesis.

Small angle scattering Small angle scattering has occasionally been used to assist with the homology modeling of water- soluble proteins. Mascarenhas et al.15 studied crotoxin. The sequence identity with a template structure was generally high enough to build a good homology model, but the structure of one large loop remained highly ambiguous. Small angle neutron scattering data corresponded much better with one of the two models made, thus solving this problem. Comoletti et al.16 studied the structure of the neuroligins and their complex with neurexin using small angle neutron scattering and small angle X-ray scattering. A high-resolution structure of the neurexin was available but no structure was available of a neuroligin. However, the neuroligins have significant sequence similarity with acetylcholine esterase making it possible to build a homology model. Scattering from neuroligin constructs was similar to that previously obtained from acetylcholine esterase structures17 indicating that the homology model was valid.

Fluorescent ligands Turcatti et al.18 studied the NK2 receptor using a number of fluorescent ligands, differing only in the length of the spacer between the fluorescent probe and the peptide ligand. By analyzing the different levels of fluorescence related to the spacer lengths they found that the binding pocket of the NK2 receptor was buried at a depth of 5-10 Å.

Modeling and spectroscopy of the M13 protein The final example to illustrate the never ending love story of modeling and spectroscopy was recently reviewed by the Hemminga group19. In a beautiful review article titled "From ‘I’ to ‘L’ and back again: the odyssey of membrane-bound M13 protein" they illustrate how a large series of spectroscopic techniques have been employed world-wide over a period of more than 20 years to continuously update the structure model of the M13 protein in its membrane-bound form. The (mainly spectroscopic) techniques used during this whole odyssey include NMR, site specific and other solid state NMR, X-ray fibre diffraction, cryo-electron microscopy, site directed spin-labeling and site directed introduction of fluorescence probes similar as described above for the GPCR studies, fluorescence energy transfer, site specific infrared dichroism, etc. Throughout this odyssey of Hemminga and others, homology modeling and spectroscopy were both applied. This example nicely illustrates the importance of homology modeling for spectroscopy, and vice versa. In some cases the spectroscopic results triggered the need to analyze the model, while in other cases the model suggested the spectroscopic experiments, and in some cases both went hand-in-hand. Throughout the odyssey the model improved and the spectroscopy got more sophisticated. The M13 odyssey also shows that spectroscopic techniques developed over the years, allowing for all the time more precise and more accurate measurements. In this same period homology modeling improved as well. In the next paragraphs we will discuss the latest developments in homology modeling and its remaining unsolved problems. We will also illustrate the usefulness of homology models, sometimes in combination with spectroscopic techniques, with a few real-world examples.

43

Homology modelling and spectroscopy 3

Method overview: the eight steps of homology modeling Homology modeling is usually described as a multi-step process in which the number of steps typically varies from X to Y. Here we use an eight step plan (figure 3.2). Over the years each of these eight steps has undergone extensive scrutiny and has been the topic of much research. Consequently, models built today with a fully automatic web server are considerably more accurate than the first model-approach four decades ago by Browne et al. using wire and plastic models of bonds and atoms20. Here, we will discuss the latest innovations and developments in each of the eight homology modeling steps.

Step 1: Template recognition and initial alignment Traditionally, the modeling template is found by performing a BLAST21 search against the Protein Data Bank (PDB)22. This approach is often successful when the query is highly identical to structures in the database. In contrast, templates that are close to the possible homology modeling threshold are harder to find or may even remain undetected23. The development of PSI-BLAST24 and fold- recognition methods25 improved the detection of these difficult-to-find templates because these methods use a profile instead of a single sequence to search the database. Furthermore, the growing number of structures collected in the PDB makes it every year easier to find a homologous one.

Figure 3.2: The eight steps of homology modeling. The first step involves finding a suitable, homologous protein of which the structure can be used as modeling template, and generating the initial alignment between that template and the model sequence. In step 2 this alignment is refined using, for example, knowledge obtained from the template structure. In step 3 the backbone is generated and deletions are performed so that, temporarily forgetting insertions, the backbone of the template looks like that of the model as much as possible. In step 4 gaps in the model are closed, and optionally loops are ab initio constructed. In step 5, the side chains are added using rotamer libraries to find the best rotamer for that local backbone conformation. Step 6 consists of a molecular dynamics simulation on the complete model in order to remove (the majority of) the introduced errors. In step 7 the model is checked for remaining errors using validation software. Depending on the outcome of the validation step we either approve the model or iterate the modeling process (step 8), starting from step 1-6.

44

Homology modelling and spectroscopy 3

Finding the best possible template is not limited to searching the PDB with (PSI-)BLAST. There may be several candidate templates with similar sequence identities to the query. In that case, the optimal template must be selected based on other criteria. The X-ray resolution is, although frequently used as such, only a limited measure of structure quality because it says something about the experimental data, not about the quality of the structure model. I.e. with 1.8Å data one can typically make a better structure model than with 2.2Å data, but whether or not this better structure model is actually made, depends on the crystallographer and the software he/she uses. The crystallographic residual, the so-called R-factor, tells us something about the correlation of the structure model with the experimental data and therefore seems more indicative than the X- ray resolution alone. Unfortunately, acceptable and even seemingly nice R-factors can be attained by adding more parameters to the structure models, effectively over-fitting/over-refining the model26. This problem was solved by the introduction of the free R-factor27, which is much more robust against over-fitting because the value is calculated only with the fraction of the X-ray data that was not used to build the structure model. Therefore, R-free can be seen as a description of how well the structure model predicts an ‘independent’ measurement. While very robust, the free R-factor is a global indicator: it describes the structure model as a whole, but not a particular part of the protein that may interest us. A local measure of fit with the experimental data is needed, for instance the real-space R-factor28. For PDB entries these values can be obtained from the Electron Density Server (EDS)29. So, with the X-ray resolution, the (free) R-factor, and the real-space R-factor one can select a proper template from the (PSI-)BLAST results. That is, one can select a template that corresponds well with the X-ray experiment. A more in depth analysis of the template structure is needed to see whether it also corresponds with our current knowledge of protein structures. Structure validation scores like Ramachandran Z-score30, the fraction of Ramachandran plot outliers31, side chain rotamer normality scores32, residue packing scores33, hydrogen bond network quality34, and many others are used to get insight in the geometric quality of the template structure. Most of these validation scores, both global and local scores, can be obtained via the PDB and the linked databanks. Another possible step in the template selection is the optimisation of the template before the actual modeling. We have recently shown that validation scores such as the Ramachandran Z- score and the number of atomic clashes (bumps) can be improved by a fully automated re- refinement of the PDB entry with its original experimental data35. In addition, the crystallographic R- factor, or rather the free R-factor is also improved by this re-refinement. This optimisation is particularly useful for templates that will be used for drug docking studies because their success often depends critically on the quality of the atomic model. The benefit of re-refinement is tightly correlated with sequence identity between the template and the model sequence. That is, any improvement of the atomic coordinates of a residue is lost when this residue (or just its side chain) has to be rebuilt. Fortunately, even with low sequence identity, there may be regions of the template that are not changed in the modeling process and thus can be improved by re-refinement. Of course, when sufficient CPU time is available to the modeller, it may be beneficial to use a number of (re-refined) PDB entries as templates, instead of a single one.

Step 2: Alignment correction Having identified one or more possible modeling templates using the initial screen described above, more sophisticated methods are needed to arrive at a better alignment. Molecular class-specific information systems (MCSISs) can be a great asset in the homology modeling process. MCSISs contain a large number of heterogeneous data on one particular class of proteins. The GPCRDB at www.gpcr.org/7tm/36 is a good example of such a system. It contains sequences, structures, mutation data, ligand binding data, and much more. All of this information is used (directly or in an indirect way) for creating sequence profiles for GPCR (sub)families. Profile-based alignments37 are used to generate alignments that are of a significantly higher quality than alignments generated by automatic methods based solely on sequence data.

45

Homology modelling and spectroscopy 3

Currently, MCSISs are available for only a small number of protein families and therefore in most cases other sequence alignment tools are needed. Many programs are available to align a number of related sequences, for example CLUSTALW38. The resulting multiple sequence alignment implicitly contains a lot of additional information. For example, if at a certain position only exchanges between hydrophobic residues are observed, it is highly likely that this residue is buried. Multiple sequence alignments are also useful to place deletions or insertions in areas where the sequences are strongly divergent. To consider this knowledge during the alignment, one uses the multiple sequence alignment to derive position specific scoring matrices, also called "profiles"39; 40. In recent years, new programs like MUSCLE41 and T-Coffee42 have been developed that use these profiles to generate and refine the multiple sequence alignments. When building a homology model, we are in the fortunate situation of having an almost perfect profile: the known structure of the template. We simply know that a certain alanine sits in the protein core and must therefore not be aligned with a glutamate. Alignment techniques like SSALN make use of this structural knowledge found in the template43.

Step 3: Backbone generation Aligned residues occupy the same position in template and model. Coordinates can thus simply be copied to create the initial model backbone. In practice, there are many ways to improve this crude recipe. First, the template is likely to be present more than once in the PDB (e.g. a bundle of NMR structures, multiple copies in the crystal, or solved multiple times under different conditions). Here, one can use structure validation tools31,32, like the PDBREPORT (www.cmbi.ru.nl/gv/pdbreport/) databank to pick the best one, correcting errors where possible. Second, one can combine multiple templates, because residues missing in one template can sometimes be found in the other, or because the alignment covers more than one template, which is common for multi-domain targets. The well-known SwissModel server (http://swissmodel.expasy.org/44) selects fragments from different PDB files that locally are most similar to the corresponding model fragment. The Zhang (http://zhang.bioinformatics.ku.edu/I-TASSER/45) and Robetta (http://robetta.bakerlab.org/46) modeling servers have successfully extended on this concept and are today among the best homology modeling servers available on-line. Some methods do not even create a single backbone at all, instead they use the alignment to derive restraints (hydrogen bonds, backbone torsion angles, etc.), and only later build the model, while trying to satisfy the restraints47.

Step 4: Loop modeling Any insertion or deletion in the alignment implies a structural change of the backbone, and can thus not be modeled in the previous step. Since these changes usually take place outside regular secondary structure elements, their prediction is referred to as 'loop modeling'. There are two major approaches to the problem: first knowledge-based methods48, which search the PDB for known loops with high sequence similarity to the target and endpoints that match the anchor residues between which the loop has to be inserted. And second energy-based methods, which sample random loop conformations while minimizing an energy function49. Since loops never fit the anchor points exactly, they have to be closed, using for example an algorithm borrowed from robotics50. In practice, a combination of both methods is common.

Step 5: Side chain modeling The most successful approaches to side-chain prediction are knowledge-based. They use libraries of common side-chain rotamers extracted from high resolution X-ray structures51-53. An essential feature of these libraries is backbone-dependence, hence they store the distribution of the side- chain dihedral angles (χ1, χ2, etc.) as a function of the backbone dihedrals φ and ψ. This does not only increase the accuracy, but also helps to shrink the search space (i.e. the possible combinations of interacting side-chain rotamers) to a size that can be handled, for example using dead-end

46

Homology modelling and spectroscopy 3 elimination54. The prediction accuracy is usually highest for residues in the hydrophobic core, where more than 90% of all predicted χ1-angles fall within ± 20° from the experimental values, but significantly lower for residues on the surface where the percentage drops to 70%, and further down 54 to 50% for the combined χ1/χ2 accuracy . This is mainly caused by the electrostatic and hydrogen bonding interactions, which are partly solvent mediated and much more difficult to get right than the simple repulsive Van der Waals interactions in the core. A second reason is the fact that flexible side chains on the surface tend to adopt multiple conformations, which are additionally influenced by crystal contacts, so there simply may not be a single correct conformation at all. Nevertheless, the surface residues are among the most important ones to get right; they mediate all the interactions, and applications like drug design or protein docking thus critically depend on them.

Step 6: Model optimization Once all these steps are completed, one obtains the initial homology model, which hopefully looks broadly similar to the (usually unknown) target structure. The little details however, like the precise backbone conformation, hydrogen bonding networks or certain side-chain rotamers, are often wrong. While this deficiency keeps scientists working on experimental structure determination busy, predictors strive to bridge the gap between model and target (the 'last mile of the protein folding problem') using various optimization and refinement techniques, the most prominent ones being molecular dynamic55 and Monte Carlo simulations56. For a given model, there are unfortunately many more paths leading away from the target than towards it, and combined with the limited accuracy of empirical force fields, this makes it trivial to reduce the model accuracy during the refinement. Consequently, the best optimization was often no optimization. We did well in the early CASP homology modeling competitions simply by not performing MD simulations on the models (except for 25 energy minimization steps with CNS57 to introduce the same local geometric features that CNS put in the real structure our prediction would be compared against). Nevertheless, steady progress over the past years has changed this rule of thumb (see Results section).

Step 7: Model validation Protein structures were error free till the landmark article in 1993 on Procheck by the Thornton group31. This article can be seen as the beginning of the realisation that crystallographers and NMR spectroscopists actually use experimental techniques to determine their coordinates. With the release of the first WHAT_CHECK32 structure validation became a common household technique for most scientists and, although not at the speed we hoped, errors in protein structures are becoming less frequent. The two main bottlenecks are the introduction of improved technologies in all structure solution software used all over the world, and the fact that the detection of an error does not implicitly mean that the error can be removed.

Step 8: Iteration If the model is not good enough, (part of) the modeling process has to be repeated. For instance, wrong side chain conformations can be improved by iterating the process from step 5 onwards. Sometimes, this iteration-step means that one has to start the modeling process all over again using another template or alignment. Alternatively, one can start several modeling processes using different templates. The resulting models can be combined in the end to produce a hybrid model that consists of the strongest points of each separate model.

47

Homology modelling and spectroscopy 3

Unsolved problems and future directions in homology modeling While many scientific disciplines face huge difficulties when trying to experimentally validate theoretical predictions, protein modeling is in a fortunate situation: since 1994, the biennial CASP ('Critical Assessment of Structure Prediction') contests58 provides an ideal opportunity to evaluate the accuracy of today's many protein structure prediction methods. During each CASP season (lasting about four months once every two years), about 200 research groups try to predict the structures of ~100 proteins, the CASP targets. The target sequences are provided to CASP by structural biology labs just before the corresponding structures are solved. The predictions are thus true blind predictions, allowing to measure performance in realistic test cases, locating areas of progress as well as yet unsolved problems. CASP regularly shows that the eight homology modeling steps summarized here allow in many cases building reliable models, from which a lot of structural and functional insights can be derived. However, these eight steps are unfortunately not sufficient to actually solve the protein structure prediction problem via homology modeling as soon as enough templates become available. Figure 3.3 shows CASP8 targets T0498 and T0499: both proteins are 56 amino acids long, 53 of which are conserved (95% sequence identity). Still, the two structures are entirely different; just three point mutations completely change the fold. While this is an extreme example of human protein engineering art59, also naturally occurring proteins with similar sequences often show surprising structural diversity60, letting classic homology modeling fail miserably. The prion protein61 and other amyloid-forming proteins provide an even more dramatic case; here 100% identical sequences can exist in two totally different structures. Obviously, the homology modeling problem is tightly intertwined with the more general protein folding problem itself. Even if a close template is available, there can always be structurally diverging regions, which are either expected from the poor local alignment, or unexpectedly caused by critical point mutations, or widely differing crystal packing contacts.

Figure 3.3: Comparison of CASP8 targets T0498 and T0499. The sequences of both proteins are 95% (53 of 56) identical (only residues 20, 30 and 45 differ), yet the structures are totally different. Classic homology modeling predicted T0499 correctly (which looks like the related homology modeling templates in the PDB), but failed completely for T0498. Since the structures of T0498 and T0499 have not been released yet, this figure is based on a closely related pair with PDB IDs 2jws and 2jwu from the same authors, who showed by NMR spectroscopy that these two structures look essentially the same as T0498 and T049959.

48

Homology modelling and spectroscopy 3

The only way to handle these difficult cases is to apply more general ab initio folding algorithms, which do not depend on template structures, but try to simulate the complete folding process from the stretched-out conformation. As it turns out, this 'one-algorithm-for-everything' approach is the currently most successful one at CASP45; 46: if available, it uses known templates (or fragments thereof) only to guide the search, but does not depend on them. As a side-effect, this allows to build hybrid-models, combining the best parts from multiple templates. Despite these encouraging developments, the protein folding problem is far from solved. The best models are still built by those who got the alignment right in the first place, which unfortunately implies that structural diversity is often missed: one cannot yet ignore the difficult-to- align regions and simply predict them with ab initio folding instead. The sequence alignment problem thus remains an active research field for years to come.

Noteworthy progress has been made with model optimization to bridge the structural gap between initial model and target. While in the early days of CASP, predictors were well advised to keep the backbone of their model fixed (the 'frozen core approach'), simply because the danger of messing up the model was just too large, the situation is quite different today: force field accuracy55 and sampling efficiency56 have improved to a level that allows well performing methods like Modeller- CSA62, Rosetta46, WHAT IF63, and YASARA64 to free all atoms during the refinement, often moving models considerably closer to the target. While homology modeling currently focuses on the protein in a model, other entities, i.e. carbohydrates, small molecules, and ions, also make up important parts of certain proteins and protein complexes. For instance, zinc atoms in so-called zinc fingers are important for the stability of the protein, and a common protein like haemoglobin would be useless without its haem groups and the iron atoms therein. Carbohydrates in glycoproteins perform numerous functions, ranging from providing stability to signalling and labeling for intracellular transport65. The many roles of non- protein entities make it obvious that homology modeling should look beyond the protein. A complete model should thus be more than a 3D representation of an amino acid sequence. One major challenge for homology modeling is recognising binding sites for non-protein entities. Drug docking software66; 67 can be used to detect the binding sites of compounds such as heme groups or coenzymes. However, relevant biological information is needed to select compounds that may be bound to the protein. Copying the binding site from the template structure is the simplest method, but does not work for ab initio folding models. For such models, spectroscopic analysis of the protein can provide insight on which compounds are bound. This approach is not limited to homology modeling; X-ray crystallography can also benefit from spectroscopic analysis of a protein to identify a bound compound68. Incorporating ions can be an additional step of the modeling process. Nayal and Di Cera69 have suggested a method to detect sodium binding sites in protein structures which can be extended to detect various other ion binding sites. Of course, any additional experimental data can guide this ion site detection process. Especially tightly bound functional ions that co-purify with the protein can be detected by means of spectroscopic analysis. A significant number of PDB files have + + bound ions or water molecules that were erroneously assigned. We have observed H20, and Na , K , 2+ 2+ + Mg , Ca , and NH4 ions that should actually be one of the others in this list. This is the result of X- + + 2+ ray spectroscopy having difficulties distinguishing between H2O, NH4 , Na and Mg because these entities have equally many electrons, and so do K+ and Ca2+. The power of force-field based model optimisation methods can be reduced significantly when such problems include a difference in the ionic charge. It is therefore very important to (experimentally) validate the ions in template structures when these are important for the final homology model.

49

Homology modelling and spectroscopy 3

Carbohydrates can be modeled at the final stage of the homology modeling process, but this does not always reflect the protein folding process. Carbohydrates are not only added in post- translational modification (that is, when the protein is ‘done’), but also during the protein expression by the ribosome. They are important in the protein folding process and the detection of misfolded proteins70. It may therefore prove interesting to add the necessary carbohydrates to the unfolded protein before ab initio folding. Apart from their role in protein folding, carbohydrates are sometimes important in oligomerisation of proteins. For instance, the neuraminidase protein from influenza shows different glycosylation states in its monomeric, dimeric, and tetrameric states. The carbohydrates in tetrameric state provide extra stability (see figure 3.4) and, in the case of the Spanish flu influenza virus, resistance to trypsin digestion leading to increased virulence71. This shows the vital (and sometimes lethal) importance of considering carbohydrates in homology models.

Figure 3.4: Tetrameric form of whale influenza neuraminidase (PDBid 2b8h72), coloured by monomer. Protein chains are displayed in ribbon representation, carbohydrate atoms in ball representation. The carbohydrates of one monomer interact with the adjacent monomer thus stabilising the tetramer.

Results and discussion: other roles for homology modeling In the 1990's most articles that included homology modeling described only how the model of one protein was constructed, and ended with the ominous sentence "...this model will help us perform our research on this intriguing protein", after which the group would start working on something else. Here, we will illustrate the importance of homology modeling for the study of inheritable diseases, but the value of models has been amply illustrated for fields ranging from drug design to laundry powder enzyme engineering, from validating experimental structures to the design of humanized antibodies, and from mutation analysis to intelligent experimental design in many spectroscopic research projects.

In the following examples building the homology model was not the ultimate goal but one of the tools used to gain more information about a mutation, a disease, or a process in the human body. These examples prove that homology models can be of great use in the (bio)medical field.

50

Homology modelling and spectroscopy 3

Modeling of the LRTOMT-COMT domain In a study for non-syndromic deafness four pathogenic mutations were found in a so-far uncharacterized gene which codes for two different proteins, called LRTOMT1 and LRTOMT273. Three of these mutations were missense mutations located in the catechol-O-methyltransferase (COMT, EC 2.1.1.6) domain of LRTOMT2, one introduced a stop codon causing the loss of a large fraction of the protein. The COMT domain catalyzes the transfer of a methylgroup from S-adenosyl- L- (Ado-Met) to a hydroxyl group of catechol. No structure for LRTOMT2 was known, so we needed a homology model to study the missense mutations in more detail. Using the crystal structure of rat COMT (39% identity to LRTOMT2 over 212 amino acids, (PDBid 1h1d74) we were able to model the COMT domain of human LRTOMT2. The three mutated residues are located in helix 1 (Arg81Gln), helix 2 (Trp105Arg) and the loop that follows helix 2 (Glu110Lys), and thus not in the hypothetical substrate-binding pockets. However, the loop is predicted to be important for the groove that binds the putative methyl acceptor. The Arg81 and Glu110 residues are predicted to form a hydrogen bonded salt bridge between helix 1 and the loop, and Trp105 is predicted to make hydrophobic interactions in the core between these helices (see fgure 3.5). These residues may therefore be important for local protein stability and can affect the substrate binding region.

Figure 3.5: A) Molecular model of the catechol-O-methyltransferase domain of LRTOMT2, residues 79–290. The affected residues are depicted in blue. The predicted ligands are coloured yellow, and the tyrosine residue (Tyr111) that lines the hydrophobic groove of the ligand binding site is shown in cyan. The boxed region containing residues affected by missense mutations of LRTOMT2 is enlarged in panels B–D. The helices 1 (H1) and 2 (H2) are shown with wild-type residues Arg81, Trp105 and Glu110 depicted in blue and mutated residues in green. Hydrogen bonds are represented by yellow dotted lines. B) The Arg81 and Glu110 residues form a salt bridge between helix 1 and the loop following helix 2. The Gln81 residue cannot form this salt bridge as it is not positively charged. Also, the formation of hydrogen bonds is impaired because of the smaller size of as compared to arginine. C) The Trp105 residue is predicted to make hydrophobic interactions as a result of its big side chain. Most of these interactions would be lost by the Trp105Arg substitution. D) Substitution Glu110Lys is predicted to lead to the loss of hydrogen bonds and a salt bridge. There would likely be repulsion between the side chains Lys110 and Arg81 as both are positively charged.

51

Homology modelling and spectroscopy 3

Modeling of the ligand binding domain of ESRRB Sequence analysis revealed four missense mutations of the ESRRB gene leading to autosomal- recessive non-syndromic hearing impairment75. Experimental results indicated that ESRRB is essential for inner-ear development and function. ESRRB encodes the estrogen-related receptor protein beta, a member of the nuclear hormone receptor (NHR) family. In general, members of this family share a zinc finger C4 DNA-binding domain at their N-terminus and a ligand-binding domain that is located near the C-terminus. The ligand-binding domain of nuclear hormone receptors is a well conserved and highly organized structure containing 12 alpha-helices. Three mutations (Leu320Pro, Val342Leu, and Leu347Pro) were located within this ligand-binding domain. The fourth mutation was located in the DNA binding domain and will not be discussed here. To study the three mutations in the ligand binding domain in more detail we built a homology model of this domain using the structure of the ESRRG receptor (PDBid 1kv676) as a template. It had 79% sequence identity over 229 amino acids. The molecular model showed that the three missense mutations in the ligand binding domain are likely to affect the structure and stability (see figure 3.6). Two of the mutations involved a to proline mutation, Leu320Pro in helix 7 and Leu347Pro in helix 8. In general, the introduction of proline residues within helices reduces the stability of the helix, and therefore, these mutations will disturb the structure of the helices and probably the complete ligand-binding domain. In addition, the loss of the leucine side chain abrogates a number of hydrophobic interactions. The other mutation in this domain (Val342Leu) substitutes a leucine for a valine residue resulting in the occurrence of a somewhat larger side chain that bumps into the molecular surface of helix one. This substitution is predicted to reduce the strength of the interaction between helix 1 and 8.

Figure 3.6: A) Molecular modeling of the ligand-binding domain of the human ESRRB protein. The structure was deduced from the known ESRRB structure. The various helices are presented by cylinders. The three amino acids that are affected by the missense mutations are indicated in green. Detailed view of the three mutations are shown in panels B, C and D. The wildtype residue is depicted in green, whereas the side chain of mutant residue is presented in red. B) Leu320 makes hydrophobic contacts in the core between two helices, the mutation into Pro causes loss of many hydrophobic interactions. Additionally, the Pro will disturb the structure of the helix. C) Val342 is tightly packed in the core between two helices. This core will be disturbed by the Val342Leu mutation because Leu has a larger side chain. D) Leu347Pro will cause loss of hydrophobic interactions between the helices and the introduction of Pro will disturb the structure of the helix.

52

Homology modelling and spectroscopy 3

In summary, the molecular modeling data predicts that mutations of ESRRB will result in conformational changes near the substituted amino acids or decreased helix stability and are therefore likely to affect the stability and function of the complete ligand-binding domain.

Functional states of alcohol oxidase Van der Klei and co-workers studied four different conformational states of the flavoenzyme alcohol oxidase (AO) from the methylotrophic yeasts Hansenula polymorpha and Pichia pastoris, including assembly intermediates77. These proteins had to be homology modeled from the enzyme glucose oxidase that shows only 25% sequence identity with the AOs. With so little similarity homology modeling is very difficult and large modeling errors are to be expected. An additional problem is that glucose oxidase is a dimer while AO is an octamer. The low quality of the model certainly precluded any form of protein-protein docking to construct an octamer from the dimer. The model fortunately revealed a series of hydrophobic surface patches, some of which have residues at the surface. As there were also observed in the model near the FAD group, the suggestion to apply spectroscopic techniques to extend the modeling study came shouting from figure 3.7. A series of spectroscopic techniques, including time resolved fluorescence, fluorescence anisotropy decay, steady-state fluorescence, and visible and near-UV circular dichroism, was used to characterize native AO several putative assembly intermediates. A good working hypothesis for the AO folding pathway could be derived. The study also triggered the search for chaperones that seemed needed to allow FAD to bind to AO in vivo.

Figure 3.7: Model of one monomer of AO shown as ribbon; helix in blue, strand in red, loop and turn in shades of green. All tryptophan side chains are shown in yellow, the FAD group is shown in purple.

Conclusion Homology modeling will always be needed because it is impossible to solve the three dimensional structure for each determined sequence. An increasing number of scientists are now using protein structures for mutational analysis and experimental design. It is a good thing that the process of homology modeling has improved dramatically over the years, but still there are many problems to solve. It is a reassuring fact for homology modellers that modeling problems often can be solved using spectroscopic techniques. Spectroscopists, on the other hand normally need homology models to fully harvest the results from their spectroscopic studies. And that brings us back to the title of this article, homology modeling and spectroscopy indeed are a never ending love story.

53

Homology modelling and spectroscopy 3

Acknowledgements This work was part of the BioRange programme of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI). TIPharma is thanked for support. This work was funded in part by the EU project EMBRACE Grid which is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health", contract number LUNG-CT-2004-512092. This work was also supported by the Biosapiens Network of Excellence project. The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area ‘Life sciences, genomics and biotechnology for health’, contract number LSHG-CT-2003-503265. GV wishes to thank MA Hemminga for being a warm-hearted friend and a great PhD supervisor.

References

1. Greer, J. (1980). Model for haptoglobin heavy chain based upon structural homology. Proc Natl Acad Sci U S A 77, 3393-3397. 2. Palczewski, K.K., T. Hori, T. Hehnke, C.A. Motoshima, H. Fox, B.A. Trong, I.L. Teller, D.C. Okada, T. Stenkamp, R.E. Yamamoto, M. Miyano, M. (2000). Crystal structure of rhodopsin: a G protein-coupled receptor. Science, 739-745. 3. Oliveira, L., Hulsen, T., Lutje Hulsik, D., Paiva, A.C., and Vriend, G. (2004). Heavier-than-air flying machines are impossible. FEBS Lett 564, 269-273. 4. Orry, A.J., and Wallace, B.A. (2000). Modeling and docking the endothelin G-protein-coupled receptor. Biophys J 79, 3083-3094. 5. Cherezov, V., Rosenbaum, D.M., Hanson, M.A., Rasmussen, S.G., Thian, F.S., Kobilka, T.S., Choi, H.J., Kuhn, P., Weis, W.I., Kobilka, B.K., et al. (2007). High-resolution crystal structure of an engineered human beta2- adrenergic G protein-coupled receptor. Science 318, 1258-1265. 6. Davison, M.D., and Findlay, J.B. (1986). Modification of ovine opsin with the photosensitive hydrophobic probe 1-azido-4-[125I]iodobenzene. Labelling of the chromophore-attachment domain. Biochem J 234, 413-420. 7. Davison, M.D., and Findlay, J.B. (1986). Identification of the sites in opsin modified by photoactivated azido[125I]iodobenzene. Biochem J 236, 389-395. 8. Altenbach, C., Cai, K., Khorana, H.G., and Hubbell, W.L. (1999). Structural features and light-dependent changes in the sequence 306-322 extending from helix VII to the palmitoylation sites in rhodopsin: a site-directed spin-labeling study. Biochemistry 38, 7931-7937. 9. Cai, K., Klein-Seetharaman, J., Farrens, D., Zhang, C., Altenbach, C., Hubbell, W.L., and Khorana, H.G. (1999). Single-cysteine substitution mutants at amino acid positions 306-321 in rhodopsin, the sequence between the cytoplasmic end of helix VII and the palmitoylation sites: sulfhydryl reactivity and transducin activation reveal a tertiary structure. Biochemistry 38, 7925-7930. 10. Oliveira, L., Paiva, A.C., and Vriend, G. (1999). A low resolution model for the interaction of G proteins with G protein-coupled receptors. Protein Eng 12, 1087-1095. 11. Turcatti, G., Nemeth, K., Edgerton, M.D., Meseth, U., Talabot, F., Peitsch, M., Knowles, J., Vogel, H., and Chollet, A. (1996). Probing the structure and function of the tachykinin neurokinin-2 receptor through biosynthetic incorporation of fluorescent amino acids at specific sites. J Biol Chem 271, 19991-19998. 12. Harikumar, K.G., Pinon, D.I., Wessels, W.S., Prendergast, F.G., and Miller, L.J. (2002). Environment and mobility of a series of fluorescent reporters at the amino terminus of structurally related peptide agonists and antagonists bound to the cholecystokinin receptor. J Biol Chem 277, 18552-18560. 13. Harikumar, K.G., Lam, P.C., Dong, M., Sexton, P.M., Abagyan, R., and Miller, L.J. (2007). Fluorescence resonance energy transfer analysis of secretin docking to its receptor: mapping distances between residues distributed throughout the ligand pharmacophore and distinct receptor residues. J Biol Chem 282, 32834-32843. 14. Beck, M., Sakmar, T.P., and Siebert, F. (1998). Spectroscopic evidence for interaction between transmembrane helices 3 and 5 in rhodopsin. Biochemistry 37, 7630-7639. 15. Mascarenhas, Y.P., Stouten, P.F., Beltran, J.R., Laure, C.J., and Vriend, G. (1992). Structure-function relationship for the highly toxic crotoxin from Crotalus durissus terrificus. Eur Biophys J 21, 199-205.

54

Homology modelling and spectroscopy 3

16. Comoletti, D., Grishaev, A., Whitten, A.E., Tsigelny, I., Taylor, P., and Trewhella, J. (2007). Synaptic arrangement of the neuroligin/beta-neurexin complex revealed by X-ray and neutron scattering. Structure 15, 693-705. 17. Marchot, P., Ravelli, R.B., Raves, M.L., Bourne, Y., Vellom, D.C., Kanter, J., Camp, S., Sussman, J.L., and Taylor, P. (1996). Soluble monomeric acetylcholinesterase from mouse: expression, purification, and crystallization in complex with fasciculin. Protein Sci 5, 672-679. 18. Turcatti, G., Vogel, H., and Chollet, A. (1995). Probing the binding domain of the NK2 receptor with fluorescent ligands: evidence that heptapeptide agonists and antagonists bind differently. Biochemistry 34, 3972-3980. 19. Vos, W.L., Nazarov, P.V., Koehorst, R.B., Spruijt, R.B., and Hemminga, M.A. (2009). From 'I' to 'L' and back again: the odyssey of membrane-bound M13 protein. Trends Biochem Sci 34, 249-255. 20. Browne, W.J., North, A.C., Phillips, D.C., Brew, K., Vanaman, T.C., and Hill, R.L. (1969). A possible three- dimensional structure of bovine alpha-lactalbumin based on that of hen's egg-white lysozyme. J Mol Biol 42, 65-86. 21. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-410. 22. Berman, H., Henrick, K., and Nakamura, H. (2003). Announcing the worldwide Protein Data Bank. Nat Struct Biol 10, 980. 23. Sander, C., and Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9, 56-68. 24. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389-3402. 25. Jones, D.T. (1999). GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 287, 797-815. 26. Brändén, C.-I.J., T.A. (1990). Between objectivity and subjectivity. Nature, 687-689. 27. Brunger, A.T. (1992). Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature 355, 472-475. 28. Jones, T.A., Zou, J.Y., Cowan, S.W., and Kjeldgaard, M. (1991). Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Crystallogr A 47 ( Pt 2), 110-119. 29. Kleywegt, G.J., Harris, M.R., Zou, J.Y., Taylor, T.C., Wahlby, A., and Jones, T.A. (2004). The Uppsala Electron- Density Server. Acta Crystallogr D Biol Crystallogr 60, 2240-2249. 30. Hooft, R.W., Sander, C., and Vriend, G. (1997). Objectively judging the quality of a protein structure from a Ramachandran plot. Comput Appl Biosci 13, 425-430. 31. Laskowski, R.A.M., M.W. Moss, D.S. Thornton, J. (1993). PROCHECK: a program to check the stereochemical quality of protein structure. J Appl Cryst 26, 283-291. 32. Hooft, R.W., Vriend, G., Sander, C., and Abola, E.E. (1996). Errors in protein structures. Nature 381, 272. 33. Vriend, G.S., C. (1993). Quality control of protein models: directional atomic contact analysis. J Appl Cryst 26, 47-60. 34. Hooft, R.W., Sander, C., and Vriend, G. (1996). Positioning hydrogen atoms by optimizing hydrogen-bond networks in protein structures. Proteins 26, 363-376. 35. Joosten, R.P., Womack, T., Vriend, G., and Bricogne, G. (2009). Re-refinement from deposited X-ray data can deliver improved models for most PDB entries. Acta Crystallogr D Biol Crystallogr 65, 176-185. 36. Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohen, F.E., and Vriend, G. (2003). GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res 31, 294-297. 37. Oliveira, L.P., A.C. Vriend, G. (1993). A common motif in G protein-coupled seven transmembrane helix receptors. J Comp Aided Mol Des 7, 649–658. 38. Thompson, J.D., Higgins, D.G., and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680. 39. Taylor, W.R. (1986). Identification of protein sequence homology by consensus template alignment. J Mol Biol 188, 233-258. 40. Dodge, C., Schneider, R., and Sander, C. (1998). The HSSP database of protein structure-sequence alignments and family profiles. Nucleic Acids Res 26, 313-315. 41. Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792-1797.

55

Homology modelling and spectroscopy 3

42. Notredame, C., Higgins, D.G., and Heringa, J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302, 205-217. 43. Qiu, J., and Elber, R. (2006). SSALN: an alignment algorithm using structure-dependent substitution matrices and gap penalties learned from structurally aligned protein pairs. Proteins 62, 881-891. 44. Arnold, K., Bordoli, L., Kopp, J., and Schwede, T. (2006). The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics 22, 195-201. 45. Pandit, S.B., Zhang, Y., and Skolnick, J. (2006). TASSER-Lite: an automated tool for protein comparative modeling. Biophys J 91, 4180-4190. 46. Chivian, D., Kim, D.E., Malmstrom, L., Bradley, P., Robertson, T., Murphy, P., Strauss, C.E., Bonneau, R., Rohl, C.A., and Baker, D. (2003). Automated prediction of CASP-5 structures using the Robetta server. Proteins 53 Suppl 6, 524-533. 47. Sali, A., and Blundell, T.L. (1993). Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234, 779-815. 48. Michalsky, E., Goede, A., and Preissner, R. (2003). Loops In Proteins (LIP)--a comprehensive loop database for homology modelling. Protein Eng 16, 979-985. 49. Xiang, Z., Soto, C.S., and Honig, B. (2002). Evaluating conformational free energies: the colony energy and its application to the problem of loop prediction. Proc Natl Acad Sci U S A 99, 7432-7437. 50. Canutescu, A.A., and Dunbrack, R.L., Jr. (2003). Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Sci 12, 963-972. 51. Dunbrack, R.L., Jr., and Karplus, M. (1993). Backbone-dependent rotamer library for proteins. Application to side-chain prediction. J Mol Biol 230, 543-574. 52. Chinea, G., Padron, G., Hooft, R.W., Sander, C., and Vriend, G. (1995). The use of position-specific rotamers in model building by homology. Proteins 23, 415-421. 53. Lovell, S.C., Word, J.M., Richardson, J.S., and Richardson, D.C. (2000). The penultimate rotamer library. Proteins 40, 389-408. 54. Canutescu, A.A., Shelenkov, A.A., and Dunbrack, R.L., Jr. (2003). A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci 12, 2001-2014. 55. Krieger, E., Darden, T., Nabuurs, S.B., Finkelstein, A., and Vriend, G. (2004). Making optimal use of empirical energy functions: force-field parameterization in crystal space. Proteins 57, 678-683. 56. Misura, K.M., Chivian, D., Rohl, C.A., Kim, D.E., and Baker, D. (2006). Physically realistic homology models built with ROSETTA can be more accurate than their templates. Proc Natl Acad Sci U S A 103, 5361- 5366. 57. Brunger, A.T., Adams, P.D., Clore, G.M., DeLano, W.L., Gros, P., Grosse-Kunstleve, R.W., Jiang, J.S., Kuszewski, J., Nilges, M., Pannu, N.S., et al. (1998). Crystallography & NMR system: A new software suite for macromolecular structure determination. Acta Crystallogr D Biol Crystallogr 54, 905-921. 58. Moult, J., Fidelis, K., Kryshtafovych, A., Rost, B., Hubbard, T., and Tramontano, A. (2007). Critical assessment of methods of protein structure prediction-Round VII. Proteins 69 Suppl 8, 3-9. 59. He, Y., Chen, Y., Alexander, P., Bryan, P.N., and Orban, J. (2008). NMR structures of two designed proteins with high sequence identity but different fold and function. Proc Natl Acad Sci U S A 105, 14412- 14417. 60. Kosloff, M., and Kolodny, R. (2008). Sequence-similar, structure-dissimilar protein pairs in the PDB. Proteins 71, 891-902. 61. Prusiner, S.B. (1998). Prions. Proc Natl Acad Sci U S A 95, 13363-13383. 62. Joo, K., Lee, J., Seo, J.H., Lee, K., and Kim, B.G. (2009). All-atom chain-building by optimizing MODELLER energy function using conformational space annealing. Proteins 75, 1010-1023. 63. Vriend, G. (1990). WHAT IF: a molecular modeling and drug design program. J Mol Graph 8, 52-56, 29. 64. Krieger, E., Joo, K., Lee, J., Raman, S., Thompson, J., Tyka, M., Baker, D., and Karplus, K. (2009). Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling: Four approaches that performed well in CASP8. Proteins 77 Suppl 9, 114-122. 65. Lutteke, T. (2009). Analysis and validation of carbohydrate three-dimensional structures. Acta Crystallogr D Biol Crystallogr 65, 156-168. 66. Rarey, M., Kramer, B., Lengauer, T., and Klebe, G. (1996). A fast flexible docking method using an incremental construction algorithm. J Mol Biol 261, 470-489. 67. Nabuurs, S.B., Wagener, M., and de Vlieg, J. (2007). A flexible approach to induced fit docking. J Med Chem 50, 6507-6518. 68. Chen, X., Schauder, S., Potier, N., Van Dorsselaer, A., Pelczer, I., Bassler, B.L., and Hughson, F.M. (2002). Structural identification of a bacterial quorum-sensing signal containing boron. Nature 415, 545-549.

56

Homology modelling and spectroscopy 3

69. Nayal, M., and Di Cera, E. (1996). Valence screening of water in protein crystals reveals potential Na+ binding sites. J Mol Biol 256, 228-234. 70. Parodi, A.J. (2000). Role of N-oligosaccharide endoplasmic reticulum processing reactions in glycoprotein folding and degradation. Biochem J 348 Pt 1, 1-13. 71. Wu, Z.L., Ethen, C., Hickey, G.E., and Jiang, W. (2009). Active 1918 pandemic flu viral neuraminidase has distinct N-glycan profile and is resistant to trypsin digestion. Biochem Biophys Res Commun 379, 749- 753. 72. Smith, B.J., Huyton, T., Joosten, R.P., McKimm-Breschkin, J.L., Zhang, J.G., Luo, C.S., Lou, M.Z., Labrou, N.E., and Garrett, T.P. (2006). Structure of a calcium-deficient form of influenza virus neuraminidase: implications for substrate binding. Acta Crystallogr D Biol Crystallogr 62, 947-952. 73. Ahmed, Z.M., Masmoudi, S., Kalay, E., Belyantseva, I.A., Mosrati, M.A., Collin, R.W., Riazuddin, S., Hmani- Aifa, M., Venselaar, H., Kawar, M.N., et al. (2008). Mutations of LRTOMT, a fusion gene with alternative reading frames, cause nonsyndromic deafness in humans. Nat Genet 40, 1335-1340. 74. Bonifacio, M.J., Archer, M., Rodrigues, M.L., Matias, P.M., Learmonth, D.A., Carrondo, M.A., and Soares-Da- Silva, P. (2002). Kinetics and crystal structure of catechol-o-methyltransferase complex with co- substrate and a novel inhibitor with potential therapeutic application. Mol Pharmacol 62, 795-805. 75. Collin, R.W., Kalay, E., Tariq, M., Peters, T., van der Zwaag, B., Venselaar, H., Oostrik, J., Lee, K., Ahmed, Z.M., Caylan, R., et al. (2008). Mutations of ESRRB encoding estrogen-related receptor beta cause autosomal-recessive nonsyndromic hearing impairment DFNB35. Am J Hum Genet 82, 125-138. 76. Greschik, H., Wurtz, J.M., Sanglier, S., Bourguet, W., van Dorsselaer, A., Moras, D., and Renaud, J.P. (2002). Structural and functional evidence for ligand-independent transcriptional activation by the estrogen- related receptor 3. Mol Cell 9, 303-313. 77. Boteva, R., Visser, A.J., Filippi, B., Vriend, G., Veenhuis, M., and van der Klei, I.J. (1999). Conformational transitions accompanying oligomerization of yeast alcohol oxidase, a peroxisomal flavoenzyme. Biochemistry 38, 5034-5044.

57

Homology modelling and spectroscopy 3

58

HOPE: An e-Science approach with life scientist friendly interfaces 4

Previous page:

Green Fluorescent Protein (GFP) –This small protein exhibits green fluorescence when exposed to blue light. Ever since it was first isolated from jellyfish it’s been used in life science experiments as a reporter of expression. (PDB file 1q4a)

60

HOPE: An e-Science approach with life scientist friendly interfaces 4

Background The omics-revolution has led to a rapid increase in detected disease-related human mutations. A considerable fraction of these mutations is located in protein-coding regions of the genome and thus can affect the structure and function of that protein, thereby causing a phenotypic effect. Knowledge of these structural and functional effects can aid the design of further experiments and can eventually lead to the development of better disease diagnostics or even medicines to help cure patients. The analysis of mutations that cause the EEC syndrome, for example, revealed that some patients carry a mutation that disturbs dimerisation of the affected P63 protein1. This information has triggered a search for drugs2. In another case, the study of a mutation in the human hemochromatosis protein (HFE), which causes hereditary hemochromatosis, resulted in new insights that are now being used to develop novel diagnostic methods3. These and numerous other examples have highlighted the importance of using heterogeneous data, especially structure information, in the study of human disease-linked protein variants. The data that can aid our understanding of the underlying mechanism of disease related mutations can range from the protein's 3D structure to its role in biological pathways, or from information generated by mutagenesis experiments to predicted functional motifs. Collecting all available information related to the protein of interest can be challenging and time-consuming. It is a difficult task to extract exactly those pieces of information that can lead to a conclusion about the effects of a mutation. Several online Web servers exist that offer help to the (bio)medical researcher in predicting these effects. These servers use information from a wide range of sources to reach conclusions about the pathogenicity of a mutation. The PolyPhen server, for example, is widely used by researchers to predict the possible impact of an amino acid substitution on the structure and function of human proteins4. PolyPhen combines a subset of the UniProt sequence features, structural information (when available), and multiple sequence alignments in order to draw conclusions about the impact of a mutation4. SIFT, on the other hand, bases its mutation analysis purely on a multiple sequence alignment5. This server gives probability scores for each amino acid type at the position of interest to separate the harmless mutations from disease-causing ones. The ALAMUT software (http://www.interactive-biosoftware.com) is widely used in human genetics research groups. It focuses on making many forms of software and databases available to their users. The ALAMUT system also automatically calls the PolyPhen Web server as part of its decision process. ALAMUT is not available as a Web server. PolyPhen, SIFT and ALAMUT all have an excellent track-record make existing data accessible for (bio)medical scientist to aid them with the interpretation of mutational effects. We built on their strengths to produce the HOPE software that was written to optimally use the advantages of the novel tools of the e-Science era. The recent increase in data types and data volumes has gone hand-in-hand with large efforts in bioinformatics that have led to numerous new databases and computational methods, and in this era of e-Science, Web services provide on-demand access to these facilities6-8. The development of Web services facilitates the usage of external databases and methods in in-house developed software and eases software maintenance and development by out-sourcing logic to Web services. Web services have a series of advantages for the software developers: • They save time by reusing program code; • They tend to always be up-to-date; • They are executed remotely, which gives access to large amounts of (free) CPU time, thus not overloading the local machine; • No need to maintain in-house data and software collections.

Web services also have disadvantages: • Source code of Web services often is not available; • Web services are not guaranteed to always be available.

61

HOPE: An e-Science approach with life scientist friendly interfaces 4

HOPE (Have (y)Our Protein Explained) is a next-generation web application for automatic mutant analysis. We have designed HOPE to explain the molecular origin of a disease related phenotype caused by mutations in human proteins. In this aspect HOPE resembles the aforementioned systems (PolyPhen, SIFT, ALAMUT). With HOPE we have taken the logical next step in the e-Science era in that the data gathering is done using Web services and DAS servers. Additionally, in HOPE we have taken a protein 3D-structure centred approach. HOPE collects information from data sources such as the protein's 3D-structure and the UniProt database of well-annotated protein sequences. For each protein this data is stored in a PostgreSQL-based information system. A decision scheme is used to process these data and to predict the effects of the mutation on the 3D-structure and the function of the protein. A life-scientist friendly report is produced that explains and illustrates the effects of the mutation. This report is presented using an interface that is designed specifically for the intended user community of human genetics researchers. The report is enriched with figures that illustrate the effects of the mutation, while any residual bioinformatics jargon is linked to our in-house, online dictionary. The conclusions drawn in the report can be used to design follow-up experiments and eventually can lead to the development of better diagnostics or medicines. Figure 4.1 illustrates the major steps of HOPE. We have tested HOPE on a series of mutations that we have previously analyzed manually. In all straightforward cases HOPE performed equally well as a trained protein structure bioinformatician.

Figure 4.1: Overview of HOPE's process flow. The user submits a sequence and a mutation. HOPE will first collect information from a wide range of information sources. These sources include: WHAT IF for structural calculations on either the PDB file or a homology model that was build by YASARA, HSSP for conservation scores, DAS-servers for sequence-based predictions and Uniprot for sequence annotations. The data is stored in HOPE's information system. The data is combined with the known properties of the amino acids in a decision schedule. The result is a report shown on the HOPE website that will focus on the effect of the submitted mutation on the 3D-structure of the protein. The text and figures can be used in articles and publications. Availability: The HOPE Web server is freely available on http://www.cmbi.ru.nl/hope/

62

HOPE: An e-Science approach with life scientist friendly interfaces 4

Results and discussion Input The intended users of HOPE are life scientists who neither routinely use protein structures nor bioinformatics in their research. Therefore, both HOPE's input and its results are designed to be intuitive and simple, and all software used will run with default settings so that the user neither needs to set parameters nor needs to read documentation. Actually, the user will not even know which software runs in the background. The interface of HOPE is a website that enables the user to submit a sequence and a mutation. The user can indicate the mutated residue and the new residue type by simple mouse-clicks. Figure 4.2 shows the input screen, filled with an example protein sequence and a mutation.

Figure 4.2: HOPE's input screen. The user can submit a sequence of interest and indicate the mutated residue with two simple mouse-clicks. In this example HOPE will analyze a Leu->Pro mutation on position 25 of the protein Crambin.

63

HOPE: An e-Science approach with life scientist friendly interfaces 4

Information retrieval HOPE uses the submitted sequence as query for BLAST9 searches against both the UniProt database10 and the Protein Data Bank (PDB)11. The search against the UniProt database identifies the protein's UniProt entry and the accession code of the protein, a unique identifier that is used later in the process to obtain DAS-predictions. Alternatively, it is possible to submit this accession code directly. The BLAST search against the PDB is required to find the protein's structure or a possible template for homology modeling. HOPE uses the actual PDB-file when it contains the residue that is to be mutated and when it is 100% identical with the submitted sequence.

HOPE identifies among multiple 100% hits the best structure for analysis based on resolution, experimental method, and length of the protein covered in the PDB (a full protein is preferred over a fragment). Nowadays, 20% of the human sequences available from SwissProt have a (partly) known structure and for another 30% a homology model can be built. To be able to build a homology model, the BLAST results should contain the equivalent location of the mutation and the percentage sequence identity should fall above the Sander and Schneider curve shown in figure 4.3. Homology modeling is performed using the Twinset version of YASARA which contains an automatic homology modeling script that requires only a sequence as input12. The script fully automatically performs the modeling process including sequence alignment, loop building, side chain modeling, and energy minimization. This script was the top contestant in the CASP8 modeling competition in terms of model detail accuracy13.

Figure 4.3: The two zones of sequence alignment identity that indicate the likelihood of adopting similar structures. Two aligned sequences are highly likely to have similar folds if their length and percentage sequence identity fall in the region above the threshold (black line). HOPE will build a homology model when the identity between the template and submitted sequence falls in this zone. In case the sequence identity is less than 5% above this threshold (grey line) HOPE will build a model but will also warn that the model is based on a template with low identity. The region below the threshold (indicated with a cross) indicates the zone where inference of structural similarity cannot be made, thus making it difficult to determine if model building will be possible. (Figure modified after Sander and Schneider14).

64

HOPE: An e-Science approach with life scientist friendly interfaces 4

The structure of the protein of interest, either a PDB-file or a homology model, is analyzed using WHAT IF Web services7. These services can calculate a wide range of structural features (e.g. accessibility, hydrogen bonds, salt bridges, ligand or ion interactions, mutability, variability, etc). When neither a 3D-structure nor a possible modeling template is available, HOPE cannot use structural information and will instead base its conclusions only on the sequence related data, and published mutation and variation results.

The UniProt database (http://www.uniprot.org/) is used for the retrieval of features that can be mapped on the sequence15. This information includes the location of active sites, transmembrane domains, secondary structure, domains, motifs, experimental information, and sequence variants. The UniProt accession code is used to retrieve data from a series of DAS-servers for sequence based predictions such as possible phosphorylation sites. The DAS-servers form a widely used system for biological sequence annotation16. The conservation score of the mutated residue is calculated from a HSSP multiple sequence alignment17.

Data storage in HOPE Information obtained from the protein structure or model, the UniProt record, and the DAS- predictions are stored in a protein-specific information system based on the PostgreSQL database system. One new information system is produced for each submitted protein. Differences in the protein sequence might exist between data sources, for example sequences from UniProt often contain the signal peptide while the sequences stored in the PDB tend to lack these residues. Therefore, sequences obtained from different sources are aligned using ClustalW. This enables us to transfer information to the residue of interest without the need to deal with the residue numbering problem that results from these sequence differences. Protein features are stored in the information system on a per-residue basis, and can have one of the following four data-types: •Contacts: Interaction of the residue with another entity; for example DNA, a metal-ion, a ligand, hydrogen bond, disulfide bond, salt bridge, etc; •Variable features: Type with a value: for example, accessibility or torsion angle; •Fixed features: Labels a residue (or stretch of residues) with a feature without a value. This indicates that the residue is located in a domain or motif (for example a residue can be part of the active site or in a transmembrane region); •Variants: Mutations or other variations in sequence known at this position; for example splice variants, mutagenesis sites, SNPs. After a user request has triggered the generation of an information system for the protein of interest, the system for this protein is kept on disk for one month just in case the same user (or another user for that matter) requests information about other mutations in the same molecule. After one month every system is thrown away to ensure that conclusions are never based on outdated information. So, there does not really exist a HOPE database as all HOPE's data is, in total agreement with e-Science paradigms, scattered over the internet, and is each time combined upon request.

Decision scheme The decision scheme in HOPE uses all collected information combined with known properties of the wild-type and mutated amino acid, such as size, charge, and hydrophobicity, to predict the effect of the mutation on the protein's structure and function. The scheme consists of six parts that each correspond to a paragraph in the output. Each part analyzes the effect of the mutation on one of the following aspects of the residue: •Contacts: Any interaction with other molecules or atoms, like DNA, ligands, metals, etc, but also hydrogen bonds, disulfide bridges, ionic interactions, etc; •Structural domain: Any part of the protein with a specific name (and often function), such as domains, motifs, regions, transmembrane domains, repeats, zinc fingers, etc;

65

HOPE: An e-Science approach with life scientist friendly interfaces 4

•Modifications: Features that do not directly influence the structure of the protein but might influence post-translational processes like phosphorylation. •Variants: Known polymorphisms, mutagenesis sites, splice variants, etc; •Conservation: The relative frequency of an amino acid type at each position taken from a multiple sequence alignment. •Amino acid properties: The differences in the known properties of the wild-type and mutant residue (size, charge, hydrophobicity).

HOPE will produce its conclusions for each of these six aspects separately. For example, a residue can be located in a transmembrane domain and also be important for ligand interaction. HOPE will in this case produce a paragraph about the effect of the mutation on the contacts and a separate paragraph describing the effect of the mutation on the structural location, in this example the transmembrane domain.

Some types of information can be obtained from multiple sources, which are not equally reliable. Experimentally determined features and calculations performed on the 3D coordinates are more likely to be correct than any prediction. For example, transmembrane domains can be predicted by a DAS-server which normally will produce less reliable results than the annotations in UniProt. Therefore, HOPE ranks the information and uses the most accurate source available for its conclusions. WHAT IF calculations are preferred, followed by UniProt annotations, and DAS predictions are used only when neither WHAT IF nor UniProt data are available. In case no information about the mutated residue is found, HOPE will show a conclusion based only on biophysical characteristics between the wild type and mutant amino acid type. The conservation score is obtained either from the HSSP database that holds multiple sequence alignments for all proteins in the PDB, or through the HSSP Web services if a PDB file is not available17.

Output The report focuses on the effect of the mutation on the 3D-structure, and is aimed at a specific audience in the field of (bio)medical science. It shows the methods used and the sources of the combined information. This can either be an analysis of the real structure or homology model, or a prediction based on the sequence. The results of the mutation analyses are illustrated with figures of the amino acids and, if available, figures and animations of the mutation in the structure. The HOPE output is rather extensive and way too large to put in print. In figure 4.4 we just show a small part of one mutation report. A series of examples of HOPE output is available at the ‘about’ section of the HOPE pages. A HOPE result consists of one HTML page that contains all results. This makes it easy for users to print the results, or to make their own Web-page with HOPE results for long-term storage.

Test cases HOPE was validated in a series of collaborations with scientists from different fields of life sciences. Experiences from these real-world examples where used to design and adjust the decision scheme. So far, most mutation studies involved non-sense and missense mutations. Descriptions of these projects can be found at the HOPE website. The resulting reports often contain a molecular explanation of the observed phenotype that can suggest further experiments. The majority of these projects included the building of a homology model as in most cases no 3D-structure of the protein of interest was available.

We also validated HOPE's conclusions by comparing them with the output of PolyPhen and SIFT. Even though it is very difficult to compare the results from PolyPhen, SIFT, and HOPE, we can still draw a few general conclusions, that will be elaborated in the following paragraphs.

66

HOPE: An e-Science approach with life scientist friendly interfaces 4

Figure 4.4: Example of HOPE's output. A simplified example of HOPE's output. A) Explanation of the used method (structure, modeling or predictions) and links to the relevant databases. B) Text and pictures that explain the differences between the wild-type and mutant residue. (Text is left out of this figure for clarity.) C) Paragraph of the report explaining the effect of the mutation on contacts made by the residue, a disulfid bond in this case. It contains a link to the wiki-entry ‘cysteine’ and ‘disulfid bond’. D) Images/animations that show the effect of the mutation on the structure.

Structure adds value The use of a protein's 3D-structure or homology model increases the prediction quality in terms of reliability and detail. The possibility offered by the YASARA software to fully automatically build high quality homology models increases the number of sequences for which HOPE can use structure data. The protein structure, either a PDB-file or a homology model, can reveal information that currently cannot be predicted accurately from sequence alone, such as ionic interactions, ligand-contacts, etc. The value of the extra information that HOPE can extract from a protein's structure or model is illustrated, for example, by the Leu320Pro and Leu347Pro mutations in ESRBB (see the ‘about’ section of the HOPE website). All Web servers correctly predict the effect of these mutations as damaging for the protein. However, HOPE completes the story by an extensive explanation of the disturbing effect of on alpha-helices. In cases for which no 3D-structure data is available, the

67

HOPE: An e-Science approach with life scientist friendly interfaces 4 three Web servers seem to perform similarly albeit that PolyPhen's output often tends to be scarce and a bit cryptic and SIFT's output is limited to conservation scores.

Biomedicist understandable results HOPE's interface was designed especially for users that work in the (bio)medical sciences. Instead of displaying data in the form of detailed tables and numerical values, HOPE writes human readable reports that explain the structural and functional effects of the mutation, and illustrates this with figures and animations. When other Web servers list the effects of a mutation as "Hydrophobicity change at buried site; normed accessibility: 0.00, hydrophobicity change: -2.7". HOPE will instead report that "the mutation introduces a less hydrophobic residue in the core of the protein which can destabilize the structure". Many more examples of HOPE's readable output can be found at the ‘about’ section of the HOPE website. HOPE's comprehensibility is improved by the Help-function that links difficult bioinformatics keywords to our own in-house dictionary based on Wikipedia's software. In this dictionary the user can find text, illustrations, and sometimes a short video-clip that explains the keyword.

Conclusions Upon running 24 test cases, listed on the website, we can conclude that the present version of HOPE is useful and reliable in analysing point mutations. The next generation of HOPE will, however, need to reach a higher level of data integration to address more complicated cases. Some answer might be found only by combining the calculations with literature data and general knowledge of the protein's structure function relations. For example, PolyPhen predicts the Asn255Asp mutation in Kv1.1 (discussed in18) as being benign, while SIFT shows that this residue is 100% conserved. Combination of the conservation information with the fact that this residue is located in the voltage sensor of the channel can result in the hypothesis that the mutation disturbs the channel's voltage sensing mechanism. Such conclusions are still beyond the capabilities of today's Web servers, but the software design of HOPE will one day allow us to introduce the features needed to deal which these more complicated cases. HOPE is an example of the new way of doing data- and software-intensive research in the era of e-Science. Nowadays, the ongoing developments in experimental techniques like high- throughput sequencing will continue to produce large amounts of data and will therefore demand new, further automated approaches towards the analysis of these data. The e-Science approach used will allow us to easily extend HOPE with more Web services, data sources, and DAS predictions when these become available. In the years to come HOPE can be extended with the possibility to analyze double-mutants, to quantitatively score the structural effects of the mutation and thereby provide the possibility to automatically rank candidate mutations that are the result of a sequence project, or to further improve the already user-friendly HOPE user interface.

Methods The HOPE system is schematically shown in figure 4.5. The individual elements of this scheme are described in the remainder of this section. The HOPE website is implemented using the Wicket web framework (http://wicket.apache.org/), which allows us to provide a fluent and responsive user experience. The web application is deployed on the GlassFish web (https://glassfish.dev.java.net/) application container. HOPE obtains information from different sources beyond our control. Therefore, the data gathering is set up as fail-safe as possible to handle service unavailability. Data is cached to speed up the process, to reduce dependencies and to put less strain on external resources. The data-retention time is 30 days, after which time the data is renewed at the moment someone runs an analysis on the same sequence. The database scheme (available at the ‘about’ section of the HOPE pages) is the result of an iterative design process using both Java and Hibernate to manage all data and to create the database tables. The database engine is PostgreSQL version 8.4. The MRS BLAST version 4 Web service (http://mrs.cmbi.ru.nl/mrsws/blast/wsdl) is used for most

68

HOPE: An e-Science approach with life scientist friendly interfaces 4 database searches with an e-value cut-off of 1e-5 and the low-complexity filter switched off19. This Web service is backed by an in-house implementation of the standard BLAST algorithm. ClustalW version 2.0.10 is used for sequence alignments20. ClustalW is also offered as a Web service through MRS (http://mrs.cmbi.ru.nl/mrsws/clustal/wsdl). WHAT IF Web services are used to calculate secondary structure (http://wiws.cmbi.ru.nl/wsdl/) (using DSSP21), accessibility values, structural fits of mutations, contacts with ligands or ions, salt bridges, disulfide bridges, and hydrogen bonds22. These calculations are performed either on the deposited PDB structure, or a homology model. Homology modeling is performed fully automatically using a locally installed WHAT IF & YASARA Twinset12; 13. This installation runs on a separate server, and is controlled through a Perl CGI script.

Sequence annotations are obtained from the UniProt database15 XML records (http://www.uniprot.org/). The obtained information includes sequence features such as active site, motifs, domains, variants and binding sites. Conservation scores are obtained from HSSP using a Web service (http://mrs.cmbi.ru.nl/hsspsoap/wsd). When a PDB deposited structure is available, the pre-calculated HSSP scores maintained at the CMBI are used. In case a homology model is available a DSSP file is generated for the homology model, which in turn is used to create a HSSP file. In case no structure or model is available, a HSSP file is generated using only the user sequence. Distributed Annotation (DAS) servers16; 23 are used to obtain predictions regarding transmembrane regions by Phobius23, accessibilities by PHDacc24, secondary structure by PHDsec24, and phosphorylation sites by NetPhos25.

Figure 4.5: Detailed overview of HOPE's components. HOPE's input consists of the sequence and the mutation. The sequence is used for a BLAST search against the databases. Using the accession code (and PDB-file if available) HOPE can collect information from a series of information sources: WHAT IF calculations on the PDB- file or homology model built by YASARA, annotations in the Uniprot database, HSSP conservation scores and sequence-based predictions by DAS-servers. The information is combined in a decision scheme and a report is generated. This report is illustrated with pictures and animations and difficult keywords are linked to our own online dictionary.

69

HOPE: An e-Science approach with life scientist friendly interfaces 4

The decision scheme is implemented in Groovy, a dynamic language that runs on the Java Virtual Machine. The simple Groovy language enables other users to design their own decision schemes and run a specific version of HOPE for their own purposes. The decision scheme is divided into separate branches targeted towards certain aspects of the mutant analysis, each producing a paragraph or sub-report. The decision scheme logic is separated from the phrases used to compose the report, for a cleaner separation in code and to allow for internationalization.

The HOPE report is presented on a self-contained webpage, allowing the user to save the page without breaking links and images. The user can bookmark the URL to perform the same mutant analysis at a later point in time, incorporating any newly available data. The output web pages are intended to be free from bioinformatics jargon. An online dictionary based on MediaWiki's software (http://www.mediawiki.org/) is used to explain bioinformatics-specific terms. JavaScript is used to link keywords on the webpage to articles present in the local MediaWiki instance. This functionality is available at any time via the omni-present blue help-button. Images and movies in the report are generated using the YASARA & WHAT IF Twinset.

Availability and Requirements The full description of the design and implementation of the HOPE server is available from the "about" section of the HOPE pages. HOPE can be used freely and no licenses are required. The source code has been made open source and can be freely obtained from the HOPE website. HOPE uses Java, Groovy, and PostgreSQL; it has been implemented on a Linux system while care has been taken to avoid system dependencies.

Authors' contributions HV was responsible for the biological implementation and validation of decision tree, and for the Wiki and the GUIs. TB designed and implemented the HOPE pipeline and most major elements; RK implemented the decision tree. MH and GV provided the WHAT IF Web services. HV and GV supervised the project. All authors read and approved the final manuscript.

Acknowledgements The authors thank Elmar Krieger for his continuous support with the invaluable YASARA software. Barbara van Kampen en Wilmar Teunissen provided technical support. RK thanks NBIC for financial support. This work was part of the BioRange programme of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI). GV acknowledges the EMBRACE project that is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health", contract number LHSG-CT-2004-512092.

70

HOPE: An e-Science approach with life scientist friendly interfaces 4

References

1. Celli, J., Duijf, P., Hamel, B.C., Bamshad, M., Kramer, B., Smits, A.P., Newbury-Ecob, R., Hennekam, R.C., Van Buggenhout, G., van Haeringen, A., et al. (1999). Heterozygous germline mutations in the p53 homolog p63 are the cause of EEC syndrome. Cell 99, 143-153. 2. Bykov, V.J., Issaeva, N., Shilov, A., Hultcrantz, M., Pugacheva, E., Chumakov, P., Bergman, J., Wiman, K.G., and Selivanova, G. (2002). Restoration of the tumor suppressor function to mutant p53 by a low- molecular-weight compound. Nat Med 8, 282-288. 3. Swinkels, D.W., Venselaar, H., Wiegerinck, E.T., Bakker, E., Joosten, I., Jaspers, C.A., Vasmel, W.L., and Breuning, M.H. (2008). A novel (Leu183Pro-)mutation in the HFE-gene co-inherited with the Cys282Tyr mutation in two unrelated Dutch hemochromatosis patients. Blood Cells Mol Dis 40, 334- 338. 4. Ramensky, V., Bork, P., and Sunyaev, S. (2002). Human non-synonymous SNPs: server and survey. Nucleic Acids Res 30, 3894-3900. 5. Ng, P.C., and Henikoff, S. (2003). SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31, 3812-3814. 6. Pettifer, S., Thorne, D., McDermott, P., Attwood, T., Baran, J., Bryne, J.C., Hupponen, T., Mowbray, D., and Vriend, G. (2009). An active registry for bioinformatics web services. Bioinformatics 25, 2090-2091. 7. Hekkelman, M.L., Te Beek, T.A., Pettifer, S.R., Thorne, D., Attwood, T.K., and Vriend, G. WIWS: a protein structure bioinformatics Web service collection. Nucleic Acids Res. 8. Bhagat, J., Tanoh, F., Nzuobontane, E., Laurent, T., Orlowski, J., Roos, M., Wolstencroft, K., Aleksejevs, S., Stevens, R., Pettifer, S., et al. BioCatalogue: a universal catalogue of web services for the life sciences. Nucleic Acids Res 38 Suppl, W689-694. 9. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-410. 10. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38, D142-148. 11. Berman, H., Henrick, K., and Nakamura, H. (2003). Announcing the worldwide Protein Data Bank. Nat Struct Biol 10, 980. 12. Krieger, E., Koraimann, G., and Vriend, G. (2002). Increasing the precision of comparative models with YASARA NOVA--a self-parameterizing force field. Proteins 47, 393-402. 13. Krieger, E., Joo, K., Lee, J., Raman, S., Thompson, J., Tyka, M., Baker, D., and Karplus, K. (2009). Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling: Four approaches that performed well in CASP8. Proteins 77 Suppl 9, 114-122. 14. Sander, C., and Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9, 56-68. 15. Jain, E., Bairoch, A., Duvaud, S., Phan, I., Redaschi, N., Suzek, B.E., Martin, M.J., McGarvey, P., and Gasteiger, E. (2009). Infrastructure for the life sciences: design and implementation of the UniProt website. BMC Bioinformatics 10, 136. 16. Prlic, A., Down, T.A., Kulesha, E., Finn, R.D., Kahari, A., and Hubbard, T.J. (2007). Integrating sequence and structural biology with DAS. BMC Bioinformatics 8, 333. 17. Dodge, C., Schneider, R., and Sander, C. (1998). The HSSP database of protein structure-sequence alignments and family profiles. Nucleic Acids Res 26, 313-315. 18. van der Wijst, J., Glaudemans, B., Venselaar, H., Nair, A.V., Forst, A.L., Hoenderop, J.G., and Bindels, R.J. Functional analysis of the Kv1.1 N255D mutation associated with autosomal dominant hypomagnesemia. J Biol Chem 285, 171-178. 19. Hekkelman, M.L., and Vriend, G. (2005). MRS: a fast and compact retrieval system for biological data. Nucleic Acids Res 33, W766-769. 20. Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T.J., Higgins, D.G., and Thompson, J.D. (2003). Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 31, 3497-3500. 21. Kabsch, W., and Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577-2637. 22. Vriend, G. (1990). WHAT IF: a molecular modeling and drug design program. J Mol Graph 8, 52-56, 29. 23. Kall, L., Krogh, A., and Sonnhammer, E.L. (2004). A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338, 1027-1036. 24. Rost, B. (1996). PHD: predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol 266, 525-539.

71

HOPE: An e-Science approach with life scientist friendly interfaces 4

25. Blom, N., Gammeltoft, S., and Brunak, S. (1999). Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 294, 1351-1362.

72

Applications of Homology Modelling 5

Previous page:

Model of a WD 40 repeat – These propeller-like domains are found in proteins in all eukaryotes and are implicated in functions varying from signal transduction to apoptosis. (PDB-file 1p22)

74

Applications of Homology Modelling 5

Introduction In the previous chapters, we described a method and the recent improvements to build homology models. We also described Project HOPE, a web server that is able to build and use these models or, if available, experimentally determined protein structures for the analysis of the effects caused by point mutations. This chapter contains short summaries of those projects that contributed to HOPE’s development.

During the last years we have collaborated in a series of projects with researchers from departments ranging from human genetics to organic chemistry, and from physiology to internal medicine. A few projects were carried out together with researchers from laboratories in other cities (Utrecht, Groningen), and even with colleagues from institutes located in other countries (USA, Korea). Some of our collaborators we have never even met in person yet, which illustrates that it is not only computers that meet in the internet. In all these projects, our colleagues spent numerous hours working in the lab collecting data for their project. Wet-lab experiments are the only way to identify that one particular mutation in one particular gene causes the phenotype of interest. Therefore, experiments in the laboratory undoubtedly form the starting point of interesting discoveries and publications. The bioinformatics analysis was often an important issue in the last phase of the project. In all cases presented here, analysis of the protein structure provided a better understanding of the effect of the mutation on the protein´s 3D-structure and its function. In some cases, our analyses resulted in suggestions for new experiments but in all cases bioinformatics provided new and useful insights in the molecular origin of the disease. Therefore, our findings can often be regarded as the last piece of the molecular puzzle, and at the same time, as first part of the new puzzle. The collaboration between bioinformatics and the (bio)medical field works both ways. Through bioinformatics the (bio)-medical researcher can access large amounts of data, knowledge, and tools, which can all be beneficial for his/her project. On the other side, bioinformaticians can use the questions, ideas, and information to work on their own projects. Project HOPE is an excellent example of this process. While collecting information, building models, and answering the questions from researchers we worked on the development of Project HOPE. Collaborating in many different projects has not only taught us what questions should be answered but also how we should present our answers in such a way that they are understandable and useful for everyone who is not a bioinformatician. Every article summarized in this chapter has contributed to the development of Project HOPE.

The articles summarized on the following pages discuss the effect of at least one point mutation on the protein’s 3D-structure. We used all our knowledge and bioinformatics tools to visualize and explain these effects. We followed the same protocol for each project albeit that some of these aspects were not yet available in older projects. First, we tried to find the 3D-structure of the protein of interest in the PDB. Sometimes, experimentally determined structures were available in the PDB. As the structures deposited in the PDB usually still contain errors, we used a re-refined version downloaded from the PDB-redo database for our analyses. In a major part of the cases without known 3D-coordinates we were able to perform homology modeling (see chapter 2 for methods). Models for earlier projects were built using the WHAT IF modeling server, often in combination with loop modeling and energy minimization scripts in YASARA. Models for more recent projects were built using YASARA’s automatic modeling script. The latter method has 2 main advantages:

75

Applications of Homology Modelling 5

 It can perform the complete modeling process automatically; all it needs is the sequence of interest. This enables us to incorporate the program in a pipeline that requires homology models.  It builds very good models. The automatic YASARA modeling method was ranked first overall in the category 'high-resolution server accuracy’ during CASP8.

Therefore, automatic modeling by YASARA is nowadays our method of choice for modeling in Project HOPE and in all other projects (http://www.cmbi.ru/nl/~hvensela/HOPEresults). Both the downloaded PDB-files and the models produced by YASARA were visualized and analyzed using the WHAT IF & YASARA Twinset1; 2. All figures shown in this chapter were made using this software package in combination with POV-Ray (http://www.povray.org/). In cases for which no structural information was available at all, we used servers that can provide predictions based on the sequence, such as PsiPred3 for secondary structure predictions, Phobius for transmembrane domain predictions4, and NetPhos for predictions of phosphorylation sites5. In a few cases the authors also used other methods, such as EsyPred3D, to build a homology model.

The summaries on the following pages will all first describe the disease or syndrome that was found in the patient, followed by information about the affected protein(s). The text will contain only a minimal description of the experiments done in the lab because this goes beyond the scope of this thesis. The short introduction is followed by a paragraph that describes the method that was used to obtain structural information about the protein. In case of homology modeling, the method, the template, and the percentage sequence identity will be mentioned. In case no structure or template was found, the other methods used to collect more information about the protein of interest will be described. The modeling method itself will not be discussed in detail because this can be found in chapter 2. The effect of each point mutation on the protein 3D-structure will be described. We will only study missense mutations, even though other mutations such as nonsense mutation, deletions and insertions were sometimes also studied in the original article. These other mutations can be found on the corresponding project website (www.cmbi.ru.nl/~hvensela). At this site more information, figures, animations, and explanations are shown. This site was used to communicate with collaborators in the projects and was often mentioned in the articles as supplementary data.

The last paragraph of each project will show HOPE’s main conclusions about the effect of the mutations on the 3D-structure. The text about the method and amino acids is summarized to save space. We will only highlight the most striking results, such as a feature that still needs to be improved or an important difference between the manual and automatic analysis. Due to the ever ongoing work on HOPE some of the answers provided by the current version of the server might slightly differ from the ones shown here. We also used these, and many other, mutations in a large validation-study of Project HOPE, which will be described in the next chapter.

76

Applications of Homology Modelling 5

Project: HFE Title: A novel (Leu183Pro-)mutation in the HFE-gene co-inherited with the Cys282Tyr mutation in two unrelated Dutch hemochromatosis patients Authors: D. W. Swinkels – H. Venselaar – E. T. Wiegerinck – E. Bakker – I. Joosten – C. A. J. J. Jaspers – W. L. Vasmel – M. H. Breuning Published in: Blood Cells, Molecules, and Diseases 2008 (40(3): 334–338)

We studied the heterozygous mutation Leu183Pro in the HFE-gene found in two unrelated Dutch men who both presented a classical form of hereditary haemochromatosis (HH). Individuals suffering from HH have, among other symptoms, a disturbed iron metabolism resulting in iron overload in the joints. Many of the European HH patients are homozygous for the Cys282Tyr mutation while another 4–7% are Cys282Tyr/His63Asp compound heterozygotes6. In Caucasians, around 25% of the HH patients carry at least one chromosome without any of the known HFE- mutations (Cys282Tyr, His63Asp, or Ser65Cys)6. In addition to these common mutations, a series of rare HFE mutations that are usually found in only one family exists6. The HFE protein is an atypical major histocompatibility complex class I molecule that interacts with the transferrin receptor 1 (TfR1) that is a type II transmembrane glycoprotein and the primary effector of cellular iron uptake7; 8. It has been proposed that the most important role for HFE is iron-sensing in the liver by conveying the whole-body iron status. This status is expressed by transfer of Fe2-TF from the HFE–TfR1 complex to TfR2, resulting in potential signaling of downstream events9. HFE-mutations are supposed to exert their effect by disrupting either the interaction with β2-microglobulin (β2M) and/or TfR1 (figure 5.1A). The Cys282Tyr mutation in HFE-related haemochromatosis disrupts the disulfide bridge in the HFE α3-domain, thereby impairing its β2M binding and its HFE cell surface expression10; 11. The His63Asp mutation in a loop of the α1-domain is also a frequent, but much less severe HFE variant. It probably induces a minor rearrangement that slightly influences the structure of the TfR1 interacting domain, but this has been an area of controversy8; 11; 12. Here, we assess the functional consequences of the newly discovered Leu183Pro mutation by studying its effect on the interaction with TfR1 in the structure of the HFE–β2M-TfR1 complex.

Manual study of mutation Leu183Pro in HFE The HFE–β2M-transferrin dimer complex has been solved by X-ray diffraction and the atomic coordinates can be found in PDB-file 1de48. HFE contains a MHC-like structure comprising a β-sheet topped by two α-helices. These helices form a cleft that upon signaling interacts with the helix α1 from TfR1 (see figure 5.1A). The Leu183Pro mutation is situated in the α2-helix of the HFE-protein but the mutated leucine residue does not directly interact with the receptor. The Leu183Pro mutation results in destabilization of the protein structure for two reasons. First, the proline side chain is attached to its own backbone twice, and thus does not have a backbone proton to form the hydrogen bond required for stabilization of the helix. Second, the smaller size of proline might result in instability in the hydrophobic core of the protein. This HFE-mutation clearly affects the stability of the α2-helix of the HFE-protein, thereby disrupting the interaction with the TfR1 protein that is essential to exert the iron sensing function of the HFE-protein (see figure 5.1B and C). The idea that a stable α1/2 domain is essential for the correct function of HFE is coroborated by the fact that other HFE missense mutations in this area have been shown to result in the classic hemochromatosis phenotype13,14 (reviewed in6). A closer study of these mutations (including the newly found Leu163Pro) showed that the effect of these mutations on the 3D-structure was directly correlated with the severity of the phenotype found in the patient. These findings may have implications for HFE-testing of iron overloaded heterozygous Cys282Tyr-patients of Northern European origin and their relatives.

77

Applications of Homology Modelling 5

Figure 5.1: A) Overview of the crystal structure of HFE (green)–β2M (blue)–TfR1 (yellow)-complex, indicating the location of the Cys282Tyr, His63Asp and Leu183Pro mutations found in different patients. B) Close-up of Leu183 in the HFE-protein. C) Changes in the α-helix caused by the substitution of a proline for a leucine at position 183 of the HFE-protein. The irregular area in the helix is light green.

78

Applications of Homology Modelling 5

Project HOPE’s conclusion about the HFE mutation

Method Analysis on existing structure PDB 1a6z Mutation Leu183Pro

Structure The mutation is located within a stretch of residues identified as a special region: "Alpha- 2". The differences in amino acid properties can disturb this region and disturb its function. In the 3D structure can be seen that the wild-type residue is located in an α-helix. Proline disrupts an α-helix when not located in the first 3 residues. In this case the helix will be disturbed and this can have severe effects on the structure of the protein. Amino acid properties The wild-type and new mutant amino acids differ in size. The new mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein (complex).

Conclusion HOPE correctly predicts the disturbing effect of the proline on the α2-helix, and the size difference between proline and leucine. HOPE does not draw the conclusion that disturbing this helix will also disturb interaction with the transferrin receptor. This information requires another PDB-file for analysis and/or knowledge from literature. Currently, HOPE identifies PDB-file 1a6z as the best structure to use for the analysis. However, this file contains a dimer of the HFE-β2M complex while the HFE-β2M-TfR1 (1de4) complex was used for the manual analysis. This explains the differences between the manual and automatic analyses. In the future HOPE will be able to use complex information and thus will be able to identify the lost interaction between HFE and transferrin.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/HFE/

79

Applications of Homology Modelling 5

Project: ESRRB Title: Mutations of ESRRB Encoding Estrogen-Related Receptor Beta Cause Autosomal-Recessive Nonsyndromic Hearing Impairment DFNB35 Authors: R.W. Collin – K. Kalay – M. Tariq – B. van der Zwaag – H. Venselaar - J. Oostrik – K. Lee – Z.M. Ahmed – R. Caylan – Y. Li – H. A. Spierenburg - E. Eyupoglu – A Heister – S. Riazuddin – E. Bahat – M. Ansar – S. Arslan – B. Wollnik – H. G. Brunner – C. W. Cremers – A. Karaguzel – W. Ahmad – F. P. Cremers – G. Vriend – T. B. Friedman – S. Riazuddin - S. M. Leal – H. Kremer Published in: American Journal of Human Genetics 2008 (824(1): 125-138)

Autosomal-recessive non-syndromic hearing impairment (ARNSHI) is a genetically heterogeneous disorder. Sixty-seven loci for ARNSHI, referred to as DFNB loci, have been mapped, and 24 of the causative genes have been identified15. Genome-wide homozygosity mapping revealed a locus for nonsyndromic hearing impairment on location 14q24.3–q34.12 that partly overlaps with the DFNB35 locus16. Among the candidate genes in this region was the gene for Estrogen Related Receptor Beta (ESRRB). Sequence analysis of this gene revealed a frameshift leading to a premature stop codon in the affected proband family members. Subsequent mutation analysis in the proband familiy and three other families of Pakistani origin revealed four missense mutations in ESRRB. The ESRRB protein is a member of the nuclear hormone receptor transcription factor family. All family members share the same fold but for only two of the domains the function is known: the DNA-binding domain and the ligand binding domain (see figure 5.2A). Structures of these 2 domains have been solved for many family members. The ligand binding domain generally consists of twelve α-helices, although the second α-helix is very irregular in some family members including ESRRB. Together with a two-stranded β-sheet the α-helices fold into a three-layered helical sandwich (see figure 5.2B). Activation of a receptor depends on the binding of an agonist and one or more co- factors. The ligand binding pocket is formed by the β-sheet and α-helices 3, 5, 6, and 11. Upon ligand binding, helix 12 is repositioned into a groove between helices 3 and 11, thereby forming a new groove for co-activator binding. In contrast, antagonist binding prevents the repositioning of helix 12 thereby allowing co-repressors to occupy the available space in the groove between helices 3 and 11 and preventing activation of the receptor17.

Manual study of the mutations in the ESRRB ligand-binding domain The missense mutations discussed here (Leu320Pro, Val342Leu, and Leu347Pro) affect amino acids in the ligand-binding domain of ESRRB. The sequence of the ESRRB ligand-binding domain shows high sequence similarity (79%) with the corresponding domain of Estrogen Related Receptor Gamma (ESRRG), found in PDB-file 1kv618. The crystal structure of ESRRG was used as a template to build a model of the human ESRRB ligand-binding domain. Although the missense variants found in the ligand-binding domain do not affect the α- helices directly involved in ligand binding, the structural changes that occur near the substituted amino acids are likely to influence the structure and the stability of the domain. Leucine to proline mutations in an α-helix tend to disrupt the helical conformation. In the case of Leu320Pro and Leu347Pro the absence of the leucine side chain also results in the loss of hydrophobic interactions between helices (figure 5.2C-E). The Val342Leu substitution in helix 8 is predicted to disrupt the interaction between helix 8 and helix 1 because the introduction of the bigger leucine will result in steric clashes between the leucine side chain and residues in helix 1 and 8 (see figure 5.2E). Our hypothesis that this might impair the function of the ligand binding domain is supported by mutations found in either α-helix 8 of the estrogen receptor or α-helix 1 of the androgen receptor19,20,21. These mutations are located far away from the binding site but still affected the binding affinity of the ligand. In addition, hydrophobic interactions between α-helices 1 and 8 have been reported to be required for maintenance of the conformation of the receptor PPARγ, and it is has been suggested that these results can be extrapolated to other members of the nuclear receptor family, including the estrogen-related receptors19.

80

Applications of Homology Modelling 5

The present study describes the identification of ESRRB as the gene involved in autosomal recessive hearing impairment DFNB35. Characterization of the exact molecular mechanisms by which genetic defects in ESRRB cause nonsyndromic hearing loss will contribute to the understanding of the processes that are essential for normal hearing and might therefore be important for therapeutic intervention.

9

8 10 4 5

7 3 12

11 6

Figure 5.2: A) Schematic representation of the domain structure of human ESRRB. The zinc finger C4 DNA- binding domain (DBD) and the ligand binding domain (LBD) are depicted in red and blue, respectively. B) Structure of the ligand binding domain of the human ESRRB protein. Helices are shown as cylinders and numbered. The positions of the three amino acids that are affected by the missense mutations, are indicated in green. D-E) Close-ups of Leu320Pro, Val342Leu and Leu347Pro. Side chains of the wild type and mutant residues are shown in green and red respectively. In panel D a semitransparent surface is shown (gray).

81

Applications of Homology Modelling 5

Project HOPE’s conclusion about the ESRRB mutations

Method Model based on template PDB 2e2r Mutation Leu320Pro

Contacts In the 3D-structure van be seen that the wild-type residue makes interaction with a ligand annotated as 2OH. The differences in properties between wild-type and mutation can easily cause loss of interactions with the ligand. Because ligand binding is often important for the protein's function, this function might be disturbed by this mutation. Structure In the 3D-structure can be seen that the wild-type residue is located in an α-helix. Proline disrupts an α- helix when not located at one of the first 3 positions. In case of the mutation at hand, the helix will be disturbed and this can have severe effects on the structure of the protein. Amino acid properties The wild-type and mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein (or protein-complex).

Conclusion HOPE correctly predicts the disturbing effect of a proline in an α-helix, that the mutation will cause an empty space in the core, and that this residue can be important for ligand binding.

Mutation Val342Leu

Amino acid properties The wild-type and mutant amino acids differ in size. The mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The mutant residue is bigger and probably will not fit.

Conclusion HOPE correctly predicts that the size difference can be responsible for the effect. Although the report might seem very short, this conclusion agrees completely with the manual analysis.

82

Applications of Homology Modelling 5

Mutation Leu347Pro

Structure In the 3D-structure can be seen that the wild-type residue is located in an α-helix. Proline disrupts an α-helix when not located in the first two residues. In this case the helix will be disturbed and this can have severe effects on the structure of the protein. Amino acid properties The wild-type and mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein (or protein-complex).

Conclusion HOPE correctly predicts the disturbing effect of a proline in an α-helix, similar to mutation Leu320Pro.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/estrogen/

83

Applications of Homology Modelling 5

Project: PCDH15 Title: Gene structure and mutant alleles of PCDH15: nonsyndromic deafness DFNB23 and type 1 Usher syndrome Authors: Z. M. Ahmed – S. Riazuddin – S. Aye – R. A. Ali – H. Venselaar – S. Anwar – P. P. Belyantseva –M. Qasim – S. Riazuddin – T. B. Friedman Published in: Human Genetics 2008 (124(3): 215-223)

We analyzed mutations in PCDH15 causing either combined hearing and vision impairment (type 1 Usher syndrome; USH1F) or nonsyndromic deafness (DFNB23). In 13 families selected on the basis of prelingual hearing loss we observed co-segregation of deafness with homozygosity for USH1F/DFNB23-linked STR marker genotypes. Sequence analyses identified homozygous mutant alleles of PCDH15 in affected individuals from 8 of these 13 families. These variants co-segregated with deafness and were not found in 100 ethnically matched hearing control individuals. Human PCDH15 encodes protocadherin 15. The protein exists in a variety of isoforms with 3–11 ectodomains (ECs), a transmembrane domain, and a carboxy-terminal cytoplasmic domain (CD). Human PCDH15 encodes three alternative, evolutionarily conserved unique cytoplasmic domains (CD1, CD2, or CD3)22. The ectodomains are connected by a short stretch of residues that can bind 3 Ca2+-ions. These Ca2+-ions are thought to stabilize a conformation in which the ectodomains are available for protein-protein interactions23. Molecular modeling provided mechanistic insight into the phenotypic variation in phenotype severity of three PCDH15 missense mutations.

Manual study of the mutations in PCDH15 ectodomain 1-3 Molecular modeling was performed using the C-cadherin crystal structure, PDB-file 1q55, as a template24. Sequence identity between C-cadherin and protocadherin EC repeats 2-3 was 26% over 252 amino acids. Modeling was done using the WHAT IF server for model building and YASARA for subsequent energy minimization using standard parameter settings2. Mutation Arg134Gly is located in the first ectodomain (EC1). Unfortunately, we were not able to model the complete EC1 (see figure 5.3). However, we can assume that EC1 adopts a fold similar to EC2 and 3. A ClustalW sequence alignment of the eleven EC domains reveals that Arg134 is conserved and that this mutation does not disrupt a calcium-binding motif. Arg134 is located in the middle of EC1 with its side chain pointing outwards. This domain is known to be important for interaction with a partnering cadherin or protocadherin monomer25. Mutation of the hydrophilic and positively charged arginine into a glycine might disturb these interactions. Indeed, Arg134Gly has been shown in vitro to result in the loss of interaction between protocadherin 15 and cadherin 2326. The homozygous mutation Asp178Gly is located in the calcium-binding motif in the loops between EC2 and EC3. The multiple sequence alignment shows that this mutation disrupts a conserved DXD calcium-binding motif. Glycine does not have a side chain and therefore cannot interact with the calcium atoms. As a result, the Asp178Gly mutation is predicted to cause decreased calcium affinity. The previously reported27 Gly262Asp mutation is located close to the calcium-binding residues in the loops between EC2 and EC3, but is not in contact with the calcium itself. Glycine does not have a side chain and is therefore very flexible and able to form unusual backbone angles that are likely to maintain the correct shape of the calcium binding domain. The introduction of a less flexible and larger aspartic acid will disturb the local structure needed to bind calcium. The three missense mutations Arg134Gly, Asp178Gly, Gly262Asp of PCDH15 are located in the extracellular N-terminus of protocadherin 15. Arg134Gly and Gly262Asp are associated with nonsyndromic deafness DFNB23 while mutation Asp178Gly is associated with USH1F. By comparison, all truncating mutant alleles of PCDH15 are associated with type 1 USH28,29,30. A better understanding as to how the retina is spared in deaf individuals homozygous for presumably hypomorphic mutations of PCDH15, CDH23, MYO7A or USH1C should be helpful in developing

84

Applications of Homology Modelling 5 interventions that delay the age of onset and/or slow the progression of the retinitis pigmentosa component of Usher syndrome.

Figure 5.3: Structure of the homology model of EC domains 2 and 3 and a small part of EC domain 1 in PCDH15. Side chains of the mutated residues are shown as sticks and coloured red. Yellow balls represent calcium atoms.

Project HOPE’s conclusion about the PCDH15 mutations

Method Model based on template PDB 1ncj Mutation Arg134Gly

Contacts The size difference between wild-type and mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. The difference in hydrophobicity will affect hydrogenbond formation. The difference in charge will disturb the ionic interaction made by the original wild-type residue. Structure The mutation is located within a domain, annotated as: "Cadherin 1". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. The mutation introduces a glycine at this position. are very flexible and can disturb the required rigidity of the protein on this position. Amino acid properties There is a difference in charge between the wild-type and mutant amino acid. The charge of wild-type residue is lost by this mutation. This can cause loss of interactions with other molecules. The wild-type and mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. This will cause a possible loss of external interactions.

Conclusion HOPE correctly mentions all the facts it can find (including lost hydrogenbond, lost ionic interaction, and possible loss of external interactions). The residue is located in a loose end, which makes it difficult to confirm the result. The fact that other mutations in this area disturb interactions with cadherin can be found in literature only and is not (yet) available to HOPE.

85

Applications of Homology Modelling 5

Mutation Asp178Gly

Contacts In the 3D-structure can be seen that the wild-type was involved in a metal-ion contact. The mutant residue is smaller than the wild-type residue. The size differences between the wild-type and mutant residue disturb the interaction with the metal-ion: "CA". The wild-type residue was negatively charged, the mutant residue is neutral. Because the negative charge is lost the interaction with the metal "CA" will be less stable, this can disturb the domain. The difference in charge will disturb the ionic interaction made by the original, wild-type residue. Structure The mutation is located within a domain, annotated as: "Cadherin 2". The mutation introduces an amino acid with different properties, which can disturb this domain and abolish its function. The mutation introduces a glycine at this position. Glycines are very flexible and can disturb the required rigidity of the protein at this position. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The charge of the buried wild-type residue is lost by this mutation. The wild-type and mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein (or protein-complex). The hydrophobicity of the wild-type and mutant residue differs. The mutation can cause loss of hydrogenbonds in the core of the protein and as a result disturb correct folding. Conclusion (left picture) HOPE correctly predicts that the metal-contact is lost here and that a smaller and flexible residue is introduced which will disturb the structure that facilitated Calcium binding.

Mutation Gly262Asp

Structure The mutation is located within a domain, annotated as: "Cadherin 2". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. The wild-type residue was a glycine, the most flexible residue of all. This flexibility might be necessary for the protein's function. Mutation of this glycine can abolish this function. Amino acid properties There is a difference in charge between the wild-type and mutant amino acid. The mutant residue introduces a charge in a buried residue which can lead to protein folding problems. The wild-type and mutant amino acids differ in size. The mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein (or protein-complex). The mutant residue is bigger and probably will not fit. The hydrophobicity of the wild-type and mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core of the protein (or protein-complex). The torsion angles for this residue are unusual. Only glycine is flexible enough to make these torsion angles, mutation into another residue will force the local backbone into an incorrect conformation and will disturb the local structure. Conclusion (right picture) HOPE correctly mentions all differences between Gly and Asp, including the difference in flexibility. It correctly predicts that the bigger side chain will not fit here.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/PCDH15

86

Applications of Homology Modelling 5

Project: LRTOMT Title: Mutations of LRTOMT, a fusion gene with alternative reading frames, cause nonsyndromic deafness in humans Authors: Z. M. Ahmed – S. Masmoudi – E. Kalay – I. A. Belyantseva – M. A. Mosrati – R. W .J. Collin – S. Riazuddin – M. Hmani-Aifa – H. Venselaar – M. N. Kawar – A. Tlili – B. van der Zwaag – S. Y Khan- L. Ayadi - S. A. Riazuddin – R. J. Morell – A. J. Griffith – I. Charfedine – R. C. Aylan – J. Oostrik – A. Karaguzel – A. Ghorbel – S. Riazuddin – T. B. Friedman – H. Ayadi – H. Kremer Published in: Nature Genetics 2008 (40(11): 1335-1340)

Sequence analysis of Pakistani, Turkish, and Tunisian families revealed four pathogenic mutations in a previously uncharacterized gene, LRRC51 renamed LRTOMT. All four mutations co-segregate with deafness; carriers have normal hearing; and none of the four mutations was detected in ancestry- matched, normal-hearing subjects. The LRTOMT gene consists of ten exons that can produce five alternatively spliced transcripts that are widely expressed in the human body. Translation beginning in exon 5 produces isoform LRTOMT2, which is predicted to have a catechol-O-methyltransferase (COMT) domain. Catechol-O-methyltransferase catalyzes the transfer of a methyl group from S-adenosyl-L- methionine (Ado-Met) to a catechol hydroxyl group of catechols31. The homozygous mutation 358+4A>C in hearing-impaired individuals alters the splice donor site of exon 8 of LRTOMT. This results in the absence of exon 8, a reading frameshift, and a premature downstream translation stop codon A29SfsX54 in the mRNA encoding LRTOMT2. The other three mutations are all predicted to affect the structure of the catechol-O-methyltransferase domain of LRTOMT2 and will be discussed here.

Manual study of mutations in the COMT-domain The crystal structure of rat COMT can be found in PDB-file 1h1d31 and shares 39% sequence identity with the human protein. This structure was used to model the COMT domain of human LRTOMT2. The WHAT IF server was used for modeling. Energy minimization and analysis were done with YASARA2. The three mutated residues are located in helix 1 (Arg81Gln), helix 2 (Trp105Arg), and the loop that follows helix 2 (Glu110Lys) respectively, and thus not in the substrate-binding pockets (figure 5.4A). However, the loop is predicted to be important for the groove that binds the methyl acceptor31. Arg81 and Glu110 form a salt bridge between helix 1 and the loop following helix 2 (figure 5.4B-D). A glutamine at position 81 cannot form this salt bridge as it is not positively charged. The formation of hydrogen bonds is also impaired because of the smaller size of glutamine compared to arginine. The Trp105 residue makes hydrophobic interactions with its big side chain (figure 5.4C) Most of these interactions would be lost by the Trp105Arg substitution. Mutation Glu110Lys is predicted to lead to the loss of hydrogen bonds and a salt bridge. There will be repulsion between the side chains of Lys110 and Arg81 as both are positively charged. The mutations may cause protein instability and could indirectly affect the substrate binding region31. Deafness may be due to the predicted destabilizing effects of these mutations on the catechol-O-methyltransferase domain of LRTOMT2.

87

Applications of Homology Modelling 5

Figure 5.4: Molecular model and predicted effects of missense mutations. A) Molecular model of the COMT- domain (residues 79–290) of LRTOMT2. The affected residues are depicted in blue. The predicted ligands are coloured yellow, and tyrosine 111 that lines the hydrophobic groove of the ligand binding site is shown in cyan. The boxed region is enlarged in panels B-D. B/C/D) Residues affected by missense mutations of LRTOMT2. The region of helices 1 and 2 and part of the flanking loops are enlarged. Wild-type residues Arg81, Trp105, and Glu110 are depicted in blue, mutated residues in pink. Hydrogen bonds are represented by yellow dotted lines. H1 = helix 1; H2 = helix 2; L = loop.

88

Applications of Homology Modelling 5

Project HOPE’s conclusion about the LRTOMT mutations

Method Model based on template PDB 1h1d Mutation Arg81Gln

Contacts The size difference between wild-type and mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. The difference in charge will disturb the ionic interaction made by the original wild-type residue. Amino acid properties There is a difference in charge between the wild-type and mutant amino acid. The charge of wild-type residue is lost by this mutation. This can cause loss of interactions with other molecules. The wild-type and mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. This will cause a possible loss of external interactions.

Conclusion HOPE correctly predicts that the hydrogenbond and saltbridge will be lost. HOPE also predicts an effect on the external interactions, which can be an indirect effect.

Mutation Trp105Arg

Amino acid properties There is a difference in charge between the wild-type and mutant amino acid. The mutant residue introduces a charge in a buried residue, this can lead to protein folding problems. The wild-type and mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein. The hydrophobicity of the wild-type and mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core (or protein-complex).

Conclusion HOPE predicts correctly the loss of hydrophobic interactions in the core of the protein.

89

Applications of Homology Modelling 5

Mutation Glu110Lys

Contacts In the 3D-structure can be seen that the wild-type residue makes interaction with a ligand annotated as NHE. The differences in properties between wild-type and mutation can easily cause loss of interactions with the ligand. Because ligand binding is often important for the protein's function, this function might be disturbed by this mutation. The size difference between wild-type and new mutant residue makes that the new residue is not in the correct position to make the same hydrogen bond as the original wild-type residue did. The difference in charge will disturb the ionic interaction made by the original wild-type residue. Amino acid properties There is a difference in charge between the wild-type and mutant amino acid. The mutation introduces the opposite charge at this position. This possibly disrupts contacts with other molecules. The wild-type and mutant amino acids differ in size. The mutant residue is bigger than the wild-type residue. The residue is located on the surface of the protein, mutation of this residue can disturb interactions with other molecules or other parts of the protein.

Conclusion HOPE correctly predicts the loss of a hydrogenbond and saltbridge. It also sees a contact with a putative ligand that might be biologically irrelevant.

General conclusion about LRTOMT mutations The effects of all three mutations are predicted correctly by HOPE. However, the longer distance relation between the domain and the putative ligand binding site can only be found by doing a manual study of the structure and literature.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/catechol/

90

Applications of Homology Modelling 5

Project: EHMT1 Title: Further clinical and molecular delineation of the 9q Subtelomeric Deletion Syndrome supports a major contribution of EHMT1 haploinsufficiency to the core phenotype Authors: T. Kleefstra – W. A. van Zelst-Stams – W. M. Nillesen – V. Cormier-Daire – G. Houge – N. Foulds – M. van Dooren – M. H. Willemsen – R. Pfundt – A. Turner – M. Wilson – J. McGaughran – A. Rauch – M. Zenker – M. P. Adam – M. Innes – C. Davies – A. G. López - R. Casalone – A. – L. A. Brueton – A. D. Navarro – M.P. Bralo – H. Venselaar – S.P.A. Stegmann – H.G. Yntema – H. van Bokhoven – H.G. Brunner Published in: American Journal of Medical Genetics 2009 (46(1): 598-606)

We studied the clinical and molecular characteristics of a cohort of newly identified cases with 9q subtelomeric deletion syndrome (9qSTDS), one of the first clinically recognizable telomeric syndromes. It was demonstrated earlier that the syndrome is caused by haplo-insufficiency of Eu- chromatin Histon Methyl Transferase 1 (EHMT1), a gene involved in histone methylation32-37.. In 6 of 24 patients (25%) selected for exhibiting the 9qSTDS core phenotype, a pathogenic EHMT1 mutation was found. Five out of six mutations affect the open reading frame of the EHMT1 gene and are predicted to create premature termination codons which may lead to nonsense mediated decay. One de novo missense change (Cys1042Tyr) was found. This mutation affects a highly conserved cysteine residue located in the pre-SET domain.

Manual study of mutation Cys1042Tyr in EHMT1 We used the structure of the pre-SET and SET-domains, PDB-file 2rfi38, of EHMT1 to study the effect of the missense mutations on the 3D-structure of the protein. The WHAT IF & YASARA Twinset was used to analyze the effect of the amino acid substitution on the protein structure. Cysteine 1042 is located in a cysteine cluster surrounding three zinc-ions (see figure 5.5). This cysteine/zinc-cluster is important for the local structure of this particular region of the pre-SET domain. The make strong interactions with the zinc-ions thereby placing the intervening loops in the correct position for dimer interactions. Mutation of cysteine 1042 into a tyrosine will abolish the strong cysteine/zinc interaction, causing a less stable domain. Moreover, introduction of the much bigger tyrosine will completely disturb the structure of the zinc/cysteine-cluster and displace the residues in the loops that interact with the other monomer. The mutation will destroy the conformation of this local area that is needed for dimer-contacts. Previously reported missense changes all turned out to be polymorphisms33. However, the de novo occurrence and the severe effect on the protein conformation indicate that the Cys1042Tyr missense change studied here is most likely to be causal for the phenotype.

Figure 5.5: Location of the Cys1042Tyr mutant. All cysteines in the cluster are coloured yellow, including cysteine 1042. This residue is mutated into a tyrosine (red). The side chain of tyrosine is shown to indicate the difference in size compared to the cysteine side chain. The cysteines surround three zinc-ions (green). These interactions are indicated as grey cilinders. The cluster is important for the correct position of the loops that interact with the other monomer represented by the light blue surface.

91

Applications of Homology Modelling 5

Project HOPE’s conclusion about the EHMT1 mutations

Method Model based on template PDB 3hna Mutation Cys1042Tyr

Contacts The cysteine residue that was originally located on this position was involved in a metal contact with ZN. The mutant residue is bigger than the wild-type residue. The size differences between the wild-type and mutant residue disturb the interaction with the metal-ion: "ZN". Structure The mutation is located within a domain, annotated as: "Pre-SET". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Amino acid properties The wild-type and new mutant amino acids differ in size. The mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The mutant residue is bigger and probably will not fit. The hydrophobicity of the wild-type and mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core of the protein (or protein-complex).

Conclusion HOPE correctly predicts the lost contact between the cysteine and the zinc-ion. It also predicts that the large tryptophan side chain will not fit in the core of the protein. However, HOPE does not (yet) mention that the mutation can also disturb dimerisation as this requires complex information.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/EHMT1/

92

Applications of Homology Modelling 5

Project: C3ORF60 Title: Mutations in NDUFAF3 (C3ORF60), encoding an NDUFAF4 (C6ORF66)-interacting complex I assembly protein, cause fatal neonatal mitochondrial disease Authors: A. Saada – R. O. Vogel – S. J. Hoefs – M. A. van den Brand –H. J. Wessels – P. H. Willems – H. Venselaar – A. Shaag – F. Barghut – O. Reish – M. Shohat – M. A. Huynen – J. A.M. Smeitink – L. P. van den Heuvel – L. G. Nijtmans Published in: American Journal of Human Genetics 2009 (84(6): 718-727)

In this study we analyze 3 mutations in NDUFAF3 causing complex 1 deficiency, a disorder of the oxidative phosphorylation (OXPHOS) system. OXPHOS-disorders frequently result in severe multisystem disease with a consequence of early childhood death39-41. The most prevalent and least understood OXPHOS disorder is isolated complex I deficiency (MIM 252010). This disease is frequently caused by mutations that disturb complex I assembly, an intricate 45-component puzzle42. At present, we know that complex I assembly involves the formation of multiple assembly intermediates, presumably starting with several highly conserved subunits (NDUFS2, NDUFS3, and NDUFS7)42-47. In addition, aided by recent developments in bioinformatics and genetics, the number of putative complex I assembly proteins has grown considerably. One of the candidate genes with a putative role in complex 1 assembly picked up in this study is C3ORF60, hereafter referred to as NDUFAF3. Homozygosity mapping and gene sequencing identified NDUFAF3 mutations in five complex I-deficient patients from three different families. One of these mutations (M1T) affected the first methionine located in the mitochondrial target signal, the other 2 mutations were located in a domain that could be modeled and will be discussed here.

Manual study of mutations in NDUFAF3 Homology modeling was performed using the NMR structure of Rpa2829 protein from Rhodopseudomonas palustris as a template (PDB-file 2fvt) (30% sequence identity). The mitochondrial target signal at the N-terminus and the 11 C-terminal residues could not be modeled. A model was built using the WHAT IF web server for modeling and the WHAT IF & YASARA Twinset for model optimization and analysis48. The resulting model is shown in figure 5.6.

Figure 5.6: Model of NDUFAF3 and the two mutations found in the patients. Middle panel: NDUFAF3 is shown in ribbon visualization and coloured by secondary structure element (red β-strand, blue α-helix, green turn, cyan coil). The positions of the two mutations found in patients are indicated by the arrows and coloured yellow. Side panels: the R122P and G77R mutations shown in more detail. The protein is gray. The side chains of both the wild-type residue (green) and the mutant residue (red) are shown. Both mutations result in modifications at the same side of the protein.

93

Applications of Homology Modelling 5

The Arg122Pro mutation is located at the C-terminal side of the central β-sheet and is predicted to change the local structure of the loop, which might influence either the interactions or the stability of the protein. Gly77 is conserved between the 2fvt template and the NDUFAF3 model and has backbone-torsion angles that are very unfavorable for other residue types. The Gly77Arg mutation is therefore energetically unfavorable and likely to disrupt the structure. The model of the NDUFAF3 structure exposes a large hydrophobic domain of about 600 squared Ăngstrøm, which is commonly observed for protein-protein interactions49; 50.

In conclusion, we found that NDUFAF3 is a genuine mitochondrial complex I assembly protein that interacts with complex I subunits. We show that mutations in NDUFAF3 cause complex I deficiency.

Project HOPE’s conclusion about the C3ORF60 mutations

Method Model based on template PDB 2fvt Mutation Gly77Arg

Structure The wild-type residue is a glycine, the most flexible of all residues. This flexibility might be necessary for the protein's function. Mutation of this glycine can abolish this function. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The new mutant residue introduces a charge in a buried residue, this can lead to protein folding problems. The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The new mutant residue is bigger and probably will not fit. The hydrophobicity of the wild-type and new mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core of the protein (or protein-complex). The torsion angles for this residue are unusual. Only glycine is flexible enough to make these torsion angles, mutation into another residue will force the local backbone into an incorrect conformation and will disturb the local structure.

Conclusion HOPE correctly predicts the structural impact of the glycine to arginine mutation and that glycine is needed for its flexibility.

94

Applications of Homology Modelling 5

Mutation Arg122Pre

Contacts The size difference between wild-type and mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. The difference in hydrophobicity will affect hydrogenbond formation. The difference in charge will disturb the ionic interaction made by the original wild-type residue. Structure The proline that is introduced by this mutation is a very rigid residue. The mutation might abolish the required flexibility of the protein at this position. Amino acid properties There is a difference in charge between the wild-type and mutant amino acid. The charge of wild-type residue is lost by this mutation. This can cause loss of interactions with other molecules. The wild-type and mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. This will cause a possible loss of external interactions.

Conclusion HOPE correctly predicts that a hydrogenbond and saltbridge are lost by this mutation, and also mentions the structural impact when a rigid proline is introduced at this position. However, the fact that this mutation might influence the active site can only be found in the literature as this is not yet annotated in SwissProt.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/C3ORF60

95

Applications of Homology Modelling 5

Project: POU3F4 Title: Clinical and molecular characterizations of novel POU3F4 mutations reveal that DFN3 is due to null function of POU3F4 protein Authors: H. K. Lee – M. H. Song – M. Kang – J. T. Lee – K. Kong –S.J. Choi – K. Y. Lee – H. Venselaar – G. Vriend – W. S. Lee – H. J. Park – T. K. Kwon – J. Bok- U. K.Kim Published in: Physiological Genomics 2009 (39(3): 195-201)

X-linked deafness type 3 (DFN3) is the most prevalent X-linked form of hereditary deafness and accounts for ~50% of all families carrying X-linked nonsyndromic hearing loss51. Clinical characteristics of DFN3 in affected males include temporal bone abnormalities, stapes fixation, and, in most cases, a mixed type of hearing loss, which is often progressive52-54. Female carriers of a mutation in the DFN3 locus typically show little or no hearing loss51. DFN3 is associated with mutations in the POU3F4 locus that encodes Pou3f4, a member of the POU family of transcription factors53. Members of this family are characterized by a conserved bipartite DNA binding domain55,56. This C-terminal domain consists of a POU-specific domain and a POU homeodomain, both of which are helix-turn-helix structural motifs that are responsible for DNA binding and specificity (see figure 5.7). The Pou3f4 protein binds to DNA on the POU-specific binding motif and regulates downstream target genes57. Numerous genetic analyses of DFN3 families have identified mutations in the POU3F4 gene including intragenic mutations, partial or complete deletions of the gene, as well as deletions, inversions, and duplications of the POU3F4 genomic region not encompassing the coding sequence54. Nevertheless, little is known about how such mutations that lead to the characteristic inner ear defects and deafness affect the normal function of the Pou3f4 protein. Using linkage and sequence analyses we found three novel mutations in the POU3F4 gene in three Korean families carrying X-linked hereditary hearing loss. The three mutations included frameshift Ala116fs, deletion delSer310, and missense mutation Arg329Pro. The latter will be discussed here.

Manual study of mutations in POU3F4 The crystal structure of Pou2f1 Oct-1 domain, PDB-file 1oct,58 was used as a template to build a model of wild-type and mutant Pou3f4 proteins. Modeling was performed with the WHAT IF software1 with standard parameter settings and protocols59,60. Optimalisation of the model was performed using YASARA2. Sequence identity between the template and model was 65%. The Arg329Pro mutation converts an arginine to a proline at the amino acid position 329, which is one of the most conserved residues in the POU homeo-domain of the protein. Arginine 329 is located in the third α-helix of the domain, which is crucial for protein-DNA interactions (see figure 5.7). Its side chain is located in the core between the helices and thus not in contact with the DNA. The substitution of proline for the wild-type arginine at this position is expected to be detrimental to the α-helix structure due to destabilization of two important salt-bridges with glutamates as well as loss of a series of hydrophobic contacts. In addition, the side chain structure of proline typically destabilizes the α-helical structures which will reduce the proper protein-DNA interactions. The results demonstrate that the mutations identified in the DFN3 patients abolish the transcriptional activity of the protein.

96

Applications of Homology Modelling 5

Figure 5.7: A) Overview of the Pou3f4 homology model, POU-specific domain (right) and POU-homeo domain (left). The model is shown as a grey ribbon. The DNA is indicated in yellow, the phosphate backbone of the bases is not shown for clarity. The side chain of the mutated Arg329 is shown green. The box indicates the domain shown in the close-ups. B) Close-up of residue Arg329. The hydrogen bonds between arginine and the glutamic acids (blue) are visible as yellow dots. C) Close-up of the mutant residue Pro329. The side chain is coloured red, the hydrogen bonds are lost.

97

Applications of Homology Modelling 5

Project HOPE’s conclusion about the POU3F4 mutations

Method Model based on template PDB-file for modeling: 1oct Mutation Arg329Pro

Contacts The residue makes DNA interactions or is located in a DNA binding region. The differences in properties between wild-type and new mutant residue can easily cause loss of these interactions or disturb the domain which will affect the function of the protein. The wild-type residue was annotated in the Uniprot database to be involved in DNA interaction type: "Homeobox". The size difference between wild-type and new mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. The difference in hydrophobicity will affect hydrogenbond formation. The difference in charge will disturb the ionic interaction made by the original wild-type residue. Structure In the 3D structure can be seen that the wild-type residue is located in an α-helix. Proline disrupts an α-helix when not located in the first two residues. In this case the helix will be disturbed and this can have severe effects on the structure of the protein. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The charge of wild-type residue is lost by this mutation. This can cause loss of interactions with other molecules. The wild-type and new mutant amino acids differ in size. The new mutant residue is smaller than the wild-type residue. This will cause a possible loss of external interactions.

Conclusion HOPE finds that proline will disturb the helix and that introduction of the proline will cause loss of a hdyrogenbond, ionic interactions and hydrophobic interactions. Currently, the information about the lost interaction with DNA is obtained from SwissProt, in the future HOPE will be able to use complex information.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/pou3f4/

98

Applications of Homology Modelling 5

Project: Kv1.1 Title: Functional analysis of the Kv1.1 N255D mutation associated with autosomal dominant hypomagnesemia Authors: J. van der Wijst – B. Glaudemans – H. Venselaar – A. Nair – A. Forst – J. G.J. Hoenderop – R. J. M. Bindels Published in: Journal of Biological Chemistry 2009 (285(1): 171-178)

Voltage-gated K+ channels (Kv) are a diverse family of membrane proteins61-63. The members of the major Shaker-related sub-family (Kv1) play an important role in excitable cells by setting the resting membrane potential, shaping the action potentials, thus controlling the neuronal excitability64. Mutations in Kv1.1 in humans are the cause of periodic episodic ataxia type 1 (EA1, severe discoördination) and/or myokymia (continuous muscle movement)65-70. Kv-channels comprise four subunits that surround a central ion-conduction pore. Kv1.1 subunits consist of six transmembrane segments (S1-S6, see figure 5.8A). Both the N- and the C- terminal tails of the protein are located at the intracellular side71,72. The S1-S4 segments form the voltage sensing domain, whereas S5 and S6 along with the intervening reentrant P-loop form the pore domain73; 74. Kv-channels are known to switch between a closed and an open conformation upon cell (de)polarization74; 75. S4 contains a long array of positive charges that have been shown to sense differences in voltage and start the transition from the closed to open conformation by forming stabilizing hydrogen bonds with the external and internal negative clusters in the voltage- sensing domain74,76,77 . Recently, a missense mutation in Kv1.1 was found in patients with isolated autosomal dominant hypomagnesemia, converting the highly conserved aspargine at position 255 into an aspartic acid78. To examine the importance of Asn255 in Kv1.1 channel function, we substituted the by six different amino acids with distinct chemical and physical properties (Asn255Glu, Asn255Gln, Asn255Ala, Asn255Thr, Asn255Val, and Asn255His). The mutant Kv1.1 channels were analyzed electrophysiological, biochemical, and in silico.

Manual study of Asn255Asp and other mutations in Kv1.1 The structural model of Kv1.1 was built based on the 3D-structure of a chimeric Kv1.2-Kv2.1 channel (75% identity, PDB-file 2r9r79). In order to obtain optimal modeling results, we used a re-refined version of this template from the PDB-REDO database80. The WHAT IF server was used for model building, YASARA was used for loop building2, energy minimization and manual mutation analysis. The tetrameric model was obtained by superposing four models on the biological subunit of PDB file 2r9r, followed by an energy minimization with YASARA. The Asn255Asp mutation is located in the third transmembrane segment (S3) close to the intracellular compartment and the S4 voltage-sensor element (figure 5.8). The mutated channel was non-functional with a dominant negative effect on wild-type channel activity. Mutation of 255 into different amino acids did affect neither channel trafficking nor expression of the channels at the plasma membrane. However, the amino acid substitutions had a clear effect on channel activity, as Asn255Glu and Asn255Gln were non-functional. This indicates that next to the addition of a negative charge at position 255 also an additional CH2-group in the side-chain influences channel function, likely via affecting conformational rearrangements. All mutations stabilized the open state of Kv1.1, measured as negative shifts in the voltage dependence of channel activation. There was no obvious correlation between charge and the magnitude of this shift as the effect was not significantly changed for Asn255His. The hydrophilic asparagine could be important for structural rearrangements within the channel in response to voltage changes, which can be affected by conversion into hydrophobic amino acids like alanine and valine. Interestingly, the change in activation kinetics was most evident with these two residues as Asn255Ala and Asn255Val activated 2 to 3-fold faster than wild-type channels. The magnitude of

99

Applications of Homology Modelling 5 shift in V1/2 was highest for Asn255Thr, which suggests that channel gating is independent of hydrogen bonding with the residue at position 255. The voltage-dependence of inactivation was significantly affected in all mutant Kv1.1 channels compared to wild-type Kv1.1 except for Asn255His. We suggest that the additional negative charge of Asp at position 255 near S3 could keep the channel in the inactivated state, which may explain the non-functionality of the mutation found in patients with hypomagnesemia (Asn255Asp). We have previously described hypomagnesemia as a new phenotypic variability associated with mutation of the highly conserved Asn255 in Kv1.178. This study provided more information about the structural arrangement of Asn255 and its involvement in channel activity. Asn255 is essential for normal channel function, since substitution by other amino acids significantly altered channel activity, voltage-dependence and kinetics of Kv1.1 channels.

A B

Figure 5.8: A) Overview of a tetramer of the Kv1.1-model. The four monomers are shown as ribbon and coloured grey, blue, cyan and green, respectively. The mutated residue is coloured red in the grey monomer. B) Close-up of the wildtype and mutant residues. The mutated residue and surrounding side chains are shown and coloured by element; oxygen = red, nitrogen = blue, and carbon = grey.

Project HOPE’s conclusion about the Kv1.1 mutation

Method Model based on template PDB-file 2a79 Mutation Asp255Asn

Structure The residue is located in a region annotated in the Uniprot database as a transmembrane domain. The wild-type residue was neutral, the mutant residue is negatively charged. This can disturb the (ionic) interactions with the other transmembrane helices. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The new mutant residue introduces a charge in a buried residue, this can lead to protein folding problems. Conclusion HOPE correctly mentions that the only difference between the two residues is the charge and that this charge can affect interactions with other transmembrane helices. There is no information source that identifies the residues in his domain as important for voltage-sensing. This information can only be found in literature.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/Kv1.1

100

Applications of Homology Modelling 5

Project: PRKCSH & SEC63 Title: Secondary and tertiary structure modeling reveals effects of novel mutations in polycystic liver disease genes PRKCSH and SEC63 Authors: E. Waanders – H. Venselaar – R. H.M. te Morsche – D. B. de Koning – P. S. Kamath – V. E.Torres – S. Somlo – J. P.H. Drenth Published in: Clinical Genetics 2010 (78(1):47-56)

Polycystic liver disease is a rare but potentially disabling disease characterized by overgrowth of fluid-filled cysts in the liver. The cysts originate from the intralobular bile ducts and are lined with a single layer of biliary epithelial cells81; 82. The cysts may cause severe hepatomegaly (enlarged liver) with concomitant symptoms such as severe dyspnea (shortness of breath) and wasting83; 84. PCLD is a congenital disease and is inherited in an autosomal dominant fashion85; 86. Mutations in two genes, PRKCSH and SEC63, have been found as the molecular culprit causing PCLD87; 88. PRKCSH encodes hepatocystin, the non-catalytic β-subunit of glucosidase II which is apparently unrelated to known pathways of cystogenesis. Glucosidase II is located in the endoplasmic reticulum (ER) and is involved in carbohydrate processing and quality control of glycoproteins82; 89. SEC63 encodes Sec63p, which, like the protein product of PRKCSH, is an ER resident protein82. Sec63p functions in the protein translocation into and out of the ER90; 91. We identified a total of 26 novel mutations in PRKCSH (n = 14) and SEC63 (n = 12), including four splice site mutations, eight insertions/ deletions, six nonsense mutations, and eight missense mutations. Where possible we estimated the effect of the missense mutations using homology models, for the remaining mutations we used other bioinformatics tools.

Manual study of mutations in PRKCSH and SEC63 We analyzed the effects of the mutations using a combination of splice site recognition, evolutionary conservation, secondary and tertiary structure predictions, and pathogenicity prediction. We examined the conservation of the proteins using multiple sequence alignments constructed with the ENSEMBL genome browser, Clustal-W, and Jalview92. Mutations in the helical domains of PRKCSH were analyzed using a helical wheel predictor. (http://cti.itc.virginia.edu/∼cmg/Demo/wheel/ wheelApp.html) We were able to identify homology modeling templates for two domains of Sec63p: the luminal DnaJ domain and SEC63-2 domain (amino acids 626–719). The model of the human J-domain was based on the J-domain from mouse DnaJ subfamily C (PDB-file 2ctw, 45% identity). The model for the SEC63-2 domain was based on PDB-file 2q0z (38% identity). Homology modeling and subsequent analysis were performed using YASARA’s automatic homology modeling script2. To assess the effect of the mutations in the unmodeled part of the proteins, we used a secondary structure prediction generated by the PsiPred-server3; 93. The luminal DnaJ domain of Sec63 mainly consists of α-helices and contains two mutations (see figure 5.9A). The side chain of Ile120 makes important stabilizing hydrophobic interactions between these helices. Mutation of this isoleucine into the smaller and less hydrophobic might result in loss of these interactions in the core thus cause loss of stability. A second mutation in this domain, Asp168His, was found in an earlier study94. The side chain of histidine does not fit in the space between the helices and is likely to disturb the local conformation (see figure 5.9A). The model of a second Sec63p domain showed that the large side chain of Trp651 is located in the core between two β-sheets (figure 5.9B). A mutation into a glycine (with no side chain) results in an empty space in the middle of the domain. This will cause stability loss and may alter the structure profoundly thereby affecting the protein function. The other mutations in Sec63 and PRKCSH are all located outside the modeled domains. Mutation Arg217Cys is predicted to be located in an α-helix in the cytoplasmic domain of Sec63p just downstream of the DnaJ domain. Arginine is positively charged and can make hydrogen bonds. It is likely that a mutation to a hydrophobic, neutral and smaller cysteine will cause loss of interactions

101

Applications of Homology Modelling 5 with other residues. Gln375Pro is predicted to be located in an α-helix. This mutation will have severe impact on the structure of this helix as proline causes loss of one hydrogenbond, which will severely destabilize the helix. Arg267Ser is located in a region with no clear secondary structure preference. We were not able to explain the effects of the mutation other than the fact that arginine is a big and positively charged residue whereas serine is small and neutral. The difference between arginine and serine will abolish any structural function or interaction made by the arginine.

A B

Figure 5.9: Homology models of two Sec63p domains. The models are shown in gray with the side chain of the residues exposed. The side chains of the wild-type residues are shown in green and mutation side chain in red. A) Model of the DnaJ domain with the Asp168His and Ile120T mutations. B) Model of the C-terminal 626–719 amino acids of Sec63p. The side chain of wildtype tryptophan 651 is visible in green while no glycine side chain is visible as this is only one H-atom.

Unfortunately, neither a structure nor a modeling template was found for PRKCSH which complicated tertiary structure analysis. Structural analyses on PRKCSH mutations were therefore based on secondary structure predictions. The Glu381Lys mutation was located in a predicted α-helix. Using a helical wheel predictor, we showed that this helix is amphipathic, containing hydrophobic and hydrophilic sides (see figure 5.10). The hydrophobic side can interact with other domains of the protein, leaving the hydrophilic side free for complex formation with other proteins or maybe another part of the protein. A mutation from the negatively charged into a positively charged lysine might disturb these interactions. The PRKCSH Thr261Ser mutation is located in the loop between two helices in the EF-hand domain. The small difference between threonine and serine can affect the conformation of residues between two α-helices. This conformation is reminiscent of a calcium-binding domain. The Met175Val mutation is predicted to be located in a helix. Helices often interact with each other by hydrophobic side chains and therefore this residue might be important for the packing between two helices. Both methionine and valine are hydrophobic but a mutation from a large side chain to a smaller β-branched side chain might lead to fewer hydrophobic interactions and thus destabilization of the domain. In conclusion, we identified 26 novel mutations associated with PCLD and we provide in silico analyses to delineate the role of these mutations. Most mutations were located in highly conserved regions and homology modeling for two domains of Sec63p illustrated the severity of the substitutions. As it is unclear how mutations in hepatocystin and Sec63p lead to cystogenesis in PCLD, the identification of important protein domains by mutational analysis might reveal the mechanism of molecular functions, protein–protein interactions, or protein–glycan interactions. Furthermore, mutation analysis of both the PRKCSH and the SEC63 genes may become increasingly important for establishing a diagnosis of PCLD in affected patients.

102

Applications of Homology Modelling 5

Figure 5.10: Helical wheel prediction of the residues surrounding the Glu381Lys mutation in PRKCSH (Glu10 in the wheel). The helical plot was constructed using amino acids 372–395 of the PRKCSH protein sequence.

Project HOPE’s conclusion about PRKCSH mutations

Method No structure or modeling template, report based on predictions Mutation Met175Val

Structure The wild-type residue is predicted to be located in an α-helix. The mutation converts the wild-type residue in a residue that does not prefer α-helices as secondary structure. Amino acid properties The wild-type and new mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein (or protein-complex). Conclusion HOPE gives a possible reason why this mutation is damaging for the protein.

Mutation Thr261Ser

Structure The mutation is located within a domain, annotated as: "EF-hand 2". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Amino acid properties The wild-type and new mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein (or protein-complex). Conclusion HOPE’s answer is based on the differences in amino acid properties. Information about the function of EF-hands can be found in literature but is not annotated in a database.

Mutation Glu381Lys

Amino acid properties There is a difference in charge between the wild-type and mutant amino acid. The charge of the buried wild-type residue is reversed by this mutation, this can cause repulsion between residues in the protein core. The wild-type and new mutant amino acids differ in size. The mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The mutant residue is bigger and probably will not fit. Conclusion HOPE can only use the differences in amino acid properties.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/PRKCSH/

103

Applications of Homology Modelling 5

Project HOPE’s conclusion about SEC63 mutations

Method Model based on template PDB 2ctw (Ile120Thr/Asp168His), 2q0z (Arg257Ser/ Gln375Pro/ Trp651Gly/ Asp675Glu), no structure or modeling template for Arg217Cys Mutation Ile120Thr

Structure The mutation is located within a domain, annotated as: "J". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. The mutation is located within a topological domain (domain with certain spatial position) annotated in the Uniprot database as "Lumenal". The mutation might disrupt this topological domain. Amino acid properties The wild-type and new mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein (or protein-complex). The hydrophobicity of the wild-type and mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core.

Conclusion HOPE correctly predicts the lost hydrophobic contacts in the core which will destabilize the J-domain.

Mutation Asp168His

Contacts The size difference between wild-type and mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. Structure The mutation is located within a topological domain (domain with certain spatial position) annotated in the Uniprot database as "Lumenal". The mutation might disrupt this topological domain. Amino acid properties There is a difference in charge between the wild-type and mutant amino acid. The charge of wild-type residue is lost by this mutation. This can cause loss of interactions with other molecules. The wild-type and mutant amino acids differ in size. The mutant residue is bigger than the wild-type residue. The residue is located on the surface of the protein, mutation of this residue can disturb interactions with other molecules or other parts of the protein.

Conclusion The model used by HOPE slightly differs from the one used in our manual analysis and therefore the accessibility differs. As the mutation is located in a terminal end, it is impossible to say which one of the predictions is correct. However, HOPE correctly predicts that a possible hydrogen bond might be lost.

104

Applications of Homology Modelling 5

Mutation Arg217Cys

Structure The mutation is located within a domain, annotated as: "SEC63 1". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The charge of the buried wild-type residue is lost by this mutation. The wild-type and mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein (or protein-complex). The hydrophobicity of the wild-type and mutant residue differs. The mutation can cause loss of hydrogenbonds in the core of the protein and as a result disturb correct folding. Conclusion HOPE can be correct about the loss of interactions in the core of the protein.

Mutation Arg267Ser

Contacts The size difference between wild-type and mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. The difference in hydrophobicity will affect hydrogenbond formation. The difference in charge will disturb the ionic interaction made by the original wild-type residue. Structure The mutation is located within a domain, annotated as: "SEC63 1". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Amino acid properties There is a difference in charge between the wild-type and mutant amino acid. The charge of wild-type residue is lost by this mutation. This can cause loss of interactions with other molecules. The wild-type and mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. This will cause a possible loss of external interactions.

Conclusion HOPE is probably correct about the lost hydrogenbonds and ionic interactions. This analysis is even more complete than our manual analysis.

Mutation Glu375Pro

Structure The mutation is located within a domain, annotated as: "SEC63 1". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. According to the 3D-structure, the wild-type residue is located in an α-helix. Proline disrupts an α-helix when not located at one of the first 3 positions. In case of the mutation at hand, the helix will be disturbed and this can have severe effects on the structure of the protein. Amino acid properties The wild-type and mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. This will cause a possible loss of external interactions.

Conclusion HOPE predicts the damaging effect of a proline in a helix.

105

Applications of Homology Modelling 5

Mutation 6 Trp651Gly

Contacts The size difference between wild-type and mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. The difference in hydrophobicity will affect hydrogenbond formation. Structure The mutation is located within a domain, annotated as: "SEC63 2". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. The mutation introduces a glycine at this position. Glycines are very flexible and can disturb the required rigidity of the protein on this position. Amino acid properties The wild-type and new mutant amino acids differ in size. The mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein (or protein-complex). The hydrophobicity of the wild-type and mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core of the protein (or protein-complex).

Conclusion HOPE correctly finds that this mutation will cause an empty space in the core of the protein. HOPE also finds a rare hydrogen bond formed by the nitrogen in the Trp-side chain that will be lost.

Mutation Asp675Glu

Structure The mutation is located within a domain, annotated as: "SEC63 2". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Amino acid properties The wild-type and new mutant amino acids differ in size. The mutant residue is bigger than the wild-type residue. The residue is located on the surface of the protein, mutation of this residue can disturb interactions with other molecules or other parts of the protein.

Conclusion Our manual analysis did not find an answer, but HOPE’s analysis underlines our hypothesis that this mutation affects interaction with other proteins.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/PRKCSH/

106

Applications of Homology Modelling 5

Project: FZD4, LRP5 and NDP Title: Identification of 21 novel variants in FZD4, LRP5, NDP and an overview of the mutation spectrum in familial exudative vitreoretinopathy and Norrie disease. Authors: K. Nikopoulos – H. Venselaar – R. W. J. Collin – R. Riveiro-Alvarez -F. N. Boonstra – J. M. M. Hooymans – A. Mukhopadhyay – D. Shears – M. van Bers – I. J. de Wijs – R. H. Sijmons – M. A.D. Tilanus - C. E. van Nouhuys - C. Ayuso – L. H. Hoefsloot – F. P.M. Cremers Published in: Human mutation 2010 (31(6): 656-666)

We report 21 novel variants for FZD4, LRP5 and NDP that cause familial exudative vitreoretinopathy (FEVR) and Norrie disease. FEVR and Norrie disease share similar phenotypic characteristics including abnormal vascularisation of the peripheral retina with the formation of retinal folds, retinal detachment, and in many cases the formation of fibrovascular masses in the eye that can lead to blindness95-97. The LRP5, FZD4, and NDP genes encode proteins involved in the evolutionary highly conserved Wnt-signalling network which (among other roles) regulates eye organogenesis and angiogenesis. LRP5 and FZD4 act as co-receptors for Wnt ligands, such as NDP. Mutations in FZD4 and LRP5 are associated with autosomal dominant FEVR whereas X-linked recessive FEVR is caused by mutations in the Norrie disease gene (NDP)98-102. FZD4 is a member of the Frizzled gene family and encodes a protein of 537 amino acids. Similar to the other members of the Frizzled family it is composed of a seven transmembrane helix domain and an extracellular Frizzled (FZ) domain with unknown function. Mutations that have been reported for FZD4 include premature stop codons, missense changes, and one in-frame deletion but no splice mutations have been observed. Mutations are not clustering in a specific ‘hotspot’. All patients with FZD4 mutations are heterozygous carriers. LRP5 is a 1615 AA cell surface receptor that belongs to the low-density lipoprotein receptor (LDLR) superfamily. Its function is associated with receptor-mediated endocytosis of a wide spectrum of ligands. Upon ligand binding, LRP5 can act synergistically with FZD4 or other members of the Frizzled family by activating the canonical Wnt and/or Norrin/β-catenin pathway and induce the transcription of target genes103-105. LRP5 consists of four extracellular domains, each of which is composed of six segments. Those segments form a β-propeller structure that is followed by an EGF- like domain. Five of those segments are YWTD LDL-class B repeats, whereas the sixth does not contain the required YWTD motif to be recognized as a LDL-class B repeat. The first two propeller domains have been suggested to be important for interaction with the Wnt/Frizzled complex106. The four propeller domains are followed by three LDL-class A repeats (also known as LDL-receptor like ligand binding domains), one TM domain, and a cytoplasmic region with one or multiple short signals for receptor internalization through coated pits. The NDP gene encodes Norrin, a small (133 AA) and cysteine rich protein with low sequence similarity to the transforming growth factor (TGF)-β family of ligands100; 102; 107; 108. Norrin consists of two major parts: a signal peptide at the amino-terminus of the protein that directs its localization and a region containing a typical motif of six cysteines forming a cysteine-knot that provides the structural conformation required for receptor binding and subsequent signal transduction108.

Manual study of mutations in LRP5, FZD4 and NDP The 21 novel mutations reported here are scattered throughout the LRP5, FZD4, and NDP genes. We used YASARA2 with standard parameter settings to build a model of the β-propeller domain of LRP5. As a template we used the re-refined80 version of PDB-file 1npe which shares 41% sequence identity with the Nodigen protein (PDB-file 1npe109). A second model was made of the EGF-like domain of LRP5 based on PDB-file 1aut110. The sequence identity between this domain and the template was very low, but the presence of the six cysteines indicated that the EGF-like domain and EGF itself adopt similar structures. We were not able to build a model of FZD4 that contained the locations of the mutations. No modeling template with sufficient identity was found for NDP. However, this protein is known to contain a cysteine-knot in its structure and we used the structure of cysteine-

107

Applications of Homology Modelling 5 knot containing Sclerosterin as a template (PDB-file 2kd3111). Our model of NDP agrees with structures previously described112 and contains two antiparallel β-sheets and one α-helical structure. Analysis of mutations that were not covered by the models was done using sequence based predictions. Three novel missense mutations were found in FZD4: Glu40Gln, Cys204Tyr, and Gly525Arg. Glutamic acid 40 is located at the N-terminus, directly following the putative signal peptide at the FZ domain. It is predicted to be located in a coiled domain on the surface of the protein where interactions might occur. The function of the FZ domain is not fully elucidated, although it has been suggested that it may be important for interaction with other proteins such as Wnts or NDP113. Glutamic acid is a negatively charged amino acid and is often involved in ionic interactions; mutation into a glutamine will cause loss of the negative charge and as a result also cause loss of ionic interactions of the FZ domain with other protein domains. The Cys204Tyr variant is located in a stretch of residues that connects the FZ domain with the first TM domain of FZD4. Predictions indicate that the residue is placed in a large coiled region, buried in the core of the protein preceding the first TM helix. Mutation of a cysteine residue, possibly employed to form a disulfide bond, into a tyrosine, a hydrophilic amino acid with a large bulky side chain, is expected to lead to destabilization or even unfolding of the domain. The Gly252Arg mutation affects a buried amino acid located in the intracellular C-terminal tail of FZD4. Glycines are known for their flexibility and no other residue can make the same backbone angles. It is possible that this residue is required for a specific backbone formation. This is underlined by the fact that the glycine residue at position 525 is preceded by an asparagine (a residue often present in structural turns) and by another flexible glycine. The Gly525Arg variant changes the flexible glycine into a less flexible arginine, thereby disturbing the putative backbone structure. This could potentially affect the normal function of the conserved KTXXXW motif and PDZ binding in the Wnt/β-catenin pathway, thereby impeding the action of FZD4 receptor complex. Two new missense mutations were found in LRP5: Glu441Lys and Cys1253Phe. The Glu441Lys variant is located in the second ‘β-propeller’ domain of the protein (see figure 5.11B), at a highly conserved position. The negatively charged and hydrophilic glutamic acid creates ionic interactions with two positively charged arginine residues in its close proximity, thereby stabilizing the local structure of the domain. Introduction of an additional positive residue such as lysine is likely to disturb the ionic interactions, thereby destabilizing the domain structure and thus impeding its normal function. This would certainly have an effect on its potential interaction with NDP. The second variant Cys1253Phe, is located in the EGF-like domain following the third ‘β-propeller’ module of LRP5, and also affects a well-conserved residue. The third ‘β-propeller’ structure and the three LDL-class A repeats play a role in binding of ligands such as Dkk-1. The EGF-like domains are generally stabilized by three disulfide bridges. The introduction of a at the position of cysteine residue is predicted to disturb one of the existing disulfide bridges. Moreover, phenylalanine has a much bulkier side chain than compared with cysteine and might exert steric hindrance effects (see figure 5.11C). Six missense mutations were found in NDP. The missense change Cys55Arg influences a cysteine residue located on the ‘top’ of one of the anti-parallel β-sheets which forms a disulphide bridge with a cysteine on the top of the other β-sheet (see figure 5.12B). The exact function of this bond is unclear, but it could be speculated that it might be necessary to stabilize the conformation of the two β-sheet domains that contain hydrophobic residues that can interact with another NDP monomer. Without the disulfide bridge the two sheets will have more freedom to move, thereby making it difficult to form the correct structure for dimerization. Moreover, this disulfide was shown to be evolutionary conserved108. The glycine residue at position 67 is highly conserved among many NDP orthologues and the Gly67Glu and Gly67Arg variants change a small and flexible glycine into either a negatively charged residue (glutamic acid) or into a positively charged residue (arginine). Glycine is required at this position in order for the backbone to adopt the correct formation for the cysteine knot (figure 5.12C). No other residue can make the correct backbone angles, and therefore,

108

Applications of Homology Modelling 5 mutation of the glycine will disturb cysteine knot formation and thereby the structure of the protein. The missense mutation Phe89Leu is located in a highly conserved domain that adopts an α-helical structure. This amino acid is predicted to participate in the dimerization of NDP108. The hydrophobic phenylalanine interacts with other hydrophobic residues on the other monomer. Leucine is also hydrophobic, but has a smaller sidechain (figure 5.12D). It seems possible that this leucine could make the same type of interactions, but because of altered side chain conformation the dimer will be less strong and stable. The missense change Ser92Pro was identified in a simplex Norrie disease patient. This mutation affects a serine that precedes a cysteine in the cysteine knot motif, and. It is critical that the cysteines are placed correctly in NDP, as they are crucial for its structure. The missense change introduces proline (figure 5.12E), which is known to be a very rigid residue that often forces the backbone to bend and thereby might disturb the formation of the disulfide bridge and thus the conformation of NDP. Pro98Leu is located close to the last cysteine in the knot motif and to the unbound cysteine that is needed to form dimers (figure 5.12F). The local backbone structure that is created by proline, might be necessary for the correct positioning of cysteines. A leucine will cause more flexibility in the backbone that might result in difficulties during dimerization and/or formation of the cysteine knot. Another possibility is that the hydrophobic side chain of proline might be needed for interactions with another monomer.

Figure 5.11: Schematic representation of LRP5 and 3D modeling of LRP5 missense changes. A) Schematic overview of LRP5 and its domains. B) Overview of the second β-propeller domain of LRP5 and a close-up of the Glu441Lys change in the second β-propeller domain. The side chain of the wild-type residue (glutamic acid) in position 441 is shown in green, whereas the mutant lysine is shown in red. Two neighboring are coloured blue. C) Overview of one of the EGF-like domains and the Cys1253Phe change. Cysteines forming disulfide bridges are shown in yellow. One of these cysteines is mutated into a phenylalanine, depicted in red.

109

Applications of Homology Modelling 5

We have provided a comprehensive summary of novel mutations in FZD4, LRP5, and NDP that have been reported to affect these genes, with a focus on FEVR and Norrie disease. Our emerging understanding of Wnt-signaling offers hope that pharmaceutical agents that selectively target this pathway will be developed to treat some phenotypic characteristics of these diseases.

A B

Figure 5.12: A) Ribbon model of C D NDP. The protein is coloured in gray, the cysteine side chains that participate in forming disulfide bonds and creating the critical knot motif are shown in yellow. B-F) Close-ups of the NDP missense mutations. The wild- type and mutant residues are depicted in green and red, E F respectively.

Project HOPE’s conclusion about LRP5 mutations

Method Model based on template PDB-file 1npe for E441K and 2fcw for C1253 Mutation Glu441Lys

Contacts The size difference between wild-type and new mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. The difference in charge will disturb the ionic interaction made by the original wild-type residue. Structure The mutation is located within a stretch of residues identified as a special region: "Beta- propeller 2". The differences in amino acid properties can disturb this region and disturb its function. The mutation is located within a stretch of residues that is repeated in the protein, this repeat is named "LDL-receptor class B 7". The mutation into another residue might disturb this repeat and consequently any function this repeat might have. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The charge of the buried wild-type residue is reversed by this mutation, this can cause repulsion between residues in the protein core. The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein (or protein-complex). The new mutant residue is bigger and probably will not fit.

110

Applications of Homology Modelling 5

Conclusion HOPE mentions correctly that the hydrogen bond and ionic interaction will be disturbed and that this will affect the LDL-repeat in the beta-propeller domain.

Mutation Cys1253Phe

Contacts In the 3D-structure can be seen that this residue is involved in a cysteine bridge, which is important for stability of the protein. Only cysteine can make these type of interactions, the mutation causes loss of this interaction and will have a severe effect on the 3D-structure of the protein. Together with loss of the cysteine bond, the differences between the old and new residue can cause destabilization of the structure. Structure The mutation is located within a domain, annotated as: "EGF-like 4". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Amino acid properties The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The new mutant residue is bigger and probably will not fit.

Conclusion Even though residues with Phenylalanine-like properties have been found on this position, it is very likely that HOPE is correct in predicting the lost cysteine-bond and introduction of a big side chain that will disturb the structure.

Project HOPE’s conclusion about FZD4 mutations

Method No structure or modeling template, reports based on predictions Mutation Glu40Gln

Structure The mutation is located within a domain, annotated as: "FZ". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The charge of wild-type residue is lost by this mutation. This can cause loss of interactions with other molecules. Conclusion HOPE’s conclusion is based on the amino acid properties and predictions only. Still, it gives a plausible reason for the phenotypic effect of this mutation.

111

Applications of Homology Modelling 5

Mutation Cys204Tyr

Amino acid properties The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The new mutant residue is bigger and will probably not fit. The hydrophobicity of the wild- type and new mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core (or protein-complex). Conclusion HOPE’s conclusion is based on the amino acid properties and predictions only. Still, it gives a plausible reason for the phenotypic effect of this mutation.

Mutation Gly525Arg

Structure The wild-type residue was a glycine, the most flexible residue of all. This flexibility might be necessary for the protein's function. Mutation of this glycine can abolish this function. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The new mutant residue introduces a charge in a buried residue, this can lead to protein folding problems. The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The new mutant residue is bigger and probably will not fit. The hydrophobicity of the wild-type and new mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core of the protein (or protein-complex). The glycine residue is the most flexible residue. It is possible that the residue is needed at this position to make a special backbone conformation or to facilitate movement of the protein. The mutation introduces a less flexible residue thereby disturbing this conformation or movement. Conclusion HOPE’s conclusion is based on the amino acid properties and predictions only. Using the fact that glycine is required for flexibility, it gives a plausible reason for the phenotypic effect of this mutation.

Project HOPE’s conclusion about NDP mutations

Method No structure or modeling template, reports based on predictions Mutation Cys55Arg

Contacts The wild-type residue is annotated to be involved in a cysteine bridge, which is important for stability of the protein. Only cysteines can make these type of bonds, the mutation causes loss of this interaction and will have a severe effect on the 3D-structure of the protein. Together with loss of the disulfid bridge, the differences between the old and new residue can cause destabilization of the structure Structure The mutation is located within a domain, annotated as: "CTCK". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Conservation Only this residue type was found at this position. Mutation of a 100% conserved residue is usually damaging for the protein. The mutant and wild-type residue are not very similar. Based on this conservation information this mutation is probably damaging to the protein. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The new mutant residue introduces a charge in a buried residue, this can lead to protein folding problems. The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The new mutant residue is bigger and probably will not fit. The hydrophobicity of the wild-type and new mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core. Conclusion Correct prediction of a lost disulfide-bridge. The accessibility prediction might be incorrect.

112

Applications of Homology Modelling 5

Mutation Gly67Gln

Structure The mutation is located within a domain, annotated as: "CTCK". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. The wild-type residue was a glycine, the most flexible residue of all. This flexibility might be necessary for the protein's function. Mutation of this glycine can abolish this function. Conservation Only this residue type was found at this position. Mutation of a 100% conserved residue is usually damaging for the protein. The mutant and wild-type residue are not very similar. Based on this conservation information this mutation is probably damaging to the protein. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The mutation introduces a charge at this position, this can cause repulsion between the new mutant residue and neighboring residues. The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The residue is located on the surface of the protein, mutation of this residue can disturb interactions with other molecules or other parts of the protein. The hydrophobicity of the wild-type and new mutant residue differs. The mutation might cause loss of hydrophobic interactions with other molecules on the surface of the protein. The glycine residue is the most flexible residue. It is possible that the residue is needed at this position to make a special backbone conformation or to facilitate movement of the protein. The mutation introduces a less flexible residue thereby disturbing this conformation or movement. Conclusion HOPE correctly uses all the differences in properties between Glycine and Glutamic acid. The idea that this residue enables a special conformation for neighboring cysteines cannot be predicted by HOPE yet as there is no structure to identify Cysteine bonds close to the mutated residue.

Mutation Gly67Arg

Structure The mutation is located within a domain, annotated as: "CTCK". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. The wild-type residue was a glycine, the most flexible residue of all. This flexibility might be necessary for the protein's function. Mutation of this glycine can abolish this function. Conservation Only this residue type was found at this position. Mutation of a 100% conserved residue is usually damaging for the protein. The mutant and wild-type residue are not very similar. Based on this conservation information this mutation is probably damaging to the protein. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The mutation introduces a charge at this position, this can cause repulsion between the new mutant residue and neighboring residues. The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The residue is located on the surface of the protein, mutation of this residue can disturb interactions with other molecules or other parts of the protein. The hydrophobicity of the wild-type and new mutant residue differs. The mutation might cause loss of hydrophobic interactions with other molecules on the surface of the protein. The glycine residue is the most flexible residue. It is possible that the residue is needed at this position to make a special backbone conformation or to facilitate movement of the protein. The mutation introduces a less flexible residue thereby disturbing this conformation or movement. Conclusion HOPE correctly uses all the differences in properties between Glycine and Arginine. The fact that this residue enables a special conformation for neighbouring Cysteines cannot be predicted by HOPE yet as there is no structure to identify Cysteine bonds close to the mutated residue.

113

Applications of Homology Modelling 5

Mutation Phe89Leu

Structure The mutation is located within a domain, annotated as: "CTCK". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Conservation Only this residue type was found at this position. Mutation of a 100% conserved residue is usually damaging for the protein. The mutant and wild-type residue are not very similar. Based on this conservation information this mutation is probably damaging to the protein. Amino acid properties The wild-type and new mutant amino acids differ in size. The new mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein (or protein-complex). Conclusion HOPE can only use the differences in amino acid properties. The fact that this residue might be involved in dimer-interactions can be found in literature only. The accessibility prediction is probably incorrect.

Mutation Ser92Pro

Structure The mutation is located within a domain, annotated as: "CTCK". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. The proline that is introduced by this mutation is a very rigid residue. The mutation might abolish the required flexibility of the protein at this position. Amino acid properties The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The new mutant residue is bigger and probably will not fit. The hydrophobicity of the wild- type and new mutant residue differs. The mutation can cause loss of hydrogenbonds in the core of the protein and as a result disturb correct folding. Conclusion HOPE correctly predicts the differences between Serine and Proline. However, it does not use the fact that the preceding residue is one of the important Cysteines.

Mutation Pro89Leu

Structure The mutation is located within a domain, annotated as: "CTCK". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. The wild-type residue was a proline. Prolines are known to be very rigid and therefore induce a special backbone conformation which might be required at this position. The mutation can disturb this special conformation. Amino acid properties The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The new mutant residue is bigger and probably will not fit. Prolines are known to have a very rigid structure, sometimes forcing the the backbone in a specific conformation. Possibly, your mutation changes a proline with such a function into another residue, thereby disturbing the local structure. Conclusion HOPE correctly uses the differences in amino acid properties but could benefit from an analysis of neighboring residues.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/PRKCSH/

114

Applications of Homology Modelling 5

Project: ACAD9 Title: Acyl-CoA dehydrogenase 9 is required for the biogenesis of oxidative phosphorylation complex. Authors: J. Nouws – L. Nijtmans – S. M. Houten – M. van den Brand –M. Huynen –H. Venselaar – S. Hoefs – J. Gloerich – J. Kronick – T. Hutchin – P. Willems – R. Rodenburg – R. Wanders – L. van den Heuvel – J. Smeitink – R. O. Vogel Published in: Cell Metabolism 2010 (12(3): 283-294)

We analysed 2 point mutations in Acyl-CoA dehydrogenase 9 (ACAD9) that were found in patients diagnosed with isolated complex 1 deficiency. This disease causes exercise intolerance, hypertrophic cardiomyopathy, lactic acidosis, and failure to thrive. The affected ACAD9 protein is a recently identified member of the acyl-CoA dehydrogenase family which closely resembles very long chain acyl-CoA dehydrogenases (VLCAD). Therefore, ACAD9 has been proposed to be involved in mitochondrial β-oxidation of long-chain fatty acids114. However, none of the patients showed symptoms that would point to a defect in β-oxidation of fatty acids. Results show that ACAD9 is required for mitochondrial complex 1 assembly which seemingly conflicts with its previously described role. Our finding is underlined by the fact that ACAD9 associates with the well-known mitochondrial assembly factors NDUFAF1 and Ecsit. Complex I (NADH: ubiquinone oxidoreductase) of the OXPHOS system plays a central role in mitochondrial metabolism as it is the entry point for NADH in the oxidative phosphorylation system. The biogenesis of complex I involves the combination of 45 subunits encoded by the mitochondrial and nuclear genomes, a process that requires the aid of specific chaperones such as AIF, Ecsit and Ind142, 115,116. The mechanisms by which these proteins aid and regulate the assembly of complex I remain to be elucidated. We found that ACAD9 mutations result in complex I deficiency and not in disturbed long- chain fatty acid oxidation. This strongly contrasts with its evolutionary ancestor VLCAD which we show is not required for complex I assembly and clearly plays a role in fatty acid oxidation.

Manual study of mutations in ACAD9 Sequencing the ACAD9 gene of patients with isolated complex I deficiency uncovered two missense mutations in two unrelated patients. One patient was homozygous for Arg518His; the other patient was compound heterozygous for a stop codon after 62 amino acids, and Glu413Lys. To analyze these mutations, a homology model of ACAD9 was built using the structure of human very long-chain Acyl- CoA dehydrogenase (VLCAD) as a template (PDB-file 3B96, 47% identity117). The model was built using YASARA’s homology modeling script with standard parameters48. A multiple sequence alignment of a series of ACAD9 and VLCAD proteins was made using ClustalX92. This alignment showed that both missense mutations affect highly conserved amino acids. The Arg518His mutation is located at the end of an α-helix with its side chain forming charged hydrogen bonds with glutamic acid 592. Substituting arginine by histidine will disturb the dimer formation due to the loss of stabilizing hydrogen bonds within this area (see figure 5.13A). Glu413 forms a hydrogenbond and ionic interaction with Arg417 at the dimerization surface. This interaction will be lost when the glutamic acid is replaced by charged lysine. Lysine has a positively charged side chain which will even repulse arginine (figure 5.13B). Among other results, our data also have clinical implications, as isolated complex I deficient patients should now also be screened for mutations in ACAD9.

115

Applications of Homology Modelling 5

Figure 5.13: Close-up of the two mutations found in the patients. The ACAD9 monomers are shown in green and blue (surface view) the wild type side chains are shown and coloured by element; carbon is cyan, nitrogen is blue, oxygen is red, and yellow dots indicate hydrogen bonds. The mutant residues are shown in grey. The locations of the mutations are indicated with arrows.

Project HOPE’s conclusion about ACAD9-mutations

Method Model based on template PDB 2uxw Mutation Glu413Lys

Contacts The size difference between wild-type and new mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. The difference in charge will disturb the ionic interaction made by the original wild-type residue. The residue is involved in lots of multimer contacts. The mutation introduces a bigger residue at this position, this can disturb the multimeric interactions.

Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The charge of the buried wild-type residue is reversed by this mutation, this can cause repulsion between residues in the protein core. The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein (or protein complex). The new mutant residue is bigger and probably will not fit.

Conclusion HOPE exactly mentions all the facts that were also seen in the manual analysis: hydrogenbond, ionic interaction, dimer surface, loss of charge and possible repulsion and size difference.

116

Applications of Homology Modelling 5

Mutation Arg518His

Contacts The size difference between wild-type and new mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. The difference in charge will disturb the ionic interaction made by the original wild-type residue. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The charge of wild-type residue is lost by this mutation. This can cause loss of interactions with other molecules. The wild-type and new mutant amino acids differ in size. The new mutant residue is smaller than the wild-type residue. This will cause a possible loss of external interactions.

Conclusion HOPE correctly predicts the loss of both the hydrogenbond and the ionic interaction and that this affects interaction with other molecules (or in this case, other parts of the same molecule). However, in our manual analysis we can see that disturbing the interaction with the other helix will also disturb dimerization. These long range effects are not yet included in HOPE.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/ACAD9/

117

Applications of Homology Modelling 5

Project: EFG1 Title: Mutation in subdomain G’ of mitochondrial elongation factor G1 is associated with combined OXPHOS deficiency in fibroblasts but not in muscle Authors: P. Smits – H. Antonicka – P. M. van Hasselt – W. Weraarpachai – W. Haller – M. Schreurs – H. Venselaar – R.J. Rodenburg – J.A. Smeitink – L.P. van den Heuvel Published in: European Journal of Human Genetics 2010 (19(3):275-279)

We studied a mutation in mitochondrial elongation factor G1 (EFG1) that was found in a patient suffering from oxidative phosphorylation (OXPHOS) deficiency in fibroblast cells. EFG1 is a GTPase that catalyzes the translocation step of mitochondrial protein synthesis. During this step, the deacylated tRNA moves from the ribosomal peptidyl (P) to the exit (E) site, the peptidyl-tRNA moves from the acceptor (A) to the P site and the mRNA is advanced by one codon, preparing for a new elongation cycle118. EFG1 consists of five domains of which the GTP-binding domain I (or G-domain) contains a large insert, subdomain G’, that is indirectly required for translocation. The OXPHOS system consists of 5 large protein complexes located in the inner mitochondrial membrane that together form the major energy-generating process in our cells. Mitochondrial translation of a large number of the OXPHOS subunits is controlled by various nuclear encoded proteins. We studied patients with combined OXPHOS deficiencies, in whom complex II (the only complex that is entirely encoded by nuclear DNA) showed normal activities, and mutations in the mitochondrial genome as well as polymerase gamma were excluded. A screen of all mitochondrial translation factors revealed a mutation in the gene for EFG1 causing severe and rapidly progressing mitochondrial encephalopathy.

Manual study of the mutation in EFG1 The modeled structures of the wild-type and mutant EFG1 proteins were based on the crystal structure of Thermus thermophilus EFG (PDB-file 2bm0, 45% identity)119. Modeling, energy minimization and analyses were performed using the WHAT IF & YASARA Twinset2. The first 43 EFG1 residues could not be modeled. The mutation found in the patient converts an arginine at position 250 into a tryptophan. Sequence alignment of EFG1 orthologs in 8 different species shows that the arginine is highly conserved from humans to bacteria. Figure 5.14 depicts how the residue is located at the periphery of the protein and forms a saltbridge and hydrogenbond with glutamic acid 153; these interactions are lost when the tryptophan is introduced. Furthermore, a positive charge which might be important for interactions between helix A and other proteins, is lost because of the mutation. In addition, the large and aromatic tryptophan side chain will cause steric clashes with surrounding residues. Arginine 250 is located in helix A of the G’-domain. Several experiments have demonstrated that the bacterial G’ domain interacts with a ribosomal protein that promotes recruitment to the ribosome, greatly stimulates its GTP hydrolysis activity, controls the release of inorganic phosphate, and exerts indirect effects on translocation120-122. Mutation experiments in E. Coli resulted in severe defects in GTP hydrolysis in the presence of the ribosome and small defects in ribosomal translocation. Hence, the Arg250Trp mutation in EFG1 reported in the current study is presumed to hamper ribosome-dependent GTP hydrolysis, disturbing mitochondrial translation and oxidative phosphorylation and consequently causes fatal mitochondrial encephalopathy.

118

Applications of Homology Modelling 5

Figure 5.14: Homology modeling based on the crystal structure of the T. thermophilus EFG protein. The complete EFG structure (except for the first 43 amino acids) is shown at the left side. A close-up of the outlined region containing the mutation is show at the right side. The side chains of wild-type Arg250 and mutant Trp are depicted in green and red respectively. GDP is depicted in yellow. The close-up shows that the Trp residue fits in the structure; however, the hydrogen bond (yellow dots) between Arg and Glu 153 is lost.

Project HOPE’s conclusion about the EFG1 mutation

Method Model based on template PDB 2rdo Mutation Arg250Trp

Contacts The size difference between wild-type and new mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. The difference in hydrophobicity will affect hydrogenbond formation. The difference in charge will disturb the ionic interaction made by the original wild-type residue. Conservation The wild-type residue is very conserved, but a few other residue types have been observed at this position too. Your mutant residue or any other residue type with similar properties was not observed at this position in other homologous sequences. Based on conservation scores this mutation is probably damaging to the protein. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The charge of wild-type residue is lost by this mutation. This can cause loss of interactions with other molecules. The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The residue is located on the surface of the protein, mutation of this residue can disturb interactions with other molecules or other parts of the protein. Conclusion HOPE correctly predicts the loss of the hydrogen bond and the problems caused by size and charge differences. HOPE suggests that this might affect interaction with other proteins. The model and information in literature underline that this can indeed be the case.

119

Applications of Homology Modelling 5

Project: SMAD3 Title: Mutations in SMAD3 cause a syndromic form of aortic aneurysms and dissections with early- onset osteoarthritis. Authors: I. M. van de Laar – R. A. Oldenburg – G. Pals – J. W. Roos-Hesselink – B. M. de Graaf - J. M. Verhagen – Y. M. Hoedemaekers – R. Willemsen – L. A. Severijnen – H. Venselaar – G. Vriend – P. M. Pattynama – M. Collée – D. Majoor-Krakauer – D. Poldermans – I. M. Frohn-Mulder – D. Micha – J. Timmermans – Y. Hilhorst-Hofstee – S. M. Bierma-Zeinstra- P. J. Willems – J. M. Kros – E. H. Oei – B. A. Oostra – M. W. Wessels A. M. Bertoli-Avella Published in: Nature Genetics 2011 (43(2): 121-126)

We studied a new syndrome caused by mutations in the SMAD3 gene. This syndrome is characterized with aneurysms, dissections, and tortuosity throughout the arterial tree in association with mild craniofacial features and skeletal and cutaneous anomalies. In contrast with other aneurysm syndromes, most of the affected individuals presented with early-onset osteoarthritis. SMAD3 encodes a member of the TGF-β pathway that is essential for TGF-β signal transmission. The protein forms a hetero-trimeric complex consisting of two SMAD3, and one SMAD4 molecule. Trimer-formation is required for translocation to the nucleus where the complex acts as a regulatory unit. The SMAD3 mutations are located in the MH2 domain of the protein; a region that is extremely well conserved among other species and other SMAD proteins123-126. The mutations in SMAD3 were identified by a genome-wide linkage analysis using 250K SNP arrays. Two candidate genes (SMAD3 and SMAD6) in the locus related to the disease were selected for further sequence analyses based on their roles in TGFβ-signaling. Two point-mutations in SMAD3 were found to segregate with the disease phenotype. Both Arg287Trp and Thr261Ile affect evolutionary highly conserved amino acids within the MH2 domain of SMAD3 and other SMAD-like proteins.

Manual study of mutations in SMAD3 The experimentally solved structure of the human SMAD3/4 complex (PDB-file 1u7f126) was used to predict the effect of the mutations on the structure and function of the protein. The re-refined structure of this complex was downloaded from the PDB-redo website80 and contains a SMAD3/SMAD3/SMAD4 trimer complex, of which each monomer contains the linker and the MH2 domain. The missense mutations substitute amino acids essential for trimer formation which in turn is crucial for the translocation of the SMAD complex to the nucleus and subsequent gene regulation. Both altered amino acids are located at or very close to the SMAD/SMAD interfaces. They make interactions required for the correct conformation of the loops that interact with the other monomers (see figure 5.15). Mutation of the threonine at position 261 into the bigger isoleucine will lead to steric clashes, conformational changes and disturbed dimer interactions. In a similar way, the replacement of a positively charged arginine at position 287 by the neutral and bigger tryptophan will cause loss of important hydrogenbond and ionic interactions, and will cause rearrangement of the residues and loops that interact with other SMAD3 and SMAD4 monomers (see figure 5.15). It has been shown that SMAD3 molecules in which the polar Arg287 is replaced by a non-polar residue were unable to form dimers with SMAD3 or SMAD4127. TGF-β signals cannot be propagated via SMAD3, if SMAD3 heterotrimer formation is affected. These findings implicate the TGF-β signaling pathway as the primary pharmacological target for the development of new treatment strategies for arterial wall anomalies and osteoarthritis. However, our limited understanding of the physiological functioning of this pathway and the paradoxical increased TGF-β signal propagation caused by mutations in key molecules within the pathway (SMAD3 and the TGF-β receptors) warrant further studies before clinical trials with TGF-β antagonists can commence.

120

Applications of Homology Modelling 5

Figure 5.15: Overview of the SMAD 3/4 trimeric complex of the MH2 domains and close-ups of the mutations. The trimeric complex is shown in the middle. Both SMAD3 molecules are coloured green and blue, the SMAD4 is coloured yellow. The mutated Ile261 and Arg287 are shown as magenta balls. Ile261 is located on the SMAD3/SMAD3 interface (blue/green) and Arg287 is located on the SMAD3/SMAD4 interface(green/yellow). Both close-ups show the difference between wild-type side chain (green sticks) and mutant side chain (red sticks).

Project HOPE’s conclusion about SMAD3 mutations

Method Analysis on existing structure PDB1mjs Mutation Thr261Ile

Contacts The size difference between wild-type and new mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. The difference in hydrophobicity will affect hydrogenbond formation. Structure The mutation is located within a domain, annotated as: "MH2". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Amino acid properties The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The residue is located on the surface of the protein, mutation of this residue can disturb interactions with other molecules or other parts of the protein.

Conclusion HOPE uses the monomeric structure of SMAD3 for its prediction. All facts are correct, but HOPE misses the interactions between this residue and the other monomer in the complex. In the future this will be solved by using complex information.

121

Applications of Homology Modelling 5

Mutation Arg287Trp

Contacts The size difference between wild-type and new mutant residue makes that the new residue is not in the correct position to make the same hydrogenbond as the original wild-type residue did. The difference in hydrophobicity will affect hydrogenbond formation. The difference in charge will disturb the ionic interaction made by the original wild-type residue. Structure The mutation is located within a domain, annotated as: "MH2". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. The mutation is located within a stretch of residues identified as a special region: "Sufficient for interaction with XPO4". The differences in amino acid properties can disturb this region and disturb its function. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The charge of wild-type residue is lost by this mutation. This can cause loss of interactions with other molecules. The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The residue is located on the surface of the protein, mutation of this residue can disturb interactions with other molecules or other parts of the protein.

Conclusion HOPE calculates that the arginine residue is accessible and therefore does not mention the detrimental effects in the core of the protein. However, it does mention the eventual effect on interactions with other proteins.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/smad3/

122

Applications of Homology Modelling 5

Project: IMPAD1 Title: Chondrodysplasia and Abnormal Joint Development Associated with Mutations in IMPAD1, Encoding the Golgi-Resident Nucleotide Phosphatase, gPAPP Authors: L.E.L.M. Vissers – E. Lausch – S. Unger – A. B. Campos-Xavier – C. Gilissen – A. Rossi – M. Del Rosario – H. Venselaar – U. Knoll – S. Nampoothiri – M. Nair – J. Spranger – H. G. Brunner – L. Bonafé – J. A. Veltman – B. Zabel – A. Superti-Furga Published in: The American Journal of Human Genetics 2011 (13;88(5):608-615)

We studied three patients from two unrelated consanguineous families suffering from congenital joint dislocations, chondrodysplasia with short stature, micrognathia, and cleft palate, and a distinctive face. Whole exome sequencing was used to identify the cause of this disease under the assumption of an autosomal recessive mutation in one gene. After prioritizing and filtering the results a single gene was found to contain mutations in all three patients. Patient 1 was found to be homozygous for the Asp177Asn mutation in IMPAD1, whereas patients 2 and 3 showed homozygosity for Thr183Pro. Mutation analysis of IMPAD1 in a fourth patient identified a premature termination codon at position Arg178. IMPAD1 encodes a Golgi-resident PAP phosphatase (gPAPP) that is responsible for the hydrolysis of phosphoadenosine phosphate (PAP), a byproduct of sulfotransferase reactions, to AMP128. This process prevents product inhibition of the sulfotransferase reactions and possibly allows for reverse transport of AMP from the Golgi to the cytoplasm128. Independent groups associated inactivation of IMPAD1 with skeletal dysplasia and abnormal joint formation in mice, possibly related to undersulfation of proteoglycans128-130. We report on the human phenotype associated with IMPAD1 mutations and confirm the predictions made by Frederick et al.128 on the basis of the mouse knockouts and biochemical studies.

Manual study of mutations in IMPAD1 To study the structural effects of the mutations we built a homology model with the EsyPred3D website, using PDB-file 2wef as a template (26% sequence identity). Subsequent analysis was performed using the SPDBV software version 4.0.1131. We also used HOPE to build a model (using the same 2wef as a template) for subsequent analysis. The manual interpretation of the model made by EsyPred3D and the automatic HOPE analysis are in agreement. Mutation Asp177Asn causes the loss of a negative charge needed to bind an essential magnesium ion in the active site, thereby directly affecting the phosphatase activity of the protein (see figure 5.16). Both approaches were able to identify the detrimental effect on an α-helix caused by mutation Thr183Pro. Prolines lack the hydrogen on their backbone nitrogen and can therefore not form the hydrogenbonds required for the correct formation of an α-helix132. Thr183 is located at the C-terminal side of a short α-helix that also contains substrate binding residues at its N-terminal side. Mutation of this residue into a proline will disturb the helix and can indirectly affect substrate binding and consequently affect the function of the protein (see figure 5.16).

We conclude that mutations found in this study will impair the enzymatic function of IMPAD1. Analysis of this gene should be included in the diagnostic procedure for infants presenting with congenital dislocations, chondrodysplasia, and short stature.

123

Applications of Homology Modelling 5

Figure 5.16: Overview of the protein in ribbon presentation. Middle: the protein is coloured gray, and the side chains of the mutated residue are coloured magenta and shown as small balls. Left and right: Close-up of the Thr183Pro and Asp177Asn mutations. Side chains of both the wild-type and the mutant residue are shown and are coloured green and red, respectively.

Project HOPE’s conclusion about IMPAD1 mutations

Method Model based on template PDB-file 2wef Mutation D177N

Contacts In the 3D-structure can be seen that the wild-type was involved in a metal-ion contact. The wild-type residue was negatively charged, the mutant residue is neutral. Because the negative charge is lost the interaction with the metal "MG" will be less stable, this can disturb the domain. In the 3D-structure can be seen that the wild-type residue makes interaction with a ligand annotated as AMP . The differences in properties between wild-type and mutation can easily cause loss of interactions with the ligand. Because ligand binding is often important for the protein's function, this function might be disturbed by this mutation. Structure The mutation is located within a stretch of residues annotated in Uniprot as a special region: "Substrate binding". The differences in amino acid properties can disturb this region and disturb its function. Amino acid properties There is a difference in charge between the wild-type and mutant amino acid. The charge of the buried wild-type residue (either in the protein or in the protein-complex) is lost by this mutation.

Conclusion HOPE finds correctly that D177 makes interactions with both the ligand and a magnesium-ion and that this mutation can therefore disturb the function of the protein.

124

Applications of Homology Modelling 5

Mutation T183P

Contacts The difference in hydrophobicity will affect hydrogenbond formation. Structure In the 3D-structure can be seen that the wild-type residue is located in an α-helix. Proline disrupts an α-helix when not located at one of the first two positions. In case of the mutation at hand, the helix will be disturbed and this can have severe effects on the structure of the protein. Amino acid properties The hydrophobicity of the wild-type and new mutant residue differs. The mutation can cause loss of hydrogenbonds in the core of the protein (or protein-complex) and as a result disturb correct folding.

Conclusion HOPE correctly finds that a hydrogenbond will be lost and that this mutation will disturb an α-helix. HOPE cannot yet see that the other side of this helix is involved in ligand binding.

125

Applications of Homology Modelling 5

Project: TMPRSS3 Title: Genotype-Phenotype Correlation in DFNB8/10 Families with TMPRSS3 Mutations Authors: Nicole J.D. Weegerink – M. Schraders – Jaap Oostrik – Patrick L.M. Huygen – Tim M. Strom – Susanne Granneman – Ronald J.E. Pennings – Hanka Venselaar – Lies H. Hoefsloot – Mariet Elting Published in: Journal of the Association for Otolaryngology 2011 (12(6):753-766)

Autosomal recessive nonsyndromic hearing impairment (arNSHI) is the most common type of inherited hearing impairment, accounting for approximately 80% of inherited prelingual hearing disorders. The phenotype associated with arNSHI is usually prelingual, non-progressive, and severe to profound133. So far, 40 causative genes have been identified to be involved in arNSHI, indicating enormous genetic heterogeneity134. We studied and characterized the genotype–phenotype correlations in eight Dutch DFNB8/10 families with compound heterozygous mutations in the transmembrane protease serine 3 (TMPRSS3). Mutations in TMPRSS3 are responsible for arNSHI with both prelingual onset (DFNB10135) and postlingual onset (DFNB8136). Four different missense mutations and two frameshift mutations were detected, including one novel variant (Val199Met), four previously described pathogenic variants (Ala306Thr, Thr70fs, Ala138Glu, Cys107fs), and Ala426Thr, which had previously been reported as a possible polymorphism137. We compared phenotypes of the families that had at least one mutation in TMPRSS3 in common. This enabled us to address the effect of the second mutation on the phenotype in these families.

Manual study of mutations in TMPRSS3 The effects of the amino acid substitutions on the TMPRSS3 structure were analyzed both manually and by using Project HOPE138. The exact 3D structure of TMPRSS3 is unknown; therefore, we built a model based on the homologous structure in PDB-file 1z8g139. The model was built using an automatic script in the WHAT IF & YASARA Twinset1; 2. The clinical data suggest that mutations Val199Met and Ala306Thr have a more severe effect than Ala138Glu and Ala426Thr. Ala306Thr and Ala426Thr occur in the serine protease domain and result in the mutation of a small, hydrophobic, conserved residue into a bigger and more hydrophilic residue. This is predicted to destabilize the domain. The difference in severity of the effect of the two mutations might well be explained by the location of the substituted residue. The Ala306Thr is located close to one of the active residues, Asp304, and therefore this substitution can be predicted to directly disturb the function of the serine protease domain. In contrast to Ala306, which is completely buried in the protein, Ala426 is only semi-buried. As the side chain of threonine is only slightly bigger than the alanine side chain, there could be enough space for the side chain of the threonine at this position (see figure 5.17). This makes Ala426Thr a less detrimental mutation than Ala306Thr. Ala138Glu and Val199Met occur within the scavenger receptor cysteine-rich (SRCR) domain, which is thought to be involved in interactions with extracellular molecules. These two mutations affect conserved residues and substitute the wild-type residue for a larger one. This is predicted to cause steric clashes with other residues and hence to destabilize the protein (see figure 5.17). Analyses of the 3D-model could not directly explain the difference in the effect of Ala138Glu and Val199Met mutations since information about the exact function and location of binding partners of the SRCR domain cannot be obtained from the model. The genotype–phenotype analysis suggests that not only the protein truncating mutation Thr70fs has a severe effect but also the amino acid substitutions Ala306Thr and Val199Met. A combination of two of these three mutations causes prelingual profound hearing impairment (DFNB10) whereas the combination with Ala426Thr or Ala138Glu leads to a milder phenotype with postlingual onset of the hearing impairment (DFNB8). Further studies are needed to distinguish possible phenotypic differences between different TMPRSS3 mutations.

126

Applications of Homology Modelling 5

Figure 5.17: Molecular modeling for TMPRSS3 missense mutations. Graphic representation of the effect of the Ala306Thr and Ala426Thr mutations in the serine protease domain (A) and of the Ala138Glu and Val199Met in the SRCR domain (B). The wild-type residues are depicted in green, while the mutant residues are shown in red. The yellow structure represents a substrate for the serine protease.

Project HOPE’s conclusion about TMPRSS3 mutations

Method Model based on template PDB-file 1z8g Mutation Ala138Glu

Structure The mutation is located within a domain, annotated as: "SRCR". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Amino acid properties There is a difference in charge between the wild-type and new mutant amino acid. The new mutant residue introduces a charge in a buried residue, this can lead to protein folding problems. The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The new mutant residue is bigger and probably will not fit. The hydrophobicity of the wild-type and new mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core.

Conclusion HOPE is correct. The mutation introduces a bigger residue in the core of the protein which will cause steric clashes with the surrounding residues. The introduced residue is also more hydrophobic than the wild type residue, this can disturb the folding process because the new residue tends to make unwanted hydrogen bonds. The SRCR domain will be slightly destabilized. The report could be improved if we knew the exact function of this domain.

127

Applications of Homology Modelling 5

Mutation Val199Met

Structure The mutation is located within a domain, annotated as: "SRCR". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Amino acid properties The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The new mutant residue is bigger and probably will not fit.

Conclusion This is a straightforward example of a small residue that is mutated into something bigger. This does not fit and will cause steric clashed with other residues, and destabilize the domain. Again, it would be interesting to know the function of this SRCR domain.

Mutation Ala306Thr

Structure The mutation is located within a domain, annotated as: "Peptidase S1". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Amino acid properties The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The new mutant residue is bigger and probably will not fit. The hydrophobicity of the wild- type and new mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core.

Conclusion HOPE is right about the size difference between wildtype alanine and the new residue threonine. The threonine will not fit here because it is too big. What HOPE cannot (yet) see is that this occurs close to the active site residues. The mutation will surely affect this domain and affect ligand binding.

128

Applications of Homology Modelling 5

Mutation Ala426Thr

Contacts The mutated residue is located very close to a residue that makes a cysteine bond. This cysteine bond itself is not mutated but could be affected by the mutation located in its vicinity. The mutated residue is not in direct contact with a ligand, however, the mutation could affect the local stability which in turn could affect the ligand-contacts made by one of the neighboring residues. Structure The mutation is located within a domain, annotated as: "Peptidase S1". The mutation introduces an amino acid with different properties here which can disturb this domain and abolish its function. Amino acid properties The wild-type and new mutant amino acids differ in size. The new mutant residue is bigger than the wild-type residue. The wild-type residue was buried in the core of the protein. The new mutant residue is bigger and probably will not fit. The hydrophobicity of the wild- type and new mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core.

Conclusion The small residue is changed in a bigger one that does not fit. The introduction of a hydrophilic side chain will cause problems during folding. Ala 426 is located close to a cysteine bond (purple) that is important for the conformation surrounding the substraat. Mutation could affect substrate binding and the peptidase function of this domain. However, there seems to be more space here for the Threonine side chain than at position 306.

Manual analysis can be found online: http://www.cmbi.ru.nl/~hvensela/tmprss3/

129

Applications of Homology Modelling 5

Project: IER3IP1 Title: Microcephaly with simplified gyration, epilepsy, and infantile diabetes linked to inappropriate apoptosis of neural progenitors. Authors: C.J. Poulton – R. Schot – S.K. Kia – M. Jones – F.W. Verheijen – H. Venselaar – M.C. de Wit – E. de Graaff – A.M. Bertoli-Avela – G. M. Mancini Published in: The American Journal of Human Genetics 2011 (89(2): 265-276)

We describe a syndrome characterized with primary microcephaly (small brain size) with simplified gyral pattern in combination with severe infantile epileptic encephalopathy and early-onset permanent diabetes. Primary microcephaly is considered to be caused by a brain developmental defect140. So far, mutations in at least seven genes have been found that are causative for congenital primary microcephaly with normal or simplified gyration (MSG)141, whereas mutations in at least four genes have been identified in patients that suffer from MSG with additional features such as epilepsy141-143. Animal studies have shown that abnormal control of apoptosis is a possible cause for human MSG142. A disturbed apoptosis mechanism is also found in infantile-onset diabetes. In this study we ascertained 3 patients from two consanguineous families suffering from a combination of MSG, infantile encephalopathy, and infantile diabetes. Linkage analysis revealed an area with two homozygous missense mutations (Val21Gly and Leu78Pro) in immediate early response 3 interacting protein 1 (IER3IP1) in patients from both families. IER3IP1 is highly expressed in the fetal brain cortex and fetal pancreas and is thought to be involved in endoplasmic reticulum stress response. IER3IP1 encodes a small polypeptide of 82 amino acids. The protein is predicted to contain two hydrophobic transmembrane domains of which the first one possibly contains a signal peptide144, and a hydrophilic, glycine-rich domain (G-patch) that putatively interacts with RNA.

Manual study of mutations in IER3IP1 No protein structure or possible modeling template was known for IER3IP1. Therefore, an extensive in silico analysis was performed using a series of different prediction servers including Project HOPE138. The Val21Gly change introduces a glycine just before the G-patch. Signal P 3.0 suggests that this mutation is very close to a putative peptide cleavage site, possibly interfering with the cleavage of the signal sequence144. Project HOPE predicts that this mutation will affect a residue in a transmembrane domain, and will introduce a more flexible residue at this position. The Leu78Pro change is located in the second transmembrane domain. Project HOPE predicts that the mutation disturbs an α-helix which will have severe effects on the structure of the protein138. In silico analysis of both mutations using several other prediction programs, such as SIFT145, PMUT146 and PolyPhen147, predicted a pathogenic/non-neutral effect.

The linkage and IER3IP1 sequence data, together with the expression pattern during brain development and its involvement in regulation of apoptosis in vitro indicate that the mutations observed are causally related to their disease. This directly implicates IER3IP1 in the regulation of cell survival. Identification of IER3IP1 mutations sheds light on the mechanisms of brain development and on the pathogenesis of infantile epilepsy and early-onset permanent diabetes.

130

Applications of Homology Modelling 5

Project HOPE’s conclusion about IER3IP1 mutations

Method No structure or modeling template, report based on predictions Mutation Val21Gly

Structure The residue is located in a region annotated in the Uniprot database as a transmembrane domain. The new mutant residue is smaller than the wild-type residue. This can disturb either the contacts with the other transmembrane domains or with the lipid-membrane. The wild-type residue was more hydrophobic than the new mutant residue. This can affect the hydrophobic interactions within the core of the protein or with the membrane lipids. The mutation introduces a glycine at this position. Glycines are very flexible and can disturb the required rigidity of the protein on this position. Amino acid properties The wild-type and new mutant amino acids differ in size. The new mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein (or protein-complex). The hydrophobicity of the wild-type and new mutant residue differs. The mutation will cause loss of hydrophobic interactions in the core of the protein (or protein-complex). Conclusion There is hardly any information known for this protein but HOPE does a good job using prediction servers.

Mutation Leu78Pro

Structure The residue is located in a region annotated in the Uniprot database as a transmembrane domain. The new mutant residue is smaller than the wild-type residue. This can disturb either the contacts with the other transmembrane domains or with the lipid-membrane. The wild-type residue is predicted (by the PHDacc DAS-server) to be located in an α-helix. Proline disrupts an α-helix when not located at one of the first two positions. In case of the mutation at hand, the helix will be disturbed and this can have severe effects on the structure of the protein. Amino acid properties The wild-type and new mutant amino acids differ in size. The new mutant residue is smaller than the wild-type residue. The mutation will cause an empty space in the core of the protein (or protein-complex). Conclusion Even though there is hardly any information at all, HOPE gives a very plausible conclusion about the effect of this mutation.

Manual analyses can be found online: http://www.cmbi.ru.nl/~hvensela/IER3IP1/

131

Applications of Homology Modelling 5

References 1. Vriend, G. (1990). WHAT IF: a molecular modeling and drug design program. J Mol Graph 8, 52-56, 29. 2. Krieger, E., Koraimann, G., and Vriend, G. (2002). Increasing the precision of comparative models with YASARA NOVA--a self-parameterizing force field. Proteins 47, 393-402. 3. McGuffin, L.J., Bryson, K., and Jones, D.T. (2000). The PSIPRED protein structure prediction server. Bioinformatics 16, 404-405. 4. Kall, L., Krogh, A., and Sonnhammer, E.L. (2004). A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338, 1027-1036. 5. Blom, N., Gammeltoft, S., and Brunak, S. (1999). Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 294, 1351-1362. 6. Le Gac, G., and Ferec, C. (2005). The molecular genetics of haemochromatosis. Eur J Hum Genet 13, 1172- 1185. 7. Feder, J.N., Penny, D.M., Irrinki, A., Lee, V.K., Lebron, J.A., Watson, N., Tsuchihashi, Z., Sigal, E., Bjorkman, P.J., and Schatzman, R.C. (1998). The hemochromatosis gene product complexes with the transferrin receptor and lowers its affinity for ligand binding. Proc Natl Acad Sci U S A 95, 1472-1477. 8. Bennett, M.J., Lebron, J.A., and Bjorkman, P.J. (2000). Crystal structure of the hereditary haemochromatosis protein HFE complexed with transferrin receptor. Nature 403, 46-53. 9. Goswami, T., and Andrews, N.C. (2006). Hereditary hemochromatosis protein, HFE, interaction with transferrin receptor 2 suggests a molecular mechanism for mammalian iron sensing. J Biol Chem 281, 28494-28498. 10. Feder, J.N., Gnirke, A., Thomas, W., Tsuchihashi, Z., Ruddy, D.A., Basava, A., Dormishian, F., Domingo, R., Jr., Ellis, M.C., Fullan, A., et al. (1996). A novel MHC class I-like gene is mutated in patients with hereditary haemochromatosis. Nat Genet 13, 399-408. 11. Waheed, A., Parkkila, S., Zhou, X.Y., Tomatsu, S., Tsuchihashi, Z., Feder, J.N., Schatzman, R.C., Britton, R.S., Bacon, B.R., and Sly, W.S. (1997). Hereditary hemochromatosis: effects of C282Y and H63D mutations on association with beta2-microglobulin, intracellular processing, and cell surface expression of the HFE protein in COS-7 cells. Proc Natl Acad Sci U S A 94, 12384-12389. 12. Lebron, J.A., Bennett, M.J., Vaughn, D.E., Chirino, A.J., Snow, P.M., Mintier, G.A., Feder, J.N., and Bjorkman, P.J. (1998). Crystal structure of the hemochromatosis protein HFE and characterization of its interaction with transferrin receptor. Cell 93, 111-123. 13. Barton, J.C., Sawada-Hirai, R., Rothenberg, B.E., and Acton, R.T. (1999). Two novel missense mutations of the HFE gene (I105T and G93R) and identification of the S65C mutation in Alabama hemochromatosis probands. Blood Cells Mol Dis 25, 147-155. 14. Piperno, A., Arosio, C., Fossati, L., Vigano, M., Trombini, P., Vergani, A., and Mancia, G. (2000). Two novel nonsense mutations of HFE gene in five unrelated italian patients with hemochromatosis. Gastroenterology 119, 441-445. 15. Morton, C.C., and Nance, W.E. (2006). Newborn hearing screening--a silent revolution. N Engl J Med 354, 2151-2164. 16. Ansar, M., Din, M.A., Arshad, M., Sohail, M., Faiyaz-Ul-Haque, M., Haque, S., Ahmad, W., and Leal, S.M. (2003). A novel autosomal recessive non-syndromic deafness locus (DFNB35) maps to 14q24.1- 14q24.3 in large consanguineous kindred from Pakistan. Eur J Hum Genet 11, 77-80. 17. Shiau, A.K., Barstad, D., Loria, P.M., Cheng, L., Kushner, P.J., Agard, D.A., and Greene, G.L. (1998). The structural basis of estrogen receptor/coactivator recognition and the antagonism of this interaction by tamoxifen. Cell 95, 927-937. 18. Greschik, H., Wurtz, J.M., Sanglier, S., Bourguet, W., van Dorsselaer, A., Moras, D., and Renaud, J.P. (2002). Structural and functional evidence for ligand-independent transcriptional activation by the estrogen- related receptor 3. Mol Cell 9, 303-313. 19. Holt, J.A., Consler, T.G., Williams, S.P., Ayscue, A.H., Leesnitzer, L.M., Wisely, G.B., and Billin, A.N. (2003). Helix 1/8 interactions influence the activity of nuclear receptor ligand-binding domains. Mol Endocrinol 17, 1704-1714. 20. Belsham, D.D., Pereira, F., Greenberg, C.R., Liao, S., and Wrogemann, K. (1995). Leu-676-Pro mutation of the androgen receptor causes complete androgen insensitivity syndrome in a large Hutterite kindred. Hum Mutat 5, 28-33. 21. Eng, F.C., Lee, H.S., Ferrara, J., Willson, T.M., and White, J.H. (1997). Probing the structure and function of the estrogen receptor ligand binding domain by analysis of mutants with altered transactivation characteristics. Mol Cell Biol 17, 4644-4653.

132

Applications of Homology Modelling 5

22. Ahmed, Z.M., Goodyear, R., Riazuddin, S., Lagziel, A., Legan, P.K., Behra, M., Burgess, S.M., Lilley, K.S., Wilcox, E.R., Griffith, A.J., et al. (2006). The tip-link antigen, a protein associated with the transduction complex of sensory hair cells, is protocadherin-15. J Neurosci 26, 7022-7034. 23. Alattia, J.R., Ames, J.B., Porumb, T., Tong, K.I., Heng, Y.M., Ottensmeyer, P., Kay, C.M., and Ikura, M. (1997). Lateral self-assembly of E-cadherin directed by cooperative calcium binding. FEBS Lett 417, 405-408. 24. He, W., Cowin, P., and Stokes, D.L. (2003). Untangling desmosomal knots with electron tomography. Science 302, 109-113. 25. Prakasam, A.K., Maruthamuthu, V., and Leckband, D.E. (2006). Similarities between heterophilic and homophilic cadherin adhesion. Proc Natl Acad Sci U S A 103, 15434-15439. 26. Kazmierczak, P., Sakaguchi, H., Tokita, J., Wilson-Kubalek, E.M., Milligan, R.A., Muller, U., and Kachar, B. (2007). Cadherin 23 and protocadherin 15 interact to form tip-link filaments in sensory hair cells. Nature 449, 87-91. 27. Ahmed, Z.M., Riazuddin, S., Ahmad, J., Bernstein, S.L., Guo, Y., Sabar, M.F., Sieving, P., Griffith, A.J., Friedman, T.B., Belyantseva, I.A., et al. (2003). PCDH15 is expressed in the neurosensory epithelium of the eye and ear and mutant alleles are responsible for both USH1F and DFNB23. Hum Mol Genet 12, 3215-3223. 28. Ahmed, Z.M., Smith, T.N., Riazuddin, S., Makishima, T., Ghosh, M., Bokhari, S., Menon, P.S., Deshmukh, D., Griffith, A.J., Friedman, T.B., et al. (2002). Nonsyndromic recessive deafness DFNB18 and Usher syndrome type IC are allelic mutations of USHIC. Hum Genet 110, 527-531. 29. Bork, J.M., Peters, L.M., Riazuddin, S., Bernstein, S.L., Ahmed, Z.M., Ness, S.L., Polomeno, R., Ramesh, A., Schloss, M., Srisailpathy, C.R., et al. (2001). Usher syndrome 1D and nonsyndromic autosomal recessive deafness DFNB12 are caused by allelic mutations of the novel cadherin-like gene CDH23. Am J Hum Genet 68, 26-37. 30. Riazuddin, S., Nazli, S., Ahmed, Z.M., Yang, Y., Zulfiqar, F., Shaikh, R.S., Zafar, A.U., Khan, S.N., Sabar, F., Javid, F.T., et al. (2008). Mutation spectrum of MYO7A and evaluation of a novel nonsyndromic deafness DFNB2 allele with residual function. Hum Mutat 29, 502-511. 31. Bonifacio, M.J., Archer, M., Rodrigues, M.L., Matias, P.M., Learmonth, D.A., Carrondo, M.A., and Soares-Da- Silva, P. (2002). Kinetics and crystal structure of catechol-o-methyltransferase complex with co- substrate and a novel inhibitor with potential therapeutic application. Mol Pharmacol 62, 795-805. 32. Kleefstra, T., Brunner, H.G., Amiel, J., Oudakker, A.R., Nillesen, W.M., Magee, A., Genevieve, D., Cormier- Daire, V., van Esch, H., Fryns, J.P., et al. (2006). Loss-of-function mutations in euchromatin histone methyl transferase 1 (EHMT1) cause the 9q34 subtelomeric deletion syndrome. Am J Hum Genet 79, 370-377. 33. Kleefstra, T., Koolen, D.A., Nillesen, W.M., de Leeuw, N., Hamel, B.C., Veltman, J.A., Sistermans, E.A., van Bokhoven, H., van Ravenswaay, C., and de Vries, B.B. (2006). Interstitial 2.2 Mb deletion at 9q34 in a patient with mental retardation but without classical features of the 9q subtelomeric deletion syndrome. Am J Med Genet A 140, 618-623. 34. Kleefstra, T., Smidt, M., Banning, M.J., Oudakker, A.R., Van Esch, H., de Brouwer, A.P., Nillesen, W., Sistermans, E.A., Hamel, B.C., de Bruijn, D., et al. (2005). Disruption of the gene Euchromatin Histone Methyl Transferase1 (Eu-HMTase1) is associated with the 9q34 subtelomeric deletion syndrome. J Med Genet 42, 299-306. 35. Martin, C., and Zhang, Y. (2005). The diverse functions of histone lysine methylation. Nat Rev Mol Cell Biol 6, 838-849. 36. Tachibana, M., Ueda, J., Fukuda, M., Takeda, N., Ohta, T., Iwanari, H., Sakihama, T., Kodama, T., Hamakubo, T., and Shinkai, Y. (2005). Histone methyltransferases G9a and GLP form heteromeric complexes and are both crucial for methylation of euchromatin at H3-K9. Genes Dev 19, 815-826. 37. Ogawa, H., Ishiguro, K., Gaubatz, S., Livingston, D.M., and Nakatani, Y. (2002). A complex with chromatin modifiers that occupies E2F- and Myc-responsive genes in G0 cells. Science 296, 1132-1136. 38. Wu, H., Min, J., Lunin, V.V., Antoshenko, T., Dombrovski, L., Zeng, H., Allali-Hassani, A., Campagna-Slater, V., Vedadi, M., Arrowsmith, C.H., et al. Structural biology of human H3K9 methyltransferases. PLoS One 5, e8570. 39. Smeitink, J., van den Heuvel, L., and DiMauro, S. (2001). The genetics and pathology of oxidative phosphorylation. Nat Rev Genet 2, 342-352. 40. Smeitink, J.A., Zeviani, M., Turnbull, D.M., and Jacobs, H.T. (2006). Mitochondrial medicine: a metabolic perspective on the pathology of oxidative phosphorylation disorders. Cell Metab 3, 9-13. 41. Thorburn, D.R. (2004). Mitochondrial disorders: prevalence, myths and advances. J Inherit Metab Dis 27, 349-362.

133

Applications of Homology Modelling 5

42. Vogel, R.O., Smeitink, J.A., and Nijtmans, L.G. (2007). Human mitochondrial complex I assembly: a dynamic and versatile process. Biochim Biophys Acta 1767, 1215-1227. 43. Vogel, R.O., Dieteren, C.E., van den Heuvel, L.P., Willems, P.H., Smeitink, J.A., Koopman, W.J., and Nijtmans, L.G. (2007). Identification of mitochondrial complex I assembly intermediates by tracing tagged NDUFS3 demonstrates the entry point of mitochondrial subunits. J Biol Chem 282, 7582-7590. 44. Ugalde, C., Vogel, R., Huijbens, R., Van Den Heuvel, B., Smeitink, J., and Nijtmans, L. (2004). Human mitochondrial complex I assembles through the combination of evolutionary conserved modules: a framework to interpret complex I deficiencies. Hum Mol Genet 13, 2461-2472. 45. Antonicka, H., Ogilvie, I., Taivassalo, T., Anitori, R.P., Haller, R.G., Vissing, J., Kennaway, N.G., and Shoubridge, E.A. (2003). Identification and characterization of a common set of complex I assembly intermediates in mitochondria from patients with complex I deficiency. J Biol Chem 278, 43081- 43088. 46. Lazarou, M., McKenzie, M., Ohtake, A., Thorburn, D.R., and Ryan, M.T. (2007). Analysis of the assembly profiles for mitochondrial- and nuclear-DNA-encoded subunits into complex I. Mol Cell Biol 27, 4228- 4237. 47. Lazarou, M., Thorburn, D.R., Ryan, M.T., and McKenzie, M. (2009). Assembly of mitochondrial complex I and defects in disease. Biochim Biophys Acta 1793, 78-88. 48. Krieger, E., Darden, T., Nabuurs, S.B., Finkelstein, A., and Vriend, G. (2004). Making optimal use of empirical energy functions: force-field parameterization in crystal space. Proteins 57, 678-683. 49. Jones, S., and Thornton, J.M. (1996). Principles of protein-protein interactions. Proc Natl Acad Sci U S A 93, 13-20. 50. Chakrabarti, P., and Janin, J. (2002). Dissecting protein-protein recognition sites. Proteins 47, 334-343. 51. Petersen, M.B., Wang, Q., and Willems, P.J. (2008). Sex-linked deafness. Clin Genet 73, 14-23. 52. Cremers, C.W., Snik, A.F., Huygen, P.L., Joosten, F.B., and Cremers, F.P. (2002). X-linked mixed deafness syndrome with congenital fixation of the stapedial footplate and perilymphatic gusher (DFN3). Adv Otorhinolaryngol 61, 161-167. 53. de Kok, Y.J., Merkx, G.F., van der Maarel, S.M., Huber, I., Malcolm, S., Ropers, H.H., and Cremers, F.P. (1995). A duplication/paracentric inversion associated with familial X-linked deafness (DFN3) suggests the presence of a regulatory element more than 400 kb upstream of the POU3F4 gene. Hum Mol Genet 4, 2145-2150. 54. Cremers, C.W.C., F.R. Kremer, H. (2008). POU3F4 and mixed deafness with temporal defect (DFN3). In Inborn Errors of Development, C.J.E. Epstein, R.P. Wynshaw-Boris, ed. (New York, University Press), pp 1042-1047. 55. Hara, Y., Rovescalli, A.C., Kim, Y., and Nirenberg, M. (1992). Structure and evolution of four POU domain genes expressed in mouse brain. Proc Natl Acad Sci U S A 89, 3280-3284. 56. Le Moine, C., and Young, W.S., 3rd. (1992). RHS2, a POU domain-containing gene, and its expression in developing and adult rat. Proc Natl Acad Sci U S A 89, 3285-3289. 57. Mathis, J.M., Simmons, D.M., He, X., Swanson, L.W., and Rosenfeld, M.G. (1992). Brain 4: a novel mammalian POU domain transcription factor exhibiting restricted brain-specific expression. EMBO J 11, 2551-2561. 58. Klemm, J.D., Rould, M.A., Aurora, R., Herr, W., and Pabo, C.O. (1994). Crystal structure of the Oct-1 POU domain bound to an octamer site: DNA recognition with tethered DNA-binding modules. Cell 77, 21- 32. 59. Chinea, G., Padron, G., Hooft, R.W., Sander, C., and Vriend, G. (1995). The use of position-specific rotamers in model building by homology. Proteins 23, 415-421. 60. De Filippis, V., Sander, C., and Vriend, G. (1994). Predicting local structural changes that result from point mutations. Protein Eng 7, 1203-1208. 61. Christie, M.J. (1995). Molecular and functional diversity of K+ channels. Clin Exp Pharmacol Physiol 22, 944- 951. 62. Dolly, J.O., and Parcej, D.N. (1996). Molecular properties of voltage-gated K+ channels. J Bioenerg Biomembr 28, 231-253. 63. Armstrong, C.M. (2003). Voltage-gated K channels. Sci STKE 2003, re10. 64. Hille, B. (2001). Ion Channels of Excitable Membranes.(Sunderland, MA: Sinauer Associates Inc.). 65. Browne, D.L., Gancher, S.T., Nutt, J.G., Brunt, E.R., Smith, E.A., Kramer, P., and Litt, M. (1994). Episodic ataxia/myokymia syndrome is associated with point mutations in the human potassium channel gene, KCNA1. Nat Genet 8, 136-140.

134

Applications of Homology Modelling 5

66. Eunson, L.H., Rea, R., Zuberi, S.M., Youroukos, S., Panayiotopoulos, C.P., Liguori, R., Avoni, P., McWilliam, R.C., Stephenson, J.B., Hanna, M.G., et al. (2000). Clinical, genetic, and expression studies of mutations in the potassium channel gene KCNA1 reveal new phenotypic variability. Ann Neurol 48, 647-656. 67. Zuberi, S.M., Eunson, L.H., Spauschus, A., De Silva, R., Tolmie, J., Wood, N.W., McWilliam, R.C., Stephenson, J.B., Kullmann, D.M., and Hanna, M.G. (1999). A novel mutation in the human voltage-gated potassium channel gene (Kv1.1) associates with episodic ataxia type 1 and sometimes with partial epilepsy. Brain 122 ( Pt 5), 817-825. 68. Klein, A., Boltshauser, E., Jen, J., and Baloh, R.W. (2004). Episodic ataxia type 1 with distal weakness: a novel manifestation of a potassium channelopathy. Neuropediatrics 35, 147-149. 69. Shook, S.J., Mamsa, H., Jen, J.C., Baloh, R.W., and Zhou, L. (2008). Novel mutation in KCNA1 causes episodic ataxia with paroxysmal dyspnea. Muscle Nerve 37, 399-402. 70. Lee, H., Wang, H., Jen, J.C., Sabatti, C., Baloh, R.W., and Nelson, S.F. (2004). A novel mutation in KCNA1 causes episodic ataxia without myokymia. Hum Mutat 24, 536. 71. Long, S.B., Campbell, E.B., and Mackinnon, R. (2005). Voltage sensor of Kv1.2: structural basis of electromechanical coupling. Science 309, 903-908. 72. Grottesi, A., Sands, Z.A., and Sansom, M.S. (2005). Potassium channels: complete and undistorted. Curr Biol 15, R771-774. 73. Aggarwal, S.K., and MacKinnon, R. (1996). Contribution of the S4 segment to gating charge in the Shaker K+ channel. Neuron 16, 1169-1177. 74. Sands, Z., Grottesi, A., and Sansom, M.S. (2005). Voltage-gated ion channels. Curr Biol 15, R44-47. 75. Swartz, K.J. (2004). Towards a structural view of gating in potassium channels. Nat Rev Neurosci 5, 905- 916. 76. Liman, E.R., Hess, P., Weaver, F., and Koren, G. (1991). Voltage-sensing residues in the S4 region of a mammalian K+ channel. Nature 353, 752-756. 77. Pathak, M.M., Yarov-Yarovoy, V., Agarwal, G., Roux, B., Barth, P., Kohout, S., Tombola, F., and Isacoff, E.Y. (2007). Closing in on the resting state of the Shaker K(+) channel. Neuron 56, 124-140. 78. Glaudemans, B., van der Wijst, J., Scola, R.H., Lorenzoni, P.J., Heister, A., van der Kemp, A.W., Knoers, N.V., Hoenderop, J.G., and Bindels, R.J. (2009). A missense mutation in the Kv1.1 voltage-gated potassium channel-encoding gene KCNA1 is linked to human autosomal dominant hypomagnesemia. J Clin Invest 119, 936-942. 79. Long, S.B., Tao, X., Campbell, E.B., and MacKinnon, R. (2007). Atomic structure of a voltage-dependent K+ channel in a lipid membrane-like environment. Nature 450, 376-382. 80. Joosten, R.P., Womack, T., Vriend, G., and Bricogne, G. (2009). Re-refinement from deposited X-ray data can deliver improved models for most PDB entries. Acta Crystallogr D Biol Crystallogr 65, 176-185. 81. Desmet, V.J. (1992). Congenital diseases of intrahepatic bile ducts: variations on the theme "ductal plate malformation". Hepatology 16, 1069-1083. 82. Waanders, E., Croes, H.J., Maass, C.N., te Morsche, R.H., van Geffen, H.J., van Krieken, J.H., Fransen, J.A., and Drenth, J.P. (2008). Cysts of PRKCSH mutated polycystic liver disease patients lack hepatocystin but express Sec63p. Histochem Cell Biol 129, 301-310. 83. Tahvanainen, P., Tahvanainen, E., Reijonen, H., Halme, L., Kaariainen, H., and Hockerstedt, K. (2003). Polycystic liver disease is genetically heterogeneous: clinical and linkage studies in eight Finnish families. J Hepatol 38, 39-43. 84. Qian, Q., Li, A., King, B.F., Kamath, P.S., Lager, D.J., Huston, J., 3rd, Shub, C., Davila, S., Somlo, S., and Torres, V.E. (2003). Clinical profile of autosomal dominant polycystic liver disease. Hepatology 37, 164-171. 85. Drenth, J.P., Martina, J.A., van de Kerkhof, R., Bonifacino, J.S., and Jansen, J.B. (2005). Polycystic liver disease is a disorder of cotranslational protein processing. Trends Mol Med 11, 37-42. 86. Pirson, Y., Lannoy, N., Peters, D., Geubel, A., Gigot, J.F., Breuning, M., and Verellen-Dumoulin, C. (1996). Isolated polycystic liver disease as a distinct genetic disease, unlinked to polycystic kidney disease 1 and polycystic kidney disease 2. Hepatology 23, 249-252. 87. Drenth, J.P., te Morsche, R.H., Smink, R., Bonifacino, J.S., and Jansen, J.B. (2003). Germline mutations in PRKCSH are associated with autosomal dominant polycystic liver disease. Nat Genet 33, 345-347. 88. Li, A., Davila, S., Furu, L., Qian, Q., Tian, X., Kamath, P.S., King, B.F., Torres, V.E., and Somlo, S. (2003). Mutations in PRKCSH cause isolated autosomal dominant polycystic liver disease. Am J Hum Genet 72, 691-703. 89. Trombetta, E.S., Fleming, K.G., and Helenius, A. (2001). Quaternary and domain structure of glycoprotein processing glucosidase II. Biochemistry 40, 10717-10722.

135

Applications of Homology Modelling 5

90. Meyer, H.A., Grau, H., Kraft, R., Kostka, S., Prehn, S., Kalies, K.U., and Hartmann, E. (2000). Mammalian Sec61 is associated with Sec62 and Sec63. J Biol Chem 275, 14550-14557. 91. Ponting, C.P. (2000). Proteins of the endoplasmic-reticulum-associated degradation pathway: domain detection and function prediction. Biochem J 351 Pt 2, 527-535. 92. Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., et al. (2007). Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947-2948. 93. Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292, 195-202. 94. Waanders, E., te Morsche, R.H., de Man, R.A., Jansen, J.B., and Drenth, J.P. (2006). Extensive mutational analysis of PRKCSH and SEC63 broadens the spectrum of polycystic liver disease. Hum Mutat 27, 830. 95. Laqua, H. (1980). Familial exudative vitreoretinopathy. Albrecht Von Graefes Arch Klin Exp Ophthalmol 213, 121-133. 96. Miyakubo, H., Hashimoto, K., and Miyakubo, S. (1984). Retinal vascular pattern in familial exudative vitreoretinopathy. Ophthalmology 91, 1524-1530. 97. Shukla, D., Singh, J., Sudheer, G., Soman, M., John, R.K., Ramasamy, K., and Perumalsamy, N. (2003). Familial exudative vitreoretinopathy (FEVR). Clinical profile and management. Indian J Ophthalmol 51, 323-328. 98. Robitaille, J., MacDonald, M.L., Kaykas, A., Sheldahl, L.C., Zeisler, J., Dube, M.P., Zhang, L.H., Singaraja, R.R., Guernsey, D.L., Zheng, B., et al. (2002). Mutant frizzled-4 disrupts retinal angiogenesis in familial exudative vitreoretinopathy. Nat Genet 32, 326-330. 99. Toomes, C., Bottomley, H.M., Jackson, R.M., Towns, K.V., Scott, S., Mackey, D.A., Craig, J.E., Jiang, L., Yang, Z., Trembath, R., et al. (2004). Mutations in LRP5 or FZD4 underlie the common familial exudative vitreoretinopathy locus on chromosome 11q. Am J Hum Genet 74, 721-730. 100. Berger, W., Meindl, A., van de Pol, T.J., Cremers, F.P., Ropers, H.H., Doerner, C., Monaco, A., Bergen, A.A., Lebo, R., Warburgh, M., et al. (1992). Isolation of a candidate gene for Norrie disease by positional cloning. Nat Genet 2, 84. 101. Berger, W., van de Pol, D., Warburg, M., Gal, A., Bleeker-Wagemakers, L., de Silva, H., Meindl, A., Meitinger, T., Cremers, F., and Ropers, H.H. (1992). Mutations in the candidate gene for Norrie disease. Hum Mol Genet 1, 461-465. 102. Chen, Z.Y., Hendriks, R.W., Jobling, M.A., Powell, J.F., Breakefield, X.O., Sims, K.B., and Craig, I.W. (1992). Isolation and characterization of a candidate gene for Norrie disease. Nat Genet 1, 204-208. 103. He, X., Semenov, M., Tamai, K., and Zeng, X. (2004). LDL receptor-related proteins 5 and 6 in Wnt/beta- catenin signaling: arrows point the way. Development 131, 1663-1677. 104. Pinson, K.I., Brennan, J., Monkley, S., Avery, B.J., and Skarnes, W.C. (2000). An LDL-receptor-related protein mediates Wnt signalling in mice. Nature 407, 535-538. 105. Wehrli, M., Dougan, S.T., Caldwell, K., O'Keefe, L., Schwartz, S., Vaizel-Ohayon, D., Schejter, E., Tomlinson, A., and DiNardo, S. (2000). arrow encodes an LDL-receptor-related protein essential for Wingless signalling. Nature 407, 527-530. 106. Mao, B., Wu, W., Li, Y., Hoppe, D., Stannek, P., Glinka, A., and Niehrs, C. (2001). LDL-receptor-related protein 6 is a receptor for Dickkopf proteins. Nature 411, 321-325. 107. Meindl, A., Berger, W., Meitinger, T., van de Pol, D., Achatz, H., Dorner, C., Haasemann, M., Hellebrand, H., Gal, A., Cremers, F., et al. (1992). Norrie disease is caused by mutations in an extracellular protein resembling C-terminal globular domain of mucins. Nat Genet 2, 139-143. 108. Meitinger, T., Meindl, A., Bork, P., Rost, B., Sander, C., Haasemann, M., and Murken, J. (1993). Molecular modelling of the Norrie disease protein predicts a cystine knot growth factor tertiary structure. Nat Genet 5, 376-380. 109. Takagi, J., Yang, Y., Liu, J.H., Wang, J.H., and Springer, T.A. (2003). Complex between nidogen and laminin fragments reveals a paradigmatic beta-propeller interface. Nature 424, 969-974. 110. Mather, T., Oganessyan, V., Hof, P., Huber, R., Foundling, S., Esmon, C., and Bode, W. (1996). The 2.8 A crystal structure of Gla-domainless activated protein C. EMBO J 15, 6822-6831. 111. Weidauer, S.E., Schmieder, P., Beerbaum, M., Schmitz, W., Oschkinat, H., and Mueller, T.D. (2009). NMR structure of the Wnt modulator protein Sclerostin. Biochem Biophys Res Commun 380, 160-165. 112. Vitt, U.A., Hsu, S.Y., and Hsueh, A.J. (2001). Evolution and classification of cystine knot-containing hormones and related extracellular signaling molecules. Mol Endocrinol 15, 681-694. 113. Dann, C.E., Hsieh, J.C., Rattner, A., Sharma, D., Nathans, J., and Leahy, D.J. (2001). Insights into Wnt binding and signalling from the structures of two Frizzled cysteine-rich domains. Nature 412, 86-90.

136

Applications of Homology Modelling 5

114. Zhang, J., Zhang, W., Zou, D., Chen, G., Wan, T., Zhang, M., and Cao, X. (2002). Cloning and functional characterization of ACAD-9, a novel member of human acyl-CoA dehydrogenase family. Biochem Biophys Res Commun 297, 1033-1042. 115. Sheftel, A.D., Stehling, O., Pierik, A.J., Netz, D.J., Kerscher, S., Elsasser, H.P., Wittig, I., Balk, J., Brandt, U., and Lill, R. (2009). Human ind1, an iron-sulfur cluster assembly factor for respiratory complex I. Mol Cell Biol 29, 6059-6073. 116. Vahsen, N., Cande, C., Briere, J.J., Benit, P., Joza, N., Larochette, N., Mastroberardino, P.G., Pequignot, M.O., Casares, N., Lazar, V., et al. (2004). AIF deficiency compromises oxidative phosphorylation. EMBO J 23, 4679-4689. 117. McAndrew, R.P., Wang, Y., Mohsen, A.W., He, M., Vockley, J., and Kim, J.J. (2008). Structural basis for substrate fatty acyl chain specificity: crystal structure of human very-long-chain acyl-CoA dehydrogenase. J Biol Chem 283, 9435-9443. 118. Wintermeyer, W., Peske, F., Beringer, M., Gromadski, K.B., Savelsbergh, A., and Rodnina, M.V. (2004). Mechanisms of elongation on the ribosome: dynamics of a macromolecular machine. Biochem Soc Trans 32, 733-737. 119. Hansson, S., Singh, R., Gudkov, A.T., Liljas, A., and Logan, D.T. (2005). Structural insights into fusidic acid resistance and sensitivity in EF-G. J Mol Biol 348, 939-949. 120. Diaconu, M., Kothe, U., Schlunzen, F., Fischer, N., Harms, J.M., Tonevitsky, A.G., Stark, H., Rodnina, M.V., and Wahl, M.C. (2005). Structural basis for the function of the ribosomal L7/12 stalk in factor binding and GTPase activation. Cell 121, 991-1004. 121. Savelsbergh, A., Mohr, D., Kothe, U., Wintermeyer, W., and Rodnina, M.V. (2005). Control of phosphate release from elongation factor G by ribosomal protein L7/12. EMBO J 24, 4316-4323. 122. Nechifor, R., Murataliev, M., and Wilson, K.S. (2007). Functional interactions between the G' subdomain of bacterial translation factor EF-G and ribosomal protein L7/L12. J Biol Chem 282, 36998-37005. 123. Zhang, Y., Feng, X., We, R., and Derynck, R. (1996). Receptor-associated Mad homologues synergize as effectors of the TGF-beta response. Nature 383, 168-172. 124. Massague, J., Seoane, J., and Wotton, D. (2005). Smad transcription factors. Genes Dev 19, 2783-2810. 125. Datto, M.B., Frederick, J.P., Pan, L., Borton, A.J., Zhuang, Y., and Wang, X.F. (1999). Targeted disruption of Smad3 reveals an essential role in transforming growth factor beta-mediated signal transduction. Mol Cell Biol 19, 2495-2504. 126. Chacko, B.M., Qin, B.Y., Tiwari, A., Shi, G., Lam, S., Hayward, L.J., De Caestecker, M., and Lin, K. (2004). Structural basis of heteromeric smad protein assembly in TGF-beta signaling. Mol Cell 15, 813-823. 127. Prokova, V., Mavridou, S., Papakosta, P., Petratos, K., and Kardassis, D. (2007). Novel mutations in Smad proteins that inhibit signaling by the transforming growth factor beta in mammalian cells. Biochemistry 46, 13775-13786. 128. Frederick, J.P., Tafari, A.T., Wu, S.M., Megosh, L.C., Chiou, S.T., Irving, R.P., and York, J.D. (2008). A role for a lithium-inhibited Golgi nucleotidase in skeletal development and sulfation. Proc Natl Acad Sci U S A 105, 11605-11612. 129. Mitchell, K.J., Pinson, K.I., Kelly, O.G., Brennan, J., Zupicich, J., Scherz, P., Leighton, P.A., Goodrich, L.V., Lu, X., Avery, B.J., et al. (2001). Functional analysis of secreted and transmembrane proteins critical to mouse development. Nat Genet 28, 241-249. 130. Sohaskey, M.L., Yu, J., Diaz, M.A., Plaas, A.H., and Harland, R.M. (2008). JAWS coordinates chondrogenesis and synovial joint positioning. Development 135, 2215-2220. 131. Guex, N., and Peitsch, M.C. (1997). SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis 18, 2714-2723. 132. Chakrabartty, A., Kortemme, T., and Baldwin, R.L. (1994). Helix propensities of the amino acids measured in alanine-based peptides without helix-stabilizing side-chain interactions. Protein Sci 3, 843-852. 133. Petersen, M.B., and Willems, P.J. (2006). Non-syndromic, autosomal-recessive deafness. Clin Genet 69, 371-392. 134. Hilgert, N., Smith, R.J., and Van Camp, G. (2009). Forty-six genes causing nonsyndromic hearing impairment: which ones should be analyzed in DNA diagnostics? Mutat Res 681, 189-196. 135. Bonne-Tamir, B., DeStefano, A.L., Briggs, C.E., Adair, R., Franklyn, B., Weiss, S., Korostishevsky, M., Frydman, M., Baldwin, C.T., and Farrer, L.A. (1996). Linkage of congenital recessive deafness (gene DFNB10) to chromosome 21q22.3. Am J Hum Genet 58, 1254-1259. 136. Veske, A., Oehlmann, R., Younus, F., Mohyuddin, A., Muller-Myhsok, B., Mehdi, S.Q., and Gal, A. (1996). Autosomal recessive non-syndromic deafness locus (DFNB8) maps on chromosome 21q22 in a large consanguineous kindred from Pakistan. Hum Mol Genet 5, 165-168.

137

Applications of Homology Modelling 5

137. Cremers, C., van Rijn, P., and ter Haar, B. (1987). Autosomal recessive progressive high-frequency sensorineural deafness in childhood. Arch Otolaryngol Head Neck Surg 113, 1319-1324. 138. Venselaar, H., Te Beek, T.A., Kuipers, R.K., Hekkelman, M.L., and Vriend, G. Protein structure analysis of mutations causing inheritable diseases. An e-Science approach with life scientist friendly interfaces. BMC Bioinformatics 11, 548. 139. Herter, S., Piper, D.E., Aaron, W., Gabriele, T., Cutler, G., Cao, P., Bhatt, A.S., Choe, Y., Craik, C.S., Walker, N., et al. (2005). Hepatocyte growth factor is a preferred in vitro substrate for human hepsin, a membrane-anchored serine protease implicated in prostate and ovarian cancers. Biochem J 390, 125- 136. 140. Woods, C.G., Bond, J., and Enard, W. (2005). Autosomal recessive primary microcephaly (MCPH): a review of clinical, molecular, and evolutionary findings. Am J Hum Genet 76, 717-728. 141. Kaindl, A.M., Passemard, S., Kumar, P., Kraemer, N., Issa, L., Zwirner, A., Gerard, B., Verloes, A., Mani, S., and Gressens, P. Many roads lead to primary autosomal recessive microcephaly. Prog Neurobiol 90, 363-383. 142. Barkovich, A.J., Kuzniecky, R.I., Jackson, G.D., Guerrini, R., and Dobyns, W.B. (2005). A developmental and genetic classification for malformations of cortical development. Neurology 65, 1873-1887. 143. Guerrini, R., Dobyns, W.B., and Barkovich, A.J. (2008). Abnormal development of the human cerebral cortex: genetics, functional consequences and treatment options. Trends Neurosci 31, 154-162. 144. Bendtsen, J.D., Nielsen, H., von Heijne, G., and Brunak, S. (2004). Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340, 783-795. 145. Ng, P.C., and Henikoff, S. (2001). Predicting deleterious amino acid substitutions. Genome Res 11, 863- 874. 146. Ferrer-Costa, C., Gelpi, J.L., Zamakola, L., Parraga, I., de la Cruz, X., and Orozco, M. (2005). PMUT: a web- based tool for the annotation of pathological mutations on proteins. Bioinformatics 21, 3176-3178. 147. Adzhubei, I.A., Schmidt, S., Peshkin, L., Ramensky, V.E., Gerasimova, A., Bork, P., Kondrashov, A.S., and Sunyaev, S.R. A method and server for predicting damaging missense mutations. Nat Methods 7, 248- 249.

138

Automated protein structure annotation of human disease variants 6

Previous page:

Close-up of a heme-molecule in hemoglobin – These globular tetrameric complexes form the oxygen-transporting units in our blood. (PDB-file 1hho)

140

Automated protein structure annotation of human disease variants 6

Introduction The increasing application of large scale sequencing to patients with genetic diseases has made the accurate interpretation of genetic variants a challenge that is of paramount importance to both genetic diagnostic services and to disease gene discovery. This problem is not new, but it is clearly scaling up due to these new technologies. This requires that we rethink the criteria by which we declare pathogenicity, as well as the tools that we use for generating such calls. One way in which we may possibly improve the accuracy of predictions about pathogenicity of individual mutations is to include information on protein structure for their annotation. We previously developed HOPE1 (Have yOur Protein Explained), a program that combines a number of features from existing software packages with extensive use of protein structure information. HOPE can automatically build homology models which enables the use of structural information for about 40% of all human proteins. HOPE provides for each mutation a detailed description in order to aid scientists with interpretation of the data. We have systematically compared the output from HOPE, with that of three widely-used methods for automatic variant annotation. As expected, we find that when structure information is available, the two methods that can use it (PolyPhen and HOPE) outperform both methods that do not incorporate information from protein structural modeling (Grantham scores and SIFT). Additionally, the HOPE report provides an extensive explanation of the mutational effect which leads to more insight in the protein and the mutation.

Recent developments in Next Generation Sequencing (NGS) techniques have greatly increased the rate at which variation in the human genome is identified. It has been estimated that SNPs occur about every 200-300 base-pairs leading to ~15 million SNPs2 in an individual genome. Initiatives such as dbSNP3, HGMD4, and the 1000 genome project5 collect and analyse these variants in order to create an reliable overview of variation in the complete human genome. Currently, dbSNP contains more than 52 million SNPs (build 135, October 2011). One interesting observation is that in any human exome sequenced today, the number of previously unidentified variants that affect the coding sequence and would lead to a predicted change in the amino acid composition of the protein, is 150-2006. This is true, even of individuals from populations for which a large number of exomes have already been sequenced. NGS leads to easier detection of variants, but assessing a causal relation to disease remains difficult, and incomplete. This can be illustrated by the fact that for BRCA1 and BRCA2, two of the most studied human disease genes, 15% of the mutations remain 'unclassified'7. A number of bioinformatics methods have been developed to tackle the challenge of annotating point-mutations in proteins. The Grantham-score8 was the first method that was based on differences in amino acid properties. Grantham himself already mentioned in his seminal 1974 paper8 that his method has its limitations because it does not take structural features into account. A Ser->Thr mutation, for example, tends to be harmless unless the serine is an active site residue. SIFT9 (Sorting Intolerant From Tolerant) is based on multiple sequence alignments which make its predictions generally more accurate than Grantham scores. The PolyPhen10 web server was the first of a new generation of servers that collect and combine information from a wide range of sources to infer conclusions about the effect of a mutation. Rapid proliferation of such computer programs has occurred in recent years. Table 6.1 provides a list of freely available web servers that can analyse point mutations in proteins and predict their effect.

141

Automated protein structure annotation of human disease variants 6

Table 6.1: Internet based web servers that can aid with the prediction of the effects of point mutations on a protein’s structure and/or function. Left: name of facility, reference, and URL. Right: very short description of the main features. MSA=Multiple Sequence Alignment; SVM=Support Vector Machine; PSIC=Position Specific Independent Counts; GO=Gene Ontology.

Servers that work with AA sequence (sometimes Uniprot ID) and mutation and that can predict the functional effect of the mutation PolyPhen10 Classifies a mutation as probably damaging, genetics.bwh.harvard.edu/pph2/ possibly damaging, or benign, uses annotations, info in PDB-file, MSA, UniProt and PSIC 11scores. SNAP12 Gives neutral or non-neutral. Uses sequence rostlab.org/services/snap/ annotations, predictions on sequence, and MSA. PMut13 Uses sequence based information and mmb2.pcb.ub.es:8080/PMut/ predictions but no structures. A set of pre- calculated mutations on a reference PDB set is also available SNPS&GO14 Mainly uses GO-terms to indicate disease snps-and-go.biocomp.unibo.it/snps-and-go/ versus neutral. nsSNPAnalyzer15 Uses environment scores and SIFT scores to http://snpanalyzer.uthsc.edu/ classify a mutation as pathogenic or not (works for sequences with PDB only!) MutationAssesor16 Lists information from Uniprot, variation http://mutationassessor.org/ scores, known regions/domains and SNPs for the whole gene. Gives a functional impact score (high, medium, low or neutral) SNPeffect17 Predicts effects of mutations using aggregation, http://snpeffect.vib.be/ stability, etc scores Servers that only use conservation scores Panther18 Gives a score for deleterious or neutral, based www.pantherdb.org/tools/csnpScoreForm.jsp on MSA only. PhD-SNP19 Classifies a mutation as neutral or disease gpcr.biocomp.unibo.it/cgi/predictors/PhD- causing based on MSA only using SVM. SNP/PhD-SNP.cgi SIFT9 Classifies a mutation as being tolerated or not sift.jcvi.org/ tolerated, based on MSA only Servers that calculate stability/energy changes (using PDB structure) MuPro20 Predict protein stability changes for single http://mupro.proteomics.ics.uci.edu/ amino acid mutations AutoMute21 Indicates whether the stability and function of http://proteins.gmu.edu/automute/ structure is affected Eris22 Calculates stability change causes by a http://troll.med.unc.edu/eris mutation PopMuSiC23 Stability calculation only using, among others, http://babylone.ulb.ac.be/popmusic atom/torsion angle potentials CUPSAT24 Prediction of the protein stability change upon http://cupsat.tu-bs.de/ a mutation using aa-atom potentials and torsion angle distribution (calculations for all aa) I-mutant 2.025 Can predict the protein’s stability change based http://folding.uib.es/i-mutant/i-mutant2.0.html on either the structure or the sequence

142

Automated protein structure annotation of human disease variants 6

SDM26 Structure based prediction of stability and http://mordred.bioc.cam.ac.uk/sdm/sdm.php disease association of the mutation (only one that gives disease association too) Servers that collect info about known SNPS (no option for de novo SNPs) FastSNP27 Analyzes all SNPs but uses PolyPhen for the http://fastsnp.ibms.sinica.edu.tw non-synonymous ones stSNP28 Lists the aa-differences per SNP and the KEGG http://ilyinlab.org/StSNP/ pathway Omidios29 Classifies SNPS in disease/polymorphism http://sgu.bioinfo.cipf.es/services/Omidios/ F-SNP30 Meta-server, collects classifications from others http://compbio.cs.queensu.ca/F-SNP/ and uses multi-vote LS_SNP31 Gives 3D-effect/stabilization of existing SNPS http://modbase.compbio.ucsf.edu/LS-SNP/ SNAP32 Loads of information about a variant http://snap.humgen.au.dk SNPs3D33 Lists molecular functional effects of non- www.snps3d.org/ synonymous SNPs based on structure and sequence analysis. (only existing SNPS, no option for your own input) MutDB 34 Collects human variants from Swiss-Prot and http://www.mutdb.org dbSNP organized by gene. Shows SIFT scores for the nsSNPS.

HOPE We have designed HOPE1 (Have yOur Protein Explained) a program that combines the various features of existing software packages with the novel feature of extensive protein structure (or homology model) information. HOPE combines the results of dozens of computer programs and produces output in the form of a report that is intended to be understandable for trained molecular biologists, and life science experts. The use of structure information is a potentially important addition, because it allows the investigation of effects on features such as solvent accessibility, secondary structure, hydrogen bonding, ionic interactions, and location of the active site. However, the percentage of human proteins for which the 3D-structure was solved is only ~20%. HOPE can automatically build homology models when experimental structure information is available for a homolog of the protein of interest, thereby increasing the percentage of human proteins that can be analysed in detail to more than 40%. A decision scheme is applied to identify the effects of a mutation on the protein’s 3D-structure and function. This decision scheme ensures that the most reliable source of information is always used for the report; conclusions drawn from 3D-structures prevail over Swiss-Prot annotations that in turn prevail over sequence based predictions. In case no structure data is available and no model can be built, HOPE functions much like other methods and uses annotations obtained from a series of databases, and sequence based predictions obtained from Swiss-Prot35 and DAS-servers36. Conservation scores obtained from HSSP37 are always added to the report. The report is illustrated with pictures and animations that are ready for use in presentations and articles. HOPE’s technical details have been described in chapter 4. HOPE’s workflow is illustrated in figure 6.1.

143

Automated protein structure annotation of human disease variants 6

Figure 6.1: HOPE’s workflow. 1) The user submits the protein sequence and mutation of interest. 2) HOPE collects all information about the protein from different information sources (calculations on the 3D-structure or the homology model, annotations from SwissProt sequence features, sequence-based predictions by DAS- servers, and conservation scores from HSSP) The data is stored for use in a database 3) The data is combined in a decision tree. HOPE will use information from the most reliable source first (the structure itself or a homology model), followed by annotations, and predictions. Conservation scores originate from HSSP. 4) The information is combined into a life-science-friendly written report which will be shown on the website.

Since its first day online HOPE has analysed the predicted effect of hundreds of mutations for users from all over the world. Subjectively, HOPE provides very accurate detailed descriptions of the impact of a mutation on the protein. We compared HOPE’s performance with that of three commonly used methods for a set of manually analysed disease causing mutations. A test dataset of 37 mutations was extracted from all 15 articles published in 2010 in the American Journal of Human Genetics that describe the effect of a disease causing mutation on the 3D-structure and/or function of the affected protein. As a negative control we used the 21 SNPs in the same set of proteins which have been annotated as non-damaging ‘polymorphisms’ at the UniProtKB/Swiss-Prot variant pages and which could be analysed using either the experimentally solved structure or a homology model. We analysed all 58 variants using HOPE and the three afore-mentioned methods (for PolyPhen the upgraded PolyPhen-2 server was used). The results are summarized in Table 6.2 and show that current methods agree on the interpretation of a newly found mutation in only a small majority of cases. A more extensive analysis of another 144 mutations described in previous projects and randomly selected articles showed results similar to those described here. Results of this extensive analysis can be found in Table 6.3 (page 151) and will be discussed in Chapter 6a.

144

Automated protein structure annotation of human disease variants 6

Table 6.2: Analysis of 37 disease-causing mutations and 21 polymorphisms. The threshold for Grantham scores (benign < 62 < damaging) was obtained from Bell et al38. SIFT predicts either tolerated (does not cause a disease) or not-tolerated (causes a disease). PolyPhen-2 classifies mutation as ‘probably damaging’, ‘possibly damaging’, or 'benign'; the best correlation with experiment is obtained by considering the first two categories both as damaging. HOPE's decision tree produces rather elaborate results that have for this study been compressed into damaging and benign. Note that for only 59% of the disease-causing variants Grantham, SIFT, and PolyPhen-2 agree that the mutation is pathogenic, and that in 5% of the cases these 3 methods unanimously agree that the mutation is benign. Pathogenic mutations Polymorphisms Number of variants 37 21 Grantham (<62 is benign) 59% > 62 52% < 62 SIFT 81% not tolerated 62% tolerated PolyPhen-2 89% damaging 52% benign 76% probably damaging Grantham/SIFT/PolyPhen-2 59% in agreement of which 48% in agreement of which are in agreement 5% incorrect (all predict a 19% incorrect (all predict a benign mutation) damaging mutation) HOPE 92% damaging of which 67% mutant is not severe 76% obviously damaging

Table 6.2 shows that Grantham-scores correctly predict pathogenic mutations in only 59% of the disease-causing mutations and neutral for 52% of the neutral SNPs. SIFT performs better than Grantham as 81% of the pathogenic mutations are indeed predicted as not-tolerated and 62% of the polymorphisms are predicted to be tolerated. If we count both 'probably damaging' and 'possibly damaging' as ‘damaging’, PolyPhen-2 correctly identifies 89% of the pathogenic mutations and 52% of the SNPs. The analysis further shows that the SIFT, HOPE, and Polyphen-2 methods that use multiple sequence alignments clearly outperform Grantham scores. The use of structure information further increases the number of correctly predicted pathogenic mutations. Grantham, SIFT, and PolyPhen-2 are often used in combination. From our results, we question this approach as it does not seem to increase accuracy. Our analyses show that these three algorithms agree for only 59% and 48% of the pathogenic mutations and neutral SNPs, respectively. We also found a few cases for which the combined analysis is unanimously incorrect; 5% of the pathogenic mutations are called benign and 19% of the neutral SNPs were predicted to be disease-causing by all three methods. In terms of classification and accuracy, HOPE and Polyphen-2 appear to yield similar results. However, we note that there are a number of differences in their respective output formats which may be relevant to prospective users. Notably, HOPE was not primarily designed to serve as a mutation classifier. Its strength lies in the detailed yet easy to understand description of the effects of a mutation on various aspects of the protein structure and function. To make sure every user is able to read and understand the reports we have implemented a help-function that links all the difficult keywords from the fields of bioinformatics, biophysics, and protein structures to a dedicated online dictionary. Thus, HOPE produces reports that do not contain just a single pathogenicity score. Instead, HOPE reports describe the effect that the mutation might have on relevant features, such as contacts made by the residue, the structural domain in which the residue is located, post- translational modifications. Additionally, the report provides information about variations found at the same position and conservation scores. A recent editorial in Nature Genetics39 stresses the importance of “mechanistic investigation and further value”. The HOPE reports can provide a good start towards providing such information. To compare HOPE reports with the outcome of the other three methods the collected information was consistently interpreted. For instance, a report that mentions the loss of a salt- bridge at a conserved position clearly describes a damaging mutation which we therefore scored ‘pathogenic’ in this study. A HOPE report that mentioned a non-conserved residue at the surface was

145

Automated protein structure annotation of human disease variants 6 scored as ’benign’. HOPE scores comparable to PolyPhen-2 with 92% of the pathogenic mutants correctly predicted to affect the protein’s structure and/or function and 67% of the polymorphisms predicted to have no severe structural effect. Our test set consists of mutations in proteins for which 3D-structure information is available. HOPE will perform better than PolyPhen-2 in those cases for which no 3D-structure is available but for which a homology model can be build. The added value of HOPE over the other methods lies in the detailed description of the effects of the mutation on the protein structure and/or function. Whereas the other methods classify a mutation as damaging, HOPE provides the rationale why that mutation is damaging. This may lead to a better understanding of the effects of the mutation on the protein and further suggest new experiments. In 13 of the 15 cases in which the use of Grantham, PolypPhen-2, and SIFT did not unanimously result in a pathogenic prediction HOPE was able to provide the information required for a correct classification. For example, the pathogenic mutation Val2303Met in Aggrecan40 receives a very low Grantham score (21, corresponding to 'benign') based on the hydrophobic similarity of valine and methionine. SIFT predicts that this mutation is not tolerated, whereas PolyPhen-2 cannot classify the mutation due to lack of information. The HOPE report notes that the mutation will disturb the structural integrity of the core of the ‘C-type lectin’ domain and thus gives a clear reason for the pathogenic effect. This answer agrees with the conclusion drawn by the authors of the corresponding article40. The pathogenic mutation Glu325Lys in KLF1 has a benign Grantham score of 56 because glutamic acid and lysine both are hydrophilic and charged. SIFT and PolyPhen-2 predict 'not-tolerated’ and ’probably damaging’ respectively. The HOPE report describes that this mutation disturbs DNA binding which will affect the protein’s function. In this case, SIFT, PolyPhen-2, and HOPE are correct, while HOPE additionally agrees with the detailed description by the authors41. The neutral SNP Phe90Leu in DHB4 has a benign Grantham score of 22 but was predicted to be pathogenic by both SIFT and PolyPhen. The HOPE report additionally shows that the residue is located at the surface and that the mutant residue was also found at this location in other, homologous proteins so that the mutation might indeed be benign. Similarly, in 6 out of 10 neutral SNPs for which the three other methods disagreed about the classification, HOPE reported the variant to be benign.

Figure 6.2: Mutation Trp177Arg in Opsin. The picture that is shown in the HOPE report for mutation Trp177Arg in opsin shows that Trp177 is located at the surface of a helix. This helix is a transmembrane helix and therefore the Trp177 residue is in contact with the membrane-lipids. The HOPE report correctly states that this mutation will affect the contacts with the membrane-lipid because the arginine residue differs from the wild- type tryptophan in its size, charge and hydrophobicity.

146

Automated protein structure annotation of human disease variants 6

In five cases HOPE provided more, or more correct information than the description of the mutation in the article, and in all five cases manual inspection shows HOPE to be correct. For example, the authors state that residue Val359 is located at the surface of SPTC242 whereas HOPE analyses the homology model and finds that this residue is buried. In another example, the authors stated that mutation Trp177Arg in opsin43 will cause a major conformational change whereas HOPE finds that this mutation is more likely to disturb interactions with the membrane, a conclusion supported by dozens of studies listed in the GPCRDB44. Figure 6.2 shows the picture that was included in the HOPE report for this mutant. These examples further illustrate how HOPE can be used to gain more insight in the structural effects of mutations. The 8% cases in which HOPE failed to identify a damaging effect of a causative variant was due to the misclassification of 3 mutations in the PRPS1 protein. In all three cases the mutated residue was not conserved and the mutant residue was also found at that particular position in the multiple sequence alignment. Glu43Asp and Asp65Asn both occur at the surface of the protein where a mutation might be less detrimental. SIFT, PolyPhen-2 and Grantham also disagree with the authors and predict a benign mutation. Mutation Leu129Ile is predicted by HOPE as benign because leucine and isoleucine have the same properties and isoleucine was observed at this position in the HSSP multiple sequence alignment. According to the authors of the original article this residue is located close to an allosteric binding site, information that is not available as annotation in any database and therefore not known to HOPE. Over the past year, HOPE has analysed hundreds of mutations for users from all over the world. Subjectively, HOPE provides very accurate descriptions of the expected effects of mutations. These results will be discussed in Chapter 6a.

Conclusion In conclusion, we compare HOPE’s performance with that of other popular methods using a dataset of 37 manually analysed disease causing mutations from 15 articles published in 2010 in the American Journal of Human Genetics each of which includes a description of the effects on the 3D- structure and/or function of the affected protein. We find that HOPE performs well compared to existing programs such as SIFT and Polyphen-2, but more importantly, HOPE provides more detailed mechanistic information that allows scientists to gain deeper insight into the reasons for a particular score.

Acknowledgements The authors would like to thank Maarten Hekkelman, Tim te Beek, Remko Kuipers, Bas Vroling, Robbie Joosten, Barbara van Kampen, Elmar Krieger, and Wilmar Theunissen for their support and expertise. This work was part of the BioRange programme of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI). The authors acknowledge support from the EU project NewProt (289350).

147

Automated protein structure annotation of human disease variants 6

References 1. Venselaar, H., Te Beek, T.A., Kuipers, R.K., Hekkelman, M.L., and Vriend, G. Protein structure analysis of mutations causing inheritable diseases. An e-Science approach with life scientist friendly interfaces. BMC Bioinformatics 11, 548. 2. Gong, S., Worth, C.L., Cheng, T.M., and Blundell, T.L. Meet me halfway: when genomics meets structural bioinformatics. J Cardiovasc Transl Res 4, 281-303. 3. Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., and Sirotkin, K. (2001). dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308-311. 4. Cooper, D.N., and Krawczak, M. (1996). Human Gene Mutation Database. Hum Genet 98, 629. 5. The 1.000 genome consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061-1073. 6. Gilissen, C., Hoischen, A., Brunner, H.G., and Veltman, J.A. (2011). Unlocking Mendelian disease using exome sequencing. Genome Biol 12, 228. 7. Akbari, M.R., Zhang, S., Fan, I., Royer, R., Li, S., Risch, H., McLaughlin, J., Rosen, B., Sun, P., and Narod, S.A. Clinical impact of unclassified variants of the BRCA1 and BRCA2 genes. J Med Genet 48, 783-786. 8. Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science 185, 862-864. 9. Ng, P.C., and Henikoff, S. (2001). Predicting deleterious amino acid substitutions. Genome Res 11, 863-874. 10. Adzhubei, I.A., Schmidt, S., Peshkin, L., Ramensky, V.E., Gerasimova, A., Bork, P., Kondrashov, A.S., and Sunyaev, S.R. A method and server for predicting damaging missense mutations. Nat Methods 7, 248- 249. 11. Sunyaev, S.R., Eisenhaber, F., Rodchenkov, I.V., Eisenhaber, B., Tumanyan, V.G., and Kuznetsov, E.N. (1999). PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. Protein Eng 12, 387-394. 12. Bromberg, Y., and Rost, B. (2007). SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res 35, 3823-3835. 13. Ferrer-Costa, C., Gelpi, J.L., Zamakola, L., Parraga, I., de la Cruz, X., and Orozco, M. (2005). PMUT: a web- based tool for the annotation of pathological mutations on proteins. Bioinformatics 21, 3176-3178. 14. Calabrese, R., Capriotti, E., Fariselli, P., Martelli, P.L., and Casadio, R. (2009). Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat 30, 1237- 1244. 15. Bao, L., Zhou, M., and Cui, Y. (2005). nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res 33, W480-482. 16. Reva, B., Antipin, Y., and Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res 39, e118. 17. Reumers, J., Schymkowitz, J., Ferkinghoff-Borg, J., Stricher, F., Serrano, L., and Rousseau, F. (2005). SNPeffect: a database mapping molecular phenotypic effects of human non-synonymous coding SNPs. Nucleic Acids Res 33, D527-532. 18. Thomas, P.D., Campbell, M.J., Kejariwal, A., Mi, H., Karlak, B., Daverman, R., Diemer, K., Muruganujan, A., and Narechania, A. (2003). PANTHER: a library of protein families and subfamilies indexed by function. Genome Res 13, 2129-2141. 19. Capriotti, E., Calabrese, R., and Casadio, R. (2006). Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics 22, 2729-2734. 20. Cheng, J., Randall, A., and Baldi, P. (2006). Prediction of protein stability changes for single-site mutations using support vector machines. Proteins 62, 1125-1132. 21. Masso, M., and Vaisman, II. AUTO-MUTE: web-based tools for predicting stability changes in proteins due to single amino acid replacements. Protein Eng Des Sel 23, 683-687. 22. Yin, S., Ding, F., and Dokholyan, N.V. (2007). Eris: an automated estimator of protein stability. Nat Methods 4, 466-467. 23. Dehouck, Y., Grosfils, A., Folch, B., Gilis, D., Bogaerts, P., and Rooman, M. (2009). Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics 25, 2537-2543. 24. Parthiban, V., Gromiha, M.M., and Schomburg, D. (2006). CUPSAT: prediction of protein stability upon point mutations. Nucleic Acids Res 34, W239-242. 25. Capriotti, E., Fariselli, P., and Casadio, R. (2005). I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res 33, W306-310.

148

Automated protein structure annotation of human disease variants 6

26. Worth, C.L., Bickerton, G.R., Schreyer, A., Forman, J.R., Cheng, T.M., Lee, S., Gong, S., Burke, D.F., and Blundell, T.L. (2007). A structural bioinformatics approach to the analysis of nonsynonymous single nucleotide polymorphisms (nsSNPs) and their relation to disease. J Bioinform Comput Biol 5, 1297- 1318. 27. Yuan, H.Y., Chiou, J.J., Tseng, W.H., Liu, C.H., Liu, C.K., Lin, Y.J., Wang, H.H., Yao, A., Chen, Y.T., and Hsu, C.N. (2006). FASTSNP: an always up-to-date and extendable service for SNP function analysis and prioritization. Nucleic Acids Res 34, W635-641. 28. Uzun, A., Leslin, C.M., Abyzov, A., and Ilyin, V. (2007). Structure SNP (StSNP): a web server for mapping and modeling nsSNPs on protein structures with linkage to metabolic pathways. Nucleic Acids Res 35, W384-392. 29. Capriotti, E., Arbiza, L., Casadio, R., Dopazo, J., Dopazo, H., and Marti-Renom, M.A. (2008). Use of estimated evolutionary strength at the codon level improves the prediction of disease-related protein mutations in humans. Hum Mutat 29, 198-204. 30. Lee, P.H., and Shatkay, H. (2008). F-SNP: computationally predicted functional SNPs for disease association studies. Nucleic Acids Res 36, D820-824. 31. Karchin, R., Diekhans, M., Kelly, L., Thomas, D.J., Pieper, U., Eswar, N., Haussler, D., and Sali, A. (2005). LS- SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics 21, 2814-2820. 32. Li, S., Ma, L., Li, H., Vang, S., Hu, Y., Bolund, L., and Wang, J. (2007). Snap: an integrated SNP annotation platform. Nucleic Acids Res 35, D707-710. 33. Yue, P., Melamud, E., and Moult, J. (2006). SNPs3D: candidate gene and SNP selection for association studies. BMC Bioinformatics 7, 166. 34. Mooney, S.D., and Altman, R.B. (2003). MutDB: annotating human variation with functionally relevant data. Bioinformatics 19, 1858-1860. 35. Uniprot Consortium. (2010). Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res 39, D214-219. 36. Prlic, A., Down, T.A., Kulesha, E., Finn, R.D., Kahari, A., and Hubbard, T.J. (2007). Integrating sequence and structural biology with DAS. BMC Bioinformatics 8, 333. 37. Sander, C., and Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9, 56-68. 38. Bell, J., Bodmer, D., Sistermans, E., and Simon C., R. (2008). Practice guidelines for the interpretation and reporting of unclassified varients (UVs) in Clinical Molecular Genetics. 39. (2012). Full spectrum genetics. Nat Genet 44, 1. 40. Stattin, E.L., Wiklund, F., Lindblom, K., Onnerfjord, P., Jonsson, B.A., Tegner, Y., Sasaki, T., Struglics, A., Lohmander, S., Dahl, N., et al. A missense mutation in the aggrecan C-type lectin domain disrupts extracellular matrix interactions and causes dominant familial osteochondritis dissecans. Am J Hum Genet 86, 126-137. 41. Arnaud, L., Saison, C., Helias, V., Lucien, N., Steschenko, D., Giarratana, M.C., Prehu, C., Foliguet, B., Montout, L., de Brevern, A.G., et al. A dominant mutation in the gene encoding the erythroid transcription factor KLF1 causes a congenital dyserythropoietic anemia. Am J Hum Genet 87, 721-727. 42. Rotthier, A., Auer-Grumbach, M., Janssens, K., Baets, J., Penno, A., Almeida-Souza, L., Van Hoof, K., Jacobs, A., De Vriendt, E., Schlotter-Weigel, B., et al. Mutations in the SPTLC2 subunit of serine palmitoyltransferase cause hereditary sensory and autonomic neuropathy type I. Am J Hum Genet 87, 513-522. 43. Gardner, J.C., Webb, T.R., Kanuga, N., Robson, A.G., Holder, G.E., Stockman, A., Ripamonti, C., Ebenezer, N.D., Ogun, O., Devery, S., et al. X-linked cone dystrophy caused by mutation of the red and green cone opsins. Am J Hum Genet 87, 26-39. 44. Vroling, B., Sanders, M., Baakman, C., Borrmann, A., Verhoeven, S., Klomp, J., Oliveira, L., de Vlieg, J., and Vriend, G. GPCRDB: information system for G protein-coupled receptors. Nucleic Acids Res 39, D309- 319.

149

Automated protein structure annotation of human disease variants - Appendix 6

Extension of the HOPE validation experiment with 2 more datasets We extended the validation of HOPE with two more datasets. Besides the mutations obtained from AJHG2010, we also studied 78 mutations described in randomly selected articles in other human genetics journals such as Human Genetics and Clinical Genetics. The selection criteria were similar to those used for the first dataset (description of at least one point mutation that is linked to a disease, and the availability of a 3D-structure or homology model). Another dataset consisted of 66 mutations that we studied previously in collaborative projects. Again, we used Grantham, SIFT, PolyPhen-2, and HOPE to predict the effect of these mutations. The results for these two datasets, and the dataset described in chapter 6, are summarized in table 6.3.

Table 6.3: Analysis of pathogenic mutations in 3 different datasets. Dataset 1 corresponds to the dataset described in the article in Chapter 6. 15 publications were selected from the 2010 volume of the American Journal of Human Genetics. Together, these 15 articles described the effects of 37 mutations using either the experimentally solved structure or a homology model. Dataset 2 consists of 78 mutations that were described in randomly selected publications in journals such as Human Genetics, Human Mutation, and the New England Journal of Medicine. Dataset 3 contains mutations that we studied previously in collaborations with biomedical colleagues. Most of these studies have also been published in various journals and many are mentioned in chapter 5. The threshold for Grantham scores (benign < 62 < damaging) was obtained from Bell et al1. SIFT predicts either tolerated (does not cause a disease) or not-tolerated (causes a disease). PolyPhen-2 classifies mutation as ‘probably damaging’, ‘possibly damaging’, or 'benign'; the best correlation with experiment is obtained by considering the first two categories both as damaging. HOPE's decision tree produces rather elaborate results that have for this study been compressed onto ‘damaging’ and ‘benign’.

Mutants Grantham SIFT PolyPhen-2 GH/SIFT/PF HOPE 2 agree

Dataset 1 37 59% >62 81% not 89% 59% (5% is 92% indicates (AJHG2010) tolerated damaging completely damaging (76% is wrong) mutant (76% “probably obviously damaging”) damaging) Dataset 2 78 71% >62 90% not 95% 67% (1% is 94% indicates (random tolerated damaging completely damaging articles) (94% is wrong) mutant (76% “probably obviously damaging”) damaging) Dataset 3 66 70% >62 89% not 97% 62% 97% indicates (collaborative tolerated damaging damaging projects) (95% is mutant “probably mutant (82% damaging”) obviously damaging) Total 181 68% 88% 95% 64% 95% indicates damaging mutant

151

Automated protein structure annotation of human disease variants - Appendix 6

The extended analysis serves two purposes. First, we wanted to see whether HOPE can indeed predict a pathogenic effect for the mutations that were identified as related to disease. Indeed, the results in Table 6.3 support the results that were described previously in Chapter 6. HOPE performs as good as the best methods with an average of 95% of the disease-causing mutations described as damaging. These results once more underline the value of structural information for the correct classification of mutations. Second, we wanted to compare the reports generated by HOPE with our own manual analyses of the same mutations, and with the analyses provided by the original authors. In the ideal situation HOPE should provide the same answer as a well-trained structural bioinformatician. A detailed comparison of the automatically generated reports with our manual analysis will results in a list of possible future improvements for HOPE. The reports and detailed results of this study are collected on the website http://www.cmbi.ru.nl/~hvensela/HOPEresults. The results showed that about 2/3 of the reports are correct, easily understandable for anyone with a limited background in bioinformatics, and provide insight in the effect of the mutation on the structure of the protein. The remaining reports are also correct and useful but sometimes incomplete and can, in our opinion, be improved in the near future. None of the HOPE reports contained errors. We found the following points for improvement:  Use the biological protein assembly for the analyses  Take multiple conformations of a protein into account  Use long-distance residue interactions  Include more information such as protein annotations and literature information  Analyse more post-mutation interactions These points will be discussed in the following paragraphs.

Use the biological assembly In the optimal case we want to study a protein in its biological complex, when it is interacting with its biological ligand, etc. Unfortunately, due to the nature of the experiments it normally is impossible to mimic the exact biological situation in a protein crystal or NMR solution. As a result the PDB is filled with protein structures that are, for instance, complexed with unnatural ligands, or solved as a monomer while they function as a multimer in vivo. Currently, HOPE decides the structure that will be used on the basis of sequence identity, sequence length, experimental technique, and resolution. In some cases this does not result in the best choice. For example, almost half of the analyzed mutations in the PRPS1 protein occur at the surface of the protein. This protein functions as a hexamer, but the PDB file only contains a dimer. HOPE does indeed report the lost interactions on the dimer-surface but misses the lost interactions on the other surfaces that will disturb the hexameric complex. Instead, it mentions that interactions with other molecules might be lost. Technically spoken this answer is correct but it can be improved using the biological assembly of this protein. To solve this type of problem, we are planning to use macromolecular assemblies stored in the PISA database for our analyses. PISA contains the biological assemblies for each PDB- file and will as such provide information about the location of interface surfaces on the proteins.

Take multiple conformations of a protein into account Another point to consider during template selection is the fact that a protein can change its conformation when it fulfils its function. The KRAS protein, for example, adopts a different conformation depending on the binding of GTP or GDP. Both complexes are solved as separate PDB- files. Residues might make essential contacts in one conformation and not in the other. Currently, HOPE cannot distinguish between these two forms and will use the one that fits its requirements (length, identity, etc) best. We will need to think of an analysis option that includes both conformations for these proteins. The heterogeneity of database annotations regarding function- related conformational changes will make this a difficult problem to solve.

152

Automated protein structure annotation of human disease variants - Appendix 6

Use long-distance-interactions In a few cases the mutation does not seem to have a severe effect on the function of the mutated residue itself. However, some of these residues are in close contact with other very important residues. For instance, HOPE does not (yet) know that the G382 in SPTC2 is one of the residues that supports ligand binding pocket residues and therefore HOPE does not use this important piece of information in the report. One small check for important functions in residues that surround the mutated one could improve the report in this case, but this would come at the risk of producing a large number of false-positive warnings. Currently, HOPE can already identify the effects of a mutation on a position next to a residue that makes a cysteine-bond, interaction with a metal-ion, ligand, or an active site. We are planning to extend this neighbour-analysis to residues that are in contact with the mutated residue via hydrogen bonds or ionic interactions. So, when a mutation affects a residue that makes, for instance, a salt bridge with an active site residue, this will be listed in the report.

Include more information such as protein annotations, and literature information Some reports could be improved significantly if HOPE could use information that is currently not part of the Swiss-Prot annotation. Useful information includes facts like the location of the active site in SPTC2 and YARS2, or the function of the ‘Ig-like’ domains in PVRL4. HOPE cannot use information that neither is annotated somewhere in any way nor accessible through a protocol such as DAS. Adding new information to the sequence features section of the Swiss-Prot record of that particular protein will allow HOPE to access and use the data, thereby improving the predictions. Another option would be to use other servers that specialize in a very small but sometimes useful part of protein analysis. At this moment we are looking into the possibility to use more servers such as CBS' TargetP2 to obtain the location of signal peptides, ProP3 to obtain the location of cleavage sites, or TANGO to obtain aggregation properties. We are also planning to include domain-information from the InterPro-database4. Knowing the location and possible function of a domain will improve the reports for the mutations that destroy either the domain itself or the contacts with that domain. For instance, almost half of the analyzed PTPN11 mutations affected the interaction with another part of the protein5. As soon as HOPE knows that the other domain is an SH2 domain that has a function in regulation, we can add this information to the report. Analyzing the results for almost 200 cases made us realize that an enormous amount of little facts is stored in the heads of experts all over the world or otherwise is carefully hidden in the literature. We realize that making all this knowledge accessible in databases will take a while, or is, in some cases, simply still impossible.

Analyse more post-mutation interactions Currently, HOPE reports the effect of the mutations based on the interactions made in the wild-type structure. This means that any interaction that is lost will be mentioned in the report. However, during our research we found a few examples of mutations that introduced new interactions. These new interactions are difficult to predict, but can nonetheless affect the protein function, by locking the protein in a certain state. In many proteins, freedom of movement is required to fulfil their function and adding a new stabilizing interaction could render the protein less active. In the future, we will also study these new interactions that are introduced in the protein structure.

153

Automated protein structure annotation of human disease variants - Appendix 6

Conclusion This study resulted in a list of possible improvements for HOPE. Some, of these improvements require that scientists change their way of thinking about information and share their data using databases, such as Swiss-Prot. Some improvements will require a drastic change in the way data is stored in the HOPE database and will therefore take a while before they can be implemented, e.g. the addition of post-mutational information. Other improvements, like the use of PISA-assemblies, will be implemented soon and will thereby improve the reports. Additionally, we are planning to enable HOPE to quantify mutations. In the future, it should be possible to submit a list of candidate mutations that were found in a NGS-experiment. HOPE will then perform its analysis for all these mutants and sort them by the probability that the mutation at hand is causing the disease. This will become a useful option for the field of clinical genetics. It may be clear that HOPE is an on-going project that will never be finished, but at this moment HOPE can already provide that last piece that is needed to complete the molecular puzzle.

References 1. Bell, J., Bodmer, D., Sistermans, E., and Simon C., R. (2008). Practice guidelines for the interpretation and reporting of unclassified varients (UVs) in Clinical Molecular Genetics. 2. Emanuelsson, O., Brunak, S., von Heijne, G., and Nielsen, H. (2007). Locating proteins in the cell using TargetP, SignalP and related tools. Nat Protoc 2, 953-971. 3. Duckert, P., Brunak, S., and Blom, N. (2004). Prediction of proprotein convertase cleavage sites. Protein Eng Des Sel 17, 107-112. 4. Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., et al. (2009). InterPro: the integrative protein signature database. Nucleic Acids Res 37, D211-215. 5. Tartaglia, M., Martinelli, S., Stella, L., Bocchinfuso, G., Flex, E., Cordeddu, V., Zampino, G., Burgt, I., Palleschi, A., Petrucci, T.C., et al. (2006). Diversity and functional consequences of germline and somatic PTPN11 mutations in human disease. Am J Hum Genet 78, 279-290.

154

Summary/Samenvatting 7

Previous page:

Ebola Virus Matrix Protein – Eight proteins form a disc-like structure that binds RNA with its inner pore surface. (PDB file 1h2c)

156

Summary/Samenvatting 7

Summary The majority of all processes in our body are carried out by proteins; from the conversion of food into energy to the protection of cells against pathogens, and from growth-regulation to the transduction of signals across membranes. We can therefore call proteins the work-horses of our body. The function of a protein is enabled by its conformation, or structure, which in turn depends on the order of the amino acids, also known as the sequence. The amino acids are linked in a long chain which can fold into the final shape of the protein. The order of the amino acids is encoded in our DNA. Chapter 1 contains more detailed information about the amino acids and protein structures. During recent years, lots of research has been aimed at improving and accelerating the techniques that can be used to determine DNA sequences. As a result, it is now fairly easy to sequence (determine the order of) an individual’s DNA. This DNA sequence can be translated into the protein sequences and this is how recently many new protein sequences have been determined and stored in databases. Determining the 3D-structure of a protein, for instance by NMR or X-ray crystallography, is a more difficult and more time-consuming task. Currently, there is a wide gap between the number of known protein sequences and known protein structures, and this gap continues to grow. Fortunately, we can predict many of these (yet) unknown protein structures by using a technique called ‘homology modeling’. This technique uses the fact that proteins with approximately the same sequence will adopt approximately the same 3D-conformation. This means that we can predict the structure of one protein by using the structure of another protein that has a similar sequence. We describe this homology modeling technique as an 8-step process that is extensively discussed in chapter 2. Over the years, this technique has been used a lot, often in combination with other methods that can shed light on 3D-structures. More information about the developments and improvements in homology modeling and its combination with other spectroscopic techniques is provided in chapter 3. The use of 3D-structures/homology models can be of great value in (bio)medical research. As mentioned before, the technique for DNA-sequencing has greatly improved in recent years. Hospitals and research-departments are now routinely using the so-called ‘next generation sequencing’ technique to analyze the DNA-sequence of a patient suffering from a hitherto not understood disease or syndrome. Diseases caused by an abnormality in the DNA can be discovered by comparing the patient’s DNA with reference-DNA obtained from one or more healthy individuals. This is how we can discover many of these abnormalities, also known as mutations. Sixty percent of the abnormalities are found in the protein coding parts of the DNA, the genes. However, knowing the position of a mutation in the sequence is not enough to understand the effect of that mutation on the structure and function of the protein. The position in the sequence alone does not tell us whether the mutation is disturbing, for instance, the active site, a DNA-binding site, or dimerisation. In order to truly understand the effect of a mutation on the function of a protein it is necessary to study all available information about the protein in the context of its 3D-structure. Unfortunately, not every (bio)medical researcher has the possibility to use hardware and software required for an extensive structural analysis. Collaboration between geneticists and bioinformaticians often does not only result in a better understanding of the effect of a mutation on the 3D-structure and function of the protein, but it can also generate new ideas for future experiments. Eventually, this research could form the basis for the development of new medicines to treat the disease. During the last 5 years we performed many structural analyses in a series of collaborative projects. The development of the ‘next-generation-sequencing’-technique has caused an increase in the number of detected disease-causing mutations, and has thereby also increased the demand for structural analyses. We have developed an automatic approach for mutational analysis using all our knowledge and experiences obtained in these collaborations. The result of this work is HOPE, a web- server for the automatic analysis of point mutations in proteins. HOPE collects information from several information sources, including the 3D-structure or homology model, and combines this into a report that describes the effect of the mutation on the structure and function of the protein. The

157

Summary/Samenvatting 7 most important source of information will always be the 3D-structure. In case this structure is unavailable, HOPE will try to build a homology model using YASARA. The YASARA software was developed at the CMBI by my predecessor, Elmar Krieger, and can be used to visualize, analyze and predict protein structures. HOPE’s analysis of a model, or existing structure, is done using WHAT IF Web services. These analyses result in information about, for instance, the accessibility of the residue, the secondary structure, hydrogen bonds, etc. Besides that, HOPE will also collect information from other sources. The SwissProt database contains information that cannot directly be deducted from the protein structure, like names of protein domains or regions, the location of the active site, or known variants in the sequence. It is also interesting to know whether a residue is conserved or not and therefore HOPE obtains the conservation-scores from the HSSP database. In some cases no protein structure for a protein sequence is known yet and no template for homology modeling can be found. In these cases the structural analysis by the WHAT IF Web services cannot be performed. Fortunately, several servers exists that can predict properties of the protein based on the sequence alone. HOPE uses these servers to predict for instance the transmembrane domains, the secondary structure, and the location of phosphorylation sites. On top of that, HOPE will always compare the wild-type and mutant residue based on their amino acid properties (size, hydrophobicity, and charge). This is how HOPE can always produce a report that describes the effect of the mutant on the protein structure. HOPE’s interface was designed to be easy to use and understandable for everyone. The user only needs to submit the amino acid sequence of the protein of interest. In a next step the user can simply click on the mutated residue and indicate the new residue at that position. The complete process of choosing the structure for analysis or building the model if possible, collecting the information, and generating the report is done without intervention by the user. The report was also designed to be understandable for everyone in the (bio)medical field, even if they have no bioinformatics background. Bioinformatics and biophysics keywords are linked to our in-house bioinformatics dictionary for bio-medical researchers. The HOPE report is illustrated with pictures, including an overview of the complete protein as well as close-ups of the mutation. The report also contains animations that can be used in presentations or on websites. A more extensive description of the technical background of HOPE can be found in chapter 4. So far, many (bio)medical researchers have used the knowledge and technology provided by our bioinformatics department. Chapter 5 consists of a series of summarized articles that have been published about these projects. Our contribution in these collaborations consisted of the analysis of the structural effects of one or more mutations. In a majority of the cases we had to build a homology model first. In a few cases homology modeling was not possible and we had to rely on methods that only use the protein sequence, such as predicting the transmembrane domains. In almost all cases our analysis provided the last little bit of insight that was required to understand the effect of the mutation completely. We used the knowledge and experience that we gained by participating in these projects to further develop HOPE. Chapter 6 describes the validation of HOPE. A large dataset of mutations was selected from the 2010 volume of the American Journal of Human Genetics. Both the mutations that were linked to a disease and the benign polymorphisms in the same proteins were analysed by HOPE and by three other widely used methods; Grantham, SIFT and PolyPhen-2. The results showed that Project HOPE can correctly identify 92% of the disease-linked mutations as ‘damaging’, and 67% of the polymorphisms as ‘benign’. Comparison of these results with the other 3 methods showed that the methods that can use structural information (PolyPhen-2 and HOPE) predict the effect of the mutation significantly better than the other methods. The added value of HOPE over PolyPhen-2 lies in the fact that the HOPE report not only explains that a mutation is damaging, but also why that mutation is damaging. The examples in chapter 6 show that this can be very helpful in cases where other methods disagree or in cases where more information is required for follow-up experiments. This analysis was extended with mutations from 2 other datasets. One dataset consisted of mutations that were collected from randomly selected articles published in journals such as Nature Genetics and Human Mutation. The third dataset consisted of mutations that were studied earlier in

158

Summary/Samenvatting 7 collaborative projects, most of which are described in the summaries in chapter 5. The HOPE-reports generated for these mutations and the predictions provided by Grantham, SIFT, and PolyPhen-2 were compared with our own manual analyses and with the predictions by the authors of the articles. The results underline our previous findings that Project HOPE performs as good as PolyPhen-2. An extensive analysis of all reports revealed several possible points for improvement of HOPE. It is impossible to improve each of these points on a short term, for instance, because the required database or server simply does not yet exist. We also found that a large part of the information known in literature is not yet stored in databases that are accessible to programs such as HOPE. In the short term, HOPE will be able to use information about the biological protein-assemblies as stored in the PISA database. This is one step in the larger process of improving the choice for templates/structures that will be further analyzed in HOPE, and will be the first project of my successor. For a significant number of proteins, the structure was solved multiple times, for example in different conformations, or in interactions with different substrates. We would like to improve the choice for structures/templates so that HOPE can also use the information that is stored in all these different structures. Besides that, HOPE will soon be able to think about long-distance effects of mutations. A mutation from a small residue into a bigger one in the core of a protein, for instance, might affect another residue in its vicinity. When this particular residue has an important task, such as substrate binding, that task might be disturbed by this mutation. The current HOPE version only takes into account the function of the mutated residue itself and the ones that precede and follow the mutated residue in the sequence. The analysis will be extended to those residues that are close to the mutated one in the 3D space. Other results and ideas for HOPE’s improvement are discussed further in chapter 6 and 6a. HOPE could be used to determine the severity of mutations and to rank them based on their effect on the protein structure. In the future we would like to adapt HOPE in such a way that each effect of a mutation receives a score, and that HOPE can analyse multiple mutations at one go. This means that HOPE will be able to score and prioritize candidate mutations. Eventually, HOPE can become an essential component in the pipeline for the analysis of next generation sequencing data.

159

Summary/Samenvatting 7

Samenvatting Verreweg de meeste processen ons lichaam worden uitgevoerd door eiwitten; van het omzetten van voedsel in energie tot het beschermen van cellen tegen indringers, en van het reguleren van groei tot het doorgeven van prikkels van buitenaf. We kunnen eiwitten daarom de werkpaarden van ons lichaam noemen. De functie van een eiwit wordt mogelijk gemaakt door de vorm, ofwel de structuur, van dat eiwit en deze structuur wordt weer bepaald door de volgorde van de aminozuren, ofwel de sequentie. De aminozuren zijn aan elkaar gekoppeld in een lange keten die opvouwt tot de uiteindelijke vorm van het eiwit. De volgorde van deze aminozuren ligt gecodeerd in ons DNA. In hoofdstuk 1 wordt ingegaan op de achtergrond van de aminozuren en eiwitstructuren. De afgelopen jaren is er veel onderzoek gedaan naar het verbeteren en versnellen van technieken om de DNA volgorde te bepalen. Het resultaat is dat het tegenwoordig relatief gemakkelijk is om DNA te sequencen (volgorde te bepalen). Wanneer de DNA volgorde bekend is kan deze vervolgens vertaald worden naar de aminozuur volgorde van een eiwit. Op deze manier zijn de laatste jaren vele nieuwe sequenties van eiwitten bepaald en opgeslagen in databases. Het bepalen van de 3D-structuur van deze eiwitten met bijvoorbeeld NMR of X-ray crystallografie is echter veel lastiger en duurt daarom ook veel langer. Er zijn op dit moment veel meer eiwit sequenties dan structuren bekend en dit verschil wordt alleen maar groter. Gelukkig kunnen we een deel van de (nog) onbekende eiwitstructuren bepalen door gebruik te maken van de techniek ‘homologie modeleren’. Deze techniek is gebaseerd op het feit dat eiwitten met ongeveer dezelfde sequentie ook ongeveer dezelfde structuur hebben. We kunnen dus een eiwitstructuur voorspellen op basis van een eiwit waarvan de sequentie enigszins lijkt op de sequentie van het eiwit van interesse. We hebben de homologie modeleer techniek onderverdeeld in 8 stappen die uitvoerig beschreven zijn in hoofdstuk 2. Door de jaren heen is deze techniek vaak gebruikt, soms in combinatie met andere methoden die iets kunnen zeggen over de 3D-structuur. Hoofdstuk 3 gaat dieper in op de ontwikkelingen, verbeteringen en het gebruik van homologie modeleren in combinatie met andere experimentele methoden. Het gebruik van deze 3D-eiwitstructuren en/of homologie modellen kan van grote waarde kan zijn voor het (bio)medisch onderzoek. Zoals al eerder vermeld is de techniek om DNA te sequencen de laatste jaren sterk verbeterd. Ziekenhuizen en onderzoekscentra gebruiken nu deze zogenaamde ‘next generation sequencing’ techniek om snel en gemakkelijk de DNA volgorde van een patiënt met een onbegrepen ziekte of syndroom te bepalen. Ziektes die veroorzaakt worden door een afwijking in het DNA kunnen ontdekt worden door de DNA volgorde van de patiënt te vergelijken met het referentie-DNA van een of meerdere gezonde personen. Op deze manier worden vele afwijkingen, ofwel mutaties, ontdekt die een ziekte kunnen veroorzaken. Zestig procent van de afwijkingen wordt gevonden in de delen van het DNA die coderen voor eiwitten; de genen. De kennis van de locatie van een mutatie in een gen zegt echter nog weinig over het effect hiervan op de structuur en dus op de functie van het bijbehorende eiwit. Aan alleen de mutatie in de eiwit sequentie is niet te zien of bijvoorbeeld de actieve site, de DNA binding site, of dimerisatie verstoord is. Om het werkelijke effect van een mutatie op de functie van het eiwit te begrijpen is het nodig om alle beschikbare informatie over het eiwit te bestuderen in het kader van de 3D-structuur. Helaas heeft niet elke (bio)medische onderzoeker de beschikking over de hardware en software die nodig is voor een uitgebreide structuur analyse. Niet alleen kan samenwerking tussen genetici en bioinformatici resulteren in begrip van het effect van een mutatie op de 3D-structuur en functie van het eiwit het kan ook nieuwe ideeen voor verder onderzoek opleveren. Uiteindelijk zou al dit onderzoek zelfs kunnen resulteren in de ontwikkeling van een medicijn voor de ziekte. Gedurende de afgelopen jaren hebben wij de structurele analyse verzorgd in diverse samenwerkingsprojecten. De ontwikkeling van de ‘next generation sequencing’ techniek heeft de ontdekking van ziekte-veroorzakende mutanten en daarmee de vraag naar structurele analyses doen toenemen. De afgelopen jaren hebben wij daarom gewerkt aan de mogelijkheid om deze mutanten automatisch te analyseren. Daarbij konden we gebruik maken van de kennis en ervaringen die deze samenwerkingsprojecten opleverde. Dit werk heeft geleid tot de ontwikkeling van Project HOPE, een

160

Summary/Samenvatting 7 webserver voor de automatische analyse van puntmutaties in eiwitten. HOPE raadpleegt meerdere informatie-bronnen waaronder de 3D-structuur of een homologie model, en combineert de informatie tot een rapport waarin het effect van de mutant op de structuur en functie beschreven staat. De belangrijkste bron van informatie voor HOPE is altijd de 3D-structuur. Als deze niet beschikbaar is dan zal HOPE proberen een homologie model te laten bouwen door YASARA. YASARA is de software die door mijn voorganger, Elmar Krieger, aan het CMBI ontwikkeld is voor de visualisatie, analyse en voorspelling van eiwit-structuren. Voor HOPE wordt de analyse van een model, of van een bestaande structuur gedaan door WHAT IF Web services. De analyse resulteert in informatie over bijvoorbeeld de toegankelijkheid van het residue, de secundaire structuur, waterstofbruggen, enz. Daarnaast raadpleegt HOPE ook informatie uit andere bronnen. Zo bevat de SwissProt database informatie die niet rechtstreeks afgelezen kan worden uit een structuur, zoals de namen van eiwit-domeinen of regionen, de locatie van de active site of bekende variaties in de sequentie. Daarnaast is het interessant om te weten hoe geconserveerd een residue is en daarom gebruikt HOPE de conserverings-scores uit de HSSP database. Het komt natuurlijk voor dat de structuur bij een bepaalde sequentie nog niet is opgehelderd en dat er ook geen enkele homologe structuur bekend is. De structurele analyse door de WHAT IF Web services kan dan niet worden uitgevoerd. Er bestaan echter wel servers die op basis van alleen een sequentie bepaalde eigenschappen kunnen voorspellen. HOPE gebruikt deze servers voor het voorspellen van o.a. de transmembraan domeinen, de secundaire structuur en de locatie van phosphorylatie plaatsen. Bovendien zal HOPE altijd het oude en het nieuwe aminozuur, dus het wildtype en de mutant, vergelijken op basis van hun eigenschappen (grootte, hydrofobiciteit en lading). Op deze manier kan HOPE altijd een rapport produceren dat een beschrijving geeft van het effect van de mutatie op de structuur van het eiwit. Het interface van HOPE is ontworpen om zo gemakkelijk en begrijpelijk mogelijk te zijn voor iedereen. De gebruiker hoeft alleen maar de aminozuur sequentie van het eiwit in te vullen. Daarna kan hij/zij op de website het gemuteerde aminozuur aanklikken en het nieuwe aminozuur op die locatie aangeven. Het totale proces van het kiezen van de structuur of het eventuele bouwen van het model, het verzamelen van informatie en het produceren van het rapport gebeurt zonder tussenkomst van de gebruiker. Het HOPE-rapport is zo opgesteld dat het te begrijpen is voor alle onderzoekers uit het (bio)medische veld, ook voor hen die maar weinig kennis van structurele bioinformatica hebben. Eventuele lastige woorden uit het bioinformatica/biofysica-veld kunnen via een help-knop worden doorgelinkt naar ons eigen bioinformatica woordenboek. Het HOPE-rapport is geïllustreerd met afbeeldingen van zowel een overzicht van het gehele eiwit als close-ups van de mutatie. Daarnaast bevat het rapport ook bewegende animaties die te gebruiken zijn in bijvoorbeeld presentaties of op websites. We hopen dat Project HOPE hiermee zo gemakkelijk en gebruiksvriendelijk mogelijk is voor iedereen in het (bio)medische onderzoek. Een uitgebreide beschrijving van de technische achtergrond van Project HOPE is te vinden in hoofdstuk 4. Vele (bio)medische onderzoekers hebben intussen gebruik gemaakt van de kennis en technologie die onze bioinformatica afdeling te bieden heeft. Hoofdstuk 5 bestaat uit samenvattingen van de hierover gepubliceerde artikelen. Ons aandeel in deze samenwerkings- projecten bestond uit het analyseren van het effect van een mutatie op de eiwit structuur. In het grootste gedeelte van deze gevallen was het nodig om eerst een homologie model te bouwen. In een klein aantal projecten was het zelfs niet mogelijk om een homologie model te maken en moesten we teruggrijpen op technieken die alleen maar gebruik maken van de sequentie, bijvoorbeeld het voorspellen van transmembraan-domeinen. In vrijwel alle gevallen leverde de analyse net het laatste stukje begrip dat nodig was om het effect van de mutatie te begrijpen. De kennis en ervaringen die wij verkregen uit deze projecten konden worden gebruikt om HOPE te optimaliseren. Hoofdstuk 6 beschrijft de validatie van HOPE. Een grote dataset werd samengesteld uit mutaties beschreven in de 2010 jaargang van het American Journal of Human Genetics. Zowel de ziekte-gerelateerde mutaties als de goedaardige polymorfismes in dezelfde eiwitten werden geanalyseerd door HOPE en door 3 andere veelgebruikte methodes; Grantham, SIFT en PolyPhen-2. Uit de resultaten bleek dat HOPE 92% van de ziekte-gerelateerde mutaties inderdaad voorspeld als

161

Summary/Samenvatting 7

‘negatief effect op het eiwit’ en 67% van de polymorphismes als ‘geen negatief effect op het eiwit’. Vergelijking van deze resultaten met de andere 3 methodes wees uit dat methodes die gebruik maken van structurele informatie (PolyPhen-2 en HOPE) een significant betere voorspelling opleveren dan de andere methoden. De toegevoegde waarde van HOPE ten opzichte van PolyPhen-2 ligt in het feit dat het HOPE report niet alleen uitlegt dat een mutatie mogelijk een negatief effect heeft, maar ook waarom dit zo is. De voorbeelden in hoofdstuk 6 laten zien dat deze kennis erg waardevol kan zijn in die gevallen waarbij de andere methodes geen eenduidige antwoorden geven. Ook voor bijvoorbeeld het opzetten van vervolg-experimenten is inzicht in het structurele effect van een mutatie zeer waardevol. De analyse werd uitgebreid met 2 andere datasets van mutaties. Een van deze datasets bestond uit mutaties die verzameld werden uit andere willekeurig geselecteerde artikelen gepubliceerd in bijvoorbeeld Nature Genetics en Human Mutation. De derde dataset bestond uit mutaties die we al eerder bestudeerden in samenwerkingsprojecten. Een deel van deze projecten wordt ook beschreven in de samenvattingen in hoofdstuk 5. De HOPE rapporten behorende bij deze mutaties en de voorspellingen van Grantham, SIFT en PolyPhen-2 werden vergeleken met onze eigen handmatige analyse en met de analyses gedaan door de auteurs van de artikelen. De resultaten zijn in overeenstemming met de eerdere bevindingen dat HOPE’s resultaten vergelijkbaar zijn met die van PolyPhen-2. Een uitgebreide analyse van alle rapporten onthulde een aantal mogelijke verbeterpunten voor HOPE. Het is onmogelijke om op korte termijn aan al deze punten te werken, bijvoorbeeld omdat de vereiste database of server nog niet bestaat. Ook bleek dat een groot deel van de in literatuur bekende informatie nog niet is opgenomen in databases die toegankelijk zijn voor programma’s zoals HOPE. Op korte termijn zal HOPE wel in staat zijn om beter gebruik te maken van de informatie over biologische eiwit-complexen zoals deze is opgeslagen in de PISA database. Dit is een onderdeel van een grote verbeterstap in het kiezen van de juiste structuur voor verdere analyse en zal het eerste project worden van mijn opvolger. Voor een aantal eiwitten is de structuur meerdere malen, in verschillende vormen of met verschillende substraten, bepaald. We willen de keuze voor de juiste structuur verbeteren zodat HOPE zoveel mogelijk gebruik kan maken van de informatie opgeslagen in al deze structuren. Daarnaast zal HOPE binnenkort kunnen kijken naar de lange-afstands effecten van mutaties. Het is bijvoorbeeld mogelijk dat een mutatie van een klein naar een groot residue in de kern van het eiwit effect heeft op een residue in de buurt. Wanneer dit buur-residue een belangrijke taak heeft, zoals bijvoorbeeld het binden van een substraat, zal deze functie worden verstoord. Op dit moment kijkt HOPE alleen nog maar naar de functie van het gemuteerde residue zelf en naar het voorgaande/volgende residue in de sequentie. Deze analyse zal uitgebreid worden met de residuen die in de 3D-omgeving van het gemuteerde residue liggen Uitgebreide resultaten van HOPE’s validatie en de daaruit voortkomende ideeën voor verbeteringen zijn beschreven in hoofdstuk 6 en 6a. HOPE zou gebruikt kunnen worden om de ernst van de structuurverstoring te bepalen voor een aantal mutaties en deze vervolgens te rangschikken. In de toekomst willen we dan ook HOPE zodanig aanpassen dat elk effect van de mutatie een aparte score krijgt en dat meerdere mutanten tegelijk geanalyseerd kunnen worden. HOPE zal dan dus kandidaat mutaties kunnen scoren en rangschikken. Op deze manier kan HOPE uiteindelijk een onderdeel worden van de pijplijn voor het analyseren van next generation sequencing data.

162

Dankwoord - CV - Publicaties 8

Previous page:

Crambin – This small plant protein was used frequently to test our software systems. (PDB file 1crn)

164

Dankwoord - CV - Publicaties 8

Dankwoord

Mijn proefschrift had nooit tot stand kunnen komen zonder de hulp en support van een aantal mensen die ik hier graag even wil vermelden.

Ten eerste natuurlijk Prof. dr. G. Vriend, ofwel Gert. Al in het tweede jaar van mijn studie heb je me enthousiast gemaakt voor de bioinformatica, al duurde het nog een paar jaar (plus een niet zo'n succesvolle poging op het lab) voor ik werkelijk voor het computerwerk durfde te kiezen. Dankzij jouw visie op de bioinformatica vs. onderzoekswereld kon dit project tot stand komen. Jij durfde het aan om een niet-typische bioinformaticus aan te stellen op een niet-typisch onderwerp, ook al geloofden een aantal collega's in eerste instantie niet in het slagen van dit project. Ik ben je dankbaar voor alle kansen die ik gekregen heb om mij te kunnen ontwikkelen tot wie ik ben, zowel tijdens als buiten het werk. Ik blijf met plezier nog even hangen...

Prof. dr. H. Brunner, beste Han. Officieel was ik bij jou in dienst, al heb ik nooit helemaal begrepen hoe dit boekhoudkundig in elkaar zat. In de praktijk bleek de samenwerking tussen antropogenetica en het CMBI heel erg vruchtbaar en heeft verschillende mooie publicaties opgeleverd. Ik hoop dat we in de komende jaren nog meer voor elkaar kunnen betekenen.

Tim en Remko, soms omschrijf ik jullie als 'mijn technische mannen'. Ik besef heel goed dat er weinig HOPE was geweest zonder jullie technische kennis. Ik hoop dat jullie dit proefschrift dan ook een beetje als jullie eigen verdienste kunnen zien. Mijn 'roomies' voor soms korte, soms lange tijd, Maarten, Robbie, Bas en Jules. Naast het feit dat ik jullie altijd lastig mocht vallen met (soms stomme) vragen, heb ik jullie ook als zeer prettige collega's leren kennen. Ik ben blij dat er op onze kamer naast gesprekken over computers ook aandacht was voor allerlei andere politieke, maatschappelijke, culturele, etc, onderwerpen. Robbie/Bas, het deed mij plezier om jullie blijdschap te zien toen de gezinsuitbreiding zich aandiende. Veel geluk met jullie gezinnetjes. Jules, wij gaan gewoon nog even gezellig verder aan HOPE. En Maarten, jouw hulp en programma's houden voor een groot deel HOPE (en nog veel meer) draaiende, dank daarvoor! Prof. Dr. Roland Dunbrack. I've only spent a month in your lab in Philly, but it was an experience of a life-time. You gave me the chance to see and feel another lab, another country, and another culture, thank you so much for this opportunity! Lisa, Annika, Franscesca, Shima en Marlou, om de 1 of andere reden heb ik alleen maar vrouwelijke studenten gehad om te begeleiden. Girl-power dus! Jullie hebben allemaal je eigen bijdrage geleverd aan HOPE. Vele mutaties en artikelen hebben jullie bestudeerd en deze kennis is gebruikt om HOPE nog beter en duidelijker te maken. Ik hoop dat jullie met plezier terugkijken op je stage, veel succes in jullie verdere carrière.

Het CMBI is natuurlijk veel groter dan kamer 0.19. Ik wil al mijn collega's bedanken voor de zeer prettige sfeer op onze afdeling, maar een paar mensen verdienen toch echt een aparte vermelding. Barbara, stiekem het organisatorische middelpunt van het CMBI, er is niks wat jij niet weet of op kan lossen. Bedankt voor al het regelwerk (en het openen van de deur als ik weer eens mijn pasje vergat). Wilmar, jij houdt de afdeling technisch gezien draaiende. Bedankt voor al je hulp als mijn pc of de 3D- beamer weer eens gek deed. Elmar (CMBI-outstation in Austria), where would we be without YASARA?? There is no other program that has been so important during my Ph.D, and I cannot imagine what the first-year-bioinformatics course would look like nowadays without it. I'd also like to thank you for answering my numerous questions about obscure YASARA options. I wish you and your

165

Dankwoord - CV - Publicaties 8 family all the best. Even een aparte vermelding voor NBIC. Ik ben jullie samen met alle bioinformatici erg dankbaar voor het organiseren en samenbrengen van bioinformatica in Nederland. Samen met mijn mede RSG-leden heb ik van jullie steun mogen genieten.

Geen enkel proefschrift wordt door een enkel persoon geschreven. Mijn proefschrift is echter een extreem geval van vele samenwerkings-projecten. Het gevolg is dat ik vele bladzijdes nodig zou hebben om al mijn collega's op andere afdelingen te bedanken. Ik hoop echter dat zij zelf beseffen dat HOPE niet alleen voor, maar ook door hen tot stand is gekomen. Ik ben blij dat ik zoveel mensen van dienst heb kunnen zijn en dat mijn analyses zo vaak net een laatste beetje extra toe hebben kunnen voegen aan hun werk.

En dan komen we buiten het CMBI, maar niet helemaal buiten de universiteit. Ik wil mijn ex- studiegenoten bedanken voor de fantastische studententijd die ik heb gehad. Met sommigen van jullie heb ik later nog samen mogen werken, anderen zie ik helaas niet zoveel meer. Miriam, Fieke en Jordy, vrienden-sinds-de-intro, door drukke schema's zie ik jullie niet zo vaak als ik zou willen, maar ik ben dankbaar voor elke minuut dat we samen leuke dingen kunnen doen. Buiten de universiteit heb ik door de jaren heen ook vele mensen leren kennen die allemaal om een andere reden speciaal voor me zijn. Mijn vrienden in de Poleworld. Met jullie leef ik soms bijna in een parallel universum, wat een mooi contrast vormt met de academische wereld. Bedankt voor alle leuke uitjes/shows/ evenementen/activiteiten die ik met jullie heb mogen meemaken! Eva, met jou kan ik af en toe een laagje dieper, je hebt me een aantal bijzondere dingen geleerd, bedankt voor je steun in voor en tegenspoed! En dan natuurlijk Michel: soms zoek je 10 jaar naar iets, en dan ligt het recht voor je neus ;-) Ik ben blij dat onze wegen zich toch samengevoegd hebben. Er komen nog vele avonturen die we samen aan kunnen gaan!

En dan als laatste, maar natuurlijk niet het minste: mijn familie.... Je familie is waar het mee begint, de mensen die je vormen in je eerste levensjaren en ervoor zorgen dat je de persoon wordt die je bent. Matthijs, mijn kleine broertje dat nu ruim een kop groter is dan ik. Ik kan nog steeds niet geloven dat dat kleine opdondertje nu studeert, autorijdt en werkt! Ik wens je al het beste in wat er nog komen gaat. Marjolein, mijn kleine zusje. Voor mensen is het soms moeilijk te geloven dat we zussen zijn, zo verschillend lijken we soms. Toch denken en doen we ook vaak hetzelfde...kijk alleen maar naar het feit dat jij nu ook al dr. bent :-p Ik ben er trots op dat jij mijn zusje bent. Rens, sinds september 2011 deel van onze familie, en wat ben ik daar blij mee. Je bent een geval apart, super-eigenwijs soms, maar dat is ook wat ik in jou waardeer. Zolang het maar geen mainstream is, is het goed! Bedankt dat je zo vaak voor me klaar staat. Pa en ma, de allerbelangrijkste mensen in mijn leven... nu pas zie ik in hoe jullie mij gevormd hebben en hoe dankbaar ik mag zijn voor alle zorg en steun die ik altijd van jullie heb gehad. Zonder jullie was ik nooit geweest wie ik nu ben. Ik zou nog pagina’s vol kunnen schrijven, maar dat zou nog niet genoeg zijn om duidelijk te maken wat een fijne, lieve, maar vooral bijzondere familie ik heb ….dank voor alles...

Liefs, Hanka

166

Dankwoord - CV - Publicaties 8

Curriculum Vitae

Hanka Venselaar werd geboren op 27 oktober Hanka Venselaar was born on October 27th in 1982 in Tilburg en groeide op in Berkel-Enschot. Tilburg and grew up in Berkel-Enschot. She In 2001 behaalde zij haar VWO-diploma aan het attented highschool at the Cobbenhagen Tilburgse Cobbenhagen College en behoorde College in Tilburg where she received her VWO- daarmee tot de eerste lichting 2de-fase diploma in 2001. She was one of the first so- scholieren. Hierna ging zij Moleculaire called "2nd phase students". She then started Levenswetenschappen studeren aan de Radboud her studies, Molecular Life Sciences, at the Universiteit Nijmegen. Zij deed haar eerste stage Radboud University Nijmegen. She did her first aan het Centrum voor Moleculaire en internship at the Centre for Molecular and Biomoleculaire Informatica (CMBI) onder leiding Biomolecular Informatics (CMBI) under van Prof. Dr. Gert Vriend en dr. Sander Nabuurs. supervision of Prof. Dr. G. Vriend en dr. Sander Een tweede stage deed zij op de afdeling Nabuurs. She did a second internship at the Antropogenetica onder leiding van dr. Joris department of Human Genetics under Veltman en dr. Lisenka Vissers waarna zij in in supervision of dr. Joris Veltman and dr. Lisenka oktober 2006 haar doctoraalexamen behaalde. Vissers. She obtained here Master's degree in Hierna keerde zij terug naar het CMBI om te October 2006 after which she returned to the starten als junior onderzoeker op het onderwerp CMBI. As a PhD student she studied the "Toepassingen van Homologie Modeleren". "Applications of Homology Modelling". During Tijdens deze periode bracht zij een maand door this period she visited the lab of Prof. dr. Roland op het lab van Prof. dr. Roland Dunbrack in Dunbrack in for a month. These Philadelphia. Het werk dat verricht is gedurende years of research have lead to this thesis and an deze jaren heeft geleid tot dit proefschrift en een online system for automatic mutant analysis online systeem voor automatische mutant called "HOPE". During her time as researcher at analyse genaamd "HOPE". Gedurende haar the CMBI Hanka was also a board member of onderzoeksperiode was Hanka ook board- the Regional Student Group. Together with her member van de Regional Student Group (RSG) fellow board-members she organized activities waarmee zij activiteiten voor jonge for young bioinformatics-researchers in the bioinformatica-onderzoekers organiseerde. Naast Netherlands. Besides her work as PhD student haar werk als onderzoeker is Hanka ook docent Hanka is also an (aerial) acrobat and instructor. en artieste (lucht)acrobatiek en sinds januari 2011 Since Januari 2011 she is the proud owner of eigenaar van haar eigen bedrijf "Studio Ad Astra". her own company "Studio Ad Astra".

167

Dankwoord - CV - Publicaties 8

2012 Gerrits L, Venselaar H, Wieringa B, Wansink DG, Hendriks WJ.

Identification of recurrent and novel mutations Membrane topology and intracellular in TULP1 in Pakistani families with early-onset processing of Cyclin M2 (CNNM2). retinitis pigmentosa. Journal of Biological Chemistry 2012; 287(17): Molecular Vision 2012; 18: 1226-1237 13644-13655 Ajmal M, Khan MI, Michael S, Ahmed W, Shah A, De Baaij JH, Stuiver M, Meij IC, Lainez S, Kopplin Venselaar H, Bokhari H, Azam, A, Waheed NK, K, Venselaar H, Mueller D, Bindels RJ, Collin RW, den Hollander AI, Qamar R, Cremers Hoenderop J. FP.

A catalytic defect in mitochondrial respiratory A novel COCH mutation associated with chain complex I due to a mutation in NDUFS2 in autosomal dominant nonsyndromic hearing a patient with Leigh syndrome. loss disrupts the structural stability of the Biochimica et biophysica acta 2012; 1822(2): vWFA2 domain. 168-175 Journal of Molecular Medicine 2012 Ngu LH, Nijtmans LG, Distelmaier F, Venselaar H, Cho HJ, Park HJ, Trexler M, Venselaar H, Lee KY, van Emst-de Vries SE, van den Brand MA, Robertson NG, Baek JI, Kang BS, Morton CC, Stoltenborg BJ, Willems PH, van den Heuvel LP, Vriend G, Patthy L, Kim UK. Smeitink JA, Rodenburg RJ.

Dominant missense mutations in ABCC9 cause Cantú syndrome. 2011 Nature Genetics 2012, 44(7); 793-796 Harakalova M, van Harsell JJ, Terhal PA, van Variation in Genes of β-glucan Recognition Lieshout S, Duran K, Renkens I, Amor DJ, Wilson Pathway and Susceptibility to Opportunistic LC, Kirk EP, Turner CL, Shears D, Garcia-Minaur Infections in HIV-Positive Patients. S, Lees MM, Ross A, Venselaar H, Vriend G, Immunological Investigations 2011; 40(7-8):735- Takanari H, Rook MB, van der Heyden MA, 750 Asselbergs FW, Breur HM, Swinkels ME, Scurr IJ, Rosentul DC, Plantinga TS, Papadopoulos A, Smithson SF, Knoers NV, van der Smagt JJ, Joosten LA, Antoniadou A, Venselaar H, Kullberg Nijman IJ, Kloosterman WP, van Haelst MM, van BJ, van der Meer JW, Giamarellos-Bourboulis EJ, Haaften G, Cuppen E. Netea MG

NPHP4 Variants Are Associated With The structure-function relationship of the Pleiotropic Heart Malformations. Aspergillus fumigatuscyp51A L98H conversion Circulation Research 2012; 110(12): 1564-1574 by site-directed mutagenesis: The mechanism French VM, van de Laar IM, Wessels MW, Rohe of L98H azole resistance. C, Roos-Hesselink JW, Wang G, Frohn-Mulder Fungal Genetics and Biology 2011; 48(11):1062- IM, Severijnen LA, de Graaf BM, Schot R, 1070 Breedveld G, Mientjes E, van Tienhoven M, Snelders E, Karawajczyk A, Verhoeven RJ, Jadot E, Jiang Z, Verkerk A, Swagemakers S, Venselaar H, Schaftenaar G, Verweij PE, Venselaar H, Rahimi Z, Najmabadi H, Mijers- Melchers WJ Heiboer H, de Graaf E, Helbing WA, Willemsen R, Devriendt K, Belmont JW, Oostra BA, Amack Microcephaly with simplified gyration, JD, Bertoli-Aveda AM. epilepsy, and infantile diabetes linked to inappropriate apoptosis of neural progenitors. Phosphorylation target site specificity for AGC American Journal of Human Genetics 2011; kinases DMPK E and lats2. 89(2):265-276 Journal of Cellular Biochemsitry 2012; 113(6): Poulton CJ, Schot R, Kia SK, Jones M, Verheijen 2126-2135 FW, Venselaar H, de Wit MC, de Graaff E, Bertoli-Avella AM, Mancini GM

168

Dankwoord - CV - Publicaties 8

Genotype-Phenotype Correlation in DFNB8/10 Smits P, Antonicka H, van Hasselt PM, Families with TMPRSS3 Mutations. Weraarpachai W, Haller W, Schreurs M, Journal of the Association for Research in Venselaar H, Rodenburg RJ, Smeitink JA, van den Otolaryngology 2011: 12(6): 753-766 Heuvel LP Weegerink NJ, Schraders M, Oostrik J, Huygen PL, Strom TM, Granneman S, Pennings RJ, 2010 Venselaar H, Hoefsloot LH, Elting M, Cremers CW, Admiraal RJ, Kremer H, Kunst HP Protein structure analysis of mutations causing

inheritable diseases. An e-Science approach Chondrodysplasia and abnormal joint with life scientist friendly interfaces. development associated with mutations in BMC Bioinformatics 2010; 11(1):548 IMPAD1, encoding the Golgi-resident Venselaar H, Te Beek TA, Kuipers RK, Hekkelman nucleotide phosphatase, gPAPP. ML, Vriend G American Journal Human Genetics 2011;

88(5):608-615 Acyl-CoA Dehydrogenase 9 Is Required for the Vissers LE, Lausch E, Unger S, Campos-Xavier AB, Biogenesis of Oxidative Phosphorylation Gilissen C, Rossi A, Del Rosario M, Venselaar H, Complex I. Knoll U, Nampoothiri S, Nair M, Spranger J, Cell Metabolism 2010; 12(3):283-294 Brunner HG, Bonafé L, Veltman JA, Zabel B, Nouws J, Nijtmans L, Houten SM, van den Brand Superti-Furga A M, Huynen M, Venselaar H, Hoefs S, Gloerich J,

Kronick J, Hutchin T, Willems P, Rodenburg R, Mass spectrometry analysis of hepcidin Wanders R, van den Heuvel L, Smeitink J, Vogel peptides in experimental mouse models. RO PLoS One 2011; 6(3):e16762

Tjalsma H, Laarakkers CM, van Swelm RP, Theurl Terminal Osseous Dysplasia Is Caused by a M, Theurl I, Kemna EH, van der Burgt YE, Single Recurrent Mutation in the FLNA Gene. Venselaar H, Dutilh BE, Russel FG, Weiss G, American Journal of Human Genetics 2010; Masereeuw R, Fleming RE, Swinkels DW 87(1):146-153

Sun Y, Almomani R, Aten E, Celli J, van der Mutations in SMAD3 cause a syndromic form of Heijden J, Venselaar H, Robertson SP, Baroncini aortic aneurysms and dissections with early- A, Franco B, Basel-Vanagaite L, Horii E, Drut R, onset osteoarthritis. Ariyurek Y, den Dunnen JT, Breuning MH Nature Genetics 2011; 43(2):121-126 van de Laar IM, Oldenburg RA, Pals G, Roos- The moonlighting function of pyruvate Hesselink JW, de Graaf BM, Verhagen JM, carboxylase resides in the non-catalytic end of Hoedemaekers YM, Willemsen R, Severijnen LA, the TIM barrel. Venselaar H, Vriend G, Pattynama PM, Collée M, Biochimica et Biophyisica Acta 2010; Majoor-Krakauer D, Poldermans D, Frohn- 1803(9):1038-1042 Mulder IM, Micha D, Timmermans J, Hilhorst- Huberts DH, Venselaar H, Vriend G, Veenhuis M, Hofstee Y, Bierma-Zeinstra SM, Willems PJ, Kros van der Klei IJ JM, Oei EH, Oostra BA, Wessels MW, Bertoli- Avella AM Overview of the mutation spectrum in familial Mutation in subdomain G' of mitochondrial exudative vitreoretinopathy and Norrie disease elongation factor G1 is associated with with identificiation of 21 novel variants in combined OXPHOS deficiency in fibroblasts but FZD4, LRP5 and NDP. not in muscle. Human Mutation 2010; 31(6):656-666 European Journal of Human Genetics 2011; Nikopoulos K, Venselaar H, Collin RW, Riveiro- 19(3):275-259 Alvarez R, Boonstra FN, Hooymans JM, Mukhopadhyay A, Shears D, van Bers M, de Wijs IJ, van Essen AJ, Sijmons RH, Tilanus MA, van Nouhuys CE, Ayuso C, Hoefsloot LH, Cremers FP.

169

Dankwoord - CV - Publicaties 8

Secondary and tertiary structure modeling Clinical and molecular characterizations of reveals effects of novel mutations in polycystic novel POU3F4 mutations reveal that DFN3 is liver disease genes PRKCSH and SEC63. due to null function of POU3F4 protein. Clinical Genetics 2010; 78(1):47-56 Physiological Genomics 2009; 39(3):195-201 Waanders E, Venselaar H, te Morsche RH, de Lee HK, Song MH, Kang M, Lee JT, Kong KA, Choi Koning DB, Kamath PS, Torres VE, Somlo S, SJ, Lee KY, Venselaar H, Vriend G, Lee WS, Park Drenth JP HJ, Kwon TK, Bok J, Kim UK

The alpha-kinase family: an exceptional branch Mutations in NDUFAF3 (C3ORF60), Encoding an on the protein kinase tree. NDUFAF4 (C6ORF66)-Interacting Complex I Cellular andl Molecular Life Sciences 2010; Assembly Protein, Cause Fatal Neonatal 67(6):875-890 Mitochondrial Disease. Middelbeek J, Clark K, Venselaar H, Huynen MA, American Journal of Human Genetics 2009; 84 van Leeuwen FN (6):718-727 Saada A, Vogel RO, Hoefs SJ, van den Brand MA, Functional analysis of the KV1.1 N255D Wessels HJ, Willems PH, Venselaar H, Shaag A, mutation associated with autosomal dominant Barghuti F, Reish O, Shohat M, Huynen MA, hypomagnesemia. Smeitink JA, van den Heuvel LP, Nijtmans LG Journal of Biological Chemistry 2010; 285(1):171-178 Role of the C-terminal linear region of EGF-like van der Wijst J, Glaudemans B, Venselaar H, Nair growth factors in ErbB specificity. AV, Forst AL, Hoenderop JG, Bindels RJ Growth Factors 2009; 27(3):163-172 van der Woning SP, Venselaar H, van Rotterdam Homology modelling and spectroscopy, a W, Jacobs-Oomen S, van Leeuwen JE, van Zoelen never-ending love story. EJ European Biophysics Journal 2010;39(4):551-563 Venselaar H, Joosten RP, Vroling B, Baakman CA, Further clinical and molecular delineation of Hekkelman ML, Krieger E, Vriend G the 9q Subtelomeric Deletion Syndrome supports a major contribution of EHMT1 haploinsufficiency to the core phenotype. 2009 Journal of Medical Genetics 2009; 46(9):598-606 Kleefstra T, van Zelst-Stams WA, Nillesen WM, Homology modeling nd Cormier-Daire V, Houge G, Foulds N, van Dooren Structural bioinformatics 2 Edition 2009, M, Willemsen MH, Pfundt R, Turner A, Wilson chapter 30: 715-732 M, McGaughran J, Rauch A, Zenker M, Adam Venselaar H, Krieger E, Vriend G MP, Innes M, Davies C, López AG, Casalone R, Weber A, Brueton LA, Navarro AD, Bralo MP, Human dectin-1 deficiency and mucocutaneous Venselaar H, Stegmann SP, Yntema HG, van fungal infections. Bokhoven H, Brunner HG New England Journal of Medicine 2009; 361(18):1760-1767 A novel homozygous nonsense mutation in Ferwerda B, Ferwerda G, Plantinga TS, Willment CABP4 causes congenital cone-rod synaptic JA, van Spriel AB, Venselaar H, Elbers CC, disorder. Johnson MD, Cambi A, Huysamen C, Jacobs L, Investigations Ophthalmology and Visual Science Jansen T, Verheijen K, Masthoff L, Morré SA, 2009; 50(5):2344-2350 Vriend G, Williams DL, Perfect JR, Joosten LA, Littink KW, van Genderen MM, Collin RW, Wijmenga C, van der Meer JW, Adema GJ, Roosing S, de Brouwer AP, Riemslag FC, Kullberg BJ, Brown GD, Netea MG Venselaar H, Thiadens AA, Hoyng CB, Rohrschneider K, den Hollander AI, Cremers FP, van den Born LI

170

Dankwoord - CV - Publicaties 8

2008 2006

Mutations of LRTOMT, a fusion gene with Negative constraints underlie the ErbB alternative reading frames, cause specificity of epidermal growth factor-like nonsyndromic deafness in humans. ligands. Nature Genetics 2008; 40(11):1335-1340 Journal of Biological Chemistry 2006; Ahmed ZM, Masmoudi S, Kalay E, Belyantseva 281(52):40033-40040 IA, Mosrati MA, Collin RW, Riazuddin S, Hmani- van der Woning SP, van Rotterdam W, Nabuurs Aifa M, Venselaar H, Kawar MN, Tlili A, van der SB, Venselaar H, Jacobs-Oomen S, Wingens M, Zwaag B, Khan SY, Ayadi L, Riazuddin SA, Morell Vriend G, Stortelers C, van Zoelen EJ RJ, Griffith AJ, Charfedine I, Caylan R, Oostrik J, Karaguzel A, Ghorbel A, Riazuddin S, Friedman TB, Ayadi H, Kremer H

Gene structure and mutant alleles of PCDH15: nonsyndromic deafness DFNB23 and type 1 Usher syndrome. Human Genetics 2008; 124(3):215-223 Ahmed ZM, Riazuddin S, Aye S, Ali RA, Venselaar H, Anwar S, Belyantseva PP, Qasim M, Riazuddin S, Friedman TB

Role of the alpha-kinase domain in TRPM6 channel and regulation by intracellular ATP. Journal of Biological Chemistry 2008; 283(29):19999-20007 Thébault S, Cao G, Venselaar H, Xi Q, Bindels RJ, Hoenderop JG

Mutations of ESRRB Encoding Estrogen-Related Receptor Beta Cause Autosomal-Recessive Nonsyndromic Hearing Impairment DFNB35. American Journal of Human Genetics 2008; 82(1):125-138 Collin RW, Kalay E, Tariq M, Peters T, van der Zwaag B, Venselaar H, Oostrik J, Lee K, Ahmed ZM, Caylan R, Li Y, Spierenburg HA, Eyupoglu E, Heister A, Riazuddin S, Bahat E, Ansar M, Arslan S, Wollnik B, Brunner HG, Cremers CW, Karaguzel A, Ahmad W, Cremers FP, Vriend G, Friedman TB, Riazuddin S, Leal SM, Kremer H

A novel (Leu183Pro-)mutation in the HFE-gene co-inherited with the Cys282Tyr mutation in two unrelated Dutch hemochromatosis patients. Blood Cells Molecules & Diseases 2008; 40(3):334-338 Swinkels DW, Venselaar H, Wiegerinck ET, Bakker E, Joosten I, Jaspers CA, Vasmel WL, Breuning MH

171