Protein Structure Prediction

Bioinformatics Algorithms Protein structure prediction David Hoksza http://siret.ms.mff.cuni.cz/hoksza Motivation • Sequence → structure → function • The number of available (protein) sequences grows much faster than the number of available 3D structures • Given a protein sequence we want to determine its structure • Inverse problem to protein structure design where, given a structure, we want to find sequence which codes for it 2 Structure → function • Inferring function from structure • Detection of local structural motifs with functional roles • Analysis of surface clefts → catalytic sites • Conservation analysis • Quaternary structure (beware of false positives due to crystallization) • Buried and solvent exposed residues • Issues • Moonlighting proteins • Multiple functions carried out by a single domain • Conformational change of shape upon binding • ligand-bound state (holo structures) vs unbound state (apo structure) • Intrinsically disordered proteins (IDP) • Natively unfolded proteins 3 Sequence → structure Size of common cores as a function of protein homology. If two proteins of length 푛1 and 푛2 have 푐 residues in the common core, the fractions of The relation of residue identity and the r.m.s. deviation of the backbone atoms 푐 푐 of the common cores of 32 pairs of homologous proteins each sequence in the common core are and . We plot these values, 푛1 푛2 4 connected by a bar,- against the residue identity of the core source: Chothia, Cyrus, and Arthur M. Lesk. "The relation between the divergence of sequence and structure in proteins." The EMBO journal 5.4 (1986): 823. Protein structure prediction tasks • Secondary structure prediction • Assign each amino acid one of three (or more) states (helix, sheet, loop) • Tertiary structure prediction • Assign each amino acid/atom its position in 3D space • Interaction sites prediction • Tertiary structure (intra-molecular) contacts • Protein-protein/DNA/RNA sites prediction (inter-molecular quaternary structure contacts) • Protein-ligand (active sites/pockets) prediction 5 Protein structure determination 6 Protein structure determination • X-ray crystallography (89%) • NMR spectroscopy (8%) • 3D (cryo) electron microscopy (EM) (2%) 7 X-ray crystallography • Crystallized protein subjected to X-ray beams, electrons disperse the beam, interfering with each other forming a diffraction patterns which is observed • Electron density of crystal is determined by the positions of electrons (atoms) ↔ magnitudes and phases of the X-ray diffraction waves = diffraction pattern of the crystal • Fourier transformation is used to estimate the electron density for each position • Works only for proteins which form a crystal → suitable for rigid proteins but unsuitable for flexible proteins source: https://www.nature.com/news/cryo-electron-microscopy-wins-chemistry-nobel-1.22738 8 X-ray crystallography – quality measures Electron density map • Resolution • 3Å→ secondary structure • 2.5Å→ side chains • <1Å→ hydrogen atoms • R-factor • After structure reconstruction, theoretical diffraction pattern can be computed → difference between real and theoretical pattern 3.7 Å 2.4 Å expressed as percentage (how well model back-predicts the data) • Rule of thumb - good structure should have R-factor lower than resolution/10 ( ≤ 0.3 for 3Å resolution) • R(free)-factor • When set aside data is used for the real pattern • B-factor (temperature factor) • Thermal motion is present even in crystal → extent to which electron density is spread out for each atom • 퐵 = 8휋2푈2 1.5 Å 0.8 Å 9 source:: Finding the best data for your needs in the PDB archive (EBI webinar - youtube) NMR spectroscopy • Purified protein in a solution is put to strong magnetic PDB ID: 6F0Y field and probed with radio waves and observed resonances (each atom has characteristic resonance in magnetic field based on its surroundings) which are analyzed to build a model of atomic nuclei and bonded atoms • Resonances give indication of which atoms are close to each other → list of restraints to build the model • NMR structure commonly includes ensemble of structures which fit the constraints → diverse regions correspond to flexible parts PDB ID: 5MN3 • Proteins in solution → works also for flexible proteins which can’t be locked in a crystal • Works for small to medium-sized proteins PDB ID: 6BNH NMR spectroscopy – quality measures • Completeness of resonance assignments • Percentage of atoms for which the resonances were measured • Statistically unusual resonances • Random coil index • How does the resonance fit usual protein conformations such as secondary structure 11 source:: Finding the best data for your needs in the PDB archive (EBI webinar - youtube) 3D cryo-EM • A beam of electrons and a system of electron lenses is used to image the biomolecule directly. • Cryo-EM • Vitrification - protein solution is cooled so rapidly that water molecules do not have time to crystallize → thin layer of non-crystalline ice • Thousands of 2D projection images → 3D density map → fitting atomic model to the map • Chemistry Nobel prize in 2017 - Jacques Dubochet, Joachim Frank and Richard Henderson • Ability to analyze large, complex and flexible structure • Works for proteins in native state • Often breaking 3Å resolution barrier PDB ID: 3j3q 12 13 ource: "Protein Data Bank: the single global archive for 3D macromolecular structure data." Nucleic Acids Research 47, no. D1 (2018): D520-D528. 14 Protein folding 15 Protein folding • Folding (skládání) is the process through which protein obtains its three- dimensional structure • The protein wants to fold into most thermodynamically efficient state, i.e. state with the lowest free energy • Information for folding is (mostly) driven by protein’s amino acid sequence through thermodynamic process • Anfinsen’s dogma 16 Anfinsen’s dogma • All information needed to fold native structure of a protein is contained in its amino acid sequence • Experiment with ribonuclease A (RNaseA), a 124-long extracellular enzyme with 4 disulfide bonds • Observation 1. SS bonds reduced using mercaptoethanol → denaturation with 8M urea → inactive protein, flexible random polymer 2. Removal of urea → oxidation of –SH groups back to SS bonds → regain of 90% of activity • Control (proving that the protein was unfolded) • Change of the order of steps in second phase → 1-2% of activity and random assortment of SS bonds 17 Levinthal’s paradox • Reaching native folded state of a protein by a random search among all possible configurations can take an enormously long time • Unfolded polypeptide chain has many degrees of freedom • Even a small number of allowed 휙 and 휓 combinations leads to astronomically large number of structures • Proteins fold in at most seconds which is a paradox → there must be pathway or set of pathways leading to energetically favorable conformation • Biased search • When considering some conformations as stabilizing and preferred (energy bias), the folding time becomes reasonable [Zwanzig et al. "Levinthal's paradox." PNSA 89.1 (1992): 20-22] 18 Structure prediction 19 Template existence dependency of tertiary structure prediction approaches sequence identity 20% – 30% night twilight zone day Combinatorial exploration Utilization of existing structures of the folding space with the goal to find the state with the lowest energy 20 source: Krieger, E., Nabuurs, S. B. and Vriend, G. (2003) Homology Modeling, in Structural Bioinformatics 21 Model scoring 23 Energy/scoring functions • Native structure is the lowest free energy conformation → need for a function capable to assess energy/quality of a proposed structure • Approaches • Potential energy • Atom-level resolution • Based on energy terms • Knowledge-based scoring functions • Residue-level resolution • Recognizing good folds from existing knowledge (PDB) 24 Potential energy function • Potential energy function defines the potential energy of a system of positions of all its atoms • Behavior of a molecule can be described by the Schrödinger equation which in general describes behavior of a dynamic system • We need to consider not only the atoms of the molecules, but also surrounding water molecules • Were we able to compute the energy of the system, we could use it to score our predictions • To compute the equation we need to consider all nuclei and electrons of the system and their interactions → impossible to solve for more than few atoms systems → potential energy function / force field • Molecular mechanics force field • Consists of energetic contribution of covalent (bonded) and electro-static (non-bonded) interactions • Each contribution consists of a functional part and its parametrization • Atoms represented by their centers only, but that depends on the type of energy function 25 Potential energy function – covalent interactions spring equilibrium bond • Bond-length potential constant length • Treating bond as a spring and describing its energy by Hooke’s law bond length • Bonds between chemically similar atoms have similar lengths, thus we can assume the observed 2 equilibrium is the one with minimum potential 퐸푏표푛푑 = 퐾푟 푟 − 푟푒푞 energy • Bond-angle potential 2 • Same as bonds 퐸푎푛푔푙푒 = 퐾휃 휃 − 휃푒푞 • Dihedral angle potential Barrier height given number of • Dihedral angles do not have single energy energy minima minimum • Not sufficient to represent energy of a dihedral 푉 angle and often combined with electrostatic 퐸 = 푛 [1 + cos 푛휙 − 훾 ] energy between the first and last atom of the 푑푖ℎ푒푑푟푎푙 2 atoms involved in the dihedral angle angular offset 26 source:

Protein Structure Prediction

Homology Modeling and Analysis of Structure Predictions of the Bovine Rhinitis B Virus RNA Dependent RNA Polymerase (Rdrp)

Foldit Gamers Improve Protein Design Through Crowdsourcing 25 January 2012, by Bob Yirka

Increasing Public Involvement in Structural Biology

Algorithm Discovery by Protein Folding Game Players

Dnpro: a Deep Learning Network Approach to Predicting Protein Stability Changes Induced by Single-Site Mutations Xiao Zhou and Jianlin Cheng*

A Deep Reinforcement Learning Neural Network Folding Proteins

Games As a Platform for Student Participation in Authentic Scientific Research

Final Draft.Docx

11: Catchup II Machine Learning and Real-World Data (MLRD)

Estimation of Uncertainties in the Global Distance Test (GDT TS) for CASP Models

Methods for the Refinement of Protein Structure 3D Models

Advances in Rosetta Protein Structure Prediction on Massively Parallel Systems