Structure Prediction and Variant Interpretation of Membrane Proteins Aided by Machine Learning Algorithms

STRUCTURE PREDICTION AND VARIANT INTERPRETATION OF MEMBRANE PROTEINS AIDED BY MACHINE LEARNING ALGORITHMS

Bian Li

Dissertation

Submitted to the Faculty of the

Graduate School of Vanderbilt University

in partial fulfillment of the requirements

for the degree of

DOCTOR OF PHILOSOPHY in

Chemistry May 11, 2018

Nashville, Tennessee

Approved:

Jens Meiler, Ph.D.

Terry Lybrand, Ph.D.

Clare McCabe, Ph.D.

Terunaga Nakagawa, Ph.D.

To my infinitely supportive parents Dingou Li and Yinglian Li

and

To my beloved wife Xuan Zhang

ACKNOWLEDGEMENTS

First and foremost, I would like to express my deepest appreciation to my advisor and committee chair, Dr. Jens Meiler, who has the attitude and expertise of a great scientist and the charisma of an inspiring mentor. I feel extremely fortunate that my application to the Graduate Program in Chemistry at Vanderbilt five years ago was among the few selected by Dr. Meiler for further consideration. He interviewed me and made my application case a strong one to the program. His endorsement was undeniably one of the decisive reasons that convinced the program to give me an offer. I have also had the fortune of having him as my advisor. His ingenious insight, invaluable scientific guidance, and incessant encouragement are so critical that without them nothing in this dissertation would have been possible.

I’m also extremely grateful to Dr. Terry Lybrand, Dr. Clare McCabe, and Dr. Terunaga Nakagawa, who are my committee members. It’s difficult to find a better committee for one who works on problems in computational structural biology. They have continuously made suggestions to improve my dissertation research and have never hesitated to give me career advice and to write strong letters of reference for me. Dr. Lybrand also gave me invaluable advices on things that need particular attention when looking for postdoctoral positions and making the most of a postdoctoral training.

I’m indebted to all of those whom I have had the privilege of working with in Dr. Meiler’s research group. Particularly, I would like to thank Jeffrey Mendenhall, who is an extremely talented programmer and scientist in the group. He is my officemate and sits next to me. The skills, both computational and life, that I learned from him has proven to be indispensable for me to survive as a scientist and a sojourner in the U.S. I would also like to express my thanks to Dr. Axel Fischer, who was my mentor when I was a rotation student. He helped me tremendously to learn C++.

Graduate school is a huge take both for me personally and for my family. It is a tunnel that is so long and so stochastic that the chance of reaching the other end and seeing the light is nothing but an abyss without the infinite love and support from my family. In the past five years, even though I’ve only seen my parents once, I know they are always there whenever I need them. Finally, I would like to give my special thanks to my beloved wife for her love, support, patience, sacrifice, and everything else. iii

SUMMARY

Protein folding is a process of molecular self-assembly during which a disordered polypeptide chain collapses to form a compact and well-defined three-dimensional (3D) tertiary structure. A grand challenge in biochemistry has been to understand the process by which proteins fold into their functional tertiary structure (folding mechanism) and to predict this tertiary structure from amino acid sequence (structure prediction), two tasks that are collectively known as “the protein folding problem”. Solving this problem is of far-reaching impact as it will not only reveal the missing link between sequence and structure but also provide molecular biologists with a theoretical framework and practical tools for applications such as drug design and protein engineering. Chapter I of this dissertation gives a comprehensive review of the computational techniques developed in the past half century or so for studying the protein folding problem.

Helical membrane proteins (HMPs) play essential roles in various biological processes, including signal transduction, ionic and molecular transportation across the membrane, and energy generation. It was estimated that HMPs constitute about 20% to 30% of the human genome. Frequently, these transmembrane proteins do not function as monomers but undergo concerted interactions to form either homo-oligomers or interacting with other transmembrane proteins to form hetero-oligomers. Despite their prevalence in the genome, a very small portion of structures in the Protein Data Bank are HMPs due to the experimental difficulties in determining structures of HMPs and their complexes. Therefore, accurate and efficient computational methods would be valuable tools to complement existing experimental techniques. Chapters II, III, and IV describe a novel computational approach developed in this work for improving the tertiary structure prediction of HMPs and the quaternary structure prediction of HMP complexes.

In chapter II, the concept of residue weighted contact number (WCN) is introduced and a method is developed, using state-of-the-art machine learning techniques, for predicting WCNs from amino acid sequence alone. The WCN of an amino acid residue is defined as the number of neighboring residues weighted by their proximity to the focal residue. It measures the local packing degree of residues within the protein tertiary structure. In helical membrane proteins, every transmembrane helix has a characteristic profile of WCNs and this profile is strongly coupled with native contacts between helices. This implies that WCNs can be incorporated as restraints in the prediction of helix-helix packing. In chapter III, it is demonstrated that residues’ WCNs predicted

iv by the method developed in chapter II are effective restraints for improving the fraction of native contacts in predicted tertiary structure models of HMPs. Chapter IV concerns with the characterization of interfaces between HMPs and the prediction of quaternary structures of HMP complexes via protein-protein docking. First, the physicochemical characteristics and evolutionary conservation of interface residues are compared with residues on the rest of the surface, a machine learning-based method is then developed for predicting the WCNs of interface and surface residues. Finally, it is showed that predicted interface residues and their WCNs can be used to derive a powerful score for selecting native-like docking candidates of HMP complexes

Proteins mutate in response to change in environment or errors in gene replication. A lot of diseases are caused by dysfunctional variants of HMPs. Mapping the relationship between variants and their functional impact is an essential step toward precision medicine. Ideally, except for certain well-established disease-causing cases, variants should be evaluated by physiologically relevant experimental functional assays, but experimental characterization remains labor-intensive and costly to scale. Variant interpretation is bound to present an increasingly daunting challenge in the era of next-generation sequencing. Under such constraints, computational methods, which are usually machine learning-based, represent a common predictive approach.

Dysfunctional variants of the KCNQ1 potassium channel are associated with the congenital long QT syndrome. Chapter V describes a machine learning-based, protein-specific method developed in this work, that is capable of accurately classifying the functional impact of nonsynonymous variants of KCNQ1. This method was trained on a manually curated, functionally validated dataset to classify molecular functional impact. It showed superior performance when compared with eight previous methods tested in parallel.

Chapter VI concludes with a summary of the key contributions this work made to the relevant fields and some considerations on a few major limitations needed to be addressed in future work. It also points out some questions that are of significant interests for future work.

TABLE OF CONTENTS

DEDICATION ………………………………………………………………………………………………………………. ii ACKNOWLEDGEMENTS ...... iii SUMMARY ...... iv LIST OF TABLES...... x LIST OF FIGURES ...... xi I. INTRODUCTION ...... 13 I-1 The protein folding problem ...... 13 I-1.1 Thermodynamics of protein folding ...... 14 I-1.2 Simulation of protein folding, and tertiary structure prediction are very different subproblems ...... 16 I-1.3 Chemical kinetics of protein folding: mechanisms and pathways ...... 16 I-1.4 Prediction of protein folding rates ...... 19 I-2 Conformational sampling is a bottleneck ...... 19 I-2.1 Unbiased molecular dynamics simulations ...... 20 I-2.2 Enhanced sampling techniques in MD ...... 23 I-2.3 Monte Carlo simulation ...... 27 I-2.4 Genetic algorithms ...... 28 I-3 Energy functions are evolving objects ...... 29 I-3.1 Physics-based force fields...... 30 I-3.2 Knowledge-based potentials ...... 32 I-4 Improving sampling and scoring with restraints ...... 35 I-4.1 Sparse experimental data as restraints ...... 36 I-4.2 Predicted contacts as restraints ...... 39 I-5 Examples of methods for de novo tertiary structure prediction ...... 41 I-5.1 FRAGFOLD ...... 43 I-5.2 Rosetta ...... 43 I-5.3 I-TASSER...... 45 I-5.4 QUARK ...... 45 I-5.5 BCL::Fold ...... 46 I-6 Outlook ...... 47 II. ACCURATE PREDICTION OF CONTACT NUMBERS FOR MULTI-SPANNING HELICAL MEMBRANE PROTEINS ...... 49 II-1 Introduction ...... 49 II-2 Methods ...... 50 II-2.1 Generation of data set ...... 50 II-2.2 Computation of WCN ...... 51 II-2.3 Computation of relative solvent accessibility ...... 52 vi

II-2.4 Computation of feature vectors ...... 53 II-2.5 Training of dropout neural networks with back-propagation of errors ...... 54 II-2.6 Jackknife cross-validation ...... 55 II-2.7 Performance measures ...... 56 II-3 Results and Discussion ...... 57 II-3.1 Statistics of the data set ...... 57 II-3.2 Relevance of input features ...... 58 II-3.3 Choosing the optimal window size ...... 60 II-3.4 Dropout prevents overfitting and improves performance ...... 61 II-3.5 Performances of the networks on polytopic HMPs ...... 62 II-3.6 WCNs for bitopic HMPs are difficult to predict ...... 63 II-3.7 WCNs for highly exposed or buried TMHs are difficult to predict ...... 64 II-3.8 WCNs of extremely exposed or buried residues are difficult to predict ...... 65 II-3.9 Amino acid bias in prediction error ...... 66 II-3.10 Predicted WCNs reveal exposure pattern ...... 67 II-3.11 Predicting membrane protein-membrane protein interface ...... 68 II-3.12 Comparison with other WCN predictors ...... 70 II-4 Limitations and future directions ...... 71 II-5 Conclusion ...... 72 II-6 Supporting Information ...... 72 II-7 Notes ...... 72 II-8 Abbreviations ...... 72 III. IMPROVING PREDICTION OF HELIX‒HELIX PACKING IN MEMBRANE PROTEINS USING PREDICTED CONTACT NUMBERS AS RESTRAINTS ...... 73 III-1 Introduction ...... 73 III-2 Materials and Methods ...... 76 III-2.1 Benchmark set ...... 76 III-2.2 Computation of experimental and predicted WCNs ...... 77 III-2.3 Incorporating WCNs as restraints in de novo structure prediction ...... 78 III-2.4 Metrics for measuring of model quality ...... 80 III-2.5 Computation of enrichment ...... 81 III-3 Results and Discussion ...... 82 III-3.1 Predicting WCNs for HMPs in the benchmark set ...... 82 III-3.2 Incorporation of WCNs significantly improved CR ...... 83 III-3.3 Accurate prediction of WCNs is not sufficient for improving prediction of TMH potations ...... 85 III-3.4 RMSD100 is improved by using WCNs as restraints ...... 86 III-3.5 Helix rotation accuracy is improved by using predicted WCN as restraints ...... 88

vii

III-3.6 Increased ability of the scoring function at selecting accurate models ...... 88 III-4 Limitations and Future Directions ...... 89 III-5 Conclusions ...... 91 III-6 Software Availability ...... 91 IV. INTERFACES ACROSS ALPHA-HELICAL TRANSMEMBRANE PROTEINS: CHARACTERIZATION, PREDICTION, AND IMPACT FOR DOCKING ...... 92 IV-1 Introduction ...... 92 IV-2 Materials and Methods ...... 93 IV-2.1 Data set ...... 93 IV-2.2 Defining interface residues ...... 94 IV-2.3 Site-specific rate of evolution ...... 94 IV-2.4 Mutual information ...... 94 IV-2.5 Training a neural network for predicting WCN ...... 95 IV-2.6 Predicting interface residues ...... 96 IV-2.7 Membrane protein docking ...... 96 IV-2.8 Computation of enrichment ...... 97 IV-3 Results ...... 98 IV-3.1 Amino acid composition and interface propensities ...... 99 IV-3.2 Hydrophobicity ...... 101 IV-3.3 The interface is more conserved than the rest of the surface in obligate oligomers ...... 102 IV-3.4 Contacting interface residue pairs show stronger correlation than non-contacting pairs ...... 103 IV-3.5 Predicting interface residues in the membrane ...... 104 IV-3.6 Docking membrane proteins using predicted WCNs as restraints ...... 106 IV-4 Discussion ...... 111 V. PREDICTING THE FUNCTIONAL IMPACT OF KCNQ1 VARIANTS OF UNKNOWN SIGNIFICANCE .. 114 V-1 Introduction ...... 114 V-2 Materials and Methods ...... 115 V-2.1 Dataset and criteria for annotating functional impact ...... 115 V-2.2 Neural network architecture and training ...... 116 V-2.3 Predictive features ...... 117 V-2.4 Performance metrics ...... 117 V-2.5 Estimating generalization ability ...... 118 V-3 Results ...... 119 V-3.1 Functional studies do not always agree with clinical testing...... 119 V-3.2 Position-specific rate of evolution reflects functionally-critical subdomains ...... 119 V-3.3 Dysfunctional variants are enriched in selected subdomains ...... 121 V-3.4 Q1VarPred: a KCNQ1-specific predictor ...... 123

viii

V-3.5 Comparing Q1VarPred with other methods ...... 124 V-4 Discussion ...... 125 V-4.1 From functional impact to clinical disease diagnosis ...... 125 V-4.2 Unexpected conserved subdomains in the C-terminal domain ...... 126 V-4.3 The machine learning model ...... 127 V-4.4 Factors contributing to the improved performance of Q1VarPred ...... 128 V-5 Limitations and future direction ...... 129 V-6 Data and Software Availability...... 130 VI. CONCLUSIONS AND FUTURE DIRECTIONS ...... 131 VI-1 Contributions ...... 131 VI-2 Limitations and future directions ...... 133 APPENDICES ...... 136 Accurate prediction of contact numbers for multi-spanning helical membrane proteins ...... 136 Interfaces across alpha-helical transmembrane proteins: characterization, prediction, and impact for docking... 140 Predicting the functional impact of KCNQ1 variants of unknown significance ...... 142 Computation of predictive features ...... 142 Tested genome-wide tools ...... 143 Calculation of enrichment of dysfunctional variants ...... 144 A structural model for the glutamate A2 (GluA2) receptor and its cornichon 3 (CNIH3) auxiliary subunit ...... 153 REFERENCES ...... 156

LIST OF TABLES

Table Page

II-1 Summary of the TMH-Expo data set ...... 58 II-2 Summary of performance measures for WCN prediction ...... 63 II-3 Performance of interface residue identification of TMH-Expo on 4al0A ...... 69 III-1 Summary of the benchmark set ...... 76 III-2 Summary of contact recovery ...... 84 III-3 Summary of RMSD100 ...... 87 III-4 Enrichment achieved with and without WCN restraints ...... 89 IV-1 Summary of the benchmark set of transmembrane protein complexes ...... 98 IV-2 Summary of the global docking of transmembrane proteins using predicted interface residue WCN as restraints ...... 108 V-1 Comparison of Q1VarPred with other methods...... 125 A-1 HMP chains in the TMH-Expo data set ...... 138 A-2 Summary of 12 poorly predicted protein chains ...... 139 A-3 Performance of TMH-Expo on Identifying Interface Residues ...... 139 A-4 Alpha-helical transmembrane protein chains that form the oligomers