Computational Methods for the Understanding of Protein-Based Interactions

José Guilherme Coelho Peres de Almeida Computational methods for the understanding of protein-based interactions Dissertação de mestrado em Biologia Celular e Molecular orientada por Irina S. Moreira e Ana Luísa Carvalho e apresentada para o Departamento de Ciências da Vida da Universidade de Coimbra Setembro/2017 i Index Index_____________________________________________________________________ i Figure Index ______________________________________________________________ iii Table Index _______________________________________________________________ iv Annex Index _______________________________________________________________ v Acknowledgements ________________________________________________________ vi Abstract _________________________________________________________________ vii Resumo ________________________________________________________________ viii A. Introduction _____________________________________________________________ 1 1) Proteins structure fundamentals __________________________________________________ 1 2) Membrane Proteins_____________________________________________________________ 3 i) G-protein coupled receptors __________________________________________________ 5 3) Machine-learning – general aspects ________________________________________________ 8 4) Computational methods for membrane protein structure _____________________________ 11 i) Sequence importance for protein structure prediction ____________________________ 11 ii) Computational methods for soluble protein structure ____________________________ 15 iii) Computational methods for membrane protein structure _________________________ 18 iv) Computational methods for protein-protein interactions and complexes _____________ 21 v) Computational methods for membrane protein-protein partner interactions _________ 24 vi) Computational methods for GPCRs ___________________________________________ 25 B. Methodology ___________________________________________________________ 27 1) SpotOn – prediction of Hot-Spots from complex structural information __________________ 27 i) Dataset construction _______________________________________________________ 27 ii) Structural/sequence features ________________________________________________ 28 iii) Machine-learning techniques ________________________________________________ 29 2) SpotOnDB – understanding protein-protein interfaces through big data approaches _______ 32 i) Protein-protein Dataset ____________________________________________________ 32 ii) Hot-spot classification ______________________________________________________ 33 iii) Protein-protein interfacial characterization _____________________________________ 33 iv) Statistics, data visualization and webserver implementation _______________________ 34 3) Interface Prediction – using deep-learning to identify interfacial patches in protein surface __ 35 i) Dataset construction and features ____________________________________________ 35 ii) Model hyperparametrization, parametrization and training________________________ 35 iii) Interfacial patch prediction __________________________________________________ 37 4) Beyond the interface – how complex structure dynamics and conformation affects the binding interface _______________________________________________________________________ 38 i) Homology modelling _______________________________________________________ 38 ii) Structure refinement with HADDOCK __________________________________________ 38 iii) Interhelical distance _______________________________________________________ 39 iv) Comparative normal mode analysis ___________________________________________ 39 v) Data visualization and webserver implementation _______________________________ 40 C. Results and Discussion ____________________________________________________ 41 1) SpotOn – from high dimensionality to a successful method ____________________________ 41 i) ML Algorithms Clustering ___________________________________________________ 41 ii) ML algorithms Cluster Performance ___________________________________________ 42 2) SpotOnDB – a high-throughput method unravels new details on protein-protein interactions 45 i i) Protein-protein interface characterization ______________________________________ 45 ii) Hot-spots Composition _____________________________________________________ 46 iii) Hot-regions ______________________________________________________________ 49 iv) B-factors _________________________________________________________________ 51 v) Evolutionary conservation __________________________________________________ 51 vi) Solvent-accessible surface area of interfacial residues ____________________________ 52 vii) Structural interactions at protein-protein interfaces ___________________________ 53 3) Interface prediction – a tentative application of DL __________________________________ 55 i) Hyperparametrization and model training ______________________________________ 55 ii) Interfacial patch prediction __________________________________________________ 57 4) Structural dynamics as influential in understanding GPCR-partner differential binding ______ 57 i) Normal mode analysis of complex structure ____________________________________ 57 ii) Interhelical Distance _______________________________________________________ 58 D. Conclusions ____________________________________________________________ 60 E. References _____________________________________________________________ 63 F. List of Abbreviations _____________________________________________________ 78 G. Annex __________________________________________________________________ A Background ____________________________________________________________________ D Scope of review ________________________________________________________________ D Major conclusions ______________________________________________________________ D General significance _____________________________________________________________ D H. Annex References ________________________________________________________ L ii Figure Index Figure 1 - The 3D structure for opsin (PDBID: 4J4Q 86). The main structural features are easily observable and labelled. Legend: TM1 – TransMembrane Helix 1; ICL1 – IntraCellular Loop 1; TM2 – TransMembrane Helix 2; ECL1 – ExtraCellular Loop 1; TM3 – TransMembrane Helix 3; ICL2 – IntraCellular Loop 2; TM4 – TransMembrane Helix 4; ECL2 – ExtraCellular Loop 2; TM5 – TransMembrane Helix 5; ICL3 – IntraCellular Loop 3; TM6 – TransMembrane Helix 6; ECL3 – ExtraCellular Loop 3; TM7 – TransMembrane Helix 7; HX8 – Perimembranar Helix. ................................................................................................................................ 6 Figure 2 - The typical machine-learning workflow. The dataset is first split into training, validation and testing. Then, the algorithm is trained using the training set and after each iteration it is validated using the validation set. ......................................................................................................................................... 9 Figure 3 - Typical classification of machine-learning techniques: classification and regression (supervised), and clustering and dimensionality reduction (unsupervised). .................................................................... 11 Figure 4 - Hierarchical clustering of the test machine-learning algorithms. .............................................. 42 Figure 5 - Residue normalized propensity for the core, non-interface surface and interface protein regions. .................................................................................................................................................................... 46 Figure 6 - Performance of SpotOn algorithm by amino-acid type. Sensitivity, specificity and accuracy are reported as well as the number of each amino-acid residue type in small boxes at the top. All values were calculated using the prediction outcomes from SpotOnDB – true positive (TP, correctly predicted HS), true negative (TN, correctly predicted NS), false positive (FP, NS predicted as HS) and false negative (FN, HS predicted as NS). The sensitivity was calculated by dividing TP cases by the total number of predicted positives (Sensitivity = TP/(TP + FN)), the specificity by dividing TN cases by the total number of predicted negatives (TN/(TN + FP)) and the accuracy was calculated by dividing the total number of correct predictions by the total number of predictions ((TP + TN)/(TP + TN + FP + FN)). ....................................... 46 Figure 7 - HS and residues in the protein interface. A. Interface and HS composition by amino-acid residue type. B. HS enrichment by amino-acid residue type. .................................................................................. 48 Figure 8 - Hot-regions by residue type. The number of HS is plotted against the residue type with average HS, NS and global (all interfacial residues) values represented by dashed lines Normalizing was done by dividing the average number of HS neighbours by the number of neighbours. ......................................... 51 Figure 9 - Average conservation by residue type. The conservation score from Consurf is plotted against the residue type with average HS, NS and global (all interfacial residues) values represented by dashed lines. ........................................................................................................................................................... 52 Figure 10 - Relative solvent-accessible surface

Computational Methods for the Understanding of Protein-Based Interactions

A Deeply Glimpse Into Protein Fold Recognition

Modeling and Predicting Super-Secondary Structures of Transmembrane Beta-Barrel Proteins Thuong Van Du Tran

Final List of NGF Publications (2001-2007)

A Tri-Gram Based Feature Extraction Technique Using Linear Probabilities of Position Specific Scoring Matrix for Protein Fold Recognition

The Classification of Chemical Compounds Should Preferably Be Of

Simple Proteins Conjugated Proteins Enzymes

Amino ACIDS PROTEINS

SVM-Betapred: Prediction of Right-Handed ß-Helix Fold from Protein Sequence Using SVM

Regulation of Oxidoreductase Enzymes During Inflammation

Support Vector Machines for Protein Fold Class Prediction 1

Population Statistics of Protein Structures: Lessons from Structural Classiﬁcations Steven E Brenner∗§, Cyrus Chothia² and Tim JP Hubbard³

The Recognition of Multi-Class Protein Folds by Adding Average Chemical Shifts of Secondary Structure Elements