Modelling the structure and interactions of leukocyte integrins

Kyle-Richard Dawson

Submitted in fulfilment of the requirements for the degree of Magister Scientiae: Biochemistry (Msc) in the Faculty of Science at the Nelson Mandela University

9 April 2019 Prof Vaughan Oosthuizen

i

Declaration

I know that plagiarism is wrong. Plagiarism is to use another’s work and pretend that it is one’s own.

I have used the Bioinformatics convention for citation and referencing. Each contribution to, and quotation in, this dissertation from the work(s) of other people has been attributed, and has been cited and referenced.

This dissertation is my own work. This work has not been submitted to any institution other than Nelson Mandela University.

I have not allowed, and will not allow, anyone to copy my work with the intention of passing it off as his or her own work.

Signature ______

Date ______

ii

Acknowledgements

I would like to express my sincere gratitude and appreciation to: My supervisor, Prof V Oosthuizen, for his positive attitude and guidance and the National Research Foundation (NRF) for financial assistance. I would also like to thank Dr R Hatherley, Dr Vuyani Moses and Prof O Bishop for their guidance in the field of Bioinformatics. I would also like to thank the following; Abigail Sephton, Liza Findt, Eugen Schnautz, Jan Batelka, Travis Dugmore, Martin Dorfling and Blake Callahan for the encouragement, occasional idea and helpful assistance in maintaining my composure.

iii

Contents Abbreviations and acronyms ...... vii List of Figures ...... ix List of Tables ...... xi Abstract ...... 1 1. Integrin ...... 2 1.1. Structure ...... 2 1.2. Activation and regulation ...... 5 1.3. Functions ...... 6 1.3.1. Cell signalling ...... 6 1.3.2. Cell survival ...... 7 1.3.3. Cell proliferation ...... 7 1.3.4. Cell differentiation ...... 9 1.3.5. Leukocyte recruitment, activation and adhesion ...... 9 1.3.6. Integrin ligands ...... 11 2. Homology modelling ...... 13 2.1. Comparative homology modelling overview ...... 14 2.1.1. Template searching ...... 14 2.1.2. Selecting templates ...... 16 2.1.3. Sequence alignment ...... 17 2.1.4. Model construction ...... 20 2.2. Modelling programs ...... 23 2.2.1. MODELLER ...... 23 2.2.2. MEDELLER ...... 23 2.2.3. PRIMO ...... 25 2.2.4. Phyre2 ...... 25 2.2.5. I-TASSER ...... 27 2.3. Errors in homology modelling ...... 28 2.4. Determining model accuracy ...... 28 2.4.1. PROSA...... 28 2.4.2. Verify 3D ...... 29 2.4.3. PROSESS ...... 29 2.5. Docking programs ...... 31 2.5.1. HADDOCK2.2 ...... 31 2.5.2. CLUSPRO ...... 33 2.5.3. AutoDock Vina ...... 35

iv

3. Problem statement, approach and objectives for this study ...... 37 4. Methods ...... 41 4.1. Obtaining target sequences ...... 41 4.2. Modelling using online servers ...... 41 4.2.1. PRIMO ...... 42 4.2.2. I-TASSER and Phyre2 ...... 42 4.2.3. MEDELLER ...... 42 4.3. MODELLER ...... 42 4.3.1. Generating alignment files, PDB files and python scripts ...... 43 4.3.2. Modelling the complete ...... 43 4.3.3. Determining optimal model iterations ...... 43 4.3.4. Determining optimal refinement iterations ...... 44 4.3.5. Determining optimal template numbers ...... 44 4.3.6. Very_slow against slow_large refinement ...... 44 4.4. Modelling monomeric subunits ...... 44 4.4.1. Initial modelling ...... 44 4.4.2. Separating “closed” and “open” templates ...... 45 4.4.3. Modelling with secondary structure arguments ...... 45 4.4.4. Modelling with forced transmembrane regions ...... 45 4.4.5. Fragmented integrin modelling ...... 45 4.5. Final modelling ...... 45 4.5.1. BLASTp...... 46 4.5.2. Separation of “closed” and “open” templates ...... 46 4.5.3. Preparing multiple runs under varying template numbers ...... 46 4.5.4. Generating alignment files ...... 46 4.5.5. Generating python scripts ...... 47 4.5.6. Z-DOPE evaluation and output expansion ...... 47 4.5.7. Z-DOPE evaluation and monomer selection ...... 47 4.5.8. Protein evaluation ...... 47 4.5.9. Obtaining ligands used for docking ...... 48 4.5.10. Docking using HADDOCK ...... 48 4.5.11. Docking using AutoDock vina ...... 48 5. Results and Discussion ...... 50 5.1. The use of online webservers to generate protein models ...... 50 5.2. Modelling heterodimeric proteins using MODELLER ...... 69 5.3. Modelling monomeric proteins using MODELLER ...... 77

v

5.3.1. Fragmentation of monomeric models ...... 83 5.4. Final modelling procedure (Extracellular domains) ...... 86 5.4.1. Docking using ClusPro ...... 92 5.4.2. Docking using HADDOCK2.2 ...... 95 5.4.3. Docking using AutoDock Vina ...... 98 5.4.4. Metal ion coordination ...... 103 5.4.5. PROSA, Verify-3D and PROSESS Results ...... 104 5.5. Final modelling procedure (Transmembrane and Cytoplasmic)...... 110 6. Conclusion ...... 117 7. Appendix ...... I Section A: PDB files used in the study...... I Section B: Alignment file setup ...... VI Section C: Python script example ...... VIII Section D: CluPro and HADDOCK2.2 ...... IX Section E: Verify-3D example with ligand ...... X 8. References ...... XI

vi

Abbreviations and acronyms Alternative messenger ribonucleic acid mRNA Assisted Model Building with Energy Refinement AMBER Basic Local Alignment Search Tool BLAST Bipartite 4.1 protein, ezrin, radixin and moesin FERM Block substitution matrix BLOSSUM Broyden-Fletch-Goldfarb-Shanno BFGS Central processing unit CPU Chemistry at Harvard Molecular Mechanics CHARMM22 Conditional random field CRF Dynamic programming DP Epidermal growth factor EGF Extracellular matrix ECM Generate NMR structure GeNMR Graphics processing unit GPU Growth factor receptor binding protein-2 Grb2 Guanine nucleotide-binging protein G-protein Guanosine triphosphate GTPase Hidden Markov model HMM High ambiguity driven protein-protein docking HADDOCK2 Iterative threading assembly refinement I-TASSER Ligand-associated metal-binding site LIMBS Local meta-threading server LOMETS Metal-ion-dependent adhesion site MIDAS Mitogen-activated protein kinase MAPK Molecular dynamics MD Multiple alignment using fast Fourier transform MAFFT Multiple sequence alignment program MSA Nuclear Overhauser effect NOE Optimized potential for liquid simulations OPLS Particle-Mesh-Ewald PME Phosphatidylinositol-3-kinase PI3K Plexin-semaphorin-integrin PSI Point accepted mutation PAM Position specific iterated secondary structure prediction PSIPRED Position-Specific Iterated PSI Protein homology/analogy recognition engine Phyre2 Protein interactive modelling PRIMO Protein structure analysis PROSA Protein structure evaluation suite and server PROSESS Receptor tyrosine kinases RTKs Root mean square deviation RMSD

vii

Small angle X-ray scattering SAXS Specificity determining loop SDLPGMA The Protein Databank PDB Three-dimensional position-specific scoring matrix 3D-PSSM Transmembrane TM Tree-based Consistency Objective Function for Alignment Evaluation T-COFFEE Universal Protein Resource UniProt Unweighted pair group method with arithmetic mean UPGMA Volume area dihedral angle reporter VADAR von Willebrand factor type A vWFA

viii

List of Figures Figure 1. The integrin protein family ...... 2

Figure 2. Structure of integrin proteins using αVβ3 as example ...... 3 Figure 3. Homology modelling restrictions ...... 17 Figure 4. Models generated by PRIMO ...... 54 Figure 5. Standard deviation of models generated by PRIMO...... 59

Figure 6. Comparison of the best and worst β1 and β2 integrin monomers to the templates 3VI3 and 3K6S ...... 60

Figure 7. Comparison of the best and worst α3, αE and αM models with their templates superimposed on the worst αM model with an RMSD of 3.093 ...... 61 Figure 8. Comparing the best models generated by Phyre2, I-TASSER and PRIMO ...... 63

Figure 9. Phyre2 output of the α3 integrin monomer...... 64

Figure 10. I-TASSER output of the α3 integrin monomer...... 65

Figure 11. Comparison of the β1 and β2 integrin monomers generated by PRIMO, Phyre2 and I-TASSER ...... 66

Figure 12. Comparison of the α3, αE and αM models generated from Phyre2, PRIMO and I-TASSER .... 67

Figure 13. Initial α3β1 integrin models produced by MODELLER and the 100 and 300 models ...... 70 Figure 14. Comparing four sets of five models generated by MODELLER separated by levels of refinement using standard template numbers ...... 71 Figure 15. Comparing four sets of five models generated by MODELLER separated by levels of refinement using increased template numbers ...... 72 Figure 16. Very_slow against very_large refinement ...... 74 Figure 17. Comparison of heterodimeric models generated by MODELLER ...... 76

Figure 18. Comparing the α3 and β1 monomeric models generated by MODELLER...... 78 Figure 19. Changes in integrin conformation ...... 78 Figure 20. Standard deviation values associated with all model sets using only closed and open templates, only closed with arguments and only closed for forced transmembrane portions...... 80 Figure 21. Comparing the Z-DOPE score of models generated using only closed and only open templates with models generated using arguments and forced transmembrane helices ...... 81 Figure 22. Comparison of the monomeric models generated by MODELLER ...... 82 Figure 23. Overlay of the templates 3VI3 and 1JV2, the monomeric subunits generated by the forced transmembrane set and the heterodimeric model generated from the increased template and refined set ...... 83 Figure 24. Comparing the Z-DOPE scores of extracellular, transmembrane and cytoplasmic models of

α3 and β1 generated by MODELLER...... 84

Figure 25. Comparison of the extracellular portion of the α3 and β1 integrin chains ...... 85 Figure 26. Comparison of the Z-DOPE scores of models generated for the closed conformation ...... 86 Figure 27. Comparison of the final models generated for the closed conformation ...... 87 Figure 28. Overlay of closed conformation integrin models with their templates ...... 88 Figure 29. Comparison of the Z-DOPE models generated for the open conformation ...... 89 Figure 30. Comparison of the final models generated for the open conformation...... 89 Figure 31. Overlay of open conformation integrin models with their templates ...... 90

Figure 32. Overlay of the closed and open templates of the α4β1, αDβ2, αLβ2 and αXβ2 ...... 93 Figure 33. Protonation states of Histidine ...... 96

Figure 34. Interactions of the ligand (LDV) with αXβ2 ...... 100 Figure 35. Interactions between Integrin Binding Fragment of Laminin-511 (ELV) sequence and integrin α3β1 ...... 103

Figure 36. Failed result of αEβ7 and passed result of α3β1 ...... 1055

ix

Figure 37. PROSA graphical results of chain A and B of α3β1 ...... 107

Figure 38. PROSA graphical results of α3β1 ...... 107

Figure 39. Summarised PROSESS results of closed and open conformations of chain A of α3β1 ...... 108 Figure 40. Cytoplasmic domains of all integrin subunits overlaid with templates ...... 114 Figure 41. Transmembrane domains of all integrin subunits overlaid with templates ...... 115

Figure 42. The α4β1 integrin with associated LDV ligand ...... X

x

List of Tables Table 1. Description of integrins involved in leukocyte activation ...... 10 Table 2. UNIPROT database sequence information of the integrin monomers ...... 41

Table 3. Template selection panel of PRIMO for β1 ...... 51 Table 4. Template names and numbers used for the generation of each integrin ...... 52 Table 5. Output of PRIMO server ...... 56 Table 6. P-values between data sets of each integrin ...... 58 Table 7. Correlation coefficients of the balanced, coverage and identity and resolution sets ...... 58 Table 8. P-values of Z-DOPE models generated by online servers obtained from a two factor ANOVA without replication ...... 68 Table 9. P-values of standard and increased template sets ...... 73 Table 10. P-values associated with the closed and open, only closed, only open, closed with arguments and closed with forced transmembrane segment ...... 80 Table 11. P-values associated with each integrin subunit of all, fewer and fewest data sets ...... 91

Table 12. Output results of the αXβ2 integrin ...... 97 Table 13. Verify 3D results of final integrin heterodimeric models...... 105 Table 14. Z-scores obtained by PROSA indicating global quality of final models ...... 106 Table 15. Overall quality of each heterodimeric model as graded by PROSESS ...... 109 Table 16. Z-DOPE scores of cytoplasmic and transmembrane models generated by MODELLER ...... 113 Table 17. PDB files used within this study and their description...... I Table 18. PDB files used for PRIMO server modelling...... III Table 19. PDB files used for MEDELLER server modelling...... III Table 20. PDB files used for testing MODELLER parameters...... IV Table 21. PDB files used for the final cytoplasmic and transmembrane segment modelling...... IV Table 22. PDB files used for the final extracellular segment modelling...... V Table 24. Problematic residues within the open and closed conformation of integrin proteins...... IX Table 25. Predicted ligand associating residues for various integrin subunits...... X

xi

Abstract Heterodimeric transmembrane protein structure is complex and insufficient structural information exists, concerning leukocyte integrin proteins. To determine protein structure, homology modelling was conducted and modelling software was evaluated. Leukocyte integrin homologs were obtained from the PDB and models were generated using online servers and MODELLER. Template homologs were fewer in number and of lower quality in comparison to monomeric extracellular proteins. Models were docked using ClusPro, HADDOCK2.2 and AutoDock vina. Models were evaluated using PROSA, Verify-3D and PROSESS. Higher quality models were generated when using MODELLER to separately model monomeric subunits in three defined domain regions (extracellular, transmembrane and cytoplasmic). Template selection concerning these proteins is critical as an intricate relationship exists between model quality, template quality, template quantity, template resolution, target-template identity and template sequence coverage. Docking monomeric subunits was challenging when using ClusPro and the best ligand docking procedures were completed using AutoDock vina. PROSESS provided the most accurate evaluation of protein models, in comparison to PROSA and Verify-3D. These results indicate that although homology modelling is a powerful tool there is much room for improvement. Experimentally obtained templates should be expanded upon within the PDB and energy functions should cater for both monomeric and transmembrane heterodimeric proteins. Leukocyte integrins appear to adopt a closed conformation, which may still facilitate LDV ligand association within the α/β interface. The α3β1 integrin may interact with laminin-5 through the ELV sequence within the G-domain of the α laminin subunit.

1

1. Integrin proteins Kindlin activated proteins, such as integrins, are classified into transmembrane and cell surface adhesion proteins permitting communication between cells and their cellular surface with the extracellular matrix (ECM) (Huveneers et al., 2007). Integrins also function by facilitating cellular adhesion permitting multicellularity. Integrin ligands modulate these proteins differently allowing various biological functions. Ligands include collagen, fibronectin, vitronectin and laminin. Although communication and cellular adhesion are primary roles of integrins, these functions facilitate other processes such as cell proliferation, migration, signalling, morphology, movement, differentiation, ECM homeostasis, tissue integrity, lymphocyte recruitment and adhesion (Anderson et al., 2013).

1.1. Structure Combinations of type I transmembrane glycoprotein subunits associated through non- covalent forces compose the integrin protein family (Figure 1) (Pozzi and Zent, 2013). The variety of α and β subunits are further increased through formation of subunit isoforms. Isoforms result from alternative messenger ribonucleic acid (mRNA) splicing and affects both extracellular and intracellular domains (de Melker and Sonnenberg, 1999). Alternative mRNA splicing is completed through alternative donor sites, acceptor sites or intron retention. Isoform synthesis is dependent on cell type and cellular environment such as cell cycle stage or cellular differentiation level.

Figure 1. The integrin protein family. Leukocyte specific integrin proteins are shown on the left. Integrin proteins not associated with leukocytes or having dual functions are shown on the

2 right. Insert-domains (blue), no insert-domains (green), β subunit (red) (Gahmberg et al., 2009).

Integrin structure is complex, as the dimer is a heterodimeric construct of different domains, which may not be present throughout all known structures. The general structure (Figure 2) of integrins can be described by the αVβ3 protein.

Figure 2. Structure of integrin proteins using αVβ3 as example. The α chain (left) in combination with the β chain (right). The α chain contains a β propeller preceded by three β-sandwich modules. The β chain contains an A-domain preceded by a β-sandwich hybrid domain, a PSI- domain, four epidermal growth factor (EGF) repeats and a β-tail domain. The colours indicate α helices (red), β strands (blue), loops (grey) and calcium ions (yellow) (Humphries et al., 2003).

The α subunit is smaller than β subunits with lengths of ±700 and ±1000 amino acids (Danen, 2013). β subunits are highly homologous in comparison to α subunits. The α subunit contains a metal-ion-dependent adhesion site (MIDAS) located within an A-domain (Michishita et al., 1993). This αA-domain is present within nine integrin α-subunits and within proteins unrelated to integrins (Lee and Richards, 1971; Whittaker and Hynes, 2002). The αA-domain

3 is responsible for mediating ligand binding to other αA-domain containing integrins and regulates Mg2+-dependent binding of the receptor to ligands (Diamond et al., 1993; Michishita et al., 1993). It contains a compact GTPase-like fold composed of parallel-β-sheets surrounded by seven amphipathic α-helices (Arnaout, 2016). The apex in GTPases is replaced with MIDAS in which a trio of surface loops manipulates an Mg2+ ion. This Mg2+ ion modulates ligand-binding specificity of the αA-domain by associating with a solvent-exposed glutamate or aspartate provided by the ligand, creating an octahedral coordination sphere surrounding the ion (Lee et al., 1995). The αA-domain is structurally similar to the trimeric G-protein α subunit (Liddington and Ginsberg, 2002)

The β subunit ligand-binding domain, or βA-domain, is formed through seven repeats of 60 amino acids that fold similar to trimeric guanine nucleotide-binding protein (G-protein) β subunits (Shimaoka et al., 2002). About half of the known α subunits of integrins and all known β subunits contain an A-domain located within the globular head region. This domain harbours a MIDAS, which interacts with integrin ligands which alter the conformation of the αA-domain facilitating shifts between closed to open conformations (Liddington and Ginsberg, 2002).

A closed conformation of the αA-domain exists in which the carboxyl oxygen coordinating ligand is replaced with water. Conformational changes associated from closed to open include inward movement of the N-terminal α1 helix and rearrangement of metal-coordinating residues at the MIDAS site, followed by a 10Å downward shift of the C-terminal α7 helix at the opposite pole to MIDAS (Bajic et al., 2013; Lee et al., 1995).

Physical interactions between chains within the ectodomain were first determined through crystallisation of the αvβ3 integrin protein in an unaligned state occupied by a cyclic peptide ligand (Xiong et al., 2002). The αv subunit is composed of a seven-bladed propeller domain, a thigh domain and two Ig-like Calf domains. The β3 subunit is composed of an N-terminal plexin-semaphorin-integrin (PSI) domain, an Ig-like domain in which an A-like domain is present, followed by four successive epidermal growth factor (EGF)-like repeats and finally a novel membrane-proximal β-tail domain (βTD).

Integrin heads which lack an αA-domain are composed of non-covalently bonded βA-domain and propeller domains resembling the association between subunits of heterotrimeric G

4 proteins (Xiong et al., 2001). For αA-domain containing integrins the integrin head contains the αA-domain projected from a surface loop within the propeller. Five metal ions (Ca2+ or Mn2+) associate with bases of the four-seven propeller blades. These ions are suspected to increase rigidity of interfaces of thigh domains (Arnaout, 2016).

Inactive βA-domains are structurally identical to αA-domain containing integrins except for two loop insertions. These form the core of the interface with the α-subunit’s propeller and the second for the specificity-determining loop (SDL). The SDL contributes to ligand binding and βA /propeller interfaces for some integrins. A Ca2+ cation site lies adjacent to the MIDAS in the βA-domain and links two activation sensitive α1 and α7 helices. These helices stabilise the domain within the closed state. In comparison, in αA-domain containing integrins this interaction is no longer found but replaced by a hydrophobic interaction. In addition to this MIDAS site, ligand-bound βA also contains a ligand-associated metal-binding site (LIMBS) occupied by Ca2+ in ligand- or psuedoligand-bound integrin proteins (Xiao et al., 2004; Xiong et al., 2002). The LIMBS domain is regulated in ligand-free integrins by the α-subunit’s propeller domain (Rui et al., 2014).

For αA-domain containing integrins the ligand downward shift occurs at the C-terminal α7 helix and enables association between glutamate and the βA MIDAS ion. Failure prevents integrin function, hinting at the possibility for the αA-domain to act as an intrinsic ligand for βA (Arnaout et al., 2007).

An Asparagine-Proline-X-Tyrosine (NPxY) motif located within β subunits functions as an association site for signalling and cytoskeletal proteins that contain a phosphotyrosine- binding domain (Calderwood et al., 2003) modulated by tyrosine phosphorylation. The cytoplasmic tail of ±75 amino acids lacks enzymatic or actin-binding activity but facilitates direct association with adaptor proteins allowing intracellular signal transduction (Pozzi and Zent, 2013).

1.2. Activation and regulation Integrins are activated by focal adhesion proteins Kindlin-1, 2 and 3 (Ali and Khan, 2014). Activation is through interactions between integrins and a bipartite 4.1 protein, ezrin, radixin and moesin (FERM) domain interrupted by a pleckstrin homology domain (Lai-Cheong and McGrath, 2010). A FERM subdomain contains a phosphotyrosine binding fold structurally

5 homologous to Talin proteins (Kloeker et al., 2004). Kindlin proteins allow protein-protein interactions and perform roles in cell migration, cell spreading and cancer progression.

Self-regulation of integrins is through modulation of association strength between cells. This regulates recruitment and localisation of signalling proteins and associated substrates affecting and modulating cytoskeletal dynamics (Dowling et al., 2008; Kruger et al., 2008; Moser et al., 2008).

1.3. Functions An important integrin protein function is leukocyte recruitment, activation and fixation to the vascular endothelium (Diamond and Springer, 1994). Leukocytes are attracted to sites of infection or sterile inflammation and form part of both the innate and adaptive immune system. Integrins function to facilitate the above processes and permit leukocyte functioning.

The β1 subunit is heavily involved in cell-matrix interactions, β2 functions in many cell-cell interactions (Buck and Horwitz, 1987); (Ruoslahti, 1991) and β3 has many adhesive functions (Albelda and Buck, 1990).

1.3.1. Cell signalling Integrin-mediated adhesion induces a variety of processes such as signal transduction cascades, modulation of calcium fluxes, activation of inositol lipid metabolism, regulation of interactions between signalling proteins and receptors such as G-protein-coupled receptors, cytokine receptor kinases, receptor tyrosine kinases (RTKs), and both serine and threonine protein kinases (Danen and Yamada, 2001; Huveneers et al., 2007)

Growth factor receptors and integrins synergistically activate pathways that induce co- activation of downstream signalling proteins (Renshaw et al., 1997). Cell-matrix adhesion induces association of integrins with cellular membranes activating redistribution of the actin cytoskeleton and associated proteins. Redistribution facilitates multi-protein platform formation in the form of podosomes, hermidesmosomes or focal adhesions (Huveneers et al., 2007). These multi-protein platforms amplify signals originating from growth receptors by limiting distance between kinases and their substrates (Geiger et al., 2001).

Cell shape is affected and modulated by the cytoskeleton, which is dependent on integrin activity. Changes in cytoskeletal components induce shape changes of cellular nuclei, affecting chromatin structure. Changes in chromatin structure modulate genes involved in

6 integrin synthesis and impact integrin-mediated cell adhesion activity (Lelièvre et al., 1998). Integrin-mediated adhesion promotes clustering and activation of RTKs such as platelet- derived growth factor receptors, epidermal growth factor receptors, Ron receptors, Met receptors, vascular endothelial growth factor receptors and proto-oncogene tyrosine-protein (Src) family kinases (Shattil, 2005). Regulation of RTKs is enhanced by integrin modulation of upstream signalling through ECM organisation. Proteoglycans in the ECM modify growth factors prior to presentation to various growth factor receptors. Integrin-mediated cell adhesion facilitates interaction between modified growth factors and receptors allowing for enhanced regulation of cell growth (Faham et al., 1998).

1.3.2. Cell survival Cell survival is linked to integrin maintained adhesion as loss of adhesion leads to anoikis (Frisch and Screaton, 2001). The process by which integrins mediate this form of apoptosis is stimulation of phosphatidylinositol-3-kinase (PI3K)-mediated protein kinase B (PKB), autologous tumour killing (ATK) activity and B-cell leukemia-2 (Bcl-2) production, which regulate cell survival through various signals (Giancotti and Ruoslahti, 1999). Other means of survival regulation include integrin-mediated adhesion to fibronectin by activating c-Jun N- terminal kinase through focal adhesion kinase processes (Oktay et al., 1999). An additional method by which integrin proteins support cell survival is activation of nuclear factor kappa-

B (NFκB) by integrin α6β4 (Weaver et al., 2002). Alternatively, non-ligand-bound integrins induce apoptosis via recruitment and activation of caspase-8 suggesting that a cell is dependent on particular environment conditions of the ECM to support cell survival (Stupack et al., 2001).

1.3.3. Cell proliferation Cell proliferation is regulated by integrins through influence on the growth 1 (G1)-phase of the cell cycle. Integrins interact with RTKs and facilitate stimulation of cyclin E, also known as cyclin dependent kinase 2 (cdk2), which regulates synthesis (S)-phase entry of the cell cycle. Integrin proteins and RTKs regulate the cell cycle by influencing D1 gene expression (Huveneers et al., 2007). Transcriptional regulation of the D1 gene is related to mitogen- activated protein kinase (MAPK) activation, which relies on integrin-mediated adhesion for activity or MAPK activation of RTK. The latter induces a more powerful response of MAPK

7 activity due to synergistic effects of RTK and integrin signalling through Raf, MAP or MAPK (Renshaw et al., 1997).

Integrins regulate MAPK activity in three ways. Firstly, association between integrins and the ECM induces active PTK2 (protein tyrosine kinase 2) and Src signalling protein complexation at adhesion sites. This is followed by autophosphorylation of FAK at Tyr397, made possible by creation of suitable binding sites for the Src homology 2 (SH2) domain of Src (Schlaepfer and Hunter, 1998) and various downstream effectors. These effectors include association of growth factor receptor binding protein-2 (Grb2) to an active FAK and Src complex or through indirect association with Shc protein activating the Grb2-Sos-Ras-Raf-Mitogen-activated protein kinase kinase (MEK)-MAPK pathway (Huveneers et al., 2007). It is also possible for Src to phosphorylate p130Cas (Crk associated substrate), which associates with FAK through SH3 domains creating binding locations for Crk adaptor protein. Activation of MAPK by Crk is either through interaction between Crk and son of sevenless (Sos) or with a guanine- nucleotide exchange factor for the small GTPase Rap-1 (C3G). Stimulation of the adaptor protein Nck and p130Cas via integrin-mediated adhesion is another method by which p130Cas leads to MAPK activation. PI3K may interact with the phosphorylated Tyr397 in FAK, leading to activation due to integrin-mediated adhesion. PI3K leads to MAPK activation either via PI3K’s role as a protein kinase or through regulation of Sos activity via phosphatidylinositol-3, 4, 5-tri-phosphate (PIP3).

Secondly, activation of the MAPK pathway may be completed via particular integrin α- subunits associating with the Src family kinase, Fyn, via the oligomeric transmembrane protein Caveolin-1 (Guo and Giancotti, 2004). Activation of Fyn induces the recruitment and phosphorylation of Shc thereby generating a link to the Grb2-Sos-Ras-Raf-MEK-MAPK pathway.

Thirdly, the activation of protein kinase C (PKC) and other PKC isoforms by integrin-mediated adhesion may activate Raf. This is likely due to the increased levels of phospholipids. Additionally, integrin-mediated adhesion leads to the activation of p21-activated protein kinases (PAK) activating both Raf and MEK.

Alternative control methods over the cell cycle include regulation of p21 and p27 by integrin- mediated cell adhesion as both proteins exert influence over entry into the G1 phase of the

8 cell cycle. Adhesion increases the expression rates of Myc via Src activation (Benaud et al., 2001). Organization of cytoskeletal components of the cell by integrins influences Rho GTPase activity, which aid in regulating levels of cyclin D1 and cdk-inhibitors.

1.3.4. Cell differentiation Integrin-mediated cell adhesion modulates expression of genes involved in cellular differentiation. Stimulation of milk proteins via phosphorylation of prolactin receptors in mammary epithelial cells is one such example (Edwards et al., 1998). Other examples include activation of monocytes for inflammatory responses, inhibition via integrin-blocking antibodies to the formation of contracting myotubes and expression of meromyosin by embryonic myoblasts (Menko and Boettiger, 1987; Shi and Simon, 2006).

Differentiation of myogenic, cardiac and embryonic stem cells is dependent on β1 integrins.

The β1 integrin also regulates differentiation of keratinocytes where tumour associated mutations in β1 reveals an increase in ligand association preventing keratinocyte differentiation, contributing to epidermal neoplasia development (Evans et al., 2003).

1.3.5. Leukocyte recruitment, activation and adhesion Integrins mediate interactions between leukocytes and vascular endothelium facilitating leukocyte recruitment (Table 1) (Springer, 1994). Once leukocytes are recruited, the initial adhesion process, involving selectins and α4β1 integrins, begins. This is followed by tethering of leukocytes to endothelial tissue. Tethering relies on chemokine activity leading to leukocyte rolling. Activation of rolling leukocytes induce increased activity of cell adhesion molecules by recruiting β2 and α1β1 integrins to fixate leukocytes to the endothelium. However, this may not always occur as leukocytes may detach before this process is initiated. Fixation to the endothelium is completed by L-, E- and P-selectin ligands which facilitate initial fixation while the β2 integrins αLβ2 and αMβ2 perform long term fixation (Abitorabi et al.,

1997). The α4β1 and α4β7 integrins both perform initial and long term fixation of leukocytes to the endothelium (Berlin et al., 1995).

Leukocyte activation occurs in stages initiated by chemoattraction via cytokine release, which induces recruitment of cellular adhesion molecules (integrins, selectins and cadherins). Once adhesion molecules are recruited a rolling adhesion process is initiated which may lead to

9 fixed adhesion. This fixed adhesion precedes transmigration of the cellular adhesion molecules to allow movement to the required location.

Table 1. Description of integrins involved in leukocyte activation, their alternative names, associated β chains, receptors involved in leukocyte activation and general functions (Arnaout, 1990; Bechard et al., 2001; Bilsland et al., 1994; Blackford et al., 1996; Fujita et al., 2012; Granger and Senchenkova, 2010; Humphries et al., 2006; Ihanus et al., 2007; Kristof et al., 2013; Malhotra et al., 1986; Ostermann et al., 2002; Sadhu et al., 2007; Tian et al., 1997; Van der Vieren et al., 1999; Xia et al., 1999; Yakubenko et al., 2008; Yakubenko et al., 2001).

Integrin Alternative Associated Leukocyte receptor Leukocyte General function subunit name β chain function α3 CD49c 1 ICAM-1/VCAM-1 Detachment Receptor for fibronectin, laminin, collagen, epiligrin and thrombospondin. Participates in invadopodia formation and matrix degradation processes. Mediates cell migration. α4 CD49d 1, 7 VCAM-1/MadCAM-1 Rolling/fixed Receptors for fibronectin and adhesion participates in cytolytic T-Cell interactions with target cells. Functions in cell signalling. αD CD11d 2 VCAM-1, ICAM-3, several Adhesion Role in atherosclerotic matrix proteins processes. αE CD103 2 CAM120/80 Recruitment Receptor for E-cadherin. Mediates adhesion of intra- epithelial T-lymphocytes to epithelial cell monolayers.

αL CD11a 2 ICAM-1-5, telencephalin, All Receptor for ICAM1-4 and F11R. endothelial cell-specific Functions in many immune molecule-1 (ESM-1), phenomena and roles in junctional adhesion lymphopoiesis. molecule 1 (JAM1) αM CD11b 2 >40, including iC3b, Adhesion Receptor for fibrinogen, factor X ICAM1-4, fibrinogen, and ICAM1. Recognises P1 and fibronectin, Factor X, P2 peptides of fibrinogen Platelet Iba, JAM-3, gamma chain. proteinase 3

α X CD11c 2 ICAM1, 4, iC3b, VCAM-1, Crawling Receptor for fibrinogen. heparin, polysaccharides, Recognises the GPR sequence. negative charges of Mediates cell-cell interaction proteins during inflammatory responses. Important in monocyte adhesion and chemotaxis

Although β2 integrins associate solely with leukocytes, the association varies between leukocyte subpopulations. αL is predominantly located on lymphocytes, αM associates strongly with myeloid cells, natural killer cells, fibrocytes, mast cell, B cells, CD8+ T cells and

10

γδ T cells (Arnaout, 1990; Clements et al., 2016; Ghosn et al., 2008; Lahmers et al., 2006;

Pilling et al., 2009; Rosenkranz et al., 1998; Rubtsova et al., 2013; Wagner et al., 2001). αX is located on myeloid dendritic cells, natural killer cells, B cells and T cells (Keizer et al., 1987).

αD is associated with neutrophils, monocytes, natural killer cells and a portion of the T cells (Miyazaki et al., 2014; Van der Vieren et al., 1999).

β2 integrin function was studied using knockout experiments. It was determined that αL and

β2 both play a role in the onset of neutrophilia, however, the effect of the former is larger than that of the latter thereby indicating additional contributions of the other β2 integrins

(Arnaout, 2016). αL, αM and α X also display roles in preventing defective host-versus-graft reactions and tumour rejection by preventing the homotypic aggregation and antigen, mitogen and alloantigen induced lymphoproliferation (Shier et al., 1999; Shier et al., 1996).

There appears to be overlap in the functions of some β2 integrins such as α4β1 or α9β1 as the systemic responses to viral infections in αL knockout mice were unchanged (Johnson and Overington, 1993; Taooka et al., 1999).

Concerning adhesion, α subunits contributed to varying degrees to the adhesion of phagocytes to inflamed regions (Arnaout et al., 1988; Ding et al., 1999; Sadhu et al., 2007).

The αM integrin has been implicated in phagocytosis of serum-opsonised particles and phagocytosis-induced apoptosis in neutrophils and in the prevention of sepsis and endotoxin shock (Han et al., 2010). Although neutrophil adhesion to endothelial tissue requires all available β2 this is not the case for either transendothelial migration or phagocytosis, which appears to be reliant on αL and αM activity. αM aids in the regulation of lipid metabolism and mast cell development, which plays a role in early peritoneal neutrophil response (Rosenkranz et al., 1998).

1.3.6. Integrin ligands There are four major ligand associating types, leucine-aspartic acid-valine (LDV), arginine- glycine-aspartic acid (RGD), A-domain β1 and non A-domain-containing laminin-binding integrins (Humphries et al., 2006). LDV-binding integrins include α4β1, α4β7, α9β1, αEβ7 and all members of the β2 family. This group constitutes the majority of integrins involved in this study except α3β1 which associates with laminin (Humphries et al., 2006). The structural interactions involving LDV ligands are missing but is expected to be functionally similar to RGD ligands (Humphries et al., 2006). RGD binding ligands include all αV, α5β1, α8β1 and αIIbβ3

11 integrins. The RGD ligand binds within the α/β subunit interface. The R residue lies within a β-propeller cleft of the α subunit whilst the D residue coordinates a cation bound within the

A-domain of the β subunit (Humphries et al., 2006). The β2 family associate with their ligands through an A-domain of the α subunit (Shimaoka et al., 2003). The major difference between

β2 and β1/ β7 integrins is the use of glutamate for ligand coordination in the former whilst the latter uses aspartate.

A-domain containing β1 integrins such as α1, α2, α10 and α11 associate with the laminin- collagen ligands. Association appears to lie within a collagenous GFOGER motif (where O indicates hydroxyproline) providing the key cation-coordinating residue (Emsley et al., 2000) through a glutamate residue. The mechanism of laminin association is unknown. The non αA- domain containing laminin-binding receptors such as the α3β1, α6β1, α7β1 and α6β4 appear to have no definite trend of ligand association and the exact site and mechanism has yet to be determined (Humphries et al., 2006). It appears that laminin association occurs within the G- domain, which represents the C-terminal globular domain of the α chain (Colognato and

Yurchenco, 2000; Hirosaki et al., 2000; Ido et al., 2004). It has been determined that α3β1 in particular associates strongly with laminin-5 and laminin-10/11 (Nishiuchi et al., 2006).

The specific recognition sequences associated with these ligands differ. Fibronectin contains a CS1 site and an REDV site with the former containing the LDV recognition sequence (Yamada, 1991). The CS1 site elicits a stronger biological reaction when bound than the REDV site and is considered when performing ligand-integrin binding studies. Laminin, however, contains may other potential ligand recognition sites such as YIGSR (Graf et al., 1987), PDSGR (Kanemoto et al., 1990), RYVVLPR (Skubitz et al., 1990), LGTIPG (Mecham et al., 1989), RGD (Aumailley et al., 1990), IKVAV (Tashiro et al., 1989) and LRE (Hunter et al., 1989). Although it is unknown exactly how α3β1 associates with laminin, it is likely through one of these recognition sites in a location structurally homologous to that of RGD and LDV integrin binding ligands. However, it is now possible to eliminate some of these recognition sequences based on whether they fall within the α subunit G-domain of laminin-5 or laminin-10/11.

12

2. Homology modelling Homology modelling is a tool which determines structural characteristics of biomolecules, which have yet to be solved experimentally through nuclear magnetic resonance (NMR) spectroscopy or X-ray diffraction (X-D) techniques (Krieger et al., 2005). Biomolecule modelling conducted using software relies on “knowledge-based” or “de novo” approaches (Blaszczyk et al., 2013). Knowledge-based techniques require prerequisite information of the target. This information is acquired through sequence and structure data obtained from the target’s homologues, known as templates. Template suitability is based on target-template identity and template resolution. Percentage identity is an indication of evolutionary relatedness between target and template whilst resolution is an indication of template structure quality.

High sequence identity ensures these sequences are similar enough in terms of their residues to reflect similar secondary, tertiary and quaternary protein structure as these factors are dictated by sequence information (Xiang, 2006). Template resolution is dependent on the method from which the structure was solved. Although NMR and X-D techniques are not mutually exclusive, each displays their down advantages and disadvantages with NMR dominating protein structures of 10 kDa or less whilst 90% of the total PDB structures are solved using X-D (Krishnan and Rupp, 2012). NMR is particularly useful if no protein crystals can be obtained and provides for solution dynamics, however, NMR is limited to proteins of 50 kDa or less (Krishnan and Rupp, 2012). X-D does not have size limitation and provides higher atomic detail but molecule dynamics are limited (Krishnan and Rupp, 2012).

Resolution is measured in angstroms (Å) where lower values indicate increased structural detail. Homology modelling can be completed by using a combination of sequence and structural data from homologues, restricting the modelling procedure to rely on either sequence or structural data or utilisation of ab initio techniques to model regions in the absence of templates (Krieger et al., 2005). Knowledge-based approaches rely on template information and facilitate higher model quality compared to that of ab initio techniques. However, in many modelling scenarios, regions for which no template may be identified exist and thus ab initio techniques are used in combination with knowledge-based approaches to solve complicated modelling problems.

13

Various homology modelling software exist, each with their own advantages and disadvantages relating to the modelling process. These challenges include low template identity, resolution or coverage (Zhang, 2008). Software parameters relating to iterations of model construction, ab initio modelling techniques and restriction of regions of the model during construction based on template information also vary. Restraints on model construction force the model to fold in particular ways, which may not be deemed most energetically favourable. If the native state of a protein is not the lowest energy state this may be due to the evolutionary heritage of the protein. Over time the protein evolved to adapt a “stable enough” native state while allowing biological function during conformational changes (Metcalf et al., 2016). During protein function, the structure is altered and a new energy state is adapted, which must still be stable enough to perform biologically.

Model validation programs ensure models are of highest quality. Each model validation program has its own advantages and disadvantages and grades models differently, providing information regarding different aspects of the generated model. Evaluation of monomers, homodimers, heterodimers and multimeric proteins cannot rely on similar model evaluation programs. Complex protein-protein interfaces of multimeric proteins are particularly challenging to evaluate and are not catered for by certain software. Evaluation programs may rely on libraries to generate comparisons between qualities of models and proteins within the library. Variations in fundamental structure between models and library proteins prevent accurate comparisons.

2.1. Comparative homology modelling overview Steps include database mining for potential templates, aligning target and template sequences using multiple sequence alignment programs (MSAs), generating models by backbone modelling, loop and side-chain modelling followed by model validation. Once this process is complete, errors should be rectified and the process repeated to improve model quality.

2.1.1. Template searching Suitable templates are obtained from databases such as The Protein Databank (PDB), Universal Protein Resource (UniProt) or Class, Architecture, Topology, Homologous Superfamily (CATH). Templates are obtained using programs such as the Basic Local Alignment Search Tool (BLAST), (Position-Specific Iterated) PSI-BLAST or fold recognition

14 search methods (Altschul et al., 1994; Holm and Sander, 1996). BLAST obtains homologues through pairwise sequence-sequence comparison aligning two sequences simultaneously. Other search programs such as HHsearch are also used and fall within the HH-suit, which includes HHpred and HHblits. The HHsuit package and HMMER use hidden Markov models (HMMs) to identify homologues (Marks et al., 2011; Soding, 2005).

2.1.1.1. Pairwise sequence-sequence methods Pairwise sequence-sequence comparisons use either dot-matrix, dynamic programming (DP) or word methods. The dot-matrix is simple to construct, however, matrix analysis is time consuming. An advantage of the dot-matrix is visualisation of insertions, deletions, repeats or inverted repeats (Dumas and Ninio, 1982; Maizel and Lenk, 1981). However, high background alignment noise levels make identifying these events difficult (Huang and Zhang, 2004). To construct the matrix, target and homologue sequences are written on the top row and leftmost column of a two dimensional matrix. A dot is made where characters between sequences are identical. Highly homologous sequences display a single diagonal line originating in the top leftmost corner and ending in the bottom rightmost corner. These plots are susceptible to noise and lack clarity, suffer from non-intuitiveness and have high analysis time.

DP relies on the Needleman-Wunsch algorithm (Muhamad et al., 2018) for global alignments and the Smith-Waterman algorithm for local alignments (Muhamad et al., 2018). A substitution matrix is constructed from the target-template alignment. Scores and gap penalties are associated with each amino-acid match or mismatch. Penalty for introducing gaps is greater than gap extensions. The optimal alignment is generated by summing scores along a diagonal to obtain the highest score originating from the top leftmost corner and usually ending at the bottom rightmost corner. These methods are accurate; however, obtaining an optimal alignment between multiple sequences is inefficient.

The final pairwise sequence-sequence method is the word (k-tuple) method used by BLAST (Wilbur and Lipman, 1983). This heuristic method may not generate the optimal alignment; however, word base methods are more efficient. Short non-overlapping sub-sequences (words) are identified within the target and matched against words in the homologue. Distances between word locations are calculated and an offset score directly proportional to distance is assigned. These methods are efficient as only regions identified as having similar

15 words are tested saving computation time in analysing non-homologous sequences. These methods are limited to obtain 50% of possible homologues in the range of 20% to 30% sequence identity (Brenner et al., 1998).

2.1.1.2. Profile-profile comparison methods Homologues can be identified using profile analysis methods such as profile-profile comparisons, intermediate sequence search and HMMs (Eddy, 1998; Park et al., 1998; Rychlewski et al., 2000). An example, PSI-BLAST, drastically increases the number of detected homologues if sequence identity is below 25% (Muller et al., 1999). Profile-profile search methods rely on pairwise amino acid substitution matrices (Henikoff and Henikoff, 1993). A profile contains the log-odds scores (probability of obtaining the alignment by chance against a result of a mutation) of every amino acid at each position within the target and is constructed using pairwise alignment matrices and homologues as input. This profile queries the database a second time; however, position-specific scores evaluate quality of alignments rather than basic pairwise amino acid comparisons.

The latest search methods include HMMs and conditional random fields (CRFs). The latter is an advanced sequence profile algorithm incorporating data pertaining to insertions and deletions found in MSAs. In addition to this, HMMs incorporate secondary structure information (Lam et al., 2017). CRFs are similar to HMMs, however, they do not assume residue independence (Lam et al., 2017).

Homologues are also identified through threading, which relies on a pairwise comparison of homologues to the target and comparison of the target sequence to homologue structural information (David et al., 2000). Such programs include 3D-PSSM (Kelley et al., 2000), which identifies homologues if it adopts any of the known 3D folds of the target structure. The folds of homologues are predicted based on optimisation of the sequence alignment concerning a structure-dependent scoring function, which independently scores each sequence-structure pair. This limits the number of possible alignments available between homologues and target.

2.1.2. Selecting templates Homologues of high identity are indicated by an E-value (Expected value) close to zero. Homologues of sequence identity 40% or higher are suitable for homology modelling (Figure 3). Sequence length should be considered as shorter sequences have a higher probability to be homologous by chance, indicated by requiring higher percentage identity (Figure 3). Other

16 factors to be considered when selecting templates and include template environment such as pH, solvent, ligands and quaternary interactions associated with the template. Templates located under similar environmental conditions to targets are likely to be better templates as they share additional features with the target.

Figure 3. Homology modelling restrictions. The percentage identity required between template and target sequences is affected by the number of aligned residues. The safe homology zone (top) and the twilight zone (bottom). The midnight zone (not shown) acts as a threshold to prevent non-homologous sequences from being selected for homology modelling and lies below 20% sequence identity (Krieger et al., 2005).

2.1.3. Sequence alignment MSAs such as Clustal Omega and Multiple Alignment using Fast Fourier Transform (MAFFT) are used for generating accurate alignments. MSAs have algorithms that govern the alignment process and may cause variation in alignment output. In spite of these differences, if sequence identity is high MSAs will likely achieve identical alignment feedback. Once sequences are aligned alignments can be manually edited to ensure high quality modelling, particularly if the sequence identity is low (Johnson and Overington, 1993). DP algorithms use percent accepted mutations (PAM) and block substitution (BLOSSUM) matrices to align sequences. These matrices are limited when sequence identity lies in the twilight zone. These programs incorporate specific information of the protein family from which the protein belongs. However, if insufficient information is obtained, programs such as PSI-BLAST generate sequence profiles from multiple alignments of members of the protein family. The

17 procedure is improved by aligning sequence profiles in programs such as SALIGN (Saxena et al., 2013).

2.1.3.1. Progressive and iterative alignment MSAs such as Clustal and T-COFFEE use progressive or iterative alignment protocol or both to achieve high quality alignments. Progressive methods generate an alignment between most similar homologues then singly add less similar homologues to the alignment before recalculating the overall alignment.

Iterative methods remove dependency on the initial alignment, which may limit accuracy of the final alignment, disadvantageously affecting progressive alignment strategies. An initial global alignment is generated involving all sequences and graded, the process is repeated multiple times to improve on the previous total global alignment grade. If the previous score is worse than the current score, the previous alignment is discarded and replaced.

2.1.3.1.1. Clustal Omega Clustal Omega functions through calculating a pairwise alignment completed using the k-tuple method relying on the Needleman-Wunsch algorithm to generate a similarity matrix (Daugelaite et al., 2013). Similarity scores computed within the matrix are converted to distance scores indicating evolutionary relatedness between sequences. Sequences are clustered using mBed and k-means methods. The mBed method embeds each selected sequence in a space of n dimensions where n is directly proportional to logN. The mBed algorithm has a complexity of O (NlogN), a common run time for most sorting algorithms which contain a tree structure (Daugelaite et al., 2013). Sequences are then replaced by an n element vector. These vectors are clustered using the k-means method, alternatively, it may be completed via the unweighted pair group method with arithmetic mean (UPGMA) (Daugelaite et al., 2013). UPGMA functions in four steps, firstly estimation of the rooted dendrogram branch length followed by a distance matrix update. Clustering of most related sequences occurs in the second step followed by a second branch length estimation preceding a clustering repeat step. The final step produces the UPGMA dendrogram indicating sequences that are most similar. Clustering using the k-means method minimises distance between points within the same cluster. The k-means method is fast, simple and overcomes problems associated with defining initial cluster centres (Arthur and Vassilvitskii, 2007). The dendrogram created using UPGMA is used as a basis from which final alignments are

18 achieved. The UPGMA method uses distance scores computed through conversions of the similarity score matrix. The final alignment process is completed using HHalign which aligns two HMM profiles (Daugelaite et al., 2013).

2.1.3.1.2. MAFFT MAFFT prioritises speed over other parameters without incurring unnecessary loss in alignment quality. MAFFT relies on a library of algorithms from which one is chosen to align sequences based on length and number. Speed originates from identification of homologous regions by the fast Fourier transform (Daugelaite et al., 2013) method which converts residues to vectors composed of volume and polarity values. A scoring system reduces CPU workload and increases alignment accuracy by grading these values.

The strategy uses two-cycle heuristics, a progressive method followed by an iterative refinement method. In the first cycle, pairwise distances are calculated allowing for an initial MSA from which refined distances are calculated (Daugelaite et al., 2013). The second cycle includes an iterative refinement method, which compares the final and original alignment. This process is iterative and completed over many cycles. A part tree option is also available if there are more than 50 000 sequences to be aligned, which allows scalability providing increased speed and accuracy (Katoh et al., 2002).

2.1.3.1.3. T-COFFEE T-COFFEE uses a tree-based consistency objective function for alignment procedures (Daugelaite et al., 2013). The T-COFFEE alignment algorithm uses heterogeneous data sources provided to T-COFFEE through a library containing both local and global pairwise alignments.

The progressive alignment stage of T-COFFEE provides pairwise alignments using distance matrices. Matrices are used to generate a guide tree using the Neighbour-Joining method. The guide tree functions by grouping similar sequences together into clusters during alignment, providing increased accuracy. DP methods align the two closest sequences, which are weighted before aligning the next two closest sequences to the obtained alignment (Darden et al., 1993).

The dependency on library information provides T-COFFEE with some advantage over other programs such as ClustalW with 5% to 10% higher accuracy. However, T-COFFEE can only align

19 up to 100 sequences without losing accuracy and thus suffers poor scalability (Sievers et al., 2013).

2.1.4. Model construction Model construction can be completed by methods such as assembly of rigid bodies, modelling by segment matching or coordinate reconstruction and modelling by satisfaction of spatial restraints.

2.1.4.1. Modelling by assembly of rigid bodies Modelling of proteins by rigid bodies requires fragmentation of protein structures into three core regions including the backbone which displays variable loops and to which side chains are associated (Greer, 1990). This process constitutes backbone assembly, loop modelling, side chain modelling and energy minimisation.

2.1.4.1.1. Backbone assembly Backbone generation is the replication of superimposed template residue coordinates which result from the target-template alignment. If chosen templates lack acceptable levels of quality a possible solution is selective utilisation of certain regions which are combined to create more suitable templates. The carbon atoms involved in the backbone are averaged with respect to their X, Y and Z coordinates if the region is structurally conserved between all templates. Primary chain atoms involved in each core region of the target are determined by the superposition of the target sequence to that of the superimposed template sequence (Saxena et al., 2013).

2.1.4.1.2. Loop modelling Gaps occurring within templates should be covered either by obtaining additional template information or by removing corresponding residues of the target within the gapped region. These gaps can be moved if occurring within regions of known secondary structure whereby the gap is shifted to the end. However, gaps within loop region may be difficult to solve. Due to the nature of loop regions, the effects of gaps may be difficult to predict. Loop modelling can be completed in two different methods. A knowledge based approach relying on a library of solved structures obtained from the PDB or relying on energy functions which qualify the loop region (Saxena et al., 2013).

If the first method is used, comparisons are drawn between the loop region currently being evaluated and those present in the database. The loop is then modelled by basing information

20 off loop regions deemed homologous. In the second method, minimisation of energy is used as an overall goal as the quality of the loop is slowly improved over many iterations generating high quality loop regions.

2.1.4.1.3. Side chain modelling Modelling of protein side-chains is more complicated than backbone or loop regions as all attainable conformations cannot be realistically determined within reasonable periods. To overcome this challenge, partial knowledge-based systems are put into place in which a library of rotamers is obtained from high-resolution X-ray or NMR structures and compared to the template (Krieger et al., 2005). Conformations of the side chains in terms of position and geometry within 3D space can then be calculated based on information obtained from highly homologous structures within the library.

2.1.4.2. Structure refinement It is important to grade the quality of the backbone and the side-chain. However, backbone quality is dependent on the position of the amino acid side-chains whilst the side-chains have dependence on the protein backbone in terms of side-chain positioning and geometry. Therefore a challenge exists in which one is unable to determine the overall quality of the protein since the parameter (side-chain) used to grade the other (backbone) is also dependent on the quality of that same parameter (side-chain) it should be grading.

Optimisation of protein models are completed in an iterative manner in which many structures are generated and analysed in order to determine the highest quality model from the generated set. The importance of energy minimisation procedures is imperative at this stage to ensure that models are energetically favourable. From these sets, the most energetically favourable models are chosen to be graded further.

Structure refinement by molecular dynamics (MD) analysis is another means by which accuracy of models may be improved (Saxena et al., 2013). Energy functions used to grade model quality are either in the form of quantum force fields or self-parameterising force fields. How these energy functions grade models varies with the former being more accurate in terms of grading the electrostatic interactions within proteins. However, this increased level in accuracy comes at a cost of increased computational power and time. This illustrates the trade-off between accuracy, computational requirements and time. In spite of these differences, overall model quality reported by either energy function is similar.

21

If the homology between the model and crystallographically determined structure is greater than 90% the model is of very high quality. If the homology between these structures lies between 90% and 50% the root mean square deviation (RMSD) value for the modelled coordinates may be as high as 1.5Å (Krieger et al., 2005). If sequence identity between target and template is below 25% the alignment is classified as the limiting factor.

Errors present in template sequences and associated structural information should also be considered in terms of their location. If errors occur within regions lacking structural relevance their effect on model quality is limited. Conversely, if errors occur within biologically relevant regions the effects may be elevated. Force fields are used to determine the effects these errors may have on the model by taking their frequency of occurrence and location in account. The force field will analyse, from an energy perspective, parameters such as, bond length, bond angles, bumps that may be present in the model amongst others. These parameters are also evaluated in terms of the physical constraints.

2.1.4.3. Modelling by segment matching or coordinate reconstruction Models are generated by comparing a subset of coordinates from segments of the target to that of other protein structures classified in 100 structurally different classes (Bystroff and Baker, 1998). Coordinate subsets are used as guiding positions from which the remainder of the model is constructed and are most likely obtained from conserved carbon atom segments. Coordinates are obtained in one of two manners, by scanning all known protein structures or by conformational searches which are restrained via energy functions (van Gelder et al., 1994).

2.1.4.4. Modelling by satisfaction of spatial restrains This method relies on model generation governed by spatial restraints placed at the start of modelling procedures. Restraints are based on homologues and include features such as bond length, angles, dihedral angles and non-bonded atom-atom contact points (Saxena et al., 2013). These features are obtained from a force field. Model refinement is completed using distance geometry or real-space optimisation. Modelling is completed by first aligning the target with that of known 3D structures. These 3D structures are used to implement spatial restraints that govern the rest of the modelling process. Stereochemistry of the models is enforced using spatial restraints coupled with the CHARM22 force-field (Saxena et al., 2013). Once these parameters are incorporated and converted into an objective function from which

22 the model can be constructed, optimisation can be completed by altering the objective function.

2.2. Modelling programs Various programs can be used such as MODELLER, which relies on modelling by spatial restraints, MEDDELER, I-TASSER and Phyre2 to model proteins each completing this process differently.

2.2.1. MODELLER MODELLER utilises spatial restraints protocol prior to model generation. A set of aligned sequences is taken as input for MODELLER, which models within predefined boundaries using suitable templates. MODELLER also performs other functions such as aligning multiple sequences and facilitates ab initio modelling of loop regions (Eswar et al., 2006). Modelling by spatial restraints is completed using distance geometry or optimisation techniques obtained from the alignment of templates. These spatial restraints are as follows; homology- derived restraints facilitating distance and dihedral angles, stereochemical restraints such as bond length and bond angle preferences, which are obtained from force fields such as Charm- 22 and templates involved in modelling (Eswar et al., 2006). Once a model is generated, the structure is optimised using MD simulations to minimise modelling errors.

2.2.2. MEDELLER Homology modelling programs such as MODELLER allow for accurate modelling of globular or water-soluble proteins but fail to provide this level of accuracy for transmembrane proteins (Kelm et al., 2010). From 11 million sequences currently present within the UniProt database only approximately 65 000 entries are located within the PDB and of these only about 1.7 million within the UniProt database involved membranes with 4 700 corresponding PDB structures (Berman et al., 2000). As of October 30th 2018, the membrane protein browser contains 3 569 α-helical structures, 970 β-barrel, and 484 monotopic membrane proteins. Although there has been a substantial amount of growth concerning membrane protein structure submissions and curation, it remains as a small portion within the PDB database.

Structural differences between soluble and transmembrane proteins is highlighted by interactions with the external environment. The former case adopts a globular conformation whilst the latter is more linear and is situated within a lipid bilayer involving a high level of hydrophobic interactions, while soluble proteins have a hydrophobic core bounded by a polar

23 and charged residual surface (Eyre et al., 2004). A transmembrane segment is also unique to transmembrane proteins and may either adopt an α-helix or β-strand secondary structure.

Many computational methods used to produce models do so with the physical constraints and characteristics of soluble proteins and do not take physical and physiochemical differences of transmembrane proteins into account. ROSETTA modelling software was one of the first to be able to accurately model transmembrane proteins through ab initio techniques. However, this process required cluster computing and long computational times.

Experimentation of MODELLER for the generation of transmembrane protein models indicated a lack of accuracy when compared to soluble protein counterparts. Issues arose particular with the transmembrane segment which interacts with the lipid bilayer and in some cases this region may form loops, thereby reducing the accuracy of the surrounding model sections (Kelm et al., 2010).

MEDELLER has been developed specifically for the generation of transmembrane protein models and boasts higher accuracy in 65% of modelling cases or at least as good as MODELLER in 77% of cases (Kelm et al., 2010). MEDELLER functions by determining the protein insertion using iMembrane from which the “core” model will be constructed and expanded using the FREAD loop modelling protocols and MODELLER (Choi and Deane, 2010).

The algorithm functions in four main stages. Firstly, user input (target protein’s sequence, one or more homologous sequences with associated 3D coordinates) is submitted followed by annotation of the sequence alignment completed by iMembrane (Kelm et al., 2009) and JOY (Mizuguchi et al., 1998). The core is generated in four phases followed by a refinement stage whereby poorly modelled regions or regions which lack template are modelled using FREAD (database search loop prediction method). FREAD selects possible fragments based on a substitution and RMSD score with a focus on accuracy over coverage (Kelm et al., 2010). Each phase during core construction uses masks to prevent particular alignments from being selected if these masks contain gaps in the target or template. Regions annotated by the iMembrane process are classified as “loop” regions if they reside outside the central membrane layer. The remaining alignment columns are graded using a specific substitution score Scand, which determines the order in which columns are added to the core. A cut-off score is also associated with the model to prevent poorly graded columns from being

24 integrated (Kelm et al., 2010). The iMembrane process attempts to determine the location of the protein within the membrane by relying on a database of known transmembrane protein structures which have been previously simulated in an artificial bilayer using MD simulations (Scott et al., 2008).

2.2.3. PRIMO The PRIMO server allows users to select from a variety of regimes to perform homology modelling. The server utilises these programs by clustering them together into a single package focusing on visual aids and diagrams to help users perform high quality homology modelling.

PRIMO provides the choice of either using a local version of BLAST or HHsearch for the identification of possible templates. The local version of BLAST is far more time efficient than using HHsearch and is used to query the target against the National Center for Biotechnology Information (NCBI) database (Hatherley et al., 2016). The returned PDBs are parsed to extract information such as chain ID and target-template sequence identity, which is tabulated and displayed. HHsearch can be employed if distant homologs are to be identified (Hatherley et al., 2016). In HHsearch, the target is queried against the UniProt20 database with associated secondary structure information and converted to a HMM, which is queried against the HHsuit pdb70 database to identify potential homologs (Hatherley et al., 2016).

During the alignment step, the PDB is parsed to extract sequence information allowing for missing and non-standard amino acids to be replaced with “X” such that this information can still be included in the upcoming alignment generated by either MAFFT, MUSCLE, ClustalOmega or T-COFFEE.

During the modelling procedure the alignment is processed to substitute gaps with “-“ and modified residues with “.” characters. Sequences are trimmed at the ends and each sequence is checked against their respective PDB files (Hatherley et al., 2016). The models are then completed using MODELLER and evaluation is completed using Z-DOPE scores and PROCHECK.

2.2.4. Phyre2 Phyre2 performs the modelling process in four different stages in two various modes, intensive and normal. The former attempts to generate iterative full-length models of

25 sequences through combinations of multiple template modelling and ab initio folding simulations. The latter is initiated by searching for homologous sequences of the target sequence in various databases using HHblits (Kelley et al., 2015). HHblits functions by performing profile-profile matching and sequence-profile matching in an efficient and effective manner, increasing the amount of available templates. A MSA is generated while PSI-BLAST based secondary structure prediction (PSIPRED) attempts to predict protein secondary structure. PSIPRED functions through a web-based server featuring artificial neural network machine learning. The generated MSA and predicted secondary structure are used to generate a HMM which is queried against a precompiled database of known HMM structures using HHsearch (Kelley et al., 2015). HHsearch functions by determining HMM profiles for the target and template and using these profiles to identify additional templates. The target-template alignment is generated before crude backbone generation is initiated. This process is followed by loop modelling and finally addition of side chains leading to final model construction.

Loop modelling is completed using known fragments compiled in a library and are between 2 – 15 amino acids in length. Segments of the target sequence used to model the protein are compared to fragments present in the library from which the structural information is obtained. Once suitable templates are obtained, overlapping fragments are ranked from most suitable to least suitable (Kelley et al., 2015). The most suitable fragments are used to construct loop regions.

Modelling is performed using Poing, a simplified protein-folding simulator, only used if the intensive mode is selected (Jefferys et al., 2010). Multiple iterations of the modelling process generates a set of models from which a subset must be extracted to allow for maximum target protein coverage while ensuring maximum confidence (Kelley et al., 2015). These models modulate distance constraints between various pairs of residues. Poing treats these restraints as linear inelastic springs. A virtual ribosome slowly synthesises the target protein, which is modelled with respect to the specified restraints. Residues which are not restrained or lack template are modelled using ab initio techniques such as Poing’s solvent bombardment model, predicted secondary structure springs and penalisation of steric clashes (Kelley et al., 2015). Multiple models are constructed which are composed solely of α-carbons, the backbone of the model is generated using powerful chain restoration algorithm (Pulchra)

26

(Rotkiewicz and Skolnick, 2008). Side chain modelling is completed using R3 protocol, which utilises a rapid graph-based technique, coupled with side chain rotamer library, placing the side chains in the most suitable position.

2.2.5. I-TASSER The I-TASSER server performs homology modelling in four different stages namely threading, structural assembly, model selection and refinement and final structure-based functional annotation.

Identification of homologous sequences is completed using PSI-BLAST and the non-redundant sequence database (Altschul et al., 1994). PSIPRED is used to predict secondary structure of the protein model, which is completed using a sequence profile based on a multiple sequence alignment of homologous sequences. The target sequence is then threaded through a PDB structure library using LOMETS consisting of seven various threading programs (Wu and Zhang, 2007). For each threading program, the templates are ranked based on sequence- based and structure-based scores. The top templates for each program are taken for further consideration.

Regions covered well by these templates are modelled from the corresponding template structures whilst regions not aligned well are modelled using ab initio modelling. To increase accuracy of the conformational search of the protein, I-TASSER uses a reduced model to represent the protein chain whereby each residue can be described by its α-carbon and the centre of mass associated with the side chain (Roy et al., 2010). To limit entropy of the conformational search, the lattice system of grid size is restricted for regions modelled using ab initio techniques. For regions aptly covered by templates, modelling is completed off lattice and kept rigid during simulations, maintaining high-resolution modelling. Fragment assembly is completed using a modified replica-exchange Monte Carlo simulation technique (Zhang et al., 2002). The simulations are guided by a composite knowledge-based force field, which includes factors such as general statistical terms, spatial restraints and sequence-based contact predictions. The generated models are clustered using SPICKER, identifying low free- energy states.

Model selection and refinement is completed by generating multiple models, which are clustered. External constraints are pooled from LOMETS threading alignments and PDB

27 structures structurally similar to the cluster centroids are identified by TM-align (Zhang and Skolnick, 2005). Incorporation of restraints allows removal of steric clashes and is able to refine the global topology of the cluster centroid. The final models are generated by REMO through assembly of all-atom models from the α-carbon traces via the optimisation of hydrogen-bonding networks (Roy et al., 2010). Structure-based functional annotations are completed by matching predicted 3D models against the proteins of known structure and function present in the PDB using TM-align.

2.3. Errors in homology modelling There are five main categories of errors commonly associated with homology modelling. Firstly, errors in side chain packing will occur during shifts towards more divergent sequences and may be critical if these side chains are involved in a protein’s biological function. Secondly, for more divergent sequences, distortions and shifts in aligned regions may occur causing alterations in the protein’s structure. Additional errors may occur if a target sequence segment is to be modelled without a template or alternatively, if there is misalignment between sequences. This is most likely to occur for sequences under 30% homology to that of the target sequence. Subsequently incorrect templates will also cause massive errors in model construction.

2.4. Determining model accuracy Determining model accuracy and identifying problem regions is an important step in homology modelling. This will allow the user to correct any regions that may be poorly modelled, thereby generating the highest quality model possible. Model validation software includes that of PROSA and VERIFY 3D. Intricate software such as protein structure evaluation suite and server (PROSESS) also exist, which utilise a variety of software to determine the quality of produced models. PROSESS includes VADAR, GeNMR, ShiftX, RCI, Preditor, MolProbity, XPLOR-NIH and NAMD (Berjanskii et al., 2010).

2.4.1. PROSA Calculations involved in determining model quality using PROSA are performed using α- carbon potentials (Wiederstein and Sippl, 2007). Once coordinates of amino acids are parsed, a distance-based pair potential coupled with the potential associated with solvent exposure of protein residues evaluates the energy of the protein (Wiederstein and Sippl, 2007). The output generated by PROSA includes a z-score and a plot of residue energies.

28

A z-score provides an indication of overall model accuracy, which is plotted on a graph illustrating other structurally solved proteins and their respective z-scores. The z-score of the model can then be compared to z-score of other proteins.

An energy plot is created for local model quality assessment by plotting energies as a function of amino acid sequence position. Positive values indicate problematic regions. These local region energies are averaged over both 40 residue fragments and 20 residue fragments. A 3D visualisation of the model is also created using Jmol which highlights problematic regions in red.

2.4.2. Verify 3D Proteins are evaluated using the sequence coupled with a 3D profile. Each of the residues are characterised by its environment. These residues are represented by a row of twenty numbers in the profile. These numbers indicate 3D-1D scores, which are statistical preferences of each of the twenty amino acids. The characteristics of each residue include area of the buried residue, fraction of the side-chain area covered by polar atoms, and the local secondary structure (Eisenberg et al., 1997). An overall score is provided to grade model quality, called the s-score. Assessment of the local segments is completed by plotting the compatibility of the sequence in relation to the 3D structure of the model.

2.4.3. PROSESS There are multiple programs associated with protein structure evaluation suite and server (PROSESS), each being updated and maintained separately. PROSESS is useful in that the package is able to use each of these programs in an advantageous manner without the associated disadvantages by selectively using various programs for different proteins.

2.4.3.1. VADAR Volume, area, dihedral angle reporter (VADAR) uses over 15 various programs to analyse models generated through homology modelling. These programs allow VADAR to read in PDB files, calculate hydrogen bond energies (Baker and Hubbard, 1984; Kabsch and Sander, 1983), side-chain and backbone torsion angles, excluded volume (Richards, 1977), secondary structure (Kabsch and Sander, 1983; Levitt and Greer, 1977; Richards and Kundrot, 1988), β- turn identity, secondary structure propensity (Garnier et al., 1978), 3D profiles (Luthy et al., 1992), stereochemical quality (Morris et al., 1992), solvation free energy (Chiche et al., 1990;

29

Eisenberg and McLachlan, 1986), accessible surface area (Lee and Richards, 1971) and statistical analysis for both global and local regions of proteins.

2.4.3.2. GeNMR Generate NMR structure (GeNMR) is a webserver used for the construction of models by predicting 3D protein structures using nuclear Overhauser effect (NOE) – distance restraints coupled with NMR chemical shifts (Berjanskii et al., 2009). GeNMR focuses on the identification of secondary structure and protein folds (Wishart and Case, 2001; Wishart and Sykes, 1994), prediction of protein flexibility and torsion angles as well as the overall computation of protein structures (Berjanskii et al., 2006; Berjanskii and Wishart, 2005; Wishart et al., 2008).

2.4.3.3. ShiftX This program uses a semi-empirical approach whereby chemical shifts are calculated. Prediction of hydrogen, carbon and nitrogen shifts provide information on dihedral angles, side chain orientation, secondary structure and the effects of nearby chemical groups (Neal et al., 2003).

2.4.3.4. RCI Random coil index (RCI) is able to predict the flexibility of a protein through calculating the RCI by using the protein’s backbone information regarding chemical shifts. Estimations are made using values of model-free order parameters and per-residues RMSF of NMR and MD ensembles from the RCI (Berjanskii and Wishart, 2005).

2.4.3.5. Preditor Predicting φ, ψ, χ1, and ω torsion angles (Preditor) uses information obtained from 13C, 15N, and 1H atoms as well as sequential homology. The accuracy of predicting these values is approximately 84.00% for χ1, 99.98% and 93.00% for ω with regards to trans peptide bond identification and cis peptide bond identification, respectively (Berjanskii et al., 2006).

2.4.3.6. MolProbity, XPLOR-NIH and NAMD MolProbity functions by assessing the clashes between atoms as well as His or Asn flips. XPLOR-NIH functions to identify and quantify NOE restraint violations whilst NAMD computes the energies associated with the protein structures (Berjanskii et al., 2010).

30

2.5. Docking programs In order to dock ligands with generated protein models, servers and programs exist which find optimal locations for such interactions. These include but are not limited to Autodock, Autodock vina, HADDOCK2.2 and CLUSPRO. Docking procedures may be classified into two categories, firstly direct methods relying on thermodynamics to determine most optimal docking locations, which represent minimum Gibbs free energy states (Vajda and Kozakov, 2009). Secondly, template-based or information driven docking procedures rely on template structures which share 30% or more sequence homology with the target (Aloy et al., 2003). However, due to limited numbers of docked proteins that have been solved, the former method of docking is more widely employed.

2.5.1. HADDOCK2.2 The information driven HADDOCK2.2 server is compartmentalised by providing up to seven different interfaces each allowing varying levels of control over docking procedures (van Zundert et al., 2015; Vangone et al., 2017). Although information driven docking procedures are not as popular as their thermodynamic reliant counterparts, these procedures provide more accurate results. Incorporating biochemical, biophysical or bioinformatic information increases sampling and scoring accuracy (Rodrigues and Bonvin, 2014). HADDOCK2.2 can incorporate information, such as interface restraints (de Vries and Bonvin, 2011), mutagenesis experiments (Hopf et al., 2014), bioinformatics predictions (Karaca and Bonvin, 2013; van Zundert et al., 2015), shape data obtained from small angle X-ray scattering (SAXS) (van Dijk et al., 2005), cryo-electron microscopy data, orientations of individual structures, relaxation anisotropy (van Dijk et al., 2006) and pseudocontact shift data (Schmitz and Bonvin, 2011).

2.5.1.1. Overview The main interfaces associated with HADDOCK2.2 include easy, expert and guru. Easy provides basic control over protein docking parameters and is expanded upon in both the expert and guru. The server-based nature of this docking program facilitates a quick return of results and job completion with a grid-based version available hosted by the European Grid Initiative (EGI) and associated National Grid Initiatives (NGIs). Guru facilitates approximately 500 parameters which may be altered to cater for any docking procedure (van Zundert et al., 2015). The server also offers a prediction, refinement, multi-body, gentbl and file upload interface, which use bioinformatics interface predictors, complete water refinement stage

31 processes on uploaded models, facilitates multiple molecule docking, construction of customised restraint files and finally resubmission of files to redo a docking procedure with parameter alterations, respectively (van Zundert et al., 2015).

2.5.1.2. Method The protocol used by HADDOCK2.2 can be classified into three stages. This protocol uses previously obtained information within structure calculations. The first stage includes randomisation of orientation and rigid body docking via energy minimisation which has been modified by submitted restraints (van Zundert et al., 2015). Secondly, a semi-flexible refinement of torsion angle space is set up in which side chains and backbone residues are maintained in a flexible state (van Zundert et al., 2015). Finally, the structure is refined in explicit solvent (van Zundert et al., 2015).

The scoring function used for HADDOCK2.2 includes various terms such as van der Waals and electrostatic energies for non-bonded residues, van der Waals and Coulomb intermolecular energies, a desolvation energy, buried surface area upon complex formation, restraint violation energy and energies relating to symmetry restraints if specified. These energies are calculated in accordance with the OPLS force field (Jorgensen and Tirado-Rives, 1988) based on distance restraints of 8.5Å and between 6.5Å to 8.5Å for electrostatic and van der Waals energies for non-bonded residues, respectively (van Zundert et al., 2015).

Water refined structures generated by HADDOCK are clustered by one of two possible procedures. Firstly, Fraction of Common Contacts which is the default setting of HADDOCK2.2 which clusters structures based either on contact similarities at the interface (Rodrigues et al., 2012) or backbone interface-ligand RMSD similarity. Clusters are defined by a minimum of four models, which share enough similarity to be grouped together. The limitations of being similar enough by FCC (fraction of common contacts) clustering is any models within a cut-off of 0.75Å or if clustering is completed by RSMD methods 7.5Å pertaining to the backbone interface (van Zundert et al., 2015).

2.5.1.3. Output Upon file submission, the server prompts users to download parameters files containing all essential information regarding docking procedures. In addition, a link to the results page is sent to the user email address. The results page also allows the user to follow the docking job, which once complete, the results will be sent to the user email address. The results page

32 indicates the number of clusters, refined structures within these clusters, their HADDOCK scores, statistics for each cluster indicating average values of the four best structures within each cluster, and a Z-score alongside other scoring parameters such as van der Waals energies.

2.5.2. CLUSPRO The ClusPro server functions as a tool for protein-protein docking. The server does provide additional tools and facilitates removal of regions for which no template exists, specification of attraction and repulsion forces, addition of pairwise distance restraints, docking of homo- multimers and incorporation of SAXS data and specification of heparin-binding site locations (Kozakov et al., 2017). ClusPro is fast despite the extensive use of the server with most docking runs completed in approximately four hours or less.

2.5.2.1. Protocols The direct docking ClusPro protocol originally introduced in 2004 has seen a vast degree of improvement (Kozakov et al., 2017). Computational stages of ClusPro include rigid-body docking by sampling billions of conformations followed by RMSD-based clustering of the lowest 1 000 energy structures generated by the server, which are clustered to determine models most likely representing the complex. Finally, refinement of the models is completed via energy minimisation and precedes the return of output to the user (Kozakov et al., 2017).

2.5.2.1.1. Rigid-body docking The initial docking procedure is carried out by PIPER based on the FFT correlation approach (Kozakov et al., 2006). Rigid-body docking procedures are carried out by fixing orientations and coordinates of the receptor upon a fixed grid whilst the ligand is placed upon a moveable grid. Interaction energy between two proteins is represented as a correlation function (Kozakov et al., 2017). The FFT-based method allows the calculation of these correlation functions by exhaustively sampling billions of possible conformations of any interacting proteins. Therefore, no a priori information of the protein complex is required. FFT-based methods incorporate shape complementarity, electrostatic interactions and desolvation contributions (Chen and Weng, 2002; Gabb et al., 1997; Mandell et al., 2001). A major improvement based on this FFT approach originates from the PIPER protocol incorporating structural information in the form of a structure-based pairwise interaction term within its scoring function. This interaction term includes terms facilitating increased docking detail

33

(Kozakov et al., 2006). Selection of the lowest 1 000 models based on energy is a means to capture models which may be correct but may not be the most energetically favourable. As stated previously, some proteins adopt a “stable-enough” state which may not reflect the most energetically favourable conformation (Kozakov et al., 2005).

2.5.2.1.2. RMSD-based clustering and refinement Independent-RMSD-based clustering is completed using pairwise IRMSD as the distance measurement (Comeau et al., 2004). All IRMSD values are calculated for each pair among the 1 000 structures followed by a selection which contains the highest number of structures within a 9Å-IRMSD distance. This pair and all associated pairs within this distance are chosen as the centre for the first cluster and as the entire cluster, respectively (Kozakov et al., 2017). This cluster is then removed from the 1 000 structures and the server once again searches for the pair containing the highest number of neighbours within a 9Å-IRMSD distance and the procedure repeats. This process is completed until all pairs are clustered. A total of 30 clusters may be generated in this fashion before being refined for 300 steps using van der Waals forces within the CHARMM force field (Brooks et al., 1983). Although this refinement only causes slight conformational alterations to the models, steric hindrances are removed in this process. ClusPro then outputs the 10 most populated clusters as output and separates the output into four categories, balanced, electrostatic, hydrophobic and van der Waals and electrostatic favoured models.

2.5.2.1.3. Limitations Although ClusPro is adept at docking enzyme-inhibitor complexes by incorporating a “soft” potential within the PIPER algorithm which caters for particular steric overlaps, protein- protein docks often include “difficult” targets such as conformational changes within the backbone when the proteins undergo the docking procedure (Chen et al., 2003; Hwang et al., 2008; Hwang et al., 2010; Mintseris et al., 2005). These types of docking procedures are difficult to complete especially concerning rigid-body algorithms, which require fixation of both receptor and ligand. Lack of automation with final selection of the best model requires users use previous knowledge of the docked protein complex to determine which will best represent the final model.

Technical limitations of ClusPro include the inability to dock non-standard amino acids and reliance on nucleic acids to define RNA molecules as the receptor and heparin as the ligand

34 therefore excluding cofactors and small ligands in the docking process (Kozakov et al., 2017). Modified amino acid residues, such as those that have been phosphorylated, cannot be taken into account and only dimers and trimers are supported within the “multimer docking” mode. Finally, grid size is limited to approximately 350Å in each dimension and must contain both the receptor and the ligand (Kozakov et al., 2017).

2.5.3. AutoDock Vina Similarly to other docking programs, AutoDock Vina, uses a scoring function to determine the energies associated with each conformational state of the docked ligand (Trott and Olson, 2010). The lowest energy state is often assumed the correct docking conformation. During docking, many assumptions are made, such as those regarding the solvent, which may not completely accurately represent the biological model and this should be taken into consideration when performing docking procedures. One of the major advantages of AutoDock Vina over AutoDock4 is the increase in algorithm speed achieved through implementation of an iterated global search optimiser (Baxter, 1981).

A Metropolis criterion is used to accept or reject a series of stages which consist of a mutation and local optimisation by the Broyden-Fletch-Goldfarb-Shanno (BFGS) method (Nocedal and Wright, 1999). The Metropolis criterion can be described as a Markov chain Monte Carlo method that is used to obtain random samples from a probability distribution for which direct sampling may be unavailable. The BFGS method is classified as a Quasi-Newton method, which focuses on locating local maxima and minima of functions. This method takes into consideration the derivatives of the scoring function, which in the case of ligand docking is the position and orientation of the ligand and torsion angle values of rotatable ligand residues if present (Trott and Olson, 2010).

Although the inclusion of derivatives may add to computational load, the optimisation achieved by the algorithm outweighs this apparent negative effect. Multiple runs are initiated simultaneously from a random set of conformations and the process is streamlined through multithreading. This is facilitated through shared-memory hardware parallelism commonly found to be ubiquitous in modern multi-core central processing units. The obtained set of function minima obtained from each of these initial separated runs is combined and implemented during structure refinement and clustering.

35

AutoDock Vina functions with similar PDBQT files to that of AutoDock4 and many of the processes remained similar. However, the former adopts a more automatic approach with regard to grid selection, result ranking and output clustering to limit the amount of unnecessary transitional details provided to the user (Trott and Olson, 2010). AutoDock Vina also adapts according to input data concerning maximum grid size, maximum number of rotatable bonds and maximum number of atoms. Previously, these parameters were fixed for AutoDock4 and many users were unable to alter them due to complexity or lack of computing knowledge.

36

3. Problem statement, approach and objectives for this study As of October 2018, there have been approximately 210 million sequence submissions to the GenBank database and approximately 722 million sequence submissions to the whole genome shotgun database (NCBI, 2018). In contrast, the number of structure submissions to the PDB as of November is currently 145 892 with a current annual growth of 9 515, less than 2017 (PDB, 2018). This surplus of sequence information coupled with a deficit of structure information leads to sequences not being associated with known structures. Although sequence information is helpful and does provide useful information, protein structure and protein molecule dynamics cannot be determined using only sequence information. Furthermore, protein structures provide indications on protein functions, elucidates mechanisms of protein function, provides a framework from which other proteins can be based for which no structural information is available, and plays a crucial role in drug discovery by allowing drug design to facilitate specific biological processes within cells. However, as noted above, the number of structure submissions within the PDB to date only account for approximately 0.016% of the total sequence submissions within both the GenBank and WSG databases.

It appears that the rate of sequence submissions to their respective databases is too high and more effort should be placed on structure prediction. However, this only becomes apparent when noting how many structures have been submitted to the PDB. Instead, it may be more appropriate to assume that the rate of structure submissions within the PDB is far too low. The first solution would be to increase the number of individuals who submit these structures by providing the training and equipment required for both X-ray diffraction (X-D) and NMR techniques. This is not always possible, however, as of the top five global structural genomics centres, three reside within the United States, one resides in Japan and a joint operation between Canada and England takes the final place (PDB, 2018). The remaining five of the top ten also reside within the United States. These top ten provide a total of 13 103 entries forming a large portion with the PDB (PDB, 2018) The number of biochemistry graduates from the United States in 2016 was only 9 554 with a growth rate of 8.94% (U. S. Department of Education, 2018). In contrast, China and particularly India, has seen rapid growth in recent years and has seen much focus within the biotechnology sector (Vale and Dell, 2009). Although many Western academic centres and laboratories have greatly benefitted from this

37 biotechnological focus within India, a social stigma against India persists on the simple basis that Western academics remain oblivious to the scientific and education system within India (Vale and Dell, 2009). India is also the most likely candidate for extensive future scientific collaboration with the West as it is a democratic country, in comparison to China. Thus, as of 2018 the work force to model proteins using NMR and X-D techniques is simply not available to deal with the current situation. Alternatively, it may be possible to update or streamline these protein structure techniques or develop a new technology. However, this is unlikely to occur at a rate fast enough to deal with the current sequence and structure crisis.

Structures can be particularly difficult to solve, especially multimeric transmembrane proteins, which may require a combination of X-D and NMR techniques for the extracellular and cytoplasmic portions. Membrane proteins in particular are very limited within the PDB. Although these proteins constitute 25% of all proteins, there are only approximately 200 unique structures available (Carpenter et al., 2008). These proteins are difficult to solve experimentally due to their partially hydrophobic surfaces, flexibility and lack in vitro of stability (Carpenter et al., 2008). These characteristics complicate all stages of protein structure determination, such as expression, solubilisation, purification, crystallisation, data collection and final structure solution (Overington et al., 2006). These proteins are, however, very important drug targets with over 40% of drug targets being related to membrane proteins (Overington et al., 2006).

Concerning integrins specifically, the β chain of integrins has been solved to a greater degree

th than the α chain. As of October 30 2018, the UniProt database indicates that β1 is largely solved concerning the extracellular portion from residue 21 through 798 of the total sequence of 798 residues (Anthis et al., 2009; Bai et al., 2012; Liu et al., 2013; Nagae et al., 2012; Xia and Springer, 2014). Both β2 and β3 follow a similar pattern with β2 being solved from residue numbers 23 through 699 and 735 through 769 of 769 residues (Beglova et al., 2002; Sen and Springer, 2016; Sen et al., 2013; Shi et al., 2007; Shi et al., 2005; Takala et al., 2008; Xie et al., 2010). The β3 has been solved for residues 27 through 788 of 788 residues (Borst et al., 2017; Choi et al., 2013; Deshmukh et al., 2010; Deshmukh et al., 2011; Dong et al., 2012; Garcia-Alvarez et al., 2003; Kim et al., 2011; Lau et al., 2009; Lau et al., 2008; Lin et al., 2016; Liu et al., 2015; Mahalingam et al., 2014; Metcalf et al., 2010; Parry et al., 2007; Schmidt et al., 2016; Springer et al., 2008; Van Agthoven et al., 2014; Vinogradova et al.,

38

2004; Vinogradova et al., 2002; Weljie et al., 2002; Xiao et al., 2004; Xiong et al., 2009; Xiong et al., 2001; Xiong et al., 2004; Xiong et al., 2002; Yang et al., 2009; Zhou et al., 2018; Zhu et al., 2012; Zhu et al., 2008; Zhu et al., 2010; Zhu et al., 2013). Integrin β4, β6 and β7 are similar in that large portions of structural information are missing. In the case of the β4 integrin, only residues 989 to 1107, 1126 to 1369 and 1527 to 1736 have been solved (Alonso- Garcia et al., 2015; Alonso-Garcia et al., 2009; de Pereda et al., 2009; de Pereda et al., 1999).

The β6 and β7 both lack structural information regarding a large portion of the first few residues and last few hundred residues (Dong et al., 2014; Dong et al., 2017; Kiema et al., 2006; Kotecha et al., 2017; Yu et al., 2012). Although these integrin subunits lack a substantial amount of information, in comparison, the β5 integrin has no associated PDB files listed within the UniProt database. The α chain integrin subunits 1 through to 10, including IIb, have vastly different levels of structural information with the most complete being the αIIb integrin. In the case of IIb only a small percentage of residue structural information is missing (Choi et al., 2013; Lau et al., 2008; Lau et al., 2009; Lin et al., 2016; Liu et al., 2015; Schmidt et al., 2016; Springer et al., 2008; Vinogradova et al., 2000; Vinogradova et al., 2004; Vinogradova et al., 2002; Weljie et al., 2002; Xiao et al., 2004; Yang et al., 2009; Zhu et al., 2012; Zhu et al.,

2008; Zhu et al., 2010; Zhu et al., 2013). In comparison, α3 and α6 through to α10 contain no structural information. The remaining chains within this selection predominantly contain structural information regarding the first half of the integrin. However, this information is severely limited concerning the α1 (Brown et al., 2018; Chin et al., 2013; Lahti et al., 2011;

Lai et al., 2013; Nymalm et al., 2004; Rich et al., 1999) and α2 (Brown et al., 2018; Carafoli et al., 2013; Eble et al., 2017; Emsley et al., 1997; Emsley et al., 2000; Horii et al., 2004) chains. Finally, αD, αE, αL, αM, αV, and αX contain vastly different levels of structural information (Bajic et al., 2013; Baldwin et al., 1998; Chua et al., 2011; Jensen and Bajic, 2016; Lee et al., 1995; Lee et al., 1995; Mahalingam et al., 2011; McCleverty and Liddington, 2003; Xiong et al., 2000, Bhunia et al., 2009; Dodd et al., 2007; Guckian et al., 2008; Legge et al., 2000; Li et al., 2009; Lin et al., 2008; Potin et al., 2006; Qu and Leahy, 1995; Qu and Leahy, 1996; Sen and Springer, 2016; Shimaoka et al., 2003; Song et al., 2005; Wattanasin et al., 2005; Weitz-Schmidt et al., 2004; Zhang et al., 2009; Zhang et al., 2008; Zhang et al., 2009; Zhou et al., 2018, Borst et al., 2017; Cormier and Campbell, 2018; Dong et al., 2014; Dong et al., 2012; Dong et al., 2017; Kotecha et al., 2017; Mahalingam et al., 2014; Xiong et al., 2001; Xiong et al., 2004; Xiong et al., 2002, Sen and Springer, 2016; Sen et al., 2013; Vorup-

39

Jensen et al., 2003; Xie et al., 2010). The αV and αx are largely complete with only small gap regions remaining, whilst αD and αE lack any structural information. In comparison, the αM and αL subunits are mostly completed only for the extracellular portion; however, there is limited information pertaining to the transmembrane and cytoplasmic portion of the αM integrin.

Due to the complications of trying to model transmembrane proteins coupled with their biological importance as drug targets in conjunction with the low rate of solving protein structure due to either insufficient personnel, equipment, funding or training, a solution does not seem possible within the current focus. A method is required for a high-throughput and accurate prediction of model structure based on sequence alone, which maximises the available PDB structures within the database. Homology modelling provides such a means. This technique allows researchers to determine protein structure using energy minimisation and template structures. It is a powerful tool for determining protein structure and in many cases can be completed on a desktop computer or high-end laptop.

The objectives of this study are to address the current situation of homology modelling in the context of integrin proteins. Specific objectives were:

a) To utilise various online modelling software and draw a comparison to MODELLER. b) To evaluate MODELLER protocols and determine the best means to overcome challenges associated with transmembrane protein modelling. c) To evaluate various means of protein validation, such as using online servers, how these servers evaluate models and any errors that may occur during evaluation. d) Finally, to provide more information on the structure of integrin proteins which have yet to be determined.

Homology modelling is thus chosen as the approach to determine integrin structure as these proteins are of particular importance for many biological processes and are prominent drug targets.

40

4. Methods Modelling procedures were completed in categories:

1. Use of online servers. 2. Generation of suitable MODELLER support input files. 3. Determining optimal parameters for modelling by altering python job files. 4. Final modelling procedures. 5. Docking procedures using CLUSPRO and HADDOCK2.2 webservers. 6. Evaluation of final files using Verify-3D, PROSA and PROSESS webserver.

4.1. Obtaining target sequences Sequence information was obtained from the UniProt database (Table 2). Canonical sequences were chosen and included α3, α4, αD, αE, αL, αM, αX, β1, β2 and β7. These sequences were obtained by limiting search results to human and specifying the integrin family as “organism:human family:integrin”. The PDB files used for each of these experiments can be found within the appendix (Section A: PDB files used in the study: Table 17, Table 18, Table 19, Table 20, Table 21, Table 22).

Table 2. UNIPROT database sequence information of the integrin monomers. Integrin monomer (left) with the UNIPROT entry name (right).

Integrin UniProt Entry

α3 P26006 α4 P13612 αD Q13349

αE P38570 αL P20701

αM P11215 αX P20702

β1 P05556 β2 P05107 β7 P26010

4.2. Modelling using online servers Three online webservers, PRIMO, I-TASSER and Phyre2 were tested and used to generate models, which were then graded. The integrin chains α3, αM, αE, β1 and β2 were selected to determine how these programs differed. The extracellular sequence was submitted to each

41 server. In the case of PRIMO, homologues were chosen and in all cases, models were generated based on the submitted sequences. These models were evaluated using PROSA.

4.2.1. PRIMO Target sequences were submitted to the Rhode’s University Protein interactive modelling (PRIMO) server for homology modelling. Homologues were identified within the PDB through the standard BLASTp search algorithm under default settings. Templates were selected based on three criteria, coverage, identity and resolution. For each target sequence, three different types of template regimes were prepared. The first regime focused on high coverage at the expense of resolution and identity while the second regime focused on resolution and sequence identity at the expense of sequence coverage. The final regime attempted to balance coverage, sequence identity and template resolution. T-COFFEE in standard mode was used to generate the multiple sequence alignment. The number of iterations to generate models was set to 10 for each regime, which were compared based on Z-DOPE scores provided by the PRIMO server. The top models based on Z-DOPE scores were submitted to PROSA for evaluation.

4.2.2. I-TASSER and Phyre2 In both cases default server settings were selected, no additional information except the original UNIPROT sequence information was submitted in addition to job titles and the required e-mail address. The intensive mode of Phyre2 was selected as the protocol to model the proteins. The Z-DOPE score of models was determined, with the best scoring models submitted to PROSA for further evaluation.

4.2.3. MEDELLER The MEDELLER protocol was chosen and a target-template alignment generated by Clustal Omega was submitted. The template structure was inputted and the complete model type was selected to be generated from a bigger “minimal core”.

4.3. MODELLER A series of experiments was conducted to determine how varied templates, template numbers, loop refinement level, refinement protocol and iteration numbers affected model construction. Experiments are divided into two categories, construction of heterodimeric proteins and construction of separated monomeric protein subunits (α and β chain).

42

4.3.1. Generating alignment files, PDB files and python scripts The method of generating protein alignment files can be viewed within the appendix (Section B: Alignment file setup). When sequence information from the BLASTp database was obtained the SEQRES sequence was used to determine sequence homology and not the actual sequence, which had been determined through X-D, NMR or EM techniques. This complicated matters as PDB files often contained missing residues not determined for a multitude of reasons. In some instances, a homologous sequence could be found and could be used for a particular region but not necessarily containing the particular region, as it might be missing within the PDB. This highlighted the importance of checking each PDB and ensuring that the region it is intended to cover is within possibility. In all experiments various types of python job scripts were created, some included more information than others or specified a slightly different algorithm to be utilised. An example format of the full python script can be found within the appendix (Section B: Alignment file setup, Section C: Python script example).

4.3.2. Modelling the complete protein Initial testing of MODELLER included completion of the full α3β1 integrin to determine how MODELLER dealt with heterodimeric integrin models without excessive input. Templates were selected on information obtained from PRIMO and aligned using MAFFT. No additional information, such as restraints, was specified. The job script included seven templates and specified 10 models with normal refinement in the absence of model or loop refinement using the automodel() class.

4.3.3. Determining optimal model iterations MODELLER was then specified to generate two sets of 100 and 300 models of the α3β1 integrin using the automodel() class. Templates were selected based on PRIMO webserver output and aligned using MAFFT. Other integrins were also modelled in a similar fashion; however, only four models each were generated to provide a brief overview on how MODELLER dealt with each protein. The number of templates used to generate the α3β1 target increased from seven to 18 template for both sets. These two sets were then compared on the basis of their Z-DOPE scores. If the scores between the two subsets are insignificant it would be unlikely that generating a pool of models larger than 100 would provide better results. Conversely, if the scoring difference between the first and second subset was significant it would indicate that MODELLER requires more than 100 models to be generated.

43

4.3.4. Determining optimal refinement iterations In a similar manner, the α3β1 integrin was modelled to generate five base models from which four refined loop models were generated. Templates were identified using PRIMO output and aligned using MAFFT. The refinement level, both md_level and loop.md_level were set to refine.very_slow and the loopmodel() class was use for the modelling protocol. This allowed comparative analysis to determine how multiple iterations of refinement affect model construction and how these models differ from the base starting models previously created. The same templates as the previous experiment (4.3.3. Determining optimal model iterations) were used such that the alignments were identical.

4.3.5. Determining optimal template numbers The previous experiment (4.3.4. Determining optimal refinement iterations) was repeated, however, the number of templates was increased from 18 to 35, which had all been identified by PRIMO and aligned using MAFFT. This experiment set was dubbed the “increased” set.

4.3.6. Very_slow against slow_large refinement The αMβ2 protein was modelled whereby templates were identified from PRIMO output and aligned using T-COFFEE in standard mode. Two near-identical experiments were conducted using different refinement protocols, very_slow and slow_large refinement for models and loops.

4.4. Modelling monomeric subunits In subsequent experiments templates were identified by direct BLASTp searches and utilised the PSI-BLAST algorithm, the BLOSUM 80 scoring matrix and eliminated non-human or uncultured/environmental samples. The expected threshold and word size were altered to five and three. Clustal Omega was used for multiple sequence alignments between targets and templates with five combined iterations, five guide tree iterations and five hidden- Markov model iterations.

4.4.1. Initial modelling

The α3β1 integrin chains were generated separately. Templates were obtained from the BLASTp searches as described previously. Twelve models each of chain “A” and chain “B” were constructed from 25 and 48 templates, respectively, with one iteration of model loop from which the structure of individual chains was determined.

44

4.4.2. Separating “closed” and “open” templates

Templates of the α3 and β1 protein chains used for the initial modelling experiment (4.4.1. Initial modelling) were separated into two classes, open and closed, where in the former the template contained an associated ligand or antibody, whilst the closed templates were unbound. An experiment was conducted whereby closed and open templates were each used to generate a series of 12 refined models to determine the effects of using only one template type.

4.4.3. Modelling with secondary structure arguments Improvement upon model structure was performed using secondary structure arguments incorporated by added α-helix and β-strand information in the α3 and β1 model. The modes were based off the same templates used for the closed conformation test of each chain (4.4.2. Separating “closed” and “open” templates) and facilitated the construction of 12 refined models.

4.4.4. Modelling with forced transmembrane regions The chains were then modelled using secondary structure restraints and templates of only closed conformations (4.4.3. Modelling with secondary structure arguments) with an added specified restraint of a forced α helix for the transmembrane region of both chains.

4.4.5. Fragmented integrin modelling Subunits were then modelled by separating the sequence into three sections, extracellular, transmembrane and cytoplasmic. The cytoplasmic and transmembrane regions were generated for all protein subunits whilst only the extracellular domain of the α3β1 integrin was generated. Only closed conformational templates were used for the generation of extracellular regions in conjunction with secondary structure information. Cytoplasmic and transmembrane regions were modelled in sets of 10, each with one iteration of refinement where possible. The extracellular domain of both chains was modelled in a single set of 10 and specified one iteration of refinement. Both chains used the same templates as those for the initial closed model generation.

4.5. Final modelling The generation of final models used the most suitable and complete MODELLER script but varied as some information available for particular sequences was unavailable for others, such as secondary structure information pertaining to α-helix, β-strand and disulfide bridge

45 locations. Additionally, only sequence information labelled “extracellular” was used in the final construction of protein models and was obtained from the UniProt database.

4.5.1. BLASTp The UniProt database sequence information of all integrin subunits was submitted to the BLASTp search page in duplicate to account for an open and closed conformation type. The BLASTp search parameters were altered to only search within PDB, “human (taxid:9606)”and to exclude models (XM/XP) and uncultured/environmental sample sequences. The PSI-BLAST algorithm was chosen where algorithm alterations were made to the expected threshold and word size to have values of five and three, respectively, and a BLOSUM 80 matrix was utilized.

4.5.2. Separation of “closed” and “open” templates In all cases results were divided into two groups, open and closed templates. Open templates were ligand or antibody bounded whilst closed templates remained unbound. To ensure that templates were in either the open or closed conformation, the potential template was searched for within the PDB and analysed manually. Additional templates were obtained by selecting templates with high identity for PSI-BLAST iterations if applicable. The templates are too numerous to mention but the list can be found within each python script for each integrin chain.

4.5.3. Preparing multiple runs under varying template numbers Once templates were separated, three MODELLER runs were prepared for each protein subunit of both open and closed conformations. The first setup included the majority of returned applicable results, the second slightly favoured sequence identity but maintained a very high level of sequence coverage while the final run heavily favoured identity and then coverage. These three setups were dubbed “all”, “fewer” and “fewest”, respectively.

4.5.4. Generating alignment files The alignment file was generated using Clustal Omega and sequence information obtained from the “seq_dump” files acquired from BLASTp searches. Missing residues were removed from the alignment file and substituted with the appropriate gap “-“characters whilst being converted into the desired “.pir” format required by MODELLER.

46

4.5.5. Generating python scripts A python script was created specifying all secondary structures, disulfide bridges, used the DOPEhr_loopmodel class, very_slow refinement of both loop and model refining and the construction of ten models with one loop model refinement iteration.

4.5.6. Z-DOPE evaluation and output expansion Once models were generated Z-DOPE scores of the refined loop models files were determined for each run of each protein monomer and those with the lowest Z-DOPE scores were selected and used to generate a total of 100 loop models by generating 100 base models followed by loop refinement. Therefore, each monomer generated 200 models for a total of 2200 models for the open conformation and 2200 models for the closed conformation. These included both unrefined and refined models.

4.5.7. Z-DOPE evaluation and monomer selection The Z-DOPE scores of all models were then determined and one representative from each was chosen to be incorporated as part of the final heterodimeric model. These representatives were paired such as to generate the final eight heterodimeric proteins. These α and β subunits were submitted to the CLUSPRO server for docking. A dock was completed using the one chain as the receptor and the other as the ligand, specifying chain A for α subunits and chain B for β subunits. The resulting output was downloaded into four categories, “balanced”, “electrostatic favoured”, “hydrophobic favoured” and “VdW+Elec”. Each category contained ten models thereby generating 320 docked proteins each for “open” and “closed” conformation. The final models were selected based on their ClusPro energy results and PyMOL was used to further investigate the output by aligning the results of each category with an experimentally solved structure deemed suitable.

4.5.8. Protein evaluation Final models were evaluated using three programs. They were submitted to PROSA, Verify- 3D and PROSESS. The output was saved and in the case of PROSESS the text based information was saved to Excel spreadsheets whilst the graphical data was downloaded.

47

4.5.9. Obtaining ligands used for docking The Leucine-Aspartic acid-Valine (LDV) sequence was obtained to represent the fibronectin ligand used for the majority of the integrins. Only α3β1 used laminin as a ligand, however, the specific site of binding remains unknown for both the receptor and ligand in this instance.

The LDV sequence within the fibronectin type III repeat segment is important for integrin, association; therefore, this region was searched against the PDB to obtain structural information. However, none were available. The sequence was then searched for again including three residues on each side of the LDV sequence and the 5C16 PDB was chosen to represent the LDV sequence. The structural information was extracted from the PDB and used for both HADDOCK and AutoDock Vina docking procedures. The 5AUX PDB file was obtained to represent laminin. The first to third G-domains were isolated from the structure. The structure was then submitted to the 3DLigandSite server for ligand binding site prediction. The highest scoring ligand binding sites were then used in future experiments.

4.5.10. Docking using HADDOCK The open conformations were docked with the LDV ligand, except for α3β1 that utilises laminin. The subunit templates were submitted to the I-TASSER from which ligand prediction was computed and the residue numbers saved. A docked ligand template most similar to the target was obtained from the PDB using BLASTp searches as previously completed. The template was overlaid with the target using PyMOL and corroborated with the suspected ligand involvement residues obtained from the I-TASSER list. Three runs were completed to include “all”, “fewer” and “fewest” residues suspected to be involved in ligand docking. Comparisons were made between the overlay and the predicted involved residues to determine the most likely active residues. Runs were submitted using the HADDOCK easy interface and the best results analysed to determine possible docking outcomes. Since HADDOCK relies on a numbered list of involved residues, chain B was renamed to chain A and extended. Therefore, residue numbers only occurred once. This was taken into consideration when uploading the numbers for chain B to the HADDOCK webserver. These models were evaluated as completed for the closed conformation.

4.5.11. Docking using AutoDock vina The numbered active residues obtained from the HADDOCK experiment provided a reference point to the general location as to where ligands were to be associated. Four residues within

48 integrins were chosen representing the location of ligand docking. These residues were present within the α/β subunit interface, which represents the docking of both RDG and LDV ligands. These residue coordinates were averaged amongst themselves to obtain the general X, Y and Z coordinates for each docking procedure. The search box size and exhaustiveness were set to 28 and 128 in all cases, respectively. Docking was performed using LDV in all cases except for α3β1 where laminin was used.

49

5. Results and Discussion 5.1. The use of online webservers to generate protein models Although PRIMO provides clear output and eases template selection (Table 3) it fails to highlight “missing residues” in PDB files. A potentially optimal template may fail to contain the required structural information for accurate modelling. “Missing residues”, such as solvent exposed side chains of proteins, of a PDB segment often originate as a result of their high B-factor values (Deller and Rupp, 2015). The B-factor is an indicator of average model quality and highly flexible residues are viewed as outliers and are subsequently removed from the solved structure (Deller and Rupp, 2015). This practice, however, is not as damaging as including these “missing residues” within a PDB but assigning an occupancy of zero. These residues would be completely ignored by most modelling software. Thus, many crystallographic researchers opt to assign occupancy values near zero (0.01) (Deller and Rupp, 2015). It is possible to view the occupancy of residues within a PDB using the Atom.ooc – occupancy within MODELLER; however, it is more beneficial to use a program that highlights such atoms from within the structure. The PDB validation parameters such as the Rfree value, clashscore, Ramachandran outliers, side-chain outlier and RSRZ outliers are also omitted from the PRIMO interface. These indicate the quality of the model by evaluating the goodness of fit between the model and experimental diffraction data (Yang et al., 2016), quantity of too- close contacts within 1 000 atom segments, percentage of residues with incorrect backbone conformation, percentage of residues with incorrect side-chain conformations (Chen et al., 2010), and percentage of residues not conforming to the electron density map when compared with other similar resolution structures (Kleywegt et al., 2004), respectively. Therefore, it is difficult to select optimal templates from the PRIMO interface as it fails to provide sufficient information regarding the PDB files.

50

Table 3. Template selection panel of PRIMO for β1. Templates have their chains and sequence identity to the target listed alongside the coverage and template resolution.

Template Chain Identity (%) Coverage Resolution 4WJK B 99% 1-445 (62%) 1.85 3VI3 B 99% 1-445 (62%) 2.90 3K6S B 45% 7-708 (99%) 3.50 5ES4 B 45% 7-706 (98%) 3.30 4NEH B 45% 7-707 (99%) 2.75 3IJE B 44% 5-708 (99%) 2.90

To combat this lack of suitable template information, three modelling methods were implemented based on the available information. The first attempted to use only high resolution and identity templates whilst the second focused more on target coverage. Finally, a balance between target coverage and template homology and resolution was set up to accommodate full target modelling using templates that are moderately high in quality.

Manual editing of the alignment was unnecessary, most likely due to the limited number of templates having high enough sequence homology to the target. T-COFFEE is one of the most accurate alignment programs (Chang et al., 2012) and is able to align 500 sequences with very high accuracy, provided these homologs have an identity higher than the midnight zone (Figure 3) (Nute et al., 2018). Although gaps present within known regions of secondary structure should be moved outside this region, prerequisite knowledge is required to complete such a task. PRIMO does allow users to edit and search the alignment for such regions and colour codes the alignment for easy identification of problematic regions.

51

Table 4. Template names and numbers used for the generation of each integrin. Integrins are divided into identity and resolution, coverage and balance sets with “#” indicating the number of templates. Sets that generated the lowest average Z-DOPE scoring models are highlighted. The percentage indicates the percentage target-template homology.

Identity and Integrin # Coverage # Balance # Resolution 1BD3 (57%), 1JV2 (43%), 3FCS (43%), 3IJE 3FCS (43%), (44%), 3K6S (45%), 4G1E 3VI3 (99%), 3FCS (43%), 3V4P (50%), 3NID β1 6 (44%), 4G1M (45%), 4WJK 9 5 4NEH (45%), (45%), 4NEH (45%), 4WJK (99%) (99%), 4NEH (45%), 5ES4 4UM8 (42%), (45%) 4WJK (99%) 3K6S (100%), 4NEH (99%), 5E6R β2 4NEH (99%) 1 3K6S (100%), 4NEH (99%) 2 4 (99%), 5ES4 (99%) 3IJE (28%), 1JV2 (28%), 3IJE (28%), 4G1E 3FCS (29%), 3IJE (28%), 4G1M 4G1M (28%), α3 4 (28%), 4G1M (28%), 4O02 5 (28%), 4UM8 (30%), 4WJK 6 4UM8 (30%), (28%) (32%), 4WK4 (32%) 4WJK (32%) 4OKU (24%), 3IJE (29%), 3K6S (29%), 4OKU 1IDN (39%), 3K6S (29%), 4OKU αE 4NEH (29%), 3 (24%), 4NEH (29%), 5E6R 6 5 (24%), 4NEH (29%), 5E6R (32%) 5E6R (32%) (32%), 5ES4 (29%) 1IDO (61%), 1IDO (61%), 1JLM (100%), 1JLM (100%), 3K6S (61%), 4NEH (60%), 5ES4 1MF7 (99%), 1NA5 (100%), αM 1N9Z (98%), 5 3 7 (61%) 3Q3G (100%), 4NEH (60%), 1NA5 (100%), 5ES4 (61%) 4NEH (60%)

Consideration to template numbers and average Z-DOPE scores (Figure 4, Table 4) revealed higher template number produced higher quality α3 and αE models. More negative Z-DOPE scores indicate higher quality models. Both α3 and αE targets were best modelled from the balance and coverage set, which used six templates each. In contrast, the identity and resolution method used four and three templates, while the α3 coverage and αE balance sets used five templates each. The αM and β1 models were better constructed when using moderate template numbers. The highest quality αM and β1 models were obtained from the identity and resolution sets which used six and five templates, respectively. In comparison, the coverage sets used nine and three templates while the balance set used five and seven templates, respectively. β2 was modelled best when using the fewest template numbers.

Templates of α3 and αE targets were limited to lower sequence identity than αM, β1 and β2 template counterparts. Differences between sequence identity of an αE template to target

52 and β2 template to target were as high as 50%. The β2 templates reached identity levels of

90% whilst some αE templates were limited to 40% or less.

It is accepted within the modelling community that increasing template numbers improves model quality (Fernandez-Fuentes et al., 2007; Fernandez-Fuentes et al., 2007; Fiser, 2010; Sanchez and Sali, 1997; Venclovas and Margelevicius, 2005). Modelling with multiple templates has two effects, firstly the user is able to adequately cover a target protein and secondly, if overlap of templates occurs, the best template will be used to model that particular region (Fiser, 2010). However, when modelling with PRIMO, it appears that when templates have high sequence identity the fewest number of templates should be selected to obtain high quality models. This is deduced from comparing the lowest average Z-DOPE scoring models of each set (Figure 4) with the template numbers (Table 4) and their respective template-target identity. Integrins can be solved in a closed and bent conformation or a closed and extended conformation in addition to the open conformation. A description of each template is absent from the PRIMO interface and thus selecting multiple templates may cause both types of closed conformation and open conformation templates to be selected.

However, one of the major factors to take into consideration is the truncation of protein termini not covered by templates (Hatherley et al., 2016). PRIMO performs truncation of protein termini for which no template is available. These regions are often difficult to solve experimentally and are often associated with “missing residues” (Deller and Rupp, 2015). Therefore, in the case of β chains having a high level of sequence homology, the truncation of ends improves the overall model quality. In contrast, the α3 and αE models have very low template sequence identity. The negative effects of not including multiple templates, which slightly overlap the termini of integrins, are outweighed by the improvement of average model quality regardless of termini incorporation.

53

1,600

1,400

1,200

1,000

0,800

0,600

0,400

DOPE - Z 0,200

0,000 B C IR B C IR B C IR B C IR B C IR -0,200 α3 αE αM β1 β2 -0,400

-0,600 Models -0,800

Average Max Min Median Figure 4. Models generated by PRIMO. The Z-DOPE score is illustrated for each generated model subdivided into balanced (B), high coverage (C) and identity and resolution (IR) priority models. The average (blue), maximum (orange), minimum (grey) and median (yellow) Z-DOPE values are illustrated for each model.

However, only extracellular portions of integrins were modelled and both termini of the best scoring models were covered by templates. Another factor to take into consideration is the introduction of gaps within the target if additional templates are specified. During model evaluation, the energy values associated with these regions is not computed and negatively affect model quality (Eswar et al., 2006).

Although many researchers believe that multiple templates improve model quality no actual evidence or large-scale experiments exist to prove this notion (Larsson et al., 2008; Venclovas, 2003). Many programmers emphasise that multiple templates increase model quality by better including variability and divergence of natural structures (Larsson et al., 2008). Additionally, there is a lack of information regarding exactly how these modelling programs obtain the “extra” information provided by templates, how it selects the “most” suitable template if they overlap and whether or not it is possible for multiple templates to have a negative effect on modelling (Larsson et al., 2008). It has also been proposed that modelling software is able to discern between multiple templates because it is able to

54 nontrivially identify the best template within a set (Contreras-Moreira et al., 2003). However, this would also imply that single template modelling is on par with multi-template modelling and this is not the case. This prompted a large scale study which indicated that MODELLER performs best when using either two or three templates when compared to single templates (Larsson et al., 2008). This was expected. However, when more templates were included the quality of the models gradually fell (Larsson et al., 2008). This trend was also observed when using Nest. The conclusion of the study indicated that although MODELLER was superior to other software, it failed to prevent model deterioration when multiple templates were submitted. It is suspected that modelling software simply incorporates an “average template” when submitting multiple templates (Larsson et al., 2008). The exact mechanism by which this occurs has yet to be stipulated.

Similarly to the previous study (Larsson et al., 2008), utilising both lower and higher quality templates appears to have had a negative effect on the final integrin model energy. A percentage threshold of sequence identity must exist whereby if identity falls below this value more templates are required, whilst if identity is higher than this value templates should be selectively chosen. This threshold is most likely affected differently by each target structure. Therefore, no standard method can be designed and implemented to overcome this particular modelling challenge. An improvement to the algorithm has been suggested (Qian et al., 2004) but not yet implemented, which may allow these modelling packages to be more selective in their templates.

The α3 and αE homologs had lower sequence identity, increased RMSD values between final models and templates, and lacked full sequence coverage (Table 5). Analysing the balanced set for α3 models revealed chosen templates displayed lower average RMSD values compared with other regimes. Many templates for the best α3 and αM models displayed RMSD values between ±5.5 and ±20, respectively. In contrast, other regimes displayed much higher average RMSD values or contained excessive numbers of extreme outliers. The best αM, β1 and β2 models appear to have been generated due to their high sequence identity and coverage.

55

Table 5. Output of PRIMO server. Templates are listed against each integrin with their associated RMSD values between the template and the final model.

Integrin Identity and Resolution Coverage Balance 1JV2: 25.06 3FCS: 2.28 1BD3: 17.78 3IJE: 2.34 3FCS: 4.28 3FCS: 15.66 3K6S: 9.16 3V4P: 30.60 3VI3: 12.35 β1 4G1E: 4.48 3NID: 2.80 4NEH: 14.50 4G1M: 2.23 4NEH: 6.36 4UM8: 5.35 4WJK: 0.91 4WJK: 10.20 4WJK: 10.78 4NEH: 8.95 5ES4: 8.24 3K6S: 3.84 3K6S: 3.21 4NEH: 1.98 β2 4NEH: N/A 4NEH: 3.29 5E6R: 2.35 5ES4: 3.33 3FCS: 4.83 1JV2: 9.26 3IJE: 24.01 3IJE: 5.91 3IJE: 9.57 4G1M: 23.27 4G1M: 53.86 α3 4G1E: 8.88 4UM8: 2.68 4UM9: 11.53 4G1M: 9.48 4WJK: 2.32 4WJK: 5.20 4O02: 9.28 4WK4: 5.34 3IJE: 15.08 1IDN: 15.77 3K6S: 16.38 4OKU: 18.47 3K6S: 15.77 4OKU: 28.20 αE 4NEH: 25.20 4OKU: 33.66 4NEH: 17.10 5E6R: 4.89 4NEH: 14.71 5E6R: 14.19 5E6R: 1.83 5ES4: 16.14 1IDO: 10.18 1IDO: 23.74 1JLM: 24.96 1JLM: 0.87 3K6S: 4.35 1MF7: 23.90 αM 1N9Z : 23.96 4NEH: 12.22 1NA5: 24.39 1NA5: 23.86 5ES4: 4.13 3Q3G: 24.05 4NEH: 24.82 4NEH: 15.45 5ES4: 23.68

It can be concluded that if target homologs are of poor sequence identity the resulting model will not be optimally generated. This low quality model will have a high RMSD value when compared to templates (Table 5). Models generated by MODELLER also have an optimal template number dependent on the degree of identity between target and homologs and the physical structure of the target. Theory states that the sequence identity of templates predetermines protein structure (Krieger et al., 2005). Instances in which sequence identity between target and template are high imply that both have similar physical structures. Therefore, the final model will adopt a similar structure to that of the template and be reflected through the RMSD value.

56

It may be required to utilise a poor template to cover a specific region of the protein for which no other template may be acquired. The RMSD value between target and template may be large but confined to a particular region. Coverage of these templates should be taken into account to determine the magnitude of the effect they have on final model construction. This is most likely the case when comparing RMSD results of αM and αE models to their templates, which displayed values of ±17 and ±20, respectively (Table 5). In the latter case, the Z-DOPE score of the final model was much lower than the former. This may be due to the latter relying on a few selective templates that cover the majority of the protein. This is reinforced by the best model, which was generated by the identity and resolution set of αM that contained the fewest number of templates.

It appeared that when modelling, PRIMO utilises large templates as a base from which the final model will be constructed regardless of sequence overlap with higher quality templates. This is indicated when comparing final models to their templates (Table 5). The RMSD scores between the model and highest coverage template is often the lowest. This again highlights previous findings that MODELLER cannot choose the best template when presented with template overlap (Larsson et al., 2008) but rather attempts to find a “local average” of model quality. In this experiment lower coverage but high quality templates have higher RMSD values than lower quality high coverage templates when used in combination. Therefore, it appears that PRIMO prefers modelling proteins from high coverage templates first before utilising smaller higher quality templates.

Template resolutions were greater than 7Å in some cases. These low resolution templates are unreliable and should be avoided as the coordinates of physical structures may not be trustworthy (Fiser, 2010). The effects of poor resolution are superimposed on those of low sequence identity templates. Often, templates displaying low sequence identity will also include those with low resolution. This is particularly true for heterodimeric transmembrane proteins that have limited template selections. This is due to the environment from which these templates originate, causing them to be difficult to solve experimentally.

Analysing target-template alignments indicate that models with lower Z-DOPE scores have better templates, more consistent coverage and higher sequence identity. More accurate and consistent information provided a better framework for MODELLER functioning. Target- template sequence identity of the α3 and αE templates on average is lower than the αM, β1

57 and β2 templates with 30% and 70%, respectively. Therefore, it is expected that the latter three should be modelled better than the former on this basis alone.

P-values (Table 6) obtained for each integrin indicate the variation between balanced, coverage and identity and resolution through statistical significance (p < 0.05). The highest and lowest variation between results were obtained from the αE and β1, respectively.

Table 6. P-values between data sets (B, C and IR) of each integrin. The P-value was obtained by comparing each data set, balance, coverage and identity and resolution using the ANOVA single factor test. α = 0.05.

Integrin P-value -7 α3 8.56 x 10 -24 αE 3.15 x 10 -10 αM 1.64 x 10 -5 β1 6.50 x 10 -9 β2 3.57 x 10

The balanced set has the lowest standard deviation (Figure 5) for β1, followed by the identity and resolution set for α3 and β2 and then the coverage set for αE and αM. A correlation (Table 7) between the standard deviation and Z-DOPE score appears to exist whereby average standard deviation values of the αE and α3 sets appear to be higher than that of the αM, β1 and β2 sets. This implies that model quality is indirectly proportional to the associated standard deviation. Standard deviation associated with poorly modelled proteins is expected to increase, as MODELLER may have to rely more on ab initio techniques (Walsh et al., 2009).

Table 7. Correlation coefficients of the balanced, coverage and identity and resolution sets. The data set each have their own correlation functions between standard deviation and the Z- DOPE score averages obtained by models.

Data set Correlation Coefficient Balanced 0,788 Coverage 0,234 Identity and Resolution 0,543

58

0,160

0,140

0,120

0,100 DOPE

- 0,080 Z

0,060

0,040

0,020

0,000 α3 αE αM β1 β2 Model set B C IR

Figure 5. Standard deviation of models generated by PRIMO. The associated Z-DOPE standard deviation of each model set is illustrated in their respective categories. Balanced set (blue [B]), coverage set (orange [C]) and identity and resolution set (grey [IR]).

However, multiple sets of data for each regime would need to be generated to determine the exact correlation strength between model quality and standard deviation and by extension template quality. Currently, the data only indicates model quality is indirectly proportional to the associated standard deviation. The correlation may not even be linear as indicated by the variance between correlation values of each data set.

The highest and lowest Z-DOPE scoring models generated by PRIMO were overlaid to determine structural variations of each model series (Figure 6, Figure 7). RMSD values are lowest for the β1 series and highest for the αE series. This trend appears to follow for β chains in general with less variation between highest and lowest Z-DOPE scoring models. Although PRIMO provides an easy way to generate accurate models the server automatically truncates regions for which no template has been specified as noted in the comparison between the worst and best α3 model.

59

A B

Figure 6. Comparison of the best and worst β1 and β2 integrin monomers to the templates 3VI3 and 3K6S. The β1 (A) best model (purple) superimposed on the worst model (orange) with an

RMSD of 0.595. The 3VI3 template (blue) is superimposed on the best β1 model with an RMSD of 1.019. The β2 (B) best model (pale green) superimposed on the worst model (teal) with an

RMSD of 1.094. The 3K6S template (grey) is superimposed on the best β2 models with an RMSD of 1.856.

60

A B C

D

E

Figure 7. Comparison of the best and worst α3 (A), αE (B) and αM (C) models with their templates. The best α3 model (green) is superimposed on the worst α3 model

(cyan) with an RMSD of 4.787. The 1JV2 template (orange) is superimposed on the best α3 model with an RMSD of 5.009. The best αE model (purple) is superimposed

on the worst αE model (yellow) with an RMSD of 5.293. The 5E6R template (white) is superimposed on the best αE model with an RMSD of 2.665. The best αM model

(grey) is superimposed on the worst αM model (blue) with an RMSD of 3.093. The template 5ES4 (red) is superimposed on the best αM model with an RMSD of 2.678.

The effects of PRIMOs truncation (D) indicating a missing calf domain for the worst α3 model (cyan). The protruding loop region (E) of the best αE model (purple).

61

Truncation of termini is completed during the pre-processing step of .pir files within the PRIMO pipeline (Hatherley et al., 2016). This is not ideal when attempting to model full protein sections for which no template exists. However, this does prevent long non-templated regions from branching out from the protein complex or incorrectly associating with the surface of the protein. In comparison to the best model of α3, the worst model lacks an entire calf-domain in the left-upmost region of the protein (Figure 7D). Further analysis indicates a particularly high RMSD score for the αE model which also has the highest standard deviation for two out of the three model sets (balanced and identity and resolution, Figure 5).

Additionally, αE has the highest Z-DOPE scores for all model sets (Figure 4). When structurally analysing αE models’ a loop region is found protruding from the left side of the purple model (Figure 7E). Despite this protrusion, the Z-DOPE score of the best model (0.787) remains much lower than the worst model (1.503).

Models generated by I-TASSER and Phyre2 (Figure 8) indicated that I-TASSER performed much better than Phyre2 but only generated better α3 and αE models in comparison to PRIMO. This implies that I-TASSER is able to perform better than PRIMO when the sequence identity of templates is low. The ab initio modelling procedure of I-TASSER appears to be superior to that of both Phyre2 and PRIMO. When comparing the best models generated by each online server (Figure 11, Figure 12) it became apparent that I-TASSER and Phyre2 generated very similar models. In the absence of high sequence identity the PRIMO model of αE appears to more closely associate with itself, forming a more compact structure than that of the I-TASSER and Phyre2 counterparts. This may highlight the protocol for MODELLER to generate globular extracellular models over more structurally complex transmembrane models (Kelm et al., 2010).

62

1,500

1,000

0,500

DOPE

- Z

0,000 α3 αE αM β1 β2

-0,500

-1,000 Integrins

Phyre2 I-Tasser PRIMO

Figure 8. Comparing the best models generated by Phyre2, I-TASSER and PRIMO. The Z-DOPE scores of Phyre2 (blue), I-TASSER (orange) and PRIMO (grey).

Output of Phyre2 and I-TASSER appears non-uniform in that Phyre2 provides a single model output file called “final.casp.pdb”. In comparison to this, I-TASSER provides multiple model output files called “model(X).pdb” where X is the model number iteration. However, this was not always the case as I-TASSER provided only a single model concerning the β2 monomer. The PRIMO server also provides validation output pertaining to associated models beyond that of their Z-DOPE scores and utilises PROCHECK, ProSA, QMEAN and Verify 3D. The validation output provided by PRIMO through these incorporated servers may seem overwhelming; additionally the option to view these data may not be inherently obvious within the drop-down arrow alongside the “show” option. In contrast to this, the validation results of I-TASSER and Phyre2 appear more ordered and tabulated with summaries associated with each section.

A major advantage of Phyre2 and I-TASSER over PRIMO is the internal, detailed validation of the generated models (Figure 9, Figure 10). Although PRIMO does use programs such as ProSA, QMEAN, Verify 3D and Procheck to evaluate models the output is cumbersome and more disordered than that of Phyre2 and I-TASSER.

63

A

B

C

Figure 9. Phyre2 output of the α3 integrin monomer. A) An overview of the modelling confidence. B) A domain analysis of modelled sequence. C) Detailed template information

(Kelley, et al., 2015; Roy, et al., 2012; Zhang, 2009).

The greatest advantage of Phyre2 and I-TASSER is ligand binding site prediction. Phyre2 implements 3DLigandSite which functions by overlapping homologous structures associated with ligands over the models to predict binding sites (Wass et al., 2010). Previously, ligand binding site prediction for I-TASSER was completed using COFACTOR, however, improvements to the algorithm were made to include ConCavity (Capra et al., 2009),

64

FINDSITE (Brylinski and Skolnick, 2008) and two new methods developed within the I-TASSER suit itself called TM-SITE and S-SITE for structure and sequence based ligand prediction. Many of these processes rely on overlapping homologous structures to that of the model to determine potential ligand binding sites. This method was followed for experiments within this study (section 4.5.10).

A

B

Figure 10. I-TASSER output of the α3 integrin monomer. A) The normalized B-factor of the models. B)

Top threading templates used by I-TASSER (Yang and Zhang, 2015).

Structurally the models generated by these servers are nearly identical to that of the templates (Figure 11, Figure 12). Many of the models exhibit loop regions (Figure 12D, E and F) which originate due to lack of a suitable template. From the generated models only the

PRIMO model of the αE monomer failed to adopt a structure similar to that of the template. The full protein models were generated in all cases with the β models more closely resembling their templates than the α models.

65

A B

Figure 11. Comparison of the β1 and β2 integrin monomers generated by PRIMO, Phyre2 and I-

TASSER. A) β1 models include those of Phyre2 (yellow), PRIMO (purple), I-TASSER (grey), 3VI3

(blue-green). B) β2 models include those of Phyre2 (blue), PRIMO (orange), I-TASSER (green) and 3K6S (light pink).

66

A B C D F A

E

Figure 12. Comparison of the α3 (A), αE (B) and αM (C) models generated from Phyre2, PRIMO and I-TASSER. A) Phyre2 (purple), PRIMO (yellow), I-TASSER (light brown), 1JV2 (light yellow). B) Phyre2 (white), PRIMO (blue), I-TASSER (orange), 5ES4 (grey). C) Phyre2 (red), PRIMO (light green), I-TASSER (dark green), 5ES4 (grey). D; Large loop region, E; protruding segment, F; non-conforming secondary structure.

67

As noted by the ANOVA (Table 8) the server should be carefully chosen, since modelling methodologies differ to such a degree that model quality can be markedly different. It appears, at least in the context of extracellular portions of transmembrane proteins, that I- TASSER outperformed Phyre2 and PRIMO. The P-values of both servers and integrins indicate that the results obtained for each integrin between each server are different as P < 0.05.

Table 8. P-values of Z-DOPE models generated by online servers obtained from a two factor ANOVA without replication. The P-values are divided into two source categories, between integrins and between servers, where α = 0.05.

Source P-value Between integrins 4.7 x 10-4 Between servers 7.6 x 10-3

These online servers focus primarily on modelling extracellular proteins, which are mostly globular in structure. This is disadvantageous for transmembrane proteins containing both a transmembrane and cytoplasmic segment. However, MEDDELER failed at the first stage by being unable to determine the membrane insertion region. This may be caused by iMembrane being unable to acquire any BLAST sequence search results against the Coarse-Grained database (CGDB) (Kelm et al., 2009). If no results are returned from the CGDB, MEDELLER is unable to place the location of the transmembrane portion of the protein as it is annotated by these CGDB results. These annotations are either “N”, “H” or “T” for each residue, indicating not in contact with membrane, in contact with polar head groups of membrane lipids, and finally in contact with lipid hydrophobic tails, respectively (Kelm et al., 2009). The CGDB includes protein models generated by coarse-grained MD simulations (Chetwynd et al., 2008). However, this database is relatively small and information relating to integrins may simply not yet exist. This is highlighted when attempting to run only the iMembrane portion of the procedure by manually selecting parameters. MEDELLER’s failure to detect iMembrane hits for input sequences appears to be a common issue. When attempting to model all entries within the class ‘membrane proteins’ of the PDB many failed to annotate the membrane insertion but no solution to the problem was proposed (Kelm et al., 2010).

From the servers tested only Phyre2 and I-TASSER performed modelling without any complications. As mentioned, MEDELLER was unable to model any proteins and in some

68 cases, PRIMO modelling failed. Modelling using PRIMO failed if multiple models were to be generated in a single run, such as attempting to complete a set of 200 models. The threshold from which PRIMO failed seemed to have been dependent on the protein and how many templates were chosen but also which templates were chosen. In particular, some templates caused PRIMO to fail if more than 20 models were generated in a single run, whilst removal of these templates allowed a single run to reach 100 iterations. It may be that during model generation the server would become overloaded, however, this is unlikely due to the ability to model multiple sets of proteins simultaneously. Therefore, it may be related to MODELLER itself and this is supported by the associated error message provided by PRIMO support; “Exception: An unknown exception has occurred when running MODELLER”. This error may indicate a flaw with PDB formatting. Thus it may be that missing residues or modified residues are not being formatted correctly in the alignment to match the PDB causing the models to fail. Although, models were generated until a certain point before failing, which implies that pre-processing was completed correctly. Nevertheless, additional testing would need to be completed for the PRIMO back-end code to determine the fault, which remains unknown.

5.2. Modelling heterodimeric proteins using MODELLER

Initial protein models generated by MODELLER of the α3β1 integrin displayed Z-DOPE scores of ±1.4 with a maximum of ±1.5 and a minimum of ±1.3 (Figure 13). A drawback of oligomeric homology modelling is the increased number of errors associated with the final model (Fiser, 2010). These errors present in the model originate both from the modelling process and the solved template structures themselves (Fiser, 2010). The structurally complex oligomeric interface regions degrade the quality of X-D techniques causing decreasing resolution quality (Faber and Matthews, 1990). The environment in which these proteins are present such as their solvent, ligand association status and crystal packing also affect template quality and subsequently model quality (Faber and Matthews, 1990).

69

4,5 4 3,5 3

2,5 DOPE

- 2 Z 1,5 1 0,5 0 Average Max Min Median Parameter

Initial First set of 100 Second set of 300

Figure 13. Initial α3β1 integrin models (blue) produced by MODELLER and the 100 (orange) and 300 (grey) models. The average, maximum, minimum and median Z-DOPE values are displayed with their associated standard deviation values.

In an attempt to improve model quality, 11 additional templates were acquired and two sets of 100 and 300 models were generated. However, the Z-DOPE scores of these models is much higher than the initial experiment (Figure 13) thereby reiterating that acquiring additional templates does not necessarily improve model quality (Larsson et al., 2008).

When comparing the 100 and 300 model sets, no significant difference is discernible (Figure 13). It appears the first set yielded slightly worse models with higher Z-DOPE scores. However, the P-value (0.288) associated with these sets highlights their insignificance as it is greater than α (0.05). This P-value was generated using the Z-test two sample for means statistical analysis with a 0 mean hypothesis difference.

This suggests that the best model can be generated within the first 100 models. Increasing iteration numbers may provide a slightly better model. However, time constraints and computational power may limit MODELLER’s ability to generate a large number of models from which refinement may then be completed. If templates display high levels of sequence coverage and identity, it may be possible to reduce the total number of iterations and still obtain the best model. This is due to MODELLER’s ability to better and more consistently model proteins for which high quality information can be obtained limiting dependence on

70 ab initio modelling and MODELLER randomisation (Speranskiy et al., 2007). Many publications use a base set of 100 models, however, some prefer to use a set of 1 000 or more (Wallner and Elofsson, 2005), which may not be feasible for researchers who do not have access to cloud computing or supercomputing.

Refinement was completed to determine how refinement iterations affected MODELLER output (Figure 14). Z-DOPE scores appear to have dropped to ±2.4. An improvement of ±1.1 was achieved by minimising loop regions and overall model structure using the very_slow refinement protocol. When comparing effects of multiple refinement iterations very little change can be observed. Each set appeared to obtain similar refinement with initial refinement appearing directly proportional to model Z-DOPE scores for the second iteration before adopting an inversely proportional relationship for the latter iterations.

3,000

2,500

2,000

1,500

DOPE

- Z 1,000

0,500

0,000 Average Max Min Median Parameter

Set 1 Set 2 Set 3 Set 4

Figure 14. Comparing four sets of five models generated by MODELLER separated by levels of refinement using standard template numbers. Parameters displayed as seen in figure 13 for each set (blue – set 1 (iteration one), red – set 2 (iteration two), grey – set 3 (iteration three), yellow – set 4 (iteration four).

Templates were then increased to 34 causing a large decrease in Z-DOPE values of obtained models (Figure 15). However, it appeared that standard deviation of associated sets increased. Taking increased standard deviation into account the maximum theoretical Z-DOPE value could be as high as ±1.5, although still lower than obtained previously (Figure 14).

71

1,600

1,400

1,200

1,000

0,800

DOPE

- Z 0,600

0,400

0,200

0,000 Average Max Min Median Parameter

Set 1 Set 2 Set 3 Set 4

Figure 15. Comparing four sets of five models generated by MODELLER separated by levels of refinement using increased template numbers. Parameters displayed as seen in figure 14 for each set. Each set represents an iteration where the iteration number is in turn represented by the set number.

A few things to take into consideration is the effect of the base model on refinement and the approach of refinement taken by MODELLER. If the base model contains a significant amount of errors, energy minimisation refinement will be unable to generate a more accurate model (Park et al., 2018). These models can only be improved by improving the base model quality through high quality homologs. This is due to the strict and many times inconsistent energy function requirements to improved model accuracy (Park et al., 1998). Additionally, the energy minimisation function may contain false energy minima, causing inaccuracies, which then cause model degradation during refinement (Park et al., 2018). This is particularly true when employing coarse-grained conformational searches or unrestrained molecular dynamics simulations (He et al., 2013; Modi and Dunbrack, 2016; Raval et al., 2012). Furthermore, the refinement process only improves locally confined regions of the model, which fail to address the global model quality directly (Park et al., 1998). It has also been noted that smaller proteins are often refined to a better degree than larger complicated structures (Park et al., 1998). This was highlighted in CASP 2012, which indicated that small monomeric proteins under 200 residues were often refined to a better degree than larger

72 protein counterparts. This highlights the possible improvement to integrin model quality if the protein were to be modelled in its separated monomers rather than a single large heterodimeric structure.

Refinement of models reduced the Z-DOPE score (Figure 14) and increasing template numbers further resolved the issue (Figure 15). This is most likely due to the improvement of the base model through increased templates concerning the latter scenario. Since these models were better constructed initially the refinement procedure further aided in model improvement. This echoes previous findings when performing large-scale energy optimisation of models (Park et al., 1998).

Nevertheless, beyond the first level of refinement in both cases no additional refinement aided in improving model quality as P > 0.05 (Table 9). In contrast, it appeared in both cases that Z-DOPE scores would initially increase before decreasing during refinement. Additional testing would need to be completed to accurately determine why MODELLER causes such fluctuations in Z-DOPE scores during refinement of transmembrane proteins. It is most likely due to the presence of false minima or general energy function inaccuracies.

Table 9. P-values of standard and increased template sets. The P-value associated with each data set is greater than α where α = 0.05.

Data set P-value Using Standard Templates 0,531 Using Increased Templates 0,980

A 0.973 and -0.355 correlation exists between the standard deviation and average Z-DOPE model scores of the standard template and increased template set, respectively. This implies that by increasing the number of templates, although standard deviation does increase initially it may be reduced through multiple iterations of refinement. This is once again most likely due to energy function inaccuracies being amplified by the size and complexity of the base models (Park et al., 1998).

When comparing very_slow against very_large refinement regimes provided by MODELLER (Figure 16) no significant difference can be shown with many of the parameters showing nearly identical Z-DOPE scores. There does appear to be a small increase in Z-DOPE scores for the very_large set, however, the standard deviation of both sets negates this difference as

73 significant. However, computational time of very_large is greater than very_slow by a factor of ±1.5 times.

1,2

1

0,8

0,6

DOPE Score - Z 0,4

0,2

0 Average Max Min Median Parameter

Very Slow Very Large

Figure 16. Very_slow against very_large refinement. Very_large (orange) and very_slow (blue) refinement was tested using the αMβ2 heterodimeric protein model.

The main challenge regarding refinement of large proteins originates from their size. Search space is exponentially proportional to chain length and thus accurate refinement requires far more computational power than for smaller proteins (Park et al., 1998). There is also more potential instances for errors to occur within the energy function method itself and thus more advanced sampling strategies and energy functions should be developed to particularly cater for large proteins.

When analysing the very_slow against very_large refinement (Figure 16) no significant difference exists with regards to model quality as P (0.443) > α (0.05). If similar findings are observed for structural characteristics, selecting either protocol should be viable. However, the very_large protocol took longer to complete in comparison. Loop refinement protocols should be selected on the basis of loop size, location and severity. The very_large refinement seems ambiguously labelled in that it could also be viewed as a reference to protein size and not loop size. This is especially true when the differences between energy minimisation of large and small protein structures are well known within the community. Longer loop

74 refinement periods will increase overall model quality but it should be considered in conjunction with time constraints. Increasing the number of iterations of model creation may have a more positive effect on final model structure by providing a greater base from which the best model may be refined. Although, the amount of models to be generated would have to be much higher than 100 as noted by the previous experiment and in this instance it may still remain impossible to generate a bin of 500 or 1 000 models due to computational power and time restraint constrictions. It is therefore only beneficial to use slow_large refinement for models for which suitable templates are severely lacking.

Structurally it appears that although refinement aided in providing better overall structure, only when used in the conjunction with higher template numbers was the process effective (Figure 17). When modelling heterodimeric models using MODELLER many more templates are required than when modelling monomeric subunits. This is due to the complex interface region that should be modelled and the increased overall sequence of the entire protein, amplified by the association of both transmembrane and cytoplasmic domains in this instance. Protein structures still appear disorderly and thus more focus should be placed on carefully selecting templates when modelling heterodimeric proteins (Figure 17). When comparing the generated models to their templates only the increased number of templates which underwent refinement (Figure 17D) adopted a similar structure.

The initial model (Figure 17A) appears to display a long protruding loop region most likely the result of an untemplated transmembrane and cytoplasmic region. Structurally the initial model was poor with many disordered loop regions. Additionally it is difficult to determine overall structure from the generated model as secondary structures seem to be misplaced in a globular model unrepresentative of general integrin structure. The initial model displayed a lower Z-DOPE score (Figure 13) than models generated from each set of 100 and 300. This is most likely due to the globular nature adopted by the initial model where additional bonds between amino acids increased overall structural integrity. This again reiterates that MODELLER is not necessarily suited for modelling heterodimeric transmembrane proteins (Kelm et al., 2010).

75

A B

E

C D

Figure 17. Comparison of heterodimeric models generated by MODELLER. A - Initial α3β1 model, B - Model from set

of 100, C - Model from standard template number refinement, D - Model from increased template number

refinement, E – Comparison of the large against slow refinement protocol using αMβ2. A – D are compared against the template 3VI3 (red) whilst E is compared against 5ES4 (purple). Chain A (green) and chain B (blue).

Models obtained from the set of 100 (Figure 17B) reveal poor structural quality. The structure of these models is elongated. Domains, such as the calf and thigh, appear to have separated and flattened creating a long elongated model once again unrepresentative of integrin structure.

The standard template level refinement procedure (Figure 17C) appears to lack general integrin shape whilst increasing template numbers under refinement (Figure 17D) obtained a model resembling that of solved integrin structures. A multitude of disordered loop regions is also observed within the standard refinement models. The most suitable model appears thus to be that of refinement with increased templates. This modelling procedure also produced models with the fewest number of loop regions in terms of number and length and mitigated their effects. From a structural standpoint it appears that the effects of increasing the iterations of refinement were negligible. Similarly, models generated by comparing the very_slow against very_large refinement protocols (Figure 17E) yielded similar results, as they appear to be structurally identical.

76

5.3. Modelling monomeric proteins using MODELLER Models were broken into separate chains to determine the individual quality of each monomer. The previous PRIMO experiments (Figure 4) indicated that model quality was not uniform between subunits. From initial monomeric modelling procedures (Figure 18) it can be observed that the β chain was more accurately modelled than the α chain. The worst models for each subunit displayed a Z-DOPE score of 1.245 and 0.546 for the α and β chain, respectively. There is a large statistical variance between these two data sets as P (0.000) < 0.05.

The breakdown of the α3β1 model allowed selective use of templates for each subunit and explored the effects of using only open, only closed, and closed with various degrees of secondary arguments (Figure 21). Although primary protein sequences dictate protein structure, in some cases, these proteins may undergo drastic conformational changes such as moth chemosensory proteins (Campanacci et al., 2003). Likewise, integrin proteins also undergo conformational alterations and may adopt either a closed-bent, closed-extended, open or transition conformational state (Figure 19) (Li and Springer, 2017). Therefore, although some templates may well be suitable for modelling due to their sequence, their conformation must be taken into account. Many PDB files contain a description specifying the conformational state of the solved structure, however, this is not true for every file. In some cases a more detailed description placed within the “literature” section will provide such information and in extreme cases the conformational state of the template is not specified. It would be futile to model the open conformation of a protein using templates only representative of the closed conformation. This places additional limitations on templates, specifically those catering for the α chain, which in many cases are not optimal.

Template numbers and average identity of the β1 chain was higher than the α3 chain at 45 and

52.2% and 25 and 26.0%, respectively. The overall sequence was more complete for the β1 chain partly due to the increased number of templates but also due to the high sequence coverage associated with each template. In comparison, the α3 templates were shorter and contained higher numbers of missing residues.

77

1,600 1,400 1,200 1,000

0,800

DOPE - Z 0,600 0,400 0,200 0,000 Average Max Min Median Parameter

α3 β1

Figure 18. Comparing the α3 and β1 monomeric models generated by MODELLER. The average, maximum, minimum and median Z-DOPE scores for the initial monomeric modelling procedure of α3 (blue) and β1 (orange) subunits.

The first experiment utilised both open and closed templates (Figure 18) and displayed very similar Z-DOPE scores to that of using either only open or closed templates (Figure 21). The P-values (Table 10) associated with these data sets indicate that no significant difference exists, in terms of their Z-DOPE scores, between the closed and open model combination with either the closed or the open, except for the β chain two instances.

Figure 19. Changes in integrin conformation. The different conformations of the integrin protein including the; A and D) bent (closed), B and E) extended closed and C and F) extended open of αI-less and αI-containing integrins, respectively (Zhu et al., 2013).

Keeping the conformation requirements of templates in mind, the selection of templates becomes more tedious. The low (40%) target-template identity of the α chain hinders the

78 benefits of using multiple templates. The average difference between the template and target structure is less than the average difference among alternative template structures (Fernandez-Fuentes et al., 2007). More damaging, however, is the lack of target-template sequence identity indicating that the alignment is the limiting factor for modelling (Fiser, 2010). For target-template sequence identity lower than 40% the alignment limits model accuracy as a single residue misalignment will induce an error of 4 Å (Fiser, 2010).

In extreme cases of exceptionally poor templates and when modelling very large proteins with limited structural change between conformations, it may be beneficial to use templates representative of both conformations. Regions of structural similarity between closed and open conformations could be isolated within the PDB. These sections could be modelled with higher accuracy, improving a particular region. Although this could negatively affect model structure by using the incorrect conformation the method by which MODELLER selects templates for modelling is not mutually exclusive and an “average” template would be used for that region.

Adding specific secondary restraints to model construction caused no major change to the effects of modelling except for the β chain forced transmembrane experiment in comparison to the other generated β chains (Figure 22). Modelling the closed conformation with arguments appeared to have no effect at all and the final model obtained the same Z-DOPE scores in the same order. The secondary structure information in all these cases was obtained from the templates themselves. Thus, although in most cases it would be unnecessary to include secondary structure information it may be useful in one or two cases and incorporating this data into python scripts may slightly improve Z-DOPE scores of models.

79

Table 10. P-values associated with the closed and open, only closed, only open, closed with arguments and closed with forced transmembrane segment. The P-value against each experiment type.

Experiment P-value α Closed & Open Vs Closed 0,205 β Closed & Open Vs Closed 0,200 α Closed & Open Vs Open 0,295 β Closed & Open Vs Open 0,000 α Closed Vs Open 0,414 β Closed Vs Open 0,001 α Closed Vs α With Arguments N/A β Closed Vs β With Arguments N/A α Closed Vs α Forced Transmembrane N/A β Closed Vs β Forced Transmembrane 0,015

The P-values (Table 10) indicate that there is insignificance between using either a combination of closed and open templates or using only open or only closed templates for the α chain. However, there is significance between using closed and open templates for the β chain and using either only open or closed templates. This highlights previous findings that model quality is limited by target-template identity for the α chain. The model quality would be similar regardless if either open or closed or a combination were chosen.

0,140

0,120

0,100

0,080

0,060

DOPE

- Z 0,040

0,020

0,000 α3 β1 α3 β1 α3 β1 α3 β1 Only Closed Only Closed with Only Open Only Closed with Arguments Arguments and Forced Transmembrane Standard Deviation

Figure 20. Standard deviation values associated with all model sets using only closed and open templates, only closed with arguments and only closed for forced transmembrane portions.

80

Due to the results being nearly identical, the standard deviation (Figure 20) of many of the sets are also identical for the closed conformation. It appears that using only open templates facilitated a decreased standard deviation for the β chain whilst it increased for the forced transmembrane model of chain β.

Further experimentation using only closed and open templates (Figure 21) revealed that the

Z-DOPE score of the α3 chain remained mostly unchanged whilst the Z-DOPE score of the β1 chain was higher for the open conformation than that of the closed conformation.

1,400

1,200

1,000

0,800 DOPE

- 0,600 Z

0,400

0,200

0,000 α3 β1 α3 β1 α3 β1 α3 β1 Only Closed Only Closed with Only Open Only Closed with Arguments Arguments and Forced Transmembrane Parameter

Average Maximum Minimum Median

Figure 21. Comparing the Z-DOPE score of models generated using only closed and only open templates with models generated using arguments and forced transmembrane helices. The average, maximum, minimum and median values of all model sets are displayed in terms of their Z-DOPE scores.

Structurally (Figure 22, Figure 23) the models appear very similar and adopt a structure resembling that of integrins. This conclusion can be drawn from the structural similarities between the templates 1JV2 and 3VI3 and their respective α and β chain models. In comparison to previous heterodimeric models generated by MODELLER, the structural results appear to have improved greatly (Figure 17).

81

When comparing the monomeric models generated by MODELLER many of the results appear to overlap perfectly for both chains (Figure 22). Most of the variation between models originates from regions for which no suitable template exists. Closed templates for both chains overlap with each other, however, in comparison with the open chains slight differences occur. For the α chain, the closed conformation appears to shift outwards (1). Very little difference is discernible concerning the β chain; however, it appears that many of the secondary structures associated with the model do not fully overlap (2).

1 A B 2

Figure 22. Comparison of the monomeric models generated by MODELLER. The α3 models (A) are compared, using only closed templates (grey), only open templates (blue), closed and open templates (purple), closed with arguments

(yellow-green), closed with forced transmembrane (light green) and the template 1JV2 (light blue-green). The β1 models (B) are also compared, using only closed templates (light green), only open templates (blue-green), closed and open templates (orange), closed with arguments (pink), closed with forced transmembrane (light yellow) and the template 3VI3 (blue). 1) Disordered loop region. 2) Non-conforming transmembrane segment.

82

Figure 23. Overlay of the templates 3VI3 and 1JV2, the monomeric subunits generated by the forced transmembrane set and the heterodimeric model generated from the increased template and refined set. 1JV2 (yellow), 3VI3 (blue), heterodimeric model (red), α3 monomer

(green) and β1 monomer (cyan).

5.3.1. Fragmentation of monomeric models To further increase model quality the chains were modelled in fragments whereby the cytoplasmic, transmembrane and extracellular (Figure 24) domains were modelled separately. The cytoplasmic domain of the α chain was modelled better than that of the β chain, however, this trend does not follow for the transmembrane or extracellular domain.

Transmembrane domains are unique and templates for this region are limited (de Brevern, 2010). The number of available transmembrane templates is less than 1% of the available structures within the PDB (Fleishman et al., 2006). Therefore, in many cases accurate templates are unavailable. These regions are highly stabilised by the membrane and it is suspected that a single membrane-buried hydrogen bond may contribute as much stabilisation as all the van der Waals contacts along the length of the domain (Perrin and Nielson, 1997). Non-polar hydrogens and aromatic residues of the transmembrane domain also interact with the membrane to bring about increased domain stabilisation. Therefore,

83 without the membrane, the transmembrane domain is quite unstable and this will be reflected by a high Z-DOPE score as noted in this study (Figure 24). This region is linear and adopts the most energetically unfavourable state.

The breakdown of the model facilitated a reduction in the Z-DOPE score of the extracellular region (Figure 21, Figure 24). The high Z-DOPE transmembrane and cytoplasmic regions were removed from the generated models thereby causing an increase in average model quality as the extracellular region is better modelled. The separation of these three domains is justified through their natural isolation in three separate environments. Additionally, the smaller models would allow better energy minimisation and refinement as the search space is smaller and there is less room for energy minimisation function error (Park et al., 2018). The extracellular, transmembrane and cytoplasmic environments are completely different and poses one of the greatest challenges concerning modelling transmembrane proteins. However, due to their differences the effects each region has on another is quite limited in terms of physical effects.

7,000

6,000

5,000

4,000

3,000

DOPE

- Z

2,000

1,000

0,000 Average Maximum Minimum Median

-1,000 Parameter

Cytoplasmic Region α3 Cytoplasmic Region β1 Transmembrane Region α3 Transmembrane Region β1 Extracellular Region α3 Extracellular Region β1

Figure 24. Comparing the Z-DOPE scores of extracellular, transmembrane and cytoplasmic models of α3 and β1 generated by MODELLER.

84

The Z-DOPE scores of the extracellular regions improved greatly under this fragmented form

of modelling. The lowest Z-DOPE score achieved was -0.348 for the β1 chain in contrast to the

0.721 value attained by the α3 chain. However, the Z-DOPE score of the extracellular portion

of the α3 chain was still lower than previously achieved.

When comparing the extracellular regions obtained here to the lowest scoring Z-DOPE model sets obtained in previous monomeric experiments, the P-values for α and β chains are 0.059 and 0.000, respectively. This was generated by comparing the top ten scoring models of the closed and open data set for the α chain and the closed forced transmembrane set for the β chain. A significant difference in favour of the extracellular set is observed for the extracellular β sets, whilst the α set remains insignificant in terms of energy scores. However, the α set in comparison utilised both closed and open templates, which may not be representative of either closed or open integrin structure.

A B

Figure 25. Comparison of the extracellular portion of the α3 (A) and β1 (B) integrin chains. The α3

monomer (light green) overlaid with 1JV2 (blue) and the β1 monomer (cyan) overlaid with 3VI3 (red).

85

As most of the biological activity originates from the extracellular portion (Humphries et al., 2006) this region was selected for further study. The structural information of the models generated (Figure 25) resemble that of general integrin structure. However, improvements could still be made concerning the terminal ends, which remain insufficiently covered by

templates. The α3 chain appears to have adopted a 90° structure, which is different to previously modelled structures (Figure 12). The Z-DOPE scores of the extracellular regions are

at their lowest thus far with the β1 chain attaining negative values.

5.4. Final modelling procedure (Extracellular domains) Based on the previously completed experiments (Figure 4), three sets of modelling regimes were designed, with varying levels of template numbers for both the closed (Figure 26) and open (Figure 29) conformations. The best models from each integrin conformation were then determined (Figure 27, Figure 30).

1,500

1,000

0,500

DOPE Score - Z 0,000

All All All All All All All All All All

Fewer Fewer Fewer Fewer Fewer Fewer Fewer Fewer Fewer Fewer

Fewest Fewest Fewest Fewest Fewest Fewest Fewest Fewest Fewest Fewest

α3 α4 αD αE αL αM αX β1 β2 β7 -0,500

-1,000 Integrin Subunit

Average Max Min Median

Figure 26. Comparison of the Z-DOPE scores of models generated for the closed conformation. The average (light blue), max (red), min (grey), median (orange) is displayed for each model set, all, fewer and fewest templates.

86

1,200

1,000

0,800

0,600

0,400

0,200 DOPE score

- 0,000 Z α3 α4 αD αE αL αM αX β1 β2 β7 -0,200

-0,400

-0,600

-0,800 Integrin Subunit

Average Max Min Median

Figure 27. Comparison of the final models generated for the closed conformation. Displayed as seen for figure 26.

Structurally (Figure 28, Figure 31) the models resemble their templates and particularly those of both β chain conformations. The open and closed conformations of the models themselves

are structurally similar. However, the α3 (Figure 31A) subunit of the open conformation has a bent calf and thigh domain in comparison to the closed conformation (Figure 28A). This trend

follows for the α4 integrin subunits of both the closed (Figure 28B) and open conformations (Figure 31B). More structural differences exist between the closed and open conformations

of the β1 (Figure 28H, Figure 31H) and β2 (Figure 28I, Figure 31I) subunits, which have closed bent and extended open conformation for the closed and open conformations (Figure 19), respectively. The closed conformation models were compared to closed conformation templates, which have the highest template-target identity to the models. In some instances,

such as αD, αE and αM the 5ES4 α chain template was chosen as these integrins lack their own solved templates.

87

A B C

D E F G

H I J

Comparison of the final model results for each set for the closed conformation. Displayed as seen in Figure 32.

Figure 28. Overlay of closed conformation integrin models with their templates. A) α3 (green) and template 4G1M

(pink). B) α4 (light blue) and template 3IJE (yellow). C) αD (yellow) and template 5ES4 (purple). D) αE (light green) and

template 5ES4 (purple). E) αL (red) and template 5E6R (grey). F) αM (white) and template 5ES4 (purple). G) αX (blue)

and template 5ES4 (light green). H) β1 (orange) and template 3VI3 (blue). I) β2 (blue) and template 3K6S (red). J) β7 (dark green) and template 5ES4 (light green). 88

4,500

4,000

3,500

3,000

2,500

2,000

1,500

DOPE Score - Z 1,000

0,500

0,000

All All All All All All All All All All

Fewer Fewer Fewer Fewer Fewer Fewer Fewer Fewer Fewer Fewer

Fewest Fewest Fewest Fewest Fewest Fewest Fewest Fewest Fewest -0,500 Fewest α3 α4 αD αE αL αM αX β1 β2 β7 -1,000 Integrin Subunit

Average Max Min Median

Figure 29. Comparison of the Z-DOPE models generated for the open conformation. Displayed as seen for figure 26.

2,500

2,000

1,500

1,000

0,500 DOPE Score

- Z

0,000 α3 α4 αD αE αL αM αX β1 β2 β7 -0,500

-1,000 Integrin Subuit

Average Max Min Median

Figure 30 . Comparison of the final models generated for the open conformation. Displayed as seen for figure 26.

89

A B C

D E F G

H I J

Figure 31. Overlay of open conformation integrin models with their templates. A) α3 (green) and template 4O02

(purple). B) α4 (light blue) and template 3V4P (yellow). C) αD (green) and template 3V4P (purple). D) αE (yellow)

and template 40O2 (grey). E) αL (orange) and template 3V4P (grey). F) αM (white) and template 3V4P (purple). G)

αX (white) and template 3V4P (purple). H) β1 (orange) and template 3VI4 (blue). I) β2 (light green) and template

3V4P (yellow). J) β7 (dark green) and template 3V4P (light green).

90

Templates with sufficient coverage to overlay against models are very limited for the open conformations (Figure 31). Most of the high quality and high coverage templates are of the closed conformation which cannot be compared against open conformation models.

Table 11. P-values associated with each integrin subunit of all, fewer and fewest data sets. The data sets where compared using the ANOVA statistical test where α = 0.05.

Integrin subunit P-value of Closed Models P-value of Open Models -5 α3 9.2 x 10 0,089

α4 0,263 0.263 -35 αD 0,040 3.4 x 10 -6 αE 0,957 7.0 x 10 -32 αL 0,027 2.8 x 10 -17 -22 αM 1.0 x 10 3.1 x 10 -8 -22 αX 6.2 x 10 1.2 x 10 -7 -12 β1 3.5 x 10 6.8 x 10 -7 β2 0,001 8.4 x 10 -14 β7 0,074 1.9 x 10

When comparing each data set it appears that significant differences exists between “all”, “fewer” and “fewest” for many of the integrin subunits (Table 11). However variation between these sets for the α3 open models, both conformations of α4 subunits, αE, αL and β7 closed models are insignificant as P > 0.05. It suggests that model quality, particularly of open models, is dictated primarily by the selection of templates and template numbers. This highlights previous findings that incorrect template numbers cause a decrease in model quality and further indicates that low target-template identity exacerbates the problem (Park et al., 1998).

In the majority of cases, except for the α3 integrin subunit, the lowest average Z-DOPE scoring model was generated by the closed conformation of templates. This suggests that template quality variance between the open and closed conformations is large. Crystallising open conformation templates is more challenging due to the presence of the ligand.

Quality may also diminish due to poor target coverage, template resolution or target- template identity. Analysing the open and closed experiments (Figure 27, Figure 30) indicates that the α3, α4, αE, αL and to an extent β7 were particularly challenging to model for the closed conformation. The open conformation, however, only achieved negative Z-DOPE scores for

91 the α4, αX, β1 and β2 integrins. Particularly the αD, αE and αL attained exceptionally high Z-DOPE scores for the open conformation. However, in the cases of αD and αL it was possible to improve model quality by utilising different modelling templates such as fewer and all regimes respectively. It appears that αE remained particularly problematic regardless of the conformation. Generally, high template numbers facilitated lower Z-DOPE scores for β subunits with specific selection of template being preferred for α subunits. The results highlight the relationship between target-template identity and template numbers. In all cases, models adopted a structure similar to that of integrin subunits and can be viewed in their respective directories.

5.4.1. Docking using ClusPro Although not outside the scope of capability of ClusPro, docking protein-protein subunits is more challenging than that of enzyme-inhibitor complexes (Kozakov et al., 2017). This is due to the presence of conformational changes within the backbone during the docking procedure (Kozakov et al., 2017). However, integrins were modelled in each conformation separately and docking would induce no conformational changes within targets. Many of these targets utilised templates which included the interface region between integrin subunits thereby already providing relevant information required for accurate docking.

Although many final models shared similar interactions reflecting those of templates (Figure 32), orientations of these subunits were often incorrect. They were aligned against templates to determine how the structure of models and templates differed. It appears that the β-tail and EGF repeat domains, such as those of the αDβ2 open conformation, clash with thigh and calf domains from the β and α subunits. Templates of these integrins do not cover these regions and thus MODELLER must rely on ab initio procedures that are less accurate.

The highest coverage and highest average templates, 3VI3 and 1JV2, were chosen to be overlaid with the models. These templates have relatively high target-template identity over the whole heterodimeric model relative to other potential closed and open templates.

92

A B C D A

E F G H

Figure 32. Overlay of the closed and open templates of the α4β1, αDβ2, αLβ2 and αXβ2. The closed

(A) and open (B) α4β1, closed (C) and open (D) αDβ2, closed (E) and open (F) αLβ2 and closed (G) and open (H) αXβ2. The α chain (green), β chain (red) and templates 1JV2 (open, white) and 3VI3 (closed, white).

Energies associated with these docked subunits indicate hydrophobic favoured models provide best results. For many integrins, hydrophobic favoured results represented the highest number of models for many of the obtained clusters. However, the closed αXβ2, open

αMβ2 and open αDβ2 all displayed lowest energy model originating from the electrostatic favoured set. The ClusPro energy results are cumbersome and even an adequate summary would be lengthy. Supporting information can be located within the ClusPro energy results spreadsheet. The number of models in each cluster have been highlighted using data bars and the results themselves are compared within each cluster set through conditional formatting. The lowest energy model is represented by a blue background.

93

Structural integrity between integrin subunits is represented by non-covalent forces (Baker and Zaman, 2010). The contribution of each type of non-covalent force, which binds the subunits of integrin proteins together, does not yet appear to be known. The final models were chosen from the lowest energy models highlighted in blue within the ClusPro energy results of the first ten clusters. The first ten clusters contain over half of the overall results from each set and in many cases represent the most likely biological results. Smaller clusters with lower energy models, particularly those below ten, may not be representative of true biological structure and may have simply been generated by random chance (Kozakov et al., 2017). However, these energies should not be considered as measure of binding affinity, as the PIPER algorithm does not attempt to estimate true interaction energy (Kozakov et al., 2017). However, these energy values do attempt to evaluate the shape complementarity (Katchalski-Katzir et al., 1992), electrostatic interactions (Gabb et al., 1997) and desolvation contributions (Chen and Weng, 2002) to the docking process. Although incorporation of the FFT correlation method into the ClusPro algorithm facilitates better docking accuracy (Kozakov et al., 2006) the algorithm may still be improved, especially concerning transmembrane heterodimeric protein subunits. This is due to the increase in tolerance required of steric clashes which result during docking, noted in previous studies (Kozakov et al., 2017), and the current study. Researchers also noted that lowest energy structures reflected by the ClusPro may not necessarily be correct and low energy conformations reported may not always resemble X-D structures (Kozakov et al., 2017).

The final resulting model, particularly those of the closed conformation, appear consistent with templates. Conflicts between residues that were modelled and docked to pass through those of the other chain and have been highlighted within the appendix (Section D: CluPro and HADDOCK2.2 , Table 23).

Although the αEβ7 integrin can be considered as the worst model in terms of energy, no conflicts like those described exist. Likewise, for the closed conformation of α4β1, αDβ2, and

αMβ2 these conflicts were absent. Adding additional templates to these regions is unlikely to solve the issue, as the problem is not necessarily structural. Instead, the orientations of the docks appear to be incorrect and is likely due to the energy scoring functions of ClusPro, which may not always be accurate (Kozakov et al., 2017).

94

An inconvenient disadvantage of ClusPro is renaming of PDB chains during docking. When submitting chain A and chain B to ClusPro the user is asked to specify the ligand receptor. Once docking was completed chain names “A” and “B” were removed but no reassignment was performed. This meant reassigning chain names before submission for model validation.

5.4.2. Docking using HADDOCK2.2 Ligand prediction was completed using the I-TASSER server, which modelled each subunit, and determined possible ligand associated interactions. Residues were predicted in sets with each set being associated with a ligand or cofactor. The prediction was based off interactions of homologs using the FINDSITE protocol amongst others, which have been incorporated into the I-TASSER algorithm and can be found within the appendix (Section D: CluPro and HADDOCK2.2, Table 24) (Capra et al., 2009).

Although not a disadvantage, acquiring active residues involved in the docking process may be challenging. This is particularly pertinent when templates representing the dock are absent. The exact heterodimeric interface templates of integrins within this study are absent from the PDB. Using software, such as I-TASSER, to predict points of interaction is one way to overcome this challenge. Phyre2 and the 3DLigandSite protocol can also be used to gain further insight in regions of importance concerning ligand association (Wass et al., 2010). Alternatively, it would be beneficial to manually obtain a similar template displaying exact points of interaction between residues of different chains. These templates can be aligned against models and points of interaction can be highlighted.

In this instance both were completed, however, when overlaying the most similar template structure (3VI4) to those of the models, very little overlap was observed. Each of the specified active residues sets were chosen from top to bottom to indicate, “fewest”, “fewer” and “all” active residues. The more residues present in the active residue list the “blinder” the dock becomes, which may reduce docking accuracy but will most likely provide the most optimal dock.

Many of the HADDOCK runs failed such as those of α4β1, αDβ2, αEβ7 and αLβ2. Many of these procedures failed due to the inability of Molprobity to determine the protonation state of a histidine residue. It is possible to assign protonation states manually through the expert and guru interfaces of HADDOCK. Information about the protein would also be required to set the

95 protonation states. If there is a lack of information this may not be possible to set correctly, therefore all protonation states would have to be tested.

There are four protonation states of histidine, biprotonated, neutral τ tautomer, anionic τ tautomer and anionic π tautomer (Li and Hong, 2011). The structures of these states differ (Figure 33) and are dependent on environmental pH (Kim et al., 2013).

Figure 33. Protonation states of Histidine. From left to right (biprotonated, neutral τ tautomer, anionic τ tautomer and anionic π tautomer).

The state in which histidine residues are present impact protein structure by influencing surrounding residues. Histidine has biological roles as enzymatically active residues (Cleland, 2000), proton transfer reaction shufflers (Ren et al., 1995; Tu et al., 1989) and ligand coordinators (Olson et al., 1988; Stockel et al., 1998). Where ligand coordination is of particular importance within this study.

The second issue, obtained from αEβ7, is likely resulting from incorrectly specifying a residue within the integrin “active residue list”, which does not exist or failing to read appropriate distance restraints. Analysing the list indicates that residue number “439” was specified as an additional residue to those already incorporated within the “fewest” data set. This residue was specified within the “all” data set along with those residing around this residue. Thus, it is likely that distance restraints prevented the former from docking due to the “439” residue being embedded a distance away from the would-be docked ligand. In the following case concerning the “all” set, specifying residues surrounding this residue most likely catered for any potential interactions between this region and the docked ligand. Insufficient published information exists with regards to these problems encountered when using HADDOCK2.2 as most users prefer to make use of the online forum to troubleshoot docking procedures.

96

However, this causes additional problems as other online users may be inexperienced concerning a particular protein of study and intricacies of desired docking results.

Table 12. Output results of the αXβ2 integrin. The parameter such as the HADDOCK score and Z-score are displayed for each of the clusters for each model run.

Parameter Final All Fewer Fewest HADDOCK score -279.2 +/- 4.2 -251.9 +/- 4.1 -267.9 +/- 16.2 -269.4 +/- 7.3 Cluster size 40 54 30 38 RMSD from OLES* 3.5 +/- 0.6 2.9 +/- 0.7 4.9 +/- 1.1 2.6 +/- 0.2 Van der Waals energy -15.7 +/- 1.3 -13.6 +/- 2.5 -21.5 +/- 3.9 -19.6 +/- 3.7 Electrostatic energy -21.0 +/- 20.1 -50.6 +/- 35.4 -6.5 +/- 4.5 -50.5 +/- 47.5 Desolvation energy -273.8 +/- 4.2 -258.5 +/- 9.9 -266.4 +/- 11.2 -259.8 +/- 5.8 Restraints violation energy 144.9 +/- 52.45 303.1 +/- 64.94 212.8 +/- 28.20 200.2 +/- 26.57 Buried Surface Area 584.5 +/- 38.6 582.6 +/- 71.5 560.6 +/- 66.9 560.9 +/- 34.9 Z-Score -1.4 -1.2 -1.4 -1.2 *overall lowest-energy structure

From the energy output (Table 12) of the αXβ2, it can be concluded that the more energetically favourable model was generated from either the “final” or “fewest” sets. Energy results are very similar, however, the “final” cluster is much larger than that of the “fewer” cluster. Since more models were generated more similar to that of the “final” best model it is likely that this state is more representative of the actual biological state of the ligand-integrin interaction. HADDOCK also provides detailed graphical representations of the results, which can be consulted to determine in a more precise manner which model is best. This output includes the interface-RMSD calculated on the backbone atoms for all residues involved in intermolecular contact within 10Å. The ligand-RMSD is calculated in a similar fashion after being associated with the backbone atoms. Fraction of common contacts is also displayed, which indicates intermolecular contact based on the best HADDOCK model.

It may have been more beneficial to use HADDOCK for docking integrin monomers using a PyMOL generated surface residue list as input. In this instance, PyMOL could be used to visualise each monomer separately and three runs could be created, similar to those of the current HADDOCK experiment. In these instances, surface residues of distances, 0.5Å, 1.0Å and 1.5Å could be extracted on the interface side of monomers. These generated lists could then be submitted to HADDOCK as input.

Previous physics-based studies indicated that although HADDOCK2.2 performs well, the algorithm and energy function can be improved (Vangone et al., 2017). Although

97

HADDOCK2.2 is able to dock a wide variety of targets and uses the same scoring function for both protein-protein and protein-nucleic acid systems, the software fails, in some cases, to select the best available model (Vangone et al., 2017).

5.4.3. Docking using AutoDock Vina The CS1 site (EILDVPST) of fibronectin is known to contain the LDV sequence involved in ligand-integrin associations (Yamada, 1991). A BLASTp search of the PDB yielded no structures directly representing this region within fibronectin. However, the 1YSJ file contained a solved structure that shared 100% sequence identity with the CS1 site. The 1YSJ structure was used, as the CS1 site of fibronectin has yet to be solved. Other potential LDV structures could be acquired from 2LGW, 3A7K, 1OXS and 1OXX, which all share 100% sequence identity with the

CS1 site. In contrast to this, α3β1 associates with laminin (Humphries et al., 2006). For A- domain containing integrins, the association lies within a collagenous GFOGER motif (Emsley et al., 2000).

However, α3β1 lacks an A-domain and no specific ligand association site has yet to be determined (Humphries et al., 2006). Additionally, this integrin is known to associate with laminin-5 and not fibronectin. The region has been narrowed within the G-domain of the C- terminal of the α chain (Colognato and Yurchenco, 2000; Hirosaki et al., 2000; Ido et al., 2004). Specifically within the first three domains known as LG1-3 (Ido et al., 2004; Kunneken et al., 2004). The known ligand recognition sites within laminin include YIGSR (Graf et al., 1987), PDSGR (Kanemoto et al., 1990), RYVVLPR (Skubitz et al., 1990), LGTIPG (Mecham et al., 1989), RGD (Aumailley et al., 1990), IKVAV (Tashiro et al., 1989) and LRE (Hunter et al., 1989). However, when analysing the G-domain of laminin-5, which is known to associate strongly with α3β1, these recognition sites are absent within the canonical sequence. In general, not much is known about the association between non αA-domain containing integrins and laminin binding (Humphries et al., 2006). The mechanism and location of interaction within the integrin and laminin has yet to be determined. Although mouse laminin

α5 contains an integrin associating RGD sequence, this does not occur for human integrin- laminin associations as the RGD sequence is not evolutionarily conserved within human laminin-binding integrin interactions (Aumailley et al., 1990; Schulze et al., 1996).

It is expected that LDV associations with integrins are functionally similar to the RGD ligand (Humphries et al., 2006). The RGD ligand associates within the α/β subunit interface. The R

98 residue can be located within the α subunit β-propeller cleft whilst the D residue coordinates a cation bound within the βA-domain (Humphries et al., 2006). For β2 related integrins, ligand association is through the αA-domain (Shimaoka et al., 2003) and is coordinated by an aspartate, whilst for β1 and β7 integrins the coordination is performed by a glutamate

(Shimaoka et al., 2003). As stated, α3β1 associates strongly with laminin α5 and the association lies within the G-domain of the C-terminal end. However, the location of association within integrins is unknown.

Thus for integrin-ligand associations involving LDV, the interaction is expected within the α/β subunit interface and involves the αA-domain and ligand coordination for β1 and β7 is performed by glutamate, while β2 ligand coordination is performed by aspartate.

Two sets of AutoDock Vina results were generated, the first utilised models generated from the open conformation whilst the second made use of closed conformation models. The LDV ligand associated with these integrins is expected to bind to the same interface region (Humphries et al., 2006). This region lies within the extracellular headpiece between the two chains. The interactions between the ligand and the open conformation of αXβ2 indicate that VAL-58 interacts with LYS-164, ASP-57 interacts with ASN-159 and LEU-56 interacts with ASN- 207 (Figure 34).

Although the energies associated with the closed conformation are less than the open conformation at -6.8 kcal/mol and -6.7 kcal/mol, these differences are negligible. Additionally, the structural integrity of the closed conformation is likely higher than that of the open conformation due to the modelling confidence being superior for the closed conformation (Figure 27, Figure 30).

99

Figure 34. Interactions of the ligand (LDV) with αXβ2. The ligand (purple), α (green) and β (light blue). The dotted yellow lines indicate polar interactions determined by PyMOL.

The V-K, D-N and L-N interactions can be best characterised by their amino acid properties. V is non-polar and hydrophobic while K is positively charged, polar and hydrophilic. D is negatively charged, polar and hydrophilic while N has no charge, is polar and hydrophilic. L is non-polar and hydrophobic while, as stated, N has no charge, is polar and hydrophilic. Additional experiments and accurate solved structures would be required to provide further information on the ligand-protein interaction. Furthermore, ligands interacting with these integrins, have yet to be solved in their entirety (Humphries et al., 2006).

Previous experiments have indicated that multiple conformations of integrin can support ligand binding (Nagae et al., 2012). The tripeptide which associates with the integrin binding cleft located between the two integrin subunits is noted to be small, however, high quality structural detail concerning ligand binding of these integrins is largely absent (Nagae et al.,

2012). Additionally, the existence of a second ligand bound conformation state of the β3 integrin subunit has brought previous knowledge concerning these integrin-ligand interactions into question (Nagae et al., 2012). The ligands associated with the closed conformation of integrins in a similar fashion to the open conformation. The docking energies are very similar, however, much like in the case of the open conformation; accurate high-

100 quality templates are required for both these particular integrins and fibronectin to perform optimal docking experiments.

Docking experiments such as these are limited by exhaustiveness (Trott and Olson, 2010). The search space between the two integrin subunits is large and low exhaustiveness limits the ability of the program to obtain accurate results. The exhaustiveness was set to 128 and a moderately large surface area was specified. Time constraints and limits on computational power can hinder the ability of the program to determine the most optimal dock as it may be unable to accurately assess the entire search space. Additionally, there is a lack of information concerning the structures and ligand association as no examples exist within databases such as the PDB. Assumptions are made from both a docking perspective and biological perspective. These results would need to be verified by laboratory experimentation.

Even less information exists regarding the α3β1 integrin and its association with ligands (Humphries et al., 2006). In this instance, laminin is used, which is a known ligand of this integrin. The particular site of association within the laminin sequence is unknown but has been localised to the G-domain (Emsley et al., 2000). Structural information regarding the rest of the laminin protein can be removed. There are known regions of interaction within the laminin sequence, however, none of these known sequences fall within the G-domain. Therefore, a new point of interaction must exist. An assumption is made whereby the laminin protein will associate in a region similar to that of other ligands to the integrin protein. This is supported by evidence which indicates that mouse laminin isoforms associate with integrins through an RGD sequence (Barczyk et al., 2010). However, this is not present within the laminin α5 subunit.

101

3DLigandSite prediction indicated that laminin contained 17 residues which are involved in ligand binding. These ligand binding sites can be found within the excel support files called 3DLigandSite. From this ligand site list a tripeptide ELV was indicated as a potential ligand binding site. This tripeptide E residue is negatively charged, polar and hydrophilic whilst both L and V residues are non-polar and hydrophobic. This tripeptide has similar properties to that of the LDV ligand. The L and V residues are non-polar and hydrophobic whilst the D residue is negatively charged, polar and hydrophilic. Docking the ELV ligand to the α/β interface has an associated energy of -6.7 kcal/mol.

The ELV tripeptide interacts with only the β chain through two THR residues and a single PRO (Figure 35). It is very challenging to even obtain a realistic computer generated dock of the laminin ligand with integrins. This is due to the amount of assumptions which would have to be made for both the integrin and the ligand. Firstly, this would assume that laminin associates with integrins through the α/β interface. Secondly, the RGD sequences and other known ligand recognition sites within laminin are not present within the G-domain. Thirdly, the known interactions within the G-domain (Figure 35) would have to be assumed representative of the same potential interactions with integrins. Fourthly, the proteolytic cleavage of laminin (Steadman et al., 1993) may introduce new potential interactions with residues, which in the current state represented by 5AUX, may not be accessible due to steric hindrance. Fifthly, the lack of structural information regarding the G-domain of laminin and the lack of accurate structural information regarding the α3β1 integrin negatively affects docking options as potential points of interaction have yet to be determined. Sixthly, the lack of evolutionary relatedness between laminin mouse and human α5 subunits (Aumailley et al.,

1990; Schulze et al., 1996) coupled with the differences between the α3β1 integrin with that of other leukocyte related intergins and others, further complicate the issue.

102

Figure 35. Interactions between Integrin Binding Fragment of laminin-511 (ELV) sequence and integrin α3β1. Laminin (red), β1 (cyan) and polar interactions (dotted yellow lines).

These factors would first have to be addressed and more information should be acquired before accurate docking procedures can be completed. However, based on sequence information and the current knowledge, it seems more likely that laminin associates with the

α3β1 integrin through the G2-domain. Additional studies and experimental data would have to be completed to verify and provide more insight on laminin-α3β1 interactions.

5.4.4. Metal ion coordination Integrins contain metal ions that (Ca2+ or Mn2+) associate with propeller blades. These ions increase thigh domain surface rigidity (Arnaout, 2016). Effects of these metal ions on ligand association can only be represented through molecular dynamic simulations using programs such as GROningen Machine for Chemical Simulations (GROMACS).

103

GROMCS incorporates leap-frog Verlet, velocity Verlet, Brownian and stochastic dynamics, energy minimisation, normal-mode analysis and simulated annealing (Abraham et al., 2015). GROMACS includes other force fields and 15 variations of AMBER, CHARM, GROMOS and OPLS (Abraham et al., 2015), allows for temperature and pressure regulation and uses SHAKE (Ryckaert et al., 1977) and P-LINCS (Hess, 2008) to enforce holonomic constraints and be combined with virtual interaction sites (Berendsen and Van Gunsteren, 1984). Geometric restraints, explicit or implicit solvent types, generalized ensemble methods such as replica- exchange, non-equilibrium methods such as pulling and umbrella sampling in conjunction with alchemical free-energy transformations are available within the GROMACS package (Abraham et al., 2015).

Although GROMACS has recently undergone vast improvement through multi-level parallelism (Abraham et al., 2015), incorporation of the ensemble framework Copernicus, Verlet lists with automatic buffering (Páll and Hess, 2013) and load distribution over multithreading-based GPUs and single instruction, multiple data CPUs, the procedure is time intensive and computationally expensive to complete.

Electrostatic forces in GROMACS are computed using particle-mesh Ewald (Darden, et al., 1993) (PME) methods through MPI ranks for summation of boxed areas and 2D pencil decomposition (Pronk, et al., 2013) necessary for 3D-FFT completion. Similarly, GROMACS will compute ionic forces within ligand associations through implementation of the correct force- fields.

5.4.5. PROSA, Verify-3D and PROSESS Results The final models were submitted to three different model verification programs. Verify-3D

(Table 13, Figure 36) indicates that all closed conformations passed except the αEβ7 integrin, which failed at 72.98%. In comparison, the best modelled was α3β1 at 87.40% accuracy. The pass threshold was set to be greater or equal to 0.2. In many of the modelled heterodimeric proteins the α chain brought down the total score, particularly the initial or final sections. Despite this, many of the closed conformation models pass with an overall accuracy of ±85.00 percent. β chains appear to be modelled more consistently with problem areas occurring sporadically throughout the model.

104

Table 13. Verify 3D results of final integrin heterodimeric models. The closed and open conformation results are displayed in terms of percentage accuracy. Accurate models are defined as the 3D-1D score >= 0.2.

Integrin Closed Conformation (%) Result Open Conformation (%) Result α3β1 87,40 Pass 86.56 Pass α4β1 92,25 Pass 84.98 Pass α4β7 89,28 Pass 83.12 Pass αDβ2 84,32 Pass 93.22 Pass αEβ7 72,98 Fail 71.05 Fail αLβ2 87,15 Pass 78.94 Fail * αMβ2 97,64* Pass* 93.07 Pass αxβ2 83,41 Pass 84.20 Pass *The αMβ2 integrin fails to be processed for the α chain.

Verify-3D was unable to fully grade the αMβ2 integrin and omitted the α chain regardless of how information was submitted to the server. The model could be viewed from the server interface but no output files could be acquired. Instead, the output that followed was a statement indicating the omission of data.

A

B

Figure 36. Failed result of αEβ7 (A) and passed result of α3β1 (B). The results of the failed Verify-3D test. The cut-off 0.2 threshold indicated by the green line with each amino acid being graded (green dot) and averaged (blue dot). Scores run from 0.8 to -0.8.

The open conformation displayed lower percentage pass rates than that of the closed conformation and two of the eight models failed. The αLβ2, and in particular, the αEβ7 integrin models are poor. These results are similar to those of the αEβ7 closed conformation. In cases

105 where the ligand was still attached to the integrin, Verify-3D was unable to grade the protein structure.

It appears that Verify-3D is unable to recognize remark numbers within remark lines concerning ligands, preventing evaluation. Verify-3D is known to be able to evaluate proteins with associated ligands (Eisenberg et al., 1997). Removal of these lines from the PDB allows grading of the protein with the associated ligand. However, the average scores minimise any effect the associated ligand would have on the final score. An example can be found within the appendix (Section E: Verify-3D example with ligand, Figure 42).

PROSA indicates that overall best scoring model was αMβ2 for the closed conformation acquiring Z-scores of -12.76 and -12.39 for chain A and B, respectively (Table 14). In contrast, the worst scoring model was αEβ7. However, the α3 chain of α3β1 has a higher Z-score than the

αE chain.

Table 14. Z-scores obtained by PROSA indicating global quality of final models. Z-scores of the closed and open conformations of integrin subunits divided into α, β and ligands (L).

Integrin Z-score (closed) Z-score (open) Chain α β α β L α3β1 -8,27 -12,06 -9,29 -10,72 0,00 α4β1 -9,66 -12,06 -9,77 -10,74 0,00 α4β7 -9,87 -12,06 -9,42 -8,60 0,70 αDβ2 -12,06 -12,37 -9,69 -9,58 -0,79 αEβ7 -8,47 -9,96 -7,09 -8,60 0,03 αLβ2 -11,98 -12,39 -10,99 -9,58 0,03 αMβ2 -12,76 -12,39 -10,78 -9,57 -0,68 αxβ2 -12,31 -12,39 -12,31 -12,39 0,70

Graphical representations of the α and β chains of the α3β1 integrin (Figure 37) indicate that these subunits were modelled well enough to be placed within the X-ray group. This highlights how well the β chain places were modelled in comparison to the α chain.

106

α β

Figure 37. PROSA graphical results of chain A (left) and B (right) of α3β1. The black dot indicates where the generated model compares to solved X-ray and NMR structures in terms of length and quality.

PROSA provides an average score over either 10 or 40 indicating poorly modelled regions (Figure 38). However, in this instance the threshold of 0 indicates poorly modelled regions by indicating positive energy scores. When analysing the smaller window size of the α chain, PROSA suggests the termini of protein sequences are particularly challenging to model accurately.

α β

Figure 38. PROSA graphical results of α3β1.The Z-score is calculated for each amino acid and averaged over a window size of 10 (light green) and 40 amino acids (dark green).

107

PROSA and Verify-3D are widely used quality assessment programs and assess structure in terms of protein sequence and using single models (Kalman and Ben-Tal, 2010). However, newer programs such as ModFOLDclust (Larsson et al., 2009) and Pcons (McGuffin, 2009) use a set of models and develop a consensus-set as input and grade models within that set against each other (Kalman and Ben-Tal, 2010). The disadvantage of consensus-sets lies in the scenario-based nature of implementation. If the generated models are unlikely to contain many structurally correct models or if few template structures are available to draw comparisons, the output could be questioned in terms of its reliability (Cozzetto et al., 2007). An important factor to take into consideration is inaccuracies associated with force fields such as OPLS (Jorgensen et al., 1996), CHARMM (Brooks et al., 1983) or AMBER (Weiner et al., 1984) on which validation programs are based. These energy scoring functions and complex intricacies within force fields do not accurately cover every model and biological situation (Petrov and Zagrovic, 2014). It has been noted that both Verify-3D and PROSA tend to give higher scores than expected concerning higher quality models (Wallner and Elofsson, 2003).

PROSESS provides comprehensive output. A result summary of closed α3β1 chain A and B is provided as examples (Figure 39). The result summary alone is sufficient enough to grade the average quality of the models.

C O

Figure 39. Summarised PROSESS results of closed and open conformations of chain A of α3β1. The covalent (red), packing (green), non-covalent (blue), torsion angle (purple), chemical shift (light blue), NOE (yellow) and flexibility (black) quality outliers are shown in relation to residue numbers. The closed conformation (C) and the open conformation (O).

108

Tabulated results are also made available whereby each parameter is described to be without or outside acceptable levels. If the parameter is outside acceptable levels it is described in how many deviations this parameter is an outlier.

In this summarised graphical data (Figure 39) it appears that flexibility is the greatest issue regarding both chains. A more concise way of viewing the results is through the tabulated summary (Table 15), which indicates the overall level of quality of the protein. The global quality of both chains remains consistent for the closed conformation for α and β attaining 2.5 and 4.5, respectively. The value is graded out of 10 in 0.5 intervals. When analysing results it becomes evident, particularly for the α chain, that torsion angle quality is poor whilst non- covalent or packing quality of the β chain appears to be the greatest obstacle. The results are not far different from templates such as 1JV2, which has a global quality of 3,5 for both chains.

Table 15. Overall quality of each heterodimeric model as graded by PROSESS. The global, covalent, non-covalent / packing and torsion angle qualities are shown for the closed (top) and open (bottom) conformation of α chain and β chain. Values are graded out of ten.

Chain α β Integrin Global Covalent Non-Covalent Torsion Global Covalent Non-Covalent Torsion Quality / Packing Angle Quality / Packing Angle α3β1 2,5 7,5 4,5 1,5 4,5 6,5 3,5 4,5 α4β1 2,5 7,5 4,5 1,5 4,5 6,5 3,5 4,5 α4β7 2,5 7,5 3,5 1,5 4,5 6,5 3,5 4,5 αDβ2 2,5 7,5 3,5 1,5 4,5 7,5 4,5 5,5 αEβ7 2,5 7,5 4,5 1,5 4,5 6,5 3,5 4,5 αLβ2 2,5 7,5 4,5 1,5 4,5 7,5 4,5 5,5 αMβ2 2,5 7,5 4,5 1,5 4,5 7,5 4,5 5,5 αxβ2 2,5 5,5 3,5 2,5 4,5 7,5 4,5 5,5 Chain α β Integrin Global Covalent Non-Covalent Torsion Global Covalent Non-Covalent Torsion Quality / Packing Angle Quality / Packing Angle

α3β1 3,5 6,5 3,5 2,5 3,5 4,5 3,5 2,5 α4β1 2,5 7,5 4,5 1,5 3,5 4,5 3,5 2,5 α4β7 2,5 7,5 4,5 1,5 3,5 4,5 3,5 2,5 αDβ2 2,5 7,5 5,5 1,5 3,5 4,5 3,5 2,5 αEβ7 3,5 7,5 3,5 2,5 3,5 5,5 3,5 2,5 αLβ2 2,5 7,5 5,5 1,5 3,5 4,5 3,5 2,5 αMβ2 2,5 7,5 5,5 1,5 3,5 4,5 3,5 2,5 αxβ2 2,5 6,5 4,5 1,5 4,5 7,5 4,5 5,5 Chain α β Template Global Covalent Non-Covalent Torsion Global Covalent Non-Covalent Torsion Quality / Packing Angle Quality / Packing Angle 1JV2 3,5 7,5 3,5 2,5 3,5 6,5 3,5 2,5

109

PROSESS uses VADAR and GeNMR to evaluate proteins, which have both been evaluated to be largely accurate to identify misfolded structures (Berjanskii et al., 2009; Willard et al., 2003). The structures are flagged using RMSD equations based on Chothia and Lesk criteria which are known to be very accurate (Chothia and Lesk, 1986). PROSESS is very intensive and prevents any good structures with localized problems from being labelled as high quality by evaluating residues individually. Although PROSESS is adept at identifying misfolded structures, structures with poor stereochemistry and structures containing localised problems, the information output to display these data are cumbersome (Berjanskii et al., 2010). PROSESS can also only evaluate protein-protein systems and ignores small molecules and nucleic acids, the algorithm fails to perform nomenclature, file formatting or labelling checks and cannot determine packing quality among solvent residues (Berjanskii et al., 2010).

5.5. Final modelling procedure (Transmembrane and Cytoplasmic) The transmembrane and cytoplasmic portions of integrins were modelled separately using a refinement protocol where possible. Sequence identity for homologs were low at ±50.0 percent. In many cases, the entire region was unable to be adequately templated. It was possible to perform loop refinement, such as α4 transmembrane, αD cytoplasmic, αE cytoplasmic and β chain cytoplasmic segments. The lack of template information is likely due to the difficulty in obtaining suitable templates from cytoplasmic portions of membrane proteins.

Many structures representing the cytoplasmic portion of integrin proteins within the PDB often do not accurately represent native contacts as they fail to accommodate transmembrane-cytoplasmic interactions (Vinogradova et al., 2002; Weljie et al., 2002). The complicated nature of interactions within and surrounding the cytoplasmic region hinder efforts to solve its structure. Accurate study of the cytoplasmic portion of αIIbβ3 required an NMR and hydrogen-deuterium exchange mass spectrometer, which was used to obtain the structure and evaluate cytoplasmic dynamics (Metcalf et al., 2010). This process also made use of an intersubunit disulfide bond located near C-terminals of integrin subunit helices (Metcalf et al., 2010). The disulfide bond was placed in accordance with structure information provided by a generated model. Researchers noted that the cytoplasmic region was largely disordered but contained a contiguous transmembrane-proximal helix segment for the β3 subunit and two other helices (Metcalf et al., 2010). This description is similar to the structure

110 obtained for the αL cytoplasmic portion within this study but fails to represent the β1, β2 and

β7 cytoplasmic regions (Figure 40).

These regions are largely solved using NMR techniques, which excludes the more accurate and popular X-D technique (Cavagnero, 2003). The cytoplasmic portion of the protein is difficult to solve due to the surrounding environment, transmembrane domain interactions and water solvent. The membrane bound nature of integrins add an additional layer of complexity due to the hydrophobic and hydrophilic phospholipid bilayer. This bilayer consists of phospholipids (O'Brien and Rouser, 1964), glycolipids (Sonnino et al., 2007) and cholesterol (Singer and Nicolson, 1972). Although the membrane alters membrane protein function and structure, this field of study is largely overlooked. Most researchers are focused either on studying the biophysics of lipid membrane components using simplified model system (Andersen and Koeppe, 2007) or on studying the structure and function of membrane proteins neglecting membrane effects entirely (Grouleff et al., 2015).

The cell membrane should therefore be mimicked throughout the experiment to ensure stability of the protein and accurate structure results. This factor complicates the purification process and in many cases prevents reliable NMR spectroscopy from taking place (Chatham and Blackband, 2001). Although detergents such as Triton X-100 solubilise membrane proteins and form micelles in water, which mimic that of the membrane, the effects of these detergents prevent accurate NMR readings by increasing background noise (Chamberlain, 2004). Detergent micelles are also not identical to the structure of the membrane due to the inability to attach to non-polar regions of proteins in a parallel fashion (Landreh et al., 2017). The perpendicular orientation, achieved through these detergent solubilised membrane proteins, may cause protein shape distortion thereby preventing accurate results (Landreh et al., 2017). This led to the development of lipopeptide detergents, which facilitate the correct orientations of membrane proteins and are stable, thereby limiting background noise (McGregor et al., 2003). However, these detergents are very expensive.

In recent years new technologies and methods, such as cryo-electron microscopy (Liao et al., 2013), femtosecond crystallography (Liu et al., 2013), coarse-grained methods (Koldso et al., 2014) and continuum-MD (Mondal et al., 2014) have been developed to overcome problems encountered when using NMR spectroscopy.

111

The complicated environment, lack of accurate structural information, reliance on NMR data, lack of accurate force interaction information and absence of suitable technologies and methods severely limit the number of suitable PDB files, relating to the cytoplasmic portion of integrin subunits. This hinders modelling of these regions and the structural accuracy of template data could be debated.

When modelling the cytoplasmic regions of integrins, MODELLER was unable to perform refinement in some cases. The error indicated that loop regions were unable to be determined. In contrast to transmembrane segments, cytoplasmic regions do not contain many secondary structures; therefore, some loop regions should exist. Only a single α4 transmembrane segment included a loop region even though the entire segment was specified as an α-helix through an added secondary restraint. Due to the presence of this loop region, the resulting models were refined via MODELLERS’s energy refinement protocol.

In cases where loop refinement failed, the target itself was covered by multiple templates, a single complete template of 100% identity or the entire structure was housed within an α- helix.

Examples of these instances include; the αM cytoplasmic protein, which used multiple templates (Table 21) that adequately covered the target, the αX cytoplasmic region with 100% target-template identity or the α3 transmembrane domain, which formed a complete α-helix (Figure 40, Figure 41). Loop regions in these models would therefore not exist as the target is completely covered with either 100% identity or forms a secondary structure. If multiple templates account for 100% sequence identity such as in the αL cytoplasmic model, no refinement will take place.

The lowest Z-DOPE scoring models of the cytoplasmic portion of integrins is lower than that of transmembrane segments (Table 16). In contrast, the standard deviation and the confidence interval of the cytoplasmic models appears to have a greater effect on the data sets than the transmembrane counterparts.

112

Table 16. Z-DOPE scores of cytoplasmic and transmembrane models generated by MODELLER. The lowest Z-DOPE scoring model, the associated standard deviation (stdev) and confidence interval (CI) at 95.0% is shown for all data sets of all integrins.

Z-DOPE Score Integrin Cytoplasmic Transmembrane Lowest Stdev CI (95.0%) Lowest Stdev CI (95.0%)

α3 -0,208 0,219 0,157 5,650 0,102 0,073

α4 -0,403 0,251 0,179 5,041 0,241 0,172

αD 0,915 0,153 0,109 3,710 0,198 0,142

αE 0,847 0,539 0,386 4,191 0,214 0,153

αL -0,740 0,347 0,248 5,198 0,197 0,141

αM -0,498 0,245 0,175 3,072 0,138 0,099

αX -1,138 0,079 0,056 3,697 0,175 0,125

β1 0,792 0,144 0,103 4,122 0,126 0,090

β2 0,337 0,169 0,121 5,299 0,399 0,182

β7 0,439 0,279 0,200 2,768 0,286 0,130

Structurally, (Figure 40) cytoplasmic segments are small and consist largely of secondary structures. Although they associate with ligands within the cytoplasmic portion of the cell (Morse et al., 2014) much of the biological activity of integrins originates from the extracellular region. The Z-DOPE scores of these models vary widely (Table 16) but no obvious structural faults are visible. Many of the cytoplasmic regions overlay well with their respective templates (Figure 40). Many of the cytoplasmic portions of integrins contain a GFFKR motif important for the association of the subunits forming the complete integrin (De Melker et al., 1997). It has also been noted that this motif is critical for cell signalling in the α6β1 integrin and most likely has similar roles in other integrins.

113

A B

C

D

Figure 40. Cytoplasmic domains of all integrin subunits overlaid with templates. A) α3 (light blue), α4 (brown), αE (purple), αX (yellow) and the template 2LUV (green). B) αD (blue), αL

(orange) and the template 2K8O (grey). C) αM (dark green) and the template 2LKE (light green).

D) β1 (yellow), β2 (purple), β7 (grey) and the template 3G9W (red).

However, the exact biologically active location of these motifs in 3-dimensional space may not necessarily be accurately represented by computer generated models. This is due to the challenges associated with modelling these regions.

114

A D

B

F

C

E

Figure 41. Transmembrane domains of all integrin subunits overlaid with templates. A) α3 (light blue), αL (purple), αM (yellow), αX (pink) and template 2M3E (green). B) αD (light green) and template 5LV6 (orange). C) αE (purple) and template 2MV6 (dark green). D) α4 (blue) and template 2KSR (white). E) β1 (purple) and template 2N9Y (yellow). F) β2 (blue), β7 (orange) and template 2LOQ (grey).

Transmembrane segments (Figure 41) are structurally simple, consisting of helices, which traverse the membrane. These segments themselves are easier to model in comparison to cytoplasmic domains. However, energy refinement of this region is difficult as the stabilisation effect of membranes is not taken into account (Park et al., 2018) as a single membrane buried hydrogen bond will greatly stabilise the transmembrane domain (Perrin and Nielson, 1997). These regions have limited biological activity but stabilise the integrin protein. Stabilisation is through hydrophobic and ionic interactions between the α-helix and the lipid interior as well as polar head groups of phospholipids (Lodish et al., 2000). From the modelled transmembrane domains it should be noted that the β2 transmembrane domain did not adopt a secondary structure. This was in spite of submitting templates consisting of secondary structure information and specifying additional restraints to model the region as an α-helix. When analysing the alignment of this sequence, the submitted templates, 4UM8 and 2LOQ were aligned against each other, thereby removing the β2 sequence from the

115 alignment. This caused the β2 sequence to be modelled completely based off ab initio techniques even though the 2LOQ sequence would provide the correct structural information. This highlights the importance of not only ensuring that your PDB file contains the required structural information but that the sequence to be modelled has been accurately aligned against the correct templates.

116

6. Conclusion The online modelling servers PRIMO, Phyre2 and I-Tasser perform well considering the limited amount of input required. I-Tasser and Phyre2 have a completely streamlined approach to homology modelling, which allows users to generate high quality models with minimal knowledge. These online servers also allow rapid generation of models by providing the necessary computing power to those who do not have access to supercomputing. However, transmembrane heterodimeric proteins are far more complex than monomeric extracellular proteins. Many of these online servers do not cater for these modelling challenges. To a large degree, MODELLER itself lacks the correct energy functions to accurately and correctly model this form of protein. However, manual modelling through MODELLER allows more control over the process. Therefore, in comparison to online servers, MODELLER will likely always produce higher quality models, provided the user has some knowledge of bioinformatics. Thus, manual modelling using MODELLER should be the preferred choice over online servers where possible. Online servers such as MEDELLER appear to still be in their infancy and in this case failed to model integrins. This failure was likely the result of insufficient template information within the relevant databases. This highlights a significant problem of transmembrane protein modelling. Databases such as the PDB, contain significantly more templates for extracellular proteins than their transmembrane counterparts. The limited template numbers severely hinder transmembrane protein modelling accuracy. The expansion of these templates, through experimental determination, should be a major focus in order to provide higher quality transmembrane protein models in future works. Online docking services, such as HADDOCK2.2 and ClusPro, use either general energy functions or energy functions specifically for docking globular proteins. The interface regions of heterodimeric transmembrane proteins are complex and docking using these programs may not always be accurate. When comparing HADDOCK2.2 with AutoDock vina the increased level of control provided by the latter is more beneficial when tackling complicated docking problems. However, knowledge of the docking process is required. To a large degree, the benefits of convenience for using online servers are outweighed by increased accuracy and control associated with manual homology modelling using MODELLER and AutoDock vina.

The exact intricacies of the MODELLER algorithm appears to be unknown. This is highlighted by the number of templates which should be used when modelling proteins. The notion that

117 increasing template numbers improve model quality is widespread, however, this may not be the case. Conditions such as template quality (resolution and target-template identity) and target type (monomeric or oligomeric) should be considered. The very_large against very_slow refinement protocol appeared to have no change on the resulting models. Information concerning the MODELLER algorithm is limited such as, how MODELLER calculates the best templates to use, performs loop refinement or full model refinement or how these refinement protocols are procedurally performed. The corroboration of the results acquired here concerning template numbers to previous findings in other experiments highlights the importance of determining exactly how MODELLER performs all of these actions. Bioinformatics is largely focused on generating accurate models and docking ligands to proteins, however, the tools used for such procedures should be developed and fully understood. The lack of information surrounding how these modelling programs perform homology modelling is of great concern. It may be beneficial to perform a large extensive and in-depth scale study analysing each modelling program in terms of their ability to generate extracellular and transmembrane proteins as well as monomeric and oligomeric proteins. It does appear, however, that when modelling transmembrane heterodimeric proteins a good approach is to separate the subunits followed by separation into their respective extracellular, transmembrane and cytosolic domains. This will alleviate some of the problems incurred, such as MODELLER generating globular shaped models. The monomeric subunit docking process may introduce new problems, such as clashes generated when using ClusPro. Modelling both chains simultaneously in terms of the extracellular, transmembrane and cytoplasmic domains may overcome this challenge, however, suitable heterodimeric templates which cater for the interface regions would have to be acquired. These templates are rare and it is unlikely that high quality templates for oligomeric proteins would exist in high numbers.

Model evaluation programs, such as PROSA, Verify-3D and PROSESS each have their own energy scoring functions. PROSA and Verify-3D appear to give higher scores than expected when submitting models. The method by which PROSA evaluates models appears particularly susceptible as models are compared to structures already present in the database. This may give users a false sense of accuracy as the quality of other models present in the database may be questioned. In recent years fraudulent structures have been removed from the PDB

118 and is becoming a major focus in structure curation. Verify-3D provides a more detailed analysis of models but does not allow ligand information generated by AutoDock vina to be evaluated without editing the ligand files to obtain a particular format. PROSESS provides the most comprehensive and accurate assessment of model quality by determining quality per residue. The server highlights regions which have been poorly modelled to a better degree than PROSA or Verify-3D. The disadvantage when using PROSESS lies predominantly with the large quantity of output as well as server connection. PROSESS lacks a “download everything” selection and thus users are required to select all the output to be downloaded. However, the server tends to time out repeatedly and the procedure can be lengthy. Nevertheless, the PROSESS server is largely superior to that of PROSA or Verify-3D.

The models themselves provide some insight concerning LDV ligand interactions between the subunits. Additionally, the ELV tripeptide is highlighted as a potential region of interaction between the laminin G-domain and the α3β1 integrin. The calf domains associated with these integrin models is lacking in structural detail. This reflects the lack of suitable templates present within the PDB and highlights the complexity of acquired high quality transmembrane protein templates. These problems are being addressed through new technologies such as cryo-electron microscopy, femtosecond crystallography, coarse-grained methods and continuum-MD. Additionally the use of lipopeptide detergents facilities accurate experimental determination of transmembrane proteins. The closed conformation of integrins associating with the LDV ligand in this study may support other studies indicating that an additional closed conformational state exists whereby ligands can still associate. However, the assumptions surrounding LDV ligand docking weaken this strength. Experimentally determined protein structures, however, are required to further support or deny the model structures generated within this study.

It should be noted that bioinformatics is a new field of study and although “simple” modelling problems such as monomeric proteins have largely been covered, there is still much development ahead concerning heterodimeric proteins. This is in both terms of templates and energy functions used by modelling programs.

119

7. Appendix Section A: PDB files used in the study Table 17. PDB files used within this study and their description.

Code Description 1BHQ MAC-1 I-domain cadmium complex 1CQP Crystal structure analysis of the complex lfa-1 (cd11a) I-domain / lovastatin at 2.6 a resolution 1IDN MAC-1 I-domain metal free 1IDO I-domain from integrin CR3, mg2+ bound 1IRU Crystal structure of the mammalian 20S proteasome at 2.75 A resolution 1JLM I-domain from integrin CR3, mn2+ bound 1JV2 Crystal structure of the extracellular segment of integrin alphavbeta3 1KUP Solution structure of the membrane proximal regions of alpha-iib and beta-3 integrins 1L3Y Integrin egf-like module 3 from the beta-2 subunit 1LFA Cd11a I-domain with bound mn++ 1M1U An isoleucine-based allosteric switch controls affinity and shape shifting in integrin cd11b a- domain 1M8O Platelet integrin alfaIIb-beta3 cytoplasmic domain 1MF7 Integrin alpha m I domain 1MJN Crystal structure of the intermediate affinity al I domain mutant 1MQ8 Crystal structure of alphaL I domain in complex with ICAM-1 1MQ9 Crystal structure of high affinity alphaL I domain with ligand mimetic crystal contact 1N3Y Crystal structure of the alpha-X beta2 integrin I domain 1N9Z Integrin alpha m I domain mutant 1NA5 Integrin alpha m I domain 1RD4 An allosteric inhibitor of LFA-1 bound to its I-domain 1S4X Nmr structure of the integrin b3 cytoplasmic domain in dpc micelles 1T0P Structural Basis of ICAM recognition by integrin alpahLbeta2 revealed in the complex structure of binding domains of ICAM-3 and alphaLbeta2 at 1.65 A 1TYE Nmr structure of the integrin b3 cytoplasmic domain in dpc micelles 1V7P Structure of EMS16-alpha2-I domain complex 1XDD X-ray structure of LFA-1 I-domain in complex with LFA703 at 2.2A resolution 1YUK The crystal structure of the PSI/Hybrid domain/ I-EGF1 segment from the human integrin beta2 at 1.8 resolution 1YSJ Crystal structure of Bacillus Subtillis YXEP protein (APC1829), a dinuclear metal binding peptidase from the M20 Family 2BRQ Crystal structure of the filamin A repeat 21 complexed with the integrin beta7 cytoplasmic tail peptide 2H7D Solution structure of the talin F3 domain in complex with a chimeric beta3 integrin-PIP kinase peptide 2ICA CD11a (LFA1) I-domain complexed with BMS-587101 aka 5-[(5S, 9R)-9-(4-cyanophenyl)-3- (3,5-dichlorophenyl)-1-methyl-2,4-dioxo-1,3,7-triazaspiro [4.4]non-7-yl]methyl]-3- thiophenecarboxylicacid 2IUE Pactolus I-domain: functional switching of the rossmann fold 2JF1 Crystal structure analysis of tet repressor (class d) in complex with 7-chlortetracycline- nickel(ii) 2K1A Bicelle-embedded integrin alpha(IIB) transmembrane segment 2K8O Solution structure of integrin Alpha L 2KNC Platelet integrin ALFAIIB-BETA3 transmembrane-cytoplasmic heterocomplex 2KS1 Heterodimeric association of Transmembrane domains of ErbB1 and ErbB2 receptors Enabling Kinase Activation 2KSR NMR structures of TM domain of the n-Acetylcholine receptor b2 subunit 2KV9 Integrin beta3 subunit in a disulfide linked alphaIIb-beta3 cytosolic domain 2L91 Structure of the Integrin beta3 (A711P,K716A) Transmembrane Segment

I

2LJD Monophosphorylated (747pY) beta3 integrin cytoplasmic tail under membrane mimetic conditions 2LJE Biphosphorylated (747pY, 759pY) beta3 integrin cytoplasmic tail under membrane mimetic conditions 2LKE Structures and Interaction Analyses of the Integrin Alpha-M Beta-2 Cytoplasmic Tails 2LKJ Structures and Interaction Analyses of the Integrin Alpha-M Beta-2 Cytoplasmic Tails 2LM2 NMR structures of the transmembrane domains of the AChR b2 subunit 2LOQ Backbone structure of human membrane protein FAM14B (Interferon alpha-inducible protein 27-like protein 1) 2LUV Structure and binding interface of the cytosolic tails of axb2 integrin 2M32 Alpha-1 integrin I-domain in complex with GLOGEN triple helical peptide 2M3E The integrin alpha l transmembrane domain in bicelles: structure and interaction with integrin beta 2 2MTP The structure of Filamin repeat 21 bound to integrin 2MV6 Solution structure of the transmembrane domain and the juxta-membrane domain of the Erythropoietin Receptor in micelles 2N5S Spatial structure of EGFR transmembrane and juxtamembrane domains in DPC micelles 2N9Y Structure of the integrin alphaIIb-beta3(A711P) transmembrane complex 2P26 Structure of the PHE2 and PHE3 fragments of the integrin beta2 subunit 2P28 Structure of the PHE2 and PHE3 fragments of the integrin beta2 subunit 2RMZ Bicelle-embedded integrin beta3 transmembrane segment 2VC2 Re-refinement of integrin alphaiibbeta3 headpiece bound to antagonist l-739758 3BN3 Crystal structure of ICAM-5 in complex with aL I domain 3EOA Crystal structure the Fab fragment of Efalizumab in complex with LFA-1 I domain, Form I 3FCS Structure of complete ectodomain of integrin aIIBb3 3FCU Structure of headpiece of integrin aIIBb3 in open conformation 3G9W Crystal Structure of Talin2 F2-F3 in complex with the integrin Beta1D Cytoplasmic Tail 3IJE Crystal structure of the complete integrin alhaVbeta3 ectodomain plus an Alpha/beta transmembrane fragment 3K6S Structure of integrin alphaXbeta2 ectodomain 3NID The closed headpiece of integrin alphaiib beta3 and its complex with an alpahiib beta3 - specific antagonist that does not induce opening 3PGW Crystal structure of human U1 snRNP 3Q3G Crystal Structure of A-domain in complex with antibody 3QA3 Crystal Structure of A-domain in complex with antibod 3T3M A novel high affinity integrin alphaiibbeta3 receptor antagonist that unexpectedly displaces mg2+ from the beta3 midas 3T9K Crystal Structure of ACAP1 C-portion mutant S554D fused with integrin beta1 peptide 3UNB Mouse constitutive 20S proteasome in complex with PR-957 3V4P Ccrystal structure of a4b7 headpiece complexed with Fab ACT-1 3VI3 Crystal structure of alpha5beta1 integrin headpiece (ligand-free form) 3VI4 Crystal structure of alpha5beta1 integrin headpiece in complex with RGD peptide 4BJ3 Integrin alpha2 I domain E318W-collagen complex 4DX9 ICAP1 in complex with integrin beta 1 cytoplasmic tail 4G1E Crystal structure of integrin alpha V beta 3 with coil-coiled tag. 4G1M Re-refinement of alpha V beta 3 structure 4HKC 14-3-3-zeta in complex with S1011 phosphorylated integrin alpha-4 peptide 4M76 Integrin I domain of complement receptor 3 in complex with C3d 4NEH An internal ligand-bound, metastable state of a leukocyte integrin, aXb2 4O02 AlphaVBeta3 integrin in complex with monoclonal antibody FAB fragment. 4R3O Human Constitutive 20S Proteasome 4UM8 Crystal structure of alpha V beta 6 4UM9 Crystal structure of alpha V beta 6 with peptide 4WJK Metal Ion and Ligand Binding of Integrin 4WK4 Metal Ion and Ligand Binding of Integrin 4Z7N Integrin alphaIIbbeta3 in complex with AGDV peptide 4Z7Q Integrin alphaIIbbeta3 in complex with AGDV-NH2 peptide

II

5A2F Two membrane distal IgSF domains of CD166 5AUX Crystal structure of DAPK1 in complex with kaempferol 5C16 Myotubularin-related protein 1 5E6R Structures of leukocyte integrin aLb2: The aI domain, the headpiece, and the pocket for the internal ligand 5E6V Re-refinement of the crystal structure of the Plexin-Semaphorin-Integrin Domain/Hybrid Domain/I-EGF1 segment from the human integrin b2 Subunit 5E6W Re-refinement of the crystal structure of the Plexin-Semaphorin-Integrin Domain/Hybrid Domain/I-EGF1 segment from the human integrin b2 Subunit 5ES4 Re-refinement of integrin alphaxbeta2 ectodomain in the closed/bent conformation 5FFG Crystal structure of integrin alpha V beta 6 head 5KXI X-ray structure of the human Alpha4Beta2 nicotinic receptor 5L5K Plexin A4 full extracellular region, domains 1 to 10, data to 7.5 angstrom, spacegroup P4(1) 5LV6 N-terminal motif dimerization of EGFR transmembrane domain in bicellar environment 5NEM Localised reconstruction of alpha v beta 6 bound to Foot and Mouth Disease Virus O PanAsia - Pose A. 5NER Localised reconstruction of alpha v beta 6 bound to Foot and Mouth Disease Virus O PanAsia - Pose A. 5THP Localised reconstruction of alpha v beta 6 bound to Foot and Mouth Disease Virus O PanAsia - Pose A prime. 6AVQ The therapeutic antibody lm609 selectively inhibits ligand binding to human alpha-v beta-3 integrin via steric hindrance

Table 18. PDB files used for PRIMO server modelling.

Integrin Identity and Resolution Coverage Balance 1JV2, 3FCS, 3IJE, 3K6S, 1BD3, 3FCS, 3FCS, 3V4P, 3NID, β1 4G1E, 4G1M, 4WJK, 3VI3, 4NEH, 4UM8, 4WJK 4NEH, 4WJK 4NEH, 5ES4 β2 4NEH 3K6S, 4NEH 3K6S, 4NEH, 5E6R, 5ES4 1JV2, 3IJE, 4G1E, 4G1M, 3FCS, 3IJE, 4G1M, α3 3IJE, 4G1M, 4UM8, 4WJK 4O02 4UM9, 4WJK, 4WK4 3IJE, 3K6S, 4OKU, 4NEH, 1IDN, 3K6S, 4OKU, αE 4OKU, 4NEH, 5E6R 5E6R, 5ES4 4NEH, 5E6R 1IDO, 1JLM, 1MF7, 1IDO, 1JLM, 1N9Z, 1NA5, αM 3K6S, 4NEH, 5ES4 1NA5, 3Q3G, 4NEH, 4NEH 5ES4

Table 19. PDB files used for MEDELLER server modelling.

Integrin Template

α3 4UM8

αE 5ES4

αM 3K6S

β1 4G1M

β2 3K6S

III

Table 20. PDB files used for testing MODELLER parameters.

Experiment Templates 3.4. 4G1M, 3FCS, 3IJE, 5BYP, 4NEH, 3G9W, 4WJK 3.4.2. and 3.4.3. 1D3B, 1TYE, 2BRQ, 2JF1, 3FCU, 3G9W, 3IJE, 3VI3, 4DX9, 4G1M, 4NEH, 4UM8, 4UM9, 4WJK, 4WK4, 4Z7N, 4Z7Q, 5BYP 3.4.4. 1IRU, 1JV2, 2VC2, 3FCS, 3K6S, 3NID, 3PGW, 3T3M, 3T9K, 3UNB, 3V4P, 4G1E, 4O02, 4R3O, 5E6R, 5ES4, 5L5K 3.4.5. 1BHQ, 1IDN, 1IDO, 1JLM, 1M1U, 1MF7, 1N9Z, 1NA5, 1YUK, 2JF1, 2P28, 3K6S, 3Q3G, 4M76, 4NEH, 5E6R, 5E6V, 5ES4 3.5.1. (A) 1JV2, 1TYE, 2VC2, 3FCS, 3FCU, 3IJE, 3K6S, 3V4P, 3VI3, 4G1E, 4G1M, 4NEH, 4O02, 4UM8, 4UM9, 4WJK, 4WK4, 4Z7N, 4Z7Q, 5E6R, 5ES4, 5FFG, 5NEM, 5NER, 6AVQ 3.5.1. (B) 2L91, 2RMZ, 2N9Y, 3G9W, 2MTP, 2H7D, 2LJE, 2LJD, 1S4X, 2KV9, 1KUP, 2KNC, 3V4P, 2IUE, 5E6W, 2P28, 1YUK, 5E6V, 1L3Y, 2P26, 5E6R, 4NEH, 5ES4, 3K6S, 2JF1, 3VI3, 4WJK, 1TYE, 2VC2, 3FCS, 1JV2, 4G1E, 3IJE, 4G1M, 3T3M, 4Z7N, 3NID, 4UM8, 5NEM, 4UM9, 5FFG, 1M8O 3.5.2. (A) (open) 1JV2, 2VC2,3FCU, 3V4P, 3VI4, 4NEH, 4O02, 4UM9, 4WK4, 4Z7N, 4Z7Q, 6AVQ 3.5.2. (B) (open) 2L91, 2RMZ, 2N9Y, 3G9W, 2MTP, 2H7D, 2LJE, 2LJD, 1S4X, 2KV9, 1KUP, 2KNC, 3V4P, 2IUE, 2P28, 1YUK, 1L3Y, 2P26, 2P28A, 2JF1, 3VI4, 2VC2, 1JV2, 3T3M, 4Z7N, 3NID, 1M8O 3.5.2. (A) (closed) 1TYE, 3FCS, 3IJE, 3K6S, 3VI3, 4G1E, 4G1M, 4NEH,4UM8, 4WJK, 5E6R, 5ES4, 5FFG 3.5.2. (B) (closed) 2L91, 2RMZ, 2N9Y, 1S4X, 2KV9, 1KUP, 2KNC, 5E6W, 2P28, 1YUK, 5E6V, 1L3Y, 2P26, 5E6R, 5ES4, 3K6S, 4WJK, 1TYE, 3FCS, 1JV2, 4G1E, 3IJE, 4G1M, 4UM8, 5NEM, 5FFG, 1M8O 3.5.3. (A) (closed) 1TYE, 3FCS, 3IJE, 3K6S, 3VI3, 4G1E, 4G1M, 4NEH,4UM8, 4WJK, 5E6R, 5ES4, 5FFG 3.5.3. (B) (closed) 2L91, 2RMZ, 2N9Y, 1S4X, 2KV9, 1KUP, 2KNC, 5E6W, 2P28, 1YUK, 5E6V, 1L3Y, 2P26, 5E6R, 5ES4, 3K6S, 4WJK, 1TYE, 3FCS, 1JV2, 4G1E, 3IJE, 4G1M, 4UM8, 5NEM, 5FFG, 1M8O 3.5.4. (A) (closed) 1TYE, 3FCS, 3IJE, 3K6S, 3VI3, 4G1E, 4G1M, 4NEH,4UM8, 4WJK, 5E6R, 5ES4, 5FFG 3.5.4. (B) (closed) 2L91, 2RMZ, 2N9Y, 1S4X, 2KV9, 1KUP, 2KNC, 5E6W, 2P28, 1YUK, 5E6V, 1L3Y, 2P26, 5E6R, 5ES4, 3K6S, 4WJK, 1TYE, 3FCS, 1JV2, 4G1E, 3IJE, 4G1M, 4UM8, 5NEM, 5FFG, 1M8O 3.5.5. (A) (closed) 3IJE, 4G1E, 4G1M (E) 3.5.5. (B) (closed) 5E6W, 2P28, 1YUK, 5E6V, 1L3Y, 2P26, 5E6R, 5ES4, 3K6S, 4WJK, 1TYE, 3FCS, 1JV2, 4G1E, 3IJE, (E) 4G1M, 4UM8, 5NEM, 5FFG 3.5.5. (A) (closed) 2M3E, 2KNC, 2RMZ, 2L91, 2N9Y, 2KNCA, 2K1A, 2N9YA AND 2L91, 2N9Y, 2RMZ, 2KNC (T) 3.5.5. (B) (closed) 2M3E, 2KNC, 2RMZ, 2L91, 2N9Y, 2KNCA, 2K1A, 2N9YA AND 2L91, 2N9Y, 2RMZ, 2KNC (T) 3.5.5. (A) (closed) 2LUV (C) 3.5.5. (B) (closed) 3T9K, 3G9W, 4DX9 (C)

Table 21. PDB files used for the final cytoplasmic and transmembrane segment modelling.

Integrin Cytoplasmic Transmembrane

α3 2LUV 2M3E, 2KNC, 2RMZ, 2L91, 2N9Y, 2KNCA, 2K1A α4 2LUV, 4HKC 5KXI, 2LM2, 2KSR αD 2LUV, 4HKC, 2K8O, 2LKJ, 2LKE, 2M3E 2KS1, 5LV6, 2N5S αE 2LUV 2MV6 αL 2M3E, 2K8O 2M3E αM 4HKC, 2LKE, 2LKJ, 2LUV 2M3E, 5A2F αX 2LUV 2M3E β1 3T9K, 3G9W, 4DX9 2L91, 2N9Y, 2RMZ, 2KNC β2 2JF1, 3G9W, 2KV9, 2KNC, 1M8O, 1S4X 4UM8, 2LOQ β7 2JF1, 3G9W, 2KV9, 2KNC, 1M8O, 1S4X, 2BRQ, 2MTP 2LOQ

IV

Table 22. PDB files used for the final extracellular segment modelling.

Conformati Integri all less least on n Closed α3 1TYE, 3FCS, 3IJE, 3K6S, 3VI3, 3FCS, 3IJE, 3VI3, 4G1E, 3IJE, 4G1E, 4G1M 4G1E, 4G1M, 4UM8, 4WJK, 4G1M, 4UM8, 5FFG 5E6R, 5ES4, 5FFG α4 1JV2, 2VC2, 3FCU, 3V4P, 3V4P, 3VI4, 4WK4, 4O02, 1JV2, 3V4P, 6AVQ, 4O02 3VI4, 4UM9,4WK4, 4Z7Q, 1JV2, 6AVQ, 4UM9 6AVQ αD 1IDN, 1N9Z, 1M1U, 1NA5, 1N9Z, 1M1U, 1MF7, 5ES4, 5ES4, 3K6S 1MF7, 5ES4, 3K6S, 1N3Y 3K6S αE 5E6R, 5ES4, 3K6S, 1IDN, 5E6R, 5ES4, 3K6S, 1IDN, 5E6R, 5ES4, 3K6S 4WJK, 1TYE, 1N9Z, 1M1U, 4WJK, 1TYE 1NA5, 1MF7, 1N3Y αL 5E6R, 3K6S, 4WJK, 1TYE, 5E6R, 3K6S, 4WJK, 1TYE, 3IJE, 4G1M, 4G1E, 3FCS, 5ES4, 4UM8, 5ES4, 5FFG, 3FCS, 4UM8, 5ES4, 3FCS, 3VI3, 3K6S, 5E6R 3VI3, 1MJN, 4G1E, 4G1M, 4G1E, 4G1M, 3IJE 3IJE, 1MQ9 αM 1IDN, 1N9Z, 1M1U, 1NA5, 1IDN, 1N9Z, 1M1U, 1NA5, 1IDN, 1N9Z, 1M1U, 1NA5, 1MF7, 5ES4, 3K6S, 5E6R , 1MF7, 5ES4, 3K6S, 5E6R, 1MF7, 5ES4, 3K6S 1MQ9, 4WJK, 3VI3, 5FFG, 3FCS, 4G1E 4UM8, 3FCS, 1TYE, 3IJE, 4G1E, 4G1M αX 1N3Y, 5ES4, 3K6S, 1MF7, 1N3Y, 5ES4, 3K6S, 5E6R 1N3Y, 5ES4, 3K6S 1M1U, 1IDN, 5E6R β1 5E6W, 2P28, 1YUK, 5E6V, 5E6W, 5E6V, 5E6R, 5ES4, 4WJK, 3VI3, 3K6S, 5ES4, 3IJE, 1L3Y, 2P26, 5E6R, 5ES4, 3K6S, 4WJK, 1TYE, 3FCS, 4G1M, 4G1E, 3FCS, 4UM8 3K6S, 1YUK, 2P28, 4WJK, 3VI3, 4G1E, 3IJE, 4G1M, 1TYE, 3FCS, 3VI3, 4G1E, 3IJE, 3NID, 4UM8, 5FFG 4G1M, 3NID, 4UM8,5FFG β2 5E6W, 2P28, 1YUK, 5E6V, 5E6W, 2P28, 1YUK, 5E6V, 5ES4, 3K6S 1L3Y, 2P26, 5E6R, 5ES4, 1L3Y, 2P26, 5E6R, 5ES4, 3K6S, 1YUK, 2P28, 4WJK, 3K6S, 1YUK, 2P28 1TYE, 3FCS, 3VI3, 4G1E, 3IJE, 4G1M, 3NID, 4UM8,5FFG β7 5E6W, 2P28, 1YUK, 5E6V, 5E6W, 5E6V, 5E6R, 5ES4, 3V4P, 5ES4, 3K6S, 3FCS, 4G1E, 1L3Y, 2P26, 5E6R, 5ES4, 3K6S, 4WJK, 1TYE, 3FCS, 3IJE, 4G1M, 4UM8 3K6S, 1YUK, 2P28, 4WJK, 3VI3, 4G1E, 3IJE, 4G1M, 1TYE, 3FCS, 3VI3, 4G1E, 3IJE, 4UM8,5FFG 4G1M, 3NID, 4UM8,5FFG, 3V4P Open α3 4G1E, 2VC2, 3FCU, 3V4P, 3V4P, 3VI4, 4O02, 6AVQ, 3V4P, 6AVQ, 4O02 3VI4, 4O02, 4UM9, 4WK4, 4UM9 4Z7N, 4Z7Q, 6AVQ α4 1JV2, 2VC2, 3FCU, 3V4P, 3V4P, 3VI4, 4WK4, 4O02, 1JV2, 3V4P, 6AVQ, 4O02 3VI4, 4UM9,4WK4, 4Z7Q, 1JV2, 6AVQ, 4UM9 6AVQ αD 3V4P, 3FCU, 4Z7N, 4Z7Q, 3K6S, 5ES4, 3V4P, 3VI4, 3Q3G, 4M76, 3QA3 2VC2, 3VI4, 4WK4, 4O02, 4WK4, 4O02, 1JV2, 6AVQ, 1JV2,6AVQ, 4UM9, 2M32, 4UM9, 1MQ9, 2ICA, 1LFA, 4BJ3, 1MQ9, 3BN3, 2ICA, 1RD4, 1CQP, 3Q3G, 4M76, 1LFA, 1RD4, 1CQP, 3Q3G, 3QA3 4M76, 3QA3

V

αE 5E6R, 3V4P, 4O02, 1JV2, 3Q3G, 1NA5, 3QA3, 4M76, 3Q3G, 1NA5, 3QA3, 4M76, 6AVQ,3VI4, 4WK4, 4Z7N, 3V4P, 4O02, 1JV2, 6AVQ, 3V4P, 4O02, 1JV2, 6AVQ, 3FCU, 4Z7Q, 2VC2 4UM9, 4WK4 4UM9, 4WK4, 3VI4, 4Z7N, 3FCU, 4Z7Q, 2VC2

αL 3HI6, 1MQ8, 1T0P, 3BN3, 5E6R, 6AVQ, 4O02, 3VI4, 3HI6, 1MQ8, 1T0P, 3BN3, 1XDD, 1CQP, 1RD4, 3EOA, 1JV2, 3FCU 1XDD, 1CQP, 1RD4, 3EOA, 2ICA, 3Q3G, 4M76, 3QA3, 2ICA 3V4P, 3VI4, 4O02, 6AVQ, 1JV2

αM 3K6S, 5ES4, 3V4P, 4Z7N, 4M76, 3QA3, 3Q3G, 3V4P, 4M76, 3QA3, 3Q3G, 3V4P, 3FCU, 4Z7Q, 2VC2, 3VI4, 4O02, 1JV2, 6AVQ, 1T0P, 4O02, 1JV2, 6AVQ 4WK4, 4O02, 1JV2, 6AVQ, 1MQ8, 3BN3, 1XDD, 1RD4, 4UM9, 2M32, 4M76, 3QA3, 2ICA, 3VI4, 4WK4, 1JV2, 3Q3G, 1T0P, 1MQ8, 3BN3, 4UM9 1XDD, 1RD4, 2ICA αX 3K6S, 5ES4, 3V4P, 3FCU, 3VI4, 4UM9, 6AVQ, 1JV2, 1JV2, 4O02, 3V4P, 1MQ8, 4Z7N, 4Z7Q, 2VC2, 3VI4, 4O02, 3V4P, 1MQ8, 3BN3, 1RD4, 3EOA, 2ICA, 1XDD, 4UM9, 6AVQ, 1JV2, 4O02, 1RD4, 3EOA, 2ICA, 1XDD, 1CQP, 3Q3G, 4M76, 3QA3 2M32, 4BJ3, 5THP, 1V7P, 1CQP, 3Q3G, 4M76, 3QA3 1MQ8, 3BN3, 1RD4, 3EOA, 2ICA, 1XDD, 1CQP, 3Q3G, 4M76, 3QA3 β1 1YUK, 2P26, 2P28, 1L3Y, 5E6W, 5E6V, 4UM9, 2VC2, 4UM9, 4UM9, 1JV2, 3V4P, 5E6W, 2P28, 1YUK, 5E6V, 1JV2, 3T3M, 4Z7N, 3V4P, 3VI4 4UM9, 4UM9, 2VC2, 1JV2, 3VI4 3T3M, 4Z7N, 3V4P, 3VI4 β2 1YUK, 2P26, 2P28, 1L3Y, 2P26, 5E6W, 5E6V, 4UM9, 5E6W, 1JV2, 3V4P 5E6W, 2P28, 1YUK, 5E6V, 2VC2, 1JV2, 3T3M, 4Z7N, 4UM9, 4UM9, 2VC2, 1JV2, 3V4P, 3VI4 3T3M, 4Z7N, 3V4P, 3VI4 β7 1YUK, 2P26, 2P28, 1L3Y, 4UM9, 1JV2, 3T3M, 4Z7N, 1JV2, 3V4P, 3VI4 5E6W, 2P28, 1YUK, 5E6V, 3V4P, 3VI4 4UM9, 4UM9, 2VC2, 1JV2, 3T3M, 4Z7N, 3V4P, 3VI4

Section B: Alignment file setup 1.1 >P1; template 1 structure(X/N/E): PDB ID: start: chain ID: end: :::: S R K L R M K W M D F V P K A N K Q C Y *

1.2 >P1; template 2 structure(X/N/E): PDB ID: start: chain ID: end: :::: Y L E T S V K K M M G Q E A F G T T K M *

1.3 >P1; template 3 structure(X/N/E): PDB ID: start: chain ID: end: :::: S P V E L M Y A R Y S I D D F P K Q S Y *

VI

1.4 >P1; target sequence sequence: target sequence: start: chain ID(X): end: :::: M N L L R G P R Q Q F V S D F Q S Q S Y *

In contrast to this the modelling of heterodimeric proteins was altered as such; 2.1 >P1; template 1 structure(X/N/E): PDB ID: start: chain ID: end: :::: A D A R K H D E A F K R D C D R D H A A / L D K Q M V T H M A L K K S T L N M Q V *

2.2 >P1; template 2 structure(X/N/E): PDB ID: start: chain ID: end: :::: ------/ L W W A P Q E S T A P M N L A R A L Y S *

2.3 >P1; template 3 structure(X/N/E): PDB ID: start: chain ID: end: :::: M N L L R G P R Q Q F V S D F Q S Q S Y / ------*

2.4 >P1; target sequence sequence: target sequence: start: chain ID(A): end: start: chain ID(B):end:: K I P C Q W C E T P Q D G V G G V S I P / A R G P L M N Q W T S Y T A L G R V E N *

If the template applied to both chains scenario 2.1 is utilized, if the template only applied to the first chain scenario 2.2 was chosen, likewise if the template only applied to the second chain 2.3 occurred. The target sequence was set up and reflected both chain “A” and chain “B” and thus the header was altered accordingly. The header also included the start and end amino acid position numbers of chain B. Chains were separated by the “/” delimiter with the associated “*” at the end of the final sequence, which indicated sequence termination. The gap character “-“ indicated gap regions in the template or target sequence and if a template was only used for chain A, chain B template sequences only contained gap characters and indicated the lack of template information as seen in example 2.3. In contrast to this setup, when modelling singular monomeric subunits the alignment file needed only the termination signal at the end of each sequence with the corresponding use of gap where required.

Common to both setups was header information, which contained the following using 1.1 as an example;

VII

1.1 >P1; template 1 structure(X/N/E): PDB ID: start: chain ID: end: ::::

The “>” character was the delimiter which indicated the start of a new input sequence and is followed by “P1;” which indicated to which protein the sequence belongs. It was possible to construct an alignment file in such a manner as multiple proteins could be generated in a single run. However, the job script associated with python should be altered as such to reflect this alteration. The “template 1” name was the PDB name of the associated protein sequence. The structure(X/N/E) indicated through which method the PDB coordinates of the amino acid sequence were determined. PDB files which contained coordinates that originated from x-ray crystallography (X) and electron microscopy (E) data were displayed in similar formats while coordinate information determined through nuclear magnetic resonance (N) methods were divided into models. “Structure” could also be used as valid input for this field. The second field contained the PDB code present in the working directory. Structural information would be extracted about the sequence from this code. It was beneficial to state which chain was to be used within the PDB code, such as 4wjk_A for chain A or 4wjk_B for chain B. The following three to six fields contained the start and end amino acid positions for chain A and B as stated previously in examples 1.1 – 1.4 and 2.1 – 2.4. Fields seven through to ten were optional and could be used to contain the protein name from which the structural information was derived (field seven), the source of the protein (field eight), the resolution of the solved structure (field nine) and finally the associated R-factor of the structure (field ten).

Section C: Python script example from modeller import * from modeller.automodel import * log.verbose() env = environ() env.io.atom_files_directory = ['.', '../atom_files'] class MyModel(automodel): def special_restraints(self, aln): rsr = self.restraints at = self.atoms rsr.add(secondary_structure.α(self.residue_range('35:', '40:'))) rsr.add(secondary_structure.strand(self.residue_range('45:', '47:')))

VIII

def special_patches(self, aln): self.patch(residue_type='DISU', residues=(self.residues['27'], self.residues['45'])) a = loopmodel(env, alnfile = 'β1.ali', knowns = ('5e6w_A', '2p28_B', '1yuk_B', '5e6v_A', '1l3y_A', '2p26_A', '5e6r_B', '5es4_B', '3k6s_B', '1yuk_A', '2p28_A', '4wjk_B', '1tye_B', '3fcs_B', '3vi3_B', '4g1e_B', '3ije_B', '4g1m_B', '3nid_B', '4um8_B','5ffg_B'), sequence = 'β1') a.starting_model= 1 a.ending_model = 100 a.md_level = refine.very_slow a.loop.starting_model = 1 a.loop.ending_model = 1 a.loop.md_level = refine.very_slow a.make() Section D: CluPro and HADDOCK2.2 Table 23. Problematic residues within the open and closed conformation of integrin proteins. The residues listed in each chain are problematic in that they transit through residues of the other chain. In particular, the α3β1 open conformation is problematic in that both chains display a region entangled with the opposite chain.

Integrin Open Conformation Closed Conformation

Residue chain A Residue chain B Residue chain A Residue chain B

α3β1 596 to 726 469 to 561 862 669

α4β1 487, 638, 678 480, 489, 493 - -

α4β7 492, 493, 494 540, 541, 542 700 590

αDβ2 242, 243, 272, 276, 280, 283, 165, 168, 170, 178, 204, 205, - - 663 467

αLβ2 607, 623, 688, 689, 743, 748 454, 458, 468, 485, 493, 507 992, 995, 999, 1050, 590, 620, 623, 1052 665,666

αMβ2 273, 667, 709, 710 165, 468, 471, 473 - -

IX

Table 24. Predicted ligand associating residues for various integrin subunits. The integrin subunit (left) is displayed in terms of each ligand set (right) from one to five. These residue numbers in each set were predicted to associate with a ligand or cofactor.

Residue Numbers Integri Set 1 Set 2 Set 3 Set 4 Set 5 n α3 160,186,18 283,285,287,2 346,348,350,352,35 407,409,411,41 583,586,587,588,589,63 7,210,211, 89,291 4 3,415,430 6 212,214 α4 343,344,34 158,186,188,2 280,282,284,286,28 405,407,409,41 817,869,870,871 5,347,349, 10,213 8 1,413,429 351 αD 343,344,34 158,186,188,2 280,282,284,286,28 405,407,409,41 817,869,870,871 5,347,349, 10,213 8 1,413,429 351 αE 190,192,19 504,505,506,5 568,569,570,571,57 636,638,640,64 855,922,924 4,261,294 08,510,511,51 2,574,575,576 2,644 2 αL 505,506,50 565,566,567,5 443,444,445,447,44 137,139,141,23 132,134,153,166,233,23 7,508,509, 68,569,571,57 9,450,451 9 5,255,257,258,259,284, 511,512,51 3 285,286,287,298,301,30 3 2 αM 140,142,14 449,450,451,4 513,514,515,516,51 576,577,578,57 754,755,757,758,759,79 4,209,242 53,455,456,45 7,519,520,521 9,580,582,584 5 7 αX 138, 140, 447, 448, 449, 511, 512, 513, 514, 574, 575, 576, 657, 659, 716, 741 142, 207, 451, 453, 454, 515, 517, 518, 519 577, 578, 580, 240 455 582 β1 132,133,13 134,137,138,2 130,132,134,229,25 169,224,226,22 376,378,388,407,408,40 4,223,224, 59,342 9 7,228,229,263 9 225,226,22 7,229 β2 116,117,11 112,114,116,2 151,207,209,210,21 114,115,116,11 361,371,372,390,392 9,120,242, 12,242 1,212,246 9,206,207,208,2 325 09,210,212 β7 144,145,14 142,143,144,1 140,142,144,240,27 179,235,237,23 187,234,235,236,237,24 7,148,270, 47,234,235,23 0 8,239,240,274 0 354 6,237,238,240

Section E: Verify-3D example with ligand

Figure 42. The α4β1 integrin with associated LDV ligand. Chain A (α4), B (β1) and C (ligand).

X

8. References Abitorabi, M. A., Pachynski, R. K., Ferrando, R. E., Tidswell, M., and Erle, D. J. (1997) Presentation of Integrins on Leukocyte Microvilli: A Role for the Extracellular Domain in Determining Membrane Localization, The Journal of cell biology 139, 563-571.

Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, Lindahl E: GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 2015, 1-2:19-25.

Albelda, S. M., and Buck, C. A. (1990) Integrins and other cell adhesion molecules, FASEB journal : official publication of the Federation of American Societies for Experimental Biology 4, 2868-2880.

Ali, R. H., and Khan, A. A. (2014) Tracing the evolution of FERM domain of Kindlins, Molecular phylogenetics and evolution 80, 193-204.

Alonso-Garcia, N., Garcia-Rubio, I., Manso, J. A., Buey, R. M., Urien, H., Sonnenberg, A., Jeschke, G., and de Pereda, J. M. (2015) Combination of X-ray crystallography, SAXS and DEER to obtain the structure of the FnIII-3,4 domains of integrin alpha6beta4, Acta crystallographica. Section D, Biological crystallography 71, 969-985.

Alonso-Garcia, N., Ingles-Prieto, A., Sonnenberg, A., and de Pereda, J. M. (2009) Structure of the Calx-beta domain of the integrin beta4 subunit: insights into function and cation-independent stability, Acta crystallographica. Section D, Biological crystallography 65, 858-871.

Aloy, P., Ceulemans, H., Stark, A., and Russell, R. B. (2003) The relationship between sequence and interaction divergence in proteins, Journal of molecular biology 332, 989-998.

Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C. (1994) Issues in searching molecular sequence databases, Nature genetics 6, 119-129.

Andersen, O. S., and Koeppe, R. E., 2nd. (2007) Bilayer thickness and membrane protein function: an energetic perspective, Annual review of biophysics and biomolecular structure 36, 107-130.

Anderson, L. R., Owens, T. W., and Naylor, M. J. (2013) Structural and mechanical functions of integrins, Biophysical reviews 6, 203-213.

Anthis, N. J., Wegener, K. L., Ye, F., Kim, C., Goult, B. T., Lowe, E. D., Vakonakis, I., Bate, N., Critchley, D. R., Ginsberg, M. H., and Campbell, I. D. (2009) The structure of an integrin/talin complex reveals the basis of inside-out signal transduction, The EMBO journal 28, 3623-3632.

Arnaout, M. A. (1990) Structure and function of the leukocyte adhesion molecules CD11/CD18, Blood 75, 1037-1050.

XI

Arnaout, M. A. (2016) Biology and structure of leukocyte β (2 )integrins and their role in inflammation, F1000Research 5, F1000 Faculty Rev-2433.

Arnaout, M. A., Goodman, S., and Xiong, J.-P. (2007) Structure and mechanics of integrin-based cell adhesion, Current opinion in cell biology 19, 495-507.

Arnaout, M. A., Lanier, L. L., and Faller, D. V. (1988) Relative contribution of the leukocyte molecules Mo1, LFA-1, and p150,95 (LeuM5) in adhesion of granulocytes and monocytes to vascular endothelium is tissue- and stimulus-specific, Journal of cellular physiology 137, 305-309.

Arthur, D., and Vassilvitskii, S. (2007) k-means++: the advantages of careful seeding, In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp 1027-1035, Society for Industrial and Applied Mathematics, New Orleans, Louisiana.

Aumailley, M., Gerl, M., Sonnenberg, A., Deutzmann, R., and Timpl, R. (1990) Identification of the Arg-Gly-Asp sequence in laminin A chain as a latent cell-binding site being exposed in fragment P1, FEBS letters 262, 82-86.

Bai, M., Pang, X., Lou, J., Zhou, Q., Zhang, K., Ma, J., Li, J., Sun, F., and Hsu, V. W. (2012) Mechanistic insights into regulated cargo binding by ACAP1 protein, J Biol Chem 287, 28675-28685.

Bajic, G., Yatime, L., Sim, R. B., Vorup-Jensen, T., and Andersen, G. R. (2013) Structural insight on the recognition of surface-bound opsonins by the integrin I domain of complement receptor 3, Proceedings of the National Academy of Sciences of the United States of America 110, 16426-16431.

Baker, E. L., and Zaman, M. H. (2010) The biomechanical integrin, Journal of biomechanics 43, 38-44.

Baker, E. N., and Hubbard, R. E. (1984) Hydrogen bonding in globular proteins, Progress in biophysics and molecular biology 44, 97-179.

Baldwin, E. T., Sarver, R. W., Bryant, G. L., Jr., Curry, K. A., Fairbanks, M. B., Finzel, B. C., Garlick, R. L., Heinrikson, R. L., Horton, N. C., Kelley, L. L., Mildner, A. M., Moon, J. B., Mott, J. E., Mutchler, V. T., Tomich, C. S., Watenpaugh, K. D., and Wiley, V. H. (1998) Cation binding to the integrin CD11b I domain and activation model assessment, Structure (London, England : 1993) 6, 923-935.

Barczyk, M., Carracedo, S., and Gullberg, D. (2010) Integrins, Cell and tissue research 339, 269-280.

Baxter, J. (1981) Local Optima Avoidance in Depot Location, The Journal of the Operational Research Society 32, 815-819.

XII

Bechard, D., Scherpereel, A., Hammad, H., Gentina, T., Tsicopoulos, A., Aumercier, M., Pestel, J., Dessaint, J. P., Tonnel, A. B., and Lassalle, P. (2001) Human endothelial-cell specific molecule-1 binds directly to the integrin CD11a/CD18 (LFA-1) and blocks binding to intercellular adhesion molecule-1, Journal of immunology (Baltimore, Md. : 1950) 167, 3099-3106.

Beglova, N., Blacklow, S. C., Takagi, J., and Springer, T. A. (2002) Cysteine-rich module structure reveals a fulcrum for integrin rearrangement upon activation, Nature structural biology 9, 282-287.

Benaud, C., Dickson, R. B., and Lin, C. Y. (2001) Regulation of the activity of matriptase on epithelial cell surfaces by a blood-derived factor, European journal of biochemistry 268, 1439-1447.

Berjanskii, M. V., and Wishart, D. S. (2005) A simple method to predict protein flexibility using secondary chemical shifts, Journal of the American Chemical Society 127, 14970-14971.

Berjanskii, M. V., Neal, S., and Wishart, D. S. (2006) PREDITOR: a web server for predicting protein torsion angle restraints, Nucleic acids research 34, W63-69.

Berjanskii, M., Liang, Y., Zhou, J., Tang, P., Stothard, P., Zhou, Y., Cruz, J., MacDonell, C., Lin, G., Lu, P., and Wishart, D. S. (2010) PROSESS: a protein structure evaluation suite and server, Nucleic acids research 38, W633-640.

Berjanskii, M., Liang, Y., Zhou, J., Tang, P., Stothard, P., Zhou, Y., Cruz, J., MacDonell, C., Lin, G., Lu, P., and Wishart, D. S. (2010) PROSESS: a protein structure evaluation suite and server, Nucleic acids research 38, W633-W640.

Berjanskii, M., Tang, P., Liang, J., Cruz, J. A., Zhou, J., Zhou, Y., Bassett, E., MacDonell, C., Lu, P., Lin, G., and Wishart, D. S. (2009) GeNMR: a web server for rapid NMR-based protein structure determination, Nucleic acids research 37, W670-W677.

Berjanskii, M., Tang, P., Liang, J., Cruz, J. A., Zhou, J., Zhou, Y., Bassett, E., MacDonell, C., Lu, P., Lin, G., and Wishart, D. S. (2009) GeNMR: a web server for rapid NMR-based protein structure determination, Nucleic acids research 37, W670-677.

Berlin, C., Bargatze, R. F., Campbell, J. J., von Andrian, U. H., Szabo, M. C., Hasslen, S. R., Nelson, R. D., Berg, E. L., Erlandsen, S. L., and Butcher, E. C. (1995) α4 integrins mediate lymphocyte attachment and rolling under physiologic flow, Cell 80, 413-422. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., and Bourne, P. E. (2000) The Protein Data Bank, Nucleic acids research 28, 235-242.

XIII

Bhunia, A., Tang, X. Y., Mohanram, H., Tan, S. M., and Bhattacharjya, S. (2009) NMR solution conformations and interactions of integrin alphaLbeta2 cytoplasmic tails, J Biol Chem 284, 3873-3884.

Bilsland, C. A., Diamond, M. S., and Springer, T. A. (1994) The leukocyte integrin p150,95 (CD11c/CD18) as a receptor for iC3b. Activation by a heterologous beta subunit and localization of a ligand recognition site to the I domain, Journal of immunology (Baltimore, Md. : 1950) 152, 4582-4589.

Blackford, J., Reid, H. W., Pappin, D. J. C., Bowers, F. S., and Wilkinson, J. M. (1996) A monoclonal antibody, 3/22, to rabbit CD11c which induces homotypic T cell aggregation: evidence that ICAM-1 is a ligand for CD11c/CD18, 26, 525-531.

Blaszczyk, M., Jamroz, M., Kmiecik, S., and Kolinski, A. (2013) CABS-fold: Server for the de novo and consensus-based prediction of protein structure, Nucleic acids research 41, W406-W411.

Borst, A. J., James, Z. M., Zagotta, W. N., Ginsberg, M., Rey, F. A., DiMaio, F., Backovic, M., and Veesler, D. (2017) The Therapeutic Antibody LM609 Selectively Inhibits Ligand Binding to Human alphaVbeta3 Integrin via Steric Hindrance, Structure (London, England : 1993) 25, 1732-1739.e1735.

Brenner, S. E., Chothia, C., and Hubbard, T. J. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships, Proceedings of the National Academy of Sciences of the United States of America 95, 6073-6078.

Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S., and Karplus, M. (1983) CHARMM: A program for macromolecular energy, minimization, and dynamics calculations, 4, 187-217.

Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S., and Karplus, M. (1983) CHARMM: A program for macromolecular energy, minimization, and dynamics calculations, Journal of computational chemistry 4, 187-217.

Brown, K. L., Banerjee, S., Feigley, A., Abe, H., Blackwell, T. S., Pozzi, A., Hudson, B. G., and Zent, R. (2018) Salt-bridge modulates differential calcium-mediated ligand binding to integrin alpha1- and alpha2-I domains, Scientific reports 8, 2916.

Brylinski, M., and Skolnick, J. (2008) A threading-based method (FINDSITE) for ligand- binding site prediction and functional annotation, Proceedings of the National Academy of Sciences of the United States of America 105, 129-134.

Buck, C. A., and Horwitz, A. F. (1987) Cell surface receptors for extracellular matrix molecules, Annual review of cell biology 3, 179-205.

XIV

Bystroff, C., and Baker, D. (1998) Prediction of local structure in proteins using a library of sequence-structure motifs11Edited by J. Thornton, Journal of molecular biology 281, 565-577.

Calderwood, D. A., Fujioka, Y., de Pereda, J. M., Garcia-Alvarez, B., Nakamoto, T., Margolis, B., McGlade, C. J., Liddington, R. C., and Ginsberg, M. H. (2003) Integrin beta cytoplasmic domain interactions with phosphotyrosine-binding domains: a structural prototype for diversity in integrin signaling, Proceedings of the National Academy of Sciences of the United States of America 100, 2272-2277.

Campanacci, V., Lartigue, A., Hällberg, B. M., Jones, T. A., Giudici-Orticoni, M.-T., Tegoni, M., and Cambillau, C. (2003) Moth chemosensory protein exhibits drastic conformational changes and cooperativity on ligand binding, Proceedings of the National Academy of Sciences 100, 5069.

Capra, J. A., Laskowski, R. A., Thornton, J. M., Singh, M., and Funkhouser, T. A. (2009) Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure, PLoS computational biology 5, e1000585.

Carafoli, F., Hamaia, S. W., Bihan, D., Hohenester, E., and Farndale, R. W. (2013) An activating mutation reveals a second binding mode of the integrin alpha2 I domain to the GFOGER motif in collagens, PLoS One 8, e69833.

Carpenter, E. P., Beis, K., Cameron, A. D., and Iwata, S. (2008) Overcoming the challenges of membrane protein crystallography, Current opinion in structural biology 18, 581-586.

Cavagnero, S. (2003) Using NMR to Determine Protein Structure in Solution, Journal of Chemical Education 80, 125.

Chamberlain, L. H. (2004) Detergents as tools for the purification and classification of lipid rafts, FEBS letters 559, 1-5.

Chang, J.-M., Di Tommaso, P., Taly, J.-F., and Notredame, C. (2012) Accurate multiple sequence alignment of transmembrane proteins with PSI-Coffee, BMC bioinformatics 13 Suppl 4, S1-S1.

Chatham, J. C., and Blackband, S. J. (2001) Nuclear Magnetic Resonance Spectroscopy and Imaging in Animal Research, ILAR Journal 42, 189-208.

Chen, R., and Weng, Z. (2002) Docking unbound proteins using shape complementarity, desolvation, and electrostatics, Proteins 47, 281-294.

Chen, R., Mintseris, J., Janin, J., and Weng, Z. (2003) A protein-protein docking benchmark, Proteins 52, 88-91.

XV

Chen, V. B., Arendall, W. B., III, Headd, J. J., Keedy, D. A., Immormino, R. M., Kapral, G. J., Murray, L. W., Richardson, J. S., and Richardson, D. C. (2010) MolProbity: all-atom structure validation for macromolecular crystallography, Acta Crystallographica Section D 66, 12-21.

Chetwynd, A. P., Scott, K. A., Mokrab, Y., and Sansom, M. S. (2008) CGDB: a database of membrane protein/lipid interactions by coarse-grained molecular dynamics simulations, Molecular membrane biology 25, 662-669.

Chiche, L., Gregoret, L. M., Cohen, F. E., and Kollman, P. A. (1990) Protein model structure evaluation using the solvation free energy of folding, Proceedings of the National Academy of Sciences of the United States of America 87, 3240-3243.

Chin, Y. K., Headey, S. J., Mohanty, B., Patil, R., McEwan, P. A., Swarbrick, J. D., Mulhern, T. D., Emsley, J., Simpson, J. S., and Scanlon, M. J. (2013) The structure of integrin alpha1I domain in complex with a collagen-mimetic peptide, J Biol Chem 288, 36796-36809.

Choi, W. S., Rice, W. J., Stokes, D. L., and Coller, B. S. (2013) Three-dimensional reconstruction of intact human integrin alphaIIbbeta3: new implications for activation-dependent ligand binding, Blood 122, 4165-4171.

Choi, Y., and Deane, C. M. (2010) FREAD revisited: Accurate loop structure prediction using a database search algorithm, Proteins 78, 1431-1440.

Chothia, C., and Lesk, A. M. (1986) The relation between the divergence of sequence and structure in proteins, The EMBO journal 5, 823-826.

Chua, G. L., Tang, X. Y., Amalraj, M., Tan, S. M., and Bhattacharjya, S. (2011) Structures and interaction analyses of integrin alphaMbeta2 cytoplasmic tails, J Biol Chem 286, 43842-43854.

Cleland, W. W. (2000) Low-barrier hydrogen bonds and enzymatic catalysis, Archives of biochemistry and biophysics 382, 1-5.

Clements, M., Gershenovich, M., Chaber, C., Campos-Rivera, J., Du, P., Zhang, M., Ledbetter, S., and Zuk, A. (2016) Differential Ly6C Expression after Renal Ischemia- Reperfusion Identifies Unique Macrophage Populations, Journal of the American Society of Nephrology : JASN 27, 159-170.

Colognato, H., and Yurchenco, P. D. (2000) Form and function: the laminin family of heterotrimers, Developmental dynamics : an official publication of the American Association of Anatomists 218, 213-234.

Comeau, S. R., Gatchell, D. W., Vajda, S., and Camacho, C. J. (2004) ClusPro: a fully automated algorithm for protein-protein docking, Nucleic acids research 32, W96-99.

XVI

Contreras-Moreira, B., Fitzjohn, P. W., and Bates, P. A. (2003) In silico protein recombination: enhancing template and sequence alignment selection for comparative protein modelling, Journal of molecular biology 328, 593-608.

Cormier, A., and Campbell, M. G. (2018) Cryo-EM structure of the alphavbeta8 integrin reveals a mechanism for stabilizing integrin extension, 25, 698-704.

Cozzetto, D., Kryshtafovych, A., Ceriani, M., and Tramontano, A. (2007) Assessment of predictions in the model quality assessment category, Proteins 69 Suppl 8, 175-183.

Danen, E. H. J. (2000-2013) Integrins: An Overview of Structural and Functional Aspects. In: Madame Curie Bioscience Database, 1 ed., Landes Bioscience.

Danen, E. H., and Yamada, K. M. (2001) Fibronectin, integrins, and growth control, Journal of cellular physiology 189, 1-13.

Darden, T., York, D., and Pedersen, L. (1993) Particle mesh Ewald: An N⋅log(N) method for Ewald sums in large systems, 98, 10089-10092.

Daugelaite, J., O' Driscoll, A., and Sleator, R. D. (2013) An Overview of Multiple Sequence Alignments and Cloud Computing in Bioinformatics %J ISRN Biomathematics, 2013, 14.

David, R., Korenberg, M. J., and Hunter, I. W. (2000) 3D-1D threading methods for protein fold recognition, Pharmacogenomics 1, 445-455. de Brevern, A. G. (2010) 3D structural models of transmembrane proteins, Methods in molecular biology (Clifton, N.J.) 654, 387-401. de Melker, A. A., and Sonnenberg, A. (1999) Integrins: alternative splicing as a mechanism to regulate ligand binding and integrin signaling events, BioEssays : news and reviews in molecular, cellular and developmental biology 21, 499-509.

De Melker, A. A., Kramer, D., Kuikman, I., and Sonnenberg, A. (1997) The two phenylalanines in the GFFKR motif of the integrin alpha6A subunit are essential for heterodimerization, The Biochemical journal 328 ( Pt 2), 529-537. de Pereda, J. M., Lillo, M. P., and Sonnenberg, A. (2009) Structural basis of the interaction between integrin alpha6beta4 and plectin at the hemidesmosomes, The EMBO journal 28, 1180-1190. de Pereda, J. M., Wiche, G., and Liddington, R. C. (1999) Crystal structure of a tandem pair of fibronectin type III domains from the cytoplasmic tail of integrin alpha6beta4, The EMBO journal 18, 4087-4095. de Vries, S. J., and Bonvin, A. M. (2011) CPORT: a consensus interface predictor and its performance in prediction-driven docking with HADDOCK, PLoS One 6, e17695.

XVII

Deller, M. C., and Rupp, B. (2015) Models of protein-ligand crystal structures: trust, but verify, Journal of computer-aided molecular design 29, 817-836.

Deshmukh, L., Gorbatyuk, V., and Vinogradova, O. (2010) Integrin {beta}3 phosphorylation dictates its complex with the Shc phosphotyrosine-binding (PTB) domain, J Biol Chem 285, 34875-34884.

Deshmukh, L., Meller, N., Alder, N., Byzova, T., and Vinogradova, O. (2011) Tyrosine phosphorylation as a conformational switch: a case study of integrin beta3 cytoplasmic tail, J Biol Chem 286, 40943-40953.

Diamond, M. S., and Springer, T. A. (1994) The dynamic regulation of integrin adhesiveness, Current biology : CB 4, 506-517.

Diamond, M. S., Garcia-Aguilar, J., Bickford, J. K., Corbi, A. L., and Springer, T. A. (1993) The I domain is a major recognition site on the leukocyte integrin Mac-1 (CD11b/CD18) for four distinct adhesion ligands, The Journal of cell biology 120, 1031-1043.

Ding, Z. M., Babensee, J. E., Simon, S. I., Lu, H., Perrard, J. L., Bullard, D. C., Dai, X. Y., Bromley, S. K., Dustin, M. L., Entman, M. L., Smith, C. W., and Ballantyne, C. M. (1999) Relative contribution of LFA-1 and Mac-1 to neutrophil adhesion and migration, Journal of immunology (Baltimore, Md. : 1950) 163, 5029-5038.

Dodd, D. S., Sheriff, S., Chang, C. J., Stetsko, D. K., Phillips, L. M., Zhang, Y., Launay, M., Potin, D., Vaccaro, W., Poss, M. A., McKinnon, M., Barrish, J. C., Suchard, S. J., and Murali Dhar, T. G. (2007) Design of LFA-1 antagonists based on a 2,3-dihydro-1H- pyrrolizin-5(7aH)-one scaffold, Bioorganic & medicinal chemistry letters 17, 1908- 1911.

Dong, X., Hudson, N. E., Lu, C., and Springer, T. A. (2014) Structural determinants of integrin beta-subunit specificity for latent TGF-beta, Nature structural & molecular biology 21, 1091-1096.

Dong, X., Mi, L. Z., Zhu, J., Wang, W., Hu, P., Luo, B. H., and Springer, T. A. (2012) alpha(V)beta(3) integrin crystal structures and their functional implications, Biochemistry 51, 8814-8828.

Dong, X., Zhao, B., Iacob, R. E., Zhu, J., Koksal, A. C., Lu, C., Engen, J. R., and Springer, T. A. (2017) Force interacts with macromolecular structure in activation of TGF-beta, Nature 542, 55-59.

Dowling, J. J., Vreede, A. P., Kim, S., Golden, J., and Feldman, E. L. (2008) Kindlin-2 is required for myocyte elongation and is essential for myogenesis, BMC cell biology 9, 36.

XVIII

Dumas, J. P., and Ninio, J. (1982) Efficient algorithms for folding and comparing nucleic acid sequences, Nucleic acids research 10, 197-206.

Eble, J. A., McDougall, M., Orriss, G. L., Niland, S., Johanningmeier, B., Pohlentz, G., Meier, M., Karrasch, S., Estevao-Costa, M. I., Martins Lima, A., and Stetefeld, J. (2017) Dramatic and concerted conformational changes enable rhodocetin to block alpha2beta1 integrin selectively, PLoS biology 15, e2001492.

Eddy, S. R. (1998) Profile hidden Markov models, Bioinformatics 14, 755-763.

Edwards, G. M., Wilford, F. H., Liu, X., Hennighausen, L., Djiane, J., and Streuli, C. H. (1998) Regulation of mammary differentiation by extracellular matrix involves protein-tyrosine phosphatases, J Biol Chem 273, 9495-9500.

Eisenberg, D., and McLachlan, A. D. (1986) Solvation energy in protein folding and binding, Nature 319, 199-203.

Eisenberg, D., Luthy, R., and Bowie, J. U. (1997) VERIFY3D: assessment of protein models with three-dimensional profiles, Methods in enzymology 277, 396-404.

Emsley, J., King, S. L., Bergelson, J. M., and Liddington, R. C. (1997) Crystal structure of the I domain from integrin alpha2beta1, J Biol Chem 272, 28512-28517.

Emsley, J., Knight, C. G., Farndale, R. W., Barnes, M. J., and Liddington, R. C. (2000) Structural basis of collagen recognition by integrin alpha2beta1, Cell 101, 47-56.

Eswar, N., Webb, B., Marti-Renom, M. A., Madhusudhan, M. S., Eramian, D., Shen, M. Y., Pieper, U., and Sali, A. (2006) Comparative protein structure modeling using Modeller, Current protocols in bioinformatics Chapter 5, Unit-5.6.

Evans, R. D., Perkins, V. C., Henry, A., Stephens, P. E., Robinson, M. K., and Watt, F. M. (2003) A tumor-associated beta 1 integrin mutation that abrogates epithelial differentiation control, The Journal of cell biology 160, 589-596.

Eyre, T. A., Partridge, L., and Thornton, J. M. (2004) Computational analysis of alpha- helical membrane protein structure: implications for the prediction of 3D structural models, Protein engineering, design & selection : PEDS 17, 613-624.

Faber, H. R., and Matthews, B. W. (1990) A mutant T4 lysozyme displays five different crystal conformations, Nature 348, 263-266.

Faham, S., Linhardt, R. J., and Rees, D. C. (1998) Diversity does make a difference: fibroblast growth factor-heparin interactions, Current opinion in structural biology 8, 578-586.

XIX

Fernandez-Fuentes, N., Madrid-Aliste, C. J., Rai, B. K., Fajardo, J. E., and Fiser, A. (2007) M4T: a comparative protein structure modeling server, Nucleic acids research 35, W363-368.

Fernandez-Fuentes, N., Rai, B. K., Madrid-Aliste, C. J., Fajardo, J. E., and Fiser, A. (2007) Comparative protein structure modeling by combining multiple templates and optimizing sequence-to-structure alignments, Bioinformatics 23, 2558-2565.

Fiser, A. (2010) Template-based protein structure modeling, Methods in molecular biology (Clifton, N.J.) 673, 73-94.

Fleishman, S. J., Unger, V. M., and Ben-Tal, N. (2006) Transmembrane protein structures without X-rays, Trends in biochemical sciences 31, 106-113.

Frisch, S. M., and Screaton, R. A. (2001) Anoikis mechanisms, Curr Opin Cell Biol 13, 555-562.

Fujita, M., Takada, Y. K., and Takada, Y. (2012) Integrins αvβ3 and α4β1 act as co- receptors for fractalkine and the integrin-binding defective mutant of fractalkine is an antagonist of CX3CR1, Journal of immunology (Baltimore, Md. : 1950) 189, 5809-5819. Gabb, H. A., Jackson, R. M., and Sternberg, M. J. (1997) Modelling protein docking using shape complementarity, electrostatics and biochemical information, Journal of molecular biology 272, 106-120.

Gahmberg, C. G., Fagerholm, S. C., Nurmi, S. M., Chavakis, T., Marchesan, S., and Gronholm, M. (2009) Regulation of integrin activity and signalling, Biochimica et biophysica acta 1790, 431-444.

Garcia-Alvarez, B., de Pereda, J. M., Calderwood, D. A., Ulmer, T. S., Critchley, D., Campbell, I. D., Ginsberg, M. H., and Liddington, R. C. (2003) Structural determinants of integrin recognition by talin, Molecular cell 11, 49-58.

Garnier, J., Osguthorpe, D. J., and Robson, B. (1978) Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins, Journal of molecular biology 120, 97-120.

Geiger, B., Bershadsky, A., Pankov, R., and Yamada, K. M. (2001) Transmembrane crosstalk between the extracellular matrix--cytoskeleton crosstalk, Nature reviews. Molecular cell biology 2, 793-805.

Ghosn, E. E. B., Yang, Y., Tung, J., Herzenberg, L. A., and Herzenberg, L. A. (2008) CD11b expression distinguishes sequential stages of peritoneal B-1 development, Proceedings of the National Academy of Sciences of the United States of America 105, 5195-5200.

Giancotti, F. G., and Ruoslahti, E. (1999) Integrin signaling, Science (New York, N.Y.) 285, 1028-1032.

XX

Graf, J., Iwamoto, Y., Sasaki, M., Martin, G. R., Kleinman, H. K., Robey, F. A., and Yamada, Y. (1987) Identification of an amino acid sequence in laminin mediating cell attachment, chemotaxis, and receptor binding, Cell 48, 989-996.

Granger, D. N., and Senchenkova, E. (2010) Inflammation and the Microcirculation, Morgan & Claypool Life Sciences, San Rafael (CA).

Greer, J. (1990) Comparative modeling methods: application to the family of the mammalian serine proteases, Proteins 7, 317-334.

Grouleff, J., Irudayam, S. J., Skeby, K. K., and Schiott, B. (2015) The influence of cholesterol on membrane protein structure, function, and dynamics studied by molecular dynamics simulations, Biochimica et biophysica acta 1848, 1783-1795.

Guckian, K. M., Lin, E. Y., Silvian, L., Friedman, J. E., Chin, D., and Scott, D. M. (2008) Design and synthesis of a series of meta aniline-based LFA-1 ICAM inhibitors, Bioorganic & medicinal chemistry letters 18, 5249-5251.

Guo, W., and Giancotti, F. G. (2004) Integrin signalling during tumour progression, Nature reviews. Molecular cell biology 5, 816-826.

Han, C., Jin, J., Xu, S., Liu, H., Li, N., and Cao, X. (2010) Integrin CD11b negatively regulates TLR-triggered inflammatory responses by activating Syk and promoting degradation of MyD88 and TRIF via Cbl-b, Nature immunology 11, 734-742.

Hatherley, R., Brown, D. K., Glenister, M., and Tastan Bishop, Ö. (2016) PRIMO: An Interactive Homology Modeling Pipeline, PLOS ONE 11, e0166698.

He, Y., Mozolewska, M. A., Krupa, P., Sieradzan, A. K., Wirecki, T. K., Liwo, A., Kachlishvili, K., Rackovsky, S., Jagiela, D., Slusarz, R., Czaplewski, C. R., Oldziej, S., and Scheraga, H. A. (2013) Lessons from application of the UNRES force field to predictions of structures of CASP10 targets, Proceedings of the National Academy of Sciences of the United States of America 110, 14936-14941.

Henikoff, S., and Henikoff, J. G. (1993) Performance evaluation of amino acid substitution matrices, Proteins 17, 49-61.

Hess B: P-LINCS: A Parallel Linear Constraint Solver for Molecular Simulation. Journal of chemical theory and computation 2008, 4(1):116-122

Hirosaki, T., Mizushima, H., Tsubota, Y., Moriyama, K., and Miyazaki, K. (2000) Structural requirement of carboxyl-terminal globular domains of laminin alpha 3 chain for promotion of rapid cell adhesion and migration by laminin-5, J Biol Chem 275, 22495-22502.

XXI

Holm, L., and Sander, C. (1996) Mapping the protein universe, Science (New York, N.Y.) 273, 595-603.

Hopf, T. A., Schärfe, C. P. I., Rodrigues, J. P. G. L. M., Green, A. G., Kohlbacher, O., Sander, C., Bonvin, A. M. J. J., and Marks, D. S. (2014) Sequence co-evolution gives 3D contacts and structures of protein complexes, eLife 3, e03430.

Horii, K., Okuda, D., Morita, T., and Mizuno, H. (2004) Crystal structure of EMS16 in complex with the integrin alpha2-I domain, Journal of molecular biology 341, 519-527.

Huang, Y., and Zhang, L. (2004) Rapid and sensitive dot-matrix methods for genome analysis, Bioinformatics 20, 460-466.

Humphries, J. D., Byron, A., and Humphries, M. J. (2006) Integrin ligands at a glance, Journal of cell science 119, 3901-3903.

Humphries, J. D., Byron, A., and Humphries, M. J. (2006) INTEGRIN LIGANDS, Journal of cell science 119, 3901-3903.

Humphries, M. J., Symonds, E. J., and Mould, A. P. (2003) Mapping functional residues onto integrin crystal structures, Current opinion in structural biology 13, 236-243.

Hunter, D. D., Porter, B. E., Bulock, J. W., Adams, S. P., Merlie, J. P., and Sanes, J. R. (1989) Primary sequence of a motor neuron-selective adhesive site in the synaptic basal lamina protein S-laminin, Cell 59, 905-913.

Huveneers, S., Truong, H., and Danen, H. J. (2007) Integrins: signaling, disease, and therapy, International journal of radiation biology 83, 743-751.

Hwang, H., Pierce, B., Mintseris, J., Janin, J., and Weng, Z. (2008) Protein-Protein Docking Benchmark Version 3.0, Proteins 73, 705-709.

Hwang, H., Vreven, T., Janin, J., and Weng, Z. (2010) Protein-Protein Docking Benchmark Version 4.0, Proteins 78, 3111-3114.

Ido, H., Harada, K., Futaki, S., Hayashi, Y., Nishiuchi, R., Natsuka, Y., Li, S., Wada, Y., Combs, A. C., Ervasti, J. M., and Sekiguchi, K. (2004) Molecular dissection of the alpha- dystroglycan- and integrin-binding sites within the globular domain of human laminin- 10, J Biol Chem 279, 10946-10954.

Ihanus, E., Uotila, L. M., Toivanen, A., Varis, M., and Gahmberg, C. G. (2007) Red-cell ICAM-4 is a ligand for the monocyte/macrophage integrin CD11c/CD18: characterization of the binding sites on ICAM-4, Blood 109, 802-810.

Jefferys, B. R., Kelley, L. A., and Sternberg, M. J. (2010) Protein folding requires crowd control in a simulated cell, Journal of molecular biology 397, 1329-1338.

XXII

Jensen, M. R., and Bajic, G. (2016) Structural Basis for Simvastatin Competitive Antagonism of Complement Receptor 3, 291, 16963-16976.

Johnson, M. S., and Overington, J. P. (1993) A Structural Basis for Sequence Comparisons: An Evaluation of Scoring Methodologies, Journal of molecular biology 233, 716-738.

Jorgensen, W. L., and Tirado-Rives, J. (1988) The OPLS [optimized potentials for liquid simulations] potential functions for proteins, energy minimizations for crystals of cyclic peptides and crambin, Journal of the American Chemical Society 110, 1657- 1666.

Jorgensen, W. L., Maxwell, D. S., and Tirado-Rives, J. (1996) Development and Testing of the OPLS All-Atom Force Field on Conformational Energetics and Properties of Organic Liquids, Journal of the American Chemical Society 118, 11225-11236. Kabsch, W., and Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolymers 22, 2577- 2637.

Kalman, M., and Ben-Tal, N. (2010) Quality assessment of protein model-structures using evolutionary conservation, Bioinformatics (Oxford, England) 26, 1299-1307. Kanemoto, T., Reich, R., Royce, L., Greatorex, D., Adler, S. H., Shiraishi, N., Martin, G. R., Yamada, Y., and Kleinman, H. K. (1990) Identification of an amino acid sequence from the laminin A chain that stimulates metastasis and collagenase IV production, Proceedings of the National Academy of Sciences of the United States of America 87, 2279-2283.

Karaca, E., and Bonvin, A. M. (2013) On the usefulness of ion-mobility mass spectrometry and SAXS data in scoring docking decoys, Acta crystallographica. Section D, Biological crystallography 69, 683-694.

Katchalski-Katzir, E., Shariv, I., Eisenstein, M., Friesem, A. A., Aflalo, C., and Vakser, I. A. (1992) Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques, Proceedings of the National Academy of Sciences of the United States of America 89, 2195-2199.

Katoh, K., Misawa, K., Kuma, K., and Miyata, T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic acids research 30, 3059-3066.

Keizer, G. D., Borst, J., Visser, W., Schwarting, R., de Vries, J. E., and Figdor, C. G. (1987) Membrane glycoprotein p150,95 of human cytotoxic T cell clone is involved in conjugate formation with target cells, Journal of immunology (Baltimore, Md. : 1950) 138, 3130-3136.

XXIII

Kelley, L. A., MacCallum, R. M., and Sternberg, M. J. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM, Journal of molecular biology 299, 499-520.

Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N., and Sternberg, M. J. E. (2015) The Phyre2 web portal for protein modeling, prediction and analysis, Nature protocols 10, 845.

Kelm, S., Shi, J., and Deane, C. M. (2009) iMembrane: homology-based membrane- insertion of proteins, Bioinformatics 25, 1086-1088.

Kiema, T., Lad, Y., Jiang, P., Oxley, C. L., Baldassarre, M., Wegener, K. L., Campbell, I. D., Ylanne, J., and Calderwood, D. A. (2006) The molecular basis of filamin binding to integrins and competition with talin, Molecular cell 21, 337-347.

Kim, C., Schmidt, T., Cho, E. G., Ye, F., Ulmer, T. S., and Ginsberg, M. H. (2011) Basic amino-acid side chains regulate transmembrane integrin signalling, Nature 481, 209- 213.

Kim, M. O., Nichols, S. E., Wang, Y., and McCammon, J. A. (2013) Effects of histidine protonation and rotameric states on virtual screening of M. tuberculosis RmlC, Journal of computer-aided molecular design 27, 235-246.

Kleywegt, G. J., Harris, M. R., Zou, J.-y., Taylor, T. C., Wahlby, A., and Jones, T. A. (2004) The Uppsala Electron-Density Server, Acta Crystallographica Section D 60, 2240-2249. Kloeker, S., Major, M. B., Calderwood, D. A., Ginsberg, M. H., Jones, D. A., and Beckerle, M. C. (2004) The Kindler syndrome protein is regulated by transforming growth factor-beta and involved in integrin-mediated adhesion, J Biol Chem 279, 6824-6833.

Koldso, H., Shorthouse, D., Helie, J., and Sansom, M. S. (2014) Lipid clustering correlates with membrane curvature as revealed by molecular simulations of complex lipid bilayers, PLoS computational biology 10, e1003911.

Kotecha, A., Wang, Q., Dong, X., Ilca, S. L., Ondiviela, M., Zihe, R., Seago, J., Charleston, B., Fry, E. E., Abrescia, N. G. A., Springer, T. A., and Huiskonen, J. T. (2017) Rules of engagement between alphavbeta6 integrin and foot-and-mouth disease virus, 8, 15408.

Kozakov, D., Brenke, R., Comeau, S. R., and Vajda, S. (2006) PIPER: an FFT-based protein docking program with pairwise potentials, Proteins 65, 392-406.

Kozakov, D., Clodfelter, K. H., Vajda, S., and Camacho, C. J. (2005) Optimal Clustering for Detecting Near-Native Conformations in Protein Docking, Biophysical Journal 89, 867-875.

XXIV

Kozakov, D., Hall, D. R., Xia, B., Porter, K. A., Padhorny, D., Yueh, C., Beglov, D., and Vajda, S. (2017) The ClusPro web server for protein-protein docking, Nature protocols 12, 255-278.

Krieger, E., Nabuurs, S. B., and Vriend, G. (2005) Homology Modeling, In Structural Bioinformatics.

Krishnan, V., and Rupp, B. (2012) Macromolecular Structure Determination: Comparison of X-ray Crystallography and NMR Spectroscopy, In eLS.

Kristof, E., Zahuczky, G., Katona, K., Doro, Z., Nagy, E., and Fesus, L. (2013) Novel role of ICAM3 and LFA-1 in the clearance of apoptotic neutrophils by human macrophages, Apoptosis : an international journal on programmed cell death 18, 1235-1251.

Kruger, M., Moser, M., Ussar, S., Thievessen, I., Luber, C. A., Forner, F., Schmidt, S., Zanivan, S., Fassler, R., and Mann, M. (2008) SILAC mouse for quantitative proteomics uncovers kindlin-3 as an essential factor for red blood cell function, Cell 134, 353-364. Kunneken, K., Pohlentz, G., Schmidt-Hederich, A., Odenthal, U., Smyth, N., Peter- Katalinic, J., Bruckner, P., and Eble, J. A. (2004) Recombinant human laminin-5 domains. Effects of heterotrimerization, proteolytic processing, and N-glycosylation on alpha3beta1 integrin binding, J Biol Chem 279, 5184-5193.

Lahmers, K. K., Hedges, J. F., Jutila, M. A., Deng, M., Abrahamsen, M. S., and Brown, W. C. (2006) Comparative gene expression by WC1+ gammadelta and CD4+ alphabeta T lymphocytes, which respond to Anaplasma marginale, demonstrates higher expression of chemokines and other myeloid cell-associated genes by WC1+ gammadelta T cells, Journal of leukocyte biology 80, 939-952.

Lahti, M., Bligt, E., Niskanen, H., Parkash, V., Brandt, A. M., Jokinen, J., Patrikainen, P., Kapyla, J., Heino, J., and Salminen, T. A. (2011) Structure of collagen receptor integrin alpha(1)I domain carrying the activating mutation E317A, J Biol Chem 286, 43343- 43351.

Lai, C., Liu, X., Tian, C., and Wu, F. (2013) Integrin alpha1 has a long helix, extending from the transmembrane region to the cytoplasmic tail in detergent micelles, PLoS One 8, e62954.

Lai-Cheong, J. E., and McGrath, J. A. (2010) Kindler syndrome, Dermatologic clinics 28, 119-124.

Lam, S. D., Das, S., Sillitoe, I., and Orengo, C. (2017) An overview of comparative modelling and resources dedicated to large-scale modelling of genome sequences, Acta Crystallographica. Section D, Structural Biology 73, 628-640.

Landreh, M., Costeira-Paulo, J., Gault, J., Marklund, E. G., and Robinson, C. V. (2017) Effects of Detergent Micelles on Lipid Binding to Proteins in Electrospray Ionization Mass Spectrometry, Analytical chemistry 89, 7425-7430.

XXV

Larsson, P., Skwark, M. J., Wallner, B., and Elofsson, A. (2009) Assessment of global and local model quality in CASP8 using Pcons and ProQ, Proteins 77 Suppl 9, 167-172.

Larsson, P., Wallner, B., Lindahl, E., and Elofsson, A. (2008) Using multiple templates to improve quality of homology models in automated homology modeling, Protein science : a publication of the Protein Society 17, 990-1002.

Lau, T. L., Dua, V., and Ulmer, T. S. (2008) Structure of the integrin alphaIIb transmembrane segment, J Biol Chem 283, 16162-16168.

Lau, T. L., Kim, C., Ginsberg, M. H., and Ulmer, T. S. (2009) The structure of the integrin alphaIIbbeta3 transmembrane complex explains integrin transmembrane signalling, The EMBO journal 28, 1351-1361.

Lau, T. L., Partridge, A. W., Ginsberg, M. H., and Ulmer, T. S. (2008) Structure of the integrin beta3 transmembrane segment in phospholipid bicelles and detergent micelles, Biochemistry 47, 4008-4016.

Lee, B., and Richards, F. M. (1971) The interpretation of protein structures: estimation of static accessibility, Journal of molecular biology 55, 379-400.

Lee, J. O., Bankston, L. A., Arnaout, M. A., and Liddington, R. C. (1995) Two conformations of the integrin A-domain (I-domain): a pathway for activation?, Structure (London, England : 1993) 3, 1333-1340.

Lee, J. O., Rieu, P., Arnaout, M. A., and Liddington, R. (1995) Crystal structure of the A domain from the alpha subunit of integrin CR3 (CD11b/CD18), Cell 80, 631-638.

Legge, G. B., Kriwacki, R. W., Chung, J., Hommel, U., Ramage, P., Case, D. A., Dyson, H. J., and Wright, P. E. (2000) NMR solution structure of the inserted domain of human leukocyte function associated antigen-1, Journal of molecular biology 295, 1251-1264. Lelièvre, S. A., Weaver, V. M., Nickerson, J. A., Larabell, C. A., Bhaumik, A., Petersen, O. W., and Bissell, M. J. (1998) Tissue phenotype depends on reciprocal interactions between the extracellular matrix and the structural organization of the nucleus, Proceedings of the National Academy of Sciences 95, 14711.

Levitt, M., and Greer, J. (1977) Automatic identification of secondary structure in globular proteins, Journal of molecular biology 114, 181-239.

Li, J., and Springer, T. A. (2017) Integrin extension enables ultrasensitive regulation by cytoskeletal force, Proceedings of the National Academy of Sciences of the United States of America 114, 4685-4690.

Li, S., and Hong, M. (2011) Protonation, Tautomerization, and Rotameric Structure of Histidine: A Comprehensive Study by Magic-Angle-Spinning Solid-State NMR, Journal of the American Chemical Society 133, 1534-1544.

XXVI

Li, S., Wang, H., Peng, B., Zhang, M., Zhang, D., Hou, S., Guo, Y., and Ding, J. (2009) Efalizumab binding to the LFA-1 alphaL I domain blocks ICAM-1 binding via steric hindrance, Proceedings of the National Academy of Sciences of the United States of America 106, 4349-4354.

Liao, M., Cao, E., Julius, D., and Cheng, Y. (2013) Structure of the TRPV1 ion channel determined by electron cryo-microscopy, Nature 504, 107-112.

Liddington, R. C., and Ginsberg, M. H. (2002) Integrin activation takes shape, The Journal of cell biology 158, 833-839.

Lin, E. Y., Guckian, K. M., Silvian, L., Chin, D., Boriack-Sjodin, P. A., van Vlijmen, H., Friedman, J. E., and Scott, D. M. (2008) Structure-activity relationship of ortho- and meta-phenol based LFA-1 ICAM inhibitors, Bioorganic & medicinal chemistry letters 18, 5245-5248.

Lin, F. Y., Zhu, J., Eng, E. T., Hudson, N. E., and Springer, T. A. (2016) beta-Subunit Binding Is Sufficient for Ligands to Open the Integrin alphaIIbbeta3 Headpiece, J Biol Chem 291, 4537-4546.

Liu, J., Das, M., Yang, J., Ithychanda, S. S., Yakubenko, V. P., Plow, E. F., and Qin, J. (2015) Structural mechanism of integrin inactivation by filamin, Nature structural & molecular biology 22, 383-389.

Liu, W., Draheim, K. M., Zhang, R., Calderwood, D. A., and Boggon, T. J. (2013) Mechanism for KRIT1 release of ICAP1-mediated suppression of integrin activation, Molecular cell 49, 719-729.

Liu, W., Wacker, D., Gati, C., Han, G. W., James, D., Wang, D., Nelson, G., Weierstall, U., Katritch, V., Barty, A., Zatsepin, N. A., Li, D., Messerschmidt, M., Boutet, S., Williams, G. J., Koglin, J. E., Seibert, M. M., Wang, C., Shah, S. T., Basu, S., Fromme, R., Kupitz, C., Rendek, K. N., Grotjohann, I., Fromme, P., Kirian, R. A., Beyerlein, K. R., White, T. A., Chapman, H. N., Caffrey, M., Spence, J. C., Stevens, R. C., and Cherezov, V. (2013) Serial femtosecond crystallography of G protein-coupled receptors, Science (New York, N.Y.) 342, 1521-1524.

Lodish, H., Berk, A., Zipursky, S., Matsudaira, P., Baltimore, D., and Darnell, J. (2000) Membrane Proteins, In Molecular Cell Biology 4 ed., W. H. Freeman, New York. Luthy, R., Bowie, J. U., and Eisenberg, D. (1992) Assessment of protein models with three-dimensional profiles, Nature 356, 83-85.

Mahalingam, B., Ajroud, K., Alonso, J. L., Anand, S., Adair, B. D., Horenstein, A. L., Malavasi, F., Xiong, J. P., and Arnaout, M. A. (2011) Stable coordination of the inhibitory Ca2+ ion at the metal ion-dependent adhesion site in integrin CD11b/CD18 by an antibody-derived ligand aspartate: implications for integrin regulation and

XXVII structure-based drug design, Journal of immunology (Baltimore, Md. : 1950) 187, 6393-6401.

Mahalingam, B., Van Agthoven, J. F., Xiong, J. P., Alonso, J. L., Adair, B. D., Rui, X., Anand, S., Mehrbod, M., Mofrad, M. R., Burger, C., Goodman, S. L., and Arnaout, M. A. (2014) Atomic basis for the species-specific inhibition of alphaV integrins by monoclonal antibody 17E6 is revealed by the crystal structure of alphaVbeta3 ectodomain-17E6 Fab complex, J Biol Chem 289, 13801-13809.

Maizel, J. V., Jr., and Lenk, R. P. (1981) Enhanced graphic matrix analysis of nucleic acid and protein sequences, Proceedings of the National Academy of Sciences of the United States of America 78, 7665-7669.

Malhotra, V., Hogg, N., and Sim, R. B. (1986) Ligand binding by the p150,95 antigen of U937 monocytic cells: properties in common with complement receptor type 3 (CR3), European journal of immunology 16, 1117-1123.

Mandell, J. G., Roberts, V. A., Pique, M. E., Kotlovyi, V., Mitchell, J. C., Nelson, E., Tsigelny, I., and Ten Eyck, L. F. (2001) Protein docking using continuum electrostatics and geometric fit, Protein engineering 14, 105-113.

Marks, D. S., Colwell, L. J., Sheridan, R., Hopf, T. A., Pagnani, A., Zecchina, R., and Sander, C. (2011) Protein 3D structure computed from evolutionary sequence variation, PLoS One 6, e28766.

McCleverty, C. J., and Liddington, R. C. (2003) Engineered allosteric mutants of the integrin alphaMbeta2 I domain: structural and functional studies, Biochem J 372, 121- 127.

McGregor, C. L., Chen, L., Pomroy, N. C., Hwang, P., Go, S., Chakrabartty, A., and Prive, G. G. (2003) Lipopeptide detergents designed for the structural study of membrane proteins, Nature biotechnology 21, 171-176.

McGuffin, L. J. (2009) Prediction of global and local model quality in CASP8 using the ModFOLD server, Proteins 77 Suppl 9, 185-190.

Mecham, R. P., Hinek, A., Griffin, G. L., Senior, R. M., and Liotta, L. A. (1989) The elastin receptor shows structural and functional similarities to the 67-kDa tumor cell laminin receptor, J Biol Chem 264, 16652-16657.

Menko, A. S., and Boettiger, D. (1987) Occupation of the extracellular matrix receptor, integrin, is a control point for myogenic differentiation, Cell 51, 51-57.

Metcalf, D. G., Moore, D. T., Wu, Y., Kielec, J. M., Molnar, K., Valentine, K. G., Wand, A. J., Bennett, J. S., and DeGrado, W. F. (2010) NMR analysis of the alphaIIb beta3 cytoplasmic interaction suggests a mechanism for integrin regulation, Proceedings of the National Academy of Sciences of the United States of America 107, 22481-22486.

XXVIII

Metcalf, K. J., Bevington, J. L., Rosales, S. L., Burdette, L. A., Valdivia, E., and Tullman- Ercek, D. (2016) Proteins adopt functionally active conformations after type III secretion, Microbial cell factories 15, 213-213.

Michishita, M., Videm, V., and Arnaout, M. A. (1993) A novel divalent cation-binding site in the A domain of the beta 2 integrin CR3 (CD11b/CD18) is essential for ligand binding, Cell 72, 857-867.

Mintseris, J., Wiehe, K., Pierce, B., Anderson, R., Chen, R., Janin, J., and Weng, Z. (2005) Protein-Protein Docking Benchmark 2.0: an update, Proteins 60, 214-216.

Miyazaki, Y., Vieira-de-Abreu, A., Harris, E. S., Shah, A. M., Weyrich, A. S., Castro-Faria- Neto, H. C., and Zimmerman, G. A. (2014) Integrin α(D)β(2) (CD11d/CD18) Is Expressed by Human Circulating and Tissue Myeloid Leukocytes and Mediates Inflammatory Signaling, PLoS ONE 9, e112770.

Mizuguchi, K., Deane, C. M., Blundell, T. L., Johnson, M. S., and Overington, J. P. (1998) JOY: protein sequence-structure representation and analysis, Bioinformatics 14, 617- 623.

Modi, V., and Dunbrack, R. L., Jr. (2016) Assessment of refinement of template-based models in CASP11, Proteins 84 Suppl 1, 260-281.

Mondal, S., Khelashvili, G., and Weinstein, H. (2014) Not just an oil slick: how the energetics of protein-membrane interactions impacts the function and organization of transmembrane proteins, Biophysical journal 106, 2305-2316.

Morris, A. L., MacArthur, M. W., Hutchinson, E. G., and Thornton, J. M. (1992) Stereochemical quality of protein structure coordinates, Proteins 12, 345-364.

Morse, E. M., Brahme, N. N., and Calderwood, D. A. (2014) Integrin cytoplasmic tail interactions, Biochemistry 53, 810-820.

Moser, M., Nieswandt, B., Ussar, S., Pozgajova, M., and Fassler, R. (2008) Kindlin-3 is essential for integrin activation and platelet aggregation, Nature medicine 14, 325- 330.

Muhamad, F. N., Ahmad, R. B., Asi, S. M., and Murad, M. N. (2018) Performance Analysis Of Needleman-Wunsch Algorithm (Global) And Smith-Waterman Algorithm (Local) In Reducing Search Space And Time For Dna Sequence Alignment, Journal of Physics: Conference Series 1019, 012085.

Muller, A., MacCallum, R. M., and Sternberg, M. J. (1999) Benchmarking PSI-BLAST in genome annotation, Journal of molecular biology 293, 1257-1271.

XXIX

Nagae, M., Re, S., Mihara, E., Nogi, T., Sugita, Y., and Takagi, J. (2012) Crystal structure of alpha5beta1 integrin ectodomain: atomic details of the fibronectin receptor, The Journal of cell biology 197, 131-140.

Nagae, M., Re, S., Mihara, E., Nogi, T., Sugita, Y., and Takagi, J. (2012) Crystal structure of α5β1 integrin ectodomain: atomic details of the fibronectin receptor, The Journal of cell biology 197, 131-140.

NCBI. (2018) GenBank and WGS Statistics, U. S. National Library of Medicine, Rockville Pike, Bethesda MD, 20894 USA.

Neal, S., Nip, A. M., Zhang, H., and Wishart, D. S. (2003) Rapid and accurate calculation of protein 1H, 13C and 15N chemical shifts, Journal of biomolecular NMR 26, 215-240. Nishiuchi, R., Takagi, J., Hayashi, M., Ido, H., Yagi, Y., Sanzen, N., Tsuji, T., Yamada, M., and Sekiguchi, K. (2006) Ligand-binding specificities of laminin-binding integrins: A comprehensive survey of laminin–integrin interactions using recombinant α3β1, α6β1, α7β1 and α6β4 integrins, Matrix Biology 25, 189-197.

Nocedal, J., and Wright, S. (1999) Quasi-Newton Methods In Numerical optimization (Glynn, P., and Robinson, S., Eds.) 1 ed., pp 192-220, Springer Series in Operations Research, Berlin.

Nute, M. G., Saleh, E., and Warnow, T. (2018) Benchmarking Statistical Multiple Sequence Alignment, bioRxiv.

Nymalm, Y., Puranen, J. S., Nyholm, T. K., Kapyla, J., Kidron, H., Pentikainen, O. T., Airenne, T. T., Heino, J., Slotte, J. P., Johnson, M. S., and Salminen, T. A. (2004) Jararhagin-derived RKKH peptides induce structural changes in alpha1I domain of human integrin alpha1beta1, J Biol Chem 279, 7962-7970.

O'Brien, J. S., and Rouser, G. (1964) The fatty acid composition of brain sphingolipids: sphingomyelin, ceramide, cerebroside, and cerebroside sulfate, Journal of lipid research 5, 339-342.

Oktay, M., Wary, K. K., Dans, M., Birge, R. B., and Giancotti, F. G. (1999) Integrin- mediated Activation of Focal Adhesion Kinase Is Required for Signaling to Jun NH(2)- terminal Kinase and Progression through the G1 Phase of the Cell Cycle, The Journal of cell biology 145, 1461-1470.

Olson, J. S., Mathews, A. J., Rohlfs, R. J., Springer, B. A., Egeberg, K. D., Sligar, S. G., Tame, J., Renaud, J. P., and Nagai, K. (1988) The role of the distal histidine in myoglobin and haemoglobin, Nature 336, 265-266.

Ostermann, G., Weber, K. S., Zernecke, A., Schroder, A., and Weber, C. (2002) JAM-1 is a ligand of the beta(2) integrin LFA-1 involved in transendothelial migration of leukocytes, Nature immunology 3, 151-158.

XXX

Overington, J. P., Al-Lazikani, B., and Hopkins, A. L. (2006) How many drug targets are there?, Nature reviews. Drug discovery 5, 993-996.

Páll S, Hess B: A flexible algorithm for calculating pair interactions on SIMD architectures. Computer Physics Communications 2013, 184(12):2641-2650.

Park, H., Ovchinnikov, S., Kim, D. E., DiMaio, F., and Baker, D. (2018) Protein homology model refinement by large-scale energy optimization, Proceedings of the National Academy of Sciences of the United States of America 115, 3054-3059.

Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T., and Chothia, C. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods, Journal of molecular biology 284, 1201- 1210.

Parry, C. S., Gorski, J., and Stern, L. J. (2007) Crystallographic structure of the human leukocyte antigen DRA, DRB3*0101: models of a directional alloimmune response and autoimmunity, Journal of molecular biology 371, 435-446.

PDB, R. (2018) PDB Data Distribution by Structural Genomics Centers.

PDB, R. (2018) PDB Statistics: Overall Growth of Released Structures Per Year.

Perrin, C. L., and Nielson, J. B. (1997) “STRONG” HYDROGEN BONDS IN CHEMISTRY AND BIOLOGY, Annual Review of Physical Chemistry 48, 511-544.

Petrov, D., and Zagrovic, B. (2014) Are current atomistic force fields accurate enough to study proteins in crowded environments?, PLoS computational biology 10, e1003638-e1003638.

Pilling, D., Fan, T., Huang, D., Kaul, B., and Gomer, R. H. (2009) Identification of markers that distinguish monocyte-derived fibrocytes from monocytes, macrophages, and fibroblasts, PLoS One 4, e7475.

Potin, D., Launay, M., Monatlik, F., Malabre, P., Fabreguettes, M., Fouquet, A., Maillet, M., Nicolai, E., Dorgeret, L., Chevallier, F., Besse, D., Dufort, M., Caussade, F., Ahmad, S. Z., Stetsko, D. K., Skala, S., Davis, P. M., Balimane, P., Patel, K., Yang, Z., Marathe, P., Postelneck, J., Townsend, R. M., Goldfarb, V., Sheriff, S., Einspahr, H., Kish, K., Malley, M. F., DiMarco, J. D., Gougoutas, J. Z., Kadiyala, P., Cheney, D. L., Tejwani, R. W., Murphy, D. K., McIntyre, K. W., Yang, X., Chao, S., Leith, L., Xiao, Z., Mathur, A., Chen, B. C., Wu, D. R., Traeger, S. C., McKinnon, M., Barrish, J. C., Robl, J. A., Iwanowicz, E. J., Suchard, S. J., and Dhar, T. G. (2006) Discovery and development of 5-[(5S,9R)-9-(4- cyanophenyl)-3-(3,5-dichlorophenyl)-1-methyl-2,4-dioxo-1,3,7-tria zaspiro[4.4]non-7- yl-methyl]-3-thiophenecarboxylic acid (BMS-587101)--a small molecule antagonist of leukocyte function associated antigen-1, Journal of medicinal chemistry 49, 6946- 6949.

XXXI

Pozzi, A., and Zent, R. (2013) Integrins in kidney disease, J Am Soc Nephrol 24, 1034- 1039.

Pronk S, Pall S, Schulz R, Larsson P, Bjelkmar P, Apostolov R, Shirts MR, Smith JC, Kasson PM, van der Spoel D et al: GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics 2013, 29(7):845-854.

Qian, B., Ortiz, A. R., and Baker, D. (2004) Improvement of comparative model accuracy by free-energy optimization along principal components of natural structural variation, Proceedings of the National Academy of Sciences of the United States of America 101, 15346-15351.

Qu, A., and Leahy, D. J. (1995) Crystal structure of the I-domain from the CD11a/CD18 (LFA-1, alpha L beta 2) integrin, Proceedings of the National Academy of Sciences of the United States of America 92, 10277-10281.

Qu, A., and Leahy, D. J. (1996) The role of the divalent cation in the structure of the I domain from the CD11a/CD18 integrin, Structure (London, England : 1993) 4, 931-942. Raval, A., Piana, S., Eastwood, M. P., Dror, R. O., and Shaw, D. E. (2012) Refinement of protein structure homology models via long, all-atom molecular dynamics simulations, Proteins 80, 2071-2079.

Ren, X., Tu, C., Laipis, P. J., and Silverman, D. N. (1995) Proton transfer by histidine 67 in site-directed mutants of human carbonic anhydrase III, Biochemistry 34, 8492-8498. Renshaw, M. W., Ren, X. D., and Schwartz, M. A. (1997) Growth factor activation of MAP kinase requires cell adhesion, The EMBO journal 16, 5592-5599.

Rich, R. L., Deivanayagam, C. C., Owens, R. T., Carson, M., Hook, A., Moore, D., Symersky, J., Yang, V. W., Narayana, S. V., and Hook, M. (1999) Trench-shaped binding sites promote multiple classes of interactions between collagen and the adherence receptors, alpha(1)beta(1) integrin and Staphylococcus aureus cna MSCRAMM, J Biol Chem 274, 24906-24913.

Richards, F. M. (1977) Areas, volumes, packing and protein structure, Annual review of biophysics and bioengineering 6, 151-176.

Richards, F. M., and Kundrot, C. E. (1988) Identification of structural motifs from protein coordinate data: secondary structure and first-level supersecondary structure, Proteins 3, 71-84.

Rodrigues, J. P. G. L. M., and Bonvin, A. M. J. J. (2014) Integrative computational modeling of protein interactions, 281, 1988-2003.

Rodrigues, J. P., Trellet, M., Schmitz, C., Kastritis, P., Karaca, E., Melquiond, A. S., and Bonvin, A. M. (2012) Clustering biomolecular complexes by residue contacts similarity, Proteins 80, 1810-1817.

XXXII

Rosenkranz, A. R., Coxon, A., Maurer, M., Gurish, M. F., Austen, K. F., Friend, D. S., Galli, S. J., and Mayadas, T. N. (1998) Impaired mast cell development and innate immunity in Mac-1 (CD11b/CD18, CR3)-deficient mice, Journal of immunology (Baltimore, Md. : 1950) 161, 6463-6467.

Rotkiewicz, P., and Skolnick, J. (2008) Fast procedure for reconstruction of full-atom protein models from reduced representations, Journal of computational chemistry 29, 1460-1465.

Roy, A., Kucukural, A., and Zhang, Y. (2010) I-TASSER: a unified platform for automated protein structure and function prediction, Nature protocols 5, 725-738.

Roy, A., Yang, J., and Zhang, Y. (2012) COFACTOR: an accurate comparative algorithm for structure-based protein function annotation, Nucleic acids research 40, W471-477. Rubtsova, K., Rubtsov, A. V., van Dyk, L. F., Kappler, J. W., and Marrack, P. (2013) T- box transcription factor T-bet, a key player in a unique type of B-cell activation essential for effective viral clearance, Proceedings of the National Academy of Sciences of the United States of America 110, E3216-3224.

Rui, X., Mehrbod, M., Van Agthoven, J. F., Anand, S., Xiong, J.-P., Mofrad, M. R. K., and Arnaout, M. A. (2014) The α-Subunit Regulates Stability of the Metal Ion at the Ligand- associated Metal Ion-binding Site in β(3) Integrins, The Journal of Biological Chemistry 289, 23256-23263.

Ruoslahti, E. (1991) Integrins, The Journal of clinical investigation 87, 1-5.

Rychlewski, L., Jaroszewski, L., Li, W., and Godzik, A. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information, Protein science : a publication of the Protein Society 9, 232-241.

Ryckaert J-P, Ciccotti G, Berendsen HJC: Numerical integration of the cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes. Journal of Computational Physics 1977, 23(3):327-341.

Sadhu, C., Ting, H. J., Lipsky, B., Hensley, K., Garcia-Martinez, L. F., Simon, S. I., and Staunton, D. E. (2007) CD11c/CD18: novel ligands and a role in delayed-type hypersensitivity, Journal of leukocyte biology 81, 1395-1403.

Sanchez, R., and Sali, A. (1997) Evaluation of comparative protein structure modeling by MODELLER-3, Proteins Suppl 1, 50-58.

Saxena, A., Sangwan, R., and Mishra, S. (2013) Fundamentals of Homology Modeling Steps and Comparison among Important Bioinformatics Tools: An Overview, Vol. 1.

Schlaepfer, D. D., and Hunter, T. (1998) Integrin signalling and tyrosine phosphorylation: just the FAKs?, Trends in cell biology 8, 151-157.

XXXIII

Schmidt, T., Situ, A. J., and Ulmer, T. S. (2016) Structural and thermodynamic basis of proline-induced transmembrane complex stabilization, Scientific reports 6, 29809.

Schmitz, C., and Bonvin, A. M. J. J. J. J. o. B. N. (2011) Protein–protein HADDocking using exclusively pseudocontact shifts, 50, 263-266.

Schulze, B., Mann, K., Poschl, E., Yamada, Y., and Timpl, R. (1996) Structural and functional analysis of the globular domain IVa of the laminin alpha 1 chain and its impact on an adjacent RGD site, Biochem J 314 ( Pt 3), 847-851.

Scott, K. A., Bond, P. J., Ivetac, A., Chetwynd, A. P., Khalid, S., and Sansom, M. S. (2008) Coarse-grained MD simulations of membrane protein-bilayer self-assembly, Structure (London, England : 1993) 16, 621-630.

Sen, M., and Springer, T. A. (2016) Leukocyte integrin alphaLbeta2 headpiece structures: The alphaI domain, the pocket for the internal ligand, and concerted movements of its loops, Proceedings of the National Academy of Sciences of the United States of America 113, 2940-2945.

Sen, M., Yuki, K., and Springer, T. A. (2013) An internal ligand-bound, metastable state of a leukocyte integrin, alphaXbeta2, The Journal of cell biology 203, 629-642.

Shi, C., and Simon, D. I. (2006) Integrin signals, transcription factors, and monocyte differentiation, Trends in cardiovascular medicine 16, 146-152.

Shi, M., Foo, S. Y., Tan, S. M., Mitchell, E. P., Law, S. K., and Lescar, J. (2007) A structural hypothesis for the transition between bent and extended conformations of the leukocyte beta2 integrins, J Biol Chem 282, 30198-30206.

Shi, M., Sundramurthy, K., Liu, B., Tan, S. M., Law, S. K., and Lescar, J. (2005) The crystal structure of the plexin-semaphorin-integrin domain/hybrid domain/I-EGF1 segment from the human integrin beta2 subunit at 1.8-A resolution, J Biol Chem 280, 30586- 30593.

Shier, P., Ngo, K., and Fung-Leung, W. P. (1999) Defective CD8+ T cell activation and cytolytic function in the absence of LFA-1 cannot be restored by increased TCR signaling, Journal of immunology (Baltimore, Md. : 1950) 163, 4826-4832.

Shier, P., Otulakowski, G., Ngo, K., Panakos, J., Chourmouzis, E., Christjansen, L., Lau, C. Y., and Fung-Leung, W. P. (1996) Impaired immune responses toward alloantigens and tumor cells but normal thymic selection in mice deficient in the beta2 integrin leukocyte function-associated antigen-1, Journal of immunology (Baltimore, Md. : 1950) 157, 5375-5386.

Shimaoka, M., Takagi, J., and Springer, T. A. (2002) Conformational regulation of integrin structure and function, Annual review of biophysics and biomolecular structure 31, 485-516.

XXXIV

Shimaoka, M., Xiao, T., Liu, J. H., Yang, Y., Dong, Y., Jun, C. D., McCormack, A., Zhang, R., Joachimiak, A., Takagi, J., Wang, J. H., and Springer, T. A. (2003) Structures of the alpha L I domain and its complex with ICAM-1 reveal a shape-shifting pathway for integrin regulation, Cell 112, 99-111.

Sievers, F., Dineen, D., Wilm, A., and Higgins, D. G. (2013) Making automated multiple alignments of very large numbers of protein sequences, Bioinformatics 29, 989-995. Singer, S. J., and Nicolson, G. L. (1972) The fluid mosaic model of the structure of cell membranes, Science (New York, N.Y.) 175, 720-731.

Skubitz, A. P., McCarthy, J. B., Zhao, Q., Yi, X. Y., and Furcht, L. T. (1990) Definition of a sequence, RYVVLPR, within laminin peptide F-9 that mediates metastatic fibrosarcoma cell adhesion and spreading, Cancer research 50, 7612-7622.

Soding, J. (2005) Protein homology detection by HMM-HMM comparison, Bioinformatics 21, 951-960.

Song, G., Yang, Y., Liu, J. H., Casasnovas, J. M., Shimaoka, M., Springer, T. A., and Wang, J. H. (2005) An atomic resolution view of ICAM recognition in a complex between the binding domains of ICAM-3 and integrin alphaLbeta2, Proceedings of the National Academy of Sciences of the United States of America 102, 3366-3371.

Sonnino, S., Mauri, L., Chigorno, V., and Prinetti, A. (2007) Gangliosides as components of lipid membrane domains, Glycobiology 17, 1r-13r.

Speranskiy, K., Cascio, M., and Kurnikova, M. (2007) Homology modeling and molecular dynamics simulations of the glycine receptor ligand binding domain, Proteins 67, 950-960.

Springer, T. A. (1994) Traffic signals for lymphocyte recirculation and leukocyte emigration: the multistep paradigm, Cell 76, 301-314.

Springer, T. A., Zhu, J., and Xiao, T. (2008) Structural basis for distinctive recognition of fibrinogen gammaC peptide by the platelet integrin alphaIIbbeta3, The Journal of cell biology 182, 791-800.

Steadman, R., Irwin, M. H., St John, P. L., Blackburn, W. D., Heck, L. W., and Abrahamson, D. R. (1993) Laminin cleavage by activated human neutrophils yields proteolytic fragments with selective migratory properties, Journal of leukocyte biology 53, 354-365.

Stockel, J., Safar, J., Wallace, A. C., Cohen, F. E., and Prusiner, S. B. (1998) Prion protein selectively binds copper(II) ions, Biochemistry 37, 7185-7193.

XXXV

Stupack, D. G., Puente, X. S., Boutsaboualoy, S., Storgard, C. M., and Cheresh, D. A. (2001) Apoptosis of adherent cells by recruitment of caspase-8 to unligated integrins, The Journal of cell biology 155, 459-470.

Takala, H., Nurminen, E., Nurmi, S. M., Aatonen, M., Strandin, T., Takatalo, M., Kiema, T., Gahmberg, C. G., Ylanne, J., and Fagerholm, S. C. (2008) Beta2 integrin phosphorylation on Thr758 acts as a molecular switch to regulate 14-3-3 and filamin binding, Blood 112, 1853-1862.

Taooka, Y., Chen, J., Yednock, T., and Sheppard, D. (1999) The Integrin α9β1 Mediates Adhesion to Activated Endothelial Cells and Transendothelial Neutrophil Migration through Interaction with Vascular Cell Adhesion Molecule-1, The Journal of cell biology 145, 413-420.

Tashiro, K., Sephel, G. C., Weeks, B., Sasaki, M., Martin, G. R., Kleinman, H. K., and Yamada, Y. (1989) A synthetic peptide containing the IKVAV sequence from the A chain of laminin mediates cell attachment, migration, and neurite outgrowth, J Biol Chem 264, 16174-16182.

Tian, L., Yoshihara, Y., Mizuno, T., Mori, K., and Gahmberg, C. G. (1997) The neuronal glycoprotein telencephalin is a cellular ligand for the CD11a/CD18 leukocyte integrin, 158, 928-936.

Trott, O., and Olson, A. J. (2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading, Journal of computational chemistry 31, 455-461.

Tu, C. K., Silverman, D. N., Forsman, C., Jonsson, B. H., and Lindskog, S. (1989) Role of histidine 64 in the catalytic mechanism of human carbonic anhydrase II studied with a site-specific mutant, Biochemistry 28, 7913-7918.

U. S. Department of Education, U. S. C. B., U. S. Department of Labor. (2018) Biochemistry.

Vajda, S., and Kozakov, D. (2009) Convergence and combination of methods in protein- protein docking, Current opinion in structural biology 19, 164-170.

Vale, R. D., and Dell, K. (2009) The biological sciences in India: aiming high for the future, The Journal of cell biology 184, 342-353.

Van Agthoven, J. F., Xiong, J. P., Alonso, J. L., Rui, X., Adair, B. D., Goodman, S. L., and Arnaout, M. A. (2014) Structural basis for pure antagonism of integrin alphaVbeta3 by a high-affinity form of fibronectin, Nature structural & molecular biology 21, 383-388. Van der Vieren, M., Crowe, D. T., Hoekstra, D., Vazeux, R., Hoffman, P. A., Grayson, M. H., Bochner, B. S., Gallatin, W. M., and Staunton, D. E. (1999) The leukocyte integrin alpha D beta 2 binds VCAM-1: evidence for a binding interface between I domain and VCAM-1, Journal of immunology (Baltimore, Md. : 1950) 163, 1984-1990.

XXXVI

van Dijk, A. D., Fushman, D., and Bonvin, A. M. (2005) Various strategies of using residual dipolar couplings in NMR-driven protein docking: application to Lys48-linked di-ubiquitin and validation against 15N-relaxation data, Proteins 60, 367-381. van Dijk, A. D., Kaptein, R., Boelens, R., and Bonvin, A. M. (2006) Combining NMR relaxation with chemical shift perturbation data to drive protein-protein docking, Journal of biomolecular NMR 34, 237-244. van Gelder, C. W. G., Leusen, F. J. J., Leunissen, J. A. M., and Noordik, J. H. (1994) A molecular dynamics approach for the generation of complete protein structures from limited coordinate data, 18, 174-185. van Zundert, G. C. P., Melquiond, A. S. J., and Bonvin, A. (2015) Integrative Modeling of Biomolecular Complexes: HADDOCKing with Cryo-Electron Microscopy Data, Structure (London, England : 1993) 23, 949-960.

Vangone, A., Rodrigues, J. P. G. L. M., Xue, L. C., van Zundert, G. C. P., Geng, C., Kurkcuoglu, Z., Nellen, M., Narasimhan, S., Karaca, E., van Dijk, M., Melquiond, A. S. J., Visscher, K. M., Trellet, M., Kastritis, P. L., and Bonvin, A. M. J. J. (2017) Sense and simplicity in HADDOCK scoring: Lessons from CASP‐CAPRI round 1, Proteins 85, 417- 423.

Vangone, A., Rodrigues, J. P. G. L. M., Xue, L. C., van Zundert, G. C. P., Geng, C., Kurkcuoglu, Z., Nellen, M., Narasimhan, S., Karaca, E., van Dijk, M., Melquiond, A. S. J., Visscher, K. M., Trellet, M., Kastritis, P. L., and Bonvin, A. M. J. J. (2017) Sense and simplicity in HADDOCK scoring: Lessons from CASP-CAPRI round 1, Proteins 85, 417- 423.

Venclovas, C. (2003) Comparative modeling in CASP5: progress is evident, but alignment errors remain a significant hindrance, Proteins 53 Suppl 6, 380-388.

Venclovas, C., and Margelevicius, M. (2005) Comparative modeling in CASP6 using consensus approach to template selection, sequence-structure alignment, and structure assessment, Proteins 61 Suppl 7, 99-105.

Vinogradova, O., Haas, T., Plow, E. F., and Qin, J. (2000) A structural basis for integrin activation by the cytoplasmic tail of the alpha IIb-subunit, Proceedings of the National Academy of Sciences of the United States of America 97, 1450-1455.

Vinogradova, O., Vaynberg, J., Kong, X., Haas, T. A., Plow, E. F., and Qin, J. (2004) Membrane-mediated structural transitions at the cytoplasmic face during integrin activation, Proceedings of the National Academy of Sciences of the United States of America 101, 4094-4099.

Vinogradova, O., Velyvis, A., Velyviene, A., Hu, B., Haas, T., Plow, E., and Qin, J. (2002) A structural mechanism of integrin alpha(IIb)beta(3) "inside-out" activation as regulated by its cytoplasmic face, Cell 110, 587-597.

XXXVII

Vorup-Jensen, T., Ostermeier, C., Shimaoka, M., Hommel, U., and Springer, T. A. (2003) Structure and allosteric regulation of the alpha X beta 2 integrin I domain, Proceedings of the National Academy of Sciences of the United States of America 100, 1873-1878. Wagner, C., Hansch, G. M., Stegmaier, S., Denefleh, B., Hug, F., and Schoels, M. (2001) The complement receptor 3, CR3 (CD11b/CD18), on T lymphocytes: activation- dependent up-regulation and regulatory function, European journal of immunology 31, 1173-1180.

Wallner, B., and Elofsson, A. (2003) Can correct protein models be identified?, Protein science : a publication of the Protein Society 12, 1073-1086.

Wallner, B., and Elofsson, A. (2005) All are not equal: a benchmark of different homology modeling programs, Protein science : a publication of the Protein Society 14, 1315-1327.

Walsh, I., Baù, D., Martin, A. J. M., Mooney, C., Vullo, A., and Pollastri, G. (2009) Ab initio and template-based prediction of multi-class distance maps by two-dimensional recursive neural networks, BMC structural biology 9, 5-5.

Wass, M. N., Kelley, L. A., and Sternberg, M. J. E. (2010) 3DLigandSite: predicting ligand-binding sites using similar structures, Nucleic acids research 38, W469-W473. Wattanasin, S., Kallen, J., Myers, S., Guo, Q., Sabio, M., Ehrhardt, C., Albert, R., Hommel, U., Weckbecker, G., Welzenbach, K., and Weitz-Schmidt, G. (2005) 1,4- Diazepane-2,5-diones as novel inhibitors of LFA-1, Bioorganic & medicinal chemistry letters 15, 1217-1220.

Weaver, V. M., Lelievre, S., Lakins, J. N., Chrenek, M. A., Jones, J. C., Giancotti, F., Werb, Z., and Bissell, M. J. (2002) beta4 integrin-dependent formation of polarized three- dimensional architecture confers resistance to apoptosis in normal and malignant mammary epithelium, Cancer cell 2, 205-216.

Weiner, S. J., Kollman, P. A., Case, D. A., Singh, U. C., Ghio, C., Alagona, G., Profeta, S., and Weiner, P. (1984) A new force field for molecular mechanical simulation of nucleic acids and proteins, Journal of the American Chemical Society 106, 765-784.

Weitz-Schmidt, G., Welzenbach, K., Dawson, J., and Kallen, J. (2004) Improved lymphocyte function-associated antigen-1 (LFA-1) inhibition by statin derivatives: molecular basis determined by x-ray analysis and monitoring of LFA-1 conformational changes in vitro and ex vivo, J Biol Chem 279, 46764-46771.

Weljie, A. M., Hwang, P. M., and Vogel, H. J. (2002) Solution structures of the cytoplasmic tail complex from platelet integrin alpha IIb- and beta 3-subunits, Proceedings of the National Academy of Sciences of the United States of America 99, 5878-5883.

XXXVIII

Whittaker, C. A., and Hynes, R. O. (2002) Distribution and evolution of von Willebrand/integrin A domains: widely dispersed domains with roles in cell adhesion and elsewhere, Molecular biology of the cell 13, 3369-3387.

Wiederstein, M., and Sippl, M. J. (2007) ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins, Nucleic acids research 35, W407-410.

Wilbur, W. J., and Lipman, D. J. (1983) Rapid similarity searches of nucleic acid and protein data banks, Proceedings of the National Academy of Sciences of the United States of America 80, 726-730.

Willard, L., Ranjan, A., Zhang, H., Monzavi, H., Boyko, R. F., Sykes, B. D., and Wishart, D. S. (2003) VADAR: a web server for quantitative evaluation of protein structure quality, Nucleic acids research 31, 3316-3319.

Wishart, D. S., and Case, D. A. (2001) Use of chemical shifts in macromolecular structure determination, Methods in enzymology 338, 3-34.

Wishart, D. S., and Sykes, B. D. (1994) The 13C chemical-shift index: a simple method for the identification of protein secondary structure using 13C chemical-shift data, Journal of biomolecular NMR 4, 171-180.

Wishart, D. S., Arndt, D., Berjanskii, M., Tang, P., Zhou, J., and Lin, G. (2008) CS23D: a web server for rapid protein structure generation using NMR chemical shifts and sequence data, Nucleic acids research 36, W496-W502.

Wu, S., and Zhang, Y. (2007) LOMETS: A local meta-threading-server for protein structure prediction, Nucleic acids research 35, 3375-3382. Xia, W., and Springer, T. A. (2014) Metal ion and ligand binding of integrin alpha5beta1, Proceedings of the National Academy of Sciences of the United States of America 111, 17863-17868.

Xia, Y., Vetvicka, V., Yan, J., Hanikyrova, M., Mayadas, T., and Ross, G. D. (1999) The beta-glucan-binding lectin site of mouse CR3 (CD11b/CD18) and its function in generating a primed state of the receptor that mediates cytotoxic activation in response to iC3b-opsonized target cells, Journal of immunology (Baltimore, Md. : 1950) 162, 2281-2290.

Xiang, Z. (2006) Advances in homology protein structure modeling, Current protein & peptide science 7, 217-227.

Xiao, T., Takagi, J., Coller, B. S., Wang, J. H., and Springer, T. A. (2004) Structural basis for allostery in integrins and binding to fibrinogen-mimetic therapeutics, Nature 432, 59-67.

XXXIX

Xie, C., Zhu, J., Chen, X., Mi, L., Nishida, N., and Springer, T. A. (2010) Structure of an integrin with an alphaI domain, complement receptor type 4, The EMBO journal 29, 666-679.

Xiong, J. P., Li, R., Essafi, M., Stehle, T., and Arnaout, M. A. (2000) An isoleucine-based allosteric switch controls affinity and shape shifting in integrin CD11b A-domain, J Biol Chem 275, 38762-38767.

Xiong, J. P., Mahalingham, B., Alonso, J. L., Borrelli, L. A., Rui, X., Anand, S., Hyman, B. T., Rysiok, T., Muller-Pompalla, D., Goodman, S. L., and Arnaout, M. A. (2009) Crystal structure of the complete integrin alphaVbeta3 ectodomain plus an alpha/beta transmembrane fragment, The Journal of cell biology 186, 589-600.

Xiong, J. P., Stehle, T., Diefenbach, B., Zhang, R., Dunker, R., Scott, D. L., Joachimiak, A., Goodman, S. L., and Arnaout, M. A. (2001) Crystal structure of the extracellular segment of integrin alpha Vbeta3, Science (New York, N.Y.) 294, 339-345.

Xiong, J. P., Stehle, T., Goodman, S. L., and Arnaout, M. A. (2004) A novel adaptation of the integrin PSI domain revealed from its crystal structure, J Biol Chem 279, 40252- 40254.

Xiong, J. P., Stehle, T., Zhang, R., Joachimiak, A., Frech, M., Goodman, S. L., and Arnaout, M. A. (2002) Crystal structure of the extracellular segment of integrin alpha Vbeta3 in complex with an Arg-Gly-Asp ligand, Science (New York, N.Y.) 296, 151-155.

Yakubenko, V. P., Belevych, N., Mishchuk, D., Schurin, A., Lam, S. C. T., and Ugarova, T. P. (2008) The Role of Integrin α(D)β(2) (CD11d/CD18) in Monocyte/Macrophage Migration, Experimental cell research 314, 2569-2578.

Yakubenko, V. P., Solovjov, D. A., Zhang, L., Yee, V. C., Plow, E. F., and Ugarova, T. P. (2001) Identification of the binding site for fibrinogen recognition peptide gamma 383- 395 within the alpha(M)I-domain of integrin alpha(M)beta2, J Biol Chem 276, 13995- 14003.

Yamada, K. M. (1991) Adhesive recognition sequences, J Biol Chem 266, 12809-12812. Yang, H., Peisach, E., Westbrook, J. D., Young, J., Berman, H. M., and Burley, S. K. (2016) DCC: a Swiss army knife for structure factor analysis and validation, Journal of Applied Crystallography 49, 1081-1084.

Yang, J., and Zhang, Y. (2015) I-TASSER server: new development for protein structure and function predictions, Nucleic acids research 43, W174-W181.

Yang, J., Ma, Y. Q., Page, R. C., Misra, S., Plow, E. F., and Qin, J. (2009) Structure of an integrin alphaIIb beta3 transmembrane-cytoplasmic heterocomplex provides insight into integrin activation, Proceedings of the National Academy of Sciences of the United States of America 106, 17729-17734.

XL

Yu, Y., Zhu, J., Mi, L. Z., Walz, T., Sun, H., Chen, J., and Springer, T. A. (2012) Structural specializations of alpha(4)beta(7), an integrin that mediates rolling adhesion, The Journal of cell biology 196, 131-146.

Zhang, H., Astrof, N. S., Liu, J. H., Wang, J. H., and Shimaoka, M. (2009) Crystal structure of isoflurane bound to integrin LFA-1 supports a unified mechanism of volatile anesthetic action in the immune and central nervous systems, FASEB journal : official publication of the Federation of American Societies for Experimental Biology 23, 2735- 2740.

Zhang, H., Casasnovas, J. M., Jin, M., Liu, J. H., Gahmberg, C. G., Springer, T. A., and Wang, J. H. (2008) An unusual allosteric mobility of the C-terminal helix of a high- affinity alphaL integrin I domain variant bound to ICAM-5, Molecular cell 31, 432-437.

Zhang, H., Liu, J. H., Yang, W., Springer, T., Shimaoka, M., and Wang, J. H. (2009) Structural basis of activation-dependent binding of ligand-mimetic antibody AL-57 to integrin LFA-1, Proceedings of the National Academy of Sciences of the United States of America 106, 18345-18350.

Zhang, Y. (2008) Progress and challenges in protein structure prediction, Current opinion in structural biology 18, 342-348.

Zhang, Y. (2009) I-TASSER: fully automated protein structure prediction in CASP8, Proteins 77 Suppl 9, 100-113.

Zhang, Y., and Skolnick, J. (2005) TM-align: a protein structure alignment algorithm based on the TM-score, Nucleic acids research 33, 2302-2309.

Zhang, Y., Kihara, D., and Skolnick, J. (2002) Local energy landscape flattening: parallel hyperbolic Monte Carlo sampling of protein folding, Proteins 48, 192-201.

Zhou, D., Thinn, A. M. M., Zhao, Y., Wang, Z., and Zhu, J. (2018) Structure of an extended beta3 integrin, Blood 132, 962-972.

Zhu, J., Choi, W. S., McCoy, J. G., Negri, A., Zhu, J., Naini, S., Li, J., Shen, M., Huang, W., Bougie, D., Rasmussen, M., Aster, R., Thomas, C. J., Filizola, M., Springer, T. A., and Coller, B. S. (2012) Structure-guided design of a high-affinity platelet integrin alphaIIbbeta3 receptor antagonist that disrupts Mg(2)(+) binding to the MIDAS, Science translational medicine 4, 125ra132.

Zhu, J., Luo, B. H., Xiao, T., Zhang, C., Nishida, N., and Springer, T. A. (2008) Structure of a complete integrin ectodomain in a physiologic resting state and activation and deactivation by applied forces, Molecular cell 32, 849-861.

Zhu, J., Zhu, J., and Springer, T. A. (2013) Complete integrin headpiece opening in eight steps, The Journal of cell biology 201, 1053-1068.

XLI

Zhu, J., Zhu, J., Negri, A., Provasi, D., Filizola, M., Coller, B. S., and Springer, T. A. (2010) Closed headpiece of integrin alphaIIbbeta3 and its complex with an alphaIIbbeta3- specific antagonist that does not induce opening, Blood 116, 5050-5059.

XLII