<<

Using C-Alpha Geometry to Describe Secondary Structure and Motifs

by

Christopher Joseph Williams

Department of Biochemistry Duke University

Date:______Approved:

______David C. Richardson, Co-Supervisor

______Jane S. Richardson, Co-Supervisor

______Charles William Carter, Jr.

______Harold P. Erickson

______Terrence G. Oas

______Maria A. Schumacher

Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Biochemistry in the Graduate School of Duke University

2015

ABSTRACT

Using C-Alpha Geometry to Describe Protein Secondary Structure and Motifs

by

Christopher Joseph Williams

Department of Biochemistry Duke University

Date:______Approved:

______David C. Richardson, Co-Supervisor

______Jane S. Richardson, Co-Supervisor

______Charles William Carter, Jr.

______Harold P. Erickson

______Terrence G. Oas

______Maria A. Schumacher An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Biochemistry in the Graduate School of Duke University

2015

Copyright © by Christopher Joseph Williams 2015 All rights reserved except the rights granted by the Creative Commons Attribution- Noncommercial License

Abstract

X-ray crystallography 3D atomic models are used in a variety of research areas to understand and manipulate . Research and application are dependent on the quality of the models. Low-resolution experimental data is a common problem in crystallography which makes solving structures and producing the reliable models that many scientists depend on difficult.

In this work, I develop new, automated tools for validation and correction of low-resolution structures. These tools are gathered under the name CaBLAM, for C- alpha Based Low-resolution Annotation Method. CaBLAM uses a unique, Cα geometry- based parameter space to identify outliers in protein backbone geometry, and to identify secondary structure that may be masked by modeling errors.

CaBLAM was developed in the Python programming language as part of the

Phenix crystallography suite and the open CCTBX Project. It makes use of architecture and methods available in the CCTBX toolbox. Quality-filtered databases of high- resolution protein structures, especially the Top8000, were used to construct contours of expected protein behavior for CaBLAM. CaBLAM has also been integrated into the codebase for the Richardson Lab’s online MolProbity validation service.

CaBLAM succeeds in providing useful validation feedback for protein structures in the 2.5-4.0Å resolution range. This success demonstrates the relative reliability of the

iv

Cα trace of a protein in this resolution range. Full mainchain information can be extrapolated from the Cα trace, especially for regular secondary structure elements.

CaBLAM has also informed our approach to validation for low-resolution structures. Moderation of feedback, to reduce validation overload and to focus user attention on modeling errors that are both significant and correctable, is one of our goals. CaBLAM and the related methods that have grown around it demonstrate the progress towards this goal.

v

Dedication

Ars gratia artis.

Mens gratia mentis.

vi

Contents

Abstract ...... iv

List of Tables ...... xvi

List of Figures ...... xvii

Acknowledgements ...... xxiii

1. Introduction ...... 1

2. CASP Retrospective ...... 7

2.1 The CASP experiment ...... 7

2.2 My CASP tools ...... 8

2.2.1 Completeness ...... 8

2.2.2 Sidechain alignment ...... 9

2.3 Lessons from CASP ...... 12

2.3.1 Mainchain reality score ...... 12

2.3.2 Adjusted clash cutoff ...... 13

2.4 Discussion ...... 15

3. Challenges of low-resolution protein modeling ...... 17

3.1 Causes of low-resolution data ...... 17

3.1.1 Resolution ...... 17

3.1.2 Crystal quality ...... 19

3.1.3 Mobility and disorder ...... 20

3.1.4 Correlation of Interest and Difficulty ...... 22

vii

3.2 Data quality ...... 23

3.2.1 Ambiguous density ...... 24

3.2.2 Misleading density ...... 25

3.2.3 Missing density ...... 27

3.2.4 Core versus surface density ...... 27

3.3 Missing and truncated sidechains ...... 28

3.4 Loops ...... 30

3.5 Data versus geometry restraints ...... 32

3.6 Discussion ...... 35

4. The Cα geometry parameter spaces ...... 37

4.1 Problems with all-atom parameter spaces ...... 37

4.1.1 Ramachandran analysis at low resolution ...... 37

4.1.2 DSSP annotation at low resolution ...... 42

4.2 Developing the Cα parameter space ...... 45

4.2.1 Cα pseudodihedrals ...... 47

4.2.2 The plane dihedral ...... 49

4.2.3 The Cα virtual angle ...... 51

4.3 Representations of the parameter space ...... 52

4.3.1 3D CaBLAM space ...... 53

4.3.2 2D CaBLAM space ...... 53

4.3.3 3D Cα geometry space ...... 53

4.4 Populating the 3D parameter spaces ...... 54

viii

4.4.1 Dataset selection ...... 54

4.4.2 Residue-level quality filtering ...... 55

4.4.3 3D parameter spaces ...... 56

4.5 DSSP letter-code prediction with CaBLAM ...... 62

4.5.1 Dataset selection ...... 63

4.5.2 Residue-level quality filtering ...... 63

4.5.3 Defining secondary structure with DSSP ...... 64

4.5.4 Bayesian probabilities for DSSP codes ...... 66

4.5.5 Using DSSP code prediction ...... 67

4.5.6 Visual annotation for DSSP code prediction ...... 69

4.5.7 Problems with DSSP code prediction ...... 72

4.6 Populating the secondary structure parameter spaces ...... 73

4.6.1 Dataset selection ...... 73

4.6.2 Residue-level quality filtering ...... 73

4.6.3 Developing “fingerprints” for defining secondary structure ...... 75

4.6.4 ...... 76

4.6.4.1 Regular alpha helix ...... 78

4.6.4.2 Loose alpha helix ...... 79

4.6.5 ...... 81

4.6.5.1 Long beta strand fingerprints ...... 82

4.6.5.2 Beta bridge fingerprints ...... 84

4.6.5.3 Regular beta sheet ...... 85

ix

4.6.5.4 Loose beta sheet ...... 87

4.6.6 ...... 88

4.6.6.1 Loose 310 helix ...... 89

4.7 Locating structure in CaBLAM space ...... 91

4.7.1 Handedness ...... 91

4.7.2 Alpha helix in CaBLAM space ...... 93

4.7.3 Beta sheet in CaBLAM space ...... 93

4.7.4 Other motifs in CaBLAM space ...... 94

4.8 Discussion ...... 96

5. CaBLAM validation ...... 98

5.1 Methodology predecessors ...... 98

5.2 Setting contour cutoffs ...... 99

5.2.1 3D CaBLAM space ...... 100

5.2.2 2D CaBLAM space ...... 102

5.2.3 3D Cα geometry space ...... 106

5.3 CaBLAM workflow ...... 108

5.3.1 Read-in and calculation ...... 109

5.3.2 Find CaBLAM space outliers ...... 110

5.3.3 Check for Cα geometry outliers ...... 111

5.3.4 Assign secondary structure ...... 111

5.4 Validation feedback ...... 113

5.4.1 Text ...... 113

x

5.4.2 Oneline ...... 115

5.4.3 Kinemage markup ...... 117

5.4.3.1 CaBLAM outlier and disfavored markup ...... 118

5.4.3.2 Cα geometry outlier markup ...... 119

5.4.3.3 Embedded text feedback ...... 120

5.4.3.4 CaBLAM-generated secondary structure ribbons ...... 121

5.4.4 Other feedback options ...... 122

5.5 Success rates ...... 124

5.5.1 False positives ...... 124

5.5.2 False negatives ...... 126

5.5.3 Comparison to other methods ...... 127

5.5.3.1 Comparison to Ramachandran analysis ...... 127

5.5.3.2 Comparison to DSSP ...... 131

5.5.3.3 Comparison to expert human inspection ...... 132

5.6 Manual corrections from CaBLAM validation ...... 133

5.6.1 Methods ...... 133

5.6.2 Technical challenges ...... 134

5.6.3 Results of manual correction ...... 134

5.7 Discussion ...... 137

6. Protein structure motifs in CaBLAM ...... 142

6.1 Motif “fingerprints” ...... 143

6.2 The cablam_training tool ...... 145

xi

6.2.1 Kinemage output ...... 145

6.2.2 Structure annotation ...... 149

6.2.3 Sequence ...... 151

6.2.4 Superposition ...... 152

6.3 Tyrosine corners ...... 154

6.3.1 Characteristics ...... 155

6.3.2 In CaBLAM space ...... 156

6.3.3 Sequence correlations ...... 158

6.4 Widened helix turns ...... 159

6.4.1 Characteristics ...... 161

6.4.2 In CaBLAM space ...... 164

6.4.3 Sequence correlations ...... 167

6.5 Double tight turns ...... 169

6.5.1 Characteristics ...... 169

6.5.2 In CaBLAM space ...... 171

6.5.3 Sequence correlations ...... 172

6.6 Non-sequential helix bonding ...... 173

6.7 Discussion ...... 179

7. Low-resolution validation in MolProbity ...... 180

7.1 Challenges of low-resolution validation ...... 181

7.1.1 Validation overload ...... 181

7.1.2 Secondary structure errors ...... 183

xii

7.1.3 Loop uncertainty ...... 184

7.2 Clash cutoff adjustment ...... 185

7.3 New coloring ...... 187

7.3.1 Coloring for clashes ...... 189

7.3.2 Coloring for Ramachandran and rotamers ...... 190

7.3.3 Coloring for geometry validation ...... 192

7.3.4 Coloring for CaBLAM ...... 193

7.3.5 Coloring for cis- ...... 195

7.3.6 Coloring for RNA validation ...... 196

7.3.7 Extendibility of coloring ...... 197

7.4 CaBLAM in MolProbity ...... 197

7.4.1 The MolProbity structure summary table ...... 198

7.4.2 The MolProbity multicriterion table ...... 200

7.5 “Omegalyze” cis-peptide validation ...... 202

7.5.1 Cis-peptide geometry and occurrence ...... 202

7.5.2 A call for cis-peptide validation ...... 204

7.5.3 Twisted peptides ...... 205

7.5.4 Omegalyze ...... 207

7.5.4.1 Text output ...... 207

7.5.4.2 Kinemage markup ...... 209

7.5.4.3 MolProbity feedback ...... 211

7.6 Future validations ...... 212

xiii

7.6.1 Cis versus trans validation with CaBLAM ...... 212

7.6.2 Motif validation with CaBLAM ...... 215

7.7 Discussion ...... 216

8. Conclusion ...... 219

A. cablam_training commandline options ...... 226

A.1 How cablam_training works ...... 226

A.2 Available measures ...... 227

A.2.1 Standard measures ...... 227

A.2.2 CaBLAM measures ...... 227

A.2.3 Artifact measures ...... 228

A.3 Quality control ...... 228

A.4 Other output options ...... 229

A.5 Motif searching ...... 229

A.5.1 Motif search outputs ...... 230

B. How to write and format a motif fingerprint for CaBLAM ...... 232

B.1 Setup ...... 232

B.2 Class instantiation ...... 232

B.3 Adding residues ...... 233

B.4 Adding bonds ...... 234

B.5 Adding bond targets ...... 235

B.6 General notes and final checks ...... 236

B.7 Producing the fingerprint ...... 237

xiv

References ...... 239

Biography ...... 247

xv

List of Tables

Table 4.1: Success rates for automated DSSP annotation by CaBLAM ...... 68

Table 5.1: Validation statistics for 4hum.pdb through our intervention process...... 135

Table 5.2: Validation statistics for E. coli ribosomal protein S7 through our intervention process...... 136

xvi

List of Figures

Figure 2.1: Sample alignment of CASP predictions to target...... 8

Figure 2.2: Results of CASP8 assessment...... 11

Figure 2.3: Proliferation of clashes and other errors reveals misplaced loop insertion. Taken from Keedy et al, 2009...... 14

Figure 3.1: Falloff of diffracting power as a function of increasing diffraction angle...... 19

Figure 3.2: Ambiguous electron density for a helix in 4o6p.pdb, a 3.0Å structure...... 25

Figure 3.3: Misleading electron density for a "helix" in 2o01.pdb, a 3.4Å structure...... 26

Figure 3.4: Abrupt disappearance of electron density for a loop in 2gec.pdb, a 1.3Å structure...... 31

Figure 3.5: Dependency of clashscore on data resolution in structures refined without geometry restraints...... 32

Figure 3.6: Dependency of bond geometry outliers on data resolution in structures refined without geometry restraints...... 33

Figure 3.7: Dependency of Ramachandran and Rotamer outliers on data resolution in structures refined without geometry restraints...... 34

Figure 4.1: Typical helices from 2o01.pdb, showing misplaced peptide planes and disrupted H-bonding...... 38

Figure 4.2: Poorly modeled helix residues from 2o01.pdb in Ramachandran space...... 39

Figure 4.3: Typical three-carbonyl modeling error from 70S E. coli ribosome...... 40

Figure 4.4: Beta strand modeling errors from 70S E. coli ribosome in Ramachandran space...... 41

Figure 4.5: Poorly modeled helices from 2o01.pdb with DSSP-based secondary structure annotation...... 43

xvii

Figure 4.6: Ribbons show sparse identification of helices by ksdssp for 2o01.pdb chain A...... 44

Figure 4.7: Ramachandran space distribution for characteristic residue of widened helix ...... 46

Figure 4.8: Schematic representation of μin and μout for a single residue...... 48

Figure 4.9: μin relationship for residue i, shown in an alpha helix...... 49

Figure 4.10: Schematic representation of ν for a residue...... 50

Figure 4.11: The ν dihedral, shown in an alpha helix...... 51

Figure 4.12: Schematic representation of the Cα virtual angle...... 52

Figure 4.13: General case contours in 3D CaBLAM space at 5% and 10%...... 57

Figure 4.14: General case contours in 3D Cα geometry space at 1% and 5%...... 58

Figure 4.15: Proline case contours in 3D CaBLAM space at 5% and 10%...... 59

Figure 4.16: Proline case contours in 3D Cα geometry space at 1% and 5%...... 60

Figure 4.17: Glycine case contours in 3D CaBLAM space at 5% and 10%...... 61

Figure 4.18: Glycine case contours in 3D Cα geometry space at 1% and 5%...... 62

Figure 4.19: CaBLAM's visual annotation of predicted DSSP codes for helices from 2o01.pdb...... 70

Figure 4.20: CaBLAM's visual annotation of predicted DSSP codes for beta strands from 2a0l.pdb ...... 71

Figure 4.21: Contours for regular alpha helix definition in 2D CaBLAM space. (x-axis is µin, y-axis is µout.) ...... 79

Figure 4.22: Contours for loose alpha helix definitions in 2D CaBLAM space. (x-axis is µin, y-axis is µout.) ...... 80

Figure 4.23: Contours for regular beta strand definition in 2D CaBLAM space. (x-axis is µin, y-axis is µout.) ...... 86

xviii

Figure 4.24: Contours for loose beta stand definition in 2D CaBLAM space. (x-axis is µin, y-axis is µout.) ...... 87

Figure 4.25: Contours of loose 310 helix definition in 2D CaBLAM space. (x-axis is µin, y- axis is µout.) ...... 90

Figure 4.26: Abstract representation of Cα trace handedness in 2D CaBLAM space...... 92

Figure 4.27: Comparison of alpha helix and 310 helix contours in 2D CaBLAM space. (x- axis is µin, y-axis is µout.) ...... 95

Figure 5.1: Comparison of regular (red) and loose (orange) alpha helix definitions in 2D CaBLAM space. (x-axis is µin, y-axis is µout.) ...... 104

Figure 5.2: Comparison of regular (green) and loose (blue) beta strand definitions in 2D CaBLAM space. (x-axis is µin, y-axis is µout.) ...... 105

Figure 5.3: All final secondary structure cutoffs superimposed in 2D CaBLAM space. (x- axis is µin, y-axis is µout.) ...... 106

Figure 5.4: Superposition of CaBLAM's Cα geometry contours (blue mesh) with Gerard Kleywegt's...... 107

Figure 5.5: Cα geometry contours for the general case show near-total coverage of μin/μout space at the 0.5% cutoff...... 108

Figure 5.6: CaBLAM kinemage annotation for a severe outlier (pink) and a disfavored conformation (purple)...... 118

Figure 5.7: CaBLAM kinemage annotation for residues in helices from 2o01.pdb ...... 119

Figure 5.8: Cα geometry outlier annotation for helices in 2o01.pdb...... 120

Figure 5.9: Ribbons for 2o01.pdb generated by Molikin used HELIX records supplied by CaBLAM...... 123

Figure 5.10: Poorly modeled helix residues from 2o01.pdb in 3D CaBLAM space...... 128

Figure 5.11: Poorly modeled helix residues from 2o01.pdb in 2D CaBLAM space with contours for alpha helix. (x-axis is µin, y-axis is µout.) ...... 129

xix

Figure 5.12: Beta strand modeling errors from 70S E. coli ribosome in 3D CaBLAM space...... 130

Figure 5.13: Beta strand modeling errors from 70S ribosome in 2D CaBLAM space with contours for beta sheet. (x-axis is µin, y-axis is µout.) ...... 131

Figure 5.14: Comparison of secondary structure identification by ksdssp (left) versus CaBLAM (right) as demonstrated by helix ribbons on 2o01.pdb...... 132

Figure 5.15: 4hum.pdb before (left) and after (right) our intervention...... 136

Figure 5.16: E. coli ribosomal protein S7 before (left) and after (right) our intervention.137

Figure 6.1: Typical cablam.training kinemage output, showing the 2D bCaBLAM space distribution of a characteristic residue from the widened helix turn. (x-axis is µin, y-axis is µout.) ...... 147

Figure 6.2: Annotation output from cablam.training for a bifucation-stabilized widened helix turn in 1y7t.pdb...... 150

Figure 6.3: Sample WebLogo sequence frequency summary for bifurcation-stabilized widened helix turns...... 151

Figure 6.4: Superposition of bifurcation-stabilized widened helix turns by CaBLAM, showing 20-30° bend of helix...... 153

Figure 6.5: Example of NH of Y-5 tyrosine corner from 1epw.pdb...... 155

Figure 6.6: Superposition of instances of CO of Y-4 tyrosine corners shows one of the common conformations of the motif. (x-axis is µin, y-axis is µout.) ...... 157

Figure 6.7: Sequence frequency Logos for the six Tyr corner variants. Residue 2 is the Y’s H-bond partner in all cases...... 158

Figure 6.8: The characteristic cluster in Ramachandran space for the bifurcation- stabilized widened helix turn...... 160

Figure 6.9: Typical bifurcation-stabilized widened helix turn from 1y7t, showing bifurcated bond in the foreground and missing bonds to the left...... 162

xx

Figure 6.10: typical threonine-dependent widened helix turn, showing threonine (front) inserted into the helix bonding pattern...... 163

Figure 6.11: CaBLAM space comparison of member residues of widened helix turns. . 164

Figure 6.12: Instances of widened helix turns superimposed in 2D CaBLAM space show comparison of bifurcation-stabilize turn (left) to threonine-dependent turn (right). (x- axis is µin, y-axis is µout.) ...... 166

Figure 6.13: WebLogos for bifurcation-stabilized (top) and threonine-dependent (bottom) widened helix turns, aligned for comparison...... 168

Figure 6.14: A double tight turn from 2cn3.pdb shows characteristic 3 H-bonds among 3 residues...... 170

Figure 6.15: Instances of double tight turns superimposed in CaBLAM space show the motif favors a conserved, repeating path. (x-axis is µin, y-axis is µout.) ...... 172

Figure 6.16: WebLogo for double tight turns...... 173

Figure 6.17: Non-sequential helix bonding of two helices...... 174

Figure 6.18: Not what “three-helix junction” usually means...... 175

Figure 6.19: Non-sequential helix bonding with an antiparallel beta strand...... 176

Figure 6.20: Non-sequential helix bonding bracketing a long loop...... 178

Figure 7.1: Validation overload in 2o01.pdb...... 182

Figure 7.2: Laterally compressed helix from 2o01.pdb...... 184

Figure 7.3: Validation overload reduced via new clash cutoff and CaBLAM markup in place of rotamer outliers...... 187

Figure 7.4: Comparison of old (left) and new (right) multicriterion table coloring for clashes in 4hum.pdb...... 190

Figure 7.5: Comparison of old (left) and new (right) multicriterion table coloring for Ramachandran and rotamer analysis in 4hum.pdb...... 191

Figure 7.6: Multicriterion table coloring for CaBLAM validation in 4hum.pdb...... 194

xxi

Figure 7.7: Multicriterion table coloring for cis-peptide validation...... 196

Figure 7.8: New summary table for MolProbity, including peptide omegas and low- resolution validation for 4hum.pdb...... 199

Figure 7.9: A cis-proline...... 203

Figure 7.10: Schematic representation of dihedral space revealing "twisted" regions not near either planar conformation...... 206

Figure 7.11: Kinemage annotation for cis-peptides...... 209

Figure 7.12: Kinemage annotation for twisted peptides...... 210

Figure 7.13: Comparison of distribution of trans-proline (blue) and cis-proline (orange) in 2D CaBLAM space. (x-axis is µin, y-axis is µout.) ...... 214

xxii

Acknowledgements

I wish in particular to thank my advisors, David and Jane Richardson. It has been a pleasure to work for and with them. Their dedication to education, to mentoring, and to student advocacy has been an inspiration to me. I hope to carry on their example in the rest of my career.

The Richardson Lab is a remarkable group of people, all of whom have contributed to my education and to this work. I owe special gratitude to Vincent Chen for his work on the 70S ribosome which provided an inspiration and reference point for my work with CaBLAM, to Jeff Headd for his knowledge of the Phenix environment, to

Dan Keedy and Bryan Arendall for developing the Top8000 database, and to Bradley

Hintze for his friendship and collaboration on low-resolution matters.

Our collaborators in the Phenix project have provided resources and knowledge of incalculable value. Nat Echols and Tom Terwilliger have been especially encouraging of my work on secondary structure.

My family has done well by me (Trelease, 2013). Both their nature and their nurture have set me on a rewarding path.

I have been involved in the University Scholars Program since my enrollment at

Duke. This program brings together undergraduates, graduate students, and professional school students with interdisciplinary interests. It has presented me with

xxiii

an unparalleled opportunity to get out of the lab, get out of the department, and stimulate my mind with diverse worldviews. My sincere gratitude goes to Tori

Lodewick, who advises the program, and to Melinda French Gates who originated it and funds it.

I am grateful to all the sources of funding that have supported my work. In particular, I wish to recognize the Eureka Grant, which supports speculative research like the early stages of my work on low-resolution validation. Speculative research is generally underappreciated, and the Eureka Grant helps support the reality of scientific exploration.

xxiv

1. Introduction

X-ray crystallography revolutionized biochemistry. The ability to see all the atoms in a macromolecule and their relationships to each other made possible new experiments and gave birth to new fields of study. It had been possible to get a general sense of a protein’s geometry through various biochemical assays. But biochemical assays are generally limited by the expectations of the researchers performing them. It is difficult, though not impossible, to find something you are not looking for with these methods. Crystallography reveals all and lays bare the unexpected to anyone with the ability simply to look at a structure (Richardson & Richardson, 2014).

My Supervisors, David and Jane Richardson, are fond of a story from the early days of protein crystallography, in the midst of this revolution. They were building a large physical model of a protein structure they were solving (Staph nuclease, see

Arnone et al, 1971). In those days, this was a long and arduous process. But as they progressed, they were encouraged by the slow stream of biochemists who would come to look at the growing model. The biochemists would peer into the brass skeleton of the protein, take a few notes on an interaction evident from the model, then hurry off to test that interaction with their notes and familiar assays. And they would return excited by the successes that the Richardsons’ model had enabled.

1

We now take the ability to see the structure of largely for granted.

Crystallography has advanced to become an automated tool for most biochemists, rather than a discipline unto itself. Structural study has become a staple of biochemistry.

Structures are, of course, useful in elucidating functions. Binding sites and active sites are better understood when they can be seen. In some cases, and with the aid of carefully selected ligands, models of transitional states can be obtained that explain or suggest reaction mechanisms (e.g. Rittinger, 1997 – 1tx4.pdb; Akamine, 2002 – 1l3r.pdb).

And the structural patterns we see across all proteins help us to understand how proteins maintain the delicate balance of stability necessary for their functions.

Structures also feed into design methods. Rational drug design, one of the potential “holy grails” of medicine, relies upon protein structures. If a protein can be understood at the atomic level, as a crystallographic structure allows, then a drug can be designed to target that protein with specific effects (Gane & Dean, 2000). Protein design likewise aspires to start a medical revolution and likewise starts from protein structures.

Protein design looks to redesign known proteins to serve new functions, from biosensors

(e.g. Tinberg et al., 2013) to efficient drug synthesis (e.g. Geogiev, 2008).

The Richardson Lab itself started in the field of protein design, but moved into structure validation following a revelation concerning the significance of hydrogen atoms in understanding van der Waals contacts in proteins. We have been validators of macromolecular structures ever since. I often describe us as “proofreaders of protein

2

structures” to those outside of out field. While modeling errors in protein structures are more complex than misspellings, the analogy is apt. We “read” structures, find errors, and help the authors of those structures try to improve them.

Accurate, error-free structures are important to their authors as more than just a matter of pride. Function cannot be reliably understood from an unreliable structure.

Drugs cannot be reliably designed to fit a structure with modeling errors in its active site. Proteins cannot be reliably redesigned if they were not modeled realistically to start with. Structure authors have a responsibility to themselves to validate their structures and build the best models that they can.

But accurate, error-free structures are important to the entire biological and biomedical research community as well. The Worldwide (PDB) acts as the single public repository of structural knowledge in the form of deposited structures of biological macromolecules (Berman et al., 2000). Scientists other than the original authors can and do access deposited structures for their own work. Many of the

Richardson Lab’s own validation criteria come from statistical studies of the structures in the PDB, as do other structural bioinformatics (e.g. Berkholz, 2009). Protein design and prediction algorithms draw from the PDB for rotamer (e.g. Dunbrack & Karplus,

1993) and fragment (e.g. Xu & Zhang, 2012; Gront, 2011) libraries. Even new structures being deposited into the PDB depend on the reliability of the database, since the majority of crystal structures are phased by molecular replacement from homologous

3

prior PDB entries. Structure authors have a duty to the community to validate their structures and share the best models that they can.

The PDB has grown enormously in recent years. At the time of my college graduation (05-20-2007), there were 46,635 structures deposited in the PDB. My graduate career has seen that number more than double to over 100,000. Interest in (and ability to solve) low-resolution structures has grown even more. At the time of my college graduation, there were 9,649 structures at 2.5-5Å resolution in the PDB. Now there are almost 25,000.

Low-resolution structures are of particular interest to the Richardson Lab in general and to me in particular because of the unique difficulties involved in their solution and validation. Low-resolution experimental data presents the intriguing puzzle of how to construct a good protein structure model from much less input information. This work presents a new validation method I developed called CaBLAM

(C-alpha Based Low-resolution Annotation Method) for addressing the challenges of low-resolution structures, as well as the other methods that I have developed alongside, in support of, or in response to CaBLAM.

Chapter 2 describes the lab’s role as assessors for the CASP8 experiment. CASP8 assessment was my introduction to the Richardson Lab, its practices and its philosophies. I briefly reflect on how the experience influenced my subsequent work.

4

Chapter 3 describes the challenges inherent to the low-information environment of low-resolution crystallography. I define “low-resolution” as the range from about 2.5-

4.0Å because of the character of the experimental data in this region. Specifically, electron density better than 2.5Å is sufficient to resolve individual atom positions or groups, and electron density worse than 5.0Å is no longer sufficient to resolve the mainchain trace of a protein. Within this region, individual atoms cannot be resolved, but a fairly reliable Cα trace can be modeled. I give context for the ramifications of this level of data quality on protein structures solved in this resolution range.

Chapter 4 describes the unique parameters I developed for CaBLAM. These parameters take advantage of the reliable Cα trace in the 2.5-4.0Å resolution range to create a parameter space that reliably conveys conformational information for low- resolution structures. The various parameter spaces I constructed are discussed along with their uses. The properties of these spaces are described, using the familiar secondary structure types as reference points.

Chapter 5 finally describes the validation method central to CaBLAM. CaBLAM validation depends on contours similar to those used in Ramachandran validation. The specific contours used and the development of significant cutoff values for those contours is discussed. The workflow for CaBLAM validation is shown, and its various forms of output are detailed. Our difficulties in effecting improvements in low-

5

resolution structures are discussed along with some hopes for future methods to improve the process.

Chapter 6 describes the second major feature of CaBLAM: a tool for searching structures for motifs of interest. This feature was used to generate the contours for secondary structure identification used by CaBLAM’s validation functionality. The methods by which motifs are described and identified are discussed. The power of the tool for more general use is demonstrated with a collection of motifs it has been used to study, ranging from the familiar tyrosine corner to the fantastical non-sequential helix bonding.

Chapter 7 describes the integration of CaBLAM validation into the MolProbity webservice. CaBLAM is the first of our efforts to develop a suite of methods to better validate low-resolution structures. The challenges of low-resolution validation are discussed, along with changes to MolProbity to better address these challenges. An additional new method “Omegalyze” is described. Omegalyze is a validation tool for cis-peptides. The chapter closes with speculation on future validations stemming from

CaBLAM.

6

2. CASP Retrospective

2.1 The CASP experiment

Although I had done a rotation in the Richardson Lab before joining, my real introduction to the lab, its philosophies, and its practices was our assessment of CASP8

(Keedy, et al. 2009). CASP, or Critical Assessment of protein Structure Prediction, is a semi-competitive group experiment held every two years among labs interested in prediction of folded protein structure. Predictor labs are given the primary sequences for a number of proteins. These labs then attempt to predict the correct final folded conformation of those proteins. Predictions are submitted to a panel of assessors and compared against the “target” experimentally solved crystal structures for those proteins (temporarily held back from publication for the sake of the CASP experiment) for accuracy. The Richardson Lab served as assessors for CASP8, the eighth iteration of the experiment, in 2008, just after I joined the lab.

Previous iterations of CASP (Tramontano & Morea, 2003 for CASP5;

Kryshtafovych, 2007 for CASP6 and CASP7; Kopp, 2007 for CASP7) had focused primarily on the alignment of the Cα atoms between a predicted structure and its target structure. As validators of protein structure, we wanted to apply a broader range of quality criteria to the assessment of predicted structures. We also believed that the field was ready for and would benefit from a fresh set of assessments. So we set about

7

crafting new validation criteria for the match between prediction and target that took into account protein features beyond just the Cα trace.

Figure 2.1: Sample alignment of CASP predictions to target.

Taken from Keedy et al, 2009. To demonstrate the variety of alignments generated by CASP predictors, all predictions for target T0512 (shown as colored ribbons) superimposed on the target in their best alignment.

2.2 My CASP tools

My own work on CASP8 focused on two main areas – completeness of prediction and sidechain positioning.

2.2.1 Completeness

Because previous CASP assessments focused strongly on the Cα positions, standards for structural completeness in CASP models were unreliable. However, all of the new quality criteria we wished to apply required complete mainchain modeling, and many also required complete sidechains. We therefore had to determine for each prediction whether enough of the protein had been modeled to allow a meaningful

8

comparison to the target. I wrote a short script to check the percent completeness of a predicted model against its target. This script reported both mainchain completeness and sidechain completeness so that appropriate assessments could be used on each model. While not a revolutionary piece of code by any means, it gave me an appreciation for the many foibles of interpreting PDB files (and PDB-like files).

Fortunately for me, many of the most severe difficulties in PDB interpretation were managed by Vincent Chen. Presumably because structure predictors do not deposit their results into a public repository like the PDB, standards for file formatting among CASP predictors were frighteningly lax. The occasional non-standard atom name was to be expected, but the PDB format is defined by columns, and many predictions placed information in the wrong columns, rendering their files unreadable to standard methods. Vincent put considerable effort into writing “remediator” programs to correct these formatting errors so that my scripts and others could accurately assess the contents of the predictions.

2.2.2 Sidechain alignment

For my contribution to assessments proper, I worked with Adam Zemla on a measure of sidechain position recapitulation. Adam had written a program called LGA

(Local-Global Alignment) that had been used as one of the main assessment tools in previous years of CASP (Zemla, 2003). LGA could search for the best spatial alignment between a predicted model and its target and output a value corresponding to the

9

closeness of the alignment. By default, LGA only aligned on Cα atoms and only reported on the closeness of Cα atoms in the alignment. I compiled a list of one or two atoms for each amino acid type that would be characteristic of sidechain position. The list was as follows: Val:CG1, Leu:CD1, Ile:CD1, Pro:CG, Met:CE, Phe:CZ, Trp:CH2,

Ser:OG, Thr:OG1, Cys:SG, Tyr:OH, Asn:OD1, Gln:OE1, Asp:OD1/OD2, Glu:OE1/OE2,

Lys:NZ, Arg:NH1/NH2, His:NE2. The selected atoms were generally at the end of the sidechain farthest from the mainchain. The selected atoms were hydrogen-bonding partners when possible, since recapitulation of those atom positions would be important to recapitulating structurally or functionally important bonds. Where atoms were functionally and geometrically identical, as in the case of aspartate’s OD1 and OD2 atoms, the atoms were considered equivalent. Adam added a mode to LGA to include these atoms in alignment and assessment, and I ran the program on all predictions with modeled sidechains. I believe that the assessment Gary Kapral used to determine whether sidechain hydrogen-bonds from the target were recapitulated in the prediction was ultimately a better indicator of prediction quality. However, my incorporation of a sidechain assessment into LGA has the advantage that it made a sidechain assessment available to future CASP assessors in a program with which they would already be comfortable and familiar.

10

Figure 2.2: Results of CASP8 assessment.

Taken from Keedy et al, 2009. These plots show the distribution of predicted structures with respect to our assessment measures. The x-axis for each plot is the alignment of the prediction to the target, as calculated by Adam Zemla’s LGA; the y- axis of each plot is one of our new assessment measures. Panels d. and e. are of most interest to this work. Panel d. shows the sidechain alignment I developed with Adam. Panel e. shows the Mainchain Reality Score.

11

Overall, our new assessment criteria were a mixed success. There were a number of predictor groups who had indeed put in the effort to predict sidechain positions, and those groups were grateful for the recognition of their efforts. The Baker Lab

(originators of Rosetta, a perennial “winner” of CASP) fared particularly well, since they have long used MolProbity as a measure of structure quality for their predictions.

However, we were certainly rocking the boat, and it was an uphill fight to get the community as a whole thinking beyond Cα-only assessment.

2.3 Lessons from CASP

It is not clear that our lab’s assessments have had a lasting effect on the CASP community. However, the experience of CASP assessment has had a lasting effect on the Richardson Lab and on me in particular. CASP8 was one of the first experiences that forced us to deal systematically with very difficult and poorly-modeled structures. I suspect that my involvement with these difficult structures was part of what motivated me to pursue low-resolution validation.

2.3.1 Mainchain reality score

For the CASP predictions we developed a sort of alternate MolProbity Score called the Mainchain Reality Score, or MCRS. The MolProbity score is a combination of clashscore, Ramachandran score, and rotamer score, scaled to resemble a crystallographic resolution. Relatively few CASP predictors put significant effort into accurate or protein-like sidechain modeling. For predictions with poor sidechains, the

12

poor rotamer score could mask the other quality measures in the MolProbity Score. We needed a validation measure that focused on protein mainchain quality, without requiring or being influenced by sidechains.

The MCRS was a weighted combination of mainchain atom-mainchain atom clashes, Ramachandran outliers, mainchain bond length outliers, and mainchain bond angle outliers. Because “score” was a more intuitive quality measure than resolution in the CASP context, the MCRS was scaled to act as a score out of 100, though a sufficiently poor mark in any of its constituent categories would be sufficient to send the score to 0.

The Mainchain Reality Score gave us a new measure, specifically tuned to the problem of CASP predicted models. The MCRS has not found a permanent home among our validation methods. However, once we have developed additional low- resolution validation methods, I expect that we will revisit the MCRS, with an eye towards building a similar alternate aggregate score for low-resolution structures.

2.3.2 Adjusted clash cutoff

The other major finding from CASP assessment that has carried forward into my work involves all-atom contact clashes. We found during CASP that for structures of dubious quality our usual assessment of clashes became saturated. By using a more forgiving clash cutoff, we reduced the saturation and could more clearly identify the most severe problems in the predicted structures. Our customary clash cutoff identified any overlap of 0.4Å or greater between non-bonding atoms as a clash. Our more

13

forgiving cutoff required a greater overlap of 0.5Å to begin counting an interaction as a clash. This was especially effective in identifying places where a prediction contained a loop insertion at the wrong point in the sequence (Figure 2.3).

Figure 2.3: Proliferation of clashes and other errors reveals misplaced loop insertion. Taken from Keedy et al, 2009.

Saturation of validation is a problem we face in low-resolution crystallographic structures as well, so we have carried this change of cutoff through to our regular validations. The case of misplaced loop insertions is analogous to one of our low- resolution validation interests: finding register-shifted regions in crystal structures.

Improved signal-to-noise in low-resolution structures allows for better distinguishing of the strong signal clusters indicative of register errors. Certainly, we might have arrived at this solution without CASP, but the CASP8 assessment put us in an experimental mindset that led to us changing our usual parameters in new ways.

14

2.4 Discussion

It is an amusing irony that, after spending so much effort trying to look at structures in greater detail than the Cα trace, I found myself “regressing” to the Cα trace in my main project. The previous CASP assessors had a point, however, about the utility of the Cα trace in describing difficult structures. In their case, they were dealing with dubious predicted models. In mine, I found that challenging low-resolution models were of similar quality to challenging predicted models. Thus lessons we learned about managing challenging structures in CASP have applied to our later efforts to manage low-resolution structures.

The Cα trace proves to be a useful abstraction of protein geometry when other structural details are unreliable. Indeed, my work shows that the Cα trace is considerably more useful as an abstraction than we suspected at the time of CASP. It would be interesting to revisit CASP assessment with a more complete Cα trace assessment based on my work, though I doubt any of us are eager to wrangle 80,000 predicted structure files again. (Another signifier of progress: the PDB had ~60,000 structures at the time of our CASP8 assessment, so 80,000 structures seemed like an enormous number. The PDB and our file handling abilities have both expanded since then.)

The CASP assessment was a remarkable introduction to the lab for me. It was a time that showcased all of our validation methods, and also a time that required that we

15

carefully consider the meanings and mechanisms of those validations. While the CASP experience did not lead directly into my main project, the philosophies we developed influenced my approach to low-resolution structures.

16

3. Challenges of low-resolution protein modeling

X-ray crystallography is a powerful tool for discovering the atomic-level structure of proteins (and RNAs) of biological interest. Unfortunately, the power of this method is limited by the quality of the data obtained by it.

3.1 Causes of low-resolution data

A precise, in-depth understanding of the origins of low-resolution crystallography data is much less important for following the logic of this project than an understanding of the implications of low-resolution data for model building and validation. Nevertheless, a general understanding of crystallographic data quality is useful in providing context for the challenges of low-resolution structures that this project addresses.

3.1.1 Resolution

The quality of x-ray crystallography data is most often evaluated in terms of

“resolution”. Crystallographic resolution has an intuitive (and mathematical) correlation to optical resolution. In optics, resolution refers to the ability to distinguish between two adjacent points. In crystallography, this is analogous to the ability to distinguish between two adjacent features. Thus “atomic-scale resolution” affords distinction between adjacent atoms.

In crystallography, resolution can also be understood through Bragg’s Law

(Bragg & Bragg, 1913), commonly expressed as:

17

�� = 2� ����

Each diffraction spot in a crystallography experiment is associated with a particular set of Bragg planes. Bragg planes with a small distance between them (a small d in the Bragg equation represent features that are close together in the protein. A small d in the Bragg equation results in a large angle θ for constant n and λ. Thus Bragg planes with a small d have a high scattering angle, and Bragg planes with a large d have a low scattering angle. The relationship is inverse.

Therefore, diffraction spots containing low-resolution information are found close to the center of the diffraction pattern, and spots contain higher and higher- resolution information as one moves further from the center of the pattern. However, the scattering power of atoms falls off as scattering angle increases (Figure 3.1).

Eventually, the signal strength for a diffraction spot falls below the level of background noise, and no further spots can be detected at higher resolutions.

The point at which diffraction spots can no longer be distinguished from background is the resolution limit of the diffraction experiment. This limit is generally reported as the resolution of a crystallography structure. Resolution is reported in

Ångstrom (Å), a unit of distance equal to 1x10-10 meters. This distance indicates the spacing of the Bragg planes associated with the last significant shell of visible diffraction spots.

18

Figure 3.1: Falloff of diffracting power as a function of increasing diffraction angle.

Due to some engineering background in signal processing, I tend to think of the resolution limit of a crystallography experiment as an interaction between signal strength and background noise. When the signal:noise ratio is too small, diffraction spots can no longer be distinguished. Therefore, I think of contributions to resolution limit in terms of contributions to signal and contributions to noise.

3.1.2 Crystal quality

The signal from any single diffraction event is too small to measure, so a multitude of simultaneous diffraction events must occur to produce a detectable diffraction spot. Crystallization produces a vast, ordered grid of individual instances of a protein to allow for this multitude of diffractions. Crystal quality strongly affects

19

crystallographic signal quality. Dissimilarities among unit cells disrupt diffraction events, leading to a decrease in spot intensity. Proteins that crystallize poorly therefore tend to produce lower-resolution data.

The proteins most infamous for difficult crystallization are transmembrane proteins. Transmembrane proteins are typically found in the hydrophobic environment of the cell membrane and are difficult to solubilize as a result. Although advances have been made in crystallization of membrane proteins, the process remains difficult enough that mediocre crystals may be accepted for data collection simply by virtue of their existence.

Extremely large proteins or protein complexes may also be difficult to crystallize.

With more atoms come more opportunities for variance in atom positions across unit cells, and these variants reduce coherency of diffraction. Two of the most-used examples of low-resolution structures in this project, Jamie Cate’s 70S E. coli ribosome and the 2o01.pdb Photosystem I structure, are of very large multi-chain complexes.

Their size doubtless contributed to the low resolution of the data their crystals produced.

3.1.3 Mobility and disorder

Protein structures may contain highly mobile regions. These regions may be as small as a single loop or sidechain, or as large as an entire domain or protein chain. A

20

high degree of mobility results in a high degree of variance among unit cells and a corresponding decrease in diffraction intensity.

Regions that adopt a small number of discrete conformations appear in the electron density as having alternate conformations. In the case of alternates, the electron content of a region is divided among two or more conformations. If the divided electron content of an alternate falls below the noise level, that alternate disappears from the density. Thus in low-resolution structure, where the noise level is relatively high, residues with alternate conformations may disappear, even if the alternates themselves are well-ordered.

Disordered regions tend to disappear from electron density entirely, but their disappearance belies their pernicious effect on the effective resolution of a structure.

Disordered regions do not contribute systematically to diffraction spots, but do still diffract. The disordered diffraction of disordered regions contributes to the background noise between reflections. This raises the noise level in the diffraction pattern, burying already-faint higher-resolution diffraction spots in the background.

Protein regions may be mobile without being fully disordered. Concerted mobility still interferes with the alignment of proteins across unit cells, and thus still reduces diffraction spot intensity and increases background noise. However, in the case of concerted motions, there is accessible information content in the “noise”. TLS refinement can extract information on domain motions (namely the Translation,

21

Libration, and Screw of TLS) from the pattern of the background. Advancements in TLS refinement may eventually help reduce the noise-like effect of protein mobility, but even then mobility-based noise will remain an important artifact in historical structures.

3.1.4 Correlation of Interest and Difficulty

An unfortunate irony of structural biology is the correlation between macromolecules of interest and macromolecules for which good experimental data is difficult to collect or whose structures are difficult to solve. The factors discussed above that contribute to low-resolution diffraction data are also associated with structures of great current interest to the community. Membrane proteins are attractive drug targets, but are notoriously difficult to crystallize. Ribosome structures are constantly improving (Yusupov, 2001; Schuwirth, 2005; Zhang, 2009, etc.), but their models are still hampered by their extreme size. Proteins with inherently unstructured regions or rapid folding-unfolding processes like SpA-N (Deis, 2014) can only be crystallized in some of their biological states. All these frontiers of structural research have to contend with low-resolution data on a regular basis.

Such proteins and systems have always been of interest, but recent technological advances have made pursuing their structures more plausible. In particular, improved access to synchrotron laser sources and the continuing rise of computer storage and processing power have made possible data collection and structure solution for

22

previously intractable proteins. More and more, the field will stand in need of improved structure solution methods to make sense of low-resolution data.

3.2 Data quality

While the crystallographic origins of resolution-limited structures are useful to understand, I have been most concerned with their practical effects on real-space electron density. Electron density is a common visualization used in the Richardson Lab to evaluate structure quality and intent. Crystallographic refinements can be conducted in (diffraction spot) reciprocal space or in (electron density) real space, so understanding the effects of low-resolution experimental data on electron density is key to understanding the challenges of solving low-resolution structures.

In this context “low resolution” means roughly 2.5Å to 4.0Å. The project originally targeted the 3Å-4Å range. However, we found that structures began to suffer from modeling errors typical of low-resolution density at 2.5Å or worse resolution. We also found that structures with resolutions worse than 3.5Å or so tended to be intractable, even with our new low-resolution-targeted methods. However, the quality of individual structures (and their data) is highly variable in this resolution range.

“Low resolution” is somewhat of a loaded term. Expectations of data quality vary from system to system and from method to method, and “low” sounds like a judgment upon the data. I use “low resolution” as shorthand for the 2.5Å-4.0Å range because of the content of the data in this range. The data does not contain enough

23

information to resolve atomic positions, but does contain enough information to information to resolve the Cα trace of a protein. Section 3.5 will discuss the information content of this range further.

3.2.1 Ambiguous density

The most common symptom of low-resolution electron density is ambiguity. At high resolution, individual atoms can be clearly seen in the density. At middle resolutions, atom centers are no longer isolated peaks in the density, but many atoms still have an associated bulge in the density. At low resolution, electron density trends towards the appearance of a featureless tube that follows along the mainchain trace

(Figure 3.2). At very low resolution, the distinction among adjacent mainchains is lost; alpha helices begin to appear as single thick tubes of density, and beta sheets begin to appear as wide, flat slabs of density.

Fitting an atomically-accurate protein model into the sort of ambiguous electron density encountered at low resolution is a daunting challenge. The details necessary to position the all-atom backbone trace are often missing. However, given continuous density, the Cα trace of the protein backbone is generally identifiable. (The identifiability of the Cα trace is significant, since the CaBLAM validation system exploits this property of low-resolution density.)

24

Figure 3.2: Ambiguous electron density for a helix in 4o6p.pdb, a 3.0Å structure.

3.2.2 Misleading density

A more pernicious symptom of low-resolution electron density is the presence of misleading features. Density ambiguities in the right (or wrong) place can imply the presence of incorrect features. The truncated electron density often associated with low- resolution sidechains (discussed in more detail shortly) can appear as an alternate mainchain trace. The true mainchain trace may not show a clear peak for the carbonyl oxygen, so a truncated sidechain peak may mislead humans and programs as an apparently attractive location for the mainchain oxygen (Figure 3.3).

25

Figure 3.3: Misleading electron density for a "helix" in 2o01.pdb, a 3.4Å structure.

Because low-resolution data results in broader density peaks, erroneous connectivity can appear at low resolution. For example, at good resolution, the beta strands of a sheet are clearly separated in the electron density. At very low resolution, the individual strands merge together into a continuous sheet of density. At the resolutions considered by this project, the strand density is transitioning from separate to continuous. Sidechains, ligands, and even the inter-strand hydrogen bonds can contribute to density that connects adjacent strands perpendicular to the actual strand direction. Misleading connectivity among protein regions can also interfere with loop modeling.

26

3.2.3 Missing density

Some features or regions effectively disappear from the electron density at low resolution. Properly, their presence in the density is close to or indistinguishable from the noise level. Mobile and unstructured regions are prone to disappearance from the density at all resolution levels. As the signal:noise ratio degrades at lower resolution, more features and regions fall below the noise level. Fitting these regions based on the experimental data becomes impossible, and they must either be left unmodeled or be estimated by programs like JiffiLoop (Chen, 2009) or Rosetta (Leaver-Fay et al, 2011).

3.2.4 Core versus surface density

An interesting feature of crystal structures is that they appear to have inhomogeneous resolution. The well-packed core appears to have higher resolution than loops or the exterior. This is curious because crystallographic resolution is a property of the diffraction data and is nominally a constant for a given structure.

However, the electron density’s ability to resolve the difference between two proximal features is greater in some areas of structures than in others.

The appearance of inhomogeneous resolution can occur at any nominal resolution. In high-resolution structures, chain termini are known to be frequently disordered, with little or no density. If these regions are modeled without geometry restraints, impossible conformations result. The effect of inhomogeneous resolution is more pernicious at low resolution, however, since there is no clear distinction between

27

hard-to-fit and impossible-to-fit regions. While high-resolution crystallographers are advised to be careful of under-resolved regions in their structures, lest they model improbable numbers of cis-peptides in the absence of density (see Section 7.5), I will focus on the phenomenon of inhomogeneous resolution in low-resolution structures.

The difference in apparent resolution is due to the same difficulties with mobility and disorder already discussed. The packed core of a protein is more constrained than the solvent-exposed edges and its conformation will be reproduced more consistently from unit cell to unit cell of a crystal. The edges of a protein undergo motion relative to that consistent core, and this motion reduces their contribution to the diffraction pattern.

The apparent inhomogeneity of electron density at low resolution is important to understanding what can be achieved with that density. We believe that the better electron density for protein cores and the packing constraints on those cores will allow protein cores to be well-modeled, even from low-resolution data, given appropriate tools. CaBLAM validation is a tool intended for this purpose. We expect that the edges of low-resolution proteins will prove much more difficult to model at high quality, since the experimental data for those regions is effectively even lower than for the rest of the protein.

3.3 Missing and truncated sidechains

Protein sidechains tend to be more mobile than the backbone. Even when the backbone does assume different local conformations, it tends to do so through small and

28

well-defined motions such as the backrub (Davis, 2006), while sidechains may assume many disparate rotamers. In addition, greater resolving power is needed to differentiate sidechains from each other compared to the relatively large secondary structure elements the backbone forms. Therefore, electron density for sidechains tends to disappear at low resolution.

Often, the sidechain density does not disappear completely. Sidechain density near the mainchain remains, either due to those atoms’ conformational restriction or due to proximity to the stronger density peaks near the mainchain. The effect is an appearance of “truncated” sidechain density, with density extending only to a sidechain’s Cβ or possibly Cγ position (Figure 3.2). Further information on sidechain position is lost from the density.

Both humans and automated fitting programs seem to have difficulty modeling around truncated sidechain density. There is a strong impulse to fit all the atoms of a protein into density somehow. In the case of truncated sidechain density, this can result in sidechain atoms being impossibly squished into the truncated nubs of sidechain density or crammed into mainchain density. These sidechain misfittings result in distortions of the mainchain trace, rotamer outliers, terrible clashes, and general protein- unlikeness.

Seeing protein models with non-alanine sidechains that stop at their Cβ position is frustrating for people like me who are interested in the details of protein structure.

29

However, truncating the model to match the truncations in the density seems to avoid some of the fitting and refinement problems associate with low-resolution sidechain density. In the absence of a force field that can consistently pack sidechains in the absence of density, modeling only the parts of sidechains visible in the density may be necessary at low resolution.

3.4 Loops

Loop regions of proteins are often mobile or relatively unstructured. As a result, loops often have poor or no electron density associated with them at low resolution.

Accurate modeling of loops is further complicated by the great variety of possible loop conformations. Loops cannot be restrained to an ideal repeating geometric pattern as secondary structure elements can. With their residues restrained neither by experimental data nor by strong geometric expectations, loops are especially prone to modeling errors in low-resolution structures.

30

Figure 3.4: Abrupt disappearance of electron density for a loop in 2gec.pdb, a 1.3Å structure.

Challenges in loop modeling are not limited to low-resolution structures, however. As will be discussed in the later section on cis-peptide validation (Section 7.5), high-resolution structures are also susceptible to errors in loops, especially if the electron density for those loops is poor (Figure 3.4). Such apparent inhomogeneity of resolution may be particularly problematic in otherwise high-resolution structures, since the fitting and refinement methods that work (very well) for the high-resolution regions of the structure fail on the few poorly defined regions. Given the general expectation that high-resolution structures essentially solve themselves, these failures of the method may go unnoticed without extra care. 31

3.5 Data versus geometry restraints

The refinement process in crystallography is a balancing act between the experimental x-ray crystallography data and geometry restraints. With high-resolution experimental data, the data can carry the balance of refinement on its own. With low- resolution experimental data, geometry restraints must carry a greater and greater share of the responsibility for producing a protein-like model. Crystallography programs like

Phenix (Adams et al, 2010) can adjust the relative weighting of experimental data and geometry restraints automatically, but also include options to enable users to adjust this balance manually.

Figure 3.5: Dependency of clashscore on data resolution in structures refined without geometry restraints.

To demonstrate the importance of geometry restraints in refinement, I conducted an informal experiment. I collected a number of recent structures from the PDB across a

32

range of resolutions, approximately 5 structures in each 0.1Å bin from 1.0Å to 3.0Å. I then refined these structures using Phenix in two ways. Each structure from the PDB was refined using Phenix’s default settings as a reference point. Then each PDB structure was refined with Phenix drastically overweighting the experimental data. The degree of overweighting effectively removed the contribution of geometry restraints to refinement. The reference refined structure and the overweighting refined structure were each assessed for quality using MolProbity’s validation criteria. The difference in the numerical validation scores for overweighted versus reference were calculated and plotted as a function of resolution. A value of 0 indicates no net change in structure quality according to that validation criterion. The larger the value, the greater the change in structure quality when geometry restraints were effectively removed.

Figure 3.6: Dependency of bond geometry outliers on data resolution in structures refined without geometry restraints.

33

The results are entertaining. For high-resolution structures, down to about 1.5Å, the experimental data is good enough that the presence or absence of geometry restraints makes no appreciable difference in most cases. Sidechains (Figure 3.7) are the first feature to suffer without geometry restraints, and between 1.5Å and 2.0Å are no longer systematically modelable without restraints. Clashes (Figure 3.5) and bond geometry (Figure 3.6) are the next to go, becoming problematic between 2.0Å and 2.5Å.

The protein backbone, as measured here by Ramachandran statistics, remains reliably represented by the data alone at lower resolutions than any of the other measures.

Ramachandran scores do not suffer systematically from the absence of restraints until

2.5Å (Figure 3.7).

Figure 3.7: Dependency of Ramachandran and Rotamer outliers on data resolution in structures refined without geometry restraints.

34

These results confirm 2.5Å as a meaningful approximation of the beginning of a low-resolution regime in which the experimental data alone is no longer sufficient for building a realistic model of any part of a protein. Other information in the form of refinement restraints, validation criteria, or geometry idealizations becomes necessary to produce a protein-like model beyond this point. The results also show, as expected, that the data quality for the protein backbone is generally better at low resolutions than the data quality for other parts of the protein. The relative quality of backbone density and the relative reliability of backbone modeling will be important to the development of

CaBLAM validation.

3.6 Discussion

Low-resolution experimental data poses significant challenges to the accurate modeling of protein structures. Humans and crystallography programs alike cannot extract sufficient information from low-resolution electron density to build consistently protein-like models. Additional information must be brought in from external sources.

To this end, small model studies have yielded bond geometry restraints (Engh & Huber,

1991; Engh & Huber 2001), sidechain studies have yielded rotamer statistics (Ponder &

Richards, 1987; Lovell, 2000), and quantum mechanics have yielded atomic overlap

“clash” cutoffs (Bondi, 1964; Gavezzotti, 1983; Word, 2000). These methods and others have yielded a variety of restraint criteria that supplement x-ray crystallography data to ensure the production of a model that conforms to our expectations of protein-likeness.

35

Low-resolution experimental data provides less structural information and is thus in greater need of supplementation by external data. New sources of restraints will be needed to allow low-resolution structures to attain a high level of quality. We believe that high-quality, protein-like structures are possible, even at low resolution, at least in the well-packed core regions of proteins. To this end, the CaBLAM system described here introduces a new source of restraints tailored to the needs of low-resolution protein models.

Although techniques for solving structures from poor experimental data are constantly improving, new sources of restraints to inform and empower these methods will always be needed. Exciting new experimental techniques such as the recent femtosecond laser pulse “diffract and destroy” method for microcrystals allow the collection of datasets for otherwise intractable proteins (e.g. Fromme & Spence, 2011).

However, these techniques also represent what I call “bold new frontiers in the collection of mediocre data”. Development of new supplemental restraints, tailored to the specific need of these methods, will represent an important part of structural biology for the foreseeable future and beyond.

36

4. The Cα geometry parameter spaces

This chapter is concerned with the parameter spaces used by CaBLAM for validation of protein structure, identification of outlier backbone geometries, and assignment of secondary structure types. The parameter spaces used by CaBLAM are unique to the system and were designed to address shortcomings in existing methods and to take advantage of features that remain relatively well-modeled even in low- resolution structures.

4.1 Problems with all-atom parameter spaces

Low-resolution structures present considerable challenges to traditional structure assessment and annotation methods. Building a better method for low- resolution structures required an understanding of the reasons that existing methods fail. Ramachandran assessment for conformational outliers and DSSP annotation secondary structure as modeled provide instructive examples of familiar methods that cannot describe error-prone structures reliably.

4.1.1 Ramachandran analysis at low resolution

Ramachandran analysis is probably the most familiar and most popular method of protein backbone assessment. Hailing from the very dawn of protein structural biology (Ramachandran, 1963), it has been a staple of protein structure validation

(Laskowski, 1993 for PROCHECK; Hooft, 1996 for WHAT_CHECK, Lovell, 2003 for

MolProbity; etc.). Recent developments in protein structure solution methods have led

37

to speculation about the use of Ramachandran restraints as an aid in solving difficult structures (Headd et al, 2012). Much of the push-back against restraining residues to expected Ramachandran values has come from concern over losing such a broadly applicable and familiar validation method. My experience with Ramachandran analysis of low-resolution structure shows that it is not a reliable method of detecting intended protein structure where modeling errors are prevalent.

Figure 4.1: Typical helices from 2o01.pdb, showing misplaced peptide planes and disrupted H-bonding.

The two structures that have served as my primary touchstones for low- resolution modeling errors – 2o01.pdb (Amunts, 2007) and the 70S E. coli ribosome

(Dunkle et al, 2011, following our collaboration) – illustrate the failure of Ramachandran 38

analysis to provide meaningful validation at low resolution. The 2o01.pdb Photosystem

I structure is dominated by transmembrane helices. These helices are riddled with modeling errors (Figure 4.1). In particular, many of the peptide planes are oriented incorrectly, sometimes even turned about 180° such that the carbonyl carbon-carbonyl oxygen (C-O) vectors face the wrong direction along the helix axis. Multiples of these errors frequently occur adjacent to each other in sequence.

Figure 4.2: Poorly modeled helix residues from 2o01.pdb in Ramachandran space.

The Ramachandran diagram for these helices is a mess (Figure 4.2). Rather than being clustered tightly in the region for alpha helix, the residues are spread out over a wide area. Many of the residues have apparently acceptable ϕ,ψ values and are not

39

identified as outliers, despite problems with their geometry. Many of the residues that are identified as outliers are so far from the contours for expected behavior that their intended structure cannot be guessed.

Figure 4.3: Typical three-carbonyl modeling error from 70S E. coli ribosome.

The 70S E. coli ribosome structure contained many instances of a systematic modeling error in its beta sheets. In regular beta strands, the C-O vectors alternate direction by about 180° along the sequence of the strand. In this structure, there were many instances where one (or more) residue did not alternate, resulting in three (or more) C-O vectors in a row pointing in the same direction (Figure 4.3). While it is certainly possible for beta structure to have two sequential C-O in parallel, as in the case of beta bulges or at the corners of beta-helix folds, this requires some compensatory changes in backbone geometry that were not present in the 70S ribosome cases. A beta strand cannot place adjacent C-O vectors in parallel without bending. This error was evidently resistant to correction by refinement and required human intervention to fix.

40

Figure 4.4: Beta strand modeling errors from 70S E. coli ribosome in Ramachandran space.

The Ramachandran diagram for these beta strand modeling errors is less of a mess, but no more helpful (Figure 4.4). Because the 3-parallel-C-O modeling error is

“conserved”, points for residues involved in these errors form two distinct clusters, rather than a semi-random scatter. Each instance of a 3-parallel-C-O error comprises two residues, one on the N-terminal side of the flipped peptide plane and one on the C- terminal side. The N-terminal residue falls in the general alpha-helix region of the

Ramachandran plot, often within contours that make it appear to be good structure. The

C-terminal residue falls just north of the L-alpha region of the plot and is often, but not necessarily, identified as an outlier. Significantly, there is no straightforward way to 41

extract the intended structure type from the locations of these residues in

Ramachandran space.

To be fair to Ramachandran analysis, it’s a sensitive technique and very much appropriate to high-resolution structures and structures with relatively few modeling errors. However, its sensitivity is confounded in the case of low-resolution structures, especially where multiple errors become compounded together. Ramachandran analysis uses backbone dihedrals, which rely on the reasonably accurate modeling of all backbone atoms. When only a few atoms are out of place, it’s very sensitive to the error.

When all the atoms are out of place, as is the case with compound sequential errors, it loses the frame of reference for its assessment. Low-resolution structures, because of the nature of the errors they often contain, present a challenge beyond the proper scope of

Ramachandran analysis.

4.1.2 DSSP annotation at low resolution

DSSP (Kabsch and Sander, 1983) is not a validation system, but is the most commonly encountered protein structure annotation software. DSSP and its related programs, like the phenix.ksdssp implementation in Phenix, generate most of the HELIX and SHEET records found in .pdb files. Like Ramachandran analysis, DSSP annotation is dependent on atoms that are not reliably modeled at low resolution and is vulnerable to compound modeling errors. DSSP relies primarily on recognition of hydrogen bonding patterns for its assignment of secondary structure. Residues involved in alpha

42

helix hydrogen bonding receive an “H”, residues with 310 bonding receive a “G”, and residues with helix-like bonding that does not quite match alpha or 310 receive a “T” (π helix is so rare that “I” virtually never appears). Residues with beta sheet hydrogen bonding receive an “E” for extended or a “B” for short bridge. “S” is the only DSSP category not based on hydrogen bonding, and serves as a catch-all for regions of interesting geometry that do not match one of the expected hydrogen bonding patterns.

Figure 4.5: Poorly modeled helices from 2o01.pdb with DSSP-based secondary structure annotation.

Since DSSP is dependent on a regular hydrogen bonding network to identify secondary structure, it functions poorly when modeling errors disrupt secondary structure. In a typical case from 2o01.pdb (Figure 4.5), DSSP preforms passably on a less 43

distressed helix, but when compound errors disrupt hydrogen bonding, DSSP becomes confused, even assigning a beta structure tag “B” to a helix residue. The HELIX and

SHEET records for 2o01.pdb are only useful because the depositors must have entered them manually. Automated HELIX records result in poor coverage of the transmembrane helixes (Figure 4.6).

Figure 4.6: Ribbons show sparse identification of helices by ksdssp for 2o01.pdb chain A.

There is not, to my knowledge, any existing popular method designed specifically for the assessment of low-resolution or error-dense protein backbone.

Existing methods can provide some information, but make assumptions about structure quality, reliability of atom placement, and frequency of compound errors that do not hold at low resolution. If I was to deal with low-resolution protein backbone, I was

44

going to need a new method that made assumptions more appropriate to the low- resolution regime.

(In one of the ironies of science, the very need for a low-resolution method made testing that method challenging. If there had been something automated and reliable that I could check my eventual secondary structure assignment against, I would not have needed CaBLAM in the first place. )

4.2 Developing the Cα parameter space

Typically in science, the appearance of a linear and logical path from inspiration to insight to implementation is a construct laid over an indirect progression with the benefit of hindsight and the necessity of publication. This is very true of the origins of the CaBLAM project. The happenstance and misunderstandings that ultimately yielded such fruitful results cannot not be usefully shared in journal articles, but they can be recounted here.

After the wrap-up of CASP8 assessment, I was searching for an appropriate thesis project. Even then, I was interested in low resolution and protein secondary structure as a new direction for validation methods. I was working on a method for automated detection of helix capping motifs using DSSP letter codes. With the assistance of Dan Keedy, I used our topX database of the time (the top5200) to identify segments of continuous secondary structure. I then pulled data for residues that were

45

adjacent to segments of continuous “H” DSSP codes and plotted those residues in

Ramachandran space.

Figure 4.7: Ramachandran space distribution for characteristic residue of widened helix turn.

I personally did not learn much from the distribution of those alpha helix- adjacent residues, until Jane walked behind me as I was looking at the plot, pointed, and said, “Oh, that’s interesting.” What Jane had spotted was a cluster of residues to the lower left of the alpha helix region (Figure 4.7). This cluster was near a region of the

Ramachandran plot we sometimes refer to as the “Prisant Conjecture”, after an idea from Michael Prisant about the amide hydrogens of two adjacent residues approaching each other through hydrogen bonding with a single water (Lovell, 2003). I followed up 46

on Jane’s interest and discovered that this region was not in fact the Prisant Conjecture, but a related motif in alpha helices. This motif was the “widened helix turn” discussed in a Section 6.4.

I had stumbled on the widened helix turn while looking for helix caps because my DSSP-based method could not differentiate between an interruption in a continual secondary structure element (like this widened turn) and the end of an element (the helix caps I was looking for). Indeed, the tools I had available at the outset of this project were generally insufficient to the task of automated identification and description of non-repeating protein structure motifs. My work to develop a method to describe the widened helix turn ultimately led to the development of the CaBLAM parameter space, a more sensitive method for identifying non-repeating motifs, and a new validation method for low-resolution protein structures.

4.2.1 Cα pseudodihedrals

The alpha carbon is arguably the most important atom of each amino acid residue. The functional sidechain attaches to the backbone at the Cα, and the bonds connected to the Cα are (mostly) free to rotate through dihedral space. The Cα is thus the central point for the functional and conformational characterization of most residues.

The bare Cα trace also constitutes the most minimal representation of protein geometry.

I view the problem of low-resolution structures as a problem of low information content. One of the ways to compensate for low information content is to require less

47

information. The Cα positions and their minimal representation of protein geometry were therefore suitable for building a parameter space that required minimal information. Another way to compensate for low information content is to draw on more context for each data point. A Cα virtual dihedral uses the position of 4 different

Cα atoms, and the combination of two adjacent Cα dihedrals uses 5 Cα positions. Thus each dihedral captures a relatively large amount of the local conformational context.

Figure 4.8: Schematic representation of μin and μout for a single residue.

Since each dihedral involves four atom positions, each Cα is involved in four dihedrals in turn. The dihedrals for which the Cα of the residue of interest is one of the central two atom positions were deemed the most relevant to that residue’s overall protein geometry. I therefore defined two Cα pseudodihedrals for each residue. For a residue i, the dihedral µin is defined by the atom positions Cαi-2, Cαi-1, Cαi, Cαi+1. The dihedral µout is defined by atom positions Cαi-1, Cαi, Cαi+1, Cαi+2. (Figure 4.8, Figure 4.9)

There is an overlap in the Cα dihedrals of adjacent residues calculated in this system.

The µout of one residue is also the µin of the next residue.

48

Figure 4.9: μin relationship for residue i, shown in an alpha helix.

4.2.2 The peptide plane dihedral

With Cα geometry captured by the Cα pseudodihedrals µin and µout, a third measure was needed to capture the other information of the protein mainchain trace.

Dave made a suggestion about how to capture the relative orientations of adjacent peptide planes. I believe that I misunderstood the specifics of his suggestion, but it set me onto the development of the peptide plane dihedral. At that time, I was trying to create as minimal a parameter set as possible due to my distrust in all-atom representations of low-resolution structures. I therefore defined the peptide plane dihedral using only Cα positions – which had already been used in the Cα dihedrals –

49

and the backbone carbonyl oxygen positions. Since those oxygens are the heaviest backbone atoms, I assumed they would be the most reliably modeled.

That assumption was so exactly wrong that the resulting measure became a powerful diagnostic of modeling errors.

Figure 4.10: Schematic representation of ν for a residue.

The peptide plane dihedral is named ν (nu). For a residue i, the dihedral ν is defined by atom positions Oi-1 and Oi, and constructed pseudoatom positions Xi-1 and Xi.

The constructed pseudoatoms are located at the point of closest approach (or perpendicular intersection) of a carbonyl oxygen to a Cα-Cα line. Xi is located at the perpendicular intersection of a line drawn from Oi to the Cαi-Cαi+1 line. Xi-1 is located at the perpendicular intersection of a line drawn from Oi-1 to the Cαi-1-Cαi line. Thus the peptide plane dihedral ν is defined by Oi-1, Xi-1, Xi, Oi. (Figure 4.10, Figure 4.11)

It is possible to construct another ν for residue i relating it to residue i+1. The first incarnations of CaBLAM calculated both νin and νout, under the designations

CO_d_in and CO_d_out. As in the case of the µ dihedral, the νout of one residue is also the νin of the next. Since the νin dihedral bridges the Cα of the residue of interest, νin was deemed more indicative of the behavior of that residue. The νout dihedral was dropped,

50

and νin simplified to ν. This definition for ν also has the incidental advantage of roughly matching our convention of associating the peptide bond dihedral ω with the following residue.

Figure 4.11: The ν dihedral, shown in an alpha helix.

4.2.3 The Cα virtual angle

The ν dihedral is unique to CaBLAM, while one µ dihedral or the other occasionally appears in other systems (Oldfield & Hubbard, 1994; Kleywegt, 1997;

Labresse, 1997). In contrast, the Cα virtual angle is a relatively familiar measure. For a residue i, the Cα virtual angle is defined by the atom positions Cαi-1, Cαi, Cαi+1 (Figure

4.12). It was initially explored as a possible CaBLAM parameter because a Cα dihedral

51

and the Cα virtual angle together contain enough information to reconstruct the full Cα trace (assuming constant Cα-Cα distances). The Cα virtual angle proved less useful for describing the variety of protein conformational space than the ν dihedral, since the Cα virtual angle has a restricted range of about 90° while ν has a full dihedral range of 360°.

The Cα virtual angle was therefore discarded from consideration as a CaBLAM parameter for some time.

Figure 4.12: Schematic representation of the Cα virtual angle.

The Cα virtual angle came back into favor after it became clear that CaBLAM would be relying heavily on the Cα trace to identify secondary structure elements. A quality check on the Cα trace itself was needed to ensure that unusually severe backbone distortions would not misinform CaBLAM’s secondary structure identification. Since the virtual angle provides a complete description of the Cα trace when combined with the Cα dihedrals, it was natural to reintroduce it in that context.

4.3 Representations of the parameter space

There are four main parameters calculated in CaBLAM: μin, μout, ν, and the Cα virtual angle. There are three main parameter spaces used to describe protein structure: a 3D “CaBLAM” space, a 2D “CaBLAM” space, and a 3D Cα geometry space. The μin and

52

μout dihedrals are central to the CaBLAM method and are used in all of these parameter spaces. As a common frame of reference, μin and μout are usually the x- and y-axes in all representations of these parameter spaces.

4.3.1 3D CaBLAM space

The 3D parameter space referred to as “CaBLAM space” uses the parameters µin,

µout, and ν (Figures 4.13, 4.15, & 4.17). Most representations of CaBLAM space will show

µin as the x-axis, µout as the y-axis, and ν as the z-axis, down which the viewer is looking in projection. Because the ν dihedral is sensitive to modeling errors at low resolution, this space is the one used in the first pass of validation.

4.3.2 2D CaBLAM space

The 2D parameter space referred to as “2D CaBLAM space” uses only the parameters µin and µout. Representations of this space show µin as the x-axis and µout as the y-axis. Because the Cα trace is considered reliable at low resolution, this Cα-only space is used to identify secondary structure elements in low-resolution structures

(Figures 4.21-4.25).

4.3.3 3D Cα geometry space

The 3D parameter space referred to as “3D Cα geometry space” uses the parameters µin, µout, and Cα virtual angle (Figures 4.14, 4.16, & 4.18). Most representations of Cα geometry space will show µin as the x-axis, µout as the y-axis, and the Cα virtual angle as the z-axis, down which the viewer is looking in projection. This

53

space is used to check the quality of the Cα trace before assigning secondary structure elements based in the 2D CaBLAM space.

4.4 Populating the 3D parameter spaces

4.4.1 Dataset selection

The Richardson Lab has used several quality-filtered datasets of protein structures. Over time, and with the growth of the Protein Data Bank, these datasets have grown from the Top100 (Word, 1999), to the Top240, to the Top500 (Lovell et al,

2003), to the Top5200, to the current Top8000 (Richardson, 2013). The TopX datasets comprise protein chains, homology filtered for redundancy and quality filtered for reliability of modeling. Each chain in the Top8000 was required to have resolution <

2.0Å, MolProbity score < 2.0, ≤ 5% of residues with bond length outliers, ≤ 5% of residues with bond angle outliers, and ≤ 5% of residues with Cβ deviation outliers. The main working version of the Top8000 contains chains with sequence identity ≤ 70%. Sequence identity was determined using the PDB homology clusters, and the entry with the lowest (i.e. best) average of crystallography resolution and MolProbity score was selected from each cluster.

The CaBLAM project made use of the Top5200 during its early stages. Once the

Top8000 dataset became available, I transitioned to using that new, larger dataset. The contours used by the current version of CaBLAM as of this writing are derived from the

54

Top8000. These contours may be updated as newer and improved datasets become available.

4.4.2 Residue-level quality filtering

Even within structures of high total quality such as those within the Top8000, model quality and model certainly vary from residue to residue. For any study interested in individual residues, it is important to have a means of filtering out potentially problematic residues. For most Richardson Lab purposes B-factor is the most useful and accessible filtering criterion.

The structural B-factor has its origins in the crystallographic temperature factor, a measure of the thermal motion of atoms in an ordered crystal. Unfortunately, macromolecule crystals are never well ordered enough for “temperature factor” to be an accurate term. Additionally, various inaccuracies of refinement resulting from challenges such as anisotropic data and especially partial occupancy are often lumped into the B-factor. As a result, the B-factors associated with atoms in protein structures are generally catch-alls for all the sources of error and uncertainly in modeling.

Each atom in a structure has an associated B-factor, and that B-factor represents the certainty or uncertainty of that atom’s position based on the refinement. Atoms with a low B are well supported by the data. Atoms with a high B are less certainly supported. Filtering on B-factor allows us to be sure that residues that deviate from expected behavior are nevertheless justified by experimental data.

55

Common Richardson Lab practice is to identify the highest B-factor among the relevant atoms (here, in the mainchain case N, CA, C, and O) for each residue. Each residue with a highest mainchain B > 30.0 fails the quality filter. This method was used for CaBLAM. Each residue in the Top8000 containing a mainchain atom with B-factor

>30.0 was excluded from all calculations. Calculation of the full CaBLAM parameters for a residue requires the two preceding and two succeeding residues. If any one residue is excluded from calculation due to high B, the full CaBLAM parameters cannot be calculated for the surrounding residues. One of the advantages of the 2 million residue Top8000 dataset is that increasing the strictness of filtering criteria still leaves a more than sufficient number of residues for analysis.

4.4.3 3D parameter spaces

The full four CaBLAM parameters µin, µout, ν, and the Cα virtual angle were calculated for each eligible residue in the Top8000. The CaBLAM parameters, especially

µin and µout, require atom positions from multiple residues in sequence. Therefore, residues were ineligible for calculation if they were within two positions of a chain terminus or chain break. Residues were also ineligible if they contained a mainchain atom (N, CA, C, or O) with B-factor > 30.0 or if they were within two position of such a residue.

56

Figure 4.13: General case contours in 3D CaBLAM space at 5% and 10%.

57

Figure 4.14: General case contours in 3D Cα geometry space at 1% and 5%.

Residues were separated into three categories based on residue type following calculation of the CaBLAM parameters: glycine, proline, and general case. Glycine residues allowed a greater variety of conformations than the general case. Proline residues allowed a lesser variety of conformations. Branched Cβ residues (isoleucine and valine) were briefly considered as a separate category, but did not display a sufficiently distinct distribution to merit separation. Due to the dependence of µin and

µout on adjacent residues, pre-proline, pre-glycine, post-proline, and post-glycine residue may also exhibit distinct behaviors in CaBLAM space. These residues were not made into separate categories because 3D contours are memory-intensive, and I wanted as few

58

categories as possible. Fortunately, residues before and after Gly and Pro have not shown a systematic tendency to be marked as outliers in the final system.

Figure 4.15: Proline case contours in 3D CaBLAM space at 5% and 10%.

59

Figure 4.16: Proline case contours in 3D Cα geometry space at 1% and 5%.

The CaBLAM parameters for residues in these categories were submitted to our contouring program Silk. 3D contours were produced for the 3D CaBLAM and 3D Cα geometry spaces for each of the glycine, proline, and general case categories. The contour files (generally typed as “.stat”) produced by Silk serve as input to other programs that interpret the contours. Visualizations of the contours were produced by the program kin3Dcont and displayed in the KiNG viewer. For proper display of contours, it is important to ensure that the sampling grid used by kin3Dcont matches the grid used by Silk.

60

Figure 4.17: Glycine case contours in 3D CaBLAM space at 5% and 10%.

61

Figure 4.18: Glycine case contours in 3D Cα geometry space at 1% and 5%.

4.5 DSSP letter-code prediction with CaBLAM

The original intent of the CaBLAM parameter space had been to provide a tool for describing non-repeating protein backbone motifs like the widened helix turn.

However, the need for a validation tool to identify secondary structure in poorly- modeled structures and the potential for CaBLAM to serve as that tool became evident during the initial explorations of CaBLAM space.

My first method for identifying secondary structure in low-resolution models was a prediction system based on DSSP letter-codes and Bayesian statistics. This method ultimately proved unsatisfying for a variety of reasons and was abandoned in

62

favor of the system of custom secondary structure definitions discussed in section 4.6 and onwards. The DSSP letter-code prediction system represents one of the major dead ends of CaBLAM development. The rise and fall of this method will be discussed in this section. Despite its failure, it laid groundwork for the subsequent successful system and informed my feelings on the virtues of open-licensed software.

4.5.1 Dataset selection

To identify intended secondary structure in low-resolution models, a training set of high-resolution examples of secondary structure was necessary. The training dataset for secondary structure identification was the same as the training dataset discussed in

Section 4.4.1 for general protein geometry. Specifically, the Top5200 was used during my attempts to use DSSP letter codes, and the DSSP prediction method was abandoned before the Top8000 became available.

4.5.2 Residue-level quality filtering

To ensure that only confidently modeled residues were included in the datasets for secondary structure, a residue-level quality filter was necessary. The same B-factor based quality filter discussed in 4.4.2 for general protein geometry was used for potential secondary structure residues. Any residue containing a mainchain atom (N,

CA, C, or O) with B-factor > 30.0 was excluded from all calculation. The geometry parameters used in describing secondary structure are µin and µout. These measures require the preceding two and succeeding two residues to be available for their

63

calculation. Therefore, as usual with CaBLAM calculations, a chain break or a residue with high mainchain B-factor prevents calculation not only for itself, but also for the immediately surrounding residues.

4.5.3 Defining secondary structure with DSSP

DSSP (Define Secondary Structure in Proteins) is one of the standard methods of identifying and annotating protein secondary structure (Kabsch and Sander, 1983). As discussed in Section 4.1.2, DSSP depends on accurate modeling of hydrogen bonding networks to identify secondary structure. In low-resolution structures, where hydrogen- bonding networks may not be modeled reliably, DSSP tends to fail at secondary structure identification. However, for high-resolution structures, like those in the

Top8000 training set, DSSP can provide a reliable identification of secondary structure types.

My first attempt to train CaBLAM to recognize secondary structure took advantage of the already-existing DSSP categorizations of secondary structure for structures in the Top5200. DSSP categorizes secondary structure by assigning one of seven single-letter codes to each residue in a protein. The DSSP letter codes are: H for alpha helix, G for 310 helix, I for π helix, E for extended beta sheet, B for beta bridge, T for bonded turn, and S for high curvature. A residue may also be left unlabeled. For my purposes, I labeled as “X” any residues left unlabeled by DSSP. A database of DSSP letter-code assignments for all protein residues in the PDB is maintained at the DSSP

64

webpage (http://swift.cmbi.ru.nl/gv/dssp/). I downloaded DSSP letter-code assignments for the proteins in the Top5200 from this resource.

I calculated µin and µout for all eligible residues in the Top5200. Then I separated those residues into categories for each of the DSSP letter codes: H, G, I, E, B, T, S, and

“X” for residues not labeled by DSSP. I processed the 2D CaBLAM space data for each of these categories using our contouring program Silk.

For all other parts of the CaBLAM project, I produced percentile contours from the raw data with Silk. Percentile contours describe the percent of data included or excluded by each isoline or isosurface. But Silk can produce other kinds of contours and contour-like data structures. For each DSSP letter code, I used Silk to produce a probability field in two-dimensional μin, μout space based on the distribution of residue geometries in that space. The probability fields produced by Silk use the same format at the percentile contours and are interpreted by the same interpolation functions. It would be generally possible to draw isolines in a probability field, and the isolines would bear some resemblance to percentile contours, but the probability isolines would not have an intuitive relationship to inclusion/exclusion percentages. Because isolines through the probability data do not carry intuitive meaning, the results of Silk for the

DSSP letter codes are better understood as probability fields than as contours.

65

4.5.4 Bayesian probabilities for DSSP codes

My intent at that point was to create a system that would produce DSSP letter- code annotations for low-resolution structures (which DSSP itself could not assess).

DSSP produces eight distinct categories (seven letter-codes, plus unmarked), so a system that could manage several simultaneous possibilities was necessary. Constructing probability fields allowed me to use Bayesian statistics to manage the simultaneous probabilities of the eight DSSP categories.

Bayesian statistics is a mathematical method of representing and calculating conditional probabilities. The Bayesian probability P(DSSP code | Cα geometry), or the probability of a residue being assigned a specific DSSP letter-code given an observed Cα geometry, would allow prediction of DSSP letter-code from Cα geometry. Bayes’ theorem (Bayes and Price, 1763) is written generically as:

� � � � � � � � = � �

In the terms of this system, the theorem states:

� �� �������� ���� ���� � ���� ���� � ���� ���� �� �������� = � �� ��������

Therefore, P(Cα geometry), P(DSSP code), and P(Cα geometry | DSSP code) must be known to predict DSSP code from Cα geometry.

The probability field for each DSSP letter-code represents P(Cα geometry | DSSP code) for that code. A probability field for a given DSSP letter code can be queried with

66

a (μin, μout) coordinate to yield the probability of finding a residue with that geometry among residues with that letter code.

P(DSSP code) was straightforward to calculate. For each DSSP code, this is the fraction out of the total residues in the Top5200 identified with that code.

P(Cα geometry) required the construction of one additional probability field.

The Cα geometry probability field was constructed in the same way as the DSSP code probability fields, but contained all residues that passed the B-factor quality filter, regardless of letter code. This probability field gives the probability of finding any residue with a (μin, μout) combination of interest.

P(DSSP code) and P(Cα geometry) provide a normalization to the P(Cα geometry

| DSSP code) derived from the letter-code contours. With this normalization, the calculated P(DSSP code | Cα geometry) values for the (μin, μout) of a given residue sum to

1.0 across the possible letter codes. This allows for direct comparison and ranking of probabilities for different DSSP codes for a residue of interest.

4.5.5 Using DSSP code prediction

To obtain a DSSP letter-code prediction for a residue of interest, first the μin and

μout Cα geometry parameters are calculated. These parameters are used as a 2D coordinate (μin, μout) in the probability field space to determine P(Cα geometry | DSSP code) for each DSSP code and to determine P(Cα geometry). P(DSSP code | Cα geometry), the probability for each DSSP code is then calculated for each possible code.

67

The DSSP codes for that residue are then ranked in descending order of probability.

Taking the highest-ranked DSSP code for each residue allows the prediction of a complete DSSP-style annotation of a structure from just its Cα geometry.

I tested the accuracy of CaBLAM’s DSSP code predictions by running predictions on all the structures in my training set, then comparing those predictions to the official

DSSP annotations. The results of this comparison are shown in Table 4.1. CaBLAM’s predictions faired reasonably well for the major secondary structure types alpha helix

(H) and beta sheet (E). However, 310 helix was very poorly identified. Identification of the non-regular structure types suffered from difficulty distinguishing them from regular structure types; in particular T (bonded turn) was ambiguous with H, and X

(unlabeled) was ambiguous with E. While the Cα trace is a powerful indicator of protein structure, it is insufficient on its own to recreate the detail of a full DSSP annotation.

Table 4.1: Success rates for automated DSSP annotation by CaBLAM

DSSP code H G E T S X Correct first 93% 27% 77% 57% 52% 62% ranking Most common T T X H T E misidentification

In retrospect, I might have been able to get better accuracy out of the DSSP code prediction method if I had applied a continuity requirement to it, similar to the continuity requirement I would apply to the identification of potential secondary structure by the final implementation of CaBLAM. The DSSP code prediction method’s

68

accuracy improves markedly – especially for the troublesome 310 helix – if the first and second ranked predictions are both considered. A continuity requirement could have taken the second ranked predictions into account to build complete secondary structure elements.

4.5.6 Visual annotation for DSSP code prediction

The series of ranked percentages for DSSP codes for each residue called for a novel visualization. The “balls” that KiNG draws are in fact circles with highlight marks. The highlight mark can be turned off to leave a colored disc around a point. For each residue, I drew a series of these colored discs. The color of the disc corresponds to the DSSP code (red for H, purple for G, green for E, orange for T, yellow for S, white for

X). The radius of the disc corresponds to the probability of that letter code for that residue. I manipulated the draw order for the colored discs so that all discs would be visible from smallest to largest for each residue. Use of KiNG’s “pointmasters” system allowed each letter-code’s discs to be toggled on and off individually.

69

Figure 4.19: CaBLAM's visual annotation of predicted DSSP codes for helices from 2o01.pdb.

The result was a visual annotation of concentric colored circles around each residue. An additional ring is drawn at the 100% certainty level to serve as a reference point. When one DSSP code prediction dominates, a single color nearly fills the reference ring. When the assignment is ambiguous and multiple DSSP codes contend for dominance, the colored area shrinks and the concentric circles become apparent.

70

Figure 4.20: CaBLAM's visual annotation of predicted DSSP codes for beta strands from 2a0l.pdb

I chose red for H and green for E to maintain the red for helices/green for beta sheets color scheme familiar to users of KiNG. As a result, areas dominated by alpha helix (Figure 4.19) and beta sheet (Figure 4.20) are distinguishable at a glance from the visual annotation. However, humans have difficulty accurately judging visual weights, and the areas of circles can be misleading. In addition, as a necessity of visibility, the circles for less favored predictions are drawn in front of those for the more favored predictions. This can give undue visual weight to the less likely letter codes. Because of these difficulties in interpretation, the concentric circles were not an ideal annotation.

CaBLAM would ultimately use a less novel and detailed but more readily understood annotation in the form of the familiar ribbon annotations for helix and sheet.

71

4.5.7 Problems with DSSP code prediction

DSSP proved to be the wrong system on which to build a method for identifying protein secondary structure at low resolution. DSSP is both too powerful and not powerful enough for the tasks I had in mind for CaBLAM. DSSP is too powerful in that it produces a detailed annotation of the protein backbone, with distinctions between helix (H), near-helix (T), and bent structures (S), among others. This level of detail could not be satisfactorily reproduced from a simple parsing of Cα geometry. Recapitulating full DSSP annotation was not feasible.

However, while DSSP can distinguish between regular and non-regular structures, DSSP is not powerful enough to distinguish among different related non- regular motifs. For example, the DSSP annotation for an interrupted alpha helix is generally “~HHHTHHH~”. There are multiple possible motifs hidden in that “T” (see section 6.4). DSSP cannot distinguish among these, but one of my intents in developing

CaBLAM was to build a method that could find and distinguish among non-repeating motifs. Using only DSSP letter-codes, significant processing of the annotation sequence would be necessary to find motifs of interest. Even then, there would be no automated way to distinguish among related motifs. A method with a more detailed awareness of motifs would be necessary.

The proverbial nail in the coffin for CaBLAM predicting DSSP codes was DSSP’s licensing. DSSP is free to academic users. But DSSP is not free to industry users, and

72

industry users are a significant portion of Phenix’s user base. It might not have been possible to have DSSP distributed with Phenix, which would have severely limited what that version of CaBLAM could do to assess a structure. (Phenix uses an open-licensed but limited version of DSSP, ksdssp, to generate HELIX and SHEET records). The realization that a distribution license was limiting my research options has made me strongly in favor of open licensing where possible, and I hope to release any future software I develop with research-friendly terms.

4.6 Populating the secondary structure parameter spaces

4.6.1 Dataset selection

The training dataset for secondary structure identification was the same as the training dataset discussed in Section 4.4.1 for general protein geometry. The Top5200 dataset was used in the initial explorations, and the Top8000 (Richardson, 2013) dataset was used once it became available. The contours used by the current version of

CaBLAM as of this writing are derived from the Top8000. These contours may be updated as new and improved datasets become available.

4.6.2 Residue-level quality filtering

To ensure that only confidently modeled residues were included in the datasets for secondary structure, a residue-level quality filter was necessary. The same B-factor based quality filter discussed in 4.4.2 for general protein geometry was used for potential secondary structure residues. Any residue containing a mainchain atom (N,

73

CA, C, or O) with B-factor > 30.0 was excluded from all calculation. The geometry parameters used in describing secondary structure are µin and µout. These measures require the preceding two and succeeding two residues to be available for their calculation. Therefore, as usual with CaBLAM calculations, a chain break or a residue with high mainchain B-factor prevents calculation not only for itself, but also for the immediately surrounding residues.

Regular secondary structure, especially in high-resolution structures like those of the Top8000, tends to be well defined by the electron density. For this reason, I expect that relatively few secondary structure residues were lost to the B-factor filtering. I expect that the majority of the residues that failed the B-factor filter were in loop regions.

These residues would not be participating in secondary structure, but could be adjacent to the N-terminal or C-terminal residues of a secondary structure element. In that case, their exclusion from calculations would prevent the calculation of CaBLAM parameters for residues at the ends of those secondary structure elements. For this reason, I expect that the B-factor quality filter biased the final dataset somewhat towards more regular secondary structure and somewhat away from secondary structure termini and capping motifs. (The relatively high B-factors of typical loop residues means that this is always the case, but CaBLAM’s need for the preceding and succeeding two residues enhances the effect.) For the purposes of this aspect of CaBLAM – that is, the identification of probable regular secondary structure elements in low-resolution structures – a bias

74

towards regular structure is reasonable. As will be seen in the loose alpha helix contours in particular, the bias away from secondary structure termini was not very strong.

4.6.3 Developing “fingerprints” for defining secondary structure

My attempt to use DSSP letter-codes as CaBLAM’s reference point for secondary structure identification revealed a need for a secondary structure identification system better aligned to CaBLAM’s specific needs. The system had to be able to distinguish among similar related non-repeating motifs, something DSSP was not designed to do.

The system had to identify regular secondary structure with enough leniency that the resulting contours would be usable for identifying secondary structure amid the modeling errors typical of low-resolution structures. And the system could not depend on software that could not be easily distributed with Phenix. To meet these needs, I developed a novel system to use hydrogen bonding patterns to identify motifs of interest. The hydrogen bonds for this system would be identified by Reduce and Probe

(Word et al 1999), Richardson Lab programs already distributed in Phenix.

I called the hydrogen bonding patterns for motifs “fingerprints”, because each motif fingerprint is meant to be a unique identifier for a motif of interest, just as a human fingerprint is a unique identifier for a person of interest. A “fingerprint” consists of a pattern of hydrogen bonding, connectivity information, and optional residue specifications. CaBLAM can iterate through each residue in a protein structure, testing each residue and its surrounding environment for a match to the pattern. A more

75

complete description of fingerprints and their interpretation can be found in Chapter 6, which also discusses the use of fingerprints to identify non-repeating structural motifs in proteins.

For identification of regular secondary structure, use of fingerprints meant reinventing the DSSP wheel, at least in part. Hydrogen bonding pattern definitions were needed for the major secondary structure types. However, the advantages of defining custom secondary structure definitions were total certainty of the definitions and the power to tweak those definitions to better describe secondary structure in a way useful to CaBLAM’s low-resolution validation goals.

To arrive at useful secondary structure definitions, for each major secondary structure type, two main definitions were constructed. Each secondary structure type was defined in a strict fashion that focused on its most regular segments, and in a loose fashion that allowed more conformational variety and captured transitions from that secondary structure type into other structure types. These “regular” and “loose” definitions were constructed partly with motif fingerprints and partly with post- processing of the residues identified by those fingerprints. The following sections describe the definitions for the major secondary structure types (Richardson, J. S. 1981).

4.6.4 Alpha helix

Alpha helix is defined by a regular pattern of sequence-related hydrogen bonding. In regular alpha helix, the carbonyl oxygen of residue i hydrogen bonds with

76

the amide hydrogen of residue i+4, and the amide hydrogen of residue i hydrogen bonds with the carbonyl oxygen of residue i-4.

The i to i+4 defines the alpha helix, but a single i to i+4 hydrogen bond is not sufficient to exclude other motifs. Multiple i to i+4 hydrogen bonds in sequence are necessary to uniquely define alpha helix. Two complementary fingerprints were used to define alpha helix.

The first definition is oxygen-centric and requires three i to i+4, carbonyl oxygen to amide hydrogen bonds in sequence. If a residue and both its preceding and succeeding residue have i to i+4 hydrogen bonds, that residue is marked as a member of the oxygen-centric definition. The oxygen-centric definition captures residues in the body of a helix and at the N-terminus, but misses residues at the C-terminus of the helix.

The second definition is hydrogen-centric and requires three i to i-4, amide hydrogen to carbonyl oxygen bonds in sequence. If a residue and both its preceding and succeeding residue have i to i-4 hydrogen bonds, that residue is marked as a member of the hydrogen-centric definition. The hydrogen-centric definition captures residues in the body of a helix and at the C-terminus, but misses residues at the N-terminus of the helix.

Together, these two definitions can capture all member residues of a helix. For both definitions, the presence of a bifurcated hydrogen bond was considered a deviation

77

from regular secondary structure, and regions containing such a bifurcation were considered not to be part of alpha helix.

4.6.4.1 Regular alpha helix

To find residues participating in the most regular segments of uninterrupted alpha helix, I first found all the residues in the Top8000 that matched the oxygen-centric helix definition and all the residues that matched the hydrogen-centric definition. I then took the intersection of these two sets of residues. The resulting set of residues represented regular alpha helix. These residues matched both the oxygen-centric and the hydrogen-centric definitions. Therefore, each of these residues was the center residue of three in sequence each with full i to i+4 and i to i-4 hydrogen bonding. These residues were thus necessarily found in the middle sections of regular helices.

I also tried a regular helix definition that required five fully bonded residues in sequence instead of three. This even stricter definition did not produce substantially different contours. On the other hand, requiring only two fully bonded residues in sequence allowed clearly non-helical residues to pass the definition. From these results,

I conclude that three fully bonded residues in sequence is a necessary and sufficient definition for highly regular alpha helix in this system.

78

Figure 4.21: Contours for regular alpha helix definition in 2D CaBLAM space. (x-axis is µin, y-axis is µout.)

Each residue selected by this regular alpha helix definition was represented as a

(μin, μout) point in the 2D CaBLAM parameter space. This distribution of points was processed with Silk to produce population percentile contours (Figure 4.21). The resulting contours are extremely tight and define the distribution of highly regular helix residues. Helix caps, helix bends, and other deviations from the absolute standard are not included here.

4.6.4.2 Loose alpha helix

To find residues participating in a broader selection of alpha helical behaviors, I first found all the residues in the Top8000 that matched the oxygen-centric helix 79

definition and all the residues that matched the hydrogen-centric definition. I then took the non-redundant union of these two sets of residues. The resulting set of residues represented a loose definition of alpha helix. These residues match at least one of the oxygen-centric and the hydrogen-centric definitions, but not necessarily both. These residues were therefore not restricted to the middle sections of helices and could include helix caps, helix bends, and other minor deviations from highly regular helix.

Figure 4.22: Contours for loose alpha helix definitions in 2D CaBLAM space. (x-axis is µin, y-axis is µout.)

Each residue selected by this loose alpha helix definition was represented as a

(μin, μout) point in the 2D CaBLAM parameter space. This distribution of points was processed with Silk to produce population percentile contours (Figure 4.22). The 80

resulting contours have a broader center peak than the regular helix contour do. The loose definition contours also show “arms” stretching horizontally and vertically from the main peak. These arms represent helix caps and other helix to non-helix transitional behaviors.

4.6.5 Beta sheet

Beta sheets are defined by regular patterns of spatially-related hydrogen bonds.

No individual hydrogen bonding pair has a predictable sequence relationship. Instead, the hydrogen bonding relationships occur between paired stands, which may run parallel or antiparallel to one another. The multiple bonding relationships and the lack of a consistent sequence relationship among bonding partners makes defining a hydrogen bonding pattern for beta sheet more complex than for alpha helix.

Antiparallel beta strands are joined by pairs of hydrogen bonds. Depending on the frame of reference, the Cα of a residue on interest may be surrounded by a “narrow pair” of hydrogen bonds or by a “wide pair” of hydrogen bonds. In the narrow pair case, the amide hydrogen and the carbonyl oxygen of that residue are hydrogen bonding to the partner strand. In the wide pair case, the carbonyl oxygen of the preceding residue and the amide hydrogen of the succeeding residue are hydrogen bonding to the partner strand. If the residue of interest is part of a beta sheet, then it may have a narrow pair bonding relationship to one partner strand and a wide pair

81

bonding relationship to a partner strand on the opposite side. In antiparallel beta, narrow pairs bond to narrow pairs and wide pairs bond to wide pairs.

Parallel beta strands are also joined by pairs of hydrogen bonds. However, in parallel beta every pair of bonds is formed by a narrow pair bonding to a wide pair.

That is, the amide hydrogen and the carbonyl oxygen of a residue on one strand bond with the carbonyl oxygen of the residue preceding and the amide hydrogen of the residue succeeding the residue on the opposite strand. Much as in antiparallel beta, a residue in a parallel beta sheet may act as the center of a narrow pair with respect to one partner strand and as the center of a wide pair with respect to a partner strand on the opposite side.

Three definitions were needed to capture beta sheet structure: one centered on an antiparallel narrow pair, one centered on an antiparallel wide pair, and one centered on a parallel pair (which contains both narrow and wide components). Two variations of these three definition sets were produced to capture beta structure at different levels of regularity.

4.6.5.1 Long beta strand fingerprints

The long beta strand definitions required 6 beta-patterned hydrogen bonds in sequence joining two strands. Long beta strand represents residues in the core of continuous segments of beta structure. In the following fingerprint definitions, residues i and j are opposite each other on a pair of beta strands.

82

If residues i and j are joined by a narrow pair, and there is an additional narrow pair in each direction along the strands for a total of 6 hydrogen bonds, then residues i and j are identified as long beta strand residues joined by a narrow pair. Ten residues, six of which participate in hydrogen bonding are involved in the long beta strand narrow pair definition.

If residues i and j are joined by a wide pair, and there is an additional wide pair in each direction along the strands for a total of 6 hydrogen bonds, then residues i and j are identified as long beta strand residues joined by a wide pair. Fourteen residues, eight of which participate in hydrogen bonding are involved in the long beta strand narrow pair definition.

If residues i and j are joined by the narrow-to-wide pair characteristic of parallel beta, and there is an additional narrow-to-wide pair in each direction along the strands for a total of 6 hydrogen bonds, then residues i and j are identified as long beta strand residues joined by a parallel beta pair. Twelve residues, seven of which participate in hydrogen bonding are involved in the parallel beta pair definition.

For all beta strand definitions, appropriate parallel or antiparallel sequence relationships between partner strands are required. Bifurcated hydrogen bonds are considered deviations from the expected structure, and regions containing such bifurcations are considered not to be beta structure.

83

4.6.5.2 Beta bridge fingerprints

The beta bridge definitions required 4 beta-patterned hydrogen bonds in sequence joining two strands. Beta bridge represents a near-minimal unit of beta structure. The inspiration for this category came from the DSSP category of the same name, designated “B” in DSSP annotation. In the following fingerprint definitions, residues i and j are opposite each other on a pair of beta strands.

If residues i and j are joined by a narrow pair, and there is an additional hydrogen bond (one half of a narrow pair) in each direction along the strands for a total of 4 hydrogen bonds, then residues i and j are identified as beta bridge residues joined by a narrow pair. Ten residues, six of which participate in hydrogen bonding are involved in the beta bridge narrow pair definition.

If residues i and j are joined by a wide pair, and there is an additional hydrogen bond (one half of a wide pair) in each direction along the strands for a total of 4 hydrogen bonds, then residues i and j are identified as beta bridge residues joined by a wide pair. Fourteen residues, eight of which participate in hydrogen bonding are involved in the beta bridge narrow pair definition.

If residues i and j are joined by the narrow-to-wide pair characteristic of parallel beta, and there is an additional hydrogen bond (one half of a narrow-to-wide pair) in each direction along the strands for a total of 4 hydrogen bonds, then residues i and j are identified as beta bridge residues joined by a parallel beta pair. Ten residues, five of

84

which participate in hydrogen bonding are involved in the parallel beta bridge definition.

For all beta strand definitions, appropriate parallel or antiparallel sequence relationships between partner strands are required. Bifurcated hydrogen bonds are considered deviations from the expected structure, and regions containing such bifurcations are considered not to be beta structure.

4.6.5.3 Regular beta sheet

To find residues participating in highly regular beta sheet, I first found all the residues in the Top8000 that matched each of the long beta strand fingerprints in turn.

Then I selected the residues that appeared in any two of those lists. This selection found only residues from strands internal to larger sheets. Paired beta strands not incorporated into larger sheets were excluded due to a suspicion that such paired strands are capable of severe twists not typical of regular beta structure.

85

Figure 4.23: Contours for regular beta strand definition in 2D CaBLAM space. (x-axis is µin, y-axis is µout.)

Each residue selected by this regular beta sheet definition was represented as a

(μin, μout) point in the 2D CaBLAM parameter space. This distribution of points was processed with Silk to produce population percentile contours (Figure 4.23). The resulting contours cover a significant portion of the 2D CaBLAM space, but fall off sharply enough to be well-defined. Because the residues selected by the long beta strand fingerprint were at least one full bonding pair away from the end of a beta strand, the set of regular beta sheet residues also tends to exclude turns between beta strands and other beta to non-beta transitional behaviors. As a result, an analogue to the “arms” seen in the loose alpha helix contours does not appear. 86

4.6.5.4 Loose beta sheet

To find residues participating in a broader selection of beta strand behaviors, I first found all the residues in the Top8000 that matched each of the beta bridge fingerprints in turn. Then I took the non-redundant union of these three sets of residues.

Thus each residue in the loose beta sheet set participates in at least one beta-type bonding partnership with at least one partner strand. A second partner strand is not required for loose beta sheet as it was for regular beta sheet.

Figure 4.24: Contours for loose beta stand definition in 2D CaBLAM space. (x-axis is µin, y-axis is µout.)

Each residue selected by this loose beta sheet definition was represented as a (μin,

μout) point in the 2D CaBLAM parameter space. This distribution of points was

87

processed with Silk to produce population percentile contours (Figure 4.24). The resulting contours are extremely loose and cover much more of the 2D CaBLAM space than the contours for other secondary structure definitions do at comparable contour levels. “Arms” indicative of transitional behaviors may be visible in the contours, stretching along the edges of the typical representation of CaBLAM space. However, the multitude of conformations accepted by the loose beta sheet definition prevents such details from being clearly visible.

4.6.6 310 helix

310 helix, like alpha helix, is defined by a regular pattern of sequence-related hydrogen bonding. (Alpha helix would be 413 in that nomenclature; the main number refers to the number of residues between hydrogen bonds, and the subscript refers to the number of atoms in the cycle enclosed by the helical hydrogen bond.) In regular 310 helix, the carbonyl oxygen of residue i hydrogen bonds with the amide hydrogen of residue i+3, and the amide hydrogen of residue i hydrogen bonds with the carbonyl oxygen of residue i-3. However, 310 helix is a relatively strained conformation and

“regular” 310 helix is relatively rare. More often, 310 helix occurs in short stretches, often transitioning into or out of regular alpha helix.

Assessment of 310 helix was added relatively late in CaBLAM development. By that time, I had settled on using the loose alpha helix definition to generate contours for identifying alpha helix in low-resolution models (see section 5.2.2). Because 310 helix

88

behaves similarly to alpha helix and the two helix types would need to be compared directly, the contours for 310 helix had to behave similarly to the loose alpha helix contours. As a result, there was no serious effort to define a regular 310 helix pattern.

The relative rarity of extended 310 helix would likely have frustrated efforts to define tight contours, regardless.

As a result of 310 helix’s scarcity and the need to identify 310 helix transitioning into alpha helix, the fingerprint for 310 helix is slightly unusual. For residue i to be identified as 310 helix, the carbonyl oxygen of residue i must hydrogen bond to the amide hydrogen of residue i+3, and the amide hydrogen of residue i must hydrogen bond to the carbonyl oxygen of residue i-3. In addition, the carbonyl oxygen of residue i-1 must bond to the amide hydrogen of residue i+2, and the carbonyl oxygen of residue i-2 must bond to the amide hydrogen of residue i+1. The result is a box of consistent 310 helix behavior around residue i that does not require greater continuity.

4.6.6.1 Loose 310 helix

To define loose 310 helix behavior, I simply used the 310 helix fingerprint to find all matching residues in the Top8000. No further processing was necessary to modify the selected set.

89

Figure 4.25: Contours of loose 310 helix definition in 2D CaBLAM space. (x-axis is µin, y-axis is µout.)

Each residue selected by this 310 helix definition was represented as a (μin, μout) point in the 2D CaBLAM parameter space. This distribution of points was processed with Silk to produce population percentile contours (Figure 4.25). The resulting contours are similar in character to the loose alpha helix contours, as intended, with a high central peak and vertical and horizontal arms indicative of transitional behaviors.

Interestingly, the arms on the 310 helix contours do not extend as far in 2D CaBLAM space as the arms for alpha helix. This may be a foible of the specific fingerprint definition used here, or may indicate that the conformational strain of 310 helix restricts the conformational space of transitions out of 310 helix. 90

4.7 Locating structure in CaBLAM space

By default, discussion of motifs in this section will be referring to the 2D

CaBLAM space with μin as its x-axis and μout as its y-axis. This space is the one used in

CaBLAM validation for automated identification of secondary structure elements.

Motifs do have distinct distributions in the full 3D CaBLAM space, but that space is more complex, and the ν dimension becomes diagnostic of errors rather than intent at low resolution. To understand CaBLAM space, it also is crucial to remember that the

μout of one residue is also the μin of the next. Thus any two sequential residues plotted in

CaBLAM space will share a value, albeit on different axes.

4.7.1 Handedness

One of the significant protein backbone properties represented in CaBLAM space is handedness (Figure 4.26). The CaBLAM representation of handedness is not to be confused with the handedness nomenclature for the stereochemistry of amino acids.

CaBLAM does not analyze the Cα as a chiral center and does not distinguish specifically between L and D amino acids. Handedness in CaBLAM is a descriptor of the path of a protein’s mainchain trace.

91

Figure 4.26: Abstract representation of Cα trace handedness in 2D CaBLAM space.

The origin point (0,0) is located at the center of most representations of CaBLAM space, and corresponds to two sequential cis Cα dihedrals. Moving right from the origin indicates that the μin dihedral is increasing in right-handed character. After crossing the

90° point, moving further right indicates that the μin dihedral is becoming increasingly linear, until it reaches the edge of the plot and a fully planar trans configuration.

Similarly, moving left from the origin introduces left-handed character to the μin dihedral. Passing the midpoint introduces increasing linearity, until the μin dihedral becomes fully planar trans at the edge of the plot. (The space is continuous across the edge of the plot.) Likewise, moving upwards from the origin makes the μout dihedral

92

right-handed, and moving downwards makes the μout dihedral left-handed, with planar trans conformation at the top and bottom edge of the plot.

The locations of major secondary structure types in the 2D CaBLAM space are instructive to understanding how movements along the axes interact.

4.7.2 Alpha helix in CaBLAM space

Residues falling in the upper right quadrant on the plot (Quadrant I) have right- handed character entering the residue (μin) and right-handed character exiting the residue (μout). This, therefore, is where regular right-handed helices appear (Figure 4.21).

Two “arms” extend from the contours for alpha helix (4.22). The arm in the upper left quadrant (Quadrant II) contains residues with left-handed μin and right-handed μout.

These are residues transitioning into alpha helix from other conformations. The arm in the lower right quadrant (Quadrant IV) contains residues with right-handed μin and left- handed μout. These are residues transitioning out of alpha helix and into other conformations. Many, though not all, helix capping motifs will appear in these arms.

4.7.3 Beta sheet in CaBLAM space

Residues participating in beta sheets are mostly linear. Therefore, the distribution for beta sheet residues is centered around the corners of the plot, where both μin and μout are close to the linear trans conformation (Figure 4.23). There is more variation in allowable conformations for regular beta strands than for regular helix, since stands can twist significantly but remain beta structure as long as they remain

93

paired. To my surprise, distributions for parallel beta and antiparallel beta were not substantially different in CaBLAM space. Due to the fingerprint definitions used to generate these contours, the areas of transition into and out of regular beta structure are not readily visible. These transition areas would appear along the edges, but away from the corners of the plot. Residues along the vertical edges would be those leaving a beta strand, having a beta-like linear μin and a more twisted μout. Residues along the horizontal edges would be those entering a beta strand, having a twisted μin and a linear

μout.

4.7.4 Other motifs in CaBLAM space

Three other discrete motifs merit consideration here for what they demonstrate about the CaBLAM parameter space.

310 helix is a tighter-turning helix variant with i to i+3 hydrogen bonding, rather than the i to i+4 bonding of the more common alpha helix. This bonding pattern results in a distinctive triangle shape to the Cα trace when a 310 helix is viewed down the helix axis. The Cα dihedral must become elongated to accommodate the tighter turn. Thus residues in 310 helix are generally further right and further up in Quadrant I than residues in alpha helix, placing them closer to the linear trans configuration in the upper right corner (Figure 4.27).

94

Figure 4.27: Comparison of alpha helix and 310 helix contours in 2D CaBLAM space. (x-axis is µin, y-axis is µout.)

If there were enough π helix residues to establish motif contours, those contours would fall below and to the left of alpha helix in Quadrant I. The extremely rare π helix has an i to i+5 hydrogen bonding pattern and a much wider turn than alpha helix. The

Cα dihedrals must assume conformations closer to cis to accommodate this wider turn.

Beta bulges are a particular well-documented (Richardson, 1978; Milner-White,

1987; Dion-Schultz & Howell, 1997; etc.) deviation from regular beat sheet structure.

Bulges are a single-residue break from linear stand behavior on the “bulged” strand, so they appear in 2D CaBLAM along an edge, but not in a corner like regular beta structure. In the 3D CaBLAM space (Figure 4.13), beta bulges are even more distinctive. 95

The bulge places two carbonyl oxygens (which are used to calculate ν) roughly in parallel, such that ν is near 0°. As a result, the cluster is centered on the right and left hand walls of the cube that defines 3D CaBLAM space.

4.8 Discussion

The four CaBLAM parameters µin, µout, ν, and Cα virtual angle define a series of unique parameter spaces describing protein backbone conformation. This parameter set grew out of a desire for a minimal representation of protein geometry, suitable for describing structural motifs at low resolution. It evolved into a system for low- resolution protein structure validation, which will be discussed in the next chapter.

One of the advantages of the primary 3D CaBLAM parameter space (µin, µout, ν) is its large size, combined with the wide range of values that µin, µout, and ν can assume.

Essentially every value for each of µin, µout, and ν is populated, given an appropriate combination of the other two parameters. Compare this behavior to that of the

Ramachandran plot, where residues are observed for essentially all possible ψ values, but there are large gaps in the accessible ϕ values. The CaBLAM space’s large size and its parameters’ large observed range allow motifs to be spread out in CaBLAM space where they would be compressed in Ramachandran space. Nuanced differences among related motifs are easier to see, in general, in the expanded CaBLAM space. This property makes CaBLAM uniquely useful for describing even high resolution structures where its validation method is (usually) unnecessary.

96

The CaBLAM parameters and parameter spaces are now largely set in stone due to their use in the validation and annotation methods discussed in the next chapter. If I could reinvent the CaBLAM parameters, I define ν in a less idiosyncratic fashion. The ν definition I use was born out of an overriding interest in a minimal-atom parameterization. The cutoffs for identifying outliers in CaBLAM space were set and in use before I realized that my needs had changed, and reworking the parameter spaces and resetting the cutoffs did not seem like a worthwhile investment of time. If developing a new ν, I would first try the dihedral defined by Oi-1, Cαi-1, Cαi, and Oi, and proceed from there.

One parameterization that may be open to change is my definition of beta structure. The definition I tested for loose beta was very loose, allowing very short edge strands. The definition I tested for regular beta was very regular, allowing only long center stands of a beta sheet. Of these two, the regular beta definition was clearly more useful for finding structure that was unambiguously beta, and so that definition was put into use. However, a middle-ground definition was not tested, and the regular beta definition may not adequately capture long edge strands. A “long, loose” beta definition would be worth exploring.

97

5. CaBLAM validation

One of the primary uses for the CaBLAM parameter spaces is the validation of protein backbone geometry for low-resolution structures. This section deals with the specifics of developing, implementing, and using the CaBLAM validation system.

5.1 Methodology predecessors

CaBLAM’s data storage and method were based on two of the Richardson Lab’s existing validation methods, Ramalyze and Rotalyze. Ramalyze is a method for identifying protein backbone conformation outliers in Ramachandran space (Lovell et al,

2003). Rotalyze is a method for identifying protein sidechain rotamer outliers (Lovell,

2000). Both of these methods had been implemented in Phenix by the time I began developing CaBLAM, and they served as references for how to develop in the

Phenix/CCTBX system (Adams et al, 2010).

Ramalyze and Rotalyze store and access information on the expected protein behavior as percentile contours. These contours are stored as text files (usually with a

.stat extension) generated by the Richardson Lab program Silk. The contours for

Ramalyze are two-dimensional, using the familiar Ramachandran dihedrals ϕ and ψ.

Rotalyze contours use the χ dihedrals and vary in dimensionality depending on the length of the sidechain they represent. Separate sets of Ramachandran contours are stored for each type of residue that behaves differently in Ramachandran space – Gly, trans-Pro, cis-Pro, prePro Ile/Val, and the general case (Richardson, 2013). Separate sets

98

of rotamer contours are stored for each amino acid type with a sufficiently long sidechain.

CaBLAM stores and accesses contour information in the same way as Ramalyze and Rotalyze. It also stores multiple contour sets to capture different behaviors. Three contour sets are stored for 3D CaBLAM space: Gly, Pro, and general. Likewise three contour sets are stored for the 3D Cα geometry space: Gly, Pro, and general. Three contour sets are stored for the 2D CaBLAM space: alpha helix, beta sheet, and 310 helix.

5.2 Setting contour cutoffs

Ramalyze uses a system with three regions divided by two contour cutoffs to describe residue quality during validation. Cutoffs are based on the population distribution of residues from our Top8000 dataset that pass a mainchain b-factor <= 30 quality filter. Favored Ramachandran conformations are at or within the contour that contains 98% of our quality-filtered data. Allowed Ramachandran conformations are at or within the contour that contains 99.8% of the data. Outliers fall outside of the

Allowed contour, and are therefore in the bottom 0.2% of observed protein backbone behavior.

CaBLAM uses a similar system of two contour cutoffs to define three validation regions: Favored, Disfavored, and Outlier. The Favored and Outlier categories in

CaBLAM have clear analogs in their counterparts in Ramachandran space. The

Disfavored region of CaBLAM space is roughly analogous to the Allowed region of

99

Ramachandran space, but is named differently to emphasize more strongly the need to inspect and address the conformations of residues in this region.

5.2.1 3D CaBLAM space

The 3D CaBLAM space is defined by the parameters µin, µout, and ν. This is the parameter space used to identify CaBLAM outliers, as well as residues with Favored or

Disfavored conformations.

Contour cutoffs for the 3D CaBLAM space were set empirically through inspection of low-resolution structures with known outliers. 2o01.pdb, a 3.4Å

Photosystem I structure containing many trans-membrane helices with a multitude of modeling errors was the primary reference for alpha helix (Amunts, 2007). A 70S E. coli ribosome structure supplied to us by Jamie Cate for a collaboration on improving that structure was the primary reference for beta sheet (Dunkle et al, 2011). The proteins of this ribosome structure contained a number of easily identified backbone modeling errors in which three (or more) C-O bond vectors in a row point in the same direction rather than alternating.

I had developed a kinemage markup scheme for showing CaBLAM outliers, so I tried different cutoff levels and observed the relative coverage of the resulting markup.

I was looking for a cutoff level that correctly identified most of the known outliers, but which did not incorrectly flag many other residues.

100

One of the major challenges of CaBLAM validation was quickly evident.

CaBLAM does not like loops. The CaBLAM contours are based on the percentage frequency with which a conformation occurs. As in Ramachandran space, highly populated regions such as alpha helix and beta sheet are therefore highly favored. There is a far greater variety of loop conformations, so loop conformations are spread out over conformational space and appear to be less favored. It was not possible to find a single cutoff level that both rejected known outliers and accepted legitimate loop conformations due to the low apparent favorability of even legitimate loops.

A system using two cutoffs was the solution. An Outlier cutoff set at 1% exclusion (which rejects the bottom 1% of observed conformations) captures the most severe outliers, and mostly accepts loops as valid conformations. A second Disfavored cutoff set at 5% exclusion (which rejects the bottom 5% of observed conformations) rejects many more loop residues, but also captures outliers that the more severe cutoff accepts. The combination of these two cutoffs provides a more complete and nuanced description of low-favorability residues than either could alone.

It should be noted that, although the contours for Ramachandran space and for

CaBLAM space are produced in the same way, contour levels are referenced differently in the two spaces. Ramachandran space was traditionally concerned with describing favorable conformations, so contour levels are referenced as percent inclusion. Thus the usual Ramachandran cutoffs are 98% inclusion and 99.8% inclusion. CaBLAM space,

101

like MolProbity rotamers, is more concerned with describing outlier conformations, so contour levels are referenced as percent exclusion. The Outlier cutoff for CaBLAM space is 1% exclusion, which would be 99% inclusion in the Ramachandran scheme. Likewise,

CaBLAM’s Disfavored cutoff of 5% exclusion corresponds to 95% inclusion. CaBLAM contour values are presented as percent exclusion in part to emphasize the approach towards 0% favorability of highly distorted residues.

5.2.2 2D CaBLAM space

The 2D CaBLAM space is defined by the parameters µin and µout. This is the parameter space used to identify intended secondary structure types from the protein

Cα trace.

As with the 3D CaBLAM space contours, contour cutoffs for the 2D CaBLAM space were set empirically through inspection of low-resolution structures with known outliers. For these contours, too, 2o01.pdb served as the primary reference for alpha helix and Jamie Cate’s 70S ribosome served as the primary reference for beta sheet. The

2D CaBLAM space is intended to identify residues as probable secondary structure from their Cα geometry, so I looked for good coverage of known secondary structure elements with minimal identification of loop residues as secondary structure.

For testing, I added numerical feedback on secondary structure identity to

CaBLAM’s outlier markup. Each outlier residue was compared against four different contour sets for secondary structure: regular alpha helix, loose alpha helix, regular beta

102

sheet and loose beta sheet (Figures 5.1 & 5.2). The contour level of the residue for each of these contour sets was reported for each outlier. I paid particular attention to how

CaBLAM outlier residues were identified as secondary structure.

Once again, loops presented a challenge for CaBLAM. In this case, individual loops residues might fall within secondary structure contours, while remaining isolated from any continuous sequence of secondary structure residues. The solution to this problem was to enforce continuity among residues falling within secondary structure contours before identifying those residues as probable secondary structure. In the final system, a residue that falls within a secondary structure cutoff must also be preceded and succeeded by residues that fall within that cutoff to be identified as probable secondary structure. This requirement effectively differentiates between residues that are part of continuous secondary structure elements and isolated residues that just happen to have secondary structure-like conformations. The three-in-a-row requirement draws a clear parallel to the hydrogen bonding motif definition for alpha helix also used by CaBLAM.

I determined that the regular alpha helix contours were much too tightly clustered to be useful for identifying helix in error-prone low-resolution structures. The loose alpha helix contours are more forgiving of modeling errors and also capture a greater range of helix behavior (Figure 5.1). The 0.1% exclusion contour (or 99.9%

103

inclusion) of the loose alpha helix contours provided good coverage of known helix elements, even in the presence of CaBLAM outliers (Figure 5.3).

Figure 5.1: Comparison of regular (red) and loose (orange) alpha helix definitions in 2D CaBLAM space. (x-axis is µin, y-axis is µout.)

Conversely, I determined that the loose beta sheet contours were too permissive to prevent misidentification of non-sheet residues as beta sheet. The regular beta sheet contours were more discriminating and served as the contours of choice for beta sheet

(Figure 5.2). Even so, a more permissive cutoff level was needed to adequately capture the range of beta strand conformations. The 0.01% exclusion contour (99.99% inclusion) of the regular beta sheet contours provided adequate coverage of known sheet elements, even in the presence of CaBLAM outliers (Figure 5.3). As discussed in Section 4.8, 104

developing a “long, loose” set of beta sheet contours may help resolve the difficulties with this space.

Figure 5.2: Comparison of regular (green) and loose (blue) beta strand definitions in 2D CaBLAM space. (x-axis is µin, y-axis is µout.)

Eventually, a third set of contours was added for 310 helix. A loose definition of

310 helix was used to complement the loose definition of alpha helix. The cutoff for 310 helix was set at the same 0.1% exclusion level as for alpha helix to allow for direct comparison between the two contour sets. There is overlap at the 0.1% level between the alpha helix contours and the 310 helix contours in 2D CaBLAM space (Figure 5.3). If a residue meets both cutoffs, the ambiguity is resolved by assigning it to the helix type for which it scores higher (see Section 5.3.4 for further details on 310 helix assignment). 105

Figure 5.3: All final secondary structure cutoffs superimposed in 2D CaBLAM space. (x-axis is µin, y-axis is µout.)

5.2.3 3D Cα geometry space

The 3D Cα geometry space is defined by the parameters µin, µout, and Cα virtual angle. These parameters are calculated solely from Cα positions. This is the parameter space used to identify Cα geometry outliers that may interfere with CaBLAM’s other assessments.

The 3D Cα geometry space was a relatively late addition to the family of

CaBLAM parameter spaces. CaBLAM depends on having a reliable Cα trace on which to base its assessments, and so a safety check on the quality of the Cα trace was necessary. Unlike the 2D and 3D CaBLAM spaces, the contour cutoff level for the Cα

106

geometry space was not set through empirical observation of known outliers. It was instead set to the 0.5% exclusion (99.5% inclusion) level as a conservative estimate of the level of Cα trace distortion below which I would lose confidence in CaBLAM’s ability to assess protein backbone.

Figure 5.4: Superposition of CaBLAM's Cα geometry contours (blue mesh) with Gerard Kleywegt's.

This contour level proved to match reasonably well with 2D Cα geometry contours used by Gerard Kleywegt in protein validation (Kleywegt, 1997), if allowances are made for the differences between 2D and 3D parameter spaces and the smoother contours afforded us by our larger Top8000 database (Figure 5.4). In addition, the Cα geometry contours show near total coverage of space in the µin and µout dimensions 107

(Figure 5.5). Essentially all µin/µout combinations are viable in the 3D Cα geometry space at this contour level, given an appropriate Cα virtual angle. Because of the agreement with previous work and the good coverage of the relevant space, I feel confident that the

0.5% exclusion level is a useful cutoff for CaBLAM’s purposes.

Figure 5.5: Cα geometry contours for the general case show near-total coverage of μin/μout space at the 0.5% cutoff.

5.3 CaBLAM workflow

This section presents a general outline of CaBLAM’s workflow as it assesses a protein structure. The next section will present the types of feedback generated by specific commandline calls.

108

5.3.1 Read-in and calculation

When a PDB file is submitted to CaBLAM, it is first read into the “hierarchy” object used by many of the programs in the Phenix crystallography suite. The hierarchy is then interpreted into a custom storage object unique to CaBLAM. This object contains links between residues adjacent in sequence, a property vital to some of CaBLAM’s operations and missing from the hierarchy, at least at the time of CaBLAM’s development.

Residue adjacency is determined by read-in order and by peptide bond distance.

The Phenix hierarchy object reliably orders residues according to sequence. Residues that are not adjacent to each other in the hierarchy’s read order are presumed not to be adjacent to each other in sequence. Each residue is checked against the previous residue in the read order for spacial proximity. The length of the C-N peptide bond between the previous residue and the current one is calculated. If that bond length is <= 2.0Å, the residues are considered bonded and a link is made to treat them as adjacent in sequence.

A 2.0Å cutoff is quite generous for the peptide bond. The bond distance given by

Engh and Huber (2001) for the general case peptide bond is 1.336 ± 0.023Å, for comparison. The bonding cutoff was intentionally set to be generous because CaBLAM deals with low-resolution models that may have poor bonding geometry. The 2.0Å value was selected because 2.0Å is the bond cutoff used by the program O for model- building (Jones, 1991). With this generous bonding definition, it should be possible to

109

run CaBLAM on very early or very poor models to obtain validation feedback and secondary structure restraint suggestions.

After read-in is complete, the four CaBLAM parameters – µin, µout, ν, and Cα virtual angle – are calculated for each residue. These calculations are not possible for the first two and the last two residues in a continuous segment because of the CaBLAM parameters are dependent on the positions of atoms in five residues in sequence.

Therefore residues within two positions of the ends of chains termini or chain breaks cannot be evaluated with CaBLAM. After calculation, the CaBLAM parameters for each residue are stored with that residue object.

5.3.2 Find CaBLAM space outliers

Three 3D CaBLAM space contour sets are loaded: Glycine case, Proline case, and general case (Figures 4.17, 4.15, and 4.13). For each residue with a complete set of

CaBLAM parameters, that residue’s µin, µout and ν parameters are interpreted as a point in 3-space thus: (µin, µout, ν). This point is compared to the contour set that matches that residue’s type. This generates a numerical score corresponding to the contour level at which the point falls. If the residue scores at 5% or worse, it is added to a list of

CaBLAM outliers. Later processing will distinguish between the Outlier (1%) and

Disfavored (5%) categories.

110

5.3.3 Check for Cα geometry outliers

Three 3D Cα geometry space contour sets are loaded: Glycine case, Proline case, and general case (Figures 4.18, 4.16, and 4.14). For each residue with a complete set of

CaBLAM parameters, that residue’s µin, µout and Cα virtual angle parameters are interpreted as a point in 3-space thus: (µin, µout, virtual angle). This point is compared to the contour set that matches that residue’s type, and the contour score is stored. If the residue scores 0.5% or worse, it is added to a list of Cα geometry outliers.

5.3.4 Assign secondary structure

Three 2D CaBLAM space contour sets are loaded: (loose) alpha helix, (regular) beta sheet, and (loose) 310 helix (Figure 5.3). For each residue with a complete set of

CaBLAM parameters, that residue’s µin and µout are interpreted as a point in 2-space thus: (µin, µout). This point is compared to the contour sets for each secondary structure type, and the contour score is stored for each.

The contour cutoffs for identification as secondary structure are 0.1% for alpha helix, 0.1% for 310 helix, and 0.01% for beta sheet. However, passing the contour cutoff for a single residue is insufficient for that residue to be identified as secondary structure; it must be part of a continuous (and contiguous) pattern of residues that all pass the cutoff.

For alpha helix, if residue i-1, residue i, and residue i+1 all pass the alpha helix cutoff, then residue i is identified as alpha helix. This requirement that a residue must

111

be in the center of helical continuity to be identified as alpha helix deals helps ensure that helices terminate neatly, despite the “arms” in the alpha helix contours. As a result, helix identification may terminate before reaching the helix caps. Helix caps may have unique geometries and will be better assessed by methods targeted at non-repeating motifs.

For 310 helix, if residue i passes the 310 cutoff and either residue i-1 or residue i+1 also passes the cutoff, that residue is identified as 310 helix. 310 helix tends to occur in shorter segments than alpha helix, so a less strict continuity requirement was necessary.

Since there is overlap between the alpha helix contours and the 310 helix contours, residue i must also score higher for 310 helix than it scores for the more common alpha helix to be identified as 310 helix. As a result, residues in transition between alpha helix and 310 helix tend to be identified as alpha helix.

For beta sheet, if residue i-1, residue i, and residue i+1 all pass the beta sheet cutoff then residue i-1, residue i, and residue i+1 are all identified as beta sheet. Beta sheet does not have strong “capping” motifs like alpha helix does, so strand ends are not strongly captured by the regular beta sheet contours used by CaBLAM. Therefore a definition that tended to extend sheet identification was necessary.

As a precaution against assigning secondary structure based on unreliable Cα geometry, if a residue is in the Cα geometry outliers list, it is skipped during secondary structure assignment.

112

5.4 Validation feedback

CaBLAM provides several modes of validation feedback, appropriate for different purposes.

5.4.1 Text

Text output is the default for cablam_validate if no other arguments are specified, though it can be explicitly generated with the following commandline:

phenix.cablam_validate output=text file.pdb

Sample text output from 2o01.pdb:

A 720 THR: CA Geo Outlier :0.04835:0.00236: :0.00000:0.00000:0.00000 A 721 GLN: :0.29194:0.10072: :0.00000:0.00134:0.00000 A 722 PRO: :0.11147:0.25812: try beta sheet :0.00000:0.01481:0.00000 A 723 ARG: :0.10530:0.22413: :0.00000:0.00027:0.00000 A 724 ALA: Peptide Disfavored :0.01458:0.11435: :0.00000:0.00000:0.00000 A 725 LEU: Peptide Outlier :0.00000:0.16084: :0.00000:0.00000:0.00000 A 726 SER: :0.30666:0.45294: :0.00000:0.00126:0.00000 A 727 ILE: Peptide Disfavored :0.03020:0.39316: :0.04614:0.00000:0.00000 A 728 VAL: :0.72136:0.63512: try alpha helix :0.13848:0.00000:0.03051 A 729 GLN: :0.45590:0.71627: try alpha helix :0.13572:0.00000:0.06540 A 730 GLY: :0.18039:0.87774: try alpha helix :0.22709:0.00000:0.00473 A 731 ARG: CA Geo Outlier :0.27272:0.00002: :0.11537:0.00000:0.15138 A 732 ALA: :0.06449:0.46345: try alpha helix :0.03475:0.00000:0.00984 A 733 VAL: Peptide Outlier :0.00391:0.19215: try alpha helix :0.04802:0.00000:0.00000 A 734 GLY: :0.06608:0.15031: try alpha helix :0.20330:0.00000:0.09040 A 735 VAL: :0.07576:0.63680: try alpha helix :0.13231:0.00000:0.00283

This output is formatted to be both human and machine readable, being both column and colon-delimited. The columns are as follows:

1) A unique residue identifier formatted as ccnnnnilttt. cc is a two-character chain ID, nnnn is a right-justified residue number, i is an insertion code, l is an alternate conformation ID, and ttt is a three-character residue name. This is the same residue identification scheme used by other validation tools in Phenix, including Ramalyze and

Rotalyze. MolProbity also parses this residue id format.

113

2) An outlier type designation. This designation may be “Peptide Outlier” for a

CaBLAM outlier, “Peptide Disfavored” for a residue in the disfavored region of

CaBLAM space, “CA Geo Outlier” for a Cα geometry outlier, or blank for a residue with a favored conformation. A “CA Geo Outlier” is considered the most severe type of outlier and therefore overrides the other types if a residue falls into more than one outlier category. This field is left blank for residues falling in the Favored region of 3D

CaBLAM space.

3) A numerical representation of the residue’s contour level in 3D CaBLAM space. This and other numbers are presented as fractions rather than percents for consistency in column arrangement. Thus 0.01000 in this column corresponds to the 1% cutoff for outlier conformations. This column provides a more detailed view of outlier severity than the outlier type column.

4) The residue’s contour level in Cα geometry space. 0.00100 corresponds to the

0.1% cutoff for outlier conformations. This column provides a more detailed view of Cα geometry outlier severity than the outlier type column.

5) A recommendation of probable secondary structure for the residue. This recommendation may be “try alpha helix”, “try beta sheet”, or “try threeten” and matches the secondary structure type CaBLAM identified that residue as. The secondary structure recommendation is presented for every possible residue, whether outlier or not, because contiguity of secondary structure residues is a significant

114

indicator of that secondary structure assignment’s correctness. An isolated “try beta sheet” like residue 722 is much less convincing than a series of contiguous “try alpha helix”s like residues 732+. The secondary structure recommendations are each of different character length so that in a fixed character-width format, secondary structure elements are visually encoded.

6) A numerical representation of the residue’s contour level in the 2D CaBLAM space for loose alpha helix. 0.00100 corresponds to the 0.1% cutoff for identification as possible alpha helix. This columns and the following columns allow for direct numerical comparison of a residue’s scores for the different secondary structure types.

7) The residue’s contour level in the 2D CaBLAM space for regular beta sheet.

0.00010 corresponds to the 0.01% cutoff for identification as possible beta sheet.

8) The residue’s contour level in the 2D CaBLAM space for loose 310 helix. 0.0010 corresponds to the 0.1% cutoff for identification as possible 310 helix. Due to the overlap between the alpha helix and 310 helix contours, direct comparison of this value with the alpha helix score column’s may be important in understanding complex helices.

When MolProbity runs CaBLAM validation, it is this text output that is generated and parsed to produce MolProbity-formatted validation.

5.4.2 Oneline

MolProbity provides a “oneline” functionality for quick whole-model-level assessment of structures when exhaustive residue-by-residue feedback is not needed.

115

CaBLAM therefore provides a similar functionality. Oneline output is accessed through the following commandline:

phenix.cablam_validate output=oneline file.pdb

Sample oneline output from 2o01.pdb:

2o01_ah:724:48.1:29.6:7.60

Like the default text output, CaBLAM’s oneline output is colon-delimited for machine readability. The columns are as follows:

1) A PDB ID or filename identifier. CaBLAM’s oneline may be run in batch mode on a directory of PDB files. This column preserves the association between those files and their contents.

2) A count of residues that CaBLAM can evaluate. Since CaBLAM cannot evaluate residues near chain breaks or termini, this number will be less than the total residue count of the structure. The residue count can be used with the percent values later in the output to find absolute counts for the different outlier types.

3) The percent of residues CaBLAM has evaluated as Disfavored or worse. Since the contour cutoff for Disfavored is 5%, a value around 5% is to be expected in a well- modeled, high-resolution structure. A value significantly higher than 5% is suspicious and may indicate modeling errors.

4) The percent of residues CaBLAM has evaluated as Outlier. Since the contour cutoff for Outlier is 1%, a value around 1% is to be expected in a well-modeled, high- resolution structure. A value significantly higher than 1% is suspicious and may 116

indicate modeling errors. In a high-quality structure, these Outlier residues should almost certainly be confined to loop regions.

5) The percent of residues CaBLAM has evaluated as Cα Geo Outlier. Since the contour cutoff for Cα Geo Outlier is 0.5%, a value around 0.5% may be acceptable in a well-modeled structure with high-resolution electron density to support deviation from expected behavior. However, Cα geometry outliers are the most severe category identified by CaBLAM and should always be viewed with suspicion.

A residue may be identified as more than one kind of outlier by CaBLAM.

Indeed, all Outlier residues necessarily fall outside the 5% Disfavored contour as well.

Thus in a particularly troubled structure, these percentages may sum to more than 100%.

This apparent double counting is necessary to preserve the correspondence between the numbers reported here and the 5%, 1%, and 0.5% contours used to identify outliers.

5.4.3 Kinemage markup

While text-based feedback is useful for automation, the Richardson Lab highly values visual communication with human users. Therefore a visual markup scheme for

CaBLAM validation was a must. CaBLAM can generate an annotated kinemage similar to (or part of) the “multicriterion” kinemages generated by MolProbity. CaBLAM markup in these kinemages takes the form of CaBLAM outlier annotations, Cα geometry outlier annotations, and ribbon annotations for secondary structure elements. Kinemage markup can be accessed with the following commandline:

117

phenix.cablam_validate output=full_kin file.pdb

5.4.3.1 CaBLAM outlier and disfavored markup

Figure 5.6: CaBLAM kinemage annotation for a severe outlier (pink) and a disfavored conformation (purple).

The primary form of markup denotes outliers in 3D CaBLAM space. In general,

CaBLAM assumes that the Cα trace of a protein is relatively reliable and that any residue that is an outlier is an outlier because of its ν dihedral. Therefore, when a

CaBLAM outlier is found, it is marked in the kinemage with a set of colored bars that follows the four points defining the ν dihedral (Figure 5.6). This annotation draws attention to the likely source of the problem and also helps clarify the otherwise idiosyncratic ν dihedral for users. CaBLAM outlier markup uses two colors to encode outlier severity (Figure 5.7). Outliers are colored pink. Disfavored residues (aka “mild outliers”) are colored purple. Pink and purple were chosen for being visually related to each other and because those colors were not already in use by any protein validation markup. These colors are used to mark sugar pucker outliers in RNA validation. Since the RNA sugar pucker validation is another method that derives a modeling 118

recommendation from the best-modeled parts of a structure, the parallel coloring is a nice touch.

Figure 5.7: CaBLAM kinemage annotation for residues in helices from 2o01.pdb

5.4.3.2 Cα geometry outlier markup

Cα geometry outliers are marked with red lines following the Cα trace (Figure

5.8). These lines highlight the Cα virtual angle that is the parameter unique to the Cα geometry space in CaBLAM. The lines are slightly shortened to reduce ambiguity about which markup is associated with which residue if multiple Cα geometry outliers occur in close proximity. The red coloration is also used by “angle dev” and “bond dev” geometry markup in the usual multicrit kinemage. Since Cα geometry outliers are

119

similar in character and severity to bond angle deviations, the overlap in coloration was deemed to aid rather than confuse user understanding.

Figure 5.8: Cα geometry outlier annotation for helices in 2o01.pdb.

5.4.3.3 Embedded text feedback

The kinemage format allows each drawn point to carry text information with it.

The CaBLAM outlier markup makes use of this by storing contour level information with each markup annotation. Clicking on any point in the CaBLAM markup will show fractional values for contour level of the marked residue in 3D CaBLAM space and for the 2D CaBLAM spaces for secondary structure types. This is the same information available through the residue-by-residue text output. Because sequential CaBLAM

120

outlier markup overlaps, an additional selectable point is drawn in the middle of each of the line segments unique to a residue’s markup.

5.4.3.4 CaBLAM-generated secondary structure ribbons

Selecting each outlier in turn to check its secondary structure contour levels is tedious and impractical, so CaBLAM also provides markup for residues it determines to be probable secondary structure. Via Molikin, the KiNG viewer already possessed the ability to convert the HELIX and SHEET records found in PDB files into ribbon diagram representations of secondary structure. CaBLAM therefore generates HELIX and

SHEET records based on its identification of secondary structure elements, and these are interpreted into ribbons in the final kinemage. This ribbon markup provides a comprehensive at-a-glance view of where probable secondary structure elements occur in a structure (Figure 5.9). Because CaBLAM will not further assess residues found to be

Cα geometry outliers, any Cα geometry outliers that occur within a secondary structure element will interrupt the ribbon for that helix or strand. Numerical feedback for how secondary structure-like those outlier residues are can still be viewed by selecting any part of the Cα geometry outlier markup, but should be interpreted with care and with emphasis on correcting the geometry outlier.

121

5.4.4 Other feedback options

CaBLAM provides a number of other output options. For full details on these options, use phenix.cablam_validate help=True. This section discusses development for a few of these options.

Many existing programs accept PDB-style HELIX and SHEET records as a means of identifying secondary structure. Our KiNG viewer is one such program, and various

Phenix tools in development likewise can use HELIX and SHEET records. To facilitate interaction with these programs, CaBLAM can output HELIX and SHEET records with or without the associated PDB file, through the commandlines:

phenix.cablam_validate output=records file.pdb

or

phenix.cablam_validate output=records_plus_pdb file.pdb

These records are unfortunately imperfect. The HELIX records generated by

CaBLAM are mostly complete, but do not number the helices. The SHEET records generated by CaBLAM may be more problematic. Full PDB SHEET records contain hydrogen bonding information, showing the pairing of beta strands through space.

CaBLAM detects individual beta strands. It does not detect the hydrogen bonds (which may in fact be absent in a low-resolution structure) between them and does not have a way to join beta strands through space into a beta sheet. Thus each SHEET record generated by CaBLAM is an individual strand record without bonding partner information. Solving the strand to sheet conversion remains a future goal for CaBLAM. 122

The HELIX and SHEET records as generated are sufficient for KiNG to interpret into ribbons (Figure 5.9), but may need to be modified for use with other programs.

Figure 5.9: Ribbons for 2o01.pdb generated by Molikin used HELIX records supplied by CaBLAM.

Ramachandran analysis generally produces a plot of residues in Ramachandran space. Selecting “points” as the output for CaBLAM will similarly produce a kinemage of a structure’s residues in CaBLAM space. These points can then be compared to kinemage representations of CaBLAM-space contours. This can be useful for determining the relationship of outliers to the contours. The commandline is:

phenix.cablam_validate output=points file.pdb

123

5.5 Success rates

Determining whether CaBLAM was accurate in its designation of outliers and probable secondary structure was one of the most challenging aspects of the project. My intent with CaBLAM is to fill a gap in the ability of software to accurately assess low- resolution structures. There is not another tool that provides reliable identification of secondary structure at low resolution. Thus there was no automated way to assess how accurately CaBLAM identified secondary structure. Indeed, if there had been a way to check CaBLAM’s accuracy automatically, CaBLAM would have been less necessary to create.

Human inspection of CaBLAM’s behavior on low-resolution was necessarily the mechanism of choice for assessing CaBLAM’s accuracy. Fortunately, the Richardson

Lab trains its members in visual intuition for protein (and RNA) structure. My own efforts to this end were greatly aided by Lizbeth Videau, who conducted exhaustive residue-by-residue inspections of several low-resolution structures during our efforts to improve structure quality using CaBLAM feedback. Thanks to her work, I feel confident that CaBLAM provides consistently useful information on low-resolution structures.

5.5.1 False positives

False positive CaBLAM outliers (residues marked as CaBLAM outliers that may not be true outliers) are the most prevalent problem with CaBLAM markup. As discussed in Section 5.2.1, CaBLAM finds an excessive number of outlier and disfavored

124

residues in loop regions. Loops have a greater variety of real conformations than the more regular secondary structure regions, and so loop residues occur in less-populated regions of CaBLAM space. Because CaBLAM assesses both highly populated secondary structure and sparsely populated loops with the same 3D contours, rejection of some loops was inevitable.

Providing better feedback for loops was a major reason why CaBLAM requires continuity among adjacent residues for identification as secondary structure. Very few loop regions maintain conformations that resemble secondary structure for more than one or two residues at a time. CaBLAM requires at least three residues in a row to resemble secondary structure before it starts identifying secondary structure. Thus

CaBLAM only very rarely produces false positives for secondary structure identification.

It is possible that CaBLAM disproportionately identifies proline residues as outliers. Lizbeth, something of a proline aficionado (Videau, 2004), noticed many being identified as outliers without apparent cause, despite proline having its own unique set of contours in CaBLAM space. However, prolines are disruptive to regular secondary structures. Alpha helices bend at prolines and beta sheets tend to bulge. Prolines are thus disproportionately found in non-regular and loop regions. For prolines that were associated with secondary structure, Lizbeth saw very few that were incorrectly marked as outliers. CaBLAM’s apparent disfavor for prolines is therefore probably just a specific manifestation of its general disfavor for loops.

125

5.5.2 False negatives

False negative CaBLAM outliers (residues with obviously incorrect geometry that CaBLAM does not mark as outliers) are very rare. A CaBLAM outlier is associated with a single residue for naming conventions, but is actually determined based on the relationship between the peptide planes of two adjacent residues. Thus an incorrect peptide plane orientation may be viewed with respect to the preceding residue or with respect to the succeeding residue. A residue may be identified as a CaBLAM outlier with respect to the preceding residue, the succeeding residue, or both. CaBLAM does not always mark both of the possible outlier relationships, but does nearly always mark at least one of them.

False negatives for alpha helix markup are likewise very rare. Apparent false negatives for helices may occur where Cα geometry outliers interrupt CaBLAM’s ribbon markup, but inspection of the Cα geometry outliers generally reveals that even these residues fall within the 2D CaBLAM space contours for helix behavior.

False negatives for beta sheet markup are more common. Beta sheets are more varied and less well-behaved than helices, and so setting an appropriate contour cutoff for them proved challenging. CaBLAM’s current ignorance of through-space relationships, like the one that defines beta strand pairing, also makes accurate assessment of beta sheets difficult. Nevertheless, Lizbeth found CaBLAM’s identification of beta strands adequate in her survey of low-resolution structures.

126

Generally, CaBLAM has difficultly identifying complete beta strands. It usually identifies at least a part of each strand, however. If CaBLAM is integrated into an iterative method of structure improvement, this partial identification may be sufficient to “grow” well-modeled beta sheet out from the identified regions over several rounds of refinement or rebuilding.

5.5.3 Comparison to other methods

One of the motivating forces behind CaBLAM validation was the poor performance of existing secondary structure identification methods at low resolution

(Section 4.1). Comparison to these existing methods shows that CaBLAM does indeed offer improved assessment of low-resolution structures.

5.5.3.1 Comparison to Ramachandran analysis

Comparison of CaBLAM to Ramachandran analysis clearly demonstrates the greater reliability of the Cα trace over the full mainchain trace at low resolution. In the case of the poorly-modeled helices from 2o01.pdb, Ramachandran analysis (Figure 4.2) cannot reliably identify poorly-modeled residues as outliers. Ramachandran analysis also cannot reliably identify the modeling “intent” of these residues as alpha helix. If these same residues are plotted in the 3D CaBLAM space (Figure 5.10), the points form a column above, through, and below the region of alpha helix (see section 4.7.2). More of these residues are identified as outlier (and more clearly) than by Ramachandran analysis.

127

Figure 5.10: Poorly modeled helix residues from 2o01.pdb in 3D CaBLAM space.

If these residues are plotted in the 2D CaBLAM space along with the contours for expected alpha helix behavior (Figure 5.11), the modeling “intent” of these residues becomes absolutely clear. Virtually all of the residues fall within the cutoff for identification as alpha helix, based on their Cα trace.

128

Figure 5.11: Poorly modeled helix residues from 2o01.pdb in 2D CaBLAM space with contours for alpha helix. (x-axis is µin, y-axis is µout.)

In the case of the “conserved” beta sheet error from the 70S E.coli ribosome structures, Ramachandran analysis (Figure 4.4) places the involved residues into two distinct clusters. Half of the residues are misidentified as having “Allowed” or better conformations, despite the error. The other half, while identified as outliers, are too far from the beta sheet region for their intended structure to be clear. If these same residues are plotted in the 3D CaBLAM space (Figure 5.12), they fall in the corners (±180°) of the

µin/µout axes, and about halfway down (around 0°) the ν axis. This region is far outside the contours of expected protein behavior, and so these residues are very clear outliers.

(It is worth noting that ν dihedral values in this range do occur, as in the case of beta 129

bulges, but require compensating changes in the Cα trace that move such residues out of the corners of µin/µout space.)

Figure 5.12: Beta strand modeling errors from 70S E. coli ribosome in 3D CaBLAM space.

If these residues are plotted in the 2D CaBLAM space along with the contours for expected beta strand behavior (Figure 5.13), the modeling “intent” of the residues is clear. All the residues fall within the cutoff for identification as beta sheet, based on their Cα trace.

130

Figure 5.13: Beta strand modeling errors from 70S ribosome in 2D CaBLAM space with contours for beta sheet. (x-axis is µin, y-axis is µout.)

5.5.3.2 Comparison to DSSP

Comparison to DSSP provides a visually striking example of this improvement

(Figure 5.14). Running phenix.ksdssp file.pdb will generate HELIX and SHEET records based on the hydrogen bonding present in a structure. The HELIX and SHEET records generated by CaBLAM show similar coverage of secondary structure elements in high resolution structures and much better coverage of secondary structure in low- resolution or error-prone structures. CaBLAM’s coverage of secondary structure elements appears to be roughly consistent across a wide resolution range, while DSSP and ksdssp’s coverage quality falls off at low resolution.

131

Figure 5.14: Comparison of secondary structure identification by ksdssp (left) versus CaBLAM (right) as demonstrated by helix ribbons on 2o01.pdb.

5.5.3.3 Comparison to expert human inspection

Human visual intuition is a significant method within the Richardson Lab.

Fortunately, CaBLAM compares favorably to human intervention for identifying outliers in secondary structure. When Vincent Chen was working on exhaustive manual corrections to Jamie’s Cate’s early 70S ribosome structure, he encountered a particular, frequently occurring modeling error in the ribosomal proteins’ beta structure. In this error, three or more sequential carbonyl oxygens along a beta strand would be modeled on the same side of the strand, rather than alternating sides (Figure 4.3). Vincent made a list of every one of these outliers that he found during his manual inspection of the structure (Chen, 2010). After CaBLAM was developed, I ran its validation on the same version of the 70S ribosome that Vincent had worked on and compared the CaBLAM outlier list to Vincent’s list. CaBLAM identified as outlier or disfavored at least one residue in all but a handful of Vincent’s outliers. CaBLAM also identified a similar 132

handful of additional outliers of this type than Vincent had overlooked. For identifying modeling errors in secondary structure, CaBLAM’s automation performs similarly to an exhaustive manual search performed by a human expert.

5.6 Manual corrections from CaBLAM validation

CaBLAM’s validation and markup were put to a practical test during work to improve low-resolution protein structures. We attempted corrections to three structures of particular interest, a later version of the 70S E. coli ribosome we had worked on with

Jamie Cate; 4KIX.pdb/4KIY.pdb, another ribosome structure; and 4HUM.pdb, a low- resolution structure specifically selected for having an abundance of CaBLAM outliers.

Lizbeth Videau did most of the direct work on improving these structures, since she has great patience for exhaustive residue-by-residue inspection. I provided Lizbeth with advice on interpreting CaBLAM and took notes on her successes and frustrations with the annotation. This also served as a test of CaBLAM’s user-friendliness.

5.6.1 Methods

Validation markup from CaBLAM and MolProbity was displayed in KiNG.

Lizbeth used this markup to locate residues of interest and residues in need of correction. CaBLAM markup confirmed her visual intuition for the secondary structure character of outlier residues. She then used various Coot tools (Emsley & Cowtan, 2004) to address those residues. Coot’s coarse geometry adjustment “rubber banding” tool was used to break outlier residues out of incorrect local minima and its local realspace

133

refinement was used to reminimize them into better conformations. Alpha helix or beta sheet geometry restraints could be selected for the local realspace refinements, and were used to increase the idealization of the structures. Lizbeth also used Coot’s

Ramachandran tool to move residues’ ϕ and ψ angles into favored Ramachandran territory.

5.6.2 Technical challenges

One of the primary frustrations of the structure correction process was that Coot provides relatively little real-time validation feedback with which to track whether changes are indeed improvements. Coot does display clash dots after each local refinement. Ramachandran outliers can be assessed one at a time through the

Ramachandran tool, but the interface is awkward and the contours outdated. Other

MolProbity validation criteria are not accessible in Coot. Real-time CaBLAM feedback was particularly missed, since one of our goals was to see whether manual intervention could correct CaBLAM outliers. Fortunately, many Coot users operate in the Phenix environment, which offers greater integration between validation and Coot intervention.

Adding CaBLAM to the Phenix validation GUI should likewise assist with Coot-based

CaBLAM corrections.

5.6.3 Results of manual correction

Our success in manually correcting structures based on CaBLAM validation was marginal. Tables 5.1 and 5.2 show CaBLAM validation statistics for two proteins of

134

interest, 4hum.pdb (Lu et al, 2013) and the S7 protein of Jamie Cate’s E. coli ribosome.

These statistics show validation for the starting model, the model following manual correction, and the corrected model after refinement. S7 was generally amenable to human intervention, and we were able to reduce the number of CaBLAM outliers dramatically. The remaining outliers were confined to loop regions. However, many of the residues reverted to outlier or disfavored conformations after refinement. 4HUM

(Figure 5.15) was larger and less tractable than S7 (Figure 5.16), and a smaller percentage of outliers were corrected. Some of these residues also reverted to outliers after refinement.

Table 5.1: Validation statistics for 4hum.pdb through our intervention process.

CaBLAM CaBLAM Cα Geometry Clashscore Outlier % Disfavored % Outlier % Input 4hum 17.9 38.6 8.49 40.23 Fixed 4hum 14.8 29.0 7.20 54.85 Refined 4hum 15.7 31.9 5.90 53.08

135

Figure 5.15: 4hum.pdb before (left) and after (right) our intervention.

More seriously, our corrections to the protein backbone introduced severe outliers elsewhere in the structure. Clashscores jumped significantly for both of these structures after our corrections, and refinement could not resolve the clashes. The new clashes were primarily sidechain-sidechain or sidechain-mainchain clashes. In the process of correcting mainchain outliers, we seem to have moved a great many sidechains into untenable positions, despite Coot’s minimization and feedback on clashes.

Table 5.2: Validation statistics for E. coli ribosomal protein S7 through our intervention process.

CaBLAM CaBLAM Cα Geometry Clashscore Outlier % Disfavored % Outlier % Input S7 12.2 25.9 0.68 4.96 Fixed S7 2.7 6.8 0 12.81 Refined S7 6.1 16.3 0.68 13.22

136

Figure 5.16: E. coli ribosomal protein S7 before (left) and after (right) our intervention.

On launching this part of the project, I had hoped that fixing the protein backbone would allow the rest of the structure to fall into place more readily. Clearly, this was not the case. While manual intervention is probably sufficient for isolated outliers, structures containing many related modeling errors (i.e. many structures in the resolution range targeted by CaBLAM validation) will require a more involved correction method.

5.7 Discussion

CaBLAM was originally designed as a method for describing protein backbone motifs. However, it has grown to fill a necessary role as a validation method for low- resolution protein structures. CaBLAM has a uniquely powerful ability to combine

137

identification of modeling errors with identification of the intended secondary structure masked by those errors.

CaBLAM’s unique power comes from its unique combination of parameters and the suitability of those parameters to the problem of low-resolution validation. Other methods have used the same or similar parameters, but have not combined them in

CaBLAM’s fashion, nor explicitly exploited their power at low resolution. A 2009 review of structure assignment methods (Tyagi, 2009) has served as my primary guide to the variety of protein structure assessment programs.

Cα dihedrals have long been recognized as useful descriptors of protein structure. Levitt and Greer probably came the closest to building CaBLAM’s 2D parameter space, in part because they were working early enough to have been concerned with unreliability in ϕ and ψ values. However, despite exploring both µout

(Levitt, 1976) and µin (Levitt & Greer, 1977), they did not combine the measures to form the space I have found so useful. The typical combination of parameters has instead been one µ dihedral and the Cα virtual angle. Oldfield and Hubbard explored this possibility in some detail (Oldfield & Hubbard, 1994). They used a reference frame centered on the peptide bond, so my Cα-centric “in” and “out” nomenclature does not map cleanly to their system.

Oldfield and Hubbard’s work influenced at least two automated methods. P-

SEA (Labesse, 1997) uses a combination of µ, Cα virtual angle, and Cα-Cα distances to

138

identify secondary structure using a Cα-only method. Gerard Kleywegt developed a set of contours in 2D Cα virtual angle/µout space (Figure 5.4) for use as a Cα geometry validation tool (Kleywegt, 1997). His method is in use at the PDBe.

Other methods have used Cα positions to describe or identify secondary structure without defining an explicit µ dihedral. DEFINE (Richards & Kundrot, 1988) uses a linear mask to match the Cα positions of a region of structure to the Cα positions of a segment of ideal secondary structure. KAKSI (Martin, 2005) uses Cα-Cα distances to supplement ϕ/ψ information. PROSIGN (Hosseini, 2008) uses Cα positions to describe a periodic oscillation of a protein strand.

The ν dihedral used by CaBLAM appears to be entirely unique to this system, probably due to the use of virtual atom positions in its definition. However, there is one method that came close. XTLSSTR (King & Johnson, 1999) defines a dihedral ζ using Oi-

1, Ci-1, Ci, and Oi. This dihedral relates the relative orientations of adjacent peptide planes and carries similar information to CaBLAM’s ν. Indeed, I might well have defined ν in the same way as ζ, had I not been so concerned about having a minimal parameter space at the time. XTLSSTR combines this dihedral with the Cα virtual angle to form its parameter space.

None of these methods seem to have been developed with an eye to low- resolution structures. Most of the methods are old enough that the recent proliferation of low-resolution structure had not yet begun, though. My greatest concern as I review

139

these methods is not however their intellectual similarity to or difference from my

CaBLAM method. Rather, my concern is that most of these methods seem to have fallen out of use, their innovations lost. CaBLAM’s power will be wasted if it becomes just another back-issue curiosity upon my graduation. Fortunately, CaBLAM’s integration into MolProbity will keep it in the public eye, and its integration into Phenix promises that it may have a lasting impact on crystallography. CaBLAM is also better-suited than most of these methods to identifying non-repeating motifs, as will be discussed in

Chapter 6.

The greatest challenge still in CaBLAM’s future is the question of how to implement its validation suggestions as structure corrections. While CaBLAM’s validation markup proved useful in identifying outliers in low-resolution structures, making manual corrections that neither introduced new clashes nor reverted to outlier after refinement proved very difficult. Correction of errors in low-resolution structures is a larger and more complicated problem that any single method such as CaBLAM can address.

One possible promising solution is to use the HELIX and SHEET records generated by CaBLAM to replace troubled segments of a model with ideal secondary structure. Applying secondary structure restraints to these segments may help prevent the residues from reverting to outlier conformations during refinement. Sidechains would have to be completely repacked during this process. I spoke with Pavel Janowski

140

during a 2014 Gordon conference, and he indicated that an improved force field, specifically AMBER (Cornell et al, 1995) would be coming to Phenix refinement in the near future. This force field may provide better sensitivity to clashes and thus help resolve the sidechain clashes problem.

141

6. Protein structure motifs in CaBLAM

Despite its utility in structure validation for low-resolution models, CaBLAM was originally conceived as a method for describing non-repeating protein structure motifs. These non-repeating motifs are conserved and identifiable, but stand in contrast to repeating structures such as the familiar alpha helix and beta sheet. Most of these non-repeating motifs are properly considered as secondary structure, as they describe the fold of the protein backbone. I generally refer to such motifs as non-regular secondary structure. Early in the project, I thought of them as “secondary structure interrupts”, since the motifs I had found were primarily interruptions to regular secondary structure elements, e.g. the widened helix turn. Eventually, a wider variety of motifs became evident, and the “interrupts” moniker was deemed too narrow and was dropped.

Development of CaBLAM as a tool for identification and exploration of motifs has been secondary to development of CaBLAM as a validation tool, due to the pressing need for better low-resolution validation. However, building the datasets from which

CaBLAM’s secondary structure contours were constructed required the same tools as did motif identification. It is because of their use in building training datasets for

CaBLAM validation that the motif exploration tools are gathered under the name

“cablam_training”.

142

6.1 Motif “fingerprints”

The central conceit of CaBLAM’s motif identification is that every motif has a unique structural “fingerprint”. For most motifs, this fingerprint is defined by the presence (or absence) of a particular pattern of hydrogen bonds. In a simple, familiar case, in alpha helix, the O atom of residue i is hydrogen bonded to the H (amide hydrogen) atom of residue i+4. In regular alpha helix, Oi+1 is also bonded to Hi+5, Oi+2 is bonded to Hi+6, and so on. Thus the complete hydrogen bonding pattern for alpha helix extends to several residues with a distinct sequence relationship.

Like the rest of CaBLAM, the CaBLAM motif tools were developed using code available in the Phenix/cctbx_project. Despite its considerable power, at the time of

CaBLAM development, Phenix lacked an easy way to access connectivity among adjacent residues. For this reason, I built a custom data structure to establish a sequential relationship among residues. Each residue object in this data structure contains a link to the previous residue in sequence (if any) and the next residue in sequence (if any). This structure can be found in the code as linked_residue in the file cablam_res.py.

Residue connectivity was established through use of linked_residue objects (as discussed in Section 5.3.1). Hydrogen bonding relationships were determined by the preexisting Richardson Lab programs Reduce and Probe (Word et al, 1999). Reduce adds hydrogen atoms to a structure, and Probe determines the contacts between all

143

atoms in a structure. Together, these programs return hydrogen-bonding information for a structure. Storage and reading of hydrogen-bonding information was greatly aided by Swati Jain’s development of a condensed output for Probe around the time of

CaBLAM development. The condensed output simplified Probe output from one text line per contact dot to one line per whole contact. Hydrogen bonds identified by Probe are stored in linked_residue objects as links between residues, thus linking residues in space as well as in sequence.

Hydrogen bonds are not the only criteria for protein motifs. Residue type plays a role in many motifs, either by correlation or by requirement. Motif fingerprints in

CaBLAM may also require or disallow residue types at specified positions. Residues in a fingerprint may also be specified as cis or trans if restriction of backbone geometry is required.

The fingerprint code could be extended to include other criteria for motif identification. The ability to select on sidechain rotamer would be a natural extension of selection of residue type, and rotamer information is readily available through phenix.rotalyze. Selection on van der Waals contacts in addition to hydrogen bonds would require a different probe call and a more robust contact storage structure, but might be useful for identifying motifs with few or no hydrogen bonds. These features were not needed to recognize the motifs we were interested in during CaBLAM’s development.

144

For a complete description of the fingerprint format and instructions on how to write a custom fingerprint, see Appendix B.

6.2 The cablam_training tool

The cablam_training program contains many tools for identifying and visualizing motifs of interest. This section describes the primary output modes for cablam_training. The full commandline options for cablam_training are extensive, and a complete description of the options can be found in Appendix A. The sample commandlines provided below focus on the most common usages.

These visualizations can be produced for single structures, but most are intended to display motifs from multiple structures simultaneously. CaBLAM can be run on an entire dataset at once, and the final visualization will describe the dataset as a whole.

For nomenclature purposes, each complete motif is found in the dataset referred to as an instance of the motif. A single structure may contain multiple instances of a motif, and any sufficiently large dataset will almost certainly contain multiple instances.

Each motif contains some number of member residues, each occurring at a distinct position in the motif, as defined by the motif fingerprint. For each instance of a motif, there will be an instance of each member residue.

6.2.1 Kinemage output

phenix.cablam_training cablam=True

probe_motifs=motif_name1,motif_name2 probe_mode=kin

145

CaBLAM provides kinemage-formatted distributions of the datapoints of motif member residues in CaBLAM space (Figure 6.1). Each member residue of a non- repeating motif can be conformationally unique, so each member residue is printed to a separate kinemage. Kinemages are saved to the working directory with names automatically generated based on unique member residue names included in the motif fingerprint. These kinemages can be appended together to create an animation of a motif’s member residues in CaBLAM space. So long as every member residue in every motif has a unique name, this output mode can be used to search for several motifs simultaneously.

146

Figure 6.1: Typical cablam.training kinemage output, showing the 2D bCaBLAM space distribution of a characteristic residue from the widened helix turn. (x-axis is µin, y-axis is µout.)

Many more geometric parameters can be calculated for each residue than are typically needed for any one exploration. The cablam=True flag calculates the four main CaBLAM parameters μin, μout, ν, and Cα virtual angle, and is the generally recommended option. However, other useful parameters are available, including

Ramachandran ϕ and ψ. Expanded Ramachandran angles ψ-1 and ϕ+1 can also be calculated. The ω peptide bond dihedral was added to the available parameters early, and proved be relevant for later cis-peptide validation. The dependence of certain backbone bond angles on Ramachandran ϕ and ψ has been an area of recent interest in structural

147

biology (Berkholz, 2009). The backbone N-Cα-C angle, sometimes called τ, is of particular interest in this context and so is included in the available options.

Some parameters left over from CaBLAM development are also available for the kinemage output. Most notable is a measure that would be νout if it were a parameter in use.

νout carries similar information content to ν, but has a less intuitive relationship to its associated residue, and so was discarded from the main CaBLAM parameter space.

Measures of the Cα virtual angles for each residue’s preceding and succeeding residues can also be calculated. These were part of a brief attempt to create a 3D parameter space using only Cα virtual angles. The Cα virtual angle proved too restricted to yield useful validation contours. The measures remain as a memorial to negative results.

The kinemage representation of motifs in CaBLAM space shows whether, where, and how motifs are conformationally conserved in CaBLAM space. Most non-repeating motifs lack enough instances to generate satisfying contours. However, clustering in

CaBLAM space is readily apparent to human eyes.

When searching for motifs, it can be useful to write an overly-general motif definition at first. If a member residue shows multiple clusters in CaBLAM space, then each instance of the motif in that cluster can be investigated for similarities. The fingerprint definition can then be made more specific to isolate that cluster and better define the motif.

phenix.cablam_training cablam=True

probe_motifs=motif_name1,motif_name2 probe_mode=instance

Residue-by-residue kinemage output is useful for identifying which residues in a motif have conserved conformations, but does not easily display relationships among the member residues of each instance of a motif. CaBLAM provides an alternative

148

kinemage output to show these relationships. In this kinemage, each instance of a motif is joined together as a vectorlist. Thus, each instance of a motif appears in the kinemage as a continuous line. Correlations in conformation, especially among adjacent residues, become much more apparent in this view (Figure 6.12).

6.2.2 Structure annotation

phenix.cablam_training probe_motifs=motif_name1,motif_name2

probe_mode=annote

CaBLAM provides automated structure annotation as an aid to locating motifs of interest in protein structures. The output of annotation mode is a set of automatically- named kinemages, one per input pdb file, containing balllists that highlight the positions of the specified motifs on those structures (Figure 6.2). The balls are drawn at the Cα positions of each member residue in each motif instance. Due to the variability of motif length, the balls are not colored automatically, but they can be manually recolored within KiNG to better show the locations of key member residues.

149

Figure 6.2: Annotation output from cablam.training for a bifucation-stabilized widened helix turn in 1y7t.pdb.

As a historical note, this display method has its origins in the early and ill-fated

DSSP-based version of CaBLAM validation (see Section 4.5.3). In that system, each

DSSP letter code had a unique set of 2D CaBLAM contours, analogous to the secondary structure contours used in the final version of CaBLAM. An outlier residue would be scored against each of these contours and the values recorded. In the kinemage markup, a series of concentric circles would be drawn at that residue’s Cα position, one circle for each DSSP letter code, with each circle’s radius proportional to the contour score for that letter-code (Figures 4.19 & 4.20). Due to human perceptions of visual weight, these differently-colored concentric circles were not an effective means of communicating 150

validation information. Fortunately the core of the method found a better home where relative visual weight is not important.

6.2.3 Sequence

phenix.cablam_training probe_motifs=motif_name1

probe_mode=sequence

Sequence conservation is frequently relevant to exploration and explanation of protein motifs. For this reason, CaBLAM provides a sequence identification option. The output for this option is a text list of protein sequences printed to screen. This is one of the few CaBLAM options not suitable for running on multiple motifs at once. The output prints the one-letter residue code for each member residue of a motif instance, then a newline before starting the next instance of the motif. This output is compatible with online sequence analysis tools, in particular WebLogo (Crooks, 2004) (Figure 6.3).

Figure 6.3: Sample WebLogo sequence frequency summary for bifurcation-stabilized widened helix turns.

Nonstandard amino acids may occasionally occur in the middle of motifs of interest. These residue types do not necessarily have one-letter codes associated with them. For commonly-encountered nonstandard amino acid types, CaBLAM maintains a 151

mapping of those residue types to the closest one-letter codes (e.g selenomethionine is also assigned the methionine “M”, methyllysine is also assigned the lysine K). If a residue type is not recognized by CaBLAM, it is printed as an “X” followed by the full three-letter code for the residue type. Motif sequences containing nonstandard residues can be easily identified either by a visual search for output lines with extra characters or an automated search for the letter X. Because these lines contain extra characters,

WebLogo will refuse to process them until the discrepancy is resolved. I view this as a feature, if an occasionally tedious one, since it draws attention to the nonstandard residues.

6.2.4 Superposition

Superposition of related structures is a powerful tool for visual comparison.

However, achieving a good superposition by hand is a meticulous process.

Simultaneous superposition of multiple structures quickly becomes a daunting task to manage by hand. For this reason, the Richardson Lab has long desired a tool for automated structure superposition, especially one specific to motif analysis.

Phenix provides one such tool. phenix.superpose_pdbs automates structure superposition on a specifiable set of atoms. However, as with manual superposition, manual identification of the correct atoms is a daunting task for more than a handful of structures, especially if the residue numbering is not the same among all those structures. An automated method targeted at our Lab’s needs was required.

152

phenix.cablam_training probe_motifs=motif_name1,motif_name2

probe_mode=superpose

The superposition output for CaBLAM was a relative latecomer to the system.

The output for this mode is a directory of PDB files. Each of these files contains one instance of a motif of interest, extracted from its original structure. These files are automatically named such that multiple instances of a motif from a single structure do not overwrite each other. The motifs in these PDB files are oriented such that they will be superposed on each other if loaded into a molecular viewer such as KiNG (Figure

6.4).

Figure 6.4: Superposition of bifurcation-stabilized widened helix turns by CaBLAM, showing 20-30° bend of helix.

The superposition mode of CaBLAM takes advantage of its Phenix integration to use phenix.superpose_pdbs for generating the superposition. A general atom selection for superposition is specified in the fingerprint for each motif. The specific atom selection (with the necessary residue numbers) is automatically generated for each

153

instance of the motif when that instance is found. CaBLAM manages the superposition selection syntax for the user.

The instance is extracted from its parent structure and superposed. Fingerprint definitions must sometimes be extended to residues beyond the motif of interest to ensure that there are enough residues to provide context for the superposition. The superposition target is the first instance of the motif of interest found by CaBLAM. If the first instance that CaBLAM finds during a superposition run is sufficiently nonstandard, this may cause superposition to behave strangely.

6.3 Tyrosine corners

The tyrosine corner is a motif previously characterized by the Richardson Lab

(Hemmingsen, 1994). Revisitation of previous studies is one of the opportunities afforded by the growth of the PDB and made practical by CaBLAM’s automated motif identification. In 1994, when the original tyrosine corner study was conducted, there were only about 3,000 protein structures available in the PDB, and only a set of 162 quality-filtered proteins (Hobohm, 1992) were used. Now, with over 100,000 protein structures in the PDB and a quality-filtered dataset 8,000 chains strong, a more comprehensive study is possible. However, with the enlarged dataset comes a need for automation. With a 90’s-era dataset of a hundred-some structures, it was possible to inspect every structure individually. It is not feasible to manually inspect 8,000

154

structures for each motif of interest. CaBLAM provides the automation necessary to take advantage of the ever-increasing data available in the PDB.

6.3.1 Characteristics

The 1994 paper describes the tyrosine corner as follows: “The Tyr corner is a conformation in which a tyrosine (residue “Y”) near the beginning or end of an antiparallel β-strand makes an H bond from its side-chain OH group to the backbone

NH and/or CO of residue Y - 3, Y - 4, or Y - 5 in the nearby connection.” Tyrosine corners play an apparent role in the folding and stabilization of Greek key β-barrels.

Figure 6.5: Example of NH of Y-5 tyrosine corner from 1epw.pdb.

I constructed 6 different fingerprint definitions for tyrosine corners, depending on the sequence distance of the bond and whether the amide hydrogen or the carbonyl oxygen of the Y-X residue was involved in the H-bond of interest: NH of Y-3, NH of Y-4, 155

NH of Y-5, CO of Y-3, CO of Y-4, and CO of Y-5. The bond from the end of the tyrosine sidechain to the protein backbone (Figure 6.5) was sufficient to define the motif, and no other criteria were necessary.

All told, about 600 tyrosine corners were identified. By far the most common kind was the CO of Y-4 variant with 260 instances. (Note that this is more tyrosine corners than there were structures in the original survey!) The NH of Y-5 is next with

158 instances. The CO of Y-5 has 87, the CO of Y-3 has 54, and the NH of Y-3 has 50.

The NH of Y-4 is the least common variant, with only 25 instances in the Top8000.

6.3.2 In CaBLAM space

My studies of tyrosine corners are still preliminary. My primary interest thus far has been determining how similar or different the 6 variants are.

156

Figure 6.6: Superposition of instances of CO of Y-4 tyrosine corners shows one of the common conformations of the motif. (x-axis is µin, y-axis is µout.)

In CaBLAM space, the Y-3 and Y-4 tyrosine corners bear a general resemblance to the tight turns found at the ends of “beta hairpins”. In the Y-4 variants (Figure 6.6), there would be some room for the tight turn-like conformation to move forward or backward in sequence, but it stays earlier in sequence, closer to the Y-4 residue than the tyrosine. The Y-5 variants are perhaps too long to support a neat tight turn-like conformation, and so adopt a different, distinctive conformation in the residues near the

Y-5 bonding partner. Because of the association of this motif with beta structure, the tyrosine residue generally assumes a beta-like conformation in all variants.

157

The conformations of the residues between the bonding partners appear to be fairly conserved for each sequence distance pair. Both the CO of Y-3 variant and the NH of Y-3 variant adopt very similar conformations in CaBLAM space. Likewise, both Y-4 variants are similar to each other, and both Y-5 variants are similar to each other.

6.3.3 Sequence correlations

The motif fingerprints were defined so that the residue numbering of the easily- identified tyrosine would change, and its bonding partner would always be found as

Residue 2.

Figure 6.7: Sequence frequency Logos for the six Tyr corner variants. Residue 2 is the Y’s H-bond partner in all cases.

Interestingly, sequence conservation among the variants is not as strong as the conformational conservation discussed above (Figure 6.7). The 1994 paper asserts a consensus sequence of LxPGXY, where the first “x” is a hydrophilic residue, and the “L” forms a hydrophobic packing contact with the tyrosine ring. The significance of glycine

158

is certainly borne out by the new sequence analysis, with glycine appearing frequently in all variants, often enriched at either position Y-2 or Bonding-Partner+2. Proline is indeed enriched in the position before glycine, but less significantly, indicating that its presence may be less crucial to the motif than previously believed. Leucine seems to retain its significance in most cases. Moreover, a more general pattern of one hydrophilic residue and one hydrophobic residue emerges, except in the case of the CO of Y-5 variant, which seems to have very little sequence preference.

Work on revisiting the tyrosine corner is ongoing. The expanded dataset offered by the Top8000 will allow us to construct a more complete picture of the motif, even as it raises new questions.

6.4 Widened helix turns

The widened helix turn was the first non-repeating I discovered.

Early in the project, I had been looking for helix terminus motifs. I had not developed

CaBLAM’s fingerprint motif search at that time, so I used DSSP letter-codes as a guide to secondary structure. I searched the Top5200 database for residues not labeled as H, but adjacent to a string of continuous H-labeled residues. This method found helix caps

(Richardson & Richardson, 1988), but also found interruptions in regular helix, such as the widened helix turn. One of the residues in the widened turn falls in a distinctive region of the Ramachandran plot (Figure 6.8), as mentioned in Section 4.2.

159

Figure 6.8: The characteristic cluster in Ramachandran space for the bifurcation- stabilized widened helix turn.

I struggled to characterize widened helix turns with the tools available to me.

The need for an automated method for reliable motif identification inspired the

CaBLAM fingerprints method. I performed a number of rough superpositions of this motif by hand, a process that inspired CaBLAM’s automated motif superposition method.

Much later, Dave introduced me to a second kind of widened helix turn, one dependent on backbone-sidechain hydrogen bonding with a threonine sidechain. These two related motifs are discussed together in this section.

160

6.4.1 Characteristics

The first type of widened helix turn is defined by a bifurcated backbone- backbone hydrogen bond and two missing backbone hydrogen bonds (Figure 6.9). The widened turn is embedded in regular helix. The carbonyl oxygen of Residue 1 of the widened turn has a bifurcated hydrogen bond, bonding to the amide hydrogens of

Residue 5 and Residue 6. These hydrogen bonds are i to i+4 and i to i+5, respectively, a bonding pattern that places the turn about halfway between the familiar alpha helix and

π helix. Residue 2 has an i to i+5 hydrogen bond. The carbonyl oxygens of Residues 3 and 4 do not hydrogen bond to the backbone. The amide hydrogens of Residues 5 and 6 participate in the π-helix-like hydrogen bonds already described, but the carbonyl oxygens of those residues return to standard i to i+4 helix-pattern hydrogen bonding.

Residue 8 is not part of the formal fingerprint definition for the widened helix turn, but is sometimes significant to understanding the motif. Often, but not in every case,

Residue 8 is a proline, the sidechain of which interferes with regular helix-pattern hydrogen bonding and explains the missing hydrogen bonds in Residues 3 and 4. With or without a proline, the widened helix turn results in (and stabilizes) a bend of 20-30° in the helix (Figure 6.4).

161

Figure 6.9: Typical bifurcation-stabilized widened helix turn from 1y7t, showing bifurcated bond in the foreground and missing bonds to the left.

The threonine-dependent widened helix turn is defined by a threonine residue inserted into the hydrogen-bonding pattern of the helix (Figure 6.10). The effect of the threonine on hydrogen bonding is dramatic enough that the threonine alone is sufficient to define this version of the motif. In the threonine widened helix turn, the carbonyl oxygen of Residue 1 hydrogen bonds to the HG1 atom of the threonine sidechain of

Residue 4. The peptide plane of Residue 3/4 is often rotated such that the amide hydrogen of Residue 4 bonds with the carbonyl oxygen of Residue 2 instead of Residue

1, although this is not necessary. (If this is the case, Residue 3, which shares the rotated peptide plane with Residue 4, will not have a backbone hydrogen-bonding partner for 162

its carbonyl oxygen.) Thus the threonine sidechain substitutes for the protein backbone to form an i to i+4 hydrogen bond. The OG1 atom of the threonine sidechain of Residue

4 then bonds with the amide hydrogen of Residue 5. In effect, a very long i to i+5 hydrogen bond is formed, passing through the threonine sidechain. This pushes the helix apart in the vicinity of the threonine, creating a bend in the helix similar to the one found in the widened helix turn stabilized by a bifurcated bond.

Figure 6.10: typical threonine-dependent widened helix turn, showing threonine (front) inserted into the helix bonding pattern.

163

6.4.2 In CaBLAM space

In CaBLAM space, widened helix turns are distinguishable from regular alpha helix. This discussion focuses on the 2D CaBLAM space described by μin and μout, as that is the space that remains reliable at low resolution.

Figure 6.11: CaBLAM space comparison of member residues of widened helix turns.

Bifurcation-stabilized (left) and threonine-dependent (right). Residue 3 is green, Residue 4 is cyan, Residue 5 is blue. (x-axis is µin, y-axis is µout.)

The bifurcation-stabilized turn contains three residues with distinct conformations in CaBLAM space (Figure 6.11, left). Residues 1 and 2 are largely indistinguishable from regular helix. Residue 3 appears down and to the left of regular helix. This position means that the Cα dihedrals around Residue 3 are flatter and more cis-like than those for regular helix. Down and left from alpha helix is where π helix is found (see Section 4.7.4), so the appearance there of the almost-π-helix residues of the

164

widened turn is to be expected. Residue 4 appears directly below the position of

Residue 3, roughly in the lower arm of the loose alpha helix contours. Observed residues are distributed across about 30° in µin and about 60° in µout. The distribution of

Residue 5 appears as an approximate reflection of the Residue 4 distribution, with the axis of reflection defined by the x=y (more properly µin=µout,) line. Residue 6 returns to the conformation of regular helix, and the subsequent residues follow suit unless interrupted by another motif. The overall path of a bifurcation-stabilized helix turn through CaBLAM space sketches a rough triangle (Figure 6.12, left).

The threonine-dependent turn likewise contains three residues with distinct conformations in CaBLAM space (Figure 6.11, right). Residues 1 and 2 are largely indistinguishable from regular helix. Residue 3 appears below the position of regular helix. Residue 4 appears to the left of Residue 3. Residue 5 appears above Residue 4 and to the left of regular alpha helix. Residue 6 returns to the conformation of regular helix, and the subsequent residues follow suit unless interrupted by another motif. The distributions of observed residues for the threonine-dependent turn are not as tight as the distributions for the bifurcation-stabilized bond turn, but are more isotropic. The overall path of a threonine-dependent helix turn through CaBLAM space sketches a rough square (Figure 6.12, right).

165

Figure 6.12: Instances of widened helix turns superimposed in 2D CaBLAM space show comparison of bifurcation-stabilize turn (left) to threonine-dependent turn (right). (x-axis is µin, y-axis is µout.)

For examples derived from high-resolution structures, like those in our Top8000 dataset, these two types of widened helix turns are somewhat distinguishable from each other in CaBLAM space. The threonine-dependent motif generally stays closer to the regular alpha helix conformation, while the bifurcation-stabilized version allows some dramatic departures, especially in Residues 4 and 5. However the degree of difference between the distributions is probably insufficient to distinguish between the motifs in a low-resolution or poorly modeled structure. The difference in the triangular versus square path the motifs follow through CaBLAM space may be more robustly represented in low-resolution models.

166

6.4.3 Sequence correlations

The bifurcation-stabilized widened helix turn shows significant sequence preferences (Figure 6.13, top). Most notable is a strong preference for Proline at the

Residue 8 position. Proline occurs here in about 40% of observed instances of this motif, and no other residue types are significantly favored at this position. Where present, proline plays a clear role in the motif. The proline sidechain ring prohibits regular helix- pattern hydrogen bonding and forces the helix to bend. The real puzzle of the motif is how and why it accomplishes its characteristic bend and bonding disruption in cases where the proline is absent. An evolutionary study of the motif in related protein might reveal whether the proline position is subject to mutation. A structural explanation for the shape of the motif without proline has not been evident in my studies. Residue positions 2, 3, and 4 show slight enrichment in glutamate. Residue 4 is notably enriched in asparagine as well. This motif did not show obvious correlation to active sites, but

Residues 3 and 4 are the residues missing their hydrogen bonding partners for their carbonyl oxygens, so glutamate and asparagine may help to stabilize the motif by forming compensating hydrogen bonds. In my initial, Ramachandran-driven explorations, the motif had appeared significantly enriched in branched-Cβ residues, but the more complete study enabled by CaBLAM revealed this enrichment to be only marginal.

167

Figure 6.13: WebLogos for bifurcation-stabilized (top) and threonine-dependent (bottom) widened helix turns, aligned for comparison.

The threonine-stabilized widened helix turn has much less sequence correlation, with the obvious exception of its obligate threonine residue (Figure 6.13, bottom). The helix bend in this motif is accomplished largely through the insertion of the threonine into the helix-pattern hydrogen bonding, and proline never plays a role in bending the helix. The other residue positions in the motif mostly show a typical mild enrichment in helix-favoring residue types. Residue 3 shows a mild enrichment in polar and negatively charged residue types. These residues are likely involved in the formation of hydrogen bonds to compensate for bonding lost due to the helix bend.

168

6.5 Double tight turns

The double tight turn is a motif that serves as an example of the benefits of developing visual literacy and the way that structural models put the unexpected in plain view. I was searching a structure for an example of a different (and ultimately uninteresting) motif. One of these double tight turns happened to be near where I was looking and caught my eye as something unusual. CaBLAM then allowed me to very quickly write a fingerprint definition and automate a search for more examples.

6.5.1 Characteristics

Single tight turns are a common structural feature (Richardson, 1981). They are often found joining antiparallel beta strands, since the length of one peptide plane matches well with the length of the hydrogen bonds that join beta strands. The double tight turn is, quite simply, two tight turns in a row. The shape of the resulting strand trace is reminiscent of a cloverleaf, though with only two leaves on its stem.

The double tight turn is seven residues long, and is defined by three hydrogen bonds among three of those residues (Figure 6.14). The first hydrogen bond is between the amide hydrogen of Residue 1 and the carbonyl of Residue 7. These residues are often the last/first in a beta strand pair or “”, and this bond is often part of the antiparallel beta bonding pattern. The second hydrogen bond is between the carbonyl oxygen of Residue 1 and the amide hydrogen of Residue 4. This hydrogen bond and the peptide plane between Residues 2 and 3 form the first tight turn

169

conformation. The third hydrogen bond is between the carbonyl oxygen of Residue 4 and the amide hydrogen of Residue 7. This hydrogen bond and the peptide plane between Residues 5 and 6 form the second tight turn conformation. Each residue involved in this motif’s hydrogen bonding pattern is bonded to each of the other two involved residues. There is no sequence distance between the two tight turns.

Figure 6.14: A double tight turn from 2cn3.pdb shows characteristic 3 H-bonds among 3 residues.

My search with CaBLAM revealed a hundred-some well-modeled double tight turns within the Top8000. This unusual-looking motif generally appears at the end of an antiparallel beta strand pair, as a single tight turn would. It often appears in “propeller” folds. 2cn3.pdb, for instance, contains an 8-fold beta propeller arranged around a central 170

axis. Four of the eight beta strand pairs involved in the propeller have a double tight turn at one end. The strand pairs with the double tight turn are not evenly distributed around the propeller, rather lying next to each other. The electron density supports the absence of double tight turns in the other strand pairs. For the time being, it remains mysterious why double tight turns should form at all, let alone why they should be unevenly distributed as in 2cn3.pdb.

6.5.2 In CaBLAM space

CaBLAM can, of course, find single tight turns. In that case, the challenge is largely in differentiating among the types of tight turn. Single tight turns are familiar and not particularly exciting, so I have omitted a full discussion of them from this work in favor of the more unusual double tight turn. However, I have briefly studied single tight turns in CaBLAM space, and can confirm that the Cα geometry of the double tight turn is nothing more than two single tight turns appended to each other.

Residues 2 and 5 play the same role in their respective tight turns and have the same distributions in CaBLAM space (Figure 6.15). Likewise Residues 3 and 6 play the same roles and have the same distributions. Residue 4 joins the two tight turns and promises to be the most conformationally interesting. The distribution of Residue 4 in

CaBLAM space turns out to follow the distribution of Residue 7, but not Residue 1, leaving Residue 1 as the only residue that does not the follow the same distribution as

171

another in the motif. Instead, Residue 1 tends to follow the distribution of beta structure.

Figure 6.15: Instances of double tight turns superimposed in CaBLAM space show the motif favors a conserved, repeating path. (x-axis is µin, y-axis is µout.)

The self-similarity of this motif is remarkable. Evidently, very little if any deviation from the regular tight turn structure is needed in order to string two together.

6.5.3 Sequence correlations

As in the case of its distribution in CaBLAM space, the double tight turn’s sequence correlation is fascinatingly unsurprising. In its sequence preferences, the double tight turn is very clearly two single tight turns in a row (Figure 6.16). Residues 1 and 4 have similar sequence preferences for aspartate or asparagine. Residues 2 and 5 172

tend to favor proline, as is common in tight turns. However, instances of the motif in which both Residue 2 and Residue 5 were proline were not the norm. Residues 3 and 6 have similar, though weaker sequence preferences. Here, Residue 7 was the odd residue out, having weak sequence preferences and no strong resemblance to another residue in the motif.

Figure 6.16: WebLogo for double tight turns.

6.6 Non-sequential helix bonding

A motif that showcases CaBLAM’s power to automate identification of unusual structural features is non-sequential helix bonding. In typical alpha helix the amide hydrogen of each residue hydrogen bonds to the carbonyl oxygen of the residue four positions prior in sequence. Non-sequential helix bonding occurs at the N-terminus of a helix, where the hydrogen bonding opportunities of the amide hydrogens typically go unfulfilled (unless a capping motif is present). In this case, two of the amide hydrogens bond to the carbonyl oxygens of a piece of the protein distant in sequence. The distant bonding partners run in parallel to the helix residues, such that helix residue i bonds to distant residue j, and helix residue i+1 bonds to distant residue j+1. The helix residues

173

and their distant bonding partners are not otherwise sequence-related. The local conformation of the distant residues is similar to helix conformation for the duration of the bonding partnership.

Figure 6.17: Non-sequential helix bonding of two helices.

I wrote a motif fingerprint definition for this non-sequential helix bonding behavior at the N-terminus of a regular alpha helix. I then used CaBLAM to automate a search through the Top8000 for all well-modeled instances of the bonding pattern.

CaBLAM found 169 matches to the fingerprint definition used. The search showed that non-sequential helix bonding is not a single motif, but occurs as the unifying behavior in a number of otherwise unrelated motifs.

174

Figure 6.18: Not what “three-helix junction” usually means.

18 of the fingerprint matches were not true non-sequential bonding, but were extra-wide helix turns. These turns contained more than two extra residues inserted into the helix as a short loop and thus passed my check for “distant” bonding. A more stringent fingerprint definition could exclude these helix turns from consideration if necessary.

175

Figure 6.19: Non-sequential helix bonding with an antiparallel beta strand.

34 of the fingerprint matches were helices placed end to end (Figure 6.17), such that the C-terminus of one helix fulfilled the hydrogen bonding opportunities of the N- terminus of the other helix. These end-to-end helix arrangements were roughly linear, and could be mistaken for continuous helix, if not for the mainchain traces departing and entering midway through the helix. The majority of the end-to-end helix pairs were alpha helices paired with alpha helices. However, 8 of the observed cases involved 310 helices.

176

A very few additional cases featured an intersection of three helices, rather than a collinear arrangement of two helices. The temptation to refer to this phenomenon as a three-helix-junction to annoy students of nucleic acids has been strong (Figure 6.18).

57 of the fingerprint matches were associated with beta structure (Figure 6.19).

The majority of these featured a beta strand pair at the N-terminus of a helix. One strand leaves the beta pairing relationship to form the helix; the other strand distorts slightly to continue bonding with its partner as the helix begins. The associated beta strands may be parallel or antiparallel. The beta strands are often short, but may be part of a larger sheet. Unsurprisingly given the distortion necessary to bond with the helix, the beta sheet forms a clear beta bulge in at least 7 cases.

177

Figure 6.20: Non-sequential helix bonding bracketing a long loop.

46 of the fingerprint matches do not correspond strongly to a secondary structure type, but instead bracket a long loop between the helix N-terminus and its bonding partners (Figure 6.20). These loops vary in length from about 11 residues to 20-some residues. (The exact cutoffs in each case were subject to my human interpretation.)

These loops generally lacked obvious structure, save that many of them contained a very short alpha helix. This helix was generally only one to one-and-a-half turns in length and occurred far away from the helix used to identify the motif.

The remaining matches to the motif fingerprint did not fall into a clear, unifying category.

178

6.7 Discussion

CaBLAM allowed rapid and automated description and identification of a wide variety of protein structural motifs. Any motif that can be described with a hydrogen bonding pattern – from regular motifs like alpha helix, to familiar motifs like the tyrosine corner, to unusual behaviors like non-sequential helix bonding – can be studied faster and more comprehensively with CaBLAM. Automated motif searching is a tool that has been long desired in the Richardson Lab.

There remain a vast number of protein motifs and variants on motifs to be described (Richardson, 1981; Aurora & Rose, 1998; etc.). CaBLAM can help accelerate this work. With only one instance of a motif of interest, a motif fingerprint can be written, and CaBLAM will automate the search for all other instances of the motif in a database.

179

7. Low-resolution validation in MolProbity

The Richardson Lab has long specialized in macromolecular structure validation.

The methods developed by the lab over the years are powerful tools for assessing structure quality for a middle range of structure quality. However, experimental data at either low resolution or very high resolution present unique challenges for structure solution. Structures solved at the extremes of data quality likewise present unique challenges to structure validation. While the existing MolProbity tools (Chen et al, 2009) still provide valuable feedback for such structures, the tools have not been developed or tuned to address their unique challenges.

CaBLAM is the vanguard method in our push towards development of methods better suited to validating low-resolution protein structures. The integration of

CaBLAM into MolProbity has also served as the launch point for other low-resolution validation methods. While our work to validate structures at the extremes of data quality is far from finished, CaBLAM and the other low-res validations represent a major step towards resolution-specific validation.

Clear visual communication is key to validation feedback, both in the multicriterion kinemage and on the MolProbity site. Throughout this chapter, I will attempt to show the rationale that informed our design and display choices for the new

MolProbity validations.

180

7.1 Challenges of low-resolution validation

Low-resolution structures present problems for our traditional validation methods due to the types and quantity of modeling errors found in them.

7.1.1 Validation overload

The validation challenge immediately apparent upon completing a MolProbity assessment of a typical low-resolution structure is validation overload. Because of the unreliability of their experimental data and the difficulties in solving a structure from that data, low-resolution structures typically have far more outliers in far higher density than high- or mid-resolution structures. The endless sea of validation markup (Figure

7.1) in the multicriterion kinemage for an especially poor structure such as 2o01.pdb is intimidating. How does a user know where or how to start correcting such a structure?

181

Figure 7.1: Validation overload in 2o01.pdb.

The layering of multiple modeling errors upon each other also confuses some validation methods. In particular, Ramachandran analysis becomes unpredictable when many backbone modeling errors occur in close proximity. As discussed in Section 4.1.1,

Ramachandran analysis is very sensitive to single modeling errors, but its sensitivity is dependent on having reliable atomic positions around each outlier. Multiple errors together confound its sensitivity.

182

We need to develop new validation methods (or variants of existing methods) that can provide focus for users attempting to improve difficult structures. Tools that focus on the most severe outliers and ignore lesser outliers will reduce the amount of intimidating visual clutter in validation kinemages while guiding users to the most important areas to fix. Tools that focus on identifying easily corrected errors will help users get a foothold in an intimidating structure. CaBLAM and other associated improvements in MolProbity begin to introduce these new tools, but reducing validation overload will be an ongoing area of development.

7.1.2 Secondary structure errors

Low-resolution structures contain more modeling errors in secondary structure elements than high- or mid-resolution structures. This may be due to the ambiguous electron density discussed in Section 3.2.1, the misleading truncated sidechain density discussed in Section 3.2.2, or simply the general poor information content of the experimental data.

Alpha helices are often laterally compressed in low-resolution structures (Figure

7.2). When viewed down the helix axis, normal alpha helices have a roughly circular appearance, with the Cα trace following the path of a 7-pointed star. The compressed, low-resolution helices instead have an ovoid appearance and a less clear Cα trace when viewed down the axis. These compressed helices are not consistently identified as outlier behavior by our current methods.

183

Figure 7.2: Laterally compressed helix from 2o01.pdb.

Incorrect orientations of peptide planes are surprisingly common in low- resolution structures (Figures 4.1 & 4.2). These are most noticeable in secondary structure, where the correct orientations are clear, but they also occur in loop regions.

People and programs trying to fit backbone carbonyl oxygens into truncated sidechain density (Figure 3.3) is likely the cause of many of these errors. CaBLAM validation was specifically developed to assess these modeling errors in a systematic fashion.

7.1.3 Loop uncertainty

Accurate modeling of loop regions is challenging at all resolutions. Loops tend to be more mobile than other parts of a protein and therefore not as well resolved by the experimental data. In high-resolution structures, poorly resolved loops cannot be accurately refined with the same target function weights as the well resolved majority of the structure. At low resolution, loop regions tend to disappear entirely from the electron density (Figure 3.4). Loops solved from little or no data are prone to all manner

184

of modeling errors. Since loop regions are highly varied by nature, reliable reference points for correctly modeled loops are difficult to find.

Loop validation remains one of the uncharted frontiers of low-resolution validation. Tools exist to predict plausible loops from the surrounding structure.

Vincent Chen’s JiffiLoop (Chen, 2009) searches a library of vetted loops for loops that will close a gap in a model. Various computational tools including Rosetta (Leaver-Fay et al, 2011), Qfit (Jackson, 2002), and molecular dynamics generally have powerful methods for predicting protein structures, including loop regions. These methods and others provide solutions for how to rebuild a loop known to be in error. But these methods do not yet produce reliably correct solutions. And how does one identify problematic loops except by the presence of Ramachandran errors or clashes within them? Can a validation method be devised that assesses a loop as a whole, rather than just assessing each constituent residue? I do not yet have an answer.

7.2 Clash cutoff adjustment

We came upon one significant solution to the problem of validation overload during our time as CASP8 assessors (Keedy et al, 2009). Many of the predicted structures submitted to CASP for assessment faired poorly in MolProbity assessment.

Like low-resolution structures, they contained a multitude of modeling errors, especially minor steric clashes. These minor clashes may have resulted from the difficulty of the structure prediction task, or from optimization against less strict geometric criteria than

185

those used by MolProbity. Regardless of their cause, the prevalence of minor clashes made assessing the quality of the predicted structures difficult. Severe errors were visually – and sometimes statistically – masked by the multitude of minor ones.

Our solution was to increase the amount of steric overlap necessary for an interaction to be identified as a clash from 0.4Å to 0.5Å (see section 2.3.2). In clash assessment as performed by Probe, a spherical probe is rolled over the surface of each atom. Whenever the probe touches or passes within the surface of another atom, an interaction is indicated. Close proximity without overlap is a van der Waals interaction.

An overlap between an appropriate bonding pair is a hydrogen bond. And an overlap between a non-bonding pair is either a close contact or a steric clash.

The cutoff for distinguishing between a close contact and an unfavorable clash is generally set at an overlap of 0.4Å. Increasing the necessary overlap to 0.5Å expands the close contact category to include interactions that were previously minor clashes. In a kinemage (Figure 7.3), this reduces the visual clutter resulting from those minor clashes, allowing a user to better locate and address sites of major modeling errors.

186

Figure 7.3: Validation overload reduced via new clash cutoff and CaBLAM markup in place of rotamer outliers.

The altered clash cutoff is not appropriate for all structures. High-quality structures at high resolution should be very much concerned with identifying and either fixing or explaining the minor overlaps that the altered cutoff hides. For this reason, the multicriterion kinemage for low-resolution structures – which displays clash dots only for overlaps ≥ 0.5Å – is available by default only for structures solved at a resolution of

2.5Å or worse.

7.3 New coloring

One of the main forms of validation feedback provided by MolProbity is a sortable HTML table containing validation information for each residue in a structure. 187

As I worked on new validation markup for the multicriterion kinemage, I also worked on improving the representation of outliers in the table of residues. A new coloring scheme was the result.

The MolProbity table is organized with one row for each residue and one column for each validation category. Each individual cell defaults to background colors, but may be specifically colored to draw attention to information of interest. The traditional use of this coloration has been to color the cells for outliers in bright pink. Thus the cell for a steric clash ≥ 0.4Å, the cell for a Ramachandran space outlier, and the cell for a bond geometry deviation would all be colored pink. By default, the MolProbity table is sorted by sequence, so this coloration has served as a quick visual guide to locations in need of attention from users.

However, as in the case of the multicriterion kinemage, validation overload can make the MolProbity table difficult to use for low-resolution or otherwise challenging structures. For challenging structures, some of the validation columns, especially the column for clash, can become solid walls of undifferentiated bright pink.

Differentiation among outliers was achieved by the addition of new colors to the

MolProbity table. The majority of outliers continue to be colored in pink. Mild outliers and conformations that are suspicious but not necessarily incorrect are now colored in light pink. Severe outliers are now colored in bright red. The new colors were chosen based on their relationship to the existing standard pink coloration. In this color

188

scheme, increasing level of saturation correlates with increasing outlier severity, which should be an intuitive encoding of information for most users. The three levels of saturation are distinguishable in gray scale as well, which should ensure their readability by colorblind users.

7.3.1 Coloring for clashes

All three colors are used for the Clashes column (Figure 7.4). The mild clashes with steric overlap between 0.4 and 0.5Å discussed in section 7.3 are classified as mild outliers and colored light pink. Clashes with a very large steric overlap (>0.9Å) are classified as severe outliers and colored bight red. The remaining clashes in the middle range of steric overlap are colored the same pink as before. The new coloring scheme brings differentiation to the “Clashes” column. In the multicriterion kinemage, the visual density of clash spike provides users with intuitive information about the severity of a clash. The new chart color scheme uses color saturation to recapitulate that visual intuition.

189

Figure 7.4: Comparison of old (left) and new (right) multicriterion table coloring for clashes in 4hum.pdb.

7.3.2 Coloring for Ramachandran and rotamers

The Ramachandran and Rotamer columns use standard pink and light pink

(Figure 7.5). Our Ramachandran analysis has long differentiated among three categories of conformational behavior: Favored, Allowed, and Outlier. Until the recent work of

Bradley Hintze to improve our rotamer analysis with new contours based on the

Top8000, sidechain rotamers had only been divided into Favored and Outlier categories.

190

With the new Top8000 rotamer contours, an Allowed category can be defined for rotamers. In both the Ramachandran and Rotamer columns, Outlier residues are marked with standard pink, and Allowed residues are marked with light pink. There is not a clear severe outlier category in Ramachandran or rotamer analysis, so the bright red coloration remains unused.

Figure 7.5: Comparison of old (left) and new (right) multicriterion table coloring for Ramachandran and rotamer analysis in 4hum.pdb.

For Ramachandran analysis, this is a substantial increase in markup. Allowed residues are not outliers, and a certain percentage (about 2%) of residues in a given structure are expected to fall in the Allowed region. It had been inappropriate to mark these residues with the same color as Outliers when only that one color was available.

With the new light pink coloration, user attention can be drawn to Allowed residues

191

more strongly than text alone permitted, but without obfuscating the difference between unfavorable conformations and true outliers.

The presentation of rotamer analysis is patterned on the presentation of

Ramachandran analysis and shares the same logic. Comparison of old representations versus new ones is less straightforward for rotamers, due to the updated rotamer contours. With the addition of the Allowed category, many residues have changed category from the old system to the new. Regardless, the new color scheme allows at-a- glance differentiation among the Favored, Allowed, and Outlier categories without a need to read the text in every cell.

7.3.3 Coloring for geometry validation

Molprobity presents three forms of fundamental geometry validation: Cβ deviations, bond lengths and bond angles. For each of the geometry validation columns, standard pink and bright red colorations are used. Deviations from expected geometry usually indicate a significant problem in a structure, so there is no such thing as a “mild” geometry outlier and no need for the light pink coloration.

Cβ deviation is a measure of bond geometry around the Cα atom, measured in the deviation in Å of the modeled Cβ atom position from its ideal position (Lovell et al,

2003). A residue with a deviation of 0.25Å or greater from the ideal position is an outlier, and is colored standard pink. A residue with a deviation of 0.7Å or greater is considered a severe outlier and is colored bright red.

192

Our expectations for bond lengths and bond angles are derived from the standard Engh and Huber criteria (Engh & Huber, 2001). Outliers from these expectations are expressed in multiples of standard deviation, σ. Most bond length and angle outliers are colored the standard pink. Outliers that are 10σ or more away from the expected value are considered especially severe and colored bright red. Such severe bond geometry outliers are relatively rare, but represent substantial statistical deviation from expected behavior and are important to draw user attention to.

7.3.4 Coloring for CaBLAM

CaBLAM’s integration into MolProbity will be more fully discussed in section

7.4. CaBLAM analysis produces three categories of notable residues – Outliers,

Disfavored Conformations, and Cα Geometry Outliers – so the CaBLAM column makes use of all three colorations (Figure 7.6).

193

Figure 7.6: Multicriterion table coloring for CaBLAM validation in 4hum.pdb.

CaBLAM Outliers are the primary outlier type represented in the column and are colored in standard pink. These Outlier residues are always in need of user attention, so their color was selected to match the expected color for standard outliers. Disfavored residues are roughly analogous to the Allowed conformations in Ramachandran and rotamer analysis, although Disfavored CaBLAM residues are more likely to be modeling errors than their Ramachandran counterparts. Light pink is used for Disfavored residues to reinforce this analogy. Cα Geometry Outliers are a difference in kind rather than in degree, but represent the most severe type of outlier identified by the CaBLAM system. Cα Geometry Outliers are colored bright red due to their severity.

194

As a historical note, it was the integration of CaBLAM and its three outlier categories into MolProbity that prompted the development of multiple-color coding in the MolProbity chart. I needed a way to distinguish between CaBLAM Outliers and

CaBLAM Disfavored. Horizontal space in the chart is at a premium, so adding more additional columns was a poor solution. Outlier versus Disfavored was already visually distinguished in the multicriterion kinemage; Outliers were pink and Disfavored residues were purple. I initially used these same pink and purple colors for the

CaBLAM column (and briefly used a dark gray for Cα Geometry Outliers). I also tried marking Ramachandran Allowed with the same yellow used as a warning color in

MolProbity’s whole-model summary chart. These different colors were standardized into the unified light pink/pink/red scale described here to aid in intuitive understanding of the system.

7.3.5 Coloring for cis-peptides

The integration of cis-peptide validation into MolProbity will be more fully discussed in section 7.5. Cis-peptide analysis produced five categories of residues: trans-peptides, cis-proline, cis-nonproline, twisted proline, and twisted nonproline.

However, the MolProbity only chart uses the standard pink to mark suspicious residues

(Figure 7.7).

195

Figure 7.7: Multicriterion table coloring for cis-peptide validation.

Trans-peptides and cis-prolines are left uncolored, although cis-prolines are identified by their text box. Trans-peptides exhibit the generally expected behavior, and cis-prolines are only suspicious if they are over-represented in a model, a statistic better captured by the MolProbity’s whole-model summary chart. All other categories – cis- nonproline, twisted proline, and twisted nonproline – are suspicious and merit user attention, so are colored standard outlier pink. Cis-peptide validation is likely to evolve over the next several years, but at this time there is not data or experience to warrant parsing outlier levels for cis-peptides any finer.

7.3.6 Coloring for RNA validation

The RNA-specific validations in MolProbity are sugar pucker validation and

RNA backbone “suiteness” validation. Both sugar pucker and backbone suiteness are

196

binary outlier/non-outlier evaluations at this time, without specific distinctions for mild or severe outliers. These columns in the MolProbity chart continue to use only the standard pink for outlier identification. However, the new MolProbity coloration method is easily extended, so if these validations develop in a way that can take advantage of the new coloring, it will be available.

7.3.7 Extendibility of coloring

The new color scheme for the MolProbity chart is extendible and modifiable.

HTML allows each cell in a table to be assigned its own color. During the processing of validation output that populates each cell, cells are also assigned colors. Colors are stored in the code as hex values, so any tool that allows exploration of colors by their hex values can be used to test new color palettes. It may be worthwhile to develop alternate color palettes with higher or lower contrast or colors more readable by colorblind users. An appropriate palette could be selected by the user before starting a validation.

7.4 CaBLAM in MolProbity

CaBLAM validation for protein backbone geometry has been integrated into

MolProbity. The changes to MolProbity necessary to accomplish this integration

(especially the introduction of a new coloring scheme for the residue table) provided an opportunity for the other changes to MolProbity discussed in this chapter.

197

CaBLAM’s validation methodology is discussed at length in Chapter 5. In short,

CaBLAM uses a combination of Cα geometry and carbonyl oxygen positions to identify modeling errors in protein backbone. Cα geometry is then used to identify probable secondary structure elements that may be disguised by those modeling errors. CaBLAM also performs a check on the Cα geometry itself, since its other assessments are dependent on a reliable Cα trace. CaBLAM is designed for use at low resolution.

7.4.1 The MolProbity structure summary table

The Richardson Lab has been working to develop new methods specifically for validation of low-resolution structures, and CaBLAM is the first of these methods to come to fruition. As the harbinger of this initiative, CaBLAM’s validation summary is found in a new low-resolution validation section of the MolProbity structure summary table (Figure 7.8). This section contains two lines for CaBLAM feedback. The first line gives the count and percent for residues identified as CaBLAM Outliers. The second line gives the count and percent for residues identified as Cα geometry outliers. The count and percent of CaBLAM Disfavored conformations are excluded to reduce the size and complexity of the summary chart.

The MolProbity summary table uses “stoplight” coloration as visual shorthand for structure quality. The line for each metric in the table is colored green, yellow, or red based on a structure’s overall quality according to that metric.

198

Figure 7.8: New summary table for MolProbity, including peptide omegas and low- resolution validation for 4hum.pdb.

The CaBLAM Outliers line is colored green if the structure contains 1% or fewer outliers, yellow if the outlier population between 1% and 5% of residues, and red if over

5% of residues are CaBLAM outliers. The 1% green-yellow transition corresponds to the statistically expected population of outliers, based on the contour level used to identify outliers. Setting the green-yellow transition at the statistically expected maximum for a validation is a practice in keeping with our treatment of other validations in the summary table, including Ramachandran and rotamer analysis. The yellow-red transition was set at 5% in part because that represented a significant increase from the green-yellow transition, and in part because 5% corresponds to the cutoff value for identifying Disfavored conformations in CaBLAM. If there are as many Outlier residues as there should be Disfavored residues, something has certainly gone wrong in the modeling process.

199

The Cα geometry outliers line is colored green is the structure contains 0.5% or fewer outliers, yellow if the outlier population is between 0.5% and 1%, and red if over

1% of residues are Cα geometry outliers. As in the CaBLAM Outliers case, the 0.5% green-yellow transition corresponds to the contour level used to identify outliers. The

1% yellow-red transition was obtained through a simple doubling of the green-yellow transition. Outliers in the 3D Cα geometry space are presumed to be severe modeling errors, so a relatively narrow yellow zone is appropriate.

7.4.2 The MolProbity multicriterion table

Molprobity produces a multicriterion chart with a row for each residue and a column for each validation criterion. A single new column has been added to this chart for CaBLAM validation. The CaBLAM column reports four distinct categories of behavior, which necessitated the development of a new coloring scheme, as discussed in

Sections 7.3 generally and 7.3.4 specifically.

The three categories reported by the CaBLAM column are: CaBLAM Favored,

CaBLAM Disfavored, CaBLAM Outlier, and Cα Geometry Outlier. Cells belonging to residues for which CaBLAM parameters could not be calculated (e.g. residues near chain termini) remain empty. CaBLAM Outlier status takes priority over CaBLAM Disfavored status, just as Ramachandran Outlier takes priority over Ramachandran Allowed. Cα

Geometry Outlier status takes the highest priority. If a residue is both a CaBLAM

200

Outlier and a Cα Geometry Outlier, its cell will be populated with the Cα Geometry

Outlier information, because Cα Geometry modeling errors are considered more severe.

As visual shorthand, cells for Cα Geometry Outliers are colored bright red, cells for CaBLAM Outliers are colored standard outlier pink, and cells for CaBLAM

Disfavored are colored light pink (Figure 7.6).

Each cell contains text feedback on the residue’s location in CaBLAM space. The first line of text contains the residue’s validation status (Favored, Outlier, Disfavored, or

Cα Geometry Outlier), followed by a number showing the contour level at which it fall in the appropriate CaBLAM space. For Favored, Outlier, and Disfavored residues, this number refers to the 3D CaBLAM space (μin, μout, ν). For Cα Geometry Outlier residues, the number refers to the 3D Cα geometry space (μin, μout, Cα virtual angle) used to identify those outliers. The presentation of this contour information follows the pattern established by the presentation of Ramachandran analysis.

Each cell can contain a second line in smaller text. This line, if present, is a recommendation of probable secondary structure based on CaBLAM’s assessment. This line takes the form of a suggestion (i.e. “try alpha helix”, “try beta sheet”, or “try three ten”) in part to make clear that it is a suggestion not a certainty and in part to encourage user intervention in the structure through use of the imperative “try”. The secondary structure recommendation appears for every assessable residue identified as probable secondary structure, not just for outlier residues. When the residue table is sorted by

201

sequence, continuous secondary structure elements can therefore be identified in the text feedback.

The CaBLAM column is sortable, and sorts on each residue’s contour value. No special distinction is made between CaBLAM Outliers and Cα Geometry Outliers for sorting. CaBLAM Outliers with sufficiently low contour values deserve as much attention as Cα Geometry Outliers.

7.5 “Omegalyze” cis-peptide validation

Cis-peptide validation is particularly important at low resolution, where the experimental data may not be sufficient to constrain protein backbone geometry.

However, erroneous cis-peptides appear in a surprising number of high-resolution structures as well. This section describes the motivation for and development of a new method for validation of cis-peptides in protein structures.

7.5.1 Cis-peptide geometry and occurrence

The peptide bond joins adjacent amino acids. This bond has partial double bond character as part of the conjugated system that includes the double bond between the carbonyl carbon and its oxygen. As a result, the dihedral ω that describes the peptide bond (defined by atom positions Cαi-1, Ci-1, Ni, and Cαi) is roughly planar. Most peptide bonds have a planar trans configuration with the protein backbone continuing on opposite sides of the pseudo double bond, but a few are planar cis with the incoming and outgoing protein backbone placed on the same side of the pseudo double bond.

202

Figure 7.9: A cis-proline.

The peptide bond properly falls between two amino acid residues. However, when referring to cis-residues, we associate the peptide bond with the immediately following residue. This is because of the unique relationship of proline residues with cis-peptides. About 5% of prolines follow cis-peptides. No other residue type has such a strong relationship with cis-peptides. Indeed, all non-proline residues together follow a cis-peptide only about 0.05% of the time. Associating peptide dihedrals with their immediately following residue allows us to speak of cis-proline (Figure 7.9) as a phenomenon of special importance. This convention is not observed consistently in the field, unfortunately, so the reader is advised to be cautious in discussions of cis- peptides.

203

7.5.2 A call for cis-peptide validation

Late in 2014, we were contacted by Tristan Croll from Queensland University of

Technology. He had found an increasing proliferation of protein structures deposited into the PDB with suspiciously high numbers of non-proline cis-peptides (Croll, 2015).

Cis-prolines are well recognized and relatively prevalent in protein structures, constituting about 5% of proline. Non-proline residues assume a cis configuration far more rarely, in only about 0.05% of cases. The structures Tristan found often contained

1.5% or more non-proline cis-peptides, orders of magnitude more than expected. These suspicious structures occurred at all resolution, but most often at resolutions >2.5Å.

Broadly speaking, at that time there was no easily accessible or widely-used systematic validation method to identify cis-peptides. If a structure modification method (such as Coot) introduced a cis-peptide where one was not correct, if it was not caught immediately, it might never be found again before the structure was deposited into the public record. Coot’s record-keeping for cis-peptides, the PDB’s report of cis- peptides in its structure headers, and ProCheck’s cis-peptide flags all existed, but required more than the typical crystallographer’s attention to cis-peptides to access.

Tristan, amusingly, horrifyingly, and probably accurately, laid some of the blame for the proliferation of cis-peptides at our Richardson Lab feet. MolProbity is a staple of the protein structure community, but it did not validate cis-peptides. If erroneous cis- peptides did not cause any other structural problems, they could pass through

204

MolProbity validation unnoticed, leaving users with the impression that all was well in their structure.

I therefore set out to build a cis-peptide validation tool for MolProbity and

Phenix.

7.5.3 Twisted peptides

The variety of possible modeling errors is sometimes staggering. During cis- peptide validation development, it quickly became evident that there existed peptide bonds modeled far from either planar configuration that could not be accurately classified as cis or trans (Figure 7.10). We decided that peptide bonds model with ω dihedrals more than 30° away from a planar configuration would fall into a separate category. I called these non-planar peptides “twisted peptides”. 30° from planar was chosen as the cutoff because 0±30° is an industry standard used by the PDB for defining cis-peptides. Defining a twisted peptides category also avoided the problem of arbitrarily splitting dihedral space into cis and trans at ±90°, which might otherwise have been an unattractive but necessary evil. It should be noted, though, that like all outliers from expected behavior, twisted peptides can be real and justified by data and biology (Berkholz, 2012).

205

Figure 7.10: Schematic representation of dihedral space revealing "twisted" regions not near either planar conformation.

One of the responses to Tristan’s work provides an amusing piece of evidence in support of validating twisted peptides as a unique category. The depositors of 4q8j,pdb, a structure with a particular wealth of cis and twisted non-proline peptides, revised their structure and redeposited it (as 4xr7.pdb), obsoleting the old version. In the new version, all the non-proline cis-peptides and all the twisted peptides on the cis side of

±90° in dihedral space had been corrected into trans configurations. However, all of the twisted peptides on the trans side of ±90° remained in the structure, apparently untouched. Without this new MolProbity validation, there was no system to identify those twisted (and almost certainly erroneous) peptides.

206

7.5.4 Omegalyze

I built a peptide dihedral validation program named “Omegalyze”. The name is short for “omega analyze” and parallels the names of our existing validation programs

Ramalyze and Rotalyze, with omega (ω) being the Greek letter designation of the peptide dihedral. Omegalyze is modeled heavily on the existing Ramalyze code, which preforms similar analysis of the protein backbone.

Omegalyze is available as part of the CCTBX project, specifically in mmtbx.validation alongside Ramalyze and Rotalyze. It is therefore available to any system that can make use of the open CCTBX project, including Phenix and the (newish)

CCTBX-based MolProbity.

7.5.4.1 Text output

Omegalyze can be run on the commandline with the command:

phenix.omegalyze filename.pdb

The default output of omegalyze is a text summary of the non-trans residues in a structure. The first part of the summary is a list of every non-trans residue. The following sample output is from 4q8j.pdb:

residue:type:omega:conformation A 661 ASP:General:34.12:Twisted A 663 TYR:General:-144.62:Twisted A 716 CYS:General:-21.55:Cis A 980 PRO:Pro:-12.87:Cis A 984 ASP:General:-22.52:Cis A 987 LYS:General:11.31:Cis A 988 SER:General:15.51:Cis A 996 ASN:General:-15.85:Cis A1034 ARG:General:9.79:Cis

207

A1059 MET:General:136.20:Twisted A1063 GLU:General:-25.61:Cis A1089 ALA:General:-22.93:Cis A1097 SER:General:2.50:Cis

The output is organized in four colon-separated columns: 1) a unique residue identifier standard to programs that interface with MolProbity, 2) the category of the residue, either proline or general case, 3) the peptide ω dihedral calculated for that residue, 4) the residue’s conformation type, either cis, twisted, or trans. By default, trans residues are not printed, but can be displayed with the flag nontrans_only=False if desired.

The residue-by-residue output is followed by a whole-structure summary in four lines:

SUMMARY: 21 cis prolines out of 252 PRO SUMMARY: 1 twisted prolines out of 252 PRO SUMMARY: 81 other cis residues out of 5895 nonPRO SUMMARY: 19 other twisted residues out of 5895 nonPRO

This summary provides counts of each of the four non-trans categories and total counts for the appropriate residue types. Percents were deemed to be potentially misleading (81 non-proline cis-peptides looks more impressive than 1.4%, even though a percent occurrence that high is a red flag to those in the know), so only counts were displayed. Even a single non-proline cis-peptide in a large structure is worth investigating, and the format helps emphasize this. Percents can be easily calculated from the provided values if necessary.

208

7.5.4.2 Kinemage markup

Omegalyze provides visual feedback on the location and type of non-trans peptides in the form of kinemage markup. This markup is available through omegalyze itself with the commandline phenix.omegalyze kinemage=True filename.pdb, or as part of the comprehensive multicriterion kinemages produced by phenix.kinemage or the MolProbity website.

The visual annotation marks two distinct categories: cis-peptides and twisted peptides. No special visual distinction is made between proline residues and non- prolines because prolines are simple to identify from the shape of their sidechains in the kinemage.

Figure 7.11: Kinemage annotation for cis-peptides.

Cis-peptides are marked in sea green with a two-member trianglelist (Figure

7.11). The two triangles form a trapezoid that fills the space between the Cα trace and the mainchain trace of the cis-peptide. A further vectorlist line is drawn along the Cα

209

trace in the same color. This line ensures visibility from any distance and orientation, although the trapezoid may disappear in some views.

Twisted peptides are marked in lime green with a similar two-member trianglelist (Figure 7.12). The triangles fill the space between the Cα trace and the mainchain trace of the twisted peptide. However, the twist in the peptide dihedral prevents a simple trapezoid from being formed. The angle between the two triangles visually represents the non-planarity of the peptide bond in the twisted case. An additional vectorlist line is drawn along the shared edge of the triangles to aid in visibility from a distance and to further emphasize nonplanarity by highlighting the intersection of the triangles.

Figure 7.12: Kinemage annotation for twisted peptides.

These visualizations were inspired by a trick I have used to identify cis-peptides at a glance in kinemages. If both the Cα trance and the full mainchain trace are turned on in KiNG, then an open trapezoid shape appears between the traces, which is easy to identify even at a distance. The kinemage markup fills this distinctive shape with color.

210

Greens were chosen for the markup because green is already associated with annotation for backbone geometry outliers in the form of Ramachandran outlier markup. Both of the greens used in non-trans peptide annotation are distinct from the green used in

Ramachandran markup, however. A gentler sea green is used for cis-peptides, since these may be real conformations, especially in the cis-proline case. A more visually striking lime green is used for twisted peptides, since these are almost certainly modeling errors

7.5.4.3 MolProbity feedback

Non-trans peptide validation is available as a standard validation in MolProbity.

This validation is useful at all resolution ranges, and so is active by default for all protein structures. Molprobity feedback for non-trans peptides takes three forms: kinemage annotation, the whole-structure summary table, and the residue-by-residue chart.

Kinemage annotation for non-trans peptides through MolProbity is straightforward. The annotation described in section 7.5.4.2 is included as a part of the multicriterion kinemage generated by MolProbity.

The structure summary table contains a new section entitled “Peptide Omegas”.

This new section contains one, two, or three lines for peptide dihedral validation. The first line is a count and percentage of cis-prolines. This line always appears, even when cis-prolines are absent from the structure, since cis-prolines are a significant and sometimes expected structural feature. The second line is a count and percentage of cis

211

non-proline residues. The third line is a count and percentage of twisted residues. The second and third lines only appear if the structure contains at least one residue in their category. Otherwise, they are skipped to reduce visual clutter in the summary table.

The residue chart contains a new column for peptide dihedrals. Cells in this column are populated for each residue that is associated with a non-trans peptide dihedral. Each cell contains a text description of the type of non-trans peptide: Cis PRO,

Twisted PRO, Cis nonPRO, or Twisted nonPRO. Each cell also contains the numerical value of that peptide’s ω dihedral. Twisted peptides and non-proline cis-peptides are colored as outliers in standard pink and can be sorted to the top of the column. Cis- prolines are much less likely to be outliers, and so are not colored. Cis-prolines sort into a position below the other non-trans peptides so that they can be easily identified and analyzed by users.

7.6 Future validations

The Richardson Lab continues to develop new validations to better assess macromolecular structures. This section describes progress towards additional validations based on CaBLAM and the challenges still facing their development.

7.6.1 Cis versus trans proline validation with CaBLAM

At the time of this writing, Omegalyze (see section 7.5) identifies cis-peptides and provides a whole-structure level validation of cis-peptide occurrence. Omegalyze does not currently distinguish between correct and incorrect cis-peptides. Instead, it

212

relies on user expertise to make sense of the cis-peptides it calls attention to. A natural next step for Omegalyze would be a residue-level validation that could distinguish between correct and incorrect cis-peptides.

One promising lead towards individual cis-peptide validation comes from

CaBLAM. Cis-proline and trans-proline are known to have somewhat different distributions in Ramachandran space. The difference in their distributions in CaBLAM space, however, is far more striking (Figure 7.13). While there is some overlap between the distributions for cis- and trans-proline in CaBLAM space, enough regions are unambiguously cis or unambiguously trans that a validation method based on CaBLAM space contours may be possible for proline.

213

Figure 7.13: Comparison of distribution of trans-proline (blue) and cis-proline (orange) in 2D CaBLAM space. (x-axis is µin, y-axis is µout.)

The main question that must be answered first is one of modeling psychology.

When proline peptide bonds are modeled incorrectly, is the error one that CaBLAM is sensitive to? When a trans-proline is mismodeled as cis, does the surrounding Cα trace

(on which CaBLAM depends) still resemble the trace that would surround a proper trans proline? Or does the erroneous cis-peptide distort the local trace sufficiently to hide the error from CaBLAM? Until recently, we have not had a sufficient collection of residues that we could confidently say were mismodeled in this way to answer these questions. Our recent contact with Tristan Croll (Croll, 2015) has provided us with a possible list of residues with which to test a cis-proline residue-level validation method. 214

Unfortunately, a residue-level validation for non-proline cis-peptides will require a different method. Real non-proline cis-peptides are too rare to allow for the construction of reliable contours in CaBLAM space. A library-based method using the real examples as reference may be possible. MolProbity has not used electron density for validations in the past, but validation via real-space correlation coefficients would enable a variety of new validation methods, possibly including cis-peptide validation.

Further study of cis-peptides will be necessary.

7.6.2 Motif validation with CaBLAM

As discussed in Chapter 6, CaBLAM has a powerful ability to identify structural motifs from their sequence and hydrogen bonding. CaBLAM’s identification of regular secondary structure via hydrogen bonding in high-resolution structures powers its identification of probable secondary structure via Cα geometry in low-resolution structures. Can CaBLAM’s identification of non-repeating motifs in high-resolution structures be used to power a similar method for validation of motifs in low-resolution structures?

Even closely related motifs such as the widened helix turns discussed in Section

6.4 can be distinguished in CaBLAM space at high resolution. The question of whether non-repeating motifs are distinguishable at low resolution – distinguishable from the surrounding structure and distinguishable from other related motifs – remains unanswered at this time. Given CaBLAM’s success in using the Cα trace to find regular

215

structure even in very poor models such as 2o01.pdb, I am optimistic that the Cα trace will prove to contain enough reliable information to identify some non-repeating motifs.

One of the major challenges facing motif validation with CaBLAM is the high dimensionality of multi-residue motifs. Because the μout of one residue is also the μin of the next, CaBLAM can represent a motif with approximately the same number of μ parameters as the motif has residues. For motifs of more than three residues, a higher- dimensional space is necessary than can be easily visualized. A contour-based approach to motif identification would be difficult, especially for rare motifs that cannot adequately populate a high-dimensional space.

As with non-proline cis-peptides, a library based approach is a likely solution.

Well-modeled examples of motifs of interest can be collected using CaBLAM’s motif search functionality. When an appropriate approximate geometry is detected in a structure, the library can be accessed and the sample motifs within compared to the structure. Sufficient geometric deviation from the sample motifs would flag the regions for user inspection, and the user would be provided with suggested replacement motifs from the motif library.

7.7 Discussion

Low-resolution validation is a tricky balancing act. Modeling errors are abundant, but too much feedback is overwhelming. Intervention is necessary to improve the structure, but identifying correct solutions for compound errors is difficult.

216

Human expertise is required where automation fails, but many of those faced with low- resolution structures are effectively first-time crystallographers.

One solution would be the introduction of truly low-resolution validations.

Even CaBLAM, which takes a deliberately blurry view of the protein mainchain, is ultimately an atom-by-atom validation of protein conformation. At very low resolutions, starting around 3.5Å and lower, there is no hope of locating individual atoms in the electron density. Instead, secondary structure elements as a whole and the fold of the protein are the most distinguishable features. So a very low-resolution validation might be concerned with those larger features, rather than individual atomic placements.

A very low-resolution validation method would likely answer questions about the protein-likeness of entire secondary structure elements. Is the bend of a helix realistic? Does it turn with the correct tightness? Does the twist of a sheet conform to known behaviors? Are the distances between strand pairs appropriate to allow hydrogen bonding? A very low-resolution validation could also validate the fold of a protein, alerting users to discrepancies between that protein and known folds. Those discrepancies might be real and justified, just as some Ramachandran or rotamer outliers are, which would be all the more reason to draw user attention to them.

The real hope for structure validation in CaBLAM’s target resolution range of

2.5Å-4.0Å is integration into refinement and rebuilding methods. Refinement is not

217

intimidated by an abundance of modeling errors (though it can be overwhelmed without access to appropriate tools). The iterative nature of refinement will allow it to unravel compound errors and grow well-modeled areas to include more of the structure. And the refinement process automates the collected expertise of biophysicists like nothing else. The Richardson Lab’s collaboration with the Phenix project has started us down a road towards greater partnership with refinement and rebuilding methods.

Hopefully the next stage of our journey will yield a fuller integration.

218

8. Conclusion

The unifying theme of the work presented here is the surprising power of Cα geometry parameters to describe and predict complex protein conformations and motifs.

I have built a variety of structural biology tools, collected under the name CaBLAM (C- alpha Based Low-resolution Annotation Method) to harness the power of Cα geometry for validation and exploration of protein structures. These tools are of particular use in validating structures in the resolution range of 2.5-3.5Å. In this resolution range, experimental data derived from x-ray crystallography experiments is insufficient to resolve atomic-level details in proteins. I refer to “low-resolution” structures as starting at about 2.5Å resolution because I have found empirically that the content of the experimental data changes around that resolution. Lower resolutions are certainly possible through methods such as cryo-EM, despite my nomenclature.

The power of Cα geometry is due in great part to the relative reliability of Cα trace modeling in low-resolution structures. Whether due to the nature of low- resolution experimental data or due to the fitting tools used in model building, Cα positions are mostly well-modeled in low-resolution structures, even in regions where other atoms are catastrophically misplaced.

CaBLAM is a reaction to and an exploitation of this particular state of the crystallographic art, wherein Cα positions are well-modeled at low resolution but the full mainchain trace is not. Will CaBLAM retain its power and relevancy as

219

crystallographic techniques evolve? The pattern of relatively reliable Cα traces seems to hold throughout crystallography’s past, so the fundamental properties of low-resolution experimental data and our interpretation of it have not greatly changed. However, low- resolution crystallography has attracted greater and greater attention in recent years.

New methods are allowing the collection of data from lower-quality crystals and increasing computer power is allowing the solution of previously intractable structures.

A revolution in data processing may well be coming that will replace the need for

CaBLAM with accurate low-resolution model building. For the immediate future however, CaBLAM will remain a uniquely powerful tool for validating and improving difficult structures.

The rise of high-resolution cryo-EM will pose an interesting new challenge to

CaBLAM and its successor validations. Highly-successful cryo-EM experiments can now yield experimental data in the 2.5-3.5Å resolution range targeted by CaBLAM validation. CaBLAM is successful in part because it exploits a feature (Cα trace reliability) of how x-ray crystallography structures are solved and addresses a common error in how low-resolution structures are modeled (incorrect peptide plane orientations). If cryo-EM structures are solved by a sufficiently different method with sufficiently different assumptions, then those structures may have different properties to exploit and different modeling errors to address. However, to the extent that crystallography programs such as Phenix take over the model-building and refinement

220

stages of structure solution for high-resolution cryo-EM experiments, then CaBLAM will certainly remain relevant for those structures.

The validation philosophy practiced by CaBLAM is somewhat more intrusive than our typical validation tools. As Vincent Chen noted in the conclusion of his thesis,

“Instead of making decisions for users, my tools make it easier for users to make educated judgments about structures.” Our guiding philosophy has been to direct user attention rather than to assert “correct” answers to modeling errors. Exceptions to this philosophy have generally involved clear binary choices between alternatives, as in the case of Asn/Gln/His flips or RNA pucker outliers. CaBLAM goes a step further than most of our validations and recommends a specific secondary structure conformation (a non-binary choice) for outlier residues where possible. Of course, acting on that recommendation is still left to the user for now.

CaBLAM makes these secondary structure recommendations partly due to the low information content of low-resolution data and partly due to a change in the demographics of modern crystallographers. Low-resolution electron density is difficult to interpret, and the correct conformation may not be clear from the data alone. Indeed, in locations where a modeling error has occurred, the density is likely to be actively misleading. CaBLAM supplements the low information content of the electron density by looking at protein conformation over a wide area to make a secondary structure

221

recommendation. The challenging nature of the low-resolution structure modeling necessitates a more interventionist approach.

The other reason that more interventionist validation is needed at low resolution is a change in the demographics of crystallographers. In a sense, automated methods have become too successful. If the experimental data is high-resolution, automated methods like Phenix can take a structure from experimental data to solved model rapidly, accurately, and – significantly – with very little user input. In addition, crystallography is not usually the focus of a graduate career. Instead, a graduate student may solve a handful of structures to supplement other biochemical research.

Crystallography has become another tool for many researchers, rather than a discipline.

As a result, when a difficult structure is finally encountered, where the automated methods struggle to produce an accurate model, the crystallographers involved may have no real experience dealing directly with any structures, let alone difficult ones..

Greater outside assistance is needed to supplement the lesser experience of new crystallographers.

Integration of CaBLAM validation into Phenix refinement remains a major goal in providing a more interventionist validation. Three main possibilities for correcting structures present themselves. First, discrete moves to correct misplaced peptide planes.

Second, hydrogen bonding restraints applied to secondary structure elements identified by CaBLAM. Third, wholesale replacement of mismodeled secondary structure

222

elements with idealized versions. Progress on possible methods for correcting low- resolution structures within a refinement loop has been difficult for me because the

Richardson Lab deals with refinement relatively little. Integration of CaBLAM into

Phenix refinement will likely have to wait for a member of the lab to join one of the

Phenix labs more involved in refinement.

Motif identification is a major area for future development. CaBLAM’s motif identification tools have developed in parallel to its validation tools, but there has been relatively little interaction between them so far. I remain hopeful that the Cα trace retains enough information at low resolution to distinguish among non-repeating motifs. In particular, identification of the motifs I once called “secondary structure interrupts” will be particularly important once CaBLAM begins to inform Phenix refinement. These motifs represent real and significant structural phenomena that should not be lost to idealization, even idealization in the name of a more protein-like structure.

I hope to revisit CaBLAM’s fingerprint-based motif identification tool. I would like to rebuild it as a tool for the structural community at large (and for the Richardson

Lab especially), separate from the rest of the CaBLAM system. In particular, it is in need of a real graphical user interface for building fingerprint definitions, since the current version forces the user to write idiosyncratic Python code to define each motif. To anyone who wishes to take up this task or to build a similar system of their own: the

223

ability to move through the structure along its bonds, both covalent bonds between sequential residues and hydrogen bonds between distant residues, is essential to building a program that can understand motifs.

The work presented here was conducted with an eye towards the future of macromolecular structure validation. CaBLAM is a vanguard of a new paradigm of resolution-dependent validation in MolProbity. As we build more validation techniques tuned to the particular needs of different resolution ranges, MolProbity will be able to automatically activate or deactivate the validations based on the resolution or quality of the structure under consideration. Changing the validations presented to match the needs of the structure will help reduce the feeling of “validation overload” sometimes caused by assessing low-resolution structures with our high-resolution tools. Of course, some validations that were meant for low-resolution structures, like the omegalyze cis- peptide validation, may prove to be appropriate to all resolutions. Managing the presentation of an increasingly diverse array of validations will be an important challenge for future MolProbity developers.

Every resolution range and every modeling technique presents unique challenges and holds unique properties to be exploited. In CaBLAM’s case, low- resolution x-ray crystallography structures often have incorrectly oriented peptide planes, but generally have reliable Cα geometry. CaBLAM exploits Cα geometry to identify errors in peptide plane orientation and serves as a uniquely powerful validation

224

tool as a result. I encourage future validators to think explicitly about the properties of the systems they wish to assess and to understand as clearly as they can the data and practices underlying modeling errors. Such thinking will ensure the continuing relevancy of validation techniques to the challenges presented by whatever new structural methods the state of the art can produce.

225

A. cablam_training commandline options

This appendix contains information on using the commandline options in cablam_training.py. The material presented here is adapted from online Phenix documentation, and in the event of any question or discrepancy, that documentation should be referenced, as it may have received updates.

cablam_training is primarily a set of developer tools for exploring protein structure conformation. It provides tools for assessing protein geometry, with both traditional measures and with the idiosyncratic CaBLAM measures. It contains tools for identifying known and novel motifs based on hydrogen bonding, amino acid sequence, and other properties. And it contains tools for extracting motif and structure information into forms usable by CaBLAM validation for assessing low-resolution protein structures.

For information on cablam_validate, the main user-end interface for CaBLAM, please refer to the online documentation for CaBLAM Validation.

Running "phenix.cablam_training help=True" on the commandline will provide fairly complete documentation. This document is intended to expand on and supplement that documentation, not replace it.

A.1 How cablam_training works

For each residue in a submitted structure file, cablam_training calculates several geometry measures selected by the user. It will print these measures for various user-

226

defined selections. These selections may be residues of a certain type or may be residues that are members of a structural motif of interest.

A.2 Available measures

A.2.1 Standard measures

Many standard measures are available through cablam_training. These include: - rama=True for Ramachandran dihedrals - exrama=True for expanded Ramachandran dihedrals, that is psi-1 and phi+1 in addition to phi and psi - tau=True for the N-CA-C backbone angle - omega=True for the peptide bond dihedral (used to define if a residue is cis or trans)

A.2.2 CaBLAM measures

The CaBLAM system uses some idiosyncratic measures of protein geometry. -

Two CA-defined pseudo dihedrals. For residue i, CA_d_in (aka mu in) is defined by the

CA positions of i-2,i-1,i,i+1 and CA_d_out (aka mu out) is defined by the CA positions of i-1,i,i+1,i+2. - One dihedral that relates peptide plane orientation across the residue. For residue i, CO_d_in (aka nu) is defined by O(i-1), a virtual atom in i-1, a virtual atom in i, and O(i). The virtual atoms are defined at the perpendicular intersection of O and the

CA-CA line through the residue. -The CA virtual angle. For residue 1, CA_a is defined by the CA positions o i-1,i,i+1. cablam=True will return these measures and is the recommended output selection for cablam_training.

227

A.2.3 Artifact measures

Some other measures are available as artifacts of the initial training and measure selection process. CO_d_out is equivalent to CO_d_in, but is calculated using the current and succeeding residues. CA_a_in and CA_a_out are the virtual angles for the preceding and succeeding residues. None of these measures are recommended as more than curiosities.

A.3 Quality control

Quality filtering your data is extremely important. cablam_training provided a few tools for this at the residue level. Not that because the CaBLAM measures for each residue are dependent on many of that residue's neighbors, pruning one residue with these controls will remove reporting for other residues in its proximity. This is judged worthwhile in the interests of quality, but users should be aware.

b_max=#.# Sets a B-factor cutoff. Residues with any mainchain atom (N, CA,

C, O) with a B higher than this number will be excluded from all calculations. Using b_max=30.0 or stricter is strongly recommended unless you have good reason not to.

prune_alts=True/False Residues with alternate conformations for any mainchain atom are excluded from all calculations. Useful if alternates are modeled inconsistently. Default behavior uses the first alt (and only the first alt) for each residue.

228

prune=restype1,restype2 Residues of the listed (3-letter) residue types are excluded form all calculations. prune=GLY,PRO will totally remove the effects of the "weird" residues from the reporting, for better or worse.

A.4 Other output options

The default printing is comma-separated text, with a column for the residue id and columns for each of the user-requested measures. A few modifications to this are possible: - give_kin=True instead of .csv format, prints a multidimensional point cloud in .kin format for use with the KiNG viewer. - cis_or_trans=cis/trans/both selects whether cis-peptides, trans-peptides, or both will be printed. The default is "both". omega=True must be used with this option. - skip_types=restype1,restype2 Similar to prune in quality control, but less severe. Ignores these types for output, but not for calculation, so their neighbors can still be evaluated. - include_types=restype1,restype2 Works with skip_types If only include_types is used, only the listed restypes will be printed. If include_types and skip_types are both used, then the types given to include_types will override those skipped by skip_types. Sequence relationships may be represented with underscores.

A.5 Motif searching

Motif searching is another groups of output modes dedicated to looking at entire motifs, rather than individual residues. Motifs are primarily defined by hydrogen bonding pattern, but may also require specific residue types and cis/trans peptides.

229

Motif searching is activated by giving probe_motifs= one or more motifs to look for. Available motif names may be found with list_motif=True, or ambitious users may define their own. See mmtbx/cablam/fingerprints/how_to.py for details

(how_to.py is not an actual script, it's a .py so that fancy editors will color the example code correctly.)

Because motifs are defined by hydrogen bonds, hydrogens and phenix.probe information, motif search will run phenix.reduce and phenix.probe for each file unless a precomputed .probe file is provided via probe_path=path/to/probefiles. If you expect to run motif searching many times, it may be worth precomputing these files.

Run the following command on a phenix.reduce'd .pdb file to generate an appropriate file: - phenix.probe -u -condense -self -mc -NOVDWOUT -NOCLASHOUT ALL filename.pdb > filename.probe if you need sidechain H-bonding information - phenix.probe -u -condense -self -mc -NOVDWOUT -NOCLASHOUT MC filename.pdb > filename.probe if you only need mainchain H-bonding

A.5.1 Motif search outputs

These outputs generate automatically-named files in the working directory.

These files may overwrite each other if the program is called multiple times.

probe_mode=kin returns an automatically-named .kin file for each member residue in each motif. The kins are high-dimensional dotlists containing the measures

230

specified in the commandline (see above for options) for each residue that falls in the specified place in the motif. Recommended for finding motifs of interest within large filesets.

probe_mode=instance returns an automatically-named vectorlist kinemage file for each motif of interest. Each kin is a high-dimensional vectorlist that shows the path of a multi-residue motif through the measures specified in the commandline. An alternative way to view the information available through probe_mode=kin

probe_mode=annote returns an automatically-named kinemage file for each pdb file. These kins are balllists that highlight the selected motifs of interest if appended to existing kinemages of the structures. Recommended to aid close inspection of motifs in single files of interest.

231

B. How to write and format a motif fingerprint for CaBLAM

This appendix contains instructions on how to write motif fingerprints for cablam_training. The material presented here has been adapted from the file how_to.py that is distributed along with the CaBLAM code through CCTBX and Phenix. Please reference how_to.py for any serious fingerprint development, as there may have been updates since this publication.

B.1 Setup

Put "from __future__ import division" at the top of file.

Import the cablam_fingerprints module.

from __future__ import division from mmtbx.cablam import cablam_fingerprints

Write a short description of the motif being coded.

B.2 Class instantiation

Create an instance of the motif class

replace_this_with_name_of_motif = cablam_fingerprints.motif( motif_name = "replace_this_with_name_of_motif", residue_names = {"a":"residue1","b":"residue"}, superpose_order = {"b":["CA","N"],"c":["CA"],"d":["CA"],"e":["CA","OH"]}) Pass the class a name for the motif (as a string). This will be used in printing, filenameing, etc.

232

motif_name is an attribute of the motif class, not something to replace with the name of the motif.

superpose_order defines the atoms from each indexed residue to be used for automated superposition with superpose_pdbs. Optional unless that feature is desired for the motif.

Pass the class a dictionary of names for the residues in the motif. The keys for this dictionary must correspond to the indices used to identify residues later in the fingerprint. Some functions may .sort() these keys for printing, so using alphabetization- friendly keys is advised

B.3 Adding residues

Add the first residue to the motif. The add_residue() method also returns the new residue for easy access. Here it's named residue1.

residue1 = replace_this_with_name_of_motif.add_residue( allowed_resname = [], banned_resname = [], sequence_move=None, bond_move='', end_of_motif=False, index='')

The default values for the function parameters are shown.

allowed_resname and banned_resname accept list of 3-character amino acid names e.g. ['GLY','PRO']. Neither is required. The default [] value (empty list) allows all residue types.

Residue types in banned_resname are disallowed at this position

233

sequence_move and bond_move describe how to get to the next residue in the motif. One or the other, but not both, is *required* for all but the final residue.

sequence_move accepts a position or negative integer. This number represents the sequence relationship to the next residue. E.g. sequence_move=2 would result in a move two residues forward (towards C-term) in sequence

bond_move can be used when the sequence relationship of the destination residue is not known. bond_move accepts a character string. This string *must* match the index of a residue already found by the motif (see add_bond())

end_of_motif is a necessary flag for the last residue in the motif. Set it to True in that case and only that case.

index accepts a character string matching a key of the residue_names dictionary in the motif object. Different residues cannot have the same index, unless that index is the default empty string''.

B.4 Adding bonds

Add a bond to the residue. The add_bond() method also returns the new bond for easy access.

bond1 = residue1.add_bond( required=True, banned=False, allow_bifurcated=False, src_atom='', trg_index='')

The default values for the function parameters are shown.

234

required is a boolean flag. If True, this bond must be present in the motif

banned is a boolean flag. If True, this bond must be absent in the motif

allow_bifurcated is a boolean flag. If true, this Hbond is allowed to be bifurcated. By default, bifurcated bonds are disallowed. If both targets of a bifurcated bond are of interest, another add_bond() must be performed for the other target.

src_atom accepts a 4-character string, formatted as a pdb-style atom name. It represents the atom in this residue from which the bond of interest originates. This is

*required*. ' O ' and ' H ' are the strings for the protein backbone carbonyl oxygen and amide hydrogen most often involved in hydrogen bonding.

trg_index accepts a character string matching a key of the residue_names dictionary in the motif object. This index will be assigned to the residue on the target end of this bond (this may be relevant for bond_move). Trying to assign one index to different residues or trying to assign different indices to the same residue is considered a failure to match the motif.

trg_index may be left as the default empty string '' without provoking this failure to match, however.

B.5 Adding bond targets

Add target atoms to the bond. This is the final step and does *not* return anything.

bond1.add_target_atom( atomname=None,

235

anyatom=False, seqdist=None, anyseqdist=False)

The default values for the function parameters are shown.

A bond may have multiple target atoms. Each one represents an atom at which the bond in question may terminate. Typically, there will only be one option, but multiple possible bonding partners can be represented by running add_target_atom multiple times.

atomname accepts a 4-character string, formatted as a pdb-style atom name. ' O ' and ' H ' are the strings for the protein backbone carbonyl oxygen and amide hydrogen most often involved in hydrogen bonding.

Alternatively, a bond to any atom, regardless of name, may be allowed by setting anyatom to True.

One of either atomname or anyatom is *required*

seqdist accepts a positive or negative integer and represents the sequence distance to the end of the bond in question.

Alternatively, a bond to an atom at any sequence separation may be allowed by setting anyseqdist to True. This is particularly useful in beta structure.

One of either seqdist or anyseqdist is *required*

B.6 General notes and final checks

Any number of possible target atoms may be added for each bond.

236

Any number of bonds may be required or banned for each residue.

Any number of residues may be defined for each motif. Order matters when defining residues. New residues are append()ed to a list, and so have order. Cablam will search for them in the order they are added. Movement instructions in the form of sequence_move or bond_move are required to get from one residue in this list to the next.

When finished:

Check that the last residue has end_of_motif=True.

Check that atom names use pdb-format 4-character names, including whitespace.

Check that the motif_name defined at the top of the motif is non-redundant with any existing motif in cctbx/mmtbx/cablam/fingerprints that you do not wish to overwrite.

B.7 Producing the fingerprint

Add this to the bottom of the code:

if __name__ == "__main__": cablam_fingerprints.make_pickle(replace_this_with_name_of_motif)

Multiple motifs can be stored in the same code, but each needs its own call to cablam_fingerprints.make_pickle, like so:

cablam_fingerprints.make_pickle(another_motif)

Place all make_pickle calls at the end in a if __name__==__main__ to prevent spontaneous generation of pickle files if the definitions are imported

237

Run the code from the commandline using phenix.python. This will generate motif files in cctbx/mmtbx/cablam/fingerprints for cablam to use.

If you make any changes to the motif definition, you must re-run the code to regenerate the pickled motif files with updated information.

238

References

Adams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.-W., Kapral, G. J., Grosse-Kunstleve, R. W. McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C., & Zwart, P. H. (2010). PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallographica Section D: Biological Crystallography, 66(2), 213-221.

Akamine, P., Xuong, N. H., & Taylor, S. S. (2002). Crystal structure of a transition state mimic of the catalytic subunit of cAMP-dependent protein kinase. Nature Structural & Molecular Biology, 9(4), 273-277.

Amunts, A., Drory, O., & Nelson, N. (2007). The structure of a plant photosystem I supercomplex at 3.4 Å resolution. Nature, 447(7140), 58-63.

Arnone, A., Bier, C. J., Cotton, F. A., Day, V. W., Hazen, E. E., Richardson, D. C., Richardson, J. S. & Yonath, A. (1971). A high resolution structure of an inhibitor complex of the extracellular nuclease of Staphylococcus aureus I. Experimental procedures and chain tracing. Journal of Biological Chemistry, 246(7), 2302-2316.

Aurora, R., & Rose, G. D. (1998). Helix capping. Protein Science, 7(1), 21-38.

Mr. Bayes, & Price, M. (1763). An essay towards solving a problem in the doctrine of chances. by the late rev. mr. bayes, frs communicated by mr. price, in a letter to john canton, amfrs. Philosophical Transactions (1683-1775), 370-418.

Berkholz, D. S., Shapovalov, M. V., Dunbrack, R. L., & Karplus, P. A. (2009). Conformation dependence of backbone geometry in proteins. Structure, 17(10), 1316-1325.

Berkholz, D. S., Driggers, C. M., Shapovalov, M. V., Dunbrack, R. L., & Karplus, P. A. (2012). Nonplanar peptide bonds in proteins are common and conserved but not biased toward active sites. Proceedings of the National Academy of Sciences, 109(2), 449-453.

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). The protein data bank. Nucleic acids research, 28(1), 235-242.

Bondi, A. (1964). van der Waals volumes and radii. The Journal of physical chemistry, 68(3), 441-451. 239

Bragg, W. H., & Bragg, W. L. (1913). The reflection of X-rays by crystals. Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 428-438.

Chen, V. B., Arendall, W. B., Headd, J. J., Keedy, D. A., Immormino, R. M., Kapral, G. J., Murray, L. W., Richardson, J. S., & Richardson, D. C. (2009). MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallographica Section D: Biological Crystallography, 66(1), 12-21.

Chen, V. B. (2010). Building better backbones: visualizations, analyses, and tools for higher quality macromolecular structure models. (Doctoral dissertation, Duke University).

Cornell, W. D., Cieplak, P., Bayly, C. I., Gould, I. R., Merz, K. M., Ferguson, D. M., Spellmeyer, D. C., Fox, T., Caldwell, J. W., & Kollman, P. A. (1995). A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. Journal of the American Chemical Society, 117(19), 5179-5197.

Croll, T. I. (2015). The rate of cis–trans conformation errors is increasing in low- resolution crystal structures. Biological Crystallography, 71(3), 706-709.

Crooks, G. E., Hon, G., Chandonia, J. M., & Brenner, S. E. (2004). WebLogo: a sequence logo generator. Genome research, 14(6), 1188-1190.

Davis, I. W., Arendall, W. B., Richardson, D. C., & Richardson, J. S. (2006). The backrub motion: how protein backbone shrugs when a sidechain dances. Structure, 14(2), 265-274.

Deis, L. N., Pemble, C. W., Qi, Y., Hagarman, A., Richardson, D. C., Richardson, J. S., & Oas, T. G. (2014). Multiscale Conformational Heterogeneity in Staphylococcal Protein A: Possible Determinant of Functional Plasticity. Structure, 22(10), 1467- 1477.

Dion-Schultz, A., & Howell, E. E. (1997). Effects of insertions and deletions in a beta- bulge region of Escherichia coli dihydrofolate reductase. Protein engineering, 10(3), 263-272.

Dunbrack, R. L., & Karplus, M. (1993). Backbone-dependent rotamer library for proteins application to side-chain prediction. Journal of molecular biology, 230(2), 543-574.

Dunkle, J. A., Wang, L., Feldman, M. B., Pulk, A., Chen, V. B., Kapral, G. J., Noeske, J., Richardson, J. S., Blanchard, S. C., & Cate, J. H. D. (2011). Structures of the

240

bacterial ribosome in classical and hybrid states of tRNA binding. Science, 332(6032), 981-984.

Emsley, P., & Cowtan, K. (2004). Coot: model-building tools for . Acta Crystallographica Section D: Biological Crystallography, 60(12), 2126-2132.

Engh, R. A., & Huber, R. (1991). Accurate bond and angle parameters for X-ray protein structure refinement. Acta Crystallographica Section A: Foundations of Crystallography, 47(4), 392-400.

Engh, R. A., & Huber, R. (2001). Structure quality and target parameters. In International Tables for Crystallography Volume F: Crystallography of biological macromolecules (pp. 382-392). Springer Netherlands.

Fromme, P., & Spence, J. C. (2011). Femtosecond nanocrystallography using X-ray lasers for membrane protein structure determination. Current opinion in structural biology, 21(4), 509-516.

Gane, P. J., & Dean, P. M. (2000). Recent advances in structure-based rational drug design. Current opinion in structural biology, 10(4), 401-404.

Gavezzotti, A. (1983). The calculation of molecular volumes and the use of volume analysis in the investigation of structured media and of solid-state organic reactivity. Journal of the American Chemical Society, 105(16), 5220-5225.

Georgiev, I., Keedy, D., Richardson, J. S., Richardson, D. C., & Donald, B. R. (2008). Algorithm for backrub motions in protein design. Bioinformatics, 24(13), i196-i204.

Gront, D., Kulp, D. W., Vernon, R. M., Strauss, C. E., & Baker, D. (2011). Generalized fragment picking in Rosetta: design, protocols and applications. PloS one, 6(8), e23294.

Headd, J. J., Echols, N., Afonine, P. V., Grosse-Kunstleve, R. W., Chen, V. B., Moriarty, N. W., Richardson, D. C., Richardson, J. S. & Adams, P. D. (2012). Use of knowledge-based restraints in phenix. refine to improve macromolecular refinement at low resolution. Acta Crystallographica Section D: Biological Crystallography, 68(4), 381-390.

Hemmingsen, J. M., Gernert, K. M., Richardson, J. S., & Richardson, D. C. (1994). The tyrosine corner: a feature of most Greek key β-barrel proteins. Protein Science, 3(11), 1927-1937.

241

Hobohm, U., Scharf, M., Schneider, R., & Sander, C. (1992). Selection of representative protein data sets. Protein Science, 1(3), 409-417.

Hooft, R. W., Vriend, G., Sander, C., & Abola, E. E. (1996). Errors in protein structures. Nature, 381(6580), 272-272.

Hosseini, S. R., Sadeghi, M., Pezeshk, H., Eslahchi, C., & Habibi, M. (2008). PROSIGN: A method for protein secondary structure assignment based on three-dimensional coordinates of consecutive C α atoms. Computational biology and chemistry, 32(6), 406-411.

Jackson, R. M. (2002). Q-fit: a probabilistic method for docking molecular fragments by sampling low energy conformational space. Journal of computer-aided molecular design, 16(1), 43-57.

Jones, T. A., Zou, J. Y., Cowan, S. T., & Kjeldgaard, M. (1991). Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Crystallographica Section A: Foundations of Crystallography, 47(2), 110-119.

Kabsch, W., & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22(12), 2577-2637.

Keedy, D. A., Williams, C. J., Headd, J. J., Arendall, W. B., Chen, V. B., Kapral, G. J., Gillespie, R. A., Block, J. N., Zemla, A., Richardson, D. C. & Richardson, J. S. (2009), The other 90% of the protein: Assessment beyond the Cαs for CASP8 template-based and high-accuracy models. Proteins, 77: 29–49.

King, S. M., & Johnson, W. C. (1999). Assigning secondary structure from protein coordinate data. Proteins: Structure, Function, and Bioinformatics, 35(3), 313-320.

Kleywegt, G. J. (1997). Validation of protein models from C α coordinates alone. Journal of molecular biology, 273(2), 371-376.

Kopp, J., Bordoli, L., Battey, J. N.D., Kiefer, F. and Schwede, T. (2007), Assessment of CASP7 predictions for template-based modeling targets. Proteins, 69: 38–56.

Kryshtafovych, A., Fidelis, K. and Moult, J. (2007), Progress from CASP6 to CASP7. Proteins, 69: 194–207.

242

Labesse, G., Colloc'h, N., Pothier, J., & Mornon, J. P. (1997). P-SEA: a new efficient assignment of secondary structure from Cα trace of proteins. Computer applications in the biosciences: CABIOS, 13(3), 291-295.

Laskowski, R. A., MacArthur, M. W., Moss, D. S., & Thornton, J. M. (1993). PROCHECK: a program to check the stereochemical quality of protein structures. Journal of applied crystallography, 26(2), 283-291.

Leaver-Fay, A., Tyka, M., Lewis, S. M., Lange, O. F., Thompson, J., Jacak, R., Kaufman, K., Renfrew, D. P., Smith, C. A., Sheffler, W., Davis, I. W., Cooper, S., Treuille, A., Mandell, D. J., Richter, F., Ban, Y. A., Fleishman, S. J., Corn J. E., Kim, D. E., Lyskov, S., Berrondo, M., Mentzer, S., Popović, Z., Havranek, J. J., Karanicolas, J., Das, R., Meiler, J., Kortemme, T., Gray, J. J., Kuhlman, B., Baker D., & Bradley, P. (2011). ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Methods in enzymology, 487, 545.

Levitt, M. (1976). A simplified representation of protein conformations for rapid simulation of . Journal of molecular biology, 104(1), 59-107.

Levitt, M., & Greer, J. (1977). Automatic identification of secondary structure in globular proteins. Journal of molecular biology, 114(2), 181-239.

Lovell, S. C., Word, J. M., Richardson, J. S., & Richardson, D. C. (2000). The penultimate rotamer library. Proteins: Structure, Function, and Bioinformatics, 40(3), 389-408.

Lovell, S. C., Davis, I. W., Arendall, W. B., de Bakker, P. I., Word, J. M., Prisant, M. G., Richardson, J. S., & Richardson, D. C. (2003). Structure validation by Cα geometry: ϕ, ψ and Cβ deviation. Proteins: Structure, Function, and Bioinformatics, 50(3), 437-450.

Lu, M., Symersky, J., Radchenko, M., Koide, A., Guo, Y., Nie, R., & Koide, S. (2013). Structures of a Na+-coupled, substrate-bound MATE multidrug transporter. Proceedings of the National Academy of Sciences, 110(6), 2099-2104.

Martin, J., Letellier, G., Marin, A., Taly, J. F., De Brevern, A. G., & Gibrat, J. F. (2005). Protein secondary structure assignment revisited: a detailed analysis of different assignment methods. BMC Structural Biology, 5(1), 17.

Milner-White, E. J. (1987). Beta-bulges within loops as recurring features of protein structure. Biochimica et Biophysica Acta (BBA)-Protein Structure and Molecular Enzymology, 911(2), 261-265.

243

Oldfield, T. J., & Hubbard, R. E. (1994). Analysis of Cα geometry in protein structures. Proteins: Structure, Function, and Bioinformatics, 18(4), 324-337.

Ponder, J. W., & Richards, F. M. (1987). Tertiary templates for proteins: use of packing criteria in the enumeration of allowed sequences for different structural classes. Journal of molecular biology, 193(4), 775-791.

Ramachandran, G.N.; Ramakrishnan, C.; Sasisekharan, V. (1963). "Stereochemistry of polypeptide chain configurations". Journal of Molecular Biology 7: 95–9.

Richards, F. M., & Kundrot, C. E. (1988). Identification of structural motifs from protein coordinate data: Secondary structure and first-level . Proteins: Structure, Function, and Bioinformatics, 3(2), 71-84.

Richardson, J. S., Getzoff, E. D., & Richardson, D. C. (1978). The beta bulge: a common small unit of nonrepetitive protein structure. Proceedings of the National Academy of Sciences, 75(6), 2574-2578.

Richardson, J. S. (1981). The anatomy and taxonomy of protein structure. Advances in protein chemistry, 34, 167-339.

Richardson, J. S., & Richardson, D. C. (1988). Amino acid preferences for specific locations at the ends of alpha helices. Science, 240(4859), 1648-1652.

Richardson J. S., Keedy D. A., Richardson D. C. (2013) The Plot thickens: more data, more dimensions, more uses, pp. 46-61 in Biomolecular Forms and Functions: A Celebration of 50 Years of the Ramachandran Map, ed. Bansal M, Srinivasan N, World Scientific Publishing, Singapore, ISBN 978-981-4449-13-27

Richardson, J. S., & Richardson, D. C. (2014). Biophysical highlights from 54 years of macromolecular crystallography. Biophysical journal, 106(3), 510-525.

Rittinger, K., Walker, P. A., Eccleston, J. F., Smerdon, S. J., & Gamblin, S. J. (1997). Structure at 1.65 Å of RhoA and its GTPase-activating protein in complex with a transition-state analogue. Nature, 389(6652), 758-762.

Schuwirth, B. S., Borovinskaya, M. A., Hau, C. W., Zhang, W., Vila-Sanjurjo, A., Holton, J. M., & Cate, J. H. D. (2005). Structures of the bacterial ribosome at 3.5 Å resolution. Science, 310(5749), 827-834.

Tinberg, C. E., Khare, S. D., Dou, J., Doyle, L., Nelson, J. W., Schena, A., Jankowski, W., Kalodimos, C. G., Johnsson, K., Stoddard, B. L. & Baker, D. (2013).

244

Computational design of ligand-binding proteins with high affinity and selectivity. Nature, 501(7466), 212-216.

Tramontano, A. and Morea, V. (2003), Assessment of homology-based predictions in CASP5. Proteins, 53: 352–368.

Trelease, J. (2013). The read-aloud handbook, 7th ed. Penguin. xii-xiii.

Tyagi, M., Bornot, A., Offmann, B., & de Brevern, A. G. (2009). Analysis of loop boundaries using different local structure assignment methods. Protein Science, 18(9), 1869-1881.

Videau, L. L., Arendall, W. B., & Richardson, J. S. (2004). The cis-pro touch-turn: A rare motif preferred at functional sites. PROTEINS: Structure, Function, and Bioinformatics, 56(2), 298-309.

Williams, C. J., Hintze, B. J., Richardson, D. C. & Richardson, J. S. (2013). CaBLAM: Identification and scoring of disguised secondary structure at low resolution. Computational Crystallography Newsletter, 4, 33-35.

Williams, C. J. & Richardson, J. S. (2015). Avoiding excess cis peptides at low resolution or high B. Computational Crystallography Newsletter, 6, 2-6.

Word, J. M., Lovell, S. C., LaBean, T. H., Taylor, H. C., Zalis, M. E., Presley, B. K., Richardson, J. S. & Richardson, D. C. (1999). Visualizing and quantifying molecular goodness-of-fit: small-probe contact dots with explicit hydrogen atoms. Journal of molecular biology, 285(4), 1711-1733.

Word, J. M. (2000). All-atom small probe contact surface analysis: an information-rich description of molecular goodness-of-fit. (Doctoral dissertation, Duke University).

Xu, D., & Zhang, Y. (2012). Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field. Proteins: Structure, Function, and Bioinformatics, 80(7), 1715-1735.

Yusupov, M. M., Yusupova, G. Z., Baucom, A., Lieberman, K., Earnest, T. N., Cate, J. H. D., & Noller, H. F. (2001). Crystal structure of the ribosome at 5.5 Å resolution. science, 292(5518), 883-896.

Zemla A. (2003) LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res, 31: 3370–3374.

245

Zhang, W., Dunkle, J. A., & Cate, J. H. (2009). Structures of the ribosome in intermediate states of ratcheting. Science, 325(5943), 1014-1017.

246

Biography

Christopher Joseph Williams was born in Cleveland, Ohio on July 7th, 1984 to Tad

Williams and Susan Williams (née Herbert). His father is an engineer and his mother is a teacher. He has one sibling, his younger brother, David. His family moved to

Pennsylvania, near Philadelphia, when he was very young. They now reside in eastern

Kentucky.

Christopher graduated from Russell High School as valedictorian in 2003. He returned to Cleveland and enrolled as an undergraduate at Case Western Reserve

University. He followed the Biomedical Engineering program, specifically in the Tissue

Engineering track. He was unsatisfied with the state of tissue engineering, but one day, while listening to a friend complain about his biochemistry courses, Christopher decided that he’d like to complain about biochemistry, too. He graduated magna cum laude with a B.S.E. in Biomedical Engineering and a minor in biochemistry on May 20, 2007. He then enrolled as a biochemist in the graduate program at Duke University, which has culminated in the Ph.D. dissertation before you.

He has published the following articles: “The other 90% of the protein:

Assessment beyond the Cαs for CASP8 template-based and high-accuracy models”

(Keedy et al, 2009), “CaBLAM: Identification and scoring of disguised secondary structure at low resolution” (Williams, 2013), and “Avoiding excess cis peptides at low

247

resolution or high B” (Williams & Richardson, 2015). He anticipates further publications stemming from this work in the near future.

Christopher has been a member of the University Scholars program since his enrollment at Duke. As part of this membership, he has received a James B. Duke

Fellowship.

248