Motivation

Tools for Validation of Analysis of ~400 NMR structures NMR structures are NMR-Structures generally not very good 90% ~25 % of recently deposited 75% structures is seriously flawed Geerten W. Vuister 50% average ~ Structural quality can often 25% be improved by: Department of Biochemistry, University of Leicester 10% • Proper computational procedures Geerten W. Vuister http://proteins.dyndns.org/Validation • Validation of input data Biophysics, IMM, Radboud University Nijmegen • Validation of results

http://proteins.dyndns.org Nabuurs et al. PLoS Comp. Biol. 2, e9, 2006

NMR Structure Validation EMBO course, Basel, July 2013

1 2

Structural quality Structural quality

1Q7X human PDZ2-AS 1OZI mouse PDZ2-AS

Statistics over first 10 deposited structures hPDZ2 mPDZ2

Number of NOE restraints 1648 1354 Number of torsion angle restraints 80 76 RMSD all backbone atoms (Å) 0.24 2.56 RMSD all heavy atoms (Å) 0.86 3.00 PROCHECK Most favoured 59% 79% PROCHECK Additionally allowed 27% 16% PROCHECK Generously allowed / Disallowed 14% 5% WHAT IF Ramachandran plot Z-score -6.7 -3.7 WHAT IF Packing quality Z-score -3.7 -1.2

RMS Z-scores output RMS Z-scores of WHATIF WHAT IF Rotamer normality Z-score -6.5 -1.8 WHAT IF Backbone normality Z-score -8.6 -3.8 Backbone RDC R-factor 69% 40%

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

3 4

Maybe the wrong ensemble was deposited? Structural quality

‣ Unfortunately it looks like this was not the DR1885 apo copper-bound case.

‣ The images used in the publication also contain the errors.

‣ Furthermore, the structural observations described in the paper are in agreement with the incorrect structure… [JMB, 2003, 334:143-155 ] RMS Z-scores output RMS Z-scores of WHATIF

1X7L (replaced by 2JQA) 1X9L

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

5 6 Structural quality Structural quality

DR1885 Restraint s How did such errors pass unnoticed? RMS Z-scores output RMS Z-scores of WHATIF

[PNAS, 2005, 102 (11) 3994-3999 ]

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

7 8

Background reading wwPDB NMR-VTF

Concepts and Tools for Task: to formulate general accepted routines and

Progress in Nuclear Magnetic Resonance Spectroscopy 45 (2004) 315–337 NMR Restraint Analysis www.elsevier.com/locate/pnmrs procedures for the validation of NMR derived and Validation Validation of protein structures derived by NMR spectroscopy

1 1 1 2 a a a a b,* biomolecular structures SANDER B. NABUURS, CHRIS A.E.M. SPRONK, GERT VRIEND, GEERTEN W. VUISTER Chris A.E.M. Spronk , Sander B. Nabuurs , Elmar Krieger , Gert Vriend , Geerten W. Vuister 1 Center for Molecular and Biomolecular Informatics,University of Nijmegen,Toernooiveld 1, aCentre for Molecular and Biomolecular Informatics, IMM, Radboud University Nijmegen, Toernooiveld 1, 6525 ED Nijmegen, The Netherlands 6525 ED Nijmegen,The Netherlands bDepartment of Biophysical Chemistry, IMM, Radboud University Nijmegen, Toernooiveld 1, 6525 ED Nijmegen, The Netherlands 2 Department of Biophysical Chemistry,University of Nijmegen,Toernooiveld 1, 6525 ED Nijmegen,The Netherlands Received 15 July 2004

Contents ABSTRACT: The quality of NMR-derived biomolecular structure models can be assessed 1. Introduction ...... 316 by validation on the level of structural characteristics as well as the NMR data used to derive the structure models. Here, an overview is given of the common methods to validate 2. NMR structure determination ...... 317 Phase 1. Tasks to be implemented by PDB using experimental NMR data. These methods provide measures of quality and goodness of fit of 2.1. Structure calculation procedures ...... 317 the structure to the data. A detailed discussion is given of newly developed methods to 2.2. Structure selection ...... 317 assess the information contained in experimental NMR restraints, which provide powerful 3. Validation of experimental data ...... 318 tools for validation and error analysis in NMR structure determination. © 2004 Wiley 3.1. Fit of structures to experimental restraints ...... 319 largely existing software Periodicals, Inc. Concepts Magn Reson Part A 22A: 90–105, 2004 3.1.1. Restraint violations ...... 319 3.1.2. RMS deviations and energies of restraints ...... 320 KEY WORDS: structure validation; experimental restraints; restraint validation; structure 3.1.3. NMR R-factors and cross-validation ...... 320 refinement 3.1.4. Independent validation and Q-factors ...... 321 3.2. Information content in experimental restraints ...... 321 Phase 2. Tasks for which software / methods are 3.2.1. Number of restraints, completeness and redundancy ...... 321 3.2.2. Quantitative evaluation of experimentalRecommendations NMR restraints ...... of the wwPDB 322 !"#$%&'%(&)#$*#+$$,-#*$'#+.&#%/,(0/+($"#$*#1'$+&("#234#-+'56+5'&-the spectroscopic data directly, geometric confor-# INTRODUCTION 4. Precision and accuracy of NMR structure ensembles ...... 323 mational restraints are derived from these data, !"#"$%&'()*+,-.$/"0"$1234,.5.$6"7"8"$0*9:+';<=>.$?"1"$@2+*A*'B*+(C$D$E"$!&)FG9G(>$ 4.1. Precision versus accuracy ...... 323 The result of a biomolecular structure determina- which are subsequently used to calculate the struc- NMR Validation Task Force available, but which need more assessment before tion,H9'I*+(')J$2K$L*';*()*+.$@*MG+)F*9)$2K$N'2;4*F'()+J.$8;422A$2K$N'2A23';GA$8;'*9;*(.$0*9+J$#*AA;2F*$N&'A:'93 by solution nuclear magnetic resonance (NMR) tures (1). Derivation of.$LG9;G()*+$/2G:.$L*';*()*+.$LO,$P0Q.$H such structural restraints R"$ 5. Validation of geometric quality ...... 325 5$H9'I*+(')J$2K$SGFT+':3*.$@*MG+)F*9)$2K$N'2;4*F'()+J.$UV$W*99'($S2&+)$/2G:.$SGFT+':3*.$SN5$,!E.$HR"$$ 5.1. Z-scores and RMS Z-scores ...... 325 spectroscopy> is typically a family of structural from NMR spectra is complicated because spectral 6+2)*'9$@G)G$NG9<$'9$O&+2M*.$O7NLXO&+2M*G9$N'2'9K2+FG)';($Y9()')&)*.$#*AA;2F*$W+&()$!*92F*$SGFM&(.$0'9=)29.$SGFT+':3*$SN,V$,8@.$HR"$ 5.2. Bonded geometry ...... 326 modelsC$/G:T2&:$H9'I*+(')J$7*:';GA$S*9)+*.$QS7L8.$S7NY.$!**+)$!+22)*MA*'9$Z&':$5[ describing the accessible molecular confor- overlap,X5U.$[\5\$!E spin$Q'BF*3*9.$W4*$Q*)4*+AG9:(" diffusion, local dynamics,$ and inter- - 5.2.1. Bond lengths and angles ...... 326 mations.W2$]42F$;2++*(M29:*9;*$(42&A:$T*$G::+*((*:" This family, or ensemble,$ of structure converting conformations have to be taken into defining standard validation conventions for PDB 5.2.2. Chirality and tetrahedral geometry ...... 327 $ account. The traditional manual assignment of 1 2,3 4 5 models should agree as a whole with the experi- 5.2.3. SideGaetano chain planarity T. Montelione...... , Michael Nilges , Ad Bax , Peter Güntert 327, N'2F2A*;&AG+$ ()+&;)&+*($ G)$ G)2F';$ +*(2A&)'29$ M+*(*9)$ G$ IGA>A*$NMR resonances +*(2&+;*$ K2+$ and )4*$ conversion &9:*+()G9:'93$ of NMR 2K$ T'2A23J"$ peaks Q7/$ mental NMR data used in the procedure, as well as 5.2.4. Side chain rotamers ...... 328 (M*;)+2(;2MJ$G;;2&9)($K2+$,,^$2K$GAA$()+&;)&+*($'9$)4*$6@N$+*M2(')2+J"$Y9$+*(M29(*$)2$(*+'2&($M+2TA*F($]')4$)4*$G;;&+G;J$ other additional data. Typically, rather than using into structural restraints is an extremely time-con- 5.2.5. Backbone conformation ...... 331 2K$(2F*$2K$)4*$Q7/X:*+'I*:$()+&;)&+*($G9:$'9$2+:*+$)2$KG;'A')G)*$M+2M*+$G9GAJ('($2K$)4*$*=M*+'F*9)GA$F2:*A(.$G$9&FT*+$2K$ 6 7 8 suming process, even for experienced spectrosco- 5.3. Non-bondedT interactionsorsten ...... Herrmann , Jane S. Richardson , Charles Schwieters , 331 M+23+GF$(&')*($G+*$GIG'AGTA*"$#*$:'(;&(($ 9'9*$2K$)4*(*$)22A($'9$)4'($+*I'*]_$6/`S0OSRXQ7/.$68%8.$!L7X/78@.$SYQ!.$ Received 2 March 2004; revised 13 April 2004; ac- pists. Further, manual interpretation of NMR data is 5.3.1. Inter-atomic bumps ...... 331 Phase 3. Tasks requiring further research over the 72AM+2T')J.$ %'IGA:'.$ /*(6+2=.$ Q7/$ ;29()+G'9)($ G9GAJa*+$ G9:$prone bF*G9"$ to human #*$ *IGA&G)*$ error and,)4*(*$ possibly, M+23+GF($ manipulation. K2+$ )4*'+$ GT'A')J$ )2$ cepted 13 April 2004 5.3.2. Hydrogen bonding ...... 9 10 11 332 G((*(($)4*$()+&;)&+GA$c&GA')J.$+*()+G'9)($G9:$)4*'+$I'2AG)'29(.$;4*F';GA$(4'K)(.$M*G<($G9:$)4*$4G9:A'93$2K$F&A)'These problems are being alleviated by theXF2:*A$Q7/$ recent 5.3.3. ElectrostaticsWim F...... Vranken , Geerten W. Vuister , and David S. Wishart , 333 Correspondence*9(*FTA*("$#*$:2;&F*9)$T2)4$)4*$'9M&)$+*c&'+*:$TJ$)4*$M+23+GF($G9:$2&)M&)$)4*J$3*9*+G)*"$W2$:'(;&(($)4*'+$+*AG)'I*$ to: Geerten W. Vuister; E-mail: [email protected] development of several automated methods (for 5.3.4. The packing of residues in protein structures ...... 334 F*+')($]*$4GI*$GMMA'*:$)4*$)22A($)2$)]2$+*M+*(*9)G)'I*$*=GFMA*($K+2F$)4*$6@N_$G$(FGAA.$3A2T&AG+$F292F*+';$M+2)*'9$detailed discussions see [2, 3]), which have in- 5.4. NMR versus X-ray structures ...... 335 Conceptsd8)GM4JA2;2;;GA$9&;A*G(*$K+2F$ in Magnetic Resonance Part A,!"#$%&'%( Vol. 22A(2).$6@N$*9)+J$5e$G9:$G$ 90–105 (2004) (FGAA.$(JFF*)+';$42F2:'F*+$M+2)*'9$dG$+*3'29$2K$4&FG9$ 12 13 14 coming year creased the speed, reliability, and reproducibility of Helen M. Berman , Gerard J. Kleywegt , and John L. Markley PublishedFJ2('9Xf.$6@N$*9)+J$5A]Pe" online in Wiley InterScience$ (www.interscience.wiley. com). DOI 10.1002/cmr.a.20016 the interpretation and analysis of NMR data and the $ © 2004 Wiley Periodicals, Inc. subsequent structure calculation process. 5*-*6! *! -7'$+'-#(*&! 5',(+#0-#$.! +'&*-#.1! -7'! 5*-*! -$! -7'! 90!"#$%&'(#)%"# *-$%#(! ($$+5#.*-',! #,! +'^)#+'53! >7','! +'&*-#$.,! *+'! "#$%$&'()&*+!,-+)(-)+',!*-!*-$%#(!+',$&)-#$.!*+'!(+)(#*&! * Corresponding author. Address: Department of Biophysical Chemistry, Radboud University Nijmegen, Toernooiveld 1, P.O. Box 9010, 6500 GL J. Biomol.-80#(*&&8! *&,$! NMR, ),'5! 5)+#.1! in -7'! press ,-+)(-)+'! (*&()&*-#$.! 1 Structure, under review /$+! #.-'+0+'-#.1! ('&&)&*+! 0+$(',,',! #.! *! %$&'()&*+! Nijmegen,Center The Netherlands. for Tel.: AdvancedC31 24 3652321; Biotechnology fax: C31 24 3652112. and Medicine, Department of Molecular 0+$('5)+'! -$! 5+#9'! -7'! ($.9'+1'.('! *.5! 7'.('! -7'! ($.-'2-3!4.!*55#-#$.6!-7'8!,'+9'!#%0$+-*.-!+$&',!#.!5+)1! E-mail address: [email protected] (G.W. Vuister). *,,',,%'.-! $.&8! ($.9'8,! -7'! 5'1+''! -$! =7#(7! -7'! 5#,($9'+8! *.5! /).(-#$.*&! #.5),-+#*&! 5',#1.6! ,)(7! *,! -7'! ,-+)(-)+'!=*,!(*&()&*-'5!0+$0'+&83!4/6!7$='9'+6!-7'!5*-*! 0079-6565/$ - see front matter q 2004 Elsevier B.V. All rights reserved. %$5#/#(*-#$.! $/! '.:8%'! 0+$0'+-#',3! ;$+! *&&! -7','! Biology and Biochemistry, Rutgers, The State University of New Jersey, and *+'! #.-'+.*&&8! #.($.,#,-'.-6! -7#,! =#&&! -80#(*&&8! +',)&-! #.! doi:10.1016/j.pnmrs.2004.08.003 *00&#(*-#$.,6! #-! #,! #%0'+*-#9'! -7*-! -7'! <#$%$&'()&*+! ,-*-#,-#(*&&8! 0$$+! $+! ).),)*&! 5#,-+#<)-#$.,! $/! +'&*-'5! *-$%#(! ,-+)(-)+',! *+'! *(()+*-'6! 0+'(#,'! *.5! -+)-7/)&&8! ,-+)(-)+*&! 0*+*%'-'+,! B>#;&( #?@')C3! N$+'! #.5'0'.5'.-! +'/&'(-!-7'!'20'+#%'.-*&!5*-*!$.!=7#(7!-7'8!='+'!<*,'53!! NMR Structure ValidationDepartment of Biochemistry and Molecular BioEMBOlogy, Robert course, Wood Basel, Johnson July 2013 NMR Structure Validation EMBO course, Basel, July 2013 %'*,)+',! *+'! <*,'5! )0$.! (+$,,K9**-#$.! %'-7$5,! >7'!?+$-'#.!@*-*!"*.A!B?@"C!B"'+%*.!DEEFG!"'+.,-'#.! B"+).1'+! '-! *&3! HIILG! S&$+'! *.5! T(7=#'-'+,! DEE\G! '-! *&3! HIJJC! #,! -7'! 0+#%*+8! +'0$,#-$+8! $/! -7'! *-$%#(! M*<))+,!'-!*&3!DEEYG!>R*.5+*!'-!*&3!DEEJC!-7*-!'2(&)5'!*! ($$+5#.*-',! $/! -7+''K5#%'.,#$.*&! BL@C! <#$%$&'()&*+! Medical School, Piscataway, NJ 08854, USA /+*(-#$.! $/! -7'! 5*-*! #.! -7'! ,-+)(-)+'! (*&()&*-#$.! ,-+)(-)+',3! 4-! ()++'.-&8! ($.-*#.,! %$+'! -7*.! IE6EEE! 0+$('5)+',3! '.-+#',6!=7#(7!($9'+!0+$-'#.,6!$$.)(&'$-#5',!*.5!-7'#+! `'$%'-+#(! ,-+)(-)+'! 9**-#$.,! *#%! -$! *,,',,! -7'! 2 9 10 ($%0&'2',6! #.(&)5#.1! ,%*&&K%$&'()&'! *.5,3! >7'! Institut Pasteur, Unité de Bioinformatique Structurale, Département de Biologie ^)*&#-8! #.! +'&*-#$.! -$! -7'! (7'%#(*&! *.5! ,-+)(-)+*&! '.-+#',! ,$&9'5! <8! M)(&'*+! N*1.'-#(! O',$.*.('! BMNOC! A.$=&'51'! 5'+#9'5! /+$%! +'&'9*.-! +'/'+'.('! ,-+)(-)+',3! +'0+','.-! *00+$2#%*-'&8! HHP! $/! -7'! -$-*&! BQI6IEE! Z$(*&!,-+)(-)+*&!0*+*%'-'+,!,)(7!*,!<$.5!&'.1-7,6!<$.5! '.-+#',C3! >7'! ?@"! *+(7#9'! #,! R$#.-&8! %*.*1'5! <8! /$)+! Structurale et Chimie, F-75015 Paris, France *.1&',! *.5! -$+,#$.! *.1&',! *+'! $<-*#.'5! /+$%! ]K+*8! 0*+-.'+!$+1*.#,*-#$.,!BOST"!?@"!B"'+%*.!'-!*&3!DEEEC6! (+8,-*&&$1+*078! 5*-*! $/! ,%*&&! %$&'()&',! *.5! )&-+*K7#17! ?@"'!BU'&*.A*+!'-!*&3!DEHDC6!?@"R!BV#.R$!'-!*&3!DEHDC!*.5! +',$&)-#$.! <#$%$&'()&*+! ,-+)(-)+',! Ba.17! *.5! b)<'+! 3 "NO"! BW&+#(7! '-! *&3! DEEFCC! ).5'+! -7'! *'1#,! $/! -7'! CNRS, UMR3258, F-75015 Paris, France HIIH6! DEEHC6! =7'+'*,! 5#7'5+*&! *.1&'! 5#,-+#<)-#$.,! *+'! ==?@"!B"'+%*.!'-!*&3!DEEJC!($.,$+-#)%3! <*,'5! )0$.! *! ,'-! $/! 7#17K+',$&)-#$.! ]K+*8! ,-+)(-)+',3! X!,'+#',!$/!'++$.'$),&8!%$5'&&'5!MNO!,-+)(-)+',!BS&$+'! S&'*+&86!-7'+'!#,!*.!#.7'+'.-!5*.1'+!-7*-!,-+)(-)+',!*+'! 4 '-! *&3! HIIYG! Z*%<'+-! '-! *&3! DEE[G! M*<))+,! '-! *&3! DEE\G! Laboratory of Chemical Physics, NIDDK, National Institutes of Health, '9*&)*-'5! =#-7! +',0'(-! -$! *.! #.($%0&'-'! $+! <#*,'5! T0*5*((#.#! '-! *&3! DEE\C! *.5! (*,',! $/! $)-+#17-! ,(#'.-#/#(! +'/'+'.('6!<)-!#-!#,!.$=*5*8,!1'.'+*&&8!*00+'(#*-'5!-7*-! /+*)5! B"$++'&&! DEEIC! =#-7! ]K+*8! 5'+#9'5! ,-+)(-)+',! ).($%%$.!/'*-)+',!/&*11'5!<8!*!1'$%'-+#(!*,,',,%'.-! ).5'+,($+'! -7'! .''5! /$+! 5'5#(*-'5! -$$&,! -$! *,,',,! -7'! Bethesda, MD 20892-0520 ,7$)&5!<'!,)00$+-'5!<8!,$!'20'+#%'.-*&!5*-*!BU+#'.5! ,-+)(-)+*&!^)*&#-8!$/!<#$%$&'()&*+!,-+)(-)+',!*,!='&&!*,! HIIEG! S7'.! '-! *&3! DEHEG! "7*--*(7*+8*! '-! *&3! DEEJG! -7'! *1+''%'.-! =#-7! -7'! '20'+#%'.-*&! 5*-*3! N$+'$9'+6! @$+'&'#R'+,!'-!*&3!DEHD*G!M*<))+,!'-!*&3!DEE\G!b$$/-!'-!*&3! 5 -7'! L@! ,-+)(-)+'! *.5! 58.*%#(! 0+$0'+-#',! $/! -7'! Institute of Biophysical Chemistry, Center for Biomolecular Magnetic HII\C3! <#$%$&'()&',! (*.! (7*.1'! #.! +',0$.,'! -$! #.-'+*(-#$.,! >+*5#-#$.*&! *.5! ,-#&&K0$0)&*+! <#$%$&'()&*+! MNO! =#-7!$-7'+!%$&'()&',!*.5!7'.('!#-!#,!*&,$!#%0'+*-#9'!-$! ,-+)(-)+'!9**-#$.!+$)-#.',!7*9'!+'&#'5!$.!*!&#%#-'5!,'-! (*+'/)&&8!*,,',,!-7'!*(()+*(8!$/!-7'!,-+)(-)+',3! Resonance, and Frankfurt Institute of Advanced Studies, Goethe University $/! -$$&,! *.5! %'-+#(,3! 4-! 7*,! <''.! (),-$%*+8! -$! T-+)(-)+'! 9**-#$.! -80#(*&&8! '.($%0*,,',! -=$! <+$*5! ,)%%*+#,'!+',-+*#.-!($.-'.-!),#.1!*!,#%0&'!($).-!$/!-7'! *,0'(-,_! -7'! *1+''%'.-! $/! -7'! '20'+#%'.-*&! 5*-*! =#-7! .)%<'+! $/! +',-+*#.-,6! =7'+'*,! #-! 7*,! &$.1! <''.! A.$=.! -7'! +',)&-#.1! ,-+)(-)+'! *.5! *! 1'$%'-+#(! 9**-#$.3! 4.! Frankfurt am Main, Max-von-Laue-Str. 9, 60438 Frankfurt am Main, Germany -7*-! -7','! .)%<'+,! *+'! /&*='5! /$+! %)&-#0&'! +'*,$.,! $+5'+! -$! (*&()&*-'! -7'! *1+''%'.-! $/! -7'! '20'+#%'.-*&! 6 Centre de Résonance Magnétique Nucléaire à Très Hauts Champs, Institut des ! !"#$%&'(&%()*+((((,(-#./.*(012(345678()99&:%&;((( 3<)=&(68( Sciences Analytiques, Université de Lyon, Unité Mixte de Recherche 5280

Centre National de la Recherche Scientifique, Ecole Normale Supérieure de

Lyon, Université Claude Bernard Lyon 1, 69100 Villeurbanne, France

1

Structural quality Structural quality Proper computation. Proper computation. • Short annealing/restraint MD calculation using electrostatics and Error detection. explicit solvent (Spronk et al. J. Biomol NMR 22, 281-9, 2002; Linge et al. Mobility and structural variation. Proteins 50, 496-506, 2003) • Publicly available databases: DRESS (http://www.cmbi.ru.nl/dress) Assessment of structural quality. and RECOORD (http://www.ebi.ac.uk/msd-srv/docs/NMR/recoord/ main.html). • >100, 500 refined structures and validation reports. • NMR_REDO Error detection. Mobility and structural variation. Assessment of structural quality.

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

11 12 Structural quality Structural quality Proper computation. Proper computation. Error detection. Error detection. Mobility and structural variation. Mobility and structural variation. Assessment of structural quality. Assessment of structural quality. • Data are ‘complex in nature’ (NOEs, J, RDC, SAXS, databases, ..). • Differences in interpretation and parametrization. • Differences in protocols. • Multiple structures (dynamics). • Tools for structure validation.

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

13 14

Structure determination process Structure determination process

Reinterpretation of Reinterpretation of Experimental Data data data Validated Structure Data Validated Structure Spronk et al, Fig. 1 Data Spronk et al, Fig. 1

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

15 16

Structure determination process Structure calculation and selection

Experimental data (restraints)

Repeat n-times

Selection

Reinterpretation of Experimental Data data Validated Structure Data Spronk et al, Fig. 1

Spronk et al, Fig. 2 Ensemble

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

17 18 Structure determination process Precision and accuracy

Precision is the variation of X around • expressed as standard deviation or variance

Accuracy is the closeness of to the “true” value of X

Precision and accuracy are often mixed in the literature

Reinterpretation of Experimental Data data Validated Structure Data Spronk et al, Fig. 1

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

19 20

Precision and accuracy Accuracy of NMR structures Accuracy can only be assessed when the true structure is known (“Gold Standard”) • Only the case for simulated data-sets

Sometimes X-ray structures are used • Different experimental conditions • Crystal contacts • In some cases X-ray structures fit NMR data better than NMR structures

Spronk et al., Fig. 5

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

21 22

Uncertainty in structure coordinates Coordinate RMSDs X-ray crystallography: B-factor Calculation requires superposition of structures • Quality of the crystal • Region dependent • Dynamic behavior of the molecule • Disorder Structure selection criteria are subjective NMR: atomic Root Mean Square Deviation or RMSD • Should reflect measured dynamics and the uncertainty in the Use Circular Variance, CV? experimental data • Used as a measure for precision and accuracy!

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

23 24 Superposition NMR-VTF Phase 1 676 D.A. SNYDER AND G.T. MONTELIONE Superposition: Cyrange (distance variance matrix) Assessing structured regions: Cyrange Representative model: mediod

Fig. 1. Superimpositions of structural bundles calculated from a single ordered region. A, B:StructuralbundlesobtainedfromthePDBaresuperimposed using only a single stretch of locally ordered residues (i.e. those for which the sum of the ␾ and ␺ dihedral angle orderSnyder parameters and areMontelione, greater or equal Proteins to than 59, 673 (2008) 1.5) shown in blue. Other locally ordered residues are shown in green and residues which have ␾ and ␺ dihedral angle order parameters adding to less than 1.5 are shown in red. For the sake of clarity, only four structural models are shown although all models reported in the PDB file were used in calculating the average structure for the superimposition. Residues are numbered as in the PDB file. A: In 1PKT (Phosphatidylinositol 3-kinase, SH3 domain), residues 6–33 were used to determine the superimposition which results in a relatively tight superimposition of the other locally ordered residues (36–77) as well. B: In 1CFC(calcium free calmodulin), the superimposition is calculated using residuesNMR 1–75 Structure which results inValidation a poor superimposition of residues 78–148. C, D:ThesumoftheEMBO␾ andcourse, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013 ␺ dihedral angle order parameters are shown as a function of sequential residue number for both (C) 1PKT and (D) 1CFC. In (C), the disordered loop residues 34–35 separate ordered structures which are well defined with respect to one another. In (D), disordered residues 76–77 separate ordered structures that are not well defined with respect to each other. 25 26 Recall that each element of the IVM is a sum of n square displacements from a mean value. If scaled by the appropri- ate parametric variance, each element would therefore 2 approximately sample a ␹[n] distributed random variable (i.e., with n degrees of freedom). For even a moderate number of degrees of freedom, the distribution of the cube-root of a ␹2 random variable approximates a normal distribution.16 Since scaling a normally distributed ran- dom variable results in a variable that maintains a Gaussian distribution, by taking the cube root of each element in the IVM, the transformed row vectors from the sub-matrix consisting only of the (now cube-root of the) variances in distance between core atoms represent core backbone atoms as vectors of approximately normally distributed random variables. The optimal SAHN variant to use for clustering data represented by vectors whose elements are normally dis- Fig. 3.

Precision revisted Precision and accuracy

A good estimate of the precision requires maximizing the conformational variability within a given set of experimental restraints

“Tight” bundle: “Loose” bundle: low RMSD high RMSD Spronk et al, Fig. 6 Suggesting high precision More realistic estimate of precision

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

27 28

Resampling to assess precision Resampling to assess precision

Ubiquitin

refinement ) )

CONCOORD RMSD (Å RMSD RMSD RMSD (º angles )

CONCOORD

Ensemble of Z-scores

protein structures Re-sampled ensemble Refined ensemble RMSD (Å distances

Spronk et al., J. Biomol. NMR. 25, 225-234 (2003)

Spronk et al., J. Biomol. NMR. 25, 225-234 (2003)

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

29 30 Resampling to assess precision Structure determination process NMR ensembles tend not to accurately reflect the experimental uncertainty. Important to consider when assessing structural differences, e.g. structural changes resulting from interaction.

Reinterpretation of Experimental Data data Validated Structure Data Spronk et al, Fig. 1

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

31 32

Evaluation of experimental data Evaluation of experimental data Quality of experimental data: Quality of experimental data: • Number (?) • Number (?) • Completeness / RPF scores • Completeness / RPF scores • QUEEN. • QUEEN.

Agreement of the structure with the experimental Agreement of the structure with the experimental data: data: • Number and size of violations. • Number and size of violations. • Root mean square deviations. • Root mean square deviations. • NOE, Dihedral angles etc. • NOE, Dihedral angles etc. • R-factors, Q-factors. • R-factors, Q-factors. • RDC, chemical shifts etc. • RDC, chemical shifts etc. • Cross-validation. • Cross-validation. NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

33 34

Evaluation of experimental data Evaluation of experimental data Number of restraints: Redundancy of restraints.

Intra-residual (|i-j|=0):

• Side chain conformation Total: 1413 Sequential (|i-j|=1): Secondary structure • HN Medium range (1<|i-j|!4): • Secondary structure Long range (|i-j|>4): Hα • Secondary and tertiary structure Maximum HN-Hα distance=3.0 Å Nabuurs et al, Fig. 3

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

35 36 Evaluation of experimental data Evaluation of experimental data Redundancy of restraints. Completeness Multiple entries of the same restraint!

1D3Z (Ubiquitin):

counts singly defined 940 multiple defined 696 total unique: 1636 duplicates: 1091 total all: 2727

Fig. 3 Nabuurs et al.

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

37 38

Evaluation of experimental data Evaluation of experimental data Restraints per residue Quality of experimental data: Completeness per residue • Number (?) • Completeness / RPF scores • QUEEN.

Agreement of the structure with the experimental data: • Number and size of violations. • Root mean square deviations. • NOE, Dihedral angles etc. • R-factors, Q-factors. • RDC, chemical shifts etc. Fig. 3 Nabuurs et al. • Cross-validation. NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

39 40

Evaluation of experimentalProtein NMR RPF Scores data EvaluationARTICLES of experimental data

can also be calibrated from the NOESY data. Accordingly, the RPF scores performance score (F-measure) of the final ensemble of structuresQualityF(Gh ) of experimental data: is assessed by the following set of statistics: • Number (?) {p (h1, h2, p) GANOE,(h1, h2, d) Gh } Recall (Gh ) ) | | ∈ ∈ | (1) Protein NMR RPF Scores ARTICLES {p (h1, h2, p) GANOE} • Completeness / RPF scores Figure 1. Comparison of distance network Gh generated from an ensemble | | ∈ | of 3D query structures and G generated from input NOE peaklist (NOE) ANOE -6 QUEEN. and resonance assignment (R) data. Edges that are present in both Gh and ! d(h1, h2) • GANOE are true positives (TP). Edges present in Gh , but not in GANOE are (h1,h2,d) Gh , ∈ h (h1,h2,p) GANOE false positives (FP). Edges that are not present in both G and GANOE are ∈ true negatives (TN). NOE cross peaks (p) are counted (only once) as false Precisionw(Gh ) ) (2) negatives (FN) if corresponding linking edges in GANOE are not present in d(h1,h2)-6 Gh . ! Agreement of the structure with the experimental (h1,h2,d) Gh ∈ ously linked to more than one proton pair, as indicated by chemical data: 2 Recall(Gh ) Precisionw(Gh ) shift degeneracies and match tolerances. The solution network, GNOE, F(Gh ) ) × × (3) Recall(Gh ) + Precision (Gh ) corresponding to the true 3D structure, is a subgraph of GANOE. w • Number and size of violations. Given complete NOESY peak lists and resonance assignments, for -6 each NOESY cross peak, at least one of its linked proton pairs belongs In this analysis, a distance (d )weightingoftheprecisionmetric,• Root mean square deviations. precisonw(Gh ), is used to reduce the otherwise dominant influence of to GNOE.Fromanensembleofquery3Dstructures,anensemble-average distance network Gh is then calculated from the sum of inverse sixth the many weak NOEs arising from interproton distances close• to theNOE, Dihedral angles etc. powers of individual degenerate proton-proton distances, assuming upper-bound detection limit, dNOE_max. This weighting also makes the uniform effects of nuclear relaxation processes (Figure 1). Protons quality scores less sensitive to the value chosen for dNOE_max• . R-factors, Q-factors. (vertices) are connected (edges) if their corresponding midrange Discriminating Power (DP-score). While the F-measure statistic Figure 5. Graphic representations of false positive (FP) distributions on IL-13 structures. False positives correspond to short average distances (

Downloaded by UKB CONSORTIA NETHERLANDS on July 10, 2009 input NOESY peak lists and resonance assignments. F(Gideal), and such as packing contacts, dihedral angle distributions,true and positiveslists (TP), from which while they “not-relevant” are generated, these documents constraint retrieved lists are by the Published on January 22, 2005 http://pubs.acs.org | doi: 10.1021/ja047109h conformational energies, are valuable tools for protein structurealgorithm areinterpretations false positives of NOESY (FP). peak “Relevant” lists, while RPF documents scores directly not retrieved particularly the Precision of Gideal, thus provides a measure of the validation,15-17 comparing observed conformational distributionsby an algorithmmeasure are the false quality negatives of structures (FN) against and the “not-relevant” NOESY peak list documents combined quality of the resonance assignment and NOESY peak lists data. For example, Precision has similarities with NOE Com- and packing with values observed in nature and/or expectedthat on are also not retrieved by an algorithm are true negatives (TN). for one or more spectra. F(Gideal) and F(Gfree) describe the two bounds first principles. RPF scores measure global goodness-of-fits of pleteness score;10 the Precision score measures the completeness of the performance F(Gh ); i.e., F(Gideal) g F(Gh ) g F(Gfree). With these NOE peak lists with NMR structures. In general, the goalRecall should is definedof back-calculated as the fraction peak lists ofrelati relevantVe to documentsNOESY peak list that data are, retrieved be to generate protein structures that score well in theseby several the algorithmwhile the and NOE Precision Completeness is scoredefined computes as the the fractioncompleteness of retrieved definitions, the fold Discriminating Power (DP) for Gh is then estimated different and complementary views of structure quality.documents For of that the back-calculated are in fact relevant. distance constraints The F-measurerelatiVe to a characterizes deriVed the as: example, high RPF scores and high ProcheckDownloaded by UKB CONSORTIA NETHERLANDS on July 10, 2009 45 scores indicate (and potentially incorrect) constraint list. While the NOESY combined performance of Recall and Precision. that the structures both fit the data wellPublished on January 22, 2005 http://pubs.acs.org | doi: 10.1021/ja047109h and have good peak lists themselves are “derived” information, they are closer F(Gh ) - F(Gfree) stereochemical qualities. High RPF scores and slightly lowerIn the contextto the raw NMRof NOESY-based spectral data than constraint structure lists, analysis, which involve proton pair DP(Gh ) ) (4) much higher levels of interpretation and (sometimes) data Procheck scores indicate that the structures fit the datainteractions well, (h1, h2) are analogous to “documents”. Observed NOESY F(Gideal) - F(Gfree) but that the data may not be sufficient to define correct local omission. structure, and additional data and/or refinement processescross may peaks are defined as true relevant documents, assuming the peak Conclusions where, DP(G ) ) 1 and DP(G ) ) 0. be required. Importantly, good stereochemical and/or packinglists (set NOE) have no noise. Potential NOESY peaks not observed ideal free scores alone do not necessarily demonstrate that the correspond-in the data are“NMR analogous R-factors” to provide not-relevant a quality documents, measure of the assuming agreement the input The F-measure score provides an assessment of the overall fit ing structure fits well to the experimental NOE data. Similarly,data are complete.between the As experimental illustrated and in back-calculated Figure 1, particular NOESY peak proton pair between the query model structure(s) and the experimental data, interactionslists. present Although in (or critical “retrieved to the development by”) the of atomic the field, coordinates such of a assuming that the input data are near complete; the Discriminating (45) Laskowski, R. A.; Moss, D. S.; Thornton, J. M. J. Mol. Biol. 1993, 231, analyses have not been routinely used in NMR structure 1049-1067. Power score, DP(Gh ), measures how the query structure is distinguished (46) Sayle, R. A.; Milner-White, E. J. Trends Biochem. Sci. 1995, 20model, 374. structurecalculations mayeither because be conventional represented methods in the of graphical back calculating representation of the NOESY peak list data GANOE (TP), or not represented in GANOE from the freely rotating chain model. 9 (FP). Proton pair interactionsJ. AM. CHEM. “not SOC. retrieved”VOL. 127, by NO. the 6, 2005structure1673 and also NMR Datasets. We have validated the sensitivities of NMR RPF not represented in GANOE are defined as TNs. Proton pair interactions scores on experimental NMR data sets of: human basic fibroblast 26,27 not retrieved by the structure but represented in GANOE have to be growth factor (FGF-2, 154 a.a.), the inhibitor-free catalytic fragment 28,29 considered carefully with respect to the ambiguous relationship between of human fibroblast collagenase (MMP-1, 169 a.a.), and human peaks and their multiple possible assignments. Since GANOE is an (24) Flory, P. J. Statistical Mechanics of Chain Molecules; Interscience ambiguous network, a FN score is assigned to the peak only if none of Publishers: New York, 1969. the several possible interactions are observed in Gh . In this context, (25) Cantor, C. R.; Schimmel, P. R. Biophysical Chemistry; W. H. Freeman: Recall (eq 1) measures the fraction of NOE cross peaks that are retrieved San Francisco, 1980. (26) Moy, F. J.; Seddon, A. P.; Campbell, E. B.; Bohlen, P.; Powers, R. J. by the query structures, while Precision (eq 2) measures the fraction Biomol. NMR 1995, 6, 245-254. of retrieved proton pair interactions in the query structure that are (27) Moy, F. J.; Seddon, A. P.; Bohlen, P.; Powers, R. Biochemistry 1996, 35, 13552-13561. relevant (in GANOE), weighted by interproton distance. The upper-bound (28) Moy, F. J.; Pisano, M. R.; Chanda, P. K.; Urbano, C.; Killar, L. M.; Sung, observed distance, dNOE_max,usedinthesemeasuresis5Å,but M.-L.; Powers, R. J. Biomol. NMR 1997, 10,9-19.

J. AM. CHEM. SOC. 9 VOL. 127, NO. 6, 2005 1667 QUEEN QUEEN: average information

Structure calculations Average information of Experimental data restraint r : (restraints)

Restraints of 1GB1 !" Iave is a measure for the Experimental data average contribution of Uncertainty: (restraints) the restraint to the determination of the fold

Calculate H using concepts of Shannon’s information theory

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

43 44

QUEEN: average information QUEEN: average information

Average Average information of information of restraint r : restraint r :

!" Iave is a measure for the average contribution of the restraint to the determination of the fold.

fraction restraints (%) YBOX (1H95)

10 restraints with highest Iave

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

45 46

QUEEN: unique information QUEEN: Redundancy

Unique Unique information of information of restraint r : restraint r :

Restraints of 1GB1

!" Is a measure for the degree of support by other restraints fraction restraints (%)

Total: 1413 Total: 565

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

47 48 Evaluation of experimental data Evaluation of experimental data Quality of experimental data: Number of and size of restraint violations: • Number (?) (Remember selection!) • Completeness. • QUEEN.

Agreement of the structure with the experimental data: • Number and size of violations. • Root mean square deviations. • NOE, Dihedral angles etc. • R-factors, Q-factors. • RDC, chemical shifts etc. Fig. 4 Nabuurs et al. • Cross-validation. NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

49 50

Evaluation of experimental data Evaluation of experimental data RMS deviations: R-factors:

General

Gonzales et al., J. Magn. Reson. 91, 659-64 (1991), others

Clore et al., J.Am.Chem.Soc. 121, 9008-12 (1999)

Q-factors: Cornilescu et al., J.Am.Chem.Soc. 120, 6836-7 (1998)

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

51 52

Evaluation of experimental data Cross validation Quality of experimental data: J-Couplings

• Number (?) T1/T2 ratios • Completeness / RPF scores S2 • QUEEN. RDCs H-bonds Agreement of the structure with the experimental Chemical shifts / Chemical shift anisotropy data: Database potentials • Number and size of violations. SAXS restraints • Root mean square deviations. • NOE, Dihedral angles etc. Ensemble or time-averaged • R-factors, Q-factors. • RDC, chemical shifts etc. -> Cross-validation of the results • Cross-validation. (Stone, J. Roy. Stat. Soc. B 36, 111-47, 1974) NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

53 54 Cross-validation Cross-validation

Clore et al. J. Mol. Biol. (2006) 355, 879-86 Parameters working Algorithm Partition set model data

test set Back-calculate

Score n

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

55 56

Cross-validation QUEEN: cross-validation

3GB1 model calculations Parameters working Algorithm Cross validation approach Partition set model data

test set Back-calculate

Score n

Problems: • Different types of NMR data have very different information content. • Individual data points have very different information content. Nabuurs et al., J. Biomol. NMR 33, 123-34, 2005

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

57 58

NMR-VTF Phase 1 Structure determination process Superposition: Cyrange (distance variance matrix) Assessing structured regions: Cyrange Representative model: mediod Restraints: ‘simple validation’ (number, violations per bin) of all restraints (distance, dihedral, H-bonds, RDCs, ..), also per model

Reinterpretation of Experimental Data data Validated Structure Data Spronk et al, Fig. 1

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

59 60 Validation of protein structure quality Validation of protein structure quality How is the quality of the properties expressed? How is the quality of the properties expressed? • Z-scores, RMS Z-scores (WHAT IF) What type of properties are important? What type of properties are important? How can we check these properties? How can we check these properties?

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

61 62

Normal distributions and Z-scores Normal distributions and RMS Z-scores

RMS Z-score~0.5 RMS Z-score=1.0 (reference) RMS Z-score~2 Spronk et al., Fig. 8

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

63 64

Z-scores and RMS Z-scores The WHAT IF reference set Structure Z-scores: Structure Z-scores: • Z-scores > 0 are “better” than average • Well refined high resolution X-ray structures (resolution • Z-scores < 0 are “worse” than average < 2.0 Angstrom, R-factor < 19%) • However: A Z-score of -1 is equally normal as a Z-score • Continuously updated of +1!! RMS Z-scores: Local geometry RMS Z-scores: • Cambridge small molecule database (CSD) • Too tight restraining of geometry: RMS Z-score < 1 • Well refined high resolution X-ray structures • Too loose restraining of geometry: RMS Z-score > 1 • Proper Gaussian distribution: RMS Z-score ~1

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

65 66 Validation of protein structure quality Validation criteria for protein structures How is the quality of the properties expressed? Overall quality: • Ramachandran plot, rotameric states, packing quality, What type of properties are important? backbone conformation Local geometry: • Bond lengths, bond angles, chirality, omega angles, How can we check these properties? side chain planarity Others: • Inter-atomic bumps, buried hydrogen-bonds, electrostatics

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

67 68

Validation criteria for protein structures Overall quality: Ramachandran Plot

Overall quality: Ramachandran plot • Ramachandran plot, rotameric states, packing quality, φ and ψ Allowed backbone conformation Additionally allowed angles Generously allowed Local geometry: Forbidden • Bond lengths, bond angles, chirality, omega angles, side chain planarity Others: • Inter-atomic bumps, buried hydrogen-bonds, electrostatics

Spronk et al., Fig. 12

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

69 70

Overall quality: Ramachandran plot Overall quality: Ramachandran Plot

Residue specific ramachandran plot

Z-score = +1.8 Z-score = -8.5 Spronk et al., Fig. 13

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

71 72 Overall quality: Ramachandran Plot Overall quality: Rotameric states

Residue specific ramachandran plot

Eclipsed Staggered

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

73 74

Overall quality: dihedral angle distributions Overall quality: dihedral angle distributions

Chi-1/Chi-2: Janin plot Chi-1/Chi-2: Janin plot

Z-score = +1.9 Z-score = -1.5

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

75 76

Overall quality: packing quality Overall quality: backbone conformation

Very normal Very unique Z-score=+0.8 Z-score=-14 Bad packing Good packing Spronk et al., Fig. 16

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

77 78 Validation criteria for protein structures Local quality: bonded geometry Overall quality: • Ramachandran plot, rotameric states, packing quality, backbone conformation Local geometry: • Bond lengths, bond angles, chirality, omega angles, side chain planarity Others: • Inter-atomic bumps, buried hydrogen-bonds, electrostatics D-amino acid L-amino acid Distorted Cα-chirality (common)

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

79 80

Local quality: bond length distributions Local quality: omega angles

Trans- Cis- conformation conformation RMS Z-score = 0.96 RMS Z-score = 0.22 (omega=180°) (omega=0°)

Spronk et al., Fig. 8

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

81 82

Local quality: angle distributions Validation criteria for protein structures

Lysozyme X-ray (PDB 3LZT) Overall quality: • Ramachandran plot, rotameric states, packing quality, backbone conformation

deviation| Local geometry: ω | • Bond lengths, bond angles, chirality, omega angles, side chain planarity Lysozyme NMR (PDB 1E8L) Others:

deviation| • Inter-atomic bumps, buried hydrogen-bonds, ω | electrostatics

Spronk et al. Fig. 15

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

83 84 Local quality: side chain planarity Other quality indicators: inter-atomic bumps

Planar ARG side-chain Non-planar ARG side-chain Overlap of two backbone atoms (Good) (Bad)

Spronk et al. Fig. 10

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

85 86

Other quality indicators: Other quality indicators: electrostatics internal hydrogen bonding

“Bad” electrostatics Good electrostatics Spronk et al. Fig. 17 Spronk et al. Fig. 17

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

87 88

Validation of protein structure quality Validation of protein structure quality Table 1. Reviewed validation programs and supported features. Abbreviations used: DR: Distance Restraint, DHR: Dihedral restraint; RDC: Residual Dipolar Coupling; CV: Circular Variance; IVM: Inter-atomic distance variance matrix; ROG: Red/Orange/Green; RMSD: Root-Mean-Square Deviation; HB: Hydrogen bond ! How is the quality of the properties expressed? !"#$"%&'! !!"#$"%&'"(#)*+!! !"#$" !"#$%&'($! !"#$%&'()*"%+,) !"#$%&'&()%&*$+*'+,!""" !"#$%&'$#()!"# !"#$%"$#&'(&)&'*+,+! !"#$%&'()%#*#&+('(! !"#$"#! !"#$%#$&! !"#$%&! !"!#$%&%! !"!#$%&%! !"#$%"!&'"($)%*! !"#$%#&!'('&)! ! !"#$%&%# !"#"$%"&'("&!"! !"#$%&'(&"$)*)"+! !"#$%&'&()*+ !"#$%&'#(! !"#$%&$'"!"#!! !"#$%&'! ! !"! !"#$%&"#'()*+,(-.(! !"! !"! !"! !! !"!!! !"#$"!"#$%&#"'&()*+,%-.'()**%/" !"#$%&'(&)*+),-*$#,)*"#$) !"#$%&'()*+'(,-.% !"#$#%&'($!!"#$%! !"#$%&! !"#!$%&!! !"#$%#! !"##$%&'($)*! !""#$%"&'()*+,-%./,0'/0,&$! What type of properties are important? !"#$%&'(()*)%'+),-.$ !"#$%!"#$%$"&'(&)' ! !"#$%&&'()*+,)*-##.(#/(/#).! !"#$%&#'()*+),&-'() !"#$%&"'()* !"#$%&'()*%+,%-).." !"#$%&'#()$*+,-.* !"#$%&'()*+,-.$/0! !""#""$#%&'()#*!+,&-.! !"#$%#&%'()* !"#$%&%'()*)$! !"#$%"!&'"($!)"(&*+'& !"#$%&'()"#$%*' !"#"$%#!"#!!! !"#$%&$'()"*+*,-*./0*.1) !"#$%&'()*+$ ! !"#$%&"#'(! ! !"#$%&'()%*)$+$%,$-,./! !"#$%&'()#*+,&'( ! !"#! !"#$!%&'("&)(*+)'"!"#$%&'" !"#$%&'&(&))*+%,)'+ !"#$%&'()*+,-.! ! !"#$%"&"'()"'*+,-./%+ ! !""#$$%&'%(#&!)*#(% !"#$#"!"#"!"#$%"&'$()'*+! !"#$%&'! !"#$%$&$'(#&)*"+&! ! !"#$%&'()*+'(,-%! !"#$"!"#$%! !"#$%&'(&)*+),-*$#,)*"#$) !"#$%!&'#$&%(")& !"#$%"&'()*+,-.-/0-12%-1+-12+ !"#$%&'()*+,(-#,-("* ! !"#!$%&!! !"#$%&'(#&)$*+#,$! !"#$%"!"#$%! !"!#!$%&! !"#$%&'()*&'+,! ! ! ! !"#$%&'(! !"#"!"#$!!!! !"#! !"#! !"! !"! !"#! !"! ! ! How can we check these properties? !"#$%&'(')"#)*+,+)-(./(.0! !"#$%$&'#()*! !"#$%&'()*&'+,-$ ! !""#$%"&'()*+,-%./,0'/0,&$! !"#$#%& !"#!$%&'&%&()&*! !"##$%&'($)*+'! !"#$%&'()*+*,-*./0*.1( !"#$%#&%'()* Visual inspection ! ! • !"#$!%&'()*+&,-"./)*0&,-01.$)* !"#$%&'(%)*+, !"#$%&'()*+' !"#$%&'(#%&&'&&)'*+,# !"#$%&'#()$*+$,* !"#$%&!'()*+%,'(),$-./01() !"##$%& !"#$%&#'()*+),&-'() !"#!$#"%&'()! !"#$%&'()*(+,-+& !"#$%&'! !"#$%!!!"#!!!! !"#$%&'#()*+,%-#(&'+%-#$% !"#$%&'&(&))*+ !""#""$#%&'()#*!+,&-.! ! !"#$%$&'#()*+%&,*(,'-.* !"#$%$&'&#'(%)*+! ! ! !"#! !"#$%"&$'! !"##$%&! ! !"#$%&'()%*+,"-&*.(! !"#$%&'!!#()*+)%!,! ! !"#"$"!"#$#% • “Legacy tools”: !"#$%&'%()%*+,! !"#$%&$'()*+,! ! !"#$%&'()*"%+,)-'*#.) ! !"#$%&'(&)*+),-*$#,)*"#$) !""#""$#%&' !"#$%&! !"#$%&'()*+'(,-.% !""#""$#%&'! !"#$%&'"("#)%*%+! !"#!$%&!! !"#$%&! !"#"!"#$%&"!!"#$"%&'()"#*! !"#$#%&'())*+,! ! ! !"#$%&"'(#)! ! PROCHECK / WHAT IF / WHAT_CHECK / Molprobity !"#$%&"'(%)"' !""#$$%&'%())%*(&(+! !"#$"!"#$%&'(! !"#$%&'()*+,%-./% !"#$%&! !"#$%"&'%(! ! !"#$%"&'()!!!"!"#$!!! !"#$%&'(&"$)*)"+! !"#$%&'#()$*+$,* Program suites: ! !"#"$%"&'("&)*+,&'-)*"&./0-)* • !"#! !"! !"! !"! !"! ! !"#$%&'(")*$%&'+,' !"#$%&#'()*+'(#,-."+/01 !"#$%&$'()*&+#""($!)+&! !"#$%&'$()*(! !"#$%&"#%'"()! !"!#$%! Protein Structure Validation Suite (PSVS), ResProx, !"#$%&'()*+,')(-$ !"#$%&'($&)*+,-.! !"#$%&"#'()*+,) !"#$%"!"#$%&"'"()#*#+! !"#$%&$'()"*+*,-*./0*.1) !"#$%&"! !"#$%&'()*+,(-.' !"#$%&'()*$%&'(+,-./*01* !"#$%&'#()$*+$,* !"#$%&"'(!!!!!! !"#$%&!'! !"#$%!"#$%&'(#%$)* !"#$%&'"!"#$%!&' !"#$%&'()*+',( !"#$%&! !"#$%&'(")*$%&'"+,' ! !"#! !"! Vivaldi, CING !"#$%&'(%)*+*,-*./0*.1% !"#$%&"#'()*+,) !"#$%&'( !"!#$%&%'()*'!"+',*-.! ! !"#$%&'()%&(*%+,# !"#$%&'()*+,+-.#,/! !"#$%&'(#)*+%&',*(%-+%&' !"#$%&"! !"#$%"#&'"()! !"#$%&#'()*+$#)( !""#! !"#$%&!'()*$+$!,#-! !"#$%&'()*+%,-&&-(./% !"#$%&'(!)"*)&+&"%&,%(! !"#$%&'"!"#$%! Vuister et al., J.Biomol.NMR, in press

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

89 90 Validation of protein structure quality Structural quality Table 1. Reviewed validation programs and supported features. Abbreviations used: DR: Distance Restraint, DHR: Dihedral restraint; RDC: Residual Dipolar Coupling; CV: Circular Variance; IVM: Inter-atomic distance variance matrix; ROG: Red/Orange/Green; RMSD: Root-Mean-Square Deviation; HB: Hydrogen bond ! !"#$"%&'! !!"#$"%&'"(#)*+!! !"#$" !"#$%&'($! !"#$%&'()*"%+,) !"#$%&'&()%&*$+*'+,!""" !"#$%&'$#()!"# !"#$%"$#&'(&)&'*+,+! !"#$%&'()%#*#&+('(! !"#$"#! DR1885 apo copper-bound !"#$%#$&! !"#$%&! !"!#$%&%! !"!#$%&%! !"#$%"!&'"($)%*! !"#$%#&!'('&)! ! !"#$%&%# !"##$%&'($)*! !"#"$%"&'("&!"! !"#$%&'(&"$)*)"+! !"#$%&'&()*+ !"#$%&'#(! ! !"#$%&$'"!"#!! !"#$%&'!"#"$%&$'(")*$+,&-! "!"#$%& ! !"! !"#$%&"#'()*+,(-.(! !"! !"! !"! !!"#$%&'#()*&+',-'! !"!!! !"#$!"#$%&'()*+,'")"&'-,.("!"#$%&#"'&()*+,%-.'()**%/! " !"#$%&'(&)*+),-*$#,)*"#$) !"#$%&'()*+'(,-.% !"#$#%&'($!"!"#$%&'()*+"(,!!"#$%! ! !"#$%&!"#$%"&'()*+,#-./0.1++&"23.1+! !"#!$%&!!"#$%&'(&)*+),-*$#,)*"#$)! !"#$%&'(!!!!! !"#$%#! ! !"#$%&'()!*!+,!"-.!"! !"#!$%&!'! ! !"#! !"! !"! !"! !"! !"##$%&'($)*!"##$%&'()%*+(,*! !""#$%"&'()*+,-%./,0'/0,&$!"#$%$$&'()*+,-(./'0"!"#$!! ! !!!!"#$%&#'(")*+! !"#$%&'(()*)%'+),-.$ !"#$%!"#$%$"&'(&)' !!"#$%&'"()#!*)# !"#$%&&'()*+,)*-##.(#/(/#).!"#$%%&'"($#)*+&,-./0! ! !"#$%&#'()*+),&-'() !"#$%"!"#$%&"'"()#*#! !"#$%&"'()* !"#$%&'()*%+,%-).." !"#$%&'#()$*+,-.*!"#$%"$#&! !"#$%&'()*+,-.$/0!"#$%&'!"#"$%"&'("&)*! !""#""$#%&'()#*!+,&-.! !"#$%#&%'()* !"#$%&%'()*)$! !"#$%"!&'"($!)"(&*+'& !"#$%&'()"#$%*'! !"#"$%#!"#!!! !"#$%&$'()"*+*,-*./0*.1)!"#$%&$! !"#$%&'()*+$ ! !"#$%&"#'(! ! !"#$%&'()%*)$+$%,$-,./! !"#$%&'()#*+,&'(!"#$%&'()&$"*+& ! !"#! !"#$!%&'("&)(*+)'"!"#$%&'" !"#$%&'&(&))*+%,)'+ !"#$%&'()*+,-.! ! !"#$%"&"'()"'*+,-./%+ ! !""#$$%&'%(#&!)*#(%!"#$%! !"#$#"!"#"!"#$%"&'$()'*+! !"#$%&'! !"#$%$&$'(#&)*"+&! ! !"#$%&'()*+,')($ !"#$%&'()*+'(,-%! !"#$"!"#$%! !"#$%&'(&)*+),-*$#,)*"#$) !"#$%!&'#$&%(")&!"#$%&'#()* !"#$%"&'()*+,-.-/0-12%-1+-12+ !"#$%&'()*+,(-#,-("* !"#$%&'()!!!!" !!"#$%&'()&*+(&,-.$ !"#!$%&!! !"#$%&'(#&)$*+#,$!"#$%&'($)*&%) ! !"#$%"!"#$%! !"!#!$%&! !"#"$%&'("&)'*"#+&'("& ! !"#! !"! !"#$%&'()*&'+,!"#$%&'($")'!*)'! !"! !"! !"! ! !!"#$%&'(%$)!%)$*' ! !"#$%! !!"#$%!&$'%&()%*+&%,-!"#"$% !"#$%&'(&)"'*+,"'-!.! !"#$%&'(!"#$#%&'(#)#*+,-)#'! !"#!"#$%""!"#$!! !!!! !"#$%&! !"#! !"#! !"! !"! !"#! !"! ! ! !"#$%&'(')"#)*+,+)-(./(.0!"#$%&!!!!"" ! ! !""#$%"&'()*+,-%./,0'/0,&$! !"#$%$&'#()*! !"#$%&#'"!"#$%& !"#$%&'#()*+,-+.#(! !"#$%&'()*&'+,-$ !!"#!!$%&'()'*+,!-)%./01 !"#! !"#$%&#"'(")*$+,"'! !"! !!"! !"! !"! !"#!$%&!#'! !"#$%"&'&(%)*+&,* !""#$%"&'()*+,-%./,0'/0,&$! !"#$#%& !"#$!! !"#$%&'()**%++,-,#,'.! !"#!$%&'&%&()&*! !"!#$%&'()"*)!+&,!-."-! !"##$%&'($)*+'!"#$! ! !"#$%&'()*+*,-*./0*.1( !"#$%#&%'()* ! ! ! !"#$!%&'()*+&,-"./)*0&,-01.$)* !"#$%&'(%)*+, Visual inspection is the easiest tool! ! !"#$%&'()*+' !"#$%&'(#%&&'&&)'*+,# !"#$%&'#()$*+$,* !"#$%&!'()*+%,'(),$-./01() !"##$%& !"#$%&#'()*+),&-'() !!!"#$%&#$'()*("+,(-../,! !"#!$#"%&'()! !"#$%&'()*(+,-+& !"#$%&'! !"#$%!!!"#!!!! !"#$%&'#()*+,%-#(&'+%-#$% !"#$%&'&(&))*+ !""#""$#%&'()#*!+,&-.! !!!"#$$#%"#&'#()$(#*+(,--.+! ! !"#$%$&'#()*+%&,*(,'-.* !"#$%$&'&#'(%)*+! ! ! !"#! !"#$%"&$'! !"##$%&! ! !!!"#"$%"&'(&")*&+,-+*! !"#$%&'()%*+,"-&*.(! !"#$%&'!!#()*+)%!,! ! !"#"$"!"#$#% !"#$%&'%()%*+,! !"#$%&$'()*+,! ! !"#$%&'()*"%+,)-'*#.) !!!"#$%$&'$#()$*)+%,)-./-+,! ! !"#$%&'(&)*+),-*$#,)*"#$) !""#""$#%&' ! !"#$%&! !"#$%&'()*+'(,-.% !""#""$#%&'!

!!"#$%#&%'()%*+,+)%!"#$%"&'()*+*,-../*'0*/1/'#/*&4(*5/0*0('##*&2'67*(20(28*/(*(92*(':2*";*5%'('67-*<22*(2=(-!"#$%&'"("#)%*%+! ! !"#!$%&!! !"#$%&! output RMS Z-scores of WHATIF !"#"!"#$%&"!!"#$"%&'()"#*! !"#$#%&'())*+,! !!!"#$%&'()*"+*,-.*/012.! ! ! !"#$%&"'(#)! ! !!!"#$%&'())*"+*%,-*./0.-! !"#$%&"'(%)"' !""#$$%&'%())%*(&(+! !"#$"!"#$%&'(! !"#$%&'()*+,%-./% !!!"##"$%&'(%)*+$,"--*%./0/1! !"#$%&! !"#$%"&'%(! !!!"#$"%&'"&'()*'+,,-*! ! !"#$%"&'()!!!"!"#$!!! !"#$%&'(&"$)*)"+! !"#$%&'#()$*+$,* ! !"#"$%"&'("&)*+,&'-)*"&./0-)* !"#! !"! !"! !"! !"! ! !"#$%&'(")*$%&'+,' !"#$%&#'()*+'(#,-."+/01 !"#$%&$'()*&+#""($!)+&! !"#$%&'$()*(! !"#$%&"#%'"()! !"!#$%! !"#$%&'()*+,')(-$ !"#$%&'($&)*+,-.! !"#$%&"#'()*+,) !"#$%"!"#$%&"'"()#*#+! !"#$%&$'()"*+*,-*./0*.1) !"#$%&"! !"#$%&'()*+,(-.' !"#$%&'()*$%&'(+,-./*01* !"#$%&'#()$*+$,* !"#$%&"'(!!!!!! !"#$%&!'! !"#$%!"#$%&'(#%$)* !"#$%&'"!"#$%!&' !"#$%&'()*+',( !"#$%&! !"#$%&'(")*$%&'"+,' ! !"#! !"! !"#$%&'(%)*+*,-*./0*.1% !"#$%&"#'()*+,) !"#$%&'( !"!#$%&%'()*'!"+',*-.! ! !"#$%&'()%&(*%+,# !"#$%&'()*+,+-.#,/! !"#$%&'(#)*+%&',*(%-+%&' !"#$%&"! !"#$%"#&'"()! !"#$%&#'()*+$#)( !""#! !"#$%&!'()*$+$!,#-! !"#$%&'()*+%,-&&-(./% !"#$%&'(!)"*)&+&"%&,%(! !"#$%&'"!"#$%! Vuister et al., J.Biomol.NMR, in press

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

91 92

Structure validation programs Structure validation programs PROCHECK and PROCHECK_NMR: WHAT IF / WHAT_CHECK: • Very useful graphical and text output. • More checks and more critical checks. • No longer maintained. • The reference data base of X-ray structures is • Relatively easy to run. continuously updated. • Part of PSVS and CING. • Difficult to install and run, massive (text) output. • Part of Vivaldi and CING.

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

93 94

A WHAT IF summary report Structure validation programs Molprobity

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

95 96 Structure validation programs Structure validation programs Molprobity Protein Structure Validation Suite (PSVS)

Wiley InterScience :: Article Full Text HTML 3 Bhattacharya et al. Proteins (2007) vol. 66 (4) pp. 778-95

In this article, we present the Protein Structure Validation Software (PSVS) suite for consistent and rapid evaluation of the quality of protein structures, with a focus on NMR structures and homologyNMR Structuremodels. ValidationPSVS provides a standardized set ofEMBOquality course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013 scores and constraint analyses for each input structure. In addition to experimental constraints, this set encompasses a number of parameters evaluating different aspects of the structure, including fold and packing, local residue separation, deviations of 97 98 bond lengths and bond angles, backbone and side-chain torsion angle stereochemistry, and steric overlaps between atoms. These data allow both global and site-specific structure quality assessments. A graphical user interface (GUI) runs the analysis and integrates information reported by several structure quality evaluation tools. Quality scores, calibrated with a set of high- quality X-ray crystal structures ( 1.8 Å resolution), are summarized as Z scores for several structure validation analysis programs. The output consists of a standard set of tables and graphs and a concise validation report. The PSVS software is broadly applicable in projects. As a demonstration of the value of the PSVS server, we apply these tools to evaluate protein structures determined by different Structural Genomics Consortia, and compare the distributions of quality scores in these structures with X-ray crystal and solution NMR structures deposited in the PDB in recent years.

METHODS

Tools for Structure Quality Evaluation PSVS incorporates published software developed by other research groups and in our own laboratory that have been integrated under a single graphical user interface. Table I summarizes the software tools supported by the current version of PSVS. PDBStat is a C++ program used to perform various statistical analyses given the Cartesian coordinates of a protein and a list of spatial constraints used to generate the structure (if available). The program is able to read and write coordinates and/or constraints in different standard formats (CONGEN,[29] XPLOR/CNS,[30][31] PDB,[8] and DYANA/CYANA[32]), handling the different hydrogen naming conventions, and can deal with proteins with multiple chains and/or models. PDBStat also produces a constraint satisfaction analysis for distance or dihedral constraints, giving a summary with minimum, maximum, average, and root-mean-square violations, with violations classified in ranges. The program also provides a summary of experimental distance constraints, including the numbers of conformationally restricting distance constraints classified into intra and sequential (backbone/backbone, backbone/side-chain and side-chain/side-chain), long range, hydrogen bond, and disulfide bond constraints. PDBStat is also used to filter out conformationally nonrestricting intraresidue and sequential NOE constraints, if the constraint is too restricting, nonrestricting, or corresponds to a fixed distance, based on the ranges imposed by molecular geometry. In addition, it filters out duplicate constraints, and constraints for identical atom pairs with different bounds. PDBStat also generates contact maps based on coordinates or constraints, calculates atomic rmsd's for an ensemble of structures, and evaluatesStructurestructural order parameters validationfor backbone and dihedralprogramsangles[33] in order to assess how well Structure validation programs local structure is defined across an ensemble of models. The program is also used to fit coordinates to a specified model, and translate and rotate coordinates to optimally superpose them for all or a selected set of atoms, over the average structure or an individual model. Some other functions of PDBStat include performing a simple close contact analysis, main chain and side chain (for Ile and Thr) chiralityProteinanalysis, and an Structureanalysis of hydrogen bond Validationsatisfaction and classification Suite(based (PSVS)on geometric Protein Structure Validation Suite (PSVS) parameters). • Based upon ‘known’ tools. Table I. Tools Used by PSVS to Evaluate Different Aspects of Structure Quality

Tool(s) Parameter(s) evaluated

PDBStat Define ordered regions of the structure, and analyze numbers of conformationally-restricting constraints, violations of constraints, and rmsd of superimposed atomic coordinates RPF[15] Goodness-of-fit of protein NMR structure with NOESY peak list and resonance assignment data DSSP[36] Calculate secondary structure PROCHECKG- Probability of dihedral angles of a residue type to be factors[3] within a given range MolProbity[5][26- Calculate and visualize bad contacts, atomic overlaps, 28] and C position deviations Verify3D[6] Likelihood of the amino acid sequence to have the three-dimensional packing seen in the structure ProsaII[7] Pseudo energy of pair-wise interactions from the spatial separation of residues Bhattacharya et al. Proteins (2007) PDB validation Close contacts, deviations of bond lengths, and bond vol. 66 (4) pp. 778-95 software[8] angles from ideality

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013 http://www3.interscience.wiley.com/cgi-bin/fulltext/114029977/main.html,ftx_abs 01/17/07 14:50:33 99 100

Structure validation programs Structure validation programs Vivaldi CING

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

101 102 CING: Common Interface for NMR structure Generation CING: Data flow User friendly interface to WHAT IF/QUEEN/ • CYANA, XEASY, PROCHECK/Aqua/SHIFTX/Wattos/DSSP/.. results and PDB files, XPLOR, reports. CCPN, ...

• Residue oriented “ROG” HTML • Validation and data together Text “ROG” • Hyperlinked HTML ... • Color-coded (red, orange, green) (ROG-score) • Automated export to multiple formats • API to data and validation results Plugins: Smart and guide user with suggestions in access to external troublesome areas! programs

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

103 104

CING: checks CING: checks Correction of minor errors; e.g. nomenclature. Correction of minor errors; e.g. nomenclature. Validation of resonance assignments. • CING’s internal database Validation of experimental restraints. • WHATIF Validation of stereochemical quality. • CCPN import Validation of structural quality. Validation of resonance assignments. Analysis of structural results. Validation of experimental restraints. Validation of stereochemical quality. Validation of structural quality. Analysis of structural results.

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

105 106

CING: checks CING: checks Correction of minor errors; e.g. nomenclature. Correction of minor errors; e.g. nomenclature. Validation of resonance assignments. Validation of resonance assignments. • Relative to database Validation of experimental restraints. • Inconsistencies (e.g. both pseudo-atom and • CING explicit atoms assigned, stereo-specific • QUEEN assignments). • AQUA • Expected assignments. • Wattos Validation of experimental restraints. • RPF (soon) Validation of stereochemical quality. Validation of stereochemical quality. Validation of structural quality.

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

107 108 CING: checks CING: checks Correction of minor errors; e.g. nomenclature. Correction of minor errors; e.g. nomenclature. Validation of resonance assignments. Validation of resonance assignments. Validation of experimental restraints. Validation of experimental restraints. Validation of stereochemical quality. Validation of stereochemical quality. • Whatif Validation of structural quality. Validation of structural quality. • Whatif Analysis of structural results. • Procheck_NMR • CING Analysis of structural results.

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

109 110

CING: checks CING API Correction of minor errors; e.g. nomenclature. API: high-level language, high-level constructs. Validation of resonance assignments. Scriptable and interactive usage (ipython). Validation of experimental restraints. Easy access to all data (shifts, peaks, restraints, Validation of stereochemical quality. coordinates, validation data). Validation of structural quality. Data is ‘linked’, reflecting their relationships. Analysis of structural results. Simple solutions for complicated problems. • Secondary structure (dssp). • Salt-bridges. • Potential di-sulfide bridges. • Proline cis-trans, Leu/Val pro-chiral methyls. • Talos+ predictions NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

111 112

CING: ROG color coding CING: ROG color coding

TAF homology domain

ROG Color coding:

red: bad orange: problems green: good

2PP4 2H7B Wei et al, Nature. Struct. Mol. Biol. 14, 653, 2007 Plevin et al. PNAS 103, 10242, 2006

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

113 114 CING: ROG color coding NRG-CING

red: 25 (23%) red: 48 (46%) Converted NRG NMR restraints orange: 36 (34%) orange: 44 (42%) Import as CCPN projects green: 46 (43%) green: 13 (12%) 4102 5080 5669 5839 8878 10002 Entries (>95%)

http://nmr.cmbi.ru.nl/NRG-CING

Doreleijers et al, J.Biomol. NMR, 2009 2PP4 2H7B Doreleijers et al, NAR 2012

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

115 116

NRG-CING: results NRG-CING: results

Average: 92 Most often occurring: 20 Frequency Frequency

Protein size (# residues) # of models

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

117 118

NRG-CING: results NRG-CING: results Distance restraints Distance restraints

15 restraints per residue Frequency Restraint surplus (%)

Entries Average # distance restraints per residue Dorelijers et al. J.Biomol. NMR, 2009 NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

119 120 NRG-CING: results NRG-CING: results Distance restraints Dihedral restraints

~50% completion 1.1 restraints per residue Frequency Frequency

Completeness per residue Average # dihedral restraints per residue NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

121 122

NRG-CING: results NRG-CING: results Dihedral restraints RDC restraints

2 restraints per residue 1.2 restraints per residue Talos? Frequency Frequency

# Dihedral restraints per residue Average # RDC restraints per residue

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

123 124

NRG-CING: results NRG-CING: results RDC restraints Quality

1 restraints per residue -3.3 15N-1H Frequency Frequency

# RDC restraints per residue Overall WHATIF QualityCheck

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

125 126 NRG-CING: results NRG-CING: results Quality Quality

> 2500 Non-refined structures Frequency Frequency

Per residue WHATIF QualityCheck rms Z-score omega angle

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

127 128

NRG-CING: results CING: ROG color coding

100 red: 67 (63%) 80 fine orange: 20 (19%) green: 19 (18%)

60 green (%)

40 problematic

20 1HKT Vuister et al, 1994 0 20 40 60 80 100

NMR Structurered Validation (%) EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

129 130

NRG-CING: results NRG-CING: results 100

80 fine

60 green (%)

40 problematic

20

0 20 40 60 80 100

NMR Structurered Validation (%) EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

131 132 NRG-CING: results NRG-CING: results

100

80 fine

60 green (%)

40 problematic

20

0 20 40 60 80 100

NMR Structurered Validation (%) EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

133 134

NRG-CING: results NRG-CING: results 100 100

80 80

60 60

40 40 ROG red (%) ROG green (%)

20 20

0 0 20 40 60 80 100 20 40 60 80 100 Procheck most favorite (%) Procheck most favorite (%) NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

135 136

NRG-CING: results NMR-VTF Phase 1 Superposition: Cyrange (distance variance matrix) 100 100 Assessing structured regions: Cyrange 80 80 Representative model: mediod Restraints: ‘simple validation’ (number, violations per 60 60 bin) of all restraints (distance, dihedral, H-bonds,

40 40 RDCs, ..), also per model

ROG red (%) Geometric validation: follow wwPDB Xray VTF ROG green (%) 20 20 choices: molprobity, RosettaHoles; also residue- specific values 0 20 40 60 0 20 40 60 Procheck additionally allowed Procheck additionally allowed (%) (%)

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

137 138 NMR_REDO:NMR_REDO: large-scale large recalculation-scale recalculation of of NMR_REDO:NMR structuresNMR_REDO: largeNMR-scale structures in the largerecalculation cloud-scale in the recalculation cloudof of 1 1 2 3 1 Wouter G. Touw , JurgenWouterNMR F. G.Doreleijers Touw structures1, Jurgen, GjaltNMR F. van Doreleijers Rutten in structures the, 1Geerten, Gjalt cloud van W. Rutten VuisterNMR_REDO: in 2, theGeerten, Gert Vriend cloudlarge W. Vuister-scale 3recalculation, Gert Vriend1 of 1CMBI, NCMLS, Radboud University Medical Centre, Nijmegen, NL; 2Bitbrains IT Services, Amstelveen, NL 1 1CMBI, NCMLS, Radboud1 University Medical2 Centre, Nijmegen, NL; 2Bitbrains3 IT Services,1 Amstelveen, NL Wouter G. Touw , Jurgen3 F. Doreleijers1 , Gjalt van Rutten , Geerten1 W. VuisterNMR,2 Gert structures Vriend in the3 cloud 1 DepartmentWouter G. of TouwBiochemistry,, Jurgen3 University F. Doreleijers of Leicester,, Gjalt UK van Rutten , Geerten W. Vuister , Gert Vriend 1CMBI, NCMLS, Radboud University Medical Centre,Department Nijmegen, of Biochemistry, NL; 2Bitbrains University ITWouter Services, G. Touw of1 , Leicester,Amstelveen,Jurgen F. Doreleijers UK NL 1, Gjalt van Rutten2, Geerten W. Vuister3, Gert Vriend1 1CMBI, NCMLS, Radboud University Medical Centre, 1Nijmegen,CMBI, NCMLS, Radboud NL; University2Bitbrains Medical ITCentre, Services, Nijmegen, NL; Amstelveen, 2Bitbrains IT Services, NL Amstelveen, NL 3 3 Department of Biochemistry,3Department University of of Leicester, Biochemistry, UK University of DepartmentLeicester, of Biochemistry, UK University of Leicester, UK Ź Introduction Ź Introduction Ź Introduction The determination of solution structures of proteins by NMR is an elaborate process with many imperfect experimental and computational steps that often TheNMR-redo determination of solution structures of proteins by NMR is an elaborate process with many imperfectare based experimental on NMR-redolargely empirically anddetermined computational procedures. Over time, steps these procedures that often have become more advanced. By applying WRGD\¶V improved Ź Introduction The determination of solution structures of proteins by NMR is an elaborate process with many imperfect experimental1,2 and computational steps that often are based on largely empirically determinedŹ Introduction procedures . Over time, these procedures have becometechnology more to the original advanced data, better structures. By canapplying be calculated WRGD\¶V, especially when improved the original structure was calculated 10 or 20 years ago. are based on largely empirically determined procedures. Over time, theseThe format, procedures completeness, and havequality of depositedbecome experimental more data areadvanced highly heterogeneous. By and applying until recently have WRGD\¶V precluded large improved-scale recalculations . The determination of solution structures of proteins by NMR is1,2 an elaborate process with many imperfect experimental and computational steps that often technology to the original data, betterThe structures determination can be of calculated solution structures, especially of proteins when the by originalNMR1,2 is structure an elaborateThe NMR was Restraints processcalculated Grid now provideswith 10 many muchor 20of the imperfectyears original data ago in properlyexperimental. remediated, consistent, and computationaland validated repositories stepsthat allow usthat to redetermine often Automated frameworktechnology for to there-computation original data, better startingstructures can be calculated , especiallyvery many when NMR structures the original in the PDB. structure was calculated 10 or 20 years ago. are based on largely empiricallyare determinedbased on largelyprocedures empirically. Over determinedtime, these proceduresprocedures. haveOver becometime, these moreProtocol procedures advanced have. By applyingbecome WRGD\¶Vmore advanced improved. By applying WRGD\¶V improved The format, completeness, and quality of deposited experimental data1 ,are2 highly heterogeneous and until recently have precluded large-scale recalculations. fromtechnology the toNRG-CING the originalThe data, format, data better completeness, structures can and be qualitycalculated of deposited, especially experimental when the original data are 1structure,2 highlyObjective : heterogeneousNMR_REDOwas calculated aims to improve 10and quality or until of20 NMR years -recentlyderived ago have. precluded large-scale recalculations. technology to the original data, better structures can be calculated , especially- whenSelection the original structure was calculatedValidate 10 or 20 years ago. The NMR Restraints Grid now provides much of the original data in properly remediated, consistent, andprotein validated structure ensembles repositories in terms of fit with thethat experimental allow us to redetermine The NMR Restraints Grid now provides much of the original data in properly remediated, consistent, and validated repositoriesensemble that allow us to redetermine The format, completeness, and quality of deposited experimental data are highly heterogeneous anddata until and geometricrecently quality have by automated precluded recalculations large. -scale recalculations. very many NMR structures in the PDBThe. format, completeness, and quality of deposited experimental data are highly- Calculate heterogeneous and until recently haveCING precluded WHAT IF large-scale recalculations. Water refinementvery protocol many NMR in structures Xplor_nih in the PDB. The NMR Restraints Grid now provides much of the original data in properly remediated, consistent, and validated- Validate repositories that allow us to redetermineC PROCHECK- NMR The NMR Restraints Grid now provides much of the original data in Źproperly Cloud setup remediated, consistent, and validated repositoriesQUEEN WATTOS that allow us to redetermine very many NMR structures in thevery PDB many. NMR structures in the PDB. Objective : NMR_REDO aims to improve quality of NMR-derived Objective : NMR_REDO aims to improve quality of NMR-derivedValidate 200 extended structures proteinAim: structure ensembles in terms of fit with the experimental Validate B 50 annealed structures Objective: NMR_REDOprotein aims tostructure improve ensembles quality of NMR in terms-derived of fit with the experimentalensemble 25 refined structures data and geometric quality by automatedObjective recalculations: NMR_REDO. aims to improve quality of NMRValidate-derived ensemble protein structure ensemblesdata and in terms geometric of fit with quality the experimentalby automated recalculations. CING WHAT IF A GetValidate experimental data Automatically fix errorsprotein in the structure structures ensembles in terms of fit with the experimentalensemble CINGDistance, WHAT dihedral IF angle & hydrogenensemble bond restraints data and geometric quality by automateddata and geometricrecalculations quality. by automated recalculationsPROCHECK. -NMR C CING WHAT IF C PROCHECKPrepareCING recalculation WHAT-NMR IF Ź Cloud setup QUEENPROCHECK WATTOS-NMR QUEEN WATTOS Ź Cloud setup C PROCHECK-NMR NRG-CING C QUEEN WATTOS CING D NMR_REDO Ź Cloud setup QUEEN WATTOS project 10.000 entries Ź Cloud setup Figure 1: The NMR_REDO pipeline runs fully automatically in the 105 h CPU 3400 entries cloud and consists of three main steps for each structure. 200 extended structures (A) A CING3 project with all necessary data is retrieved from the 200 extendedNRG structures-CING4 database and prepared for (B) simulated annealing with Xplor-NIH and water refinement2. The final ensemble is (C) B 50 annealed structures validated and (D) stored in the NMR_REDO database. 200 extended structuresB 50 annealed200 extended structures structures B25 refined50 annealed structures structures 25 structures Brefined 50 annealed Ź Better structures fit to the original data A Get experimental data 25 refined structures Ź Higher geometrical quality A Get experimental data 25 refinedNMR_REDO structures ensembles generally show a slightly better fit with The recalculated ensembles on average have better scores for several ADistance, Get experimental dihedral angle data & experimentally measured distance restraints (Fig. 2A) and dihedral independent validation criteria, such as course and fine grained packing NMR Structure Validation Distance,AEMBO course,Get dihedral Basel, experimental July 2013 angle & data restraints (Fig. 2B) than the original structures. Furthermore, they areNMR Structurequality Validation and Ramachandran and Janin plot EMBOquality course,(Fig. 3 )Basel,. The improvedJuly 2013 hydrogenDistance, bond dihedral restraints angle & closer to crystal structures of the same protein in cases where the latter fit to high-resolution X-ray structures generally coincides with an hydrogenDistance, bond restraints dihedral angle & is available. improved fit to experiment. hydrogen bond restraints 139 140 Prepare recalculation Preparehydrogen recalculation bond restraints A Distance restraints B Dihedral angle restraints A Ramachandran B Janin Prepare recalculation Prepare recalculation

NRG-CING NRG-CING NRG-CING NMR_REDO CING NRG-CING D NMR_REDO frequency CING NMR_REDOfrequency D projectCING 10.000 entries violations Å > 0.1 D NMR_REDO Figure 1: The NMR_REDO pipeline runs fully automatically in the project CING 10.000 entries 5 3400 entries D Figure 1: The NMR_REDO pipeline runs fully automaticallyproject in10.000 the entries 10 h CPU 5 3400 entries cloud andFigure consists 1: The of threeNMR_REDO main steps pipeline for each runs structurefully automatically. in the project 10.000 entries 5 340010 h entries CPU 3 cloud and consistsFigure 1 of: The three NMR_REDO main steps pipelinefor each runs structure fully automatically. in the 10 h CPU 5 3400 entries (A) A CINGcloud andproject consists with allof threenecessary main stepsdata foris retrieved3each structure from .the 10 h CPU 4 3 (A) A CINGcloud project and consistswith all necessaryof three main data steps is retrieved for each structurefrom the. NRG-CING(A) A databaseCING project and withprepared all necessary for (B) simulated data4 is retrieved annealing3 from the NRG-CING(A) database A CING andproject prepared with all for necessary (B) simulated data is annealing retrieved from the violations > 0.3 Å RMS violation (°) Z-score Z-score with XplorNRG-NIH-CING and4 databasewater refinement and prepared2. The forfinal (B) ensemble simulated is annealing (C) 4 2 Figure 2: (A) Difference between the number of distance restraint violations per model in the ij,ȥ and Ȥ1,Ȥ2 plot appearance Z-score distributions of the original and with Xplor2 -NIHNRG and-CING water database refinement and prepared. The final for ensemble (B) simulated is (C) annealing Figure 3: (A) (B) validatedwith and Xplor (D) -storedNIH and in thewater NMR_REDO refinement database. The final. ensemble is (C) 2 ensemble (recalculated ± original) for violation thresholds 0.1 and 0.3 Å. The number of recalculated ensembles. The Z-scores have been calculated with WHAT IF and have been with Xplor-NIH and water refinement . The final ensemble is (C) structures in each quadrant is shown in grey (structures on the line not counted). (B) RMS standardized to a reference database of high-resolution crystal structures. validated and (D) stored in thevalidated NMR_REDO and (D)database stored. in the NMR_REDO database. validated and (D) stored in the NMR_REDO database. dihedral angle violation distributions of the original and recalculated ensembles. Ź Future Ź Better fit to the Źoriginal Better data fit to the original data Ź Higher geometricalŹ AcknowledgementsŹ Higher quality geometrical qualityThe protocol will be adapted to allow calculations involving orientational Ź Better fit to the originalŹ Better data fit to the original data Ź Higher geometricalWe acknowledge all people who quality have contributed to CING. This project restraints, ligands, and multimers. To date, 3400 structures have been NMR_REDO ensembles generally show a slightly better fit with The recalculated ensembleswould not have beenon possible averageŹ withoutHigher thehave generous bettergeometrical donation ofscores High forrecalculated severalquality and we expect this number to increase to 5000 of the 10.000 NMR_REDO ensemblesNMR_REDO generally ensemblesshow a slightly generally better show fit witha slightly better fit with structures deposited in the PDB when the protocol can handle all usable experimentally measured distance NMR_REDOrestraints (Fig ensembles. 2A) and generallydihedral show a slightlyThe recalculatedbetter fit with Performanceensembles CloudThe core onrecalculated hours average by Bitbrains . have ensembles better scores on average for several have better scores for several experimentally measured distance restraints (Fig. 2independentA) and dihedral validation criteria, suchThe asrecalculated course and ensembles fine grained on deposited averagepacking experimental have data better. In due time, scores the NMR_REDO for databaseseveral may experimentally measured distance restraints (Fig. 2A) and dihedral independent validation criteria,independent such as validation course and criteria, fine grained such thereforeas packing course serve as a baselineand finefor potential grained methodological packing improvements . restraints (Fig. 2B) than the originalexperimentally structures. Furthermore, measured distancethey are restraints (Figquality. 2 A)and and Ramachandran dihedral and Janinindependent plot quality validation (Fig. 3criteria,). The suchimproved as course and fine grained packing restraints (Fig. 2B) thanrestraints the original (Fig . structures2B) than. theFurthermore, original structures they are . Furthermore,quality they and are Ramachandran quality and Janinand Ramachandran plot quality (Fig .and 3). JaninThe improved plot quality (Fig. 3). The improved closer to crystal structures of the samerestraints protein (Fig in cases. 2B) thanwhere the the original latter structures.fit Furthermore, to high-resolution they are X-ray structuresquality and generally Ramachandran coincides and Janinwith anplot quality (Fig. 3). The improved closer to crystal structurescloser of tothe crystal same structuresprotein in cases of the where same theprotein latter in cases wherefit to the high latter-resolution Ź References X-fitray to structures high-resolution generally X- raycoincides structures with generallyan coincides with an is available. closer to crystal structures of the same protein in cases where the latter1. DRESS Nabuurs SB et al. (2004) Proteins 55(3):483±6. Wouter Touw, MSc. NMR-redo improved fit to experiment. NMR-redofit to high-resolution X-ray structures generally coincideshttp://www.cmbi.ru.nl with/ an is available. is available. 2. RECOORD Nederveen AJ et al. (2005) Proteins 59(4):662±72. Protein Structure ~wtouw/NBIC2013.pdf is available. improved fit to experiment3. CING improved. Doreleijers JF et al.fit (2012) to J Biolexperiment NMR 54(3):267-83. . [email protected] 4. NRG-CING Doreleijersimproved JF et al. (2012) Nucleic fit Acids to Res.experiment 40:D519±24. . Differences of number of distance Dihedral restraint violations Ramachandran Z-scores Janin (χ1-χ2) Z-scores A A DistanceDistance restraints restraints B BDihedral Dihedral angle angle restraints restraints A A RamachandranRamachandran B B JaninJanin restraint violationsA DistanceA Distance restraintsbefore restraints and afterB DihedralB Dihedral angle restraints angle restraints A A beforeRamachandranRamachandran and after B B beforeJaninJanin and after

frequency frequency frequency frequency frequency frequency frequency frequency violations violations Å > 0.1 violations violations Å > 0.1 violations violations Å > 0.1 violations violations Å > 0.1

Z-score Z-score violationsviolations > 0.3 Å > 0.3 Å RMS RMSviolation violation (°) (°) Z-score Z-scoreZ-score Z-score violations >violations 0.3 Å > 0.3 Å RMS violationRMS violation(°) (°) Z-score Z-score Figure 2: (A) Difference between the number of distance restraint violations per model in the Figure 3: (A) ij,ȥ and (B) Ȥ1,Ȥ2 plot appearance Z-score distributions of the original and Figure 2: (A) Difference between the number ofFigure distance 2: (A) restraint Difference violations between per the model number in the of distance restraintFigure violations 3: (A) per ij, ȥmodel and in(B) the Ȥ 1,Ȥ2 plot appearanceFigure 3 : Z(A)-score ij,ȥ distributionsand (B) Ȥ1, Ȥof2 plotthe originalappearance and Z-score distributions of the original and ensemble (recalculated ± original)Figure for 2: (A)violation Difference thresholds between 0.1 theand number 0.3 Å. ofThe distance number restraint of violationsrecalculated per model inensembles the . The Z-scoresFigure have 3: (A) been ij, ȥcalculated and (B) withȤ1,Ȥ 2WHAT plot appearanceIF and have Zbeen-score distributions of the original and ensemble (recalculated ± original) for violationensemble thresholds (recalculated 0.1 and 0±.3 original)Å. The fornumber violation of thresholds 0.recalculated1 and 0.3 Å .ensembles The number. The of Z-scores haverecalculated been calculated ensembles with. WHATThe Z- scoresIF and havehave been been calculated with WHAT IF and have been structures in each quadrant isensemble shown in (recalculatedgrey (structures ± original)on the line for notviolation counted) thresholds. (B) RMS 0 .1 and 0.3 Åstandardized. The number to aof reference databaserecalculated of high-resolution ensembles crystal. The structures Z-scores. have been calculated with WHAT IF and have been structures in each quadrant is shown in grey (structuresstructures inon each the quadrantline not counted) is shown. (B)in grey RMS (structures on thestandardized line not counted) to a reference. (B) RMS database of highstandardized-resolution crystal to a reference structures database. of high-resolution crystal structures. dihedral angle violation distributionsstructures of the in originaleach quadrant and recalculated is shown ensembles in grey (structures. on the line not counted). (B) RMS standardized to a reference database of high-resolution crystal structures. dihedral angle violation distributions of the originaldihedral and recalculated angle violation ensembles distributions. of the original and recalculated ensembles. dihedralNMR Structure angle Validationviolation distributions of the EMBOoriginal course, and Basel, recalculated July 2013 ensembles. NMR Structure Validation EMBO course, Basel, July 2013 141 Ź Future 142 Ź Future Ź FutureŹ Future Ź Acknowledgements The protocol will be adapted to allow calculations involving orientational Ź AcknowledgementsŹ Acknowledgements Ź Acknowledgements The protocol will be adaptedThe to protocolallowThe protocolcalculations will be will adapted be involving adapted to alloworientational to allow calculations calculations involving involving orientational orientational We acknowledge all people who have contributed to CING. This project restraints, ligands, and multimersrestraints,. To date, ligands, 3400 andstructures multimers have. To been date, 3400 structures have been We acknowledge all people who haveWe contributed acknowledge to CINGall people. This who project have contributedrestraints, to CING .ligands, This project and multimersrestraints,. To ligands,date, 3400 and structuresmultimers .have To date, been 3400 structures have been would not have been Wepossible acknowledge without the all peoplegenerous who donation have contributed of High to CINGrecalculated. This project and we expect thisrecalculated number to increase and we toexpect 5000 this of the number 10.000 to increase to 5000 of the 10.000 would not have been possible withoutwould the not generous have been donation possible of withoutHigh the generousrecalculated donation and of we High expect recalculatedthis number toand increase we expect to 5000 this numberof the 10 to.000 increase to 5000 of the 10.000 Performance Cloud corewould hours not by haveBitbrains been. possible without the generous donationstructures of High deposited in the PDBstructures when the deposited protocol canin the handle PDB allwhen usable the protocol can handle all usable Performance Cloud core hours by BitbrainsPerformance. Cloud core hours by Bitbrains. structures deposited in the PDBstructures when deposited the protocol in thecan PDB handle when all usablethe protocol can handle all usable Performance Cloud core hours by Bitbrains. deposited experimental data. Indeposited due time, experimentalthe NMR_REDO data database. In due time, may the NMR_REDO database may deposited experimental datadeposited. In due time, experimental the NMR_REDO data. In databasedue time, maythe NMR_REDO database may therefore serve as a baseline fortherefore potential serve methodological as a baseline improvements for potential. methodological improvements. therefore serve as a baselinetherefore for potential serve methodological as a baseline forimprovements potential methodological. improvements.

Ź References Ź References Wouter Touw, MSc. Ź References1. DRESS Nabuurs SB et al. (2004) Proteins 55(3):483±6. Wouter Touw, MSc. Ź References1. DRESS Nabuurs SB et al. (2004) Proteins 55(3):483±6. Wouter Touw, MSc. http://www.cmbi.ru.nl/ 1. DRESS2. RECOORD Nabuurs Nederveen SB et al. AJ(2004) et al. Proteins (2005) Proteins 55(3):483 59(4):±6. 662±72. Protein Structure BioinformaticsWouter Touw,~ wtouwMSc./NBIC2013.pdf http://www.cmbi.ru.nl/ 1. DRESS2. RECOORDNabuurs SBNederveen et al. (2004) AJ et Proteins al. (2005) 55(3): Proteins483± 6.59(4): 662±72. Protein Structurehttp://www.cmbi.ru.nl Bioinformatics/ ~wtouw/NBIC2013.pdf 2. RECOORD3. CING NederveenDoreleijers AJ et al.JF (2005)et al. (2012) Proteins J Biol 59(4): NMR662 54(3):±72.267 -83. Protein Structure Bioinformatics http://www.cmbi.ru.nl/ 2. RECOORD3. CING Nederveen Doreleijers AJ et al. (2005) JF et al. Proteins (2012) 59(4):J Biol 662NMR±72. 54(3): 267-83. [email protected] Protein Structure~wtouw /NBIC2013.pdfBioinformatics 3. CING4. NRG -CINGDoreleijers Doreleijers JF et al. JF (2012) et al. (2012) J Biol NucleicNMR 54(3): Acids267 Res.-83. 40:D519 ±24. [email protected] ~wtouw/NBIC2013.pdf 3. CING 4. NRG-CINGDoreleijers Doreleijers JF et al. (2012) JF et al.J Biol (2012) NMR Nucleic 54(3): Acids267- 83.Res. 40:[email protected]±24. 4. NRG-CING Doreleijers JF et al. (2012) Nucleic Acids Res. 40:D519±24. [email protected] 4. NRG-CING Doreleijers JF et al. (2012) Nucleic Acids Res. 40:D519±24.

CORRESPONDENCEI NMR-redo CASD-NMR

determinationshad correctoverall folds, for certaintargets some CASD-NMR:critical assessment of programsdid not calculateaccurate packing and length of sec- Increase success rate of automated setup. automatedstructure determination ondarystructure elements. The root meansquare (r,m.s.) devia- tions of the automaticallygenerated backbone coordinates with by NMR respectto the referencestructures were typically l-2 A but in somecases were as high as 9 A. To the Editor: NMR spectroscopyis currently the only technique We anticipatethat the completeautomation of protein solu- for determiningthe solutionstructure of biologicalmacromol- tion structuredetermination from assignedchemical shift Iists Include automated RDC refinement. and unassignedNOESY peak lists may soon reachthe point at ecules.This typicallyrequires both the assignmentof resonanc- esand a labor-intensiveanalysis of multidimensionalnuclear which'unsupervised'results can be directlydeposited to the Overhausereffect spectroscopy (NOESY) spectra, in which peaks ProteinData Bank (PDB). It is thereforemeaningful and timely3 are matchedto assignedresonances. Software tools that fully to implementCASD-NMR as a community-widerolling experi- automatethe NOESYassignment and the structurecalculation ment.Weinvite softwaredevelopers to participatein CASD-NMR Implement automated error corrections, e.g. stepshave the potentialto boost the efficiency,reproducibility to testtheir fully automatedprotocols on maskeddata sets and and reliabilityof NMR structures. producestructures as if theywould directlydeposit them to the Within the e-NMR project (http://www.e-nmr.eu/),which is PDB.We will regularlyrelease masked test data sets for proteins fundedby the EuropeanCommission (project number 2 13010), whosesolution structure will be kept on hold by the PDB for at resulting from dynamics, mis-assignments, cis- we soughtto assesswhether such automated methods can indeed leasteight weeks.We alsoinvite membersof any NMR group produce structuresthat closelymatch thosemanually refinedby that is about to deposita structurein the PDB to contribute a expertsusing the sameexperimental data (the'referencestruc- maskedtest case to CASD-NMR.To guaranteethat a sufficient tures').We just completedthe first comparisonof automated amount of datais available,the NESGconsortium of the NIH trans prolines, Talos restraints. NMR proteinstructure calculation methods and now announce ProteinStructure Initiative (PSI)will alsoprovide one data set its continuation in the form of an ongoing,community-wide per month.A maskeddata set will includethe protein sequence, experiment,called critical assessmentof automatedstructure chemicalshift assignmentsand unassignedintegrated NOESY determinationof proteinsbyNMR (CASD-NMR).CASD-NMR peaklists. Data providers may also include additional biochemi- is open for membersof anylaboratory to participateand/or cal information and raw spectraldata. These experimental data to submit targets.The conceptclosely resembles that of other will be availablefrom a centraldatabase of the e-NMR proj- Secure funding. PSI community-wide experiments,such as the critical assessment ect (http://www.e-nmr.eu/CASD-NMR/)and through the of techniquesfor protein structureprediction (CASP)Iand Knowledgebase(http://kb.psi-structuralgenomics.orgl), also the criticalassessment of predictionof interactions(CAPRI)2. afterthe releaseof the referencePDB structure.This will allow http://www.wenmr.eu/wenmr/casd- Unlike CASPand CAPRI,CASD-NMR is entirely basedon the analysisof experi- mentaldata, which presentsspecial issues nmr in assembling,organizing and distribut- ing thesedata among participants. In the first year of the CASP-NMR experiment,we provided sevenresearch teamsinvolved in developingfully auto- matedstructure assignment tools with ten experimentaldata sets for variousprotein systemsof known structureand two sets for protein structuresnot yet publicly Rosario et al., Nature Meth. 6 , 625, 2009 available(tests performed in a blind- ed fashion),courtesy of the Northeast Rosario et al., Structure 2012 StructuralGenomics (NESG) consor- tium. We then met in Florence,Italy on May 4-6,2009 to analyzethe structures generated(Fig. l) by comparisonto the referencestructures and by using inde- Figure1 | Performanceof variousautomated structure calculation methods. The results of fu[[y pendentmethods for structurevalida- automatedcalculations by various programs for oneof the maskedtest data sets of the 2009Florence tion. This first experimentindicated that workshopcompared to the referencestructure (bottom right) determined by Aramini, J.M. ef al.. NESG althoughmost of the automatedstructure Consoaium(unpublished data; PDB: 2kif).

NATUREMETHODS I VOt.6 NO.gI SEPTEMBER2OO9 I 625

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

143 144 CASD-NMR 2013: Entries Results Total: 177 (10 original) Total: 177 (10 original) Targets: 10; Groups: 12; Methods: 16

recognised as incorrect Original

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

145 146

Structure Automated Protein Structure Determination by NMR

ure 1. Note that the poorer appearance of the CS-Rosetta server, which was run via the web server developed in the e-NMR project (Bonvin et al., 2010), is partly due to inclusion of noncon- verged solutions in the comparison. It can be concluded that NOESY-based methods delivered more consistent and robust performances than CS-based methods (resulting in smaller boxes in Figures 1A and 1B), yielding structures on average closer to the reference. NOESY-filtering as in CS-DP-Rosetta could recover some but not all of the consistency andStructure reliability of the restraint-driven methods (discussed later). Notably, the CS-methodsAutomated (regardless Protein of Structure whether Determinationaugmented with by NOESY NMR Results information) are computationally much moreResults demanding than NOESY-based methods. Total: 177 (10 original) Regarding individual targets, the one withTotal: the lowest 177 perfor- (10 original) mance across all methods was AR3436A (Table 2), a 97- ure 1. Note that the poorer appearance of the CS-Rosetta server, amino-acid protein. Our target selection included three proteins which was run via the web server developed in the e-NMR with more than 100 residues (HR5536A, AtT13, and CgR26A), for project (Bonvin et al., 2010), is partly due to inclusion of noncon- all of which NOESY-based methods were able to automatically verged solutions in the comparison. It can be concluded that generate accurate structures. Instead, purely CS-based NOESY-based methods delivered more consistent and robust methods failed for all of them, whereas CS-based methods performances than CS-based methods (resulting in smaller augmented with NOESY data were successful in nearly all cases. boxes in Figures 1A and 1B), yielding structures on average All the results examined in the preceding paragraphs address closer to the reference. NOESY-filtering as in CS-DP-Rosetta the degree of similarity to the manually solved reference struc- could recover some but not all of the consistency and reliability ture. Additional insight can be obtained by the evaluation of of the restraint-driven methods (discussed later). Notably, the the degree of convergence among the different programs. This CS-methods (regardless of whether augmented with NOESY has been measured as the mean RMSD among the average information) are computationally much more demanding than conformers obtained with the automatically generated methods NOESY-based methods. (Table S3). For the NOESY-based algorithms, the mean RMSD Regarding individual targets, the one with the lowest perfor- for each target was in the range of 0.9 to 3.0 A˚ , with four targets

Original mance across all methods was AR3436A (Table 2), a 97- featuring a mean RMSD lower than 1.0 A˚ and eight targets being amino-acid protein. Our target selection included three proteins within 2.0 A˚ . If CS-based methods augmented with NOE cross- with more than 100 residues (HR5536A, AtT13, and CgR26A), for peak information are also included, the mean RMSD range all of which NOESY-based methods were able to automatically Figure 1. Structural Similarity between Reference and CASD- widens slightly up to 3.3 A˚ , still with eight targets having a generate accurate structures. Instead, purely CS-based NMR2010 Structures mean RMSD lower than the 2.0 A˚ threshold. Instead, inclusion Rosato et al, Structure 2012 methods failed for all of them, whereas CS-based methods RMSD (A) and GDT_TS score (B) deviation of the backbone coordinates (for of all methods yielded values as large as 6.2 A˚ (Table S3). The ordered residues only; see Table S1) with respect to the reference structure for augmented with NOESY data were successful in nearly all cases. present evaluation of convergence is much more stringent than the various algorithms. GDT_TS is the average fraction of residues that can be All the results examined in the preceding paragraphs address ˚ the standard recalculation with different random number seeds, superimposed to within four different distance cutoffs (1, 2, 4, and 8 A) and the degree of similarity to the manually solved reference struc- ranges between 0% and 100%. For each structure, the automatically gener- because in each calculation the NOE assignments have been ture. Additional insight can be obtained by the evaluation of ated average conformer has been used for calculations. The dashed lines are determined independently and with different methods. the degree of convergence among the different programs. This at 2 A˚ for RMSD and at 80% of superimposable residues for GDT_TS, corre- A further measure of accuracy would be the comparison with has been measured as the mean RMSD among the average NMR Structure Validationsponding to our thresholds for acceptableEMBO performance. course, Basel, See July also 2013Table S2. a completely independent structure determination. This is, at NMR Structure Validation EMBO course, Basel, July 2013 The box parameters are as follows: The box range goes from the first to the conformers obtained with the automatically generated methods present, possible for only two targets (VpR247 and PgR122A), third quartile; box whiskers identify the minimum and maximum values; the (Table S3). For the NOESY-based algorithms, the mean RMSD 147 for which the PDB contains X-ray structures of relatively close 148 square within the box identifies the mean; the thick line in the box identifies for each target was in the range of 0.9 to 3.0 A˚ , with four targets the median. The starred boxes correspond to algorithms for which less than homologs (40%–50% sequence identity). These allowed us to featuring a mean RMSD lower than 1.0 A˚ and eight targets being 60% of the targets were submitted. build reliable structural models that can be used as the structural within 2.0 A˚ . If CS-based methods augmented with NOE cross- reference for comparisons (Table S4). For PgR122A, the relevant peak information are also included, the mean RMSD range structure is 3HVZ (Forouhar et al., 2009). The homology model of Figure 1. Structural Similarity between Reference and CASD- widens slightly up to 3.3 A˚ , still with eight targets having a and run on three randomly selected targets, featured a similarity PgR122A built on this structure shows a backbone RMSD of NMR2010 Structures mean RMSD lower than the 2.0 A˚ threshold. Instead, inclusion to the corresponding reference structures in line with NOESY 0.77 A˚ to the average coordinates of the reference structure. RMSD (A) and GDT_TS score (B) deviation of the backbone coordinates (for of all methods yielded values as large as 6.2 A˚ (Table S3). The orderedrestraint-driven residues only; methods. see Table Cheshire-YAPP S1) with respect to the uses reference initial structure (pure CS) for All methods yielded structures within 1.5 A˚ from the homology present evaluation of convergence is much more stringent than Cheshirethe various algorithms.models to GDT_TS assign isNOESY the average distance fraction of restraints residues that used can beto model, with the majority being actually within 1 A˚ . For VpR247, the standard recalculation with different random number seeds, refinesuperimposed the models. to within For four CS-DP-Rosetta, different distance which cutoffs uses (1, 2, NOESY 4, and 8 Ainfor-˚ ) and there are several related crystal structures of the S. pombe because in each calculation the NOE assignments have been mationranges between only to 0% re-rank and 100%. the CS-based For each structure, models, the the automatically deviation gener- from homolog, in the free or ligand-bound form. The model built on ated average conformer has been used for calculations. The dashed lines are determined independently and with different methods. the manual reference structures was close to that of the NOESY the DNA-complexed protein (3GX4; Tubbs et al., 2009) is closer at 2 A˚ for RMSD and at 80% of superimposable residues for GDT_TS, corre- A further measure of accuracy would be the comparison with restraint methods, with a range of RMSD and GDT_TS values of to the reference VpR247 structure than the model built on the sponding to our thresholds for acceptable performance. See also Table S2. a completely independent structure determination. This is, at 0.3–3.3The box parameters A˚ and 55%–90%, are as follows: respectively, The box range and 70%goes from of targets the first falling to the free protein (3GVA), with backbone RMSD values of 1.4 A˚ and present, possible for only two targets (VpR247 and PgR122A), withinthird quartile; the thresholds box whiskers described identify the earlier. minimum Finally, and maximum pure CS-based values; the 2.1 A˚ , respectively. Similarly, nearly all the automatically gener- for which the PDB contains X-ray structures of relatively close methodssquare within had the the box poorest identifies performance the mean; the thick in terms line in of the closeness box identifies to ated structures are more similar to the former than the latter homologs (40%–50% sequence identity). These allowed us to the median. reference The structures, starred boxes as correspond it is apparent to algorithms from Table for which 2 and lessFig- than model. With the exception of the ARIA and CS-Rosetta server 60% of the targets were submitted. build reliable structural models that can be used as the structural 230 Structure 20, 227–236, February 8, 2012 ª2012 Elsevier Ltd All rightsreference reserved for comparisons (Table S4). For PgR122A, the relevant structure is 3HVZ (Forouhar et al., 2009). The homology model of and run on three randomly selected targets, featured a similarity PgR122A built on this structure shows a backbone RMSD of to the corresponding reference structures in line with NOESY 0.77 A˚ to the average coordinates of the reference structure. restraint-driven methods. Cheshire-YAPP uses initial (pure CS) All methods yielded structures within 1.5 A˚ from the homology Cheshire models to assign NOESY distance restraints used to model, with the majority being actually within 1 A˚ . For VpR247, refine the models. For CS-DP-Rosetta, which uses NOESY infor- there are several related crystal structures of the S. pombe mation only to re-rank the CS-based models, the deviation from homolog, in the free or ligand-bound form. The model built on Results: quality assessmentthe manual overview reference structures was close to that of the NOESY the DNA-complexed protein (3GX4; Tubbs etResults: al., 2009) is closer quality assessment overview restraint methods, with a range of RMSD and GDT_TS values of to the reference VpR247 structure than the model built on the 0.3–3.3 A˚ and 55%–90%, respectively, and 70% of targets falling free protein (3GVA), with backbone RMSD values of 1.4 A˚ and within the thresholds described earlier. Finally, pure CS-based 2.1 A˚ , respectively. Similarly, nearly all the automatically gener- methodsbetter had the poorest performance in terms of closeness to ated structures are more similar to the former than the latter better better the reference structures, as it is apparent from Table 2 and Fig- model. With the exception of the ARIA andbetter CS-Rosetta server

230 Structure 20, 227–236, February 8, 2012 ª2012 Elsevier Ltd All rights reserved

better better better better Original Original Original Original

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

149 150 Results: quality assessment overview Results: quality assessment overview better better better better

better better better better Original Original CS-Rosetta is very efficient in selecting structuresOriginal with excellent local Original conformation! Currently present a problem in validation.

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

151 152

NMR-VTF Phase 1 Conclusions

Superposition: Cyrange (distance variance matrix) Proper validation the structural quality in relation to the Assessing structured regions: Cyrange experimental data is essential. Representative model: mediod Error detection should be an integral part of the computation/validation process. Restraints: ‘simple validation’ (number, violations per Usage of any global validation parameters as a quality bin) of all restraints (distance, dihedral, H-bonds, indicator appears to be rather useless. We advocate a RDCs, ..), also per model residue-specific validation of NMR data. Geometric validation: follow wwPDB Xray VTF There are several tools for assessing the quality of your choices: molprobity, RosettaHoles; also residue- structures; e.g. PSVS, CING or ResProx. Use them! specific values Chemical shift validation: databases, back-calculation

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

153 154

CING: Data flow CING server http://nmr.cmbi.ru.nl/iCing/

Web-based interaction

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

155 156 Websites Websites (continued)

http://nmr.cmbi.ru.nl/CING http://nmr.cmbi.ru.nl/iCing http://nmr.cmbi.ru.nl/dress http://proteins.dyndns.org

QUEEN: http://nmr.cmbi.ru.nl/QUEEN http://nmr.cmbi.ru.nl/Validation Version 1.1, > 50 downloads http://www.ccpn.ac.uk http://www.wenmr.eu

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

157 158

Acknowledgements Jobs CING & NRG-CING Jurgen Doreleijers & Gert Vriend (RU Nijmegen); Rasmus Fogh, Khushwant Sidhu (Univ. of Leicester)

NMR_REDO Validation (1 post-doc) Wouter Touw, Jurgen Doreleijers, Karin Berntsen & Gert Vriend (RU Nijmegen); Gjalt van Rutten (BitBrains) CCPN (1 developer) CASD-NMR Rasmus Fogh, Antonio Rosato, Wim Franken, Guy Montelione, Alexandre Bonvin, .. all the NMR spectroscopists and depositors Email me:

External program authors [email protected] Roman Laskowski, David Wishart, Charles Swieters, Yang Shen, Xiang-Jun Lu...

CCPN & others Ernest Laue, Rasmus Fogh, Wayne Boucher & Tim Stevens; Aleksandras Gutmanas, Pieter Hendriks

Funding:

NMR Structure Validation EMBO course, Basel, July 2013 NMR Structure Validation EMBO course, Basel, July 2013

159 160