Chemical Shift Secondary Structure Population Inference
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2021.02.20.432095; this version posted February 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 CheSPI: Chemical shift Secondary structure Population Inference Jakob Toudahl Nielsen1* and Frans A.A. Mulder1* 1Interdisciplinary Nanoscience Center (iNANO) and Department of Chemistry, Aarhus University, Gustav Wieds VeJ 14, 8000 Aarhus C, Denmark *Correspondence to: [email protected], [email protected] ORCID: Jtn http://orcid.org/0000-0002-9554-4662 faam http://orcid.org/0000-0001-9427-8406 Keywords: NMR, protein, order, disorder, chemical shifts Abstract NMR chemical shifts (CSs) are delicate reporters of local protein structure, and recent advances in random coil CS (RCCS) prediction and interpretation now offer the compelling prospect of inferring small populations of structure from small deviations from RCCSs. Here, we present CheSPI, a simple and efficient method that provides unbiased and sensitive aggregate measures of local structure and disorder. It is demonstrated that CheSPI can predict even very small amounts of residual structure and robustly delineate subtle differences into four structural classes for intrinsically disordered proteins. For structured regions and proteins, CheSPI can assign up to eight structural classes, which coincide with the well-known DSSP classification. The program is freely available, and can either be invoked from URL www.protein-nmr.org as a web implementation, or run locally from command line as a python program. CheSPI generates comprehensive numeric and graphical output for intuitive annotation and visualization of protein structures. A number of examples are provided. Introduction NMR chemical shifts are very sensitive to the local structure of proteins, and can be measured and assigned routinely with great precision for both structured and unstructured proteins(Felli and Pierattelli 2012, Brutscher, Felli et al. 2015). The relationship between chemical shifts and local protein structure is well-established for folded proteins e.g. through simple index methods(Wishart, Sykes et al. 1992, Wishart and Sykes 1994) statistic and probabilistic methods(Eghbalnia, Wang et bioRxiv preprint doi: https://doi.org/10.1101/2021.02.20.432095; this version posted February 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 2 al. 2005, Wang, Chen et al. 2007) or by methods relying more on sequence homology through neural networks or database searches(Jones 1999, Labudde, Leitner et al. 2003, Shen and Bax 2013, Shen and Bax 2015), and super-secondary structure predictions(Hafsa, Arndt et al. 2015). Furthermore, CSs have also been used to aid the structural and dynamical characterization of proteins(Wishart and Sykes 1994, Wishart and Case 2001, Wishart and Case 2002, BerJanskii and Wishart 2007, Cavalli, Salvatella et al. 2007, Mielke and Krishnan 2009, KJaergaard and Poulsen 2012, Robustelli, Stafford et al. 2012). In stark contrast, intrinsically disordered proteins and regions (IDPs and IDRs) display no or very little regular secondary structure, are not folded into a globular structure, but rather constitute a dynamical equilibrium between several conformations with less regularity. NMR spectroscopy is an ideal technique to study these dynamic IDPs(Tompa and Fersht 2009, Uversky and Longhi 2010). Intrinsically disordered polypeptides display population-averaged chemical shifts that provide an operational definition of random coil chemical shifts. Deviations from RCCSs contain information about structural composition, but also pose a challenge to deconvolute induced from intrinsic structure. Fortunately, however, RCCSs have been predicted with increasing accuracy over time(Braun, Wider et al. 1994, Wishart, Bigam et al. 1995, Schwarzinger, Kroon et al. 2001, De Simone, Cavalli et al. 2009, Tamiola, Acar et al. 2010, KJaergaard, Brander et al. 2011), culminating in the most accurate predictor to date, known as POTENCI(Nielsen and Mulder 2018). As a result, deviations from RCCSs (the secondary chemical shifts, SCSs) are now common parameters used to identify and quantitate order/disorder in IDPs(BerJanskii and Wishart 2007, KJaergaard and Poulsen 2012, Nielsen and Mulder 2016, Sormanni, Piovesan et al. 2017). The improved accuracy has also permitted a benchmark of the performance of disorder prediction methods(Nielsen and Mulder 2019) and CSs were recently used to train the disorder predictor ODINPred (Dass, Mulder et al. 2020). A yet more challenging task is to quantify the statistical composition of structural states for IDPs, since accurate reference experimental data with unique structural interpretation do not exist; As this problem is ill-posed, astronomical numbers of ensembles could be constructed that all give rise to the experimentally observed averages. One possible avenue to plausible solutions is to use physics-based models of protein conformational sampling (by e.g. molecular mechanics force fields) coupled to parametrization of chemical shifts, and possibly other NMR observables, to bioRxiv preprint doi: https://doi.org/10.1101/2021.02.20.432095; this version posted February 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 3 conformation(Ozenne, Schneider et al. 2012, Varadi, Vranken et al. 2015). Alternative empirical approaches adopt a heuristic treatment of CSs and secondary structure relationships in folded proteins to IDPs for the inference of secondary structure populations of α-helix, β-strand, random coil, and polyproline II(Camilloni, De Simone et al. 2012). A more robust reductionist approach was taken with SSP (secondary structure propensity)(Marsh, Singh et al. 2006, Tamiola, Acar et al. 2010) where linear combinations of SCSs were aggregated into a scale between -1 (sheet) and 1 (helix) and interpreted as a structural propensity. Although such an approach implicitly remedies correlated CSs, information potentially contained in the individual CSs risks being lost by the reduction to single value. For example, a rigid loop between a sheet and a helix will display near- zero propensity, and risks being falsely interpreted as disordered. Furthermore, neither of the above methods can so-far discriminate between disordered and ordered turns, whereas such information is potentially contained in the particular combination of SCSs. Here, we introduce the linear analysis of signed SCSs, introducing Chemical shift Secondary structure Population Inference (CheSPI). This approach extends the previously-introduced CheZOD Z-score for quantifying local order and disorder in proteins(Nielsen and Mulder 2016, Nielsen and Mulder 2020) derived from the statistical analysis of sums of squared SCSs. Technically, CheSPI applies multivariate analysis and dimension reduction techniques to generate linear combinations (CheSPI components) of SCSs that optimally describing the variance: The first CheSPI component ensures the optimal distinction between secondary structure classes, whereas the second component accounts for the variance within the classes (which was found to be closely related to the local structure such as backbone conformation, flexibility and hydrogen bonding). CheSPI components offer an accurate and comprehensive quantification of local dynamic and structural composition in structured as well as disordered proteins, being sensitive to local protein structure as well as dynamics. CheSPI components are presented to the user as a color scale, which conveys the information in a simple, intuitive, and visually appealing manner. As shown herein, CheSPI colors can be used to annotate 3D structures, and thereby highlight important and detailed structural changes in proteins. The power of CheSPI to discriminate between secondary structure classes was exploited to derive estimates for the populations of helix, extended structure, turn, and “non-folded” structures (CheSPI populations) through statistical inference, and these populations were validated through bioRxiv preprint doi: https://doi.org/10.1101/2021.02.20.432095; this version posted February 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 4 comparison to simulated ensembles for four distinct IDPs (vide infra). Contemporary methods for inferring secondary structure from NMR-data typically provide only three-class predictions. A notable exception is CSI3.0, which offers a four-state prediction including turns, as well as the distinction between internal and external strands. CheSPI takes a stride further, and extends secondary-structure inference to encompass the prediction of the eight structural classes (SS8) defined by the popular DSSP classification algorithm(Kabsch and Sander 1983); a-helix, 310-helix, p- helix, extended b-strand, bridge (isolated single residue b-strand), and –