An Open-Source Software Package for Graphing, Analysis and Simulation of Thermodynamic and Kinetic Models of Protein Folding
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/191593; this version posted September 20, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. PyFolding: An open-source software package for graphing, analysis and simulation of thermodynamic and kinetic models of protein folding Alan R. Lowe1,2,3*, Albert Perez-Riba4, Laura S. Itzhaki4 & Ewan R.G. Main5* 1 London Centre for Nanotechnology 17-19 Gordon Street, London WC1H 0AH, UK 2 Structural & Molecular Biology, University College London, Gower Street, London, WC1E 6BT, UK 3 Department of Biological Sciences, Birkbeck College, University of London Malet Street, London WC1E 7HX, UK 4 Department of Pharmacology, University of Cambridge Tennis Court Road, Cambridge CB2 1PD, UK 5 School of Biological and Chemical Sciences, Queen Mary University of London Mile End Road, London E1 4NS, UK *corresponding authors (ARL: [email protected], ERGM: [email protected]) Figure 1: Illustrative Pyfolding outputs for fitting equilibrium and kinetic datasets of: (A) the two- state folding FKBP12 protein (1) and (B) the 3-state folding thermophilic AR protein (tANK) identified in the archaeon Thermoplasma (2). Both show three graphs, the first is the equilibrium chemical denaturation, the second is the chevron plot and the third is the residuals for the fit of the chevron plot. In (A) the fits shown are to two-state folding models (both equilibrium and kinetic). In (B) fits shown are to three-state folding models (both equilibrium and kinetic - SI Jupyter Notebook 1). For the kinetic three state-model the multiple kinetic phases of the chevron plot are fitted using two linked equations describing the slow and fast phases (SI Jupyter Notebook 4). Figure 2: Illustrative PyFolding outputs for global fitting of GuHCl-induced equilibrium unfolding experiments of series of single-helix deletion CTPRn proteins to a heteropolymer Ising model (3). Each output shows (A) the parameters obtained with error and correlation coefficient of the fit of the data, (B) graphical representation of the topology used to fit the data, (D) the graphs of the fitted data, (E) the graph of the first derivative of the fit function for each curve and (F) graph of the denaturant dependence of each subunit used. bioRxiv preprint doi: https://doi.org/10.1101/191593; this version posted September 20, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. [Abstract] Our understanding of how proteins find and adopt their functional three-dimensional structure has largely arisen through experimental studies of the denaturant- and primary sequence- dependence of protein stability and the kinetics of folding. For many years, curve fitting software packages have been heavily utilized to fit simple models to these data. Although such software packages are easy to use for simple functions, they are often expensive and provide substantial impediments to applying more complex models or for the analysis of large datasets. Moreover, over the past decade, increasingly sophisticated analytical models have been generated, but without simple tools to enable routine analysis. Consequently, users have needed to generate their own tools or otherwise find willing collaborators. Here we present PyFolding, a free, open source, and extensible Python framework for the analysis and modeling of experimental protein folding and thermodynamic data. To demonstrate the utility of PyFolding, we provide examples of complex analysis: (i) multi-phase kinetic folding data fitted to linked equations and (ii) thermodynamic equilibrium data from consensus designed repeat proteins to both homo- and heteropolymer variants of the Ising model. Example scripts to perform these and other operations are supplied with the software. Further, we show that PyFolding can be used in conjunction with Jupyter notebooks as an easy way to share methods and analysis for publication and amongst research teams. bioRxiv preprint doi: https://doi.org/10.1101/191593; this version posted September 20, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. [Introduction] The last decade has seen a shift in the analysis of experimental protein folding and thermodynamic stability data from the fitting of individual datasets using simple models to more and more complex models employed using global optimization over multiple large datasets [examples include Refs: (3- 21)]. This shift in focus has required moving from user-friendly, but expensive software packages to bespoke solutions developed in computing environments such as MATLAB and Mathematica or by using in-house solutions [examples include: (3, 6, 12, 21, 22)]. However, as these methods of analysis have become more essential, simple curve fitting software no longer provides sufficient flexibility to implement the models. Thus, there is increasingly a need for substantially more computational expertise than previously required. In this respect the protein folding field contrasts with other fields, for example x-ray crystallography, where free or inexpensive and user-friendly interfaces and analysis packages have been developed (23). Here we present PyFolding, a free, open-source and extensible framework for analysing and modelling protein folding kinetics and thermodynamic stability. The software, coupled with the supplied models / Jupyter (iPython) notebooks, can be used by researchers with less programming expertise to access more complex models/analyses and share their work with others. Moreover, PyFolding also enables researchers to automate the time-consuming process of combinatorial calculations, fitting data to multiple models or multiple models to specific data. To demonstrate these and other functions we present a number of examples as Jupyter notebooks. This enables novice users to simply replace the data path and rerun for their systems. The Jupyter notebooks provided also show how PyFolding provides an easy way to share analysis for publication and amongst research teams. [Results & Discussion] PyFolding is distributed as a lightweight, open-source Python library through github and can be downloaded with instructions for installation from the authors’ site 1 . PyFolding has several dependencies, requiring Numpy, Scipy and Matplotlib. These are now conveniently packaged in several Python frameworks, enabling easy installation of PyFolding even for those who have never used Python before (described in the “Setup.md” file of PyFolding). As part of PyFolding, we have 1 https://github.com/quantumjot/PyFolding bioRxiv preprint doi: https://doi.org/10.1101/191593; this version posted September 20, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. provided many commonly used folding models, such as two- and three-state equilibrium folding and various equivalent kinetic variations, as standard (S.I Jupyter notebook 1-4). Functions and models themselves are open source and are thus available for inspection or modification by both reviewers and authors. Moreover, due to the open source nature, users can introduce new functionality by adding new models into the library building upon the template classes provided. Fitting and evaluation of typical folding models within PyFolding: PyFolding uses a hierarchical representation of data internally. Proteins exist as objects that can have metadata as well as multiple sets of kinetic and thermodynamic data associated with them. Input data such as chevron plots or equilibrium denaturation curves can be supplied as comma separated value files (.CSV). Once loaded, each dataset is represented in PyFolding as an object, associating the data with numerous common calculations. Models are represented as functions that can be associated with the data objects you wish to fit. As such, datasets can have multiple models and vice versa enabling automated fitting and evaluation (S.I Jupyter notebooks 1-3). Parameter estimation for simple (non- Ising) models is performed using the Levenberg–Marquardt non-linear least-mean-squares optimization algorithm to optimize the objective function [as implemented in SciPy (24)]. The output variables (with standard error) and fit of the model to the dataset (with R2 coefficient of determination) can be viewed within PyFolding and/or the fit function and parameters written out as a CSV file for plotting in your software of choice (Figure 1 & S.I Jupyter notebook 1-3). Importantly, by representing proteins as objects, containing both kinetic and equilibrium datasets, PyFolding enables users to perform and automate higher-level calculations such as Phi-value analysis (25, 26), which can be tedious and time-consuming to perform otherwise (S.I Jupyter notebook 3). Moreover, users can define their own calculations so that more complex data analysis can be performed. For example, Figure 1B and S.I Jupyter notebook