Universal Architectural Concepts Underlying Protein Folding Patterns Arthur M
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/480194; this version posted November 27, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Universal architectural concepts underlying protein folding patterns Arthur M. Leska,b, Ramanan Subramanianc, Lloyd Allisonc, David Abramsond, Peter J. Stuckeyc,e, Maria Garcia de la Bandac, and Arun S. Konagurthuc,* aMRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, U.K.; bDepartment of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA 16802, U.S.A.; cFaculty of Information Technology, Monash University, Clayton, VIC 3800, Australia; dResearch Computing Center, University of Queensland, Brisbane, QLD 4072, Australia; eSchool of Computing and Information Systems, University of Melbourne, VIC 3010, Australia ABSTRACT What is the architectural ‘basis set’ of the observed universe of protein structures? Using information-theoretic inference, we answer this question with a comprehensive dictionary of 1,493 substructural concepts. Each concept represents a topologically-conserved assembly of helices and strands that make contact. Any protein structure can be dissected into instances of concepts from this dictionary. We dissected the world-wide protein data bank and completely inventoried all concept instances. This yields an unprecedented source of biological insights. These include: correlations between concepts and catalytic activities or binding sites, useful for rational drug design; local amino-acid sequence–structure correlations, useful for ab initio structure prediction methods; and information supporting the recognition and exploration of evolutionary relationships, useful for structural studies. An interactive site,P ROÇODIC, at http://lcb.infotech.monash.edu.au/prosodic (click) provides access to and navigation of the entire dictionary of concepts, and all associated information. KEYWORDS: architectural concepts, protein building blocks, structural motifs, minimum message length, mml, lossless compression, information theory INTRODUCTION The polypeptide chains of amino acids (primary structure) in most proteins fold into helices and strands of sheet (secondary structure), which in turn assemble to give proteins their intricate three-dimensional shapes and folding patterns (tertiary structure). Experimental methods have already provided over 140,000 entries in the world-wide Protein Data Bank (wwPDB), containing the three-dimensional coordinates of proteins and protein-nucleic acid complexes from a wide range of species. Unravelling protein architecture and discovering the relationship among these three major levels of structural description provides the key to understanding how proteins function, how their 3D folding patterns form, and how they evolve (1). Investigations of protein folding patterns have revealed recurrent themes at all structural levels (2–8), which form the basis for widely-used hierarchical classifications of protein structures (9–11). Nevertheless, many aspects of the relationships across structural levels have remained unresolved. Chothia and Lesk (6) introduced the idea of a core of the folding patterns of homologous proteins. This core comprises a maximal set of secondary structural elements that assemble in a common 3D topology, while withstanding a certain amount of distortion. The parts outside the core are structurally more variable. Many related proteins contain some but not all of the same common substructures that form their cores. * To whom correspondence should be addressed. E-mail: [email protected] Conceptualization: ASK; Methodology: AML, LA, DA, PJS, MG, and ASK; Software: RS and ASK; Validation: AML, LA, and ASK; Analysis: AML, LA and ASK; Investigation: AML, RS, PJS, MG and ASK; Resources: AML and DA; Data Curation: ASK; Writing - Orginal Draft: AML and ASK; Writing - Review & Editing: RS, LA, DA, PJS and MG; Visualization: AML and ASK; Supervision: ASK; Project Administration: ASK; Funding Acquisition: AML, PJS, MG and ASK. November 25, 2018 | 1–20 bioRxiv preprint doi: https://doi.org/10.1101/480194; this version posted November 27, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Therefore, it is of crucial interest to discover the nature of the substructures that contribute to the cores of protein families. Some of these are supersecondary structures – small conserved combinations of successive elements of secondary structure, such as the β-α-β subunit. Supersecondary structures recur within many protein folds, and can be shared even by unrelated proteins. For example, the β-α-β subunit appears in NAD-binding domains, in TIM barrels, and in many other proteins. Early definitions of supersecondary structures relied strongly on experts spotting and naming them (4, 12). With the steady growth of the wwPDB, several methods have been developed to identify automatically, with varying operational definitions, a library of substructures that form what can be considered as the 3D building blocks of protein structures (8, 13–25). However, these approaches yielded limited libraries containing mostly short oligopeptide fragments or assemblies of typically 2 to 4 secondary structural elements. It has been a challenge so far to go further than that and dissect protein structures into a more complete set that includes larger conserved substructures. Apart from the enormous computational challenge this problem poses, the attempts made so far lacked a statistically-rigorous framework in which to describe, compute, identify and resolve a dictionary of conserved assemblies of secondary structures. Here we address this problem and present a universal dictionary of substructural concepts, PROÇODIC, that advances the current knowledge of these conserved patterns. Our approach relies on the rigorous information-theoretic framework of Minimum Message Length inference that allows the inference of a dictionary that (a) avoids overfitting (i.e., inferring a dictionary that is more complex than necessary to explain the observed folding patterns) and (b) achieves an objective trade-off between the descriptive complexity of concepts in the dictionary and their fidelity (i.e., the amount of compression) gained when explaining the observed protein folding patterns. Thus, this work presents the ‘basis set’ of concepts underlying all observed protein folding patterns. PROÇODIC can contribute to: understanding fundamental principles of protein structure, correlations of concepts with ligand binding sites to suggest function, and application of sequence conservation within concepts for protein structure prediction. RESULTS Automatic identification of a dictionary of substructural concepts. This work uses the concise tableau representation of protein folding patterns introduced by Lesk (26), which is based on the idea that the essence of a protein folding pattern is captured by the order, contacts and geometry of the assembly of secondary structural elements along the amino-acid chain. A tableau corresponds to the 3D structure of a single protein domain (or sometimes chain), and has the form of a symmetric matrix (Fig. 1(a,c); Supplementary §S1). Importantly, in this representation supersecondary structures find compact and computable definitions as subtableaux containing two or more successive secondary structure elements in contact (Fig.1(d-e)). We constructed the universal dictionary reported here using our recently-developed method to infer, automatically, conserved assemblies of secondary structural elements within any given source collection of tableaux (27). The idea of a concept is constrained by the requirement that every secondary structural element in the concept must be in contact with at least one other secondary-structure element in that concept. Our concept inference approach (27) is based on the powerful minimum message length criterion for statistical inductive inference (28–30) and lossless data compression (Supplementary §S2). We applied this method to compress the source collection of Astral SCOP domains (9, 10, 31)(Supplementary §S1). This allowed us to infer a dictionary of 1,493 substructural concepts that most concisely and losslessly describes the entire source collection, and does so without any prior knowledge or preconceived notions of 2 bioRxiv preprint doi: https://doi.org/10.1101/480194; this version posted November 27, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. these recurrent substructures. The total computational effort to identify this dictionary is equivalent to about 7 years of runtime on a modern computer. We parallelised our method and ran it on a high-performance computing cluster (Supplementary §S2). Fig. 1. (a) Secondary-structural cartoon representation of the crystal structure of the Actin-binding protein actophorin from Acanthamoeba (1AHQ)(32). (b) Secondary structural assignment (using SST (33); H = helix, E = strand of sheet) and the optimal dissection of the protein chain