Deep Learning of Protein Structural Classes: Any Evidence for an ‘Urfold?
Total Page:16
File Type:pdf, Size:1020Kb
Deep Learning of Protein Structural Classes: Any Evidence for an ‘Urfold? 1st Menuka Jaiswal 1st Saad Saleem 1st Yonghyeon Kweon 2nd Eli J Draizen School of Data Science School of Data Science School of Data Science Department of Biomedical Engineering University of Virginia University of Virginia University of Virginia University of Virginia Charlottesville Charlottesville Charlottesville Charlottesville [email protected] [email protected] [email protected] [email protected] 2nd Stella Veretnik 2nd Cameron Mura 2nd Philip E Bourne Department of Biomedical Engineering Department of Biomedical Engineering School of Data Science University of Virginia University of Virginia University of Virginia Charlottesville Charlottesville Charlottesville [email protected] [email protected] [email protected] Abstract—Recent computational advances in the accurate pre- (“point mutations” or ‘substitutions’) to larger-scale changes diction of protein three-dimensional (3D) structures from amino such as reorganization of entire segments of a polypeptide acid sequences now present a unique opportunity to decipher chain. Such changes are critical because they influence 3D the interrelationships between proteins. This task entailsbut is not equivalent toa problem of 3D structure comparison and structure, and protein function stems from 3D structure [2]. classification. Historically, protein domain classification has been Indeed, our ability to elucidate protein function and evolution a largely manual and subjective activity, relying upon various is intimately linked to our knowledge of protein structure. heuristics. Databases such as CATH represent significant steps Equally important, interrelationships between structures define towards a more systematic (and automatable) approach, yet there a map of the protein universe [3]. Thus, it is paramount still remains much room for the development of more scalable and quantitative classification methods, grounded in machine to have a robust classification system for categorically orga- learning. We suspect that re-examining these relationships via nizing protein structures based upon their similarities. Even a Deep Learning (DL) approach may entail a large-scale re- more basically, what does ‘similarity’ mean in this context structuring of classification schemes, improved with respect to (geometrically, chemically, etc.)? And, are there particular the interpretability of distant relationships between proteins. formulations of ‘similarity’ that are more salient than others? Here, we describe our training of DL models on protein domain structures (and their associated physicochemical properties) in Ideally, any system of comparison would also take into account order to evaluate classification properties at CATHs homologous functional propertiesi.e., not just raw geometric shape, but also superfamily (SF) level. To achieve this, we have devised and properties such as acidity/basicity, physicochemical features of applied an extension of image-classification methods and image the surface, “binding pockets”, and so on. segmentation techniques, utilizing a convolutional autoencoder An unprecedented volume of protein structural and func- model architecture. Our DL architecture allows models to learn structural features that, in a sense, ‘define’ different homologous tional data is now available, largely because of exponential SFs. We evaluate and quantify pairwise ‘distances’ between growth in genome sequencing and efforts such as structural SFs by building one model per SF and comparing the loss genomics [4]; alongside these data, novel computational ap- functions of the models. Hierarchical clustering on these distance proaches can now discover subtle signals in sets of sequences matrices provides a new view of protein interrelationshipsa view [5]. These advances have yielded a vast trove of potential that extends beyond simple structural/geometric similarity, and arXiv:2005.08443v1 [q-bio.BM] 18 May 2020 towards the realm of structure/function properties. protein sequences and structures, but the utility of the informa- Index Terms—Autoencoders; CNNs; CATH; deep learning; tion has been limited because the mapping from sequence to protein domain classification; protein structure structure (or ‘fold’) has remained enigmatic; this decades-long grand challenge is known as the “protein folding problem”[6]. I. INTRODUCTION As computational approaches to the folding problem continu- Proteins are key biological macromolecules that consist of ally improve, increasingly we will compute 3D structures from long, unbranched chains of amino acids (AAs) linked via protein sequences de novo. Therefore, we expect to see the peptide bonds [1]; they perform most of the physiological demand for identifying and cataloging new protein structures functions of cellular life (enzymes, receptors, etc.). Different grow at an ever-increasing pace with the rise in the number sequences of the 20 naturally-occurring AAs can fold into of 3D structures, both experimentally determined as well as tertiary structures with variable degrees of geometric similar- computationally predicted (via physics-based simulations [7] ity. At the sequence level, the differences between any two or via Deep Learning approaches [8]). proteins can range from relatively modest single-point edits Historically, efforts to catalog protein structures have been a largely manual and painstaking process, fraught with heuris- chical classification scheme that organizes all known protein tics. There has been a shift in the paradigm since the intro- 3D structures (from the Protein Data Bank [PDB; [21]]) by duction of methodical hierarchical database structures, such their similarity, with implicit inclusion of some evolutionary as CATH, which engender more robust classification schemes information. The PDB houses nearly 180,000 biomolecular into which new protein structures can be incorporated [9]. This structures, determined via experimental means such as X- is critical as we expand our knowledge of known protein struc- ray crystallography, nuclear magnetic resonance (NMR) spec- tures. The CATH database has seen phenomenal growth, going troscopy, and cryo-electron microscopy (cryo-EM). CATH from 53M protein domains classified into 2737 homologous uses both automated and manual methods to parse each SFs in 2016 [10] to a current 95M protein domains classified polypeptide chain in a PDB structure into individual domains. into 6119 SFs. Despite being one of the most comprehensive Domain-level entities are then classified within a structural resources, the CATH database (like any) is not without its hierarchy: Class, Architecture, Topology and Homologous limitations, in terms of its underlying assumptions about the superfamily (see also Fig 1 in[11] for more on this). If there relationships between entities. For instance, it was recently is compelling evidence that two domains are evolutionarily argued that there exists an intermediate level of structural related (i.e., homologous, based on sequence similarity), then granularity that lies between the CATH hierarchys architecture they are classified within the same superfamily. For each do- (A) and topology (T) strata; dubbed the Urfold [11], this main, we obtain 3D structures and corresponding homologous representational level is thought to capture the phenomenon superfamily labels from CATH. Next, we compute a host of “architectural similarity despite topological variability”, of derived properties for each domain in CATH (Draizen et as exhibited by the structure/function similarity in a deeply- al., in prep)—including (i) purely geometric/structural quan- varying collection of proteins that contain a characteristic tities, e.g. secondary structure [21], solvent accessibility [22], small β-barrel (SBB) domain [12]. (ii) physicochemical properties, e.g. hydrophobicity, partial With recent advances in computing power, Deep Learning charges, electrostatic potentials [23])), and (iii) basic chemical methods have begun to be applied to protein structures, descriptors (atom and residue types). As the initial use-cases in terms of predictions, similarity assessment and classifi- that are reported here, we examined three homologous super- cation [13], [14]. However deep neural networks (DNNs), families of particular interest to us: namely, immunoglobulins and specifically 3D CNNs, have not yet seen widespread (2.60.40.10), the SH3 domain (2.30.30.100), and the OB fold use with protein structures. This is likely the case because: (2.40.50.140). Our models were built using 5583, 2834 and (1) there is no single, ‘standard’/canonical orientation across 585 domain structures of Ig, SH3 and OB, respectively. the set of all protein structures [15], which is problematic for CNNs (which are not rotationally invariant), and (2) B. Preprocessing; Further Data Engineering computational demands to train such models are exorbitant We began by considering the aforementioned features for [16]. In this paper, we present a new Deep Learning method each atom in the primary sequence of each domain. Our first to quantify similarities between protein domains using a 3D- step was to ’clean’ the data by eliminating the features that CNN based autoencoder architecture. Our approach treats contained more than 25 percent missing values. Next, we con- protein structures as 3D images, with any associated properties verted continuous-valued numerical features into binary. For (physicochemical, evolutionary, etc.) incorporated as regional example, hydrophobicity values