biomolecules

Review The Future of Secondary Structure Prediction Was Invented by Oleg Ptitsyn

1, 1, 1 2 1,3 Daniel Rademaker †, Jarek van Dijk †, Willem Titulaer , Joanna Lange , Gert Vriend and Li Xue 1,*

1 Centre for Molecular and Biomolecular Informatics (CMBI), Radboudumc, 6525 GA Nijmegen, The Netherlands; [email protected] (D.R.); [email protected] (J.v.D.); [email protected] (W.T.); [email protected] (G.V.) 2 Bio-Prodict, 6511 AA Nijmegen, The Netherlands; [email protected] 3 Baco Institute of Protein Science (BIPS), Mindoro 5201, Philippines * Correspondence: [email protected] These authors contributed equally to this work. †  Received: 15 May 2020; Accepted: 2 June 2020; Published: 16 June 2020 

Abstract: When Oleg Ptitsyn and his group published the first secondary structure prediction for a protein sequence, they started a research field that is still active today. Oleg Ptitsyn combined fundamental rules of physics with human understanding of protein structures. Most followers in this field, however, use machine learning methods and aim at the highest (average) percentage correctly predicted residues in a set of that were not used to train the prediction method. We show that one single method is unlikely to predict the secondary structure of all protein sequences, with the exception, perhaps, of future deep learning methods based on very large neural networks, and we suggest that some concepts pioneered by Oleg Ptitsyn and his group in the 70s of the previous century likely are today’s best way forward in the protein secondary structure prediction field.

Keywords: Oleg Ptitsyn; secondary structure prediction

1. Introduction In 1969, Oleg Ptitsyn published the article “Statistical analysis of the distribution of residues among helical and non-helical regions in globular proteins” [1]. This article forms the bridge between visual inspection of the (few) available protein structures to learn about their fold, stability, and (local) structure characteristics, and the protein secondary structure prediction (PSSP) methods of the 70s. The oldest article about PSSP we could find in PubMed is by Ptitsyn et al. [2] in 1973, albeit older articles from his group were already published in Russian in 1970 [3,4]. Thousands of articles have been written about PSSP. In the beginning, these studies were mainly performed to deepen our understanding of the principles of and stability. More recently, however, PSSP became mostly a topic to teach bioinformatics master students how to write software, or to illustrate that a newly developed machine learning technique can be made useful. The predicted secondary structure of proteins seems to have very little practical use. Manual inspection of the first 500 hits obtained by a Google Scholar search for “protein secondary structure” revealed a couple dozen articles about either the determination of secondary structure by—or the use of the known secondary structure in cryo-EM, atomic force microscopy, CD, IR, NMR, ESR, and Raman and vibration spectroscopy based biophysical studies. Additionally, we found “protein secondary structure” mentioned in articles related to spray-drying proteins [5], micelle density [6], protein design [7], aging of dried pollen [8], biofilm formation [9], dried seeds [10], and the effect of aspirin [11], as well as in several articles related to protein folding, protein–protein interfaces, and the

Biomolecules 2020, 10, 910; doi:10.3390/biom10060910 www.mdpi.com/journal/biomolecules Biomolecules 2020, 10, 910 2 of 14 stability of secondary structure elements. In bioinformatics studies, such as molecular dynamics simulations, fold recognition, structure superposition, or clustering protein structures in other ways, secondary structure can be used to evaluate or visualize the results. Fold recognition is the only example we found in which predicted secondary structures are sometimes used as an essential aspect of the method. In the past, one of us (G.V.) used predicted secondary structure in a project aimed at designing epitopes for peptide-based antibody generation, and in a project aimed at designing thermosensitive mutations, but these studies were not published. The vast majority of the 500 manually inspected articles describe methods to predict the secondary structure of a protein given its sequence. The method designed by Chou and Fasman [12,13] in the early 70s, for example, was first implemented in computer code by Lenstra in 1977 [14]; the last implementation we found was published 36 years later [15]. Most articles that describe yet another PSSP program start with a sentence like: “The ability to predict local structural features of a protein from the sequence is of paramount importance for unravelling its function in absence of experimental structure information”, which simply is nonsense as the function of a protein cannot be determined from its 3D structure, and thus even less from its (predicted) secondary structure. We found a series of articles in which the authors determine the secondary structure of proteins from their three-dimensional coordinates, but it is clear that DSSP [16] is the de facto standard for this work, and in this article, we only use DSSP to assign the secondary structure from 3D coordinates. Yang et al. [17] recently reviewed the 65-year history of secondary structure predictions rather beautifully. They start with the work of Pauling, Corey, and Branson [18,19], and end with today’s deep-learning methods that seem to outperform everything done in the previous 64 years. We can only urge the reader to read that review before continuing reading this article. Yang et al. end with four conclusions: (1) The accuracy of state-of-the-art three-state methods is around 83%; (2) Recent improvements come from larger databases, the use of templates, and powerful deep learning techniques; (3) Prediction of backbone angles (which can help predicting 3D structures from sequence alone) has room for further improvement; (4) The future (of PSSP) is bright as next-generation deep learning techniques can remember long-range interactions. Ptitsyn, Denesyuk, Finkelstein, and Lim based their landmark prediction of the secondary structure of the L7, L12 Escherichia coli ribosomal protein [2] on fundamental concepts of polymer physics for helix–coil transitions in water-soluble synthetic polypeptides. Today, the structure of the L7, L12 protein is known [20], so we can see how well they did in 1973. Ptitsyn et al. started from the assumption that the protein was mainly helical and contained no strands; this was based on biophysical measurements that, in retrospect, were wrong. Consequently, they missed the three strands in their prediction, but as Figure1A shows, their helix predictions were remarkably accurate. One of us (G.V.) was involved (in 1982) in the prediction of the secondary structure of phosphatidylcholine-transfer protein from bovine liver using Lenstra’s implementation of two PSSP methods [14]. Figure1B shows that this prediction at best can be called poor now that we can map it on the structure of a close homolog. Biomolecules 2020, 10, 910 3 of 14 Biomolecules 2020, 10, x FOR PEER REVIEW 3 of 14

(A)

(B)

Figure 1. (A) The structure of one monomer of the L7/L12 protein dimer. PDBid 1rqu [20] is colored by Figure 1. (A) The structure of one monomer of the L7/L12 protein dimer. PDBid 1rqu [20] is colored the helix predictions made by Ptitsyn et al. in 1973 (helix = red; nonhelix = green). The small orange by the helix predictions made by Ptitsyn et al. in 1973 (helix = red; nonhelix = green). The small orange stretch was predicted as helix by Ptitsyn et al., but falls in a region for which no NMR data is available, stretch was predicted as helix by Ptitsyn et al., but falls in a region for which no NMR data is available, but for which Bocharov et al. [20] wrote “There is an indirect evidence of the existence of transitory but for which Bocharov et al. [20] wrote “There is an indirect evidence of the existence of transitory helical structures at least in the first part (residues 33–43) of the hinge region”. To the right the sequence helical structures at least in the first part (residues 33–43) of the hinge region”. To the right the is shown colored by its secondary structure with underneath it the predicted secondary structure. sequence is shown colored by its secondary structure with underneath it the predicted secondary Helices are green, strands are yellow, and everything else is green. Loops are shown as dashes. (B) The structure. Helices are green, strands are yellow, and everything else is green. Loops are shown as structure of the phosphatidylcholine-transfer protein from bovine liver. PDBid 1ln3 [21] is colored by dashes. (B) The structure of the phosphatidylcholine-transfer protein from bovine liver. PDBid 1ln3 the secondary structure predictions by Akeroyd et al. [22] using the methods of Chou and Fasman, [21] is colored by the secondary structure predictions by Akeroyd et al. [22] using the methods of and Lim as implemented by Lenstra [14]. (same coloring as in A). Many things can be said about this Chou and Fasman, and Lim as implemented by Lenstra [14]. (same coloring as in A). Many things can prediction, but not that it could have been useful for any practical purpose. The sequence and predicted be said about this prediction, but not that it could have been useful for any practical purpose. The secondary structure to the right are also colored as in A. sequence and predicted secondary structure to the right are also colored as in A. Biomolecules 2020, 10, 910 4 of 14

The methods developed by Ptitsyn et al. were soon extended, refined, and published by Lim [23]. Unlike Ptitsyn and Finkelstein, who mainly paid attention to residue propensities to be inside or at the endsBiomolecules of secondary 2020, 10, x FOR structures, PEER REVIEW Lim concentrated on hydrophobic and hydrophilic surfaces4 of of 14 these structures. It should be mentioned that both these rather different methods, Ptitsyn–Finkelstein’s and Lim’s,the performed ends of secondary well in structures, the first world-wide Lim concentrated PSSP “competition”on hydrophobic [and24], hydrophilic which one surfaces could call of these “CASP-0”. Lim’sstructures. article is It hard should to read,be mentioned and the that method both these is even rather harder different to implement methods, Ptitsyn–Finkelstein’s in a computer program. and Lim’s,The concept performed of Jackknifing well in the firs or att world-wide least using PSSP disjunct “competition” training and [24], testing which datasets one could was call not “CASP- yet widely 0”. Lim’s article is hard to read, and the method is even harder to implement in a computer program. known in those days and consequently Lim claimed that the method was 80–85% correct. Kabsch and The concept of Jackknifing or at least using disjunct training and testing datasets was not yet Sander have later corrected this number to ~56% [25]. They described the lack of Jackknifing as widely known in those days and consequently Lim claimed that the method was 80–85% correct. “TheKabsch difference and canSander be understoodhave later corrected to be due this to numb specialer rulesto ~56% tailored [25]. They to particular described proteins the lackin of Lim’s method”.Jackknifing The people as “The in difference Ptitsyn’s can group be understood actually spent to be time due looking to special at rules wire-models tailored to and particular space-filling modelsproteins of protein in Lim’s structures method”. (see The Figurepeople2 in). Ptitsyn’s The small group set ofactually structures spent availabletime looking to themat wire-models in those days madeand that space-filling the rules models they saw of protein as “general” structures actually (see Figure were 2). often The rathersmall set specific of structures for particular available (classesto of) proteins.them in those In retrospect, days made it that is not the clear rules whetherthey saw Ptitsynas “general” initially actually focused were onoften the rather prediction specific of for helices becauseparticular the protein (classes structures of) proteins. available In retrospect, around it is 1970 not clear contained whether many Ptitsyn more initially residues focused in heliceson the than in strands,prediction or becauseof helices helices because are the easier protein to understand structures available in terms around of protein 1970 physics.contained More many recent more PSSP residues in helices than in strands, or because helices are easier to understand in terms of protein methods usually are aiming to predict the secondary structure of all proteins, and although examples physics. More recent PSSP methods usually are aiming to predict the secondary structure of all probably exist, the rule is that people who publish a PSSP method have not looked at any structure proteins, and although examples probably exist, the rule is that people who publish a PSSP method theyhave predicted. not looked at any structure they predicted.

(B)

(A) (C)

FigureFigure 2. The 2. The early early history history of of protein protein secondary secondary structure prediction. prediction. (A) ( AOleg) Oleg Ptitsyn Ptitsyn (with (with hand handin in hair)hair) and and V.I. V.I. Lim Lim study study physical physical models models ofof proteinsproteins and and peptides. peptides. (B) ( BRecent) Recent photo photo of some of some of the of the wirewire models models used used in Ptitsyn’sin Ptitsyn’s lab lab before before (and(and also after) after) computer computer graphics graphics systems systems became became available. available. (C) Model of one of the early structure predictions by Ptitsyn's group. One of us (G.V.) spent time in (C) Model of one of the early structure predictions by Ptitsyn’s group. One of us (G.V.) spent time the early 90s with Ptitsyn and Finkelstein looking at these models, at which time Oleg told him that in the early 90s with Ptitsyn and Finkelstein looking at these models, at which time Oleg told him “looking at models is the best theoretical approach”, a lesson G.V. never forgot and taught all his that “lookingstudents, including at models most is the coauthors best theoretical on this article. approach”, a lesson G.V. never forgot and taught all his students, including most coauthors on this article. From the hundreds of PSSP methods that have been published, the method by Chou and Fasman (C&F) [12,13] definitely became the most famous, together with the methods of Lim [23], and Biomolecules 2020, 10, 910 5 of 14

From the hundreds of PSSP methods that have been published, the method by Chou and Fasman (C&F) [12,13] definitely became the most famous, together with the methods of Lim [23], and Garnier–Osguthorpe–Robson [26]. For many years, new methods were published with the main conclusion of how much (actually how little) better they worked than these three methods. Chou and Fasman simply counted how often each amino acid type were observed in each of the four secondary structure types (helix, strand, turn, loop), and thereby derived rules for PSSP. Like the method of Lim, the C&F method suffers from a problem that Kabsch and Sander described as “There are ambiguities in two of the best known methods, those of Chou and Lim, in that they often give different results in the hands of different people and are therefore not programmable without extension or modification”. The first serious breakthrough in PSSP came when Rost et al. used neural networks with multiple sequence alignments as input [27]. We expect the second major breakthrough to come from deep neural networks (which essentially are, or according to us should be, very large neural networks (e.g., [28])). The upper limit of predictability of secondary structure has often been discussed (e.g., [29]), and it is commonly accepted that this upper limit is around 90%. We will shed some extra light on this topic. We find hints in the literature (e.g., [29]), about the lack of generality of PSSPs, and we will address this topic with some detail, to arrive at the conclusion that the work of the real, biological neural network of Oleg Ptitsyn, simulated in today’s artificial neural networks, holds the best hope for the future of PSSPs.

2. Materials and Methods DSSP was used for all secondary structure assignments. This software, and the assignments for all 3D protein structures in the PDB, are freely available from the CMBI protein structure bioinformatics facilities website [30]. The list of dimers in the PDB was determined with WHAT IF [31]. This list is available from https://swift.cmbi.umcn.nl/gv/facilities/too. PDB-REDO [32] files (freely available from https://pdb-redo.eu/) are reinterpreted PDB files that are re-refined with today’s software and today’s CPU power. It was shown convincingly that a PDB-REDO file normally is a better model for the underlying macromolecule than the corresponding PDB file [33]. To classify proteins based on the predictability of their secondary structures, we designed a secondary structure normality score (SSN score). Our SSN score is based on C&F-like parameters. C&F-like parameters for single amino acids and for amino acid pairs were determined with the proprietary software CFcheck that is available from the CMBI’s github area (https://github.com/cmbi/ CF-check). This software determines the secondary structure preference parameters P (AA , SS ) and   1 i k P2 AAi, AAj, SSi, SSj . P1(AAi, SSi) is the likelihood of amino acid type AAi appearing in secondary structure state SS . DSSP secondary structure states were reduced to three states (helix, strand, rest); and  i  P2 AAi, AAj, SSi, SSj is the likelihood of AAi and its C-terminally adjacent neighbor AAj appearing in secondary structure states SSi, SSj, respecitvely. Box1 explains how these parameters are determined, and how they are used. Although mathematically “cleaner” than the method published by Chou and Fasman [12,13], the use of P1 and P2 parameters gives similar results. The advantage of preference parameters over the original Chou and Fasman parameters is that preference parameters for individual amino acids and pairs can simply be added up over a whole protein. The secondary structure of a protein with a high SSN score will be predicted well by C&F like methods, and probably by most PSSP methods. A low SSN score implies poor predictability of the secondary structure. We define the secondary structure normality (SSN-score) of a protein of n amino acids as the average of all n P and n 1 P values. SSN-scores were determined for 22,405 protein chains longer 1 − 2 than 49 amino acids that share no pair-wise sequence identity larger than 30% [34]. An SSN score measures how “normal” a protein’s secondary structure is in terms of amino acid compositions in secondary structures. A protein has a high SSN score when the amino acid composition of its secondary structures is similar to the average of 22,405 proteins. SSN-score frequency plots are scaled to have the same area under the curve. The distributions were fitted using the Parzen–Rosenblatt window method (https://en.wikipedia.org/wiki/Kernel_density_ Biomolecules 2020, 10, 910 6 of 14 estimation#cite_note-Ros1956-1) as implemented in the Seaborn software (https://seaborn.pydata.org/ generated/seaborn.kdeplot.html#seaborn.kdeplot). For the analysis of intrinsically disordered proteins we have selected a subset of 225 sequences from the DisProt database [35] with a disorder content of at least 60%.

Box 1. Preference parameter determination.

60 Preference parameters are determined for the 20 amino acid types in 3 states and 3600 parameters are determined for 400 pairs of amino acid types in 3 3 states. These preference parameters are determined as the × logarithm of the observed frequency of each (aai,ssj) or (aai,aani,ssj,ssnj) combination divided by their predicted frequency. These predictions are the result of the multiplication of the chance that any amino acid type is aai with the chance that the secondary structure is ssj. For example, with 7% of the amino acids in the training set being Leu and 35% of all amino acids being helical, the expected frequency F(Leu,H) is 7% of 35% which is about 2.5%. The number of leucines in helices is considerably higher than 2.5% of the total number of amino acids in the dataset and therefore F(Leu,H)obs/F(Leu,H)pred is larger than 1, and thus the preference parameter P1(Leu,helix) = log(F(Leu,H)obs/F(Leu,H)pred) is larger than 0.0. The calculation of P2 values go in a similar fashion. Preference parameter usage. For each protein of length n, n P and n 1 P values are determined by 1 − 2 simple lookup in the P1 and P2 tables. All so obtained preference parameter are added up and divided by (n+n 1) to arrive at the protein’s SSN score. If this SSN score is high, many amino acids have the most preferred − secondary structure, which also implies that C&F-like PSSP methods should be able to predict their secondary structure well.

3. Results Oleg Ptitsyn studied secondary structures by looking at 2D and 3D images of proteins rather than analyzing numbers that describe protein structures. We picked three topics that illustrate that his approach to protein secondary structure prediction was a good one.

How accurate are the determinations of secondary structure from 3D structures? • How important is the size of the training set? • How important is the composition of the training set? • We address these topics because they illustrate how the methods designed by Oleg Ptitsyn and his group in the 70s were actually ahead of their time, and we will conclude that deep learning can merge the best of both worlds by automatically combining large datasets, big computers, and everything else we learned about PSSP methods with the idea that one cannot apply one PSSP method to all classes of proteins. These three topics also illustrate Oleg Ptitsyn’s idea that understanding the laws of physics that apply to the formation of secondary structure elements is very important.

3.1. How Accurate Are the Determinations of Secondary Structure from 3D Structures? Computer programs, such as DSSP, use patterns of hydrogen bonds, sometimes augmented with local motif superpositions, to determine which residues are in a helix, strand, etc. These methods use cut-offs to determine if a exists or not; with a cut-off of 3.50 Å, for example, a distance of 3.49 Å between the backbone N and O leads to the assignment “hydrogen bond”, while a distance of 3.51 Å does not. Obviously, this effect is strongest near the ends of secondary structure elements. It therefore makes great sense to either not care about the precise location of these ends [29], or to determine the ends of secondary structure elements by visual inspection of three-dimensional structures. The applied crystallographic methods, from crystallization (and resulting crystal packing) to model building and refinement, influence fine details in the structure such as lengths of hydrogen bonds near the ends of helices, and thus also the secondary structure assignment by DSSP. To evaluate the magnitude of this problem, we could compare DSSP assignments for structures solved at low resolution with the same structure later solved at higher resolution, or we could compare the DSSP assignments for close homologs. It would probably be best to have the structures of a large number of Biomolecules 2020, 10, x FOR PEER REVIEW 7 of 14

The applied crystallographic methods, from crystallization (and resulting crystal packing) to model building and refinement, influence fine details in the structure such as lengths of hydrogen bonds near the ends of helices, and thus also the secondary structure assignment by DSSP. To

Biomoleculesevaluate the2020 magnitude, 10, 910 of this problem, we could compare DSSP assignments for structures solved7 of 14 at low resolution with the same structure later solved at higher resolution, or we could compare the DSSP assignments for close homologs. It would probably be best to have the structures of a large proteinsnumber solvedof proteins independently solved independently by several crystallography by several crystallography groups and compare groups the outcomes,and compare but thisthe outcomes,still has the but problem this still that has athe protein problem crystallized that a protein from crystallized a saturated from ammonium a saturated sulfate ammonium solution sulfate is in a verysolution diff erentis in physicochemical a very different environment physicochemical than a protein environment solved from than a polyethylenea protein solved glycol solution.from a Whenpolyethylene a protein glycol crystallizes solution. as When a dimer a protein in the crystall asymmetricizes as unit, a dimer however, in the asymmetric the two monomers unit, however, in this thedimer two normally monomers are in covalently this dimer identical normally and are both covalently are certainly identical present and inboth the are same certainly physicochemical present in theenvironment. same physicochemical The only diff erenceenvironment. between The the only two monomersdifference isbetween that their the crystal two monomers packing environments is that their crystalare diff erent.packing Comparing environments the DSSPare different. assignments Comp ofaring monomers the DSSP in aassignments dimer are thus of monomers a good way in toa dimerestablish are to thus what a extentgood way DSSP to assignments establish to arewhat the ex resulttent DSSP of crystal assignments packing artefactsare the result and refinement of crystal packingstrategies, artefacts and thus and provide refinement a good strategies, estimate and of thus the provide very upper a good limit estimate of the accuracyof the very of upper secondary limit ofstructure the accuracy determinations, of secondary and structur thus ofe PSSP. determinations, and thus of PSSP. We used WHAT IF [[31]31] to findfind in the PDB all filesfiles thatthat have two covalently identical monomers in the asymmetric unit, and asked how different different those two monomers were in terms of secondary structure as assigned by DSSP. As crystallographerscrystallographers often force the two monomers to to be identical, we diddid notnot include include structures structures in in which which the the two two monomers monomers could could be superposed be superposed exactly. exactly. Figure Figure3 shows 3 showsthe results the results of comparing of comparing the eight-state the eight-state assignments assignments by DSSP by for DSSP about for 7000 about PDB 7000 files. PDB files. Figure3 3 shows shows that that there there are, are, on on average, average, 3% 3% di differentfferent DSSP DSSP assignments assignments between between monomers monomers in inadimer, a dimer, and and DSSP DSSP assignment assignment di differencesfferences of of more more than than ten tenpercent percent areare observedobserved for many dimers. dimers. In the PDB-REDOPDB-REDO files,files, thesethese di differencesfferences are are about abou twot two times times smaller. smaller. Clearly, Clearly, re-refinement re-refinement has has a big a biginfluence influence on theon the DSSP DSSP assignments, assignments, but but that that is beyond is beyond the the scope scope of thisof this article article [32 [32].]. Here, Here, it suit sufficesffices to toconclude conclude that that there there is already is already a 3% a 3% diff erencedifference in secondary in secondary structure structure depending depending on whether on whether one usesone useseither either the A the chain A chain or the or B the chain B chain from from the same the same dimer, dimer, and theseand these differences differences are most are most prominent prominent near nearthe ends the ofends helices. of helices. Needless Needless to say, betterto say, results bette arer results obtained are if obtained the secondary if the structure secondary assignments structure assignmentsfor the ends offor helices the ends are of either helices left are fuzzy either [29 left] or fuzzy are determined [29] or are bydetermine visual inspection,d by visual as inspection, was Oleg asPtitsyn’s was Oleg preferred Ptitsyn’s approach. preferred approach.

Figure 3. BoxBox and and whisker plots plots for the comparison of of DSSP assignment of of the A and B chains in covalently identical dimers. (Left) (Left) About About 7000 7000 dimers dimers found in PDB files; files; (Right) The corresponding PDB-REDO files. files. The The secondary secondary structure structure assignment assignmentss by by DSSP DSSP are are reduced reduced to to three three states (helix, strand, and rest). Assignments Assignments are are called called different different if if the assignment character for the residues at equivalent positions in the two chainschains in the dimer are didifferent.fferent. 3.2. How Important Is the Size of the Training Set? 3.2. How Important is the Size of the Training Set? When Oleg Ptitsyn started working on PSSP, neither the PDB nor graphics capable computers existed yet, and the world knew the 3D-structure of just a couple dozen proteins. Consequently, he could only learn from a very small dataset. Upon browsing through the first 500 Google Scholar hits for “protein secondary structure”, we found several hundred articles describing a PSSP method, and what the majority of these methods have in common is the use of a training set and a test set that were selected to contain only sequence unique proteins. Many of these articles mention that at Biomolecules 2020, 10, x FOR PEER REVIEW 8 of 14

When Oleg Ptitsyn started working on PSSP, neither the PDB nor graphics capable computers existed yet, and the world knew the 3D-structure of just a couple dozen proteins. Consequently, he could only learn from a very small dataset. Upon browsing through the first 500 Google Scholar hits for “protein secondary structure”, we found several hundred articles describing a PSSP method, and Biomolecules 2020, 10, 910 8 of 14 what the majority of these methods have in common is the use of a training set and a test set that were selected to contain only sequence unique proteins. Many of these articles mention that at some latersome time, later with time, more with data more available, data available, their method their method will work will work better. better. ThisThis latter, latter, however, however, is not is not applicableapplicable to tothe the work work of ofOleg Oleg Ptitsyn, Ptitsyn, who who wanted wanted to tounderstand understand the the physics physics of protein of protein structures. structures. PhysicsPhysics is isphysics; physics; once once gravity gravity is isunderstood, understood, for for example, example, we we do do not not need need a a thousand thousand more apples to to fallfall toto betterbetter understandunderstand gravity.gravity. WeWe took took 22,405 22,405 protein protein chains chains that that are are all all sequen sequencece unique unique at atthe the 30% 30% level. level. We We used used preference preference parametersparameters determined determined from from a (small) a (small) subset subset of ofthese these 22,405 22,405 and and asked asked how how “normal” “normal” all allproteins proteins in in ourour big big dataset dataset were. were. Figure Figure 4 shows4 shows that that once once a fe aw few thousand thousand non-redundant non-redundant proteins proteins are are available, available, a furthera further increase increase of of the the size size of of the training-set doesdoesnot not matter matter much much for for a preferencea preference parameter-based parameter- basedmethod, method, and weand expect we expect that thisthat will this also will hold also for hold the for hundreds the hundreds of PSSP of methods PSSP methods published published since 1973. sinceThis 1973. observation This observation can be explained can be explained with the so-called with the bottleneck so-called ebottleneckffect from evolutioneffect from theory, evolution in line theory,with thein line conclusions with the byconclusions Kabsch and by SanderKabsch [an25d] aboutSander the [25] dataset about used the dataset by Ptitsyn. used by Ptitsyn. WeWe were were not not aiming aiming at at predicting predicting the the secondary structurestructure ofof proteins. proteins. We We only only wanted wanted to to see see how howwell well such such methods methods could, could, in principle,in principle, perform. perform. The The underlying underlying idea idea is thatis that proteins proteins with with a very a very high highSSN-score SSN-score are are likely likely to beto be predicted predicted well well by by C&F-like C&F-like PSSP PSSP methods, methods, while while proteins proteins with with a verya very low lowSSN-score SSN-score will will not. not. We We repeated repeated the the experiments experiment manys many times times with with di ffdifferenerent subsets,t subsets, and and got got each each time timehighly highly similar similar results. results. We We have have not not Jackknifed Jackknifed the the method, method, which which explains explains (partly) (partly) why why the the SSN SSN score scoreof the of proteinsthe proteins in the in trainingthe training set is set higher is higher than th thatan ofthat the of proteins the proteins in the in test the set. test Once set. theOnce training the trainingset is largeset is enoughlarge enough to no longerto no longer suffer fromsuffer counting from counting statistics statistics problems, problems, this eff thisect iseffect gone. is Agone. larger A traininglarger training set would set would have improved have improved PSSP methods PSSP me (athods bit) in (a the bit) 80s in butthe by80s the but time by the the time PDB the held PDB many heldthousands many thousands of proteins, of proteins, this stopped this beingstopped true. being true.

Figure 4. Secondary structure normality (SSN) scores for all 22,405 proteins when preference parameters Figureare based 4. Secondary on a small, structure randomly normality selected subset(SSN) ofscores all proteins. for all From 22,405 (A –proteinsD) the preference when preference parameters parametersare determined are based from 50,on 150,a small, 500, andrandomly 2000 proteins, selected respectively subset of (inall blue).proteins. SSN From scores ( forA) theto proteins(D) the in preferencethe small parameters training sets are are determined in blue and from for the 50, complete 150, 500, setand of 2000 22,405 proteins, proteins resp in yellow.ectively Both (in blue). training SSN and scorestest setfor arethe shownproteins as in a the normalized small training histogram sets are with in the blue width and for of the the bins complete chosen set to of have 22,405 the proteins maximum in yellow.count between Both training 6 and 10, and and test similar set are between shown theas a two normalized distributions. histogram The curves with the are width obtained of the from bins those histograms using a kernel density estimation. The abscissa is based on the SSN scores for the proteins in the training set. Biomolecules 2020, 10, x FOR PEER REVIEW 9 of 14

chosen to have the maximum count between 6 and 10, and similar between the two distributions. The curves are obtained from those histograms using a kernel density estimation. The abscissa is based on the SSN scores for the proteins in the training set. Biomolecules 2020, 10, 910 9 of 14

3.3. How Important is the Composition of the Training Set? 3.3. How Important Is the Composition of the Training Set? Several hundred proteins with either the highest or the lowest SSN scores were manually analyzed.Several These hundred are the proteins proteins with that either are the likely highest toor perform the lowest best SSN or scoresworst, were respectively, manually analyzed.in PSSP projects.These are Figure the proteins 5 shows that the are protein likely with to perform the highes bestt and or worst, the protein respectively, with the in lowest PSSPprojects. SSN score Figure in the5 dataset.shows the protein with the highest and the protein with the lowest SSN score in the dataset.

(A) (B)

FigureFigure 5. 5. StructuresStructures with with extreme extreme secondary secondary structures. structures. (A (A) )1nay 1nay [36] [36 ]is is the the protein protein with with the the highest highest

“normality”, defineddefined asas thethe highesthighest averageaverage P(aa P(aai,ssi,ssj)j)+P(aa+ P(aai,aani,aani,ssi,ssj,ssnj,ssnj).j ).This This protein protein has has a along long stalk stalk withwith a a type type of of helix helix that that DSSP DSSP recognizes recognizes as as loop. loop. Th Thisis stalk stalk is is rich rich in in prolines and and P(Pro,Loop) P(Pro,Loop) is is very very highhigh so so that that the the normality normality of of this this protein protein is is high. high. (B (B) )3du1 3du1 [37] [37 ]is is the the protein protein with with the the lowest lowest “normality”. “normality”. TheThe repeats repeats in in this this protein are kept together by (c (cooperative)ooperative) interactions al alongong the the tube. The The strands strands (red)(red) are are accidentally accidentally rich rich in in residues residues for for which which P( P(aa,strand)aa,strand) is is low low while while the the (green) (green) hydrogen hydrogen bonded bonded turnsturns and and (blueish) (blueish) loops loops contain contain residues residues for for which which this this parameter parameter is is high high (and (and thus thus P(aa,turn) P(aa,turn) is is low). low).

AmongAmong the the proteins proteins with with the the lowest lowest normality normality sc score,ore, we we found found a a large large number number of of ubiquitins, ubiquitins, VEGF-AVEGF-A proteins, proteins, and and haemoglobins. Visual Visual inspec inspectiontion revealed revealed that that these these extremely extremely low-scoring low-scoring proteinsproteins contain contain many, many, or or sometimes sometimes exclusively exclusively d-amino d-amino acids, acids, but but we we could could not not find find a a trivial trivial explanationexplanation for for the the large large number number of of haemoglobins haemoglobins with with low low normality. normality. Figure Figure 6 shows the histograms ofof all all 22,405 22,405 SSN SSN scores scores with with as as insert insert a a histogram histogram for for the the SSN SSN scores scores of of 79 79 files files in in the the dataset dataset with with “hemoglobin” eithereither inin thethe text text of of the the PDBFINDER PDBFINDER [34 [34]] entry, entry, or inor the in PDBthe PDB website’s website’s “Structure “Structure Title” Title”or “Classification” or “Classification” record. record. HaemoglobinsHaemoglobins are are among among the the oldest oldest proteins proteins for for which which the the 3D 3D structure structure was was determined. determined. Consequently,Consequently, OlegOleg PtitsynPtitsyn andand his his group group studied studied these these proteins proteins longest longest [1], [1], and and therefore therefore might might have haveunderstood understood them betterthem andbetter predicted and predicted their secondary their structuressecondary better.structures These better. same haemoglobins,These same haemoglobins,however, are likely however, to be are predicted likely to worsebe predicted by C&F-like worse PSSPby C&F-like methods. PSSP Figure methods.6 indeed Figure shows 6 indeed that showshaemoglobins that haemoglobins tend to have tend a worse to have SSN a score worse than SSN all score other than proteins; all other this ledproteins; us to ask this if led there us areto ask many if theregroups are of many proteins groups that of have proteins aberrant that SSNhave scores. aberrant We SSN first scores. looked We at first GPCRs looked as their at GPCRs core structure as their corehas sevenstructure helices has and seven essentially helices noand strands, essentially just likeno strands, the haemoglobins. just like the GPCRs, haemoglobins. however, getGPCRs, very however,normal SSN get scoresvery (despitenormal beingSSN scores membrane (despite proteins), being asmembrane do ubiquitins, proteins), antibodies, as do andubiquitins, a whole antibodies,series of protein and a classeswhole series that have of protein one representative classes that have among oneeither representative the lowest among or the either highest the 250 lowest SSN orscores. the highest The only 250 other SSN aberrantly scores. The scoring only other class weaber foundrantly was scoring based class on the we word found “finger” was based (zinc-fingers on the wordand ring-fingers “finger” (zinc-fingers are recognizable and ring-fingers local folding are motifs recognizable that often local are mentionedfolding motifs in the that name often or are the mentionedannotation in of the a protein). name or the annotation of a protein). On the other hand, if we look at all de novo proteins, we see that they have very high SSN scores, something that is not surprising, as secondary structure preference parameters undoubtedly are used Biomolecules 2020, 10, x FOR PEER REVIEW 10 of 14

Biomolecules 2020, 10, 910 10 of 14 On the other hand, if we look at all de novo proteins, we see that they have very high SSN scores, something that is not surprising, as secondary structure preference parameters undoubtedly are used in indesigning designing the the sequences sequences of ofthese these proteins. proteins. Figure Figure 6 also6 also shows shows that that thermostable thermostable proteins proteins tend tend to to havehave high high SSN SSN scores scores (in (in line line with with old old work work about about thermostabilizing thermostabilizing mutations[e.g. mutations (e.g., 38]). [38 ])). So,So, when when it comes it comes to toSSN SSN scores, scores, there there seem seem to tobebe different different types types of ofprotein protein classes, classes, those those that that behavebehave like like the the average, average, and and those those that that do do not. not.

Figure 6. Normality scores for four classes of proteins determined using SSN scores obtained from Figurea large, 6. Normality representative scores dataset. for four The classes green of histogram proteins determined is for all 22,405 using proteins SSN scores in the obtained dataset, from the bluea large,histogram representative is for the dataset. subset. TheTheaxes green are histogram similar as inis Figurefor all4 22,405.( A) The proteins blue histogram in the dataset, is for 79the files blue with histogram“hemoglobin” is for the in thesubset. text. The (B) axes 65 proteins are similar with as “finger” in figure in 4. the (Top text left (mostly) The zinc-fingerblue histogram or ring-finger). is for 79 files(C )with 387 proteins“hemoglobin” with “thermophiles” in the text. (Top in theright text.) 65 ( Dproteins) 66 proteins with with“finger” “de in novo” the text in the (mostly text. zinc- finger or ring-finger). (Bottom left) 387 proteins with “thermophiles” in the text. (Bottom right) 66 proteinsWe asked with what“de novo” would in the happen text. if we determine the preference parameters from plant proteins only, and then determine the scores for all proteins. We asked this question because plant proteins areWe known asked to what be di ffwoulderent fromhappen animal if we proteins determine in many the preference ways [39]. parameters Figure7 suggests from plant that itproteins does not only,matter and much then whetherdetermine preference the scores parameters for all proteins are determined. We asked from this 22,405 question proteins because or from plant just proteins 156 plant areproteins. known to It be is surprising,different from however, animal that proteins preference in many parameters ways[39]. determined Figure 7 suggests from plant that proteinsit does not give matterhigher much SSN whether scores for preference haemoglobins parameters than preference are determined parameters from determined 22,405 proteins from or all from 22,405 just proteins, 156 plantwhile proteins. the opposite It is surprising, is found for however, antibodies. that Bothpreference haemoglobins paramete andrs determined antibodies arefrom non-plant plant proteins proteins givethat higher are typically SSN scores found for in haemoglobins animals only. than preference parameters determined from all 22,405 proteins,Figure while8 showsthe opposite the SSN is found scores for antibodies all proteins. Both calculated haemoglobins with preference and antibodies parameters are basednon-plant on all proteinsproteins. that Small are typically proteins clearlyfound in have animals more extremeonly. scores than large proteins. Manual inspection of the scoresFigure of all8 shows proteins the of SSN 50–60 scores amino for acid all lengthproteins revealed calculated that with among preference the low scoring parameters proteins, based we on find allmany proteins. peptides Small that proteins are fixated clearly in have a complex more extreme (like inhibitors scores boundthan large to the proteins. Manual they should inspection inhibit), of whilethe scores among of all the proteins high scoring of 50–60 proteins, amino we acid find length mainly revealed monomers that among and homomultimers. the low scoring The proteins, peptides wethat find are many bound peptides in a complex that are are fixated likely in to a havecomple anx induced-fit (like inhibitors structure bound that to mightthe enzyme not necessarily they should need inhibit),to follow while the among preference the high parameters. scoring proteins, we find mainly monomers and homomultimers. The peptides that are bound in a complex are likely to have an induced-fit structure that might not necessarily need to follow the preference parameters. Biomolecules 2020, 10, x FOR PEER REVIEW 11 of 14

Biomolecules 2020, 10, x FOR PEER REVIEW 11 of 14

Biomolecules 2020, 10, 910 11 of 14

Figure 7. SSNSSN scores scores based based on onplant-based plant-based versus versus general general preference preference parameters. parameters. SSN scores SSN scoresfor all for22,405 all 22,405proteins proteins were determined were determined twice. twice.For the For abscissa, the abscissa, the SSN the scores SSN scoreswere determined were determined using usingpreference preference parameters parameters extracted extracted from fromall protei all proteins,ns, while while for forthe the ordinate, ordinate, the the SSN SSN scores scores were Figure 7. SSN scores based on plant-based versus general preference parameters. SSN scores for all determined using preference parameters obtained from 156 plant proteins. So, dots above the (white) 22,405 proteins were determined twice. For the abscissa, the SSN scores were determined using Plant-SSNPlant-SSN=All-SSN=All-SSN diagonal line represent proteins that have a higher SSN score when the preference preference preference parameters extracted from all proteins, while for the ordinate, the SSN scores were parameters are are determined determined from from plant plant proteins proteins only only,, while while dots dots below below the line the lineget aget higher a higher SSN score SSN determined using preference parameters obtained from 156 plant proteins. So, dots above the (white) scorewhen whengeneral general preference preference parameters parameters are areused. used. Yellow: Yellow: all allproteins; proteins; Blue: Blue: plant plant proteins; proteins; Black: Plant-SSN=All-SSN diagonal line represent proteins that have a higher SSN score when the preference Haemoglobins; Red: antibodies. ( A)) the the whole whole plot plot including including all all 22,405 22,405 proteins; ( B) blown-up part of parameters are determined from plant proteins only, while dots below the line get a higher SSN score the left-handleft-hand sideside plotplot showing showing most most red, red, blue, blue, and and black black dots dots to to better better see see the the systematic systematic nature nature of the of when general preference parameters are used. Yellow: all proteins; Blue: plant proteins; Black: asymmetriesthe asymmetries of these of these colored colored dots. dots. Haemoglobins; Red: antibodies. (A) the whole plot including all 22,405 proteins; (B) blown-up part of the left-hand side plot showing most red, blue, and black dots to better see the systematic nature of the asymmetries of these colored dots.

Figure 8. SSN scores as function of protein length.

Long proteins tend to have homogeneous SSN sc scores,ores, while while shorter shorter proteins proteins are are more more extreme. extreme. Short proteins can be more Figure influenced influenced 8. SSN scores by the as environment environmentfunction of protein than length.long ones, ones, and and indeed, indeed, short short proteins, proteins, when bound to something, tend toto havehave poorpoor SSNSSN scores.scores. Long proteins tend to have homogeneous SSN scores, while shorter proteins are more extreme. One of the the referees referees suggested suggested that that we we analyze analyze na nativelytively unfolded unfolded proteins. proteins. We Weused used a set a of set 225 of Short proteins can be more influenced by the environment than long ones, and indeed, short proteins, 225intrinsically intrinsically disordered disordered proteins proteins and asked and askedthe questi theon question what the what SSN thescore SSN would score be would if we predicted be if we when bound to something, tend to have poor SSN scores. predictedall amino acids all amino as helix, acids as as strand, helix, asor strand,as turn. or as turn. One of the referees suggested that we analyze natively unfolded proteins. We used a set of 225 Figure 99 shows shows that,that, comparedcompared toto thethe nativenative secondarysecondary structure,structure, thethe SSNSSN scoresscores ofof allall proteinsproteins intrinsically disordered proteins and asked the question what the SSN score would be if we predicted decrease when one type of secondary structure is assumed.assumed. The assumption that the whole protein is all amino acids as helix, as strand, or as turn. one long helix gives the highest scores, while the as assumptionsumption that the whole pr proteinotein is one long loop givesFigure the 9 worstshows score. that, Whencompared the sameto the is native done forseco 225ndary intrinsically structure, disordered the SSN scores proteins, of all the proteins SSN scores decreaseare worst when for one the type Strand of secondary assumption, structure while is the assumed. scores for The the assumption Loop assumption that the arewhole close protein to the is SSN one scoreslong helix of folded gives proteins.the highest This scores, latter while observation the assumption is not overly that surprisingthe whole asprotein unfolded is one protein long loop regions Biomolecules 2020, 10, x FOR PEER REVIEW 12 of 14

gives the worst score. When the same is done for 225 intrinsically disordered proteins, the SSN scores Biomolecules 2020, 10, 910 12 of 14 are worst for the Strand assumption, while the scores for the Loop assumption are close to the SSN scores of folded proteins. This latter observation is not overly surprising as unfolded protein regions essentiallyessentially areare unstructuredunstructured loops.loops. C&F-likeC&F-like PSSPPSSP methodsmethods are thus expected the predict that natively unfoldedunfolded proteinsproteins consistconsist mainlymainly ofof loops,loops, withwith thethe occasionaloccasional helix,helix, andand barelybarely nono strands.strands.

FigureFigure 9.9. SSN scores forfor intrinsicallyintrinsically disordered proteinprotein regions assuming one secondary structure. TheThe greygrey curvecurve representsrepresents thethe SSNSSN scoresscores forfor thethe fullfull setset ofof 22,40522,405 proteins.proteins. TheThe SSNSSN scoresscores ofof thesethese 22,405 proteins were determined assuming each protein was either one long Helix (dashed blue), 22,405 proteins were determined assuming each protein was either one long Helix (dashed blue), Strand (dashed green), or Turn (dashed magenta). Similarly, solid lines (in the same colors) show Strand (dashed green), or Turn (dashed magenta). Similarly, solid lines (in the same colors) show the the SSN scores determined for 225 natively unfolded protein regions assuming one single secondary SSN scores determined for 225 natively unfolded protein regions assuming one single secondary structure for that whole region. structure for that whole region. 4. Discussion 4. Discussion When we apply C&F-style preference parameters to determine the secondary structure normality of allWhen 22,405 we proteins apply in C&F-style the dataset preference and look atparameters the proteins to withdetermine the lowest the secondary structurestructure normality,normality weof all observe 22,405 several proteins classes in the that, dataset on average, and look are at either the proteins very “normal” with the or lowest very “abnormal”. secondary “Abnormal”structure normality, proteins we would observe be predicted several classes poorly that, by a on C&F-like average, PSSP are tool.either Indeed, very “normal” the sequence or very of 3DU1“abnormal”. (see Figure “Abnormal”5) was predicted proteins bywould the mostbe predicte recentd C&F poorly prediction by a C&F-like server addedPSSP tool. to the Indeed, internet the (CFSSP9)sequence toof contain3DU1 (see much Figure more 5) helixwas predicted than strand by the and most loop, recent while C&F Figure prediction5 shows server that it added consists to ofthe strand, internet loop, (CFSSP9) and turn to contain only. Among much more the 500 helix most than abnormal strand and proteins, loop, while we find Figure a large 5 shows number that of it haemoglobins.consists of strand, When loop, we and recalculate turn only. the Among parameters the 500 using most only abnormal 79 haemoglobins proteins, then,we find obviously, a large thenumber haemoglobins of haemoglobins. all are considered When we normal, recalculate but amazingly,the parameters the large using dataset only of79 22,405haemoglobins proteins then, does notobviously, score very the abnormal.haemoglobins It is clearall are that considered PSSPs based normal, on C&F-like but amazingly, parameters the willlarge only dataset work of well 22,405 for middle-of-the-roadproteins does not score proteins, very abnormal. and this is It probably is clear that true PSSPs for most based PSSP on C&F-like projects, evenparameters when basedwill only on smallwork orwell medium for middle-of-the-road size neural networks. proteins, and this is probably true for most PSSP projects, even whenIn based Ptitsyn’s on small group, or people medium looked size atneural protein networks. structures and tried to understand the principles that governedIn Ptitsyn’s their folding group, and people stability. looked Just at this—“looking”protein structures at proteinsand tried using to understand deep learning the principles methods basedthat governed on large neural their networksfolding and and bigstability. data—is Just that this—“looking” what modern PSSPs at proteins try to do! using Our resultsdeep learning suggest thatmethods PSSP based projects on mightlarge benefitneural networks from a (fully and unsupervised) big data—is that protein-class what modern prediction, PSSPs try followed to do! byOur a predictionresults suggest with that a PSSP PSSP method projects specific might forbenefit that fr class.om a These (fully classesunsupervised) will likely protein-class neither be theprediction, typical all-followedα, all-β by, α a/β prediction, α+β and furtherwith a PSSP subclasses method used specific by most for protein that class. classification These classes schemes, will likely nor will neither they alwaysbe the typical be classes all-α of, all- proteinsβ, α/β, α with+β and a similar further function. subclasses Classes used by might most beprotein function-related, classification related schemes, to cellularnor will location, they always to species, be classes to origin, of proteins etcetera. with We a expectsimilar that function. if all kinds Classes of technical might be problems, function-related, such as overtraining,related to cellular local minimalocation, in to neural species, space, to origin, or computational etcetera. We times expect required, that if canall bekinds overcome, of technical then modernproblems, deep-learning such as overtraining, based methods local minima (with a very,in neural very bigspace, neural or computational network) can automatically times required, detect can thesebe overcome, classes, and then it modern is remarkable deep-learning that (albeit based without methods knowing (with it) a this very, is rather very big much neural what network) Oleg Ptitsyn can and his group did when they started the PSSP research field in the early 70s. Biomolecules 2020, 10, 910 13 of 14

Acknowledgments: The authors thank Alyosha Finkelstein for technical assistance. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Ptitsyn, O.B. Statistical analysis of the distribution of amino acid residues among helical and non-helical regions in globular proteins. J. Mol. Biol. 1969, 42, 501–510. [CrossRef] 2. Ptitsyn, O.B.; Denesyuk, A.I.; Finkelstein, A.V.; Lim, V.I. Prediction of the secondary structure of the L7, L12 proteins of the E. coli ribosome. FEBS Lett. 1973, 34, 55–57. [CrossRef] 3. Ptitsyn, O.B.; Finkel’shte˘ın, A.V. Predicting the spiral portions of globular proteins from their primary structure. Doklady Akademii nauk SSSR 1970, 195, 221–224. [PubMed] 4. Ptitsyn, O.B.; Finkel’shte˘ın, A.V. Relation of the secondary structure of globular proteins to their primary structure. Biofizika 1970, 15, 757–768. [PubMed] 5. Conformational analysis of protein secondary structure during spray-drying of antibody/mannitol formulations. Eur. J. Pharm. Biopharm. 2007, 65, 1–9. [CrossRef] 6. Sallach, R.E.; Wei, M.; Biswas, N.; Conticello, V.P.; Lecommandoux, S.; Dluhy, R.A.; Chaikof, E.L. Micelle Density Regulated by a Reversible Switch of Protein Secondary Structure. J. Am. Chem. Soc. 2006, 128, 12014–12019. [CrossRef] 7. Stigers, K.D.; Soth, M.J.; Nowick, J.S. Designed molecules that fold to mimic protein secondary structures. Curr. Opin. Chem. Biol. 1999, 3, 714–723. [CrossRef] 8. Wolkers, W.F.; Hoekstra, F.A. Aging of Dry Desiccation-Tolerant Pollen Does Not Affect Protein Secondary Structure. Plant Physiol. 1995, 109, 907–915. [CrossRef] 9. Gao, C.; Taylor, J.; Wellner, N.; Byaruhanga, Y.B.; Parker, M.L.; Mills, E.C.; Belton, P.S. Effect of Preparation Conditions on Protein Secondary Structure and Biofilm Formation of Kafirin. J. Agric. Food Chem. 2005, 53, 306–312. [CrossRef] 10. Golovina, E.A.; Wolkers, W.F.; Hoekstra, F.A. Long-Term Stability of Protein Secondary Structure in Dry Seeds. Comp. Biochem. Physiol. A Physiol. 1997, 117, 343–348. [CrossRef] 11. Neault, J.F.; Novetta-Delen, A.; Arakawa, H.; Malonga, H.; Tajmir-Riahi, H.A. The effect of aspirin-HSA complexation on the protein secondary structure. Can. J. Chem. 2000, 78, 291–296. [CrossRef] 12. Chou, P.Y.; Fasman, G.D. Conformational parameters for amino acids in helical, β-sheet, and random coil regions calculated from proteins. Biochemistry 1974, 13, 211–222. [CrossRef] 13. Chou, P.Y.; Fasman, G.D. Prediction of protein conformation. Biochemistry 1974, 13, 222–245. [CrossRef] [PubMed] 14. Lenstra, J.A. Evaluation of secondary structure predictions in proteins. Biochim. Biophys. Acta BBA Protein Struct. 1977, 491, 333–338. [CrossRef] 15. Ashok Kumar, T. CFSSP: Chou and Fasman Secondary Structure Prediction server. Wide Spectr. 2013, 1, 15–19. [CrossRef] 16. Kabsch, W.; Sander, C. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22, 2577–2637. [CrossRef] 17. Yang, Y.; Gao, J.; Wang, J.; Heffernan, R.; Hanson, J.; Paliwal, K.; Zhou, Y. Sixty-five years of the long march in protein secondary structure prediction: The final stretch? Brief. Bioinform. 2018, 19, 482–494. [CrossRef] 18. Pauling, L.; Corey, R.B. Configurations of Polypeptide Chains with Favored Orientations around Single Bonds: Two New Pleated Sheets. Proc. Natl. Acad. Sci. USA 1951, 37, 729–740. [CrossRef] 19. Pauling, L.; Corey, R.B.; Branson, H.R. The structure of proteins: Two hydrogen-bonded helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. USA 1951, 37, 205–211. [CrossRef] 20. Bocharov, E.V.; Sobol, A.G.; Pavlov, K.V.; Korzhnev, D.M.; Jaravine, V.A.; Gudkov, A.T.; Arseniev, A.S. From Structure and Dynamics of Protein L7/L12 to Molecular Switching in Ribosome. J. Biol. Chem. 2004, 279, 17697–17706. [CrossRef] 21. Roderick, S.L.; Chan, W.W.; Agate, D.S.; Olsen, L.R.; Vetting, M.W.; Rajashankar, K.R.; Cohen, D.E. Structure of human phosphatidylcholine transfer protein in complex with its ligand. Nat. Struct. Biol. 2002, 9, 507–511. [CrossRef][PubMed] Biomolecules 2020, 10, 910 14 of 14

22. Akeroyd, R.; Lenstra, J.A.; Westerman, J.; Vriend, G.; Wirtz, K.W.A.; Deenen, L.L.M. Prediction of Secondary Structura1 Elements in the Phosphatidylcholine-Transfer Protein from Bovine Liver. Eur. J. Biochem. 1982, 121, 391–394. [CrossRef][PubMed] 23. Lim, V.I. Algorithms for prediction of α-helical and β-structural regions in globular proteins. J. Mol. Biol. 1974, 88, 873–894. [CrossRef] 24. Schulz, G.E.; Barry, C.D.; Friedman, J.; Chou, P.Y.; Fasman, G.D.; Finkelstein, A.V.; Lim, V.I.; Ptitsyn, O.B.; Kabat, E.A.; Wu, T.T.; et al. Comparison of predicted and experimentally determined secondary structure of adenyl kinase. Nature 1974, 250, 140–142. [CrossRef][PubMed] 25. Kabsch, W.; Sander, C. How good are predictions of protein secondary structure? FEBS Lett. 1983, 155, 179–182. [CrossRef] 26. Garnier, J.; Osguthorpe, D.J.; Robson, B. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 1978, 120, 97–120. [CrossRef] 27. Rost, B.; Sander, C. Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proc. Natl. Acad. Sci. USA 1993, 90, 7558–7562. [CrossRef] 28. Wang, S.; Peng, J.; Ma, J.; Xu, J. Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields. Sci. Rep. 2016, 6, 18962. [CrossRef][PubMed] 29. Rost, B.; Sander, C.; Schneider, R. Redefining the goals of protein secondary structure prediction. J. Mol. Biol. 1994, 235, 13–26. [CrossRef] 30. Lange, J.; Baakman, C.; Pistorius, A.; Krieger, E.; Hooft, R.; Joosten, R.P.; Vriend, G. Facilities that make the PDB data collection more powerful. Protein Sci. 2020, 29, 330–344. [CrossRef][PubMed] 31. Vriend, G. WHAT IF: A molecular modeling and drug design program. J. Mol. Graph. 1990, 8, 52–56. [CrossRef] 32. Joosten, R.P.; Long, F.; Murshudov, G.N.; Perrakis, A. The PDB_REDO server for macromolecular structure model optimization. IUCrJ 2014, 1, 213–220. [CrossRef] 33. Joosten, R.P.; Womack, T.; Vriend, G.; Bricogne, G. Re-refinement from deposited X-ray data can deliver improved models for most PDB entries. Acta Crystallogr. D Biol. Crystallogr. 2009, 65, 176–185. [CrossRef] [PubMed] 34. Hooft, R.W.W.; Sander, C.; Scharf, M.; Vriend, G. The PDBFINDER database: A summary of PDB, DSSP and HSSP information with added value. Bioinformatics 1996, 12, 525–529. [CrossRef][PubMed] 35. Hatos, A.; Hajdu-Soltész, B.; Monzon, A.M.; Palopoli, N.; Álvarez, L.; Aykac-Fas, B.; Bassot, C.; Benítez, G.I.; Bevilacqua, M.; Chasapi, A.; et al. DisProt: Intrinsic protein disorder annotation in 2020. Nucleic Acids Res. 2020, 48, D269–D276. [CrossRef][PubMed] 36. Stetefeld, J.; Frank, S.; Jenny, M.; Schulthess, T.; Kammerer, R.A.; Boudko, S.; Landwehr, R.; Okuyama, K.; Engel, J. Collagen stabilization at atomic level: Crystal structure of designed (GlyProPro)10foldon. Struct. Lond. Engl. 2003, 11, 339–346. [CrossRef] 37. Casarotto, M.G.; Gibson, F.; Pace, S.M.; Curtis, S.M.; Mulcair, M.; Dulhunty, A.F. A structural requirement for activation of skeletal ryanodine receptors by peptides of the dihydropyridine receptor II-III loop. J. Biol. Chem. 2000, 275, 11631–11637. [CrossRef][PubMed] 38. Querol, E.; Perez-Pons, J.A.; Mozo-Villarias, A. Analysis of protein conformational characteristics related to thermostability. Protein Eng. Des. Sel. 1996, 9, 265–271. [CrossRef] 39. Ramírez-Sánchez, O.; Pérez-Rodríguez, P.; Delaye, L.; Tiessen, A. Plant Proteins Are Smaller Because They Are Encoded by Fewer Exons than Animal Proteins. Genom. Proteom. Bioinform. 2016, 14, 357–370. [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).