Prediction of DNA-Binding Proteins from Relational Features Andrea Szaboov´ A´1*, Ondˇrej Kuzelkaˇ 1, Filip Zeleznˇ Y´1 and Jakub Tolar2
Total Page:16
File Type:pdf, Size:1020Kb
Szaboov´ a´ et al. Proteome Science 2012, 10:66 http://www.proteomesci.com/content/10/1/66 RESEARCH Open Access Prediction of DNA-binding proteins from relational features Andrea Szaboov´ a´1*, Ondˇrej Kuzelkaˇ 1, Filip Zeleznˇ y´1 and Jakub Tolar2 Abstract Background: The process of protein-DNA binding has an essential role in the biological processing of genetic information. We use relational machine learning to predict DNA-binding propensity of proteins from their structures. Automatically discovered structural features are able to capture some characteristic spatial configurations of amino acids in proteins. Results: Prediction based only on structural relational features already achieves competitive results to existing methods based on physicochemical properties on several protein datasets. Predictive performance is further improved when structural features are combined with physicochemical features. Moreover, the structural features provide some insights not revealed by physicochemical features. Our method is able to detect common spatial substructures. We demonstrate this in experiments with zinc finger proteins. Conclusions: We introduced a novel approach for DNA-binding propensity prediction using relational machine learning which could potentially be used also for protein function prediction in general. Keywords: DNA-binding propensity prediction, DNA-binding proteins, Relational machine learning Background from physicochemical or structural protein properties. The process of protein-DNA interaction has been On the other hand, the main advantage of the approaches an important subject of recent computational-biology not using evolutionary information is that they do not rely research, however, it has not been completely under- on the existence of homologous proteins and also they stood yet. DNA-binding proteins have a vital role in the may provide interpretable patterns describing the binding biological processing of genetic information like DNA principles. transcription, replication, maintenance and the regulation In one of the pioneering works on the prediction of of gene expression. Several computational approaches DNA-binding propensity, Stawiski et al. [5] investigated have recently been proposed for the prediction of DNA- the structural and sequence properties of large, positively binding function from protein structure. In this paper charged electrostatic patches on DNA-binding protein we are interested in prediction of DNA-binding propen- surfaces. They used a neural network with 12 features sity of proteins using their structural information and such as molecular weight per residue, patch size, percent physicochemical properties. This approach is in con- α-helix in patch, average surface area per residue, num- trast with some of the most recent methods which are ber of residues with hydrogen-bonding capacity, percent based on similarity of proteins, for example structural of patch and cleft overlap, number of lysine and polar alignment or threading-based methods [1-3] or methods isosteres in Lysoff patches, and percent of conserved posi- exploiting information about evolutionary conservation of tive and aromatic residues in the patch. Ahmad and Sarai amino acids in proteins [4]. In general, methods exploit- [6] trained a neural network based on the net charge, the ing evolutionary information can be more accurate than electric dipole and quadrupole moments of the protein. the approaches aiming to infer binding propensity purely The combination of charge and dipole moment performed best, while information about the quadrupole moment *Correspondence: [email protected] improved the accuracy only slightly. They found out that 1Czech Technical University, Prague, Czech Republic Full list of author information is available at the end of the article DNA-binding proteins have significantly higher net pos- itive charges and electric moments than other proteins. © 2012 Szaboov´ a´ et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Szaboov´ a´ et al. Proteome Science 2012, 10:66 Page 2 of 11 http://www.proteomesci.com/content/10/1/66 Bhardwaj et al. [7] examined the sizes of positively charged from its absence. Our preliminary results [12] indicated patches on the surface of DNA-binding proteins. They that occurrence-counting indeed substantially lifts pre- trained a support vector machine classifier using the pro- dictive accuracy. Lastly, the approach of Nassif et al. [10] tein’s overall charge and its overall and surface amino acid resulted in classifiers that are more easily interpretable composition. Szilagyi´ and Skolnick [8] created a logistic than state-of-the-art classifiers and comparable in pre- regression classifier based on the amino acid composi- dictive accuracy. Here we maintain the interpretability tion, the asymmetry of the spatial distribution of specific advantage and achieve accuracies competitive to the state- residues and the dipole moment of the protein. Patel et of-the-art predictive accuracies both by a purely structural al. [9] used an artificial neural network to discriminate approach (without the physicochemical features) and also DNA-binding proteins from non-DNA binding proteins through the combination of structural and physicochemi- using amino-acid sequential information. For each amino cal features. acid sequence they created a set of 62 sequence features. Nimrod et al. [4] presented a random forest classifier for identifying DNA-binding proteins among proteins with Materials and methods known 3D structures. First, their method detects clusters Data of evolutionarily conserved regions on the surface of pro- DNA-binding proteins are proteins that are composed teins using the PatchFinder algorithm. Next, a classifier of DNA-binding domains. A DNA-binding domain is an is trained using features like the electrostatic potential, independently folded protein domain that contains at cluster-based amino acid conservation patterns, the sec- least one motif that recognizes double- or single-stranded ondary structure content of the patches and features of the DNA. We worked with the following datasets in our whole protein, including all the features used by Szilagyi´ experiments: and Skolnick [8]. • PD138 - dataset of 138 DNA-binding protein In the present work, we use an automatic feature con- structures in complex with DNA, struction method based on relational machine learning to • UD54 - dataset of 54 DNA-binding protein structures discover structural patterns capturing spatial configura- in unbound conformation, tion of amino acids in proteins. Numbers of occurrences • BD54 - dataset of 54 DNA-binding protein structures of each discovered pattern in a protein become attributes in DNA-bound conformation corresponding to the of the protein, which are then used by a machine learning set UD54 algorithm to predict the DNA-binding propensity of the • APO104 - dataset of 104 DNA-binding protein protein. We combine two categories of features to predict structures in unbound conformation, the DNA-binding propensity of proteins. The first cat- • ZF - dataset of 33 Zinc Finger protein structures in egory contains physicochemical features which enabled complex with DNA, Szilagyi´ and Skolnick’s method [8] to achieve state-of-the- • NB110 - dataset of 110 non-DNA-binding protein art predictive accuracies. The second category contains structures, structural features representing the discovered spatial pat- • NB843 - dataset of 843 non-DNA-binding protein terns in protein structures. Using predictive classifiers structures. based on these features we obtain accuracies competitive with existing physicochemical based methods on several Dataset PD138 was created using the Nucleic Acid datasets of proteins. Moreover, our method is able to Database (NDB) by Szilagyi´ and Skolnick [8] - it con- detect conserved spatial substructures, which we demon- tains a set of DNA-binding proteins in complex with DNA strate in experiments with zinc finger proteins. strands with a maximum pairwise sequence identity of Nassif et al. [10] previously used a relational learning 35% between any two sequences. based approach in a similar context, in particular to clas- Both the protein and the DNA can alter their conforma- sify hexose-binding proteins. The main differences of our tion during the process of binding. This conformational approach from the method of Nassif et al. [10] are as change can involve small changes in side-chain location, follows. First, the fast relational learning algorithm [11] and also local refolding, in case of the proteins. Predicting that we use enables us to produce features by inspect- DNA-binding propensity from a structural model of a pro- ing much larger structures (up to tens of thousands of tein makes sense if the available structure is not a protein- entries in a learning example) than those considered in the DNA complex, i.e. it does not contain a bound nucleic work of Nassif et al. [10] using the standard learning sys- acid molecule. In order to find out how the results would tem Aleph. Second, our structural features acquire values change according to the conformation before and after equal to the number of occurrences of the corresponding binding, we used two other datasets (UD54, BD54). BD54 spatial pattern, whereas Nassif et al. [10] only distin-