2018 26th European Signal Processing Conference (EUSIPCO) Improved Residue–Residue Contact Prediction Using Image Denoising Methods

Amelia Villegas-Morcillo, Juan A. Morales-Cordovilla, Angel M. Gomez, Victoria Sanchez Dept. of Signal Theory, Telematics and Communications, University of Granada, Spain E-mail: [email protected], {jamc,amgg,victoria}@ugr.es

Abstract—A is a simplified matrix rep- contact map. Two residues within a protein are said resentation of the , where the spatial proximity to form a contact when they share a spatial proximity that is of two amino acid residues is reflected. Although the accurate sufficient for a molecular interaction to occur. Thus, the true prediction of protein inter-residue contacts from the amino acid sequence is an open problem, considerable progress has contact map for a protein with L residues consists of an L×L been made in recent years. This progress has been driven symmetrical matrix, where each element specifies whether by the development of contact predictors that identify the co- two inter-residues are in contact under a certain Euclidean evolutionary events occurring in a protein multiple sequence distance threshold. By contrast, the matrix elements within an alignment (MSA). However, it has been shown that these methods estimated contact map indicate the contact probability of two introduce Gaussian noise in the estimated contact map, making its reduction necessary. In this paper, we propose the use of two residues. Recent research has shown that even a low proportion different Gaussian denoising approximations in order to enhance of correctly-predicted contacts can be enough for accurately the protein contact estimation. These approaches are based on modeling the protein structure [3]. (i) sparse representations over learned dictionaries, and (ii) deep Current state-of-the-art contact predictors can be classified residual convolutional neural networks. The results highlight that into two main categories, based on evolutionary coupling the residual learning strategy allows a better reconstruction of the contact map, thus improving contact predictions. analysis (EC) and supervised (ML), respec- Index Terms—Protein Contact Map, Evolutionary Coupling, tively [4]. In the first group we can find methods such as Image Denoising, Sparse Representations, Dictionary Learning, PSICOV [5] and CCMpred [6], which aim to identify co- Deep Convolutional Neural Networks, Residual Learning evolved residues from the protein multiple sequence alignment (MSA). Co-evolution is related to correlated mutations occur- I.INTRODUCTION ring in . This means that if one of the two residues Proteins are life essential macromolecules composed of long in contact has mutated during evolution, its counterpart has chains that contain 20 different types of amino acid residues. to change as well in order for the 3D structure to remain Protein folding is one of the most challenging problems stable [7]. On the other hand, supervised machine learning- found in bioinformatics, which still remains unsolved. It refers based predictors use a set of input features derived from the to the spatial arrangement of the amino acid sequence in MSA to estimate the contact map, including position-specific three-dimensional (3D) space, which is closely related to the scoring matrices (PSSM), secondary structure predictions or biological function performed by the protein. solvent accessibility information [8]. ML-based methods learn Several experimental techniques, such as X-ray crystallog- from these features using several algorithms, such as support raphy, nuclear magnetic resonance spectroscopy and electron vector machines, neural networks and ra