S-BLOSUM: Classification of 2D Shapes with Biological Sequence

2014 22nd International Conference on Pattern Recognition S-BLOSUM: classification of 2D shapes with biological sequence alignment Pietro Lovato∗, Alessio Milanese†, Cesare Centomo†, Alejandro Giorgetti†, Manuele Bicego∗ ∗Department of Computer Science, University of Verona †Department of Biotechnology, University of Verona Strada le Grazie 15, 37134 Verona (ITALY) Corresponding author: Pietro Lovato; Email: [email protected] Abstract—Recent works investigated the possibility to design An alternative, complementary way of interaction may be solutions for pattern recognition problems by exploiting the huge to translate advanced bioinformatics solutions into ideas and amount of work done in bioinformatics. If the pattern recognition methodologies useful to solve a pattern recognition task1.For problem is cast in biological terms, then a huge range of algorithms, exploitable for classification, detection, visualization, etc. example, sequence analysis is a problem encountered every can be effectively borrowed. In this paper, we exploit biological day in the life sciences: there has been a vast amount of tools sequence alignment tools to classify 2D shapes, tailoring the and solutions, improved in more than 40 years of research, biological parameters of these tools to account for the different developed to analyse biological sequences (sequence align- semantic of the 2D shape scenario. In particular, we propose ment, motif discovery, phylogenesis are just few examples). a novel substitution matrix, which is the crucial parameter determining the sequence alignment solution. The new matrix, Thus, encoding a 2D contour as a string (like the simple called S-BLOSUM, learns the rates of matches/mismatches in chaincode descriptor [11]), and mapping it into a biological conserved portions of shapes belonging to the same category, sequence opened to the possibility of exploiting a wide class and incorporates prior knowledge on the chosen representation of techniques coming from the biological sequence alignment for the 2D shape. On one hand, the experimental evaluation community, where specialized algorithms for string matching, showed that the S-BLOSUM provides a significant improvement over the biological counterpart (BLOSUM); on the other hand, visualization, and interpretation have been developed. classification results prove that our approach is competitive with Indeed, results obtained in [7], [8] proved the suitability respect to the state of the art. of the approach, even in its simplest scheme – employing the biological tools “as are”. Clearly, there is room for many I. INTRODUCTION improvements and refinements. In this sense, we can observe 2D shape analysis is an important and still open research that parameters in sequence alignment techniques are finely area in computer vision, often representing the basis for tuned to take into account the biological nature of the input recognition of 3D real-world objects. The goal in a clas- sequences so that evolutionary events, such as mutations or sification / recognition task is to assign a category to an rearrangements can be clearly expressed. If we use biological unknown 2D shape on the basis of a set of classifiers, learned tools as they are we do not take into account the fact that from a training set which contains examples of the different symbols in the shape alphabet (for example, chaincodes) have categories. Several approaches have been proposed in the past a very different semantics than aminoacids in nature. (see for example the reviews [1], [2]), and many of them are This paper is aimed at investigating this aspect, trying based on the analysis of the boundary: very often, the 2D shape to understand to which extent this is crucial. In particular, is encoded by its contour, which proved to be an effective and more than exploiting biological alignment tools to classify perceptually reasonable choice in many applications. Different 2D shapes, we also tune the biological parameters of these techniques exhibit different characteristics: robustness to noise tools to account for the change in the applicative scenario. We and occlusions, invariance to translation, rotation, and scale, start from the observation that biological knowledge (in the computational requirements, and accuracy (see [3], [4], [5], alignment process) is encoded in the form of a substitution [6] and references therein). matrix (the most famous one called BLOSUM [12]): an Recently, an alternative class of approaches have been entry (i, j) in such matrix indicates the score to assign for a proposed by some of the authors: in particular, we presented match/mismatch (the lower, the more penalized it is) between a parallelism between the 2D shape recognition problem and symbols i and j, and this models the fact that in nature the biological sequence alignment [7], [8], exploring the idea substitutions between aminoacids are not all equally likely. of borrowing bioinformatics tools to solve pattern recognition In this paper, we propose a novel substitution matrix, which problems such as the shape classification one. The main we will refer to as Shape BLOSUM (S-BLOSUM), learned observation was that, in the past, the huge and profitable interaction between pattern recognition and biology has been 1The same alternative way of thinking was also pioneered in the Video Genome Project [10] – see http://v-nome.org/about.html – where internet mainly unidirectional, namely devoted to study how to apply videos were encoded as “video DNA sequences” and analysed with phy- PR tools, algorithms and ideas to analyse biological data [9]. logenetic related tools. 1051-4651/14 $31.00 © 2014 IEEE 2335 DOI 10.1109/ICPR.2014.405 sequences, in order to maximize their point-wise similarity. A huge amount of algorithms for sequence alignment exist in the literature [13], [14], [15], [16], and they can be classified in several different categories. The main taxonomy divides the approaches in three categories: global alignment methods, which are aimed at finding the best overall alignment between two sequences; local alignments, which detect related seg- ments in a pair of sequences, and multiple alignments, which are aimed at simultaneously align more than two sequences. All of these techniques heavily rely on a fundamental parameter, called the substitution matrix, which assigns a score for matches/mismatches based on the rate at which one character in a sequence is likely to mutate into another one (the higher, the more likely it is). Another important parameter, the gap penalties, is specified by a pair of values representing the Fig. 1. A substitution matrix can be learned by observing the matches/mismatches frequencies that occur in chaincode strings composing cost for inserting a gap and extending an existing one. a category of shapes. In the example, the symbol 0 is highly conserved, so it A fundamental issue is therefore how to choose and design will have a strong negative penalty. On the other hand, substitutions between properly the substitution matrix B. Intuitively, B should have 3 and 4 happen very often, resulting in a low penalty value for the mismatch between the two. the highest values on the diagonal: if two symbols match, a high score should be assigned. For mismatches, B should from our shape data in a similar fashion the BLOSUM is reflect the fact that there are some that are highly improb- built from biological data [12]. In this way, each element able, due for example to physical or chemical properties models the variations that are likely to occur within shapes of aminoacids. In the protein domain, the most employed of the same category, due for example to local deformations substitution matrix is the one called BLOSUM [12], being the or random noise (a graphical explanation is depicted in Fig. 1). default choice of many bioinformatics tools available on the Furthermore, we introduced some a priori assumptions on the web. Note that many alternatives exist, the oldest one being chaincode representation, leading to a matrix S-BLOSUMfull the PAM [17]. which accounts for prior knowledge and is also driven by the data we actually observe; as expected, combining these aspects III. THE PROPOSED APPROACH proves to be the best choice in terms of classification accuracy. Our proposal is to build a substitution matrix able to deal Once a proper matrix is chosen, it can be plugged into with our peculiar scenario. We assume that a shape is encoded any biological sequence alignment program, which gives for with the 8-directional chaincode, thus strings are defined over any pair of sequences an alignment score, which reflects an alphabet of 8 symbols. Note however that our approach “how well” those sequences are aligned under the substitution can be employed with any descriptor that is able to map a 2D matrix. Such quantity is the similarity measure exploited for shape into a string. classification in a nearest-neighbor setting. To quantitatively test the proposed approach, we performed experiments on A. S-BLOSUM construction three benchmark shapes datasets, demonstrating the suitability In the following, we explain how the S-BLOSUM is built: of the proposed scheme: results showed that our matrices starting from the description of the BLOSUM matrix (found improve the results obtained with the biological BLOSUM, in [12]), we will detail in every step how we account for the and are also very competitive with recent state-of-the art change in representation. techniques we evaluated for comparison. Block extraction: The starting concept in building a II. BACKGROUND: BLOSUM MATRICES AND SEQUENCE BLOSUM is that of a block of related sequences. In biological ALIGNMENT terms, related sequences are the ones which belong to the same Understanding and modeling living cell behavior is strongly evolutionary family – namely, they share the same biological based on the analysis of sequences, both nucleotide sequences function. Blocks then are ungapped, highly similar portions of – i.e. strings made with the 4 symbols of DNA (A, T, C, and G) sequences, which have been previously aligned and extracted – and aminoacid sequences, i.e.

S-BLOSUM: Classification of 2D Shapes with Biological Sequence

Bioinformatics 1: Lecture 3

BASS: Approximate Search on Large String Databases

Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh

B.Sc. (Hons.) Biotech BIOT 3013 Unit-5 Satarudra P Singh

Parameter Advising for Multiple Sequence Alignment

Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices

Deriving Amino Acid Exchange Matrices (II) and Multiple Sequence Alignment (I) Summarysummary Dayhoff’Sdayhoff’S PAMPAM--Matricesmatrices

Pairwise Sequence Alignment Algorithm by a New Measure Based

Scoring Matrices for Sequence Comparisons

Rotamer-Specific Statistical Potentials for Protein Structure Modeling

Multiple Sequence Alignment

Practical Considerations of Working with Sequencing Data File Types