BMC Bioinformatics Biomed Central

BMC Bioinformatics BioMed Central Software Open Access An interactive visualization tool to explore the biophysical properties of amino acids and their contribution to substitution matrices Blazej Bulka1,2, Marie desJardins1 and Stephen J Freeland*2 Address: 1Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA and 2Department of Biological Sciences, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA Email: Blazej Bulka - [email protected]; Marie desJardins - [email protected]; Stephen J Freeland* - [email protected] * Corresponding author Published: 03 July 2006 Received: 12 December 2005 Accepted: 03 July 2006 BMC Bioinformatics 2006, 7:329 doi:10.1186/1471-2105-7-329 This article is available from: http://www.biomedcentral.com/1471-2105/7/329 © 2006 Bulka et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Quantitative descriptions of amino acid similarity, expressed as probabilistic models of evolutionary interchangeability, are central to many mainstream bioinformatic procedures such as sequence alignment, homology searching, and protein structural prediction. Here we present a web-based, user-friendly analysis tool that allows any researcher to quickly and easily visualize relationships between these bioinformatic metrics and to explore their relationships to underlying indices of amino acid molecular descriptors. Results: We demonstrate the three fundamental types of question that our software can address by taking as a specific example the connections between 49 measures of amino acid biophysical properties (e.g., size, charge and hydrophobicity), a generalized model of amino acid substitution (as represented by the PAM74-100 matrix), and the mutational distance that separates amino acids within the standard genetic code (i.e., the number of point mutations required for interconversion during protein evolution). We show that our software allows a user to recapture the insights from several key publications on these topics in just a few minutes. Conclusion: Our software facilitates rapid, interactive exploration of three interconnected topics: (i) the multidimensional molecular descriptors of the twenty proteinaceous amino acids, (ii) the correlation of these biophysical measurements with observed patterns of amino acid substitution, and (iii) the causal basis for differences between any two observed patterns of amino acid substitution. This software acts as an intuitive bioinformatic exploration tool that can guide more comprehensive statistical analyses relating to a diverse array of specific research questions. Background substitution matrices (such as the PAM and BLOSUM Molecular biology has made great progress in observing matrix families), which lie at the heart of mainstream bio- and quantifying the patterns by which amino acids informatics procedures, from algorithms that determine exchange for one another within protein sequences over whether [1] and how exactly [2] two proteins are homol- time. A key motivation here has been to create amino acid ogous, to those that predict protein tertiary structure by Page 1 of 9 (page number not for citation purposes) BMC Bioinformatics 2006, 7:329 http://www.biomedcentral.com/1471-2105/7/329 comparison with known folds [3]. However, these matri- a property that contains the term "hydrophobicity" in its ces represent generalized patterns of change "averaged" description.) Moreover, this comparison allows easy visu- across all proteins: although they typically encompass the alization of non-intuitive correlations (e.g., hydrophobic- idea that patterns of substitution will vary with evolution- ity and volume). The authors applied similarity-based ary distance, other systematic sources of variation are methods to their AAindex database to build a minimum overlooked. An increasing literature supports the idea that spanning tree: a graph-theoretic structure that connects dis- this generalization may compromise the sensitivity of crete elements together based on similarity, by minimiz- sequence comparison for various specialized subsets of ing the overall sum of the distances of the direct proteins (e.g., for particular protein families [4-8], or for connections. The result is a data structure in which ele- genomes that have evolved under unusual mutation ments are grouped together based on similarity (a detailed biases or selection regimes [9-11]). Thus a worthy chal- description and justification is given in the work of Tomii lenge is to seek the underlying ontology that can link indi- and Kanehisa who first applied this methodology to visu- vidually derived, specialized models of amino acid alizing amino acid similarity [23]). This minimum span- substitution into a common framework: if we can ulti- ning tree showed the underlying structure (clustering) for mately replace generalized patterns of observed change the 402 indices of their database. Since this time, numer- with a flexible, quantitative model of amino acid substitu- ous further indices and matrices have been developed: tion, then this offers significant potential to increase the some have been incorporated into updates of the AAin- sophistication of standard bioinformatics procedures. dex, while others remain isolated in the scientific litera- Such research may in fact be viewed as a subset of current ture (e.g., [10,25]). efforts to find a general, chemical ontology for bioactivity (e.g., [12-14]) where researchers face the same challenge In this context, we have developed free, user-friendly, of unifying diverse observations into a model that predicts publicly available web-based software that enables molecular interactions from first principles. researchers to repeat and extend the ideas of Nakai et al., [22] and Tomii and Kanehisa [23] using interactive data In this context, it has long been understood that amino visualization. We thus present the Amino Acid Explorer, a acid substitution matrices reflect a combination chemical web tool that facilitates quantitative exploration of simi- and evolutionary factors: most intuitively the biophysical larity between physiochemical properties of amino acids properties (known within chemical disciplines as "molec- and their evolutionary dynamics. Our tool allows users to ular descriptors") of the amino acids [15,16] and the explore the similarity between any of the 83 matrices and mutational distance of their encodings within the genetic any subset of the 494 indices housed by AAindex version code [5,17,18]. However, establishing accurate, quantita- 6.0, and to include any custom index or matrix (e.g., from tive connections between the outcomes of molecular evo- recent scientific literature or from unpublished research, lution and amino acids' molecular descriptors remains a as a matrix derived from an alignment of proteins in a par- complex issue under active research (e.g., [19-21]). ticular functional class, or an index derived by combining several physiochemical properties). We have embedded In this context, Nakai et al. created an innovative database, this analysis tool within a comprehensive web context: the AAindex [22], comprising both amino acid substitu- both a moderated user forum http://www.evolving tion matrices (20 × 20 matrices in which each element code.net/forum/viewforum.php?f=24 in which to discuss reflects some measure of the exchangeability of a pair of problems, findings or questions and an open wiki http:// amino acids) and amino acid indices (vectors of 20 ele- www.evolvingcode.net/ ments, each element being a value that describes some index.php?page=Amino_Acid_Indices in which the com- physiochemical property such as size or hydrophobicity, munity of those researching the interface of biochemistry for one of the twenty amino acids encoded by the stand- and protein evolution may contribute their knowledge. ard genetic code). In a later publication that expanded this database, Tomii and Kanehisa [23] suggested procedures Implementation for correlating any amino acid molecular descriptor with Our web tool, which may be accessed at http:// an observed exchange rate (e.g., substitution matrix) and www.evolvingcode.net:8080/aaindex/, comprises two for clustering indices together by similarity. major parts: one client side, one server side. The client side consists of the graphical interface that runs as a Java applet This latter technique of index clustering, is especially use- within a user's browser. The server side (residing on http:/ ful when exploring the relationship between indices, /www.evolvingcode.net), is a web application that per- given that properties of widespread interest have often forms all computations on the data, and is part of a larger been measured in many different ways by different computational infrastructure created around UMBC researchers. (For example, the latest version of the AAin- AAIndex database. Figure 1 shows an overview of our dex database [24] contains 29 different measurements of tool's architecture. Additionally, a short paragraph Page 2 of 9 (page number not for citation purposes) BMC Bioinformatics 2006, 7:329 http://www.biomedcentral.com/1471-2105/7/329 OverviewFigure 1 of Amino Acid Explorer Architecture

BMC Bioinformatics Biomed Central

Optimal Matching Distances Between Categorical Sequences: Distortion and Inferences by Permutation Juan P

Sequence Motifs, Correlations and Structural Mapping of Evolutionary

Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh

Development of Novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in Msas

3D Representations of Amino Acids—Applications to Protein Sequence Comparison and Classiﬁcation

Assume an F84 Substitution Model with Nucleotide Frequ

Amino Acid Substitution Matrices from an Information Theoretic Perspective

Comparison Theorems for Splittings of M-Matrices in (Block) Hessenberg

Pathological Rate Matrices: from Primates to Pathogens

3D Deep Convolutional Neural Networks for Amino Acid Environment Similarity Analysis Wen Torng1 and Russ B

Multiplicatively Closed Markov Models Must Form Lie Algebras

Metzlerian and Generalized Metzlerian Matrices: Some Properties and Economic Applications