Alignstat: a Web-Tool and R Package for Statistical Comparison of Alternative Multiple Sequence Alignments Thomas Shafee1* and Ira Cooke2,1

Shafee and Cooke BMC Bioinformatics (2016) 17:434 DOI 10.1186/s12859-016-1300-6 SOFTWARE Open Access AlignStat: a web-tool and R package for statistical comparison of alternative multiple sequence alignments Thomas Shafee1* and Ira Cooke2,1 Abstract Background: Alternative sequence alignment algorithms yield different results. It is therefore useful to quantify the similarities and differences between alternative alignments of the same sequences. These measurements can identify regions of consensus that are likely to be most informative in downstream analysis. They can also highlight systematic differences between alignments that relate to differences in the alignment algorithms themselves. Results: Here we present a simple method for aligning two alternative multiple sequence alignments to one another and assessing their similarity. Differences are categorised into merges, splits or shifts in one alignment relative to the other. A set of graphical visualisations allow for intuitive interpretation of the data. Conclusions: AlignStat enables the easy one-off online use of MSA similarity comparisons or into R pipelines. The web-tool is available at AlignStat.Science.LaTrobe.edu.au. The R package, readme and example data are available on CRAN and GitHub.com/TS404/AlignStat. Background synthetically generated [3, 4] or a curated database Multiple sequence alignments (MSAs) aim to organise a [5, 6], and to refine phylogenies [7]. set of sequences by placing homologous residues into A common method of alignment comparison is columns, and their accuracy affects subsequent steps in though a combination of the sum of pairs score bioinformatic pipelines such as phylogenetic inference (SPS), and total column score (CS) [8]. The sum of [1] and protein structure prediction [2]. However, since pairs score measures what proportion of all residue there is no objective function to measure true ‘biological pairs within columns of one alignment are retained in correctness’ of an alignment, an array of alternative a comparison alignment and the total column score methods exist based on different assumpitions. These al- measures the proportion of columns where both gorithms often make different alignment predictions [2], alignments agree completely (ie for all sequences). especially in MSAs with many insertions and deletions, These methods have the benefit of including all hom- for example in cysteine-rich proteins. Quantitative com- ology information in a single score, however their in- parison and intuitive visualisation of alternative MSAs terpretation can be hampered by the fact that they can help users make decisions as to which regions are scale non-linearly with the degree of similarity at a generally agreed upon and whether any regions should site (Additional file 1: Figure S1). be removed in further analyses. Quantitative similarity Here, we use a complementary method based on a measures are also used when assessing the accuracy of matrix of equivalency functions to allow comparable alignment algorithms against benchmark MSAs, either quantification of both similarity and of alternative sources of dissimilarity. Each position in the matrix corresponds to a residue of the reference MSA, with * Correspondence: [email protected] 1Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science, La Trobe University, Melbourne 3086, Australia Full list of author information is available at the end of the article © The Author(s). 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Shafee and Cooke BMC Bioinformatics (2016) 17:434 Page 2 of 6 an equivalency function indicating its relationship to leading to the dissimilarity matrix, D (of dimensions the corresponding residue in the comparison MSA. n x p x 5) based on the functions defined in Eq. 3 We present a simple set of quantitative measures and and Eq. 4. This matrix categorises five types of out- graphical visualisations for interpreting MSA compari- come when the reference and comparison alignments sons. An R package generates a standardised set of are compared. It is called the dissimilarity matrix be- comparison matrices and scores for analysis pipelines cause four of the five alternatives correspond to vari- and graphing, and a user-friendly web-tool interface ous types of mismatch. enables easy one-off use. ¼ ε ; ð Þ Dxik k Pxi Qxj 3 Implementation ε Quantifying similarity Where k(a,b) is the kth equivalency function as defined When alignment algorithms make different homology in Eq. 4. predictions for a set of sequences, the columns of the 1 if a ¼ b ∧ a ≠}−} resulting MSAs will contain different residues. The ε ðÞa; b 1 0 otherwise AlignStat R package contains functions for calculating 1 if a ¼ b ∧ a ¼ }−} all MSA comparison statistics and creating plots quanti- ε ðÞa; b 2 0 otherwise fying differences in a manner that is equivalent for nu- 1 if a≠b ∧ b ¼ }−} cleotide or amino acid sequences. ε ðÞa; b ð4Þ 3 0 otherwise Each MSA of n sequences is treated as a matrix of 1 if a≠b ∧ a ¼ }−} characters (residues plus a gap character) with the ε4ðÞa; b same number of rows. The two matrices are therefore 0 otherwise 1 if a≠b ∧ a ≠}−} ∧ b ≠}−} defined as P (of dimensions n x p)andQ (of dimen- ε5ðÞa; b sions n x q), where each row represents an aligned 0 otherwise sequence. Residues can occur multiple times in a se- Where the five εk(a,b) are equivalency functions (see quence and so are numbered by occurrence such that supplementary information for formal definitions) each character in a row has a unique designation with the following meanings. The first equivalency (Additional file 1: Figure S4). This ensures that align- (ε1)isa‘match’, in which the two characters are iden- ment columns that contain a non-homologous occur- tical and not gaps. The second equivalency (ε2)isa rence of a residue are correctly distinguished. For the ‘conserved gap’, when the both characters are gaps. A p q matrices P and Q, each column vector pair i, j is ‘merge’ is when P contains a gap, but Q contains any compared to calculate the similarity measure Sij de- other character (ε ). Similarly, a ‘split’ is when Q con- p 3 fined in Eq. 1 (where i is the ith column of P,and tains a gap, but P contains any other character (ε ). q 4 j is the jth column of Q). Finally, a ‘shift’ is when two characters are not identi- Xn ε 1 cal and neither are gaps ( 5). The D matrix is visua- Sij ¼ ε Pxi; Q ð1Þ lised using the plot_dissimilarity_matrix function of n xj x¼1 the AlignStat R package. Where Sij is the similarity score for each column pair between P and Q, the equivalency function ε is de- Summary statistics fined in Eq. 2. The column averages of D are used to describe the sources of dissimilarity between the reference and com- 1 if a ¼ b ∧ a ≠ }−} parison alignments at each alignment position and each εðÞa; b ð2Þ 0 otherwise equivalency, k. This leads to the results matrix R (of dimensions 5 x p) defined by Eq. 5. The similarity matrix S can be visualised using the plot_similarity_heatmap function of the AlignStat R Xn ¼ 1 ð Þ package. Evaluating S is the most computationally ex- Rki Dxik 5 n x¼1 pensive calculation in the AlignStat scoring method and has been implemented in C++ for maximum Where R is the results matrix, each row of which is efficiency. used to summarise a source of dissimilarity from the D matrix. Detailed match scoring for comparable MSA columns The match row of the R matrix (R1i) is visualised using For each column in P we find its “match” in Q by the plot_similarity_summary function of the AlignStat R finding the index j at which Sij is maximized. The package. The merge, split and shift rows of the R matrix match between columns, Pi and Qj is then categorised (R3i, R4i and R5i) are referred to collectively as Shafee and Cooke BMC Bioinformatics (2016) 17:434 Page 3 of 6 dissimilarities in AlignStat. They are visualised using the indicate regions of high consensus, with regions of po- plot_dissimilarity_summary function. tential conflict as parallel grey lines. A single, overall similarity score describes the A discrete character heatmap of the dissimilarity matrix weighted average similarity of the two MSAs, as de- D is generated by the plot_dissimilarity_matrix function fined in Eq. 6. The treatment of gaps in MSAs is (Fig. 2b). The reference MSA is arranged on the x-axis complex [9, 10]. In this case, the most instructive with sequences arranged on the y-axis. For each character measure is to exclude conserved gaps, to prevent re- of the reference alignment, the heatmap colour reports sults being skewed by the “similarity” of conserved whether it is a match, merge, split, shift, or conserved gap. gaps in low occupancy columns. Therefore, the over- This indicates how sequence regions (columns) or se- all score is the sum of the match characters as a quence sets (rows) differ between the MSAs. proportion of characters thatarenotconservedgaps. The similarity of the MSAs is summarised as a line A more stringent column score can also be calcu- graph by the plot_similarity_summary function (Fig. 2c). lated as the proportion of all columns that have a The average column match is shown for each reference perfectly identical between the MSAs.

Load more