Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-9 Issue-1, October 2019 Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods Arunima Mishra, B. K. Tripathi, S. S. Soam MSA is a well-known method of alignment of three or more Abstract: Protein Multiple sequence alignment (MSA) is a biological sequences. Multiple sequence alignment is a very process, that helps in alignment of more than two protein intricate problem, therefore, computation of exact MSA is sequences to establish an evolutionary relationship between the only feasible for the very small number of sequences which is sequences. As part of Protein MSA, the biological sequences are not practical in real situations. Dynamic programming as used aligned in a way to identify maximum similarities. Over time the sequencing technologies are becoming more sophisticated and in pairwise sequence method is impractical for a large number hence the volume of biological data generated is increasing at an of sequences while performing MSA and therefore the enormous rate. This increase in volume of data poses a challenge heuristic algorithms with approximate approaches [7] have to the existing methods used to perform effective MSA as with the been proved more successful. Generally, various biological increase in data volume the computational complexities also sequences are organized into a two-dimensional array such increases and the speed to process decreases. The accuracy of that the residues in each column are homologous or having the MSA is another factor critically important as many bioinformatics same functionality. Many MSA methods were developed over inferences are dependent on the output of MSA. This paper the period of time such as CLUSTALW [8], webPRANK[9], elaborates on the existing state of the art methods of protein MSA and performs a comparison of four leading methods namely PAGAN[10],Clustal Omega[11], MUSCLE MAFFT, Clustal Omega, MUSCLE and ProbCons based on the [12,13],T-Coffee[14], MAFFT[15,16], PRRP[17], speed and accuracy of these methods. BAliBASE version 3.0 ProbCons[18] , PROMALS [19], (BAliBASE is a repository of manually refined multiple sequence PROMALS3D[20],GLprobs[21],MSAprobs[22], and NAST alignments) has been used as a benchmark database and accuracy [23] etc. of alignment methods is computed through the two widely used MSA methods are generally applied for the identification of criteria named Sum of pair score (SPscore) and total column family of a new protein, locating DNA regulatory elements score (TCscore). We also recorded the execution time for each such as binding sites, protein structure prediction, and method in order to compute the execution speed. phylogenetic analysis. The assumption behind the MSA that Keywords: Multiple Sequence Alignment, Execution speed, actually help in performing these functions is that all the Sum of pair score, Total column score. biological sequences that are aligned share evolutionary relationship. The reliability of alignment results plays a key I. INTRODUCTION role in such analyses, but it is seen that results generated by various methods of alignment are pretty different. [24]. There are two types of methods to perform sequence Several studies are earlier done on the assessment of MSA alignment. One is the Pairwise sequence alignment (PSA) methods by taking BAliBASE [25], HOMSTRAD [26] as method and the other is multiple sequence alignment (MSA) benchmark datasets. The output of these studies pointed method. PSA is a procedure to expose the evolutionary towards the fact that none of the existing MSA method was relationship between two sequences of DNA, RNA or protein, suitable for all type of datasets. [27,28,29], Each MSA by finding out the maximum similarity between the method has its own advantages and limitations. BAliBASE is sequences. It was first done by Needleman and Wunsch a repository of manually refined correct alignments of (1970) [1] leveraging the dynamic programming algorithm conserved residues with 3D structural super-positions. that can provide globally optimum alignment of two HOMSTRAD is also a benchmark database which provided full-length protein sequences. Smith and Waterman (1981) protein sequences and their structure information, extracted [2] introduced another algorithm with a different scoring from various protein databases that stored structures and families of protein like- PDB [30], Pfam [31] and SCOP [32], scheme where in place of global alignment optimum local etc. Some studies are also done using simulated databases but alignment of sub-sequences are used. There are many PSA the main disadvantage of the simulated sequences is that if the methods such as BLAST [3], AlignMe[4], EMBOSS [5] and settings made for simulation are more similar to any PSI-BLAST[6]. alignment method than the other, they can produce the results in favour of that method [33]. We have used BAliBASE v3.0 Revised Manuscript Received on October 15, 2019 as benchmark database which contained specially constructed * Correspondence Author reference sequence sets of protein and corresponding sets of Arunima Mishra*, computer science & engineering, Dr APJ Abdul alignments which are meant with the help of 3-dimensional Kalam Technical University, Lucknow, India. Email: [email protected] superpositions of protein structures and then refined manually Bipin Kumar Tripathi , computer science & engineering, Rajkiya to ensure the accuracy of the Engineering college , Bijnor, India. Email: [email protected] MSA. Apart from the Sudhir Singh Soam, computer science & engineering, Institute of benchmark dataset, a good Engineering and Technology, Lucknow, India. Email: [email protected] means for comparison also Published By: Retrieval Number: A1369109119/2019©BEIESP Blue Eyes Intelligence Engineering DOI: 10.35940/ijeat.A1369.109119 771 & Sciences Publication Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods required and hence we chose two broadly accepted score Da-Fei Feng and Doolittle [34] developed a heuristic sum-of-pair score (SPscore) and total -column score approach for multiple sequence alignment in 1987 called (TCscore) for the assessment of accuracy of four alignment progressive alignment. This approach performs multiple methods. SPscore shows the level to which an alignment sequence alignments in two stages, first stage builds a guide program is able to align the sequences in a most accurate tree by any clustering method like neighbour-joining [35] or manner into an alignment, instead TCscore access the UPGMA [36], second step constructs the final MSA by capability of the alignment program to align all the residues adding sequences with the help of guide tree i.e. adding most correctly in each column. SPscore is helpful to assess the similar sequences first followed by the most distantly related alignment of sequences where the data set is divergent, sequences. Progressive alignments do not guarantee for whereas the TCscore is more accurate with an alignment of globally optimal results. but are very fast and efficient enough sequences that are closely related. In the event when the when a large number of protein sequences are there. The main datasets are closely related with few very divergent or orphan disadvantage of this method is that if errors are made at any sequences and if any tool aligns only closely related stage, it cannot be rectified later on and propagated through to sequences correctly and misalign the divergent or orphan the final result. Clustal W is the most used progressive sequences, will get the high SPscore which is not the correct alignment-based tools. representation. In such scenarios, the TCscore gives more B. Iterative- Progressive alignment Method meaningful result as it is able to distinguish between the Iterative-Progressive alignment in contrast to the alignment programs which could align the highly divergent or progressive alignment method repeatedly performs dynamic orphan sequences correctly. programming to realign the already aligned sequences and at To perform analysis using MSA, it is imperative to use the the same time keeps adding new sequences to the MSA. This right method as right choice of a method can lead to the method due to iterative dynamic programming, improves the correct results and vice versa, therefore, it seemed essential to accuracy of the alignment and is able to correct the errors conduct a comparative study of various MSA methods on the made at the initial stage. The limitation of this method is that speed and accuracy of the results, which will help even the Iterative methods cannot perform alignments for a large non-specialist biologists to select the right method . This number of sequences [37]. Key iterative alignment methods study focuses on the critical issues like accuracy and speed of include PRRP, MUSCLE, and Dialign[38]. four popular MSA methods MAFFT v7, ClustalΩ, MUSCLE C. Hidden Markov model and ProbCons by the systematic comparison and evaluation. Hidden Markov model is a graphical model that use Accuracy and speed are the standard parameters used for probability to predict a sequence of unknown variables from evaluation of methods. The main algorithms and availability the set of observed variables. It is extensively used in multiple of MSA methods are mentioned in table 1. sequence alignment. This model assigns the probability of various combinations of alignments with matches, Table 1: Description of MSA methods chosen for mismatches, and gaps at different positions to arrive at the evaluation possible MSAs. it provides as an output a single highest Tools Main Link scoring MSA and also other possible alignments that can be name/version Algorithm analyzed further for other biological implications. Clustal Progressive http://www.clustal.org/o The accuracy of the output is dependent upon the posterior Omega/1.2.4 method mega/ probabilities of residues. By using this model, the computing followed by speed is improved for the sequences having overlapping HMM regions. Examples of Hidden Markov model are ProbCons, ProbCons/1.1 Markov http://probcons.stanfor GL prob and MSAprobs etc. 2 model based d.edu/download.html D.