Optimization of Multiple Sequence Alignment Software Clustalw
Total Page:16
File Type:pdf, Size:1020Kb
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Optimization of Multiple Sequence Alignment Software ClustalW Soon-Heum Koa*, Plamenka Borovskab‡, Veska Ganchevac† aNational Supercomputing Center, Linkoping University, 58183 Linkoping, Sweden bDepartment of Computer Systems, Technical University of Sofia, Sofia, Bulgaria cDepartment of Programming and Computer Technologies, Technical University of Sofia, Sofia, Bulgaria Abstract This activity with the project PRACE-2IP is aimed to investigate and improve the performance of multiple sequence alignment software ClustalW on the supercomputer BlueGene/Q, so-called JUQUEEN, for the case study of the influenza virus sequences. Porting, tuning, profiling, and scaling of this code has been accomplished in this aspect. A parallel I/O interface has been designed for effcient sequence dataset input, in which sub-groups' local masters take care of read operation and broadcast the dataset to their slaves. The optimal group size has been investigated and the effects of read buffer size on read performance has been experimented. The application to ClustalW software shows that the current implementation with parallel I/O provides considerably better performance than the original code in view of I/O segment, leading up to 6.8 times speed-up for inputting dataset in case of using 8192 JUQUEEN cores. 1. Overview of the Project The study of the variability of influenza virus is of great importance nowadays. The world DNA databases are accessible for common use and usually contain information for more than one (up to several thousands) individual genomes for each species. Until now 84049 isolates of influenza virus have been sequenced and are available through GenBank [1]. In silico biological sequence processing is a key technology for molecular biology. Scientists are now dependent on databases and access to the biological information. This scientific area requires powerful computing resources for exploring large sets of biological data. The parallel implementation of methods and algorithms for analysis of biological data using high-performance computing is essential to accelerate the research and reduce the financial cost. Multiple sequence alignment (MSA) is an important method for biological sequences analysis and involves more than two biological sequences, generally of the protein, DNA, or RNA type. This method is computationally difficult and is classified as a NP-hard problem. ClustalW software has become the most popular algorithm and implements a progressive method for multiple sequence alignment [2]. ClustalW computes the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. The basic algorithm behind ClustalW proceeds in three stages: pairwise alignment (PA), guide tree (GT) and multiple alignment (MA). Pairwise alignment computes the optimal alignment cost for each pairs of sequences. A distance matrix is built up the entries of which show the degree of divergence for each pair of sequences in evolution. Distance is calculated as the percentage of nonidentity residues between two sequences. An evolutionary tree is out of the distance matrix using the sequence similarity matrix and Neighbor-Joining algorithm [3]. The tree holds values for each sequence that represent its similarity to all other sequences. The algorithm aligns the sequences progressively according to the branching order in the guide tree by first aligning the most similar pair of sequences, then the next most similar pair and so on. ClustalW phases are relatively independent. Each of the phases produces intermediate data which is used as an input for the next one. The execution time is strongly dependent on the number of sequences as well as their size. ClustalW-MPI [4] is a distributed and parallel implementation on a distributed computer clusters and on the traditional parallel computers. * Corresponding author. E-mail address: [email protected] ‡ Corresponding author. E-mail address: [email protected] † Corresponding author. E-mail address: [email protected] 1 The computational aspect of this project is to investigate the parallel performance in respect to the efficiency, scaling and profiling of parallel multiple alignment on the supercomputer JUQUEEN [5] utilizing parallel I/O deployment and optimization of MPI-based parallel implementation of ClustalW algorithm for the case study of influenza viral sequences comparison. 2. Parallel I/O Deployment to the ClustalW-MPI Software a. MPI-I/O Parallel I/O strategy is getting a lot of attention these days, due to more requirement of handling massive data input and outputs during a single simulation run. The main idea is to enable multiple processes to concurrently access the single file for I/O operation, so that the additional overhead of distributing/gathering the global dataset in traditional I/O approach can be diminished. A couple of parallel I/O implementations are in use these days, as depicted in Figure 2. MPI-I/O [6] has been standardized in 1997 and open source implementation of ROMIO [7] is commonly used by most MPI libraries. Parallel NetCDF [8] and parallel HDF5 [9] are implemented on top of MPI-I/O for the purpose of providing easier controllability and higher portability. On the other hand, these two implementations only allow the parallel I/O operations of the specific file format. Therefore, we directly impose the raw MPI-I/O functions to the current application for parallel I/O operation of genetic sequence dataset. Figure 1 The Parallel I/O Software Stack (Referenced by W. Frings, M. Stephan, F. Janetzko in [11]) The parallel MPI I/O model for simultaneous and independent access to single file collectively is presented in Figure 2. The slaves read the input sequences and output results to the file collectively. Even if the file system lacks support for parallel operations, this process is still more efficient, as slaves conduct more work in parallel (i.e. slaves complete their own results rather than relying on the master to do it). 2 PRACE-2IP WP7: Optimization of Multiple Sequence Alignment Software ClustalW P0 P1 P2 Pn Process Process Process Process rank 0 rank 1 rank 2 rank n MPI I/O Input/output data file Figure 2 Parallel MPI I/O Programming Model for Simultaneous and Independent Access to Single File Collectively Since the crude MPI-I/O function is directly applied to the application code, performance can vary a lot depending on the maturity of implementation. First, some tunable parameters do exist in the middleware layer, which are handled by MPI_Info_set. Out of them, cb_buffer_size and cb_nodes are two noticeable parameters which affect the actual read/write speed. cb_buffer_size controls the size (in bytes) of the intermediate buffer, whose default is 4194304 (4 Mbytes). cb_nodes represents the maximum number of aggregators, which is set to the number of unique hosts in the communicator used when opening the file by default. We tune those parameters after multiple benchmark runs. Another remarkable parameters which strongly affect parallel I/O performance are the number of processors participating in I/O calls and the size of read request from a single I/O call. Depending on the size of the dataset, the parallel I/O operation by all ranks might result in poorer performance compared with the composition of a serial read and collective communication. Therefore, we restrict the number of I/O ranks out of total processors. This is implemented by dividing the whole processors into several groups through the creation of MPI sub- communicators. We let master ranks of individual group to perform the I/O operation and to do collective operation with their slave processors. The optimal size of the group can differ depending on the total number of processors, I/O sizes, and computer hardware. Likewise, the size of read request per a single I/O call (which we will call as ‘read buffer size’) affects the read performance. In general, a larger read chunk size gives back the better performance because it necessitates less MPI_read function call. However, the read buffer size is limited by the file server’s specification, since the same-sized array shall be created in the file server for passing the data. As more ranks participate in parallel I/O calls, individual read buffer size should be reduced to maintain the total memory consumption in a file server. Therefore, the design of a read buffer size becomes related with the determination of the group size. The synthetic benchmark code for acquiring the best condition of group size and read buffer size is attached in ‘Appendix A: Implementation of grouped I/O operation through the creation of sub-communicators’. b. Deployment of the Parallel I/O to ClustalW-MPI Software ClustalW-MPI [4] is a MPI-parallelized version of ClustalW software. Clustal family of codes [2] are tools for aligning multiple nucleic acid and protein sequences. The alignment is achieved via three steps: pairwise alignment, guide-tree generation and progressive alignment. [4] ClustalW-MPI adopts the master-slave parallelism (Figure 3) in which master stores all global dataset, interfaces to I/O operations, and schedules slaves’ tasks. The benefit of this parallelism is that the load balancing is easily achieved for embarrassingly parallel tasks (pairwise alignment in this code), because master dynamically assigns the task to slaves in the runtime. On the other hand, the master rank easily exposes the memory overflow since all resultants are stored in the master, and the cost for master-slave communication becomes a bottleneck. In view of I/O in this code, master processor takes care of all POSIX-styled, sequential I/O operations and performs the point-to-point communication with individual slave to send/receive those I/O datasets. This implementation makes it impossible to directly impose MPI-I/O formulation in this code, since MPI-I/O functions are collective 3 operations among the whole processors in the same communicator.