Quick viewing(Text Mode)

Enhancing Parallelism of Data-Intensive Bioinformatics Applications

Enhancing Parallelism of Data-Intensive Bioinformatics Applications

Enhancing Parallelism of Data-Intensive Bioinformatics Applications

Zheng Xie, Liangxiu Han Richard Baldock School of Computing, Mathematics and Digital MRC Human Genetics Unit MRC IGMM, Technology University of Edinburgh Western General Hospital Manchester Metropolitan University Edinburgh EH4 2XU, UK Manchester M1 5GD, UK [email protected] [email protected] , [email protected]

Abstract — Bioinformatics data-resources collected from Our contribution of this work lies in 1) Design and heterogeneous and distributed sources can contain hundreds of implementation of enhanced parallel for a Terra-Bytes and the efficient exploration on these large bioinformatics data intensive application based on MPI. 2) amounts of data is a critical task to enable scientists to gain Quantitative analyses of super-linear and high new biological insight. In this work, an MPI-based parallel obtained from the parallel system. 3) architecture has been designed for enhancing performance of biomedical data intensive applications. The experiment results Consideration of maximizing the benefits of super-linear show the system has achieved super-linear speedup and high speedup and high scalability by maximizing the benefits of scalability. caching. The rest of the paper is organized as follows: section II Keywords- gene pattern recognition, data intensive describes the background and parallel solution for the application, Message Passing Interface, collective/point-to-point biomedical use case; Section III presents the experimental communicators, task/data parallelisms, fine-grain parallelism, evaluation. In Section IV, we conclude our work. super-linear speedup, super-ideal scalability.

I. INTRODUCTION II. PARALLEL PROCESSING FOR DATA INTENSIVE With the exploitation of advanced high-throughput BIOMEDICAL APPLICATIONS instrumentation, the quantity and variety of bioinformatics A. Background of the bioinformatics application data have become overwhelming. Such data, collected from heterogeneous and distributed sources, typically consists of In this research, we aim to accelerate a task from the tens or hundreds of Terra Bytes (TB) comprising ten to fifty biomedical science. This particular task concerns thousand individual assays with supplementary metadata ontological annotation of gene expression in the mouse and spatial-mapping information. This is the case with the Embryo. Ontological annotation of gene expression has mouse embryo gene-expression EMAGE [13], been widely used to identify gene interactions and networks EurExpress [14] and the Allen Brain Atlas [15]. To enable that are associated with developmental and physiological scientists to gain new biological insights, functions in the embryo. It entails labelling embryo images processing on these data is required. Parallel and distributed produced from RNA in situ Hybridization (ISH) with terms computing, along with parallel programming models (e.g., from the anatomy ontology for mouse development. If an MPI [1]), provides solutions by splitting massive data- image is tagged with a term, it means the corresponding intensive tasks into smaller fragments and carrying out anatomical component shows expression of that gene. The much smaller computations concurrently. input is a set of image files and corresponding metadata. This work explores how parallel processing could The output will be an identification of the anatomical accelerate data intensive biomedical applications. Message components that exhibit gene expression patterns in each Passing Interface (MPI) has been chosen to implement our image. This is a typical pattern recognition task. As shown parallel structure. MPI is a standardized and portable in Figure1 (a), we first need to identify the features of message-passing application programmer interface [4], `humerus' in the embryo image and then annotate the image which is designed to function on a wide variety of parallel using ontology terms listed on the left ontology panel . To computers. Its library provides communication functionality automatically annotate images, three stages are required: at among a set of processes. A rich range of functions in the the training stage, the classification model has to be built, library enables us to implement complicated communication based on training image datasets with annotations; at the which we need for enhancing the effectiveness of parallel testing stage, the performance of the classification computing as computational resources increase.

1

(a) (b)

Figure 1. Automatic ontological gene annotation [3].

model has to be tested and evaluated; then at the parallelisation at either hardware or software levels or both deployment stage, the model has to be deployed to perform (e.g., signal, circuit, component and system levels). the classification of all non-annotated images. We mainly In general, three considerations when parallelising an focus on the training stage in this case. The processes in the application at software level include: training stage include integration of images and annotations, • How to distribute workloads or decompose an image processing, feature generation, feature selection and into parts as tasks? extraction, and classifier design, as shown in Figure1 (b). • How to map the tasks onto various computing nodes and Currently gene expression annotation is mainly done execute the subtasks in parallel? manually by domain experts. This is both time-consuming • How to coordinate and communicate subtasks on those and costly, especially with the rapidly growing volume of computing nodes. image data of gene expression patterns on tissue sections There are mainly two common methods for dealing produced by advanced high-throughput instruments (e.g. with the first two questions: and task ISH). For example, the EuExpress-II project [2] delivered parallelism. Data parallelism represents workloads are partially annotated and curated datasets of over 20 distributed into different computing nodes and the same task Terabytes including images for the developing mouse can be executed on different subsets of the data embryo and the ontological terms for anatomic components simultaneously. means the tasks are of the mouse embryo, which provides a powerful resource independent and can be executed purely in parallel. There is for discovery of genetic control and functional pathways of another special kind of the task parallelism is called processes controlling embryo organisation. To alleviate the ‘pipelining’. A task is processed at different stages of a issues with the manual annotation, we have developed data , which is especially suitable for the case when the mining algorithms for automatically identifying an same task is used repeatedly. The extent of parallelisation is anatomical component in the embryo image and annotating determined by dependencies of each individual part of the the image using the provided ontological terms [3], algorithms and tasks. programmed these in sequential code and executed the task As for the coordination and communication among on a single commodity machine. Note that this task is a tasks or processes on various nodes or computing cores, it specific instance of a generic image-processing analysis depends on different memory architectures ( pipeline that could be applied to many datasets. To or ). A number of communication the everincreasing growth in the volume of stored data, it is models have been developed [7][8]. Among them, the necessary to design parallel solutions for speedup the data Message Passing Interface (MPI) has been developed for intensive applications. HPC parallel applications with distributed memory architectures and has become the de-facto standard. There is B. Overview of parallel approach a set of implementations of MPI, for example, OpenMPI [9], MPICH [10], GridMPI [11] and LAM/MPI [12]. Two It is well known that the speedup of an application to types of MPI communication functionality are point-to- solve large computational problems is mainly gained by the point and collective communication, and there are a number of functions involved in them. Point-to-point functions deal

2 with communication between two specific processes, in implement the data parallel and enhance the proceeding which a message needs to be sent from a specified process speed with minimum communication time. It is implemented to another specified process. They are suitable for patterned with MPI collective communicator ‘MPI_Scatter’ and or irregular communication. Collective functions manage ‘MPI_Gather’, and the minimum processing unit is a single communication among all processes in a process group at image object. ‘MPI_Scatter’, as shown in Figure3, takes an same time. The process group could either be the entire array of image objects and distributes the objects in the order process pool or a program-defined process subset. of process rank. ‘MPI_Gather’ is the reverse function of Collective communication is more efficient than point-to- ‘scatter’. After image processing, all the feature data vectors representing the processed images are put together in the point communication, and ought to be used in the parallel order of the process rank to produce a new data array, which architecture wherever it is suitable. is the input for the second part of the workflow. C. Parallel processing using MPI for the bioinformatics In the second part of the workflow, fisher-ratio algorithm application is used for feature selection and extraction [3]. Taking into account of large computing resources and regular calculation This section presents how we have designed and developed of mean and standard deviation, fine-grain parallelism is a parallel approach based on MPI for efficiently processing implemented. The implementation is similar to the part 3, large-scale data. and the detail is discussed in the next paragraph. In the In terms of the nature of algorithms used in the context of this research, ‘fine-grain’ means that a task is split biomedical application use case, we have mainly employed into very small fragment to make efficient use of the data and task parallelisms. The communication model used computing resources, though the increasing time of here is MPI, which support both point-to-point and communication among processors is a tradeoff. collective communications. In part 3 of the workflow, a Linear Discriminant For data parallelism, we need to distribute workloads into Analysis (LDA) algorithm [16] is used to implement a different computing cores and execute same computation on classifier, considering the optimization on classification [3]. different subsets of the data simultaneously. This In the LDA implementation, both regular and irregular functionality could be implemented by using one of MPI computation are involved, such as matrix multiplication, collective functions, ‘MPI_Scatter’, which takes the finding global and sub-data sets mean and covariance workload as an array and split it into a number of segments. matrix, and the calculation of a discriminant function. The number of segments is program defined number of Taking into account the complexity of the computation, parallel processes, with which MPI programs always work. fine-grain and task parallelisms are implemented by jointly Then, at runtime, all the processes are assigned to using point-to-point and collective communicators. Various computing cores in the order of process rank through the kinds of computations are defined as individual tasks. Sub- agent that starts the MPI program. When it needs to swap data sets for different computations are distributed in the regions of data among specific processors between way of ‘MPI_Scatter’. Data transfer among tasks is calculation steps, point-to-point communication functions managed by MPI’s point-to-point communicators are used, such as ‘MPI_Send/Recv’. For task parallelism, ‘MPI_Send/Recv’. This is a task parallel implementation. especially for ‘pipelining’, point-to-point communication Within each individual computation, fine-grain parallelism functions are used to pass new task data from a set of cores is designed, which split each vector of an array into sub- to another set of cores whenever the prior tasks are vector by using ‘MPI_Scatter’. Figure 4 illustrates the fine- completed. grained array multiplication calculation with increasing Our workflow for the application is divided into three number of processes. The parallel implementation is parts according to the characteristics of the task and data sensitive to the number of process. In the case of a small processing, number of process or computing cores which is less than the 1) Image Processing and Feature Generation, number of vectors in an array, it is a coarse-grained parallel 2) Feature Selection and Extraction, implementation. It is a fine parallel implementation when 3) Classification. the number of computing cores is greater than this number The overview of parallel architecture is illustrated in Figure of vectors. The advantage of this structure is that it makes 2. efficient use of the computing resources; so that no In the first part of the workflow, the image processing of computing cores have significantly different workloads each image works independently on its own dataset, and data from others as the workflow progresses. parallelism is the suitable form for parallel processing. The efficient collective communication of MPI is able to

3 Image Processing Big Image Data Set & Feature Generation

MPI Scatter Sub -Data Set from Rank 0

Rank 0 Image Rank 1 Image  … Rank N Image Processing Processing Processing

Feature Data Set Feature Data Set  … Feature Data Set

MPI Gather to Form Complete Feature Data Set at Rank 0 Processor

MPI Scatter Sub Feature Data Set and Index at Rank 0 process or

Rank 0 Fine Grain Rank 1 Fine Grain Rank N Fine Grain Feature  … Feature Selection Feature Selection Feature Selection Selection & Extraction MPI Gather to do Feature Extraction at Rank 0 Processor

MPI Scatter Sub Feature Data Set and Index at Rank 0 Processor

Fine-Grain Group Fine-Grain Group Fine Grain Global  … Mean Neg. Mean Pos. Mean

MPI Send/Recv MPI Send/Recv MPI Send/Recv

Fine Grain Classification Covariance Test Data Set

Fine Grain Discrimination Function

MPI Gather to do Classification Predication at Rank 0 Processor

Figure 2. Parallel architecture for the workflow.

… 0

0 1 2 3 … n

… Figure 3. ‘MPI_Scatter’ operation with n processors and a data series.

4

(a) process number less than the number of rows. (b) process number greater than number of rows.

Figure 4. Array multiplication under process number dependent fine-grain algorithm.

Fine-grain corresponding to the 2) Performance evaluation metrics number of process number is as below, called process We have measured performance of the parallel solution number dependent fine-grain algorithm. using two metrics: speedup and scalibility. ------a) Speedup Algorithm . Process number dependent fine-grain algorithm Amdahl’s law [18] describes an ideal linear speedup for ------a parallel system with the following equation: Input Process number nP Input Number of array columns n 1 column (1) Input Number of array rows nrow 1−α If nrow >= nP α + MPI_Scatter buffer length = ( nrow / nP)* ncolumn P Calculation In this expression, α is the fraction of a calculation that is sequential, (1- α ) is the fraction that can be parallelised, and MPI_Gather buffer length = nrow / nP End P is the number of processors. The maximum linear speedup While nrow

standalone form on the cluster, and it is compiled with FastMPJ [17] and packed into executable .jar file. We have performed experiments by varying numbers of processors and size of dataset: • Number of processors: 4 cores, 8 cores, 16 cores, 32 cores, 64 cores. • Data size (number of images): 1X, 2X, 4X, 8X,16X, 32X, 64X, where X=128 images.

5 ∆∂ T  N  ∂K(N , N )  1  =  D  D P +  K (N , N ) (3) ∂   ∂   D P 100.00 N D  N P  N D  N P 

∂∆T ∂K(N , N ) X=128 images = N D P ∆− T (4) ∂ D ∂ 1X NP N P 2X 4X 8X In Figure 6 (a), it may be seen that the increasing factor 16X 10.00 32X of computation time is almost equal to the increasing factor 64X of dataset size for all numbers of cores, and are close to the Ideal ideal linear relationship. Figure 6 (b) shows that the

Speedup relative to the baseline the to relative Speedup decreasing factor of computation time is almost equal to the increasing factor of number of computing cores. All these results demonstrate good scalability of the parallel processing. 1.00 4 8 16 32 64 Number of cores

100

Figure 5. Speedup performance analysis

b) Scalability 4 cores The scalability concerns the relationship between the 8 cores computation time and the increasing number of processors 10 16 cores and dataset size. This can be expressed in terms of the 32 cores 64 cores relationship between the factor, ∆T, by which the computation time increases when the number of parallel Relative Time to Process processors increases by a factor NP and the dataset size increases by a factor ND. Factors less than 1 represent a ∆ 1 decrease. T may be expressed as 1x 2x 4x 8x 16x 32x 64x Relative Data Size N ∆ = D T K(ND , NP ) (2) NP (a)

The function K may be dependent on many factors 1.2 including ND and N P. Over ranges of ND and N P for which the function K remains constant, the speedup is said to be 1 linear. Ideal linear speedup occurs where K=1 which means 76.8 M that if the factors NP and ND are equal, ∆T =1. This means 0.8 that the computation time does not change when the number 153.6 M 307.2 M of parallel processors and the dataset size increase by the 0.6 614.4 M same ratio. The speedup becomes ‘super-linear’ (or ‘super 1.2 G ideal’) if K becomes less than 1 over any range of factors ND 0.4 2.4 G and N P, which means that a reduction in computation time 4.8 G (∆T <1) will occur when the number of processors and the 0.2 dataset size increase by the same ratio. Relative processing tim e to baseline We can thus derive the following three equations based 0 on Eq.(2). For fixed number of computing cores with 1x 2x 4x 8x 16x increasing dataset size, the scalability can be represented in Relative number of cores (baseline=4) Eq.(3). For both the size of dataset and number of computing cores increase, the scalability can be represented in Eq.(4) (b) Figure 6. Scalability analysis: (a) time increasing factor vs dataset size increasing factor; (b) time decreasing factor vs increasing factor of computing core number.

6 Figure 7 shows the value of K(N D, NP) in Eq.(2) plotted ACKNOWLEDGMENT over the ranges of ND and NP used in the experiments . In the figure, label ‘log2NP’ and ‘log2ND’ represent increasing This research was sponsored by the and factor of computing cores and dataset size respectively. For Biological Science Research Council (BBSRC). The authors example, that ‘log2NP’ is 3 means the core number increase acknowledge the financial support from BBSRC and the by a factor of 8. It may be seen in Figure 7 that the function collaboration among the people of the AGILE project. K becomes less than one over a significant range of values of (N D, NP) meaning that the scalability becomes super-ideal. REFERENCES

K factor [1] J. Benzi, M. Damodaran, “Parallel Three Dimensional Direct Simulation Monte Carlo for Simulating Micro Flows,” Implementations and Experiences on Large Scale and , Parallel Computational Fluid Dynamics, p. 95, Springer2007, Retrieved 2013-03-21. 1.3 [2] Eurexpress website , http://www.eurexpress.org/ee/ , accessed May, 1.2 2013. 1.1 [3] L. Han, J. van Hemert, and R. Baldock, “Automatically identifying and annotating mouse embryo gene expression patterns,” 1 Bioinformatics, vol. 27(8), pp. 1101–1107, 2011. K(NP,ND) 0.9 [4] A.Yukiya, Nakano, “ Practical MPI Programming ”, http:/ /www. 0.8 redbooks. ibm. com/ abstracts/ sg245380. html), ITSO, Jun 1999. 5 [5] EURExpress-II project, http://www.eurexpress.org/ee/, 0.7 4 7 Retrieved 10, May, 2010. 6 3 5 [6] L. Silva, and R. Buyya, “High Performance Cluster Computing: 4 3 2 Programming and Applications,” ch. Parallel Programming Models 2 1 1 log2(NP) log2(ND) and Paradigms, pp. 4–27. No. ISBN 0-13-013785-5. Prentice Hall, PTR, NJ, USA, 1999 [7] P. S. Pacheco Parallel Programming with MPI. Morgan Kaufmann Figure 7. Scalability K(ND, NP) for factors ND and NP . Publishers, Inc., 1997. [8] PVM, 2009, http://www.csm.ornl.gov/pvm/ , Retrieved 5 May, 2010. The performance of this parallel structure has been tested [9] OpenMPI, 2009, http://www.open-mpi.org/ , Retrieved 5 May, 2010. on a large size of dataset, 78.6 G. It shows super-linear [10] MPICH, http://www.mcs.anl.gov/research/projects/mpi/mpich1/ , speedup of 2.66, 5.24, 9.87 when the number of processors Retrived 5 May, 2010. increased with factor of 2, 4, and 8 respectively. [11] GridMPI, http://www.gridmpi.org/index.jsp , Retrieved 5 May, 2010. [12] LAMMPI, http://www.lam-mpi.org/ , Retrieved 5 May, 2010. [13] Jeffrey H. Christiansen, Yiya Yang, Shanmugasundaram IV. CONCLUSIONS Venkataraman, Lorna Richardson, Peter Stevenson, Nicholas Burton, Richard A. Baldock and Duncan R. Davidson. EMAGE: a spatial of gene expression patterns during mouse embryo The process number dependent fine-grained algorithm development. Nucl. Acids Res. 34 (2006): D637 enhances the parallel calculation needed in Linear [14] Diez-Roux et al, “ A High-Resolution Anatomical Atlas of the Discriminant Analysis (LDA) classification. The advantage Transcriptome in the Mouse Embryo,” PLoS Biol 9(1) 2011: of this parallel structure is that it makes efficient use of the e1000582. computing resources; so that no computing cores have [15] E. S. Lein, M. J. Hawrylycz, et al, “Genome-wide atlas of gene expression in the adult mouse brain.,” Nature , vol. 445, no. 7124, pp. significantly different workloads from others as the 168–176, Jan. 2007 . workflow progresses. The parallel architecture designed [16] G. Perriere, J. Thioulouse, "Use of Correspondence Discriminant with technically using MPI communicators has enhanced Analysis to predict the subcellular location of bacterial proteins", the parallel processing performance in both aspects of Computer Methods and Programs in Biomedicine, vol.70, pp.99-105, 2003. speedup and scalability, and super-linear speedup and super- [17] “High Performance Java Message Passing Library”, ideal scalability have occurred under this parallel structure. http://www.fastmpj.com/ , accessed January 2013. The method of quantitative analysis of scalability may be [18] Amdahl, Gene, “Validity of the Single Processor Approach to used for complicated high performance system. To Achieving Large-Scale Computing Capabilities,” AFIPS Conference maximize the benefits of super-linear speedup applications, Proceedings (30), pp. 483–485, 1967. the objects may be clustered according readily discernable aspects of their similarity. Similar objects are sent to the same processor to maximize the benefits of caching.

7