Enhancing Parallelism of Data-Intensive Bioinformatics Applications

Enhancing Parallelism of Data-Intensive Bioinformatics Applications Zheng Xie, Liangxiu Han Richard Baldock School of Computing, Mathematics and Digital MRC Human Genetics Unit MRC IGMM, Technology University of Edinburgh Western General Hospital Manchester Metropolitan University Edinburgh EH4 2XU, UK Manchester M1 5GD, UK [email protected] [email protected] , [email protected] Abstract — Bioinformatics data-resources collected from Our contribution of this work lies in 1) Design and heterogeneous and distributed sources can contain hundreds of implementation of enhanced parallel algorithms for a Terra-Bytes and the efficient exploration on these large bioinformatics data intensive application based on MPI. 2) amounts of data is a critical task to enable scientists to gain Quantitative analyses of super-linear speedup and high new biological insight. In this work, an MPI-based parallel scalability obtained from the parallel system. 3) architecture has been designed for enhancing performance of biomedical data intensive applications. The experiment results Consideration of maximizing the benefits of super-linear show the system has achieved super-linear speedup and high speedup and high scalability by maximizing the benefits of scalability. caching. The rest of the paper is organized as follows: section II Keywords- gene pattern recognition, data intensive describes the background and parallel solution for the application, Message Passing Interface, collective/point-to-point biomedical use case; Section III presents the experimental communicators, task/data parallelisms, fine-grain parallelism, evaluation. In Section IV, we conclude our work. super-linear speedup, super-ideal scalability. I. INTRODUCTION II. PARALLEL PROCESSING FOR DATA INTENSIVE With the exploitation of advanced high-throughput BIOMEDICAL APPLICATIONS instrumentation, the quantity and variety of bioinformatics A. Background of the bioinformatics application data have become overwhelming. Such data, collected from heterogeneous and distributed sources, typically consists of In this research, we aim to accelerate a task from the tens or hundreds of Terra Bytes (TB) comprising ten to fifty biomedical science. This particular task concerns thousand individual assays with supplementary metadata ontological annotation of gene expression in the mouse and spatial-mapping information. This is the case with the Embryo. Ontological annotation of gene expression has mouse embryo gene-expression databases EMAGE [13], been widely used to identify gene interactions and networks EurExpress [14] and the Allen Brain Atlas [15]. To enable that are associated with developmental and physiological scientists to gain new biological insights, massively parallel functions in the embryo. It entails labelling embryo images processing on these data is required. Parallel and distributed produced from RNA in situ Hybridization (ISH) with terms computing, along with parallel programming models (e.g., from the anatomy ontology for mouse development. If an MPI [1]), provides solutions by splitting massive data- image is tagged with a term, it means the corresponding intensive tasks into smaller fragments and carrying out anatomical component shows expression of that gene. The much smaller computations concurrently. input is a set of image files and corresponding metadata. This work explores how parallel processing could The output will be an identification of the anatomical accelerate data intensive biomedical applications. Message components that exhibit gene expression patterns in each Passing Interface (MPI) has been chosen to implement our image. This is a typical pattern recognition task. As shown parallel structure. MPI is a standardized and portable in Figure1 (a), we first need to identify the features of message-passing application programmer interface [4], `humerus' in the embryo image and then annotate the image which is designed to function on a wide variety of parallel using ontology terms listed on the left ontology panel . To computers. Its library provides communication functionality automatically annotate images, three stages are required: at among a set of processes. A rich range of functions in the the training stage, the classification model has to be built, library enables us to implement complicated communication based on training image datasets with annotations; at the which we need for enhancing the effectiveness of parallel testing stage, the performance of the classification computing as computational resources increase. 1 (a) (b) Figure 1. Automatic ontological gene annotation [3]. model has to be tested and evaluated; then at the parallelisation at either hardware or software levels or both deployment stage, the model has to be deployed to perform (e.g., signal, circuit, component and system levels). the classification of all non-annotated images. We mainly In general, three considerations when parallelising an focus on the training stage in this case. The processes in the application at software level include: training stage include integration of images and annotations, • How to distribute workloads or decompose an algorithm image processing, feature generation, feature selection and into parts as tasks? extraction, and classifier design, as shown in Figure1 (b). • How to map the tasks onto various computing nodes and Currently gene expression annotation is mainly done execute the subtasks in parallel? manually by domain experts. This is both time-consuming • How to coordinate and communicate subtasks on those and costly, especially with the rapidly growing volume of computing nodes. image data of gene expression patterns on tissue sections There are mainly two common methods for dealing produced by advanced high-throughput instruments (e.g. with the first two questions: data parallelism and task ISH). For example, the EuExpress-II project [2] delivered parallelism. Data parallelism represents workloads are partially annotated and curated datasets of over 20 distributed into different computing nodes and the same task Terabytes including images for the developing mouse can be executed on different subsets of the data embryo and the ontological terms for anatomic components simultaneously. Task parallelism means the tasks are of the mouse embryo, which provides a powerful resource independent and can be executed purely in parallel. There is for discovery of genetic control and functional pathways of another special kind of the task parallelism is called processes controlling embryo organisation. To alleviate the ‘pipelining’. A task is processed at different stages of a issues with the manual annotation, we have developed data pipeline, which is especially suitable for the case when the mining algorithms for automatically identifying an same task is used repeatedly. The extent of parallelisation is anatomical component in the embryo image and annotating determined by dependencies of each individual part of the the image using the provided ontological terms [3], algorithms and tasks. programmed these in sequential code and executed the task As for the coordination and communication among on a single commodity machine. Note that this task is a tasks or processes on various nodes or computing cores, it specific instance of a generic image-processing analysis depends on different memory architectures (shared memory pipeline that could be applied to many datasets. To process or distributed memory). A number of communication the everincreasing growth in the volume of stored data, it is models have been developed [7][8]. Among them, the necessary to design parallel solutions for speedup the data Message Passing Interface (MPI) has been developed for intensive applications. HPC parallel applications with distributed memory architectures and has become the de-facto standard. There is B. Overview of parallel approach a set of implementations of MPI, for example, OpenMPI [9], MPICH [10], GridMPI [11] and LAM/MPI [12]. Two It is well known that the speedup of an application to types of MPI communication functionality are point-to- solve large computational problems is mainly gained by the point and collective communication, and there are a number of functions involved in them. Point-to-point functions deal 2 with communication between two specific processes, in implement the data parallel and enhance the proceeding which a message needs to be sent from a specified process speed with minimum communication time. It is implemented to another specified process. They are suitable for patterned with MPI collective communicator ‘MPI_Scatter’ and or irregular communication. Collective functions manage ‘MPI_Gather’, and the minimum processing unit is a single communication among all processes in a process group at image object. ‘MPI_Scatter’, as shown in Figure3, takes an same time. The process group could either be the entire array of image objects and distributes the objects in the order process pool or a program-defined process subset. of process rank. ‘MPI_Gather’ is the reverse function of Collective communication is more efficient than point-to- ‘scatter’. After image processing, all the feature data vectors representing the processed images are put together in the point communication, and ought to be used in the parallel order of the process rank to produce a new data array, which architecture wherever it is suitable. is the input for the second part of the workflow. C. Parallel processing using MPI for the bioinformatics In the second part of the workflow, fisher-ratio

Enhancing Parallelism of Data-Intensive Bioinformatics Applications

Data-Level Parallelism

High Performance Computing Through Parallel and Distributed Processing

2.5 Classification of Parallel Computers

A Massively-Parallel Mixed-Mode Computer Designed to Support

Distributed Algorithms with Theoretic Scalability Analysis of Radial and Looped Load ﬂows for Power Distribution Systems

Balanced Multithreading: Increasing Throughput Via a Low Cost Multithreading Hierarchy

Scalability and Performance Management of Internet Applications in the Cloud

Amdahl's Law Threading, Openmp

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Scalability and Optimization Strategies for GPU Enhanced Neural Networks (Genn)

Parallel Generation of Image Layers Constructed by Edge Detection

Massively Parallel Computing with CUDA