An Ultrafast Scalable Many-Core Motif Discovery Algorithm for Multiple Gpus

2011 IEEE International Parallel & Distributed Processing Symposium An Ultrafast Scalable Many-core Motif Discovery Algorithm for Multiple GPUs Yongchao Liu, Bertil Schmidt, Douglas L. Maskell School of Computer Engineering Nanyang Technological University Singapore e-mail: {liuy0039, asbschmidt, asdouglas}@ntu.edu.sg Abstract—The identification of genome-wide transcription matching models, such as word enumeration [3] and factor binding sites is a fundamental and crucial problem to dictionary methods [4]. fully understand the transcriptional regulatory processes. MEME [5] is a popular and well established motif However, the high computational cost of many motif discovery discovery algorithm, which is primarily comprised of the algorithms heavily constraints their application for large-scale starting point search (SPS) stage and the EM stage. datasets. The rapid growth of genomic sequences and gene However, the high computational cost of MEME constrains transcription data further deteriorates the situation and its application for large-scale datasets, such as motif establishes a strong requirement for time-efficient scalable identification in whole peak regions from ChIP-Seq datasets motif discovery algorithms. The emergence of many-core for transcription factor binding experiments [6]. This architectures, typically CUDA-enabled GPUs, provides an encourages the use of high-performance computing solutions opportunity to reduce the execution time by an order of magnitude without the loss of accuracy. In this paper, we to meet the execution time requirement. Several attempts present mCUDA-MEME, an ultrafast scalable many-core have been made to improve execution speed on conventional motif discovery algorithm for multiple GPUs based on the computing systems, including distributed-memory MEME algorithm. Our algorithm is implemented using a workstation clusters [7] and special-purpose hardware [8]. hybrid combination of the CUDA, OpenMP and MPI parallel The emerging many-core architectures, typically Compute programming models in order to harness the powerful Unified Device Architecture (CUDA)-enabled GPUs [9], compute capability of modern GPU clusters. At present, our have demonstrated their power for accelerating algorithm supports OOPS and ZOOPS models, which are bioinformatics algorithms [10] [11] [12] [13]. This convinces sufficient for most motif discovery applications. mCUDA- us to employ CUDA-enabled GPUs to accelerate motif MEME achieves significant speedups for the starting point discovery. Previously we have presented CUDA-MEME, search stage (and the overall execution) when benchmarked, based on MEME version 3.5.4, for a single GPU device using real datasets, against parallel MEME running on 32 which accelerates motif discovery using two parallelization CPU cores. Speedups of up to 1.4 (1.1) on a single GPU of a approaches, namely: sequence-level parallelization and Fermi-based Tesla S2050 quad-GPU computing system and up substring-level parallelization. The detailed implementations to 10.8 (8.3) on the eight GPUs of a two Tesla S2050 system of the two approaches have been described in [14]. were observed. Furthermore, our algorithm shows good However, the increasing size and availability of ChIP- scalability with respect to dataset size and the number of GPUs Seq datasets established the need for parallel motif discovery (availability:https://sites.google.com/site/yongchaosoftware/mc with even higher performance. Therefore in this paper, we uda-meme). present mCUDA-MEME, an ultrafast scalable many-core Motif discovery; MEME; CUDA; MPI; OpenMP; GPU motif discovery algorithm for multiple GPUs. Our algorithm is designed based on MEME version 4.4.0 which incorporates position-specific priors (PSP) to improve I. INTRODUCTION accuracy [15]. In order to harness the powerful compute De novo motif discovery is crucial to the complete capability of GPU clusters, we employ a hybrid combination understanding of transcription regulatory processes by of CUDA, Open Multi-Processing (OpenMP) and Message identifying transcription factor binding sites (TFBSs) on a Passing Interface (MPI) parallel programming models to genome-wide scale. Algorithmic approaches for motif implement this algorithm. Compared to CUDA-MEME, discovery can be classified into two categories: iterative and mCUDA-MEME introduces four new significant features: combinatorial. Iterative approaches generally exploit • Supports multiple GPUs in a single host or a GPU probabilistic matching models, such as Expectation cluster through a MPI-based design; Maximization (EM) [1] and Gibbs sampling [2]. These • Supports starting point search from the reverse approaches are often preferred because they use position- complements of DNA sequences; specific scoring matrices to describe the matching between a • Employs multi-threaded design using OpenMP to motif instance and a sequence, instead of a simple Hamming accelerate the EM stage; distance. Combinatorial approaches employ deterministic • Incorporates PSP prior probabilities to improve accuracy. 1530-2075/11 $26.00 © 2011 IEEE 427423 DOI 10.1109/IPDPS.2011.183 Since the sequence-level parallelization is advantageous motif models with the highest weighted log likelihood ratio to the substring-level parallelization in execution speed [14], are selected as starting points for the successive EM our following discussions only refer to the sequence-level algorithm. parallelization. Real datasets are employed to evaluate the When performing the starting point search, independent performance of mCUDA-MEME and parallel MEME (i.e. computation from each substring of length w in the dataset the MPI version of MEME distributed with the MEME S={S1, S2, ..., Sn} of n input sequences is conducted to software package). Compared to parallel MEME (version determine a set of initial motif models. The following 4.4.0) running on 32 CPU cores, mCUDA-MEME achieves notations are used for the convenience of discussion: li a speedup of the starting point search stage (and the overall denotes the length of Si, S̅ i denotes the reverse complement execution) of up to 1.4 (1.1) times on a single GPU of a of Si, Si,j denotes the substring of length w starting from Fermi-based Tesla S2050 quad-GPU computing system and position j of Si, Si(j) denotes the j-th letter of Si, where 1 ≤ i ≤ up to 10.8 (8.3) times on eight GPUs of a two Tesla S2050 n and 0 ≤ j ≤ li-w. The starting point search process is system. Furthermore, our algorithm shows good scalability primarily comprised of three steps for the OOPS and ZOOPS with respect to dataset size and the number of GPUs. models: The rest of this paper is organized as follows. Section 2 • Calculate the probability score P(Si,j, Sk,l) from the briefly introduces the MEME algorithm and the CUDA, forward strand (or P(Si,j, S̅ k,l) from the reverse OpenMP and MPI parallel programming models. Section 3 complement), which is the probability that a site details the new features of mCUDA-MEME. Section 4 starts at position l in Sk when a site starts at position j evaluates the performance using real datasets, and Section 5 in Si. The time complexity is O(li·lk) for each concludes this paper. sequence pair Si and Sk. • Identify the highest-scoring substring Sk,maxk (as well II. BACKGROUND as its strand orientation) for each Sk. The time complexity is O(l ) for each sequence S . A. The MEME Algorithm k k • Sort the n highest-scoring substrings {Sk,maxk} in Given a set of protein or DNA sequences, MEME decreasing order of scores and determine the attempts to search for statistically significant (unknown) potential starting points. The time complexity is motif occurrences, which are believed to be shared in the O(nlogn) for OOPS and O(n2w) for ZOOPS. sequences, by optimizing the parameters of statistical motif The probability score P(Si,j, Sk,l) is computed as: models using the EM approach. MEME provides support for three types of search modes: one occurrence per sequence w-1 (OOPS), zero or one occurrence per sequence (ZOOPS), and PS(ij,, , S kl )=++∑ matS [ i ( j p )][ S k ( l p )] (2) two component mixture (TCM). The OOPS model postulates p=0 that there is exactly one motif occurrence per sequence in the dataset, the ZOOPS model postulates zero or one motif where mat denotes the letter frequency matrix of size |Σ|×|Σ|. occurrence per sequence, and the TCM model postulates that To reduce computation redundancy, (2) can be further there can be any number of non-overlapping occurrences per simplified to (3), where the computation of the probability sequence [5]. Since the OOPS and ZOOPS models are scores {P(Si,j, Sk,l)} in the j-th iteration depends on the sufficient for most motif finding applications, our algorithm resulting scores {P(Si,j-1, Sk,l-1)} in the (j-1)-th iteration. in this paper concentrates on these two models. However, P(Si,j, Sk,0) needs to be computed individually MEME begins a motif search with the creation of a set of using (2). motif models. Each motif model θ is a position specific probability matrix representing frequency estimates of letters PS(ij,, , S kl )= PS ( ij ,1,1−− , S kl )++−+− matS [ i ( j w 1)][ S k ( l w 1)] occurring in different positions. Given a motif of width w -mat [ Sik ( j−− 1)][ S ( l 1)] defined over an alphabet Σ = {A1, A2, ... , A|Σ|}, each value (3) θ(i, j) (1≤ i ≤ |Σ| and 0≤ j ≤w) of the matrix is defined as: B. The CUDA Programming Model probability of A at position jjw of the motif, 1 ≤ ≤ ⎧ i More than a software and hardware co-processing θ (,ij )= ⎨ (1) ⎩probability of Aji not in the motif, = 0 architecture, CUDA is also a parallel programming language extending general programming languages, such as C, C++ A starting point is an initial motif model θ(0) from which and Fortran with a minimalist set of abstractions for the EM stage runs for a fixed number of iterations, or until expressing parallelism.

Load more