<<

Efficient GPU implementation of applications

Nuno Miguel Trindade Marcos

Thesis to obtain the Master of Science Degree in Information Systems and Engineering

Supervisors: Prof. Nuno Filipe Valentim Roma Prof. Pedro Filipe Zeferino Tomás

Examination Committee: Chairperson: Prof. José Carlos Martins Delgado Supervisor: Prof. Nuno Filipe Valentim Roma Member of the Committee: Prof. David Manuel Martins de Matos

November 2014

Agradecimentos

Em primeiro lugar, gostaria de agradecer aos Professores Nuno Roma e Pedro Tomás, meus orientado- res neste trabalho, pela vossa orientação e colaboração. Sem a vossa paciência e tolerância constante não teria sido possível terminar este trabalho. A vós um muito Obrigado. Seguidamente gostaria de agradecer aos meus colegas que me acompanharam durante o curso, em especial ao David Gaspar pela força e por todos os momentos passados, ao Jhonny Aldeia por toda a ajuda durante o curso, ao Dionisio Sousa, ao Rui Mestre e ao Artur Ferreira que me ajudaram e deram motivação para continuar em frente. Para além disso, agradecer especialmente ao meu amigo Pedro Monteiro pela ajuda com o trabalho dele e pelo constante incentivo. Gostaria também de agradecer aos meus amigos Daniela Coelho, Miguel Matos, David Dias, João Ve- lez e Pedro Chagas por todo o apoio durante este trabalho. Ao meu afilhado Tiago Carreira pelas várias insistências para acabar este trabalho e por toda a ajuda durante o mesmo. Para além deles, também gostaria de agradecer aos meus colegas da Premium Minds que sempre tiveram disponíveis para me ajudar e para assumir as coisas na minha ausência, em especial ao Márcio Nóbrega, ao André Soares, ao Renil Lacmane e ao Afonso Vilela. Por final, mas com a maior das importâncias, gostaria de agradecer aos meus pais e ao meu irmão por toda a força e motivação para que conseguisse chegar ao final deste trabalho e em especial à minha namorada Ana Daniela pela motivação na recta final deste trabalho.

i

Abstract

Biological sequence data is becoming more accessible to researchers around the world. In particular, rich of protein and DNA sequence data are already made available to biologists and their size is increasing every day. However, all this obtained information needs to be processed and classified. Several bioinformatics , such as the Needleman-Wunsch and the Smith-Waterman , have been proposed for this purpose. Both consist of the execution of dynamic programming schemes which allow the usage of parallelism to achieve a better performance execution. Under this context, this thesis proposes the integration of two previously presented parallel implementations: an adaptation of the SWIPE implementation, for multi-core CPUs that exploits SIMD vectorial instructions, and an implementation of the Smith-Waterman algorithm for GPU platforms (CUDASW++ 2.0). Accordingly, the presented work offers an unified solution that tries to take advantage of all computational resources that are made available in heterogeneous platforms, composed by CPUs and GPUs, by integrating a dynamic load balancing layer. The obtained results show that the attained can reach values as high as 6x, when executing in a quad-core CPU and two distinct GPUs.

Keywords

Bioinformatics Algorithms; Sequence Alignment; Smith-Waterman Algorithm; Heterogeneous Paral- lel ; Load Balancing; CUDA. iv Resumo

Hoje em dia a quantidade de informação genética disponível aos investigadores é cada vez maior. Bases de dados com informação genética estão disponíveis na internet e aumenta a cada dia que passa. De modo a ser utilizada pela Biologia, toda esta informação necessita de ser processada e classificada. Para a classificar, existem diversos algoritmos bioinformáticos, tais como o algoritmo de Needleman-Wunsch e o algoritmo de Smith-Waterman algorithm. Ambos consistem na execução de múltiplas iterações, que permitem a sua paralelização de forma a obter uma melhor performance na execução. Duas das implementações paralelas existentes são uma adaptação da implementação do Rognes SWIPE, apresentada por Pedro Monteiro, baseada numa paralelização ao nível das threads CPU e a outra a CUDASW++ 2.0, apresentada pelo Liu et. al., baseada numa paralelização ao nível das Threads e dados em GPUs. Considerando ambas as soluções, este trabalho propõe uma orquestração heterógenea que utilizando ambas consegue processar sequências nos cores da CPU e nas GPUs disponíveis na máquina. Para além desta implementação, é proposta uma camada adicional responsável pelo balanceamento dos dados entre os diferentes workers. Os resultados mostram que a execução pode atingir um speedup superir a 6x quando executada com quatro cores CPU e duas GPUs distintas.

Palavras Chave

Algoritmos Bioinformáticos; Alinhamento de sequências; Algoritmo de Smith-Waterman; Arquitectu- ras Paralelas Heterógeneas; Módulo de Load Balancing; CUDA.

v

Contents

1 Introduction 1 1.1 Motivation ...... 2 1.2 Objectives ...... 3 1.3 Document Outline ...... 3

2 Parallel Architectures 5 2.1 Flynn’s Taxonomy ...... 6 2.2 CPU - ...... 7 2.3 GPU - ...... 7 2.4 CPU vs GPU ...... 12 2.5 Hybrid Solution: Accelerated Processing Unit ...... 14 2.6 CUDA - Compute Unified Device ...... 14 2.6.1 Definition and Architecture ...... 14 2.6.2 Programming Model ...... 15 2.6.3 Execution Model ...... 16 2.6.4 Limitations ...... 17 2.7 Open Language ...... 17

3 Sequence Alignment in Bioinformatics 19 3.1 Alignment Scoring Model ...... 20 3.2 Optimal Alignment Algorithms ...... 21 3.2.1 Needleman-Wunsch Algorithm ...... 22 3.2.2 Smith-Waterman Algorithm ...... 22 3.3 Heuristic Sub-Optimal Algorithms ...... 24 3.3.1 FASTA ...... 24 3.3.2 BLAST - Basic Local Alignment Search Tool ...... 25 3.4 Parallel Implementations ...... 26 3.4.1 CPU Implementations ...... 27 3.4.1.A Wozniak ...... 27 3.4.1.B Farrar ...... 27 3.4.1. SWIPE (Rognes) ...... 28

vii 3.4.1.D Pedro Monteiro’s Implementation ...... 29 3.4.2 GPU Implementations ...... 31 3.4.2.A Manavski’s Implementation ...... 31 3.4.2.B CUDASW++ ...... 32 3.4.3 Discussion on the Presented implementations ...... 35

4 Heterogeneous Parallel Alignment MultiSW 38 4.1 Introduction ...... 39 4.2 Architecture ...... 40 4.2.1 CPU Worker ...... 41 4.2.1.A CPU Wrapper ...... 42 4.2.2 GPU Worker ...... 42 4.2.2.A Asynchronous Transfers ...... 43 4.2.2.B CUDA Streams in Kernel Execution ...... 43 4.2.2.C Loading Sequences with Execution ...... 43 4.3 Application Execution Flow ...... 44 4.4 Implementation Details and Optimizations ...... 44 4.4.1 File Format ...... 44 4.4.2 Database Sequences Pre-Loading ...... 45 4.4.3 Data Structures ...... 45 4.5 Dynamic Load-balancing Layer ...... 47 4.6 Conclusion ...... 49

5 Experimental Results 50 5.1 Experimental Setup ...... 51 5.1.1 Experimental Dataset ...... 51 5.2 Evaluating Metrics ...... 52 5.3 Results ...... 52 5.3.1 Scenario A - Single CPU core ...... 53 5.3.2 Scenario B - Four CPU cores ...... 53 5.3.3 Scenario C - Single GPU - GeForce GTX 780 Ti ...... 54 5.3.4 Scenario D - Single GPU - GeForce GTX 660 Ti ...... 55 5.3.5 Scenario E - Four CPU cores + Single GPU Execution ...... 55 5.3.6 Scenario F - Four CPU cores + Double GPUs Execution ...... 57 5.4 Summary ...... 58

6 Conclusions and Future Work 59 6.1 Conclusions ...... 60 6.2 Future Work ...... 60

Bibliography 61 viii List of Figures

2.1 NVIDIA GK110 Kepler Architecture ...... 9 2.2 Coalesced memory access ...... 10 2.3 GPU Memory organization ...... 11 2.4 CPU and GPU architectures [1]...... 12 2.5 Graphics Processing Unit (GPU) vs Central Processing Unit (CPU) GFLOPS comparation [1] ...... 13 2.6 Fermi Architecture ...... 15 2.7 CUDA Kernel definition and invocation example [1] ...... 16 2.8 Execution Flows Representation [1] ...... 17

3.1 Pairwise Alignment Example ...... 20 3.2 Needleman-Wunsch alignment matrix example ...... 22 3.3 Smith-Waterman alignment matrix example ...... 23 3.4 FASTA algorithm step 1...... 24 3.5 FASTA algorithm step 2...... 25 3.6 FASTA algorithm step 3...... 25 3.7 FASTA algorithm step 4...... 25 3.8 Multi-sequence vectors...... 29 3.9 Rogne’s Algorithm core instructions...... 29 3.10 Sequences Database in several chunks [2]...... 30 3.11 Processing Block - Message [2]...... 30 3.12 Processing Block FIFOs [2]...... 31 3.13 Coalesced Subject Sequence Arrangement [3]...... 33 3.14 Coalesced Global Memory Access [3]...... 33 3.15 Program workflow of CUDASW++ 3.0 [4]...... 35

4.1 Heterogeneous Architecture ...... 39 4.2 MultiSW block diagram...... 40 4.3 Master Worker Model [2] ...... 41 4.4 Master Worker Model [2] ...... 41 4.5 CPU Wrapper Function...... 42

ix 4.6 Execution Sequence Diagram...... 44 4.7 Workers execution not balanced...... 47 4.8 Workers execution balanced...... 48

5.1 Processing times considering a single CPU core execution and a processing block with 30000 sequences...... 53 5.2 Processing times for 4 CPU cores, considering a block size of 30,000 sequences. . . . . 54 5.3 Processing Times for single GPU in Machine A, considering blocks size of 65,000 se- quences. Total execution time about 6.35 seconds...... 54 5.4 Processing Times for single GPU in Machine B, considering blocks size of 65,000 se- quences. Total execution time about 7.38 seconds...... 55 5.5 Processing Times for 4 CPU cores and a GeForce GTX780 Ti GPU, considering CPU blocks of 30,000 sequences and GPU blocks of 65,000 sequences. Total execution time was 6.112 seconds...... 56 5.6 Number of Sequences Processed by CPU cores and GPU...... 56 5.7 Processing Times for 4 cores CPU, GPU A and GPU B, considering the initial block size of 30,000 sequences blocks to the CPU solution and 65,000 to the GPU solution. Total execution time of 4.957 seconds. Near some of the iteration blocks it is presented the new considered block size...... 57 5.8 Number of Sequences Processed by CPU, GPU A, and GPU B workers...... 58

x List of Tables

2.1 Flynn’s Taxonomy [5]...... 6

5.1 Execution Speedups...... 58

xi

List of Acronyms

ALU

AMD Advanced Micro Devices - North American Technology Company

APU Accelerated Processing Unit

BLOSUM Blocks Substitution Matrix

CPU Central Processing Unit

CUDA Compute Unified Device Architecture

GPC Graphics Processing Clusters

GPGPU General-Purpose Computation on Graphics Hardware

GPU Graphics Processing Unit

NVIDIA North American Company that invented the GPUs in 1999

PAM Point Accepted Mutation

SMX Streaming Multiprocessor

SP Streaming

xiii

1 Introduction

Contents 1.1 Motivation ...... 2 1.2 Objectives ...... 3 1.3 Document Outline ...... 3

1 1. Introduction

1.1 Motivation

Nowadays, numerous databases spread all over the world hosting large amounts of biological data, and they are growing in size exponentially as genomes of other species are being sequenced. Specifi- cally, rich databases of protein and DNA sequence data are available on the Internet. The outcome of the DNA sequencing work is very ample and can lead to many potential benefits in distinct fields such as molecular medicine (aiming improved diagnosis of disease, drugs design, etc.), bioarchaelogy and evo- lution (study of evolution and similarity between organisms, and others), DNA forensics (identification of crime or catastrophe victims, establishment of paternity and other family relationships, and others) and others, as agriculture or bioprocessing (disease- and drought-resistant crops, biopesticides, edible vacines to incorporate into food products, and others) [6]. There are several online knowledge bases that contain millions of genes information extracted [7]:

• GenBank DNA database [8];

• National Center for Biotechnology Information (NCBI) [9];

• Universal Protein Resource (UniProt) [10];

• Nucleotide sequence database (EMBL) [11];

• Swiss-Prot [12];

• TrEMBL [13].

With this proliferation of data comes a large computational cost to perform a genetic sequence align- ment between new genetic information and the online databases. As a consequence, genetic sequence alignment is considered to be one of the application domains which require further improvements in the execution speed, mostly because it involves several computationally intensive tasks, as well as databases whose size will continue to increase. This is leading researchers to look for even faster, high- throughput alignment tools, which can give an efficient response to this intensive growth. One of the most known Bioinformatics algorithms is the Smith-Waterman algorithm [14]. This algorithm is presented in Section 3.2.2 and consists of the alignment of two sequences, using a score matrix ap- proach. So, recently several parallel computer architectures exploiting some of the parallel approaches were presented: multiple-core processors; multiple processors installed in a single motherboard; multi- ple connected through a common network - cluster on computer grids [15]. Some of these implementations are presented in Section 2 and explore -level and Data-Level parallelism using Central Processing Unit (CPU) and Graphics Processing Unit (GPU) architectures. Under this context, this thesis proposes the integration of two previously presented parallel implementa- tions: an adaptation of SWIPE implementation [16], for multi-core CPUs that exploits SIMD vectorial in- structions [2], and an implementation of the Smith-Waterman algorithm for GPU platforms (CUDASW++ 2.0) [17]. Accordingly, the presented work offers an unified solution that tries to take advantage of all computational resources that are made available in heterogeneous platforms, composed by CPUs and GPUs, by integrating a convenient dynamic load balancing layer.

2 1.2 Objectives

This implementation was extensively evaluated considering several execution scenarios, combining both kinds of workers.

1.2 Objectives

The aim of the present work is considering two of the Smith-Waterman algorithm implementations, implement a unified solution that tries to take advantage of all computational resources that are made available in heterogeneous platforms, composed by CPUs and GPUs, by integrating a convenient dy- namic load balancing layer. With this load balancing layer it is expected that the execution time for the multiple workers considered be equal in the timeline execution, like explained in Section 4.5. This way it is possible to minimize the waiting times between working and guarantee that all the workers finish their work at the same time. Besides the load balancing layer, several implementations on the GPU module are presented in Section 4.1.

1.3 Document Outline

This thesis is organized as follows. First, in Chapter 2, the main characteristics for the parallel architectures, describing the considered ones, the CPU and the GPU. Also in Chapter 2 it is presented and described the Compute Unified Device Architecture (CUDA) architecture. Next, in Chapter 3, it will be briefly presented the sequence alignment algorithms and some of the considered applications. In Chapter 4 is presented our developed work, the MultiSW. Finally the results of this implementation and the corresponding discussion are presented in chapter 5.

3 1. Introduction

4 2 Parallel Architectures

Contents 2.1 Flynn’s Taxonomy ...... 6 2.2 CPU - Central Processing Unit ...... 7 2.3 GPU - Graphics Processing Unit ...... 7 2.4 CPU vs GPU ...... 12 2.5 Hybrid Solution: Accelerated Processing Unit ...... 14 2.6 CUDA - Compute Unified Device Architecture ...... 14 2.6.1 Definition and Architecture ...... 14 2.6.2 Programming Model ...... 15 2.6.3 Execution Model ...... 16 2.6.4 Limitations ...... 17 2.7 Open Computing Language ...... 17

5 2. Parallel Architectures

According to Almasi et al. [18], a parallel computer architecture can be defined as "That collection of processing elements that communicate and cooperate to solve large problems fast". Taking this definition into consideration, we realize that there are several types of architectures that use different memory organizations and communication topologies, as well as different processor execution models. In section 2.1, we review Flynn’s taxonomy which classifies the parallel architectures in four different classes. According to the several possible approaches, four different types of parallelism may be defined: -level parallelism; Instruction-level parallelism; Data-level parallelism, Task/Thread-level parallelism. In the remaining sections, we describe the main parallel computing architectures that are used nowa- days: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), and architectures that combine both, the Accelerated Processing Unit (APU). Finally, we present the parallel-programming model used by NVIDIA GPUs, the Compute Unified Device Architecture (CUDA).

2.1 Flynn’s Taxonomy

In 1966, Michael J. Flynn proposed a simple model that it is still used to categorize computers, taking into account the parallelism in instruction execution and memory data calls. Flynn looked at the parallelism in the instruction and data streams1 called for the instructions at the most constrained component of the multiprocessor, and placed all existing computers in four distinct categories [19], as defined below and presented in Table 2.1.

Table 2.1: Flynn’s Taxonomy [5].

Single Instruction Multiple Instruction Single Data SISD MISD Multiple Data SIMD MIMD

1. Single instruction stream, single data stream (SISD): This category corresponds to the unipro- cessor model. One example of this is the conventional sequential computer based on the , i.e., a uniprocessor computer which can only perform one single instruc- tion at a time.

2. Single instruction stream, multiple data streams (SIMD): The same instruction is executed by multiple processors using different data streams. SIMD computers exploit data-level parallelism, by applying the same operations to multiple items of data in parallel. Many current CPUs use this kind of architecture by supporting instruction set extensions. Examples for this are the MMX, established by Intel [20], and the SSEx family of Streaming SIMD Extensions, representing an evolution of the MMX architecture. The Advanced Vector Extensions (AVX) extension is also one kind of SIMD extension proposed by Intel. This category is also followed by the programming model used in Graphics Processing Units (CUDA and OpenCL) that we describe on Section 2.6.

1The concept of stream refers to the sequence of data or instructions as seen by the machine during the execution of a program.

6 2.2 CPU - Central Processing Unit

3. Multiple instruction streams, single data stream (MISD): This category indicates the use of multiple independently executing functional units operating on a single stream of data, forwarding the results from one functional unit to the next [5].

4. Multiple instruction streams, multiple data streams (MIMD): Each processor fetches its own instructions and operates on its own dataset. This model exploits thread-level parallelism, since multiple threads operate in parallel. Examples of this architecture are the current processors with multi-threading support. Other examples are distributed systems and computer clusters.

2.2 CPU - Central Processing Unit

The central processing unit (CPU) is the unit responsible for interpreting and ex- ecuting the program instructions. One of the first commercial CPU was the Intel 4004 presented by Intel in 1971.

A CPU is usually composed by these components [21]:

• Arithmetic Logic Unit (ALU) - Responsible for the execution of logical and arithmetic operations;

– Decodes instructions, gets operands and controls the execution point;

• Registers – Memory cells of the CPU that store data needed by the CPU to execute the instruc- tions;

• CPU interconnection - communication channels among the control unit, ALU, and registers.

Nowadays, in order to reduce the power consumption and to multiple tasks at the same time and more efficiently, commercial CPUs are built with multi-core technology, having between 4 and 16 execution cores. This way, the multi-core CPU can process 4 or more instructions at a time, in a MIMD parallelization way. Some solutions that take advantage of the parallelizing process on the Intel CPUs are presented in Section 3.4.

2.3 GPU - Graphics Processing Unit

A Graphics Processing Unit(GPU) is the processing unit that is present in every graphics card on each computer. This unit is designed specifically for performing the complex mathematical and geomet- ric calculations that are necessary for graphics rendering. Although GPUs were originally developed to process and display , they have been also used for processing general purpose operations, leading to the General-Purpose Computation on Graphics Hardware (GPGPU) paradigm. There are several frameworks to adapt the GPU programming to this paradigm. The most known ones are OpenCL and North American Company that invented the GPUs in 1999 (NVIDIA)’s CUDA, pre- sented in the Section 2.6. Early approaches to computing on GPUs cast computations into a graphics

7 2. Parallel Architectures framework, allocating buffers (arrays) and writing shaders (kernel functions).

GPUs provide massive parallel execution resources and high memory bandwidth. Within the most popular GPU-accelerated applications we can mention the research field, specifically:

• Higher Education and Supercomputing (numerical analytics, physics and weather and climate fore- casting, for example);

• Defense and Intelligence applications (such as geospatial );

• Computational Finance (financial analysis, etc.);

• Media and Entertainment (animation, modeling and rendering, color correction and grain manage- ment, editing, review and stereo tools, encoding and digital distribution, etc.).

In Figure 2.1, we present the GK110 NVIDIA GPUs architecture. GeForce GTX 780 Ti GPU used in our work, are built on this architecture. This architecture is part of Kepler GPUs family. As shown in this figure, the GK110 NVIDIA architecture GPU has several Graphics Processing Clusters (GPC)2, organized in a scalable array. Each GPC contains several Streaming Multiproces- sor (SMX)s which perform the executions and run the CUDA kernels, presented below in this document. The design of the SMX has been evolving rapidly since the introduction of the first CUDA-capable hard- ware in 2006, with four major revisions, codenamed Tesla, Fermi, Kepler and Maxwell [22]. Kepler’s new Streaming Multiprocessor, called SMX, has significantly more CUDA Cores than the SM of Fermi GPUs. Each SMX contains thousands of registers that can be partitioned among the threads under execution, several caches and warp schedulers (presented below in this document) that can quickly switch con- texts between threads and issue instructions to warps that are ready to execute and execution cores for integer and floating-point operations [1]. A GPU is connected to a host through a high-speed IO slot (a PCI-Express bus in current systems). The considered GPU model contains four GPCs, fifteen SMXs and six 64-bit memory controllers [23]. In Figure 2.3, we present the architecture of a GPU based on CUDA, composed of a set of stream multi-processors sharing a global memory. In addition to the shared memory, each SMX is composed of [22]:

• thousands of registers that can be partitioned among threads of execution;

• several kind of memory caches (explained below);

• warp schedulers that can quickly switch contexts between threads and issue instructions to warps that are ready to execute and

• Execution cores for integer and floating-point operations.

The memory system for current NVIDIA GPUs is more complex, as we now explain.

2also known as Streaming Processors (SPs)

8 2.3 GPU - Graphics Processing Unit

Figure 2.1: NVIDIA GK110 Kepler Architecture

9 2. Parallel Architectures

Memory NVIDIA GPUs include a complex memory system. The array of threads in a block (see Section 2.6), explained below was designed to be an array of 1, 2, 3 dimensions leading to a memory access pattern known as coalesced. A coalesced memory access can be explained this way, considering that all threads in a warp (explained below in this chapter) execute the same instruction, when all threads in a warp execute a load instruction, the hardware detects whether the threads access consecutive memory locations. The most favorable global memory access is achieved when the same instruction for all threads in a warp accesses global memory locations. In this case, the hardware coalesces all memory accesses into a consolidated ac- cess to consecutive DRAM locations. If thread 0 accesses location n, thread 1 accesses location n + 1, ... and thread 31 accesses location n + 31, then all these accesses are coalesced, that is: combined into one single access.

Figure 2.2: Coalesced memory access

A coalesced access of memory occurs when the address locality and alignment meets certain crite- ria, taking advantage of the distribute bus of the main memory, as presented in Figure 2.2. Specifically, the types of memory presented in a GK110 architecture NVIDIA GPU are [24]:

• Global memory: the largest one (typically greater than 1 GB), but with high latency, low bandwidth (when compared with the other types), and is not cached. The effective bandwidth of global mem- ory depends heavily on the memory access pattern (e.g. coalesced access generally improves bandwidth).

• Local memory: readable and writable per-thread memory with very limited size (16 kB per thread) and is not cached. Access to this memory is as expensive as access to global memory.

• Constant memory: read-only memory with limited size (typically 64 kB) and cached. The reading cost scales with the number of different addresses read by all threads. Reading from constant memory can be as fast as reading from a register.

• Texture memory: read-only memory that is mapped and allocated in global memory. This memory can be used like a cache.

• Shared memory: fast on-chip memory of limited size (16 kB per block), readable and writable on a per-block basis. This memory can only be accessed by all threads in a thread block and is

10 2.3 GPU - Graphics Processing Unit

divided into equally-sized banks that can be accessed simultaneously by each thread. Accessing this memory is as fast as accessing a register as long as there are no bank conflicts.

• Registers: readable and writable per-thread registers. These are the fastest memory to access, but the amount of registers is limited.

Figure 2.3: GPU Memory organization

The transfer of data between the Host and the GPU is done using (DMA), and can operate concurrently with both the host and the GPU computation units.

As said before, there are some programming models used in the GPGPU context, such as the CUDA and OpenCL models. Section 2.6 presents the CUDA programming model.

11 2. Parallel Architectures

2.4 CPU vs GPU

With the observed increase of the computational demands imposed by the gaming market, the man- ufacturers of GPUs had to propose powerful processing units, in order to allow gamers to run their increasingly graphically demanding games. A direct consequence is the fact that it represents the most powerful and cost effective computer hardware. Consequently, GPUs are no longer exclusively applied with the purpose of showing computer graphics. An increasingly interest of researchers and developers in the potential of GPUs for applications with large amounts of computations have arisen along the last few years.

Today, CPUs in consumer devices have several cores in a chip, and each of them has some ALUs that perform the arithmetic and logical operations (see Figure 2.4). In comparison, the GPUs have hundreds or even thousands of cores, each one with four units: one floating point unit, a logic unit, a move or compare unit and a branch unit. An advantage of GPUs is the ability to perform multiple simultaneous operations, up to an order of magnitude of 103, since there are hundreds of execution cores in a single GPU.

Figure 2.4: CPU and GPU architectures [1].

According to Owens et al. [25], one of the major architectural differences between CPUs and GPUs is the fact that CPUs are optimized to achieve high performance in sequential code, with some of the processing stages dedicated to extracting instruction-level parallelism with techniques such as branch prediction and out-of-order execution. On the other hand, GPUs with entirely parallel computing nature allow processing stages to be more focused on computing. This allows achieving a higher level of arith- metic intensity, with around the same number of than CPUs. Regarding execution performance, one of the metrics that has been used is floating-point operations per second (FLOPS). As shown in Figure 2.5, during the last years GPUs surpassed CPUs in this measure of theoretical peek performance.

In order to compare a sequential and a parallel software implementation, the fundamental metric is the speedup. The expression of the speed up is:

t Speedup = sequential (2.1) tparallel

12 2.4 CPU vs GPU

Figure 2.5: GPU vs CPU GFLOPS comparation [1]

It gives a ratio that indicates how a parallelized system is faster when compared to a sequential system. When comparing with CPUs, some advantages and disadvantages can be identified with respect to GPUs[25, 26]:

Advantages:

• Faster and Cheaper;

• Fully programmable processing units that support vectorized floating-point operations[27];

• Very flexible and powerful, with the introduction of new capabilities in modern GPUs, like high level languages support the programmability of the vertex and pixel pipelines. Other features are the implementation of vertex texture access, the full branching support in the vertex , and the limited branching capability in the fragment pipeline.

Disadvantages:

• Memory transfers between host and device can slow the whole application;

• Complex memory management, since there are several limitations regarding memory size (which is limited), and also in the memory organization, which has a hierarchical organization. (See Section 2.3 for the CUDA memory model);

• Only applications with high level of parallelization sections can benefit from all the GPU execution power.

13 2. Parallel Architectures

2.5 Hybrid Solution: Accelerated Processing Unit

Nowadays, new hybrid solutions are appearing in the market, such as the Accelerated Processing Units (APU) by Advanced Micro Devices - North American Technology Company (AMD). This new hard- ware part is based on a single processor chip that combines CPU and GPU elements into an unified architecture.

An example of these APUs is the AMD Fusion [28], the Kaveri, the Athlon or the Sempron series. In this architecture, the CPU cores and the programmable GPU cores share a common path to the system memory. The key aspect to highlight are that the x86 CPU cores and the vector engines are attached to the system memory through the same high speed bus and . This feature allows the AMD Fusion architecture to alleviate the fundamental PCIe constraint that traditionally has limited performance on a discrete GPU. Fusion architecture obviates the necessity of PCIe accesses to and from the GPU, improving application performance [29]. However, the graphics cores that have been placed on current APUs are not meant to be competitive with high-end or even mid-range discrete graphic cards [30]. Recently, in November 2013, Sony introduced an AMD 1.6GHz APU on the Playsta- tion 4 console. This was the fastest APU produced by AMD when the console was presented. Despite the Playstation APU being Sony property, AMD took some of the implementations and included them in their consumer APUs, improving the available APUs processing power.

2.6 CUDA - Compute Unified Device Architecture

2.6.1 Definition and Architecture

The Compute Unified Device Architecture (CUDA) is a parallel-programming model and software en- vironment, designed by NVIDIA, in order to deliver all the performance of NVIDIA’s GPUs technology to general purpose GPU Computing. It was first introduced in March 2007, and, since then, more than 100 million CUDA-enabled GPUs has been sold.

This programming model implements a MIMD parallel processing paradigm, since it divides the ex- ecution flow between groups, with the result that every group is independent from another. Inside that group, an adapted SIMD parallelism is adopted, named single instruction, multiple-thread (SIMT), where many threads execute each function. In CUDA, the GPU is denoted as the "device" and the CPU is re- ferred as the "host". "Kernel" refers to the function that runs on the device. Using this nomenclature, the host invokes kernel executions on the device.

Current NVIDIA graphics cards are composed of streaming multi-processors. In these, the kernel function runs in parallel. This execution is done according to a special execution flow (explained in Section 2.6.3). Figure 2.6 presents the Fermi architecture of NVIDIA’s graphic cards.

14 2.6 CUDA - Compute Unified Device Architecture

Figure 2.6: Fermi Architecture

On Kepler, each multiprocessor has 192 processing cores, while on Fermi each multiprocessor has a group of 32 SPs. The high-end Kepler has 15 multiprocessors, for a total of 2880 cores (15 ∗ 192), and the Fermi accelerators have 16 multiprocessors, for a total of 512 cores (32 ∗ 16). Another difference is the shared memory size. On Kepler, each SMX has 64 KB of on-chip memory that can be configured as 48 KB of Shared memory with 16 KB of L1 cache, or as 16 KB of shared memory with 48 KB of L1 cache just like the Fermi GPUs. The memory types available in NVIDIA’s graphic cards are further explained more fully in Section 2.3.

Another important difference is related with the maximum number of active warps (group of 32 threads that executes the kernel code at a time), that can exist in each multiprocessor. When one warp stalls on a memory operation, the multiprocessor selects another ready warp and switches to that one. This way, the cores can be productive as long as there is enough parallelism to keep them busy [31]. Tesla supports up to 32 active warps on each multiprocessor, and Fermi supports up to 48.

In order to allow its use by a great number of developers, NVIDIA based its language on the C/C++ and added some specific keywords in order to deploy some special features of CUDA. This new language is called CUDA C and the is NVCC. [24]

2.6.2 Programming Model

CUDA is a extension of C programming language with some reserved keywords. CUDA C extends C by allowing the to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads. The __global__ keyword declares a function as being

15 2. Parallel Architectures a kernel, and it is executed on the device and can only be invoked by the host using a specific syntax configuration - < < < ... > > > as shown in Figure 2.7. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable. These kernel functions must be highly parallelized, in order to obtain maximum efficiency for the application [1].

The basic entities involved in the execution of the heterogeneous programming model are the host, which is traditionally the CPU, and the other one is the devices, which are GPUs in this case. The execution flow for a simple CUDA application can be:

1. Allocate device memory

2. Copy memory from host to device

3. Invoke Kernel

4. Copy memory from device to host

5. Free device memory

Figure 2.7 illustrates a kernel definition and a kernel invocation. The two values within three angle brackets, 1 and N, represent respectively the dimension for each execution grid (total number of blocks), and the dimension for each block (the total number of threads per block that will run in the kernel). These numbers are specified by the programmer, but have a limit according to the maximum number of blocks and threads supported by the adopted GPU[1].

Figure 2.7: CUDA Kernel definition and invocation example [1]

The next section presents the execution model.

2.6.3 Execution Model

In CUDA, the execution flow is organized by a hierarchy that is represented in figure 2.8. Threads represent the fundamental flow of parallel execution and are executed by core processors [32]. A set of threads is called a thread block. Thread blocks are executed on multi-processors and do not migrate over multi-processors. Several concurrent thread blocks can reside on one multi-processor.

16 2.7 Open Computing Language

Figure 2.8: Execution Flows Representation [1]

This number is delimited by multi-processor resources (shared memory and register file). Finally, a set of thread blocks is called a grid. One kernel is launched as a grid.

2.6.4 Limitations

Despite the versatility offered by this architecture, GPUs have some limitations, particularly in mem- ory management and allocation. Memory transfer time between the host and the device represents an overhead that delays execution time, since data has to be transferred to the device before being pro- cessed. Afterwards, the results of data processing need to be transferred from the device to the host. This overhead can become larger, since limited system bus bandwidth and system bus contention can increase the latency between the host and device components.

2.7 Open Computing Language

Open Computing Language (OpenCL) is an open standard that can be used not only for program- ming NVIDIA GPUs, but also to program CPUs and the GPU devices from different manufacturing brands, providing a portable language when programming in context of GPGPU. As with CUDA technology, the OpenCL language denotes as kernel the execution code block that will run on the GPU. The diference between a CUDA kernel and an OpenCL kernel relates to the fact that OpenCL kernel is compiled at run-time, which increases the run time of this solution. In addition, CUDA has the advantage of being developed by the same company that develops the hardware where it runs, so better performance at execution time is expected. This technology is used by the GPGPU community alongside with the CUDA programming model.

17 2. Parallel Architectures

18 3 Sequence Alignment in Bioinformatics

Contents 3.1 Alignment Scoring Model ...... 20 3.2 Optimal Alignment Algorithms ...... 21 3.2.1 Needleman-Wunsch Algorithm ...... 22 3.2.2 Smith-Waterman Algorithm ...... 22 3.3 Heuristic Sub-Optimal Algorithms ...... 24 3.3.1 FASTA ...... 24 3.3.2 BLAST - Basic Local Alignment Search Tool ...... 25 3.4 Parallel Implementations ...... 26 3.4.1 CPU Implementations ...... 27 3.4.2 GPU Implementations ...... 31 3.4.3 Discussion on the Presented implementations ...... 35

19 3. Sequence Alignment in Bioinformatics

Sequence alignment is a fundamental procedure in Bioinformatics, specifically used for molecular sequence analysis, which attempts to identify the maximally homologous subsequences among sets of long sequences [14]. In the of this thesis, it was considered the processing of biological sequences consisting on a single, continuous molecule of nucleic acid or protein [33]. While DNA se- quences can be expressed by four symbols (corresponding to the four nucleotides A,C,T and G). The amino acids in proteins can be expressed by 22 symbols A, B, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, Y, Z. When comparing sequences, one looks for patterns that diverged from a common ancestor by a process of mutation and selection. According to Dewey et al. [34], the main objectives of sequence alignment are to establish input data for phylogenetic analysis, to determine the evolutionary history of a set of sequences, to discover a common motif3 in a set of sequences; to characterize the set of sequences and also for building profiles for database sequence searching.

The considered mutational processes involved in the alignments are residue substitutions, residue insertions, and residue deletions. Insertions and deletions are common referred to as gaps [35]. The basic idea in the aligning process of two sequences (of possibly different sizes) is to write one on top of the other and break them into smaller pieces by inserting spaces in one or the other so that identical sub-sequences are eventually aligned in a one to one correspondence. Naturally, spaces are not inserted in both sequences in the same position. Figure 3.1 illustrates an alignment between the sequences A="ACAAGACAGCGT" and B="AGAACAAGGCGT".

Figure 3.1: Pairwise Alignment Example

In order to understand all the steps involved in the algorithms that will be presented in Sections 3.2 and 3.3, we need to go through some of the concepts employed: the Scoring Model and the concept of the Gap Penalties (Section 3.1). After explaining these concepts, this chapter will provide a brief overview on optimal sequence alignment algorithms (Section 3.2) and heuristic sequence alignment al- gorithms (Section 3.3). Finally, by taking into account the parallel architectures presented in Chapter 2, we present some im- plementations of sequence alignment using parallel architectures, based either on CPU (Section 3.4.1) and on GPU (Section 3.4.2).

3.1 Alignment Scoring Model

The basis of many sequence alignment algorithms are based in a scoring model, which classifies the several matching and mismatching patterns according predefined score values. The simplest ap- proaches consider a positive constant value assigned to a match between both residues. Alternatively,

3Sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function.

20 3.2 Optimal Alignment Algorithms instead of using fixed score values when there’s one match on the alignment, biologists frequently use scoring schemes that take into account physicochemical properties or evolutionary knowledge of the sequences being aligned. This is common when protein sequences are compared. The most known schemas are Point Accepted Mutation (PAM) and Blocks Substitution Matrix (BLOSUM) alphabet-weight scoring schemes, which are usually implemented by a substitution matrix. The BLOSUM matrices were developed by Henikoff & Henikoff, in 1992, to detect more distant rela- tionships. In particular, BLOSUM50 and BLOSUM62 are being widely used for pairwise alignment and database searching.

Substitution matrices allow for the possibility of giving a negative score for a mismatch, what is sometimes called an approximate or partial match. Just like the score values, the gap value can be represented by a constant value or by using some of the presented models. In order to understand the models, the gap open/start score (d), represents the cost of starting a gap, while the gap extension (e) represents the cost of extending a gap by one more space. The standard cost associated with a gap of length (g) is given either by a linear score given by [35]: γ(g) = −gd (3.1)

or by using the affine score: γ(g) = −d − (g − 1)e (3.2)

The gap-extension penalty e is usually set to a value less than the gap-open penalty (d), allowing long insertions and deletions to be penalized less than they would be by the linear gap cost. This is desirable when gaps of a few residues are expected almost as frequently as gaps of a single residue [35].

3.2 Optimal Alignment Algorithms

The optimal alignment of two DNA or protein sequences is the alignment that maximizes the sum of pair-scores minus any penalty for the introduced gaps [35].

Optimal Alignment algorithms include:

• Global Alignment algorithms, which align every residue in both sequences. One example is the Needleman-Wunsch algorithm, which we present in Section 3.2.1.

• Local Alignment algorithms, which only considers part of the sequences and obtains the best sub- sequence alignments or the Identification of common molecular subsequences [14]. One example is the Smith-Waterman algorithm, which we present in Section 3.2.2.

21 3. Sequence Alignment in Bioinformatics

3.2.1 Needleman-Wunsch Algorithm

In 1970, Needleman & Wunsch [36] proposed the following algorithm. Given two molecular se- quences, A = a1a2...an and B = b1b2...bm, the goal is to return an alignment matrix H which indicates the optimal global-alignment score between both sequences. In order to understand this algorithm, consider the following definitions:

• H(i, j) represents the similarity score of two sequences A and B, ending at position i and j;

• s(ai, bj) is the score for each aligned pair of residues. This value can be defined by a constant value, or can be obtained using scoring matrices like PAM or BLOSUM, for the protein sequences;

• Wk, Wl represent the gap penalties, according to the considered gap model.

Each matrix cell is filled with the maximum value that results from Equation 3.3.  Hi−1j−1 + s(ai, bj), if ai and bj are similar symbols  H(i, j) = max Hi−kj − Wk, if ai is at the end of a deletion of length k (3.3)   Hij−k − Wl, if bj is at the end of a deletion of length l

This equation is repeatedly applied in order to fill in the matrix with the H(i, j) values, by calculating the value in the bottom right-hand corner of each square of four cells from one of the remaining three values [36]. By definition, the value in the bottom-right cell of the entire matrix, H(n,m), corresponds to the best score for an alignment between A and B. Figure 3.2 illustrates the algorithm with the alignment between sequences A="AACGTT" and B="ATGTT". The obtained score was 13 and the best global alignment is presented with the green arrows presented in the figure.

Figure 3.2: Needleman-Wunsch alignment matrix example

3.2.2 Smith-Waterman Algorithm

In 1981, Smith and Waterman [14] proposed a dynamic programming algorithm4 that computes the similarity scores corresponding to the maximally homologous subsequences among sets of long se- quences. Given two sequences A = a1a2...an and B = b1b2...bm, the goal of this algorithm is to return a alignment matrix H which indicates the optimal local-alignments between both sequences. For each cell, this algorithm computes the similarity value between the current symbol of sequence A and the current

4Dynamic programming is a programming method that solves problems by combining the solutions to their subproblems[37].

22 3.2 Optimal Alignment Algorithms symbol of sequence B. This algorithm has some data dependencies, since each cell of the alignment matrix depends on its left, upper and upper-left neighbors.

In this algorithm, we consider the same definitions of H(i, j), s(ai, bj), Wk and Wl used in the Needleman- Wunsch algoritm (Section 3.2.1). Receiving the sequences A and B as input, this algorithm begins with the initialization of the first column and the first row, which is given by:

Hk0 = H0l = 0, for 0 ≤ k ≤ n and 0 ≤ l ≤ m (3.4)

Then the algorithm computes the similarity score H(i, j) by using the following equation:

 Hi−1j−1 + s(ai, bj), if ai and bj are similars   Hi−kj − Wk, if ai is at the end of a deletion of length k Hij = max (3.5) H − W , if b is at the end of a deletion of length l  ij−k l j  0, otherwise

The output for the algorithm is the optimal local alignment of sequence A and sequence B with max- imum score. Unlike the Needleman-Wunsch algorithm, the Smith-Waterman algorithm always gives matrix scores greater than or equal to 0.

In order to get all the optimal local alignments between sequences A and B, a trace-back algorithm starts from the highest score in the whole matrix and ends at a score of 0.

Figure 3.3 presents the optimal local alignments between sequence A: WPCIWWPC and sequence

B: IIWPC. In this example, the BLOSUM 50 matrix scoring model is used, in order to get s(ai, bj) value. The gap penalty is -5. The optimal local alignments between sequences A and B are represented inside green background color cells. These alignments occurred between the subsequences WPC of A and WPC of B.

Figure 3.3: Smith-Waterman alignment matrix example

23 3. Sequence Alignment in Bioinformatics

3.3 Heuristic Sub-Optimal Algorithms

Although providing optimal solutions, the described algorithms we characterized by a quadratic com- plexity O(mn) where m is the size of sequence A and n the size of sequence B. This is made evident on large databases with high number of residues. The current protein Uniprot Swissprot [12] database con- tains hundreds of millions residues; for a sequence of length one thousand, approximately 1011 matrix cells must be evaluated to search the complete database. At ten million matrix cells per second, which is reasonable for a single workstation at the time this is being written, this would take 10000 seconds, i.e., around three hours [35]. Heuristic algorithms address this issue at the expense of not guaranteeing to find the optimal solution. Examples of these algorithms are the FASTA and the BLAST, presented in Section 3.3.1 and in Sec- tion 3.3.2.

3.3.1 FASTA

The FASTA algorithm (also known as "fast A" which stands for "FAST-All") was presented by Pear- son & Lipman in 1985 [38] and further improved in 1988 [39]. This algorithm uses local high scoring alignments with a multistep approach, starting from exact short word matches, through maximal scoring ungapped extensions, to finally identify gapped alignments. This algorithm can be described in four steps [35]:

• Step 1 (Figure 3.4): locate all identically matching words of length ktup (specifies the size of the word) between the two sequences. For proteins, ktup is typically 1 or 2, for DNA it may be 4 or 6. It then looks for diagonals with many mutually supporting word matches.

Figure 3.4: FASTA algorithm step 1.

• Step 2 (Figure 3.5): search for the best diagonals, extending the exact word matches to find maximal scoring ungapped regions (and, in the process, possibly joining together several seed matches).

• Step 3 (Figure 3.6): check if any of these ungapped regions can be joined by a gapped region, allowing for gap costs.

• Step 4 (Figure 3.7): the highest scoring candidate matches in a database search are realigned using the full dynamic programming algorithm, but restricted to a subregion of the dynamic pro-

24 3.3 Heuristic Sub-Optimal Algorithms

Figure 3.5: FASTA algorithm step 2.

Figure 3.6: FASTA algorithm step 3.

gramming matrix forming a band around the candidate heuristic match. This step uses a standard dynamic programming algorithm, such as Needleman-Wunsch or Smith-Waterman, to get the final scores.

Figure 3.7: FASTA algorithm step 4.

There is a tradeoff between speed and sensitivity in the choice of the ktup parameter, higher values of ktup are faster, but more likely to miss true significant matches. To achieve sensitivities close to the optimal algorithms for protein sequences, ktup needs to be set to 1.

3.3.2 BLAST - Basic Local Alignment Search Tool

The Basic Local Alignment Search Tool (BLAST) was presented by Altschul et al. in 1990 [40], and finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can

25 3. Sequence Alignment in Bioinformatics be used to infer functional and evolutionary relationships between sequences as well as to help identify members of gene families [9]. This algorithm is most effective with polypeptide5 sequences and uses a matrix score (BLOSUM, PAM, etc.) to find the maximal segment pair (MSP) for two sequences, defined as locally optimal if the score cannot be improved either by lengthening or shortening the segment pair. This algorithm is the most widely used for protein-coding sequences alignment6. The BLAST algorithm steps are [40]:

1. Compile a list of high-scoring words

• Giving a length parameter w and a threshold parameter T , find all the w-length substring (words) of the database sequences that align with words from the query with an alignment score higher than T . This is called a hit in BLAST.

• Discard those words that score below T (these are assumed to carry too little information to be useful starting seeds)

2. Scan the database for hits

• When T is high, the search will be rapid, but potentially informative matches will be missed.

3. Extend the hits

• Attempt to extend this match to see if it is part of a longer segment that scores above the MSP score S

• Report only those hits that yield a score above S

From the score S it is also possible to calculate an expectation score E, which is an estimate of how many local alignments of at least this score would be expected given the characteristics of the query sequence and database. The original BLAST did not permit gaps, so it would find relatively short regions of similarity, and it was often necessary to extend the alignment manually or with a second alignment tool.

3.4 Parallel Implementations

Smith-Waterman is the most known algorithm in this context, and it was explored in many software implementations, each improving the execution times and optimizing the parallelization method. In what concerns the parallelization method, the implementations presented in this Section have some different approaches of parallelism and they can be grouped taking into account their level of paral- lelism [2]:

• Coarse Grained Parallelism: example of this parallelism can be the master/worker model pre- sented in our work, where we have a single processor named master that send works to the workers. On the parallel sequence alignment, the database sequence is split into n parts, and

5short chains of amino acid monomers linked by peptide (amide) bonds. 6http://cmns.umd.edu/

26 3.4 Parallel Implementations

each worker node is processing one of those parts. When the execution ends, the worker sends back the results to the master and gets more parts to process, until the processing of all parts is finished. This parallel method is present in the proposed implementations of this work, presented on Chapter 4.

• Fine-Grained Parallelism: examples on this parallelization methodology are those presented by Wozniak [41], taking advantage of the Visual Instruction Set (VIS) of the SUN ULTRA SPARC processors and Farrar and Rognes implementations [16, 42], that take advantage of the Stream- ing SIMD Extensions (SSE) technologies, available in most modern Intel processors. All these implementations are presented on Sections 3.4.1.A, 3.4.1.B and 3.4.1.C.

• Intermediate-grained Parallelism: this kind of implementations are the most explored nowadays, with the growth of the General Purpose Graphics Processing Unit (GPGPU), taking advantage of the GPUs to parallelize the execution of the algorithm. Section 3.4.2.A presents Manavski’s solution [43], one of the first ones considering CUDA framework for modern NVIDIA GPUs. The implementation CUDASW++, introduced by Liu et al. [3, 17] is presented in Section 3.4.2.B.

3.4.1 CPU Implementations

In this section, we survey the state of the art on CPU-based implementations of the Smith-Waterman algorithm.

3.4.1.A Wozniak

One of the first parallel implementations of the Smith-Waterman algorithm was presented in 1997 by A. Wozniak [41], who proposed using SIMD instructions for the parallelization of the algorithm. By exploiting the use of specialized video instructions. These instructions, SIMD-like in their design, make possible parallelization of the algorithm at the instruction level. Another optimization of this implemen- tation is using Visual Instruction Set (VIS) instructions found in the SUN ULTRA SPARC processors. These VIS instructions can be used to execute in parallel four rows of the Smith-Waterman algorithm score matrix, enabling data-level parallelization. VIS instructions use special 64-bit registers, making it possible to add two sets of 16-bit integers and get four 16-bit results, with a single instruction. This implementation reaches over 18 million matrix cell updates per second on a single ULTRA SPARC running at 167 MHz. The global performance scales with the number of processors used, reaching at 12 processors, 200 million matrix cells per second.

3.4.1.B Farrar

In order to optimize the performance of the original Smith-Waterman algorithm, Michael Farrar also proposed in 2006 [42] a SIMD solution to parallelize the algorithm at the data level. This solution takes advantage of three different optimizations. The first one is called query profile and was presented by

27 3. Sequence Alignment in Bioinformatics

Rognes and Seeberg [44]. It avoids calculating the score between both sequence residues in the Smith- Waterman matrix, calculating a query profile parallel of the query for each possible residue. Then, the calculation of the Hij requires just an addition of the pre-calculated score to the previous Hij. The query profile is stored in memory at 16- boundaries. By aligning the profile at a 16-byte boundary, the values are read with a single aligned load instruction, which is faster than reading unaligned data. Another optimization proposed by Farrar is the use of the SSE2 instructions, available on Intel proces- sors. To maximize the number of cells calculated per instruction, the SIMD SSE2 registers are divided into their smallest unit possible. The 128-bit wide registers are divided into 16 8-bit elements for pro- cessing. One instruction can therefore operate on 16 cells in parallel. Dividing the register into 8-bit elements limits the cell’s range to between 0 and 255. In most cases, the scores fit in the 8-bit range unless the sequences are long and similar. If a query’s score exceeds the cells maximum, that query is recalculated using a higher precision. Finally, Farrar proposed the Lazy F evaluation. In order to avoid calculating every cells on the matrix, this optimization makes the algorithm not calculate the H value when the F remains at zero (thus not contributing to the value of H). In order to avoid bad results, this optimization has a second pass loop to correct all the matrix cells that were not calculated in the first pass. This second pass loop is exe- cuted until all elements in F are less than H − Ginit, Ginit being the gap open penalty. According to the presented results, this algorithm achieves over 3 billion cell updates per second using a 2.0 GHz Xeon Core 2 Duo processor [42].

3.4.1.C SWIPE (Rognes)

Taking into account the Farrar’s implementation, in 2011 Torbjørn Rognes proposed SWIPE, an effi- cient parallel solution based on SIMD instructions [16], which allows running the Smith-Waterman search more than six times faster. SWIPE performs rapid local alignment searches in amino acid or nucleotide sequence databases.

SWIPE compares sixteen residues from sixteen different database sequences in parallel for the same query residue. This operation is carried out using Intel SSE2 vectors consisting of sixteen independent (Figure 3.8).

Another important characteristic of this algorithm is the use of a compact code of ten instructions writ- ten in assembly, which constitute the core of the inner loop of the computations. These ten instructions are presented in Figure 3.9 and compute in parallel the values for each vector of 16 cells in independent alignment matrices. The exact selection of instructions and their order is important; this part of the code was therefore hand-coded in assembly to maximize performance. In this figure, H represents the main score vector. The H vector is saved in the N vector for the next cell on the diagonal. E and F represent the score vectors for alignments ending in a gap in the query and database sequence, respectively. P is the vector of substitution scores for the database sequences versus the query residue q (see temporary

28 3.4 Parallel Implementations

Figure 3.8: Multi-sequence vectors. score profiles below). Q represents the vector of gap open plus gap extension penalty. R represents the gap extension penalty vector. S represents the current best score vector. All vectors, except N are initialized prior to this code.

Figure 3.9: Rogne’s Algorithm core instructions.

Using a 375-residue query sequence, SWIPE achieved 106 billion cell updates per second (GCUPS) on a dual Intel Xeon X5650 six-core processor system, which is more than six times faster than software based on Farrar’s approach (the previous fastest implementation).

3.4.1.D Pedro Monteiro’s Implementation

Extending the Rognes implementation, Pedro Monteiro in its Master Thesis [2] proposes an ex- tension to the presented thread-level parallelization model, exploring a fine-grained parallelization in a inter-task SIMD solution. Basically in this implementation the database sequences are split in several database chunks, and each chunk are processed using the Rognes execution module, like presented in Figure 3.10. This implementation explores both intra-task and inter-task level parallelization.

29 3. Sequence Alignment in Bioinformatics

Figure 3.10: Sequences Database in several chunks [2].

To support this implementation Pedro Monteiro’s solution proposes a different basic processing ele- ment, represented by a structure called message or processing block, that are presented in Figure 3.11.

Figure 3.11: Processing Block - Message [2].

The message presented in Figure 3.11 contains all the needed elements for one processing iteration by one of the system workers.

To avoid an overhead in the amount of waiting processing blocks to be processed by system workers, Pedro Monteiro’s solution implements also two First In,First Out (FIFO) lists through which the master is

30 3.4 Parallel Implementations able to communicate asynchronously with all the workers and vice versa (Figure 3.12). Supporting the inclusion of two queues in the solution, Pedro Monteiro introduced also multiple synchronization barriers in the distinct access moments.

Figure 3.12: Processing Block FIFOs [2].

This way, the solution obtained great speedup values using the Dell PowerEdge R810 processing platform. This implementation also attained a performance of more than 71 GCUPS by using 32 parallel worker threads on a distributed-memory architecture, which is nearly 2.5 times faster than SWIPE, running on a different memory architecture [2].

3.4.2 GPU Implementations

We now present some of the GPU-based implementations of the Smith-Waterman algorithm found in the literature.

3.4.2.A Manavski’s Implementation

In order to get a fast implementation of the Smith-Waterman algorithm on commodity GPU hardware using OpenGL instructions, Manavski et al. [43] proposed what they refer to "the first solution based on commodity hardware that efficiently computes the exact Smith-Waterman algorithm". In this imple- mentation, they used an optimization of the Smith-Waterman algorithm previously proposed by Rognes and Seeberg [16]. This optimization consists in pre-computing the query profile parallel to the query sequence for each possible residue, in order to avoid the lookup of s(ai, bj) in the internal cycle of the algorithm. Thus, the random accesses to the substitution matrix are replaced by sequential ones. In their implementation, the query profile is stored in GPU texture memory space, since it is a low latency memory.

The strategy that was adopted in this implementation consists of making each GPU thread compute the whole alignment of the query sequence with one database sequence. Before that, the database is ordered and stored in the global memory of the GPU, while the query-profile is saved into texture mem- ory. Another optimization of this implementation is the inclusion of an initialization process, where the number of available computational resources is automatically detected. This number will help achieve dynamic load balancing. After this step, the database is divided into as many segments as the number of stream-processors present in the GPU. Each stream-processor then computes the alignment of the

31 3. Sequence Alignment in Bioinformatics query with one database sequence.

To analyze the obtained performance, Manavski’s implementation was compared with three previ- ous implementations. This performance was measured by running the application both on single and on double GPU configurations. The first comparison that was carried out is with Liu’s implementation of the Smith-Waterman algorithm based on OpenGL instructions. The obtained results show that this implementation is 18 times faster than Liu’s [45]. The second comparison was made with BLAST and SSEARCH algorithms [46, 47]. The obtained results show that this implementation is up to 30 times faster than SSEARCH, and up to 2.4 faster than BLAST. Finally, the last test compares this implemen- tation with Farrar’s implementation [42], showing a three-fold performance increase.

3.4.2.B CUDASW++

Just like the algorithm presented above, CUDASW++ is an optimized implementation of the Smith- Waterman algorithm using CUDA. It was proposed by Liu et al. [3] and uses the computational power of CUDA-enabled GPUs to accelerate Smith-Waterman algorithm sequence database searches.

Liu et al. presented two different approaches for the parallelization of the algorithm: inter-task paral- lelization and intra-task parallelization. In inter-task parallelization, each task is assigned to exactly one thread and dimBlock tasks are performed in parallel by different threads in a thread block. In Intra-task parallelization, each task is assigned to one thread block and all dimBlock threads in the thread block cooperate to perform the task in parallel, exploiting the parallel characteristics of cells in the minor diag- onals.

In order to achieve the best performance, their implementation uses two stages. The first stage exploits inter-task parallelization and the second stage exploits intra-task parallelization. The transition between these stages is separated by a defined threshold; only when the query sequence length is above that threshold are the alignments carried out in the second stage. Besides this two-stage process, their implementation uses three techniques to improve the perfor- mance: coalesced subject sequence arrangement, coalesced global memory access, and cell block division method.

Coalesced subject sequence arrangement (Figure 3.13) - For inter-task parallelization, arrange the sorted subject sequence in an array, where the symbols of the query sequences are restricted to be stored in the same column from top to bottom and all sequences are arranged in increasing length order from left to right and top to bottom in the array. For the intra-task parallelization, the sorted subject sequence are sequentially stored in an array, row by row, from the top-left corner to the bottom-right corner. All symbols of a sequence are restricted to be stored in the same row from left to right. Texture cache can be utilized in order to achieve maximum performance on coalesced access patterns.

32 3.4 Parallel Implementations

Figure 3.13: Coalesced Subject Sequence Arrangement [3].

Coalesced global memory access (Figure 3.14) - This technique explores memory organization pat- terns in order to achieve the best performance. All threads in a half-warp should access the intermediate results in coalesced pattern. Thus, the words accessed by all threads in a half-warp must lie in the same segment. To achieve this, all threads in a half-warp are allocated in the form of an array to keep them in contiguous .

Figure 3.14: Coalesced Global Memory Access [3].

Cell block division method - This method consists of dividing the alignment matrix into cell blocks of equal size for inter-task parallelization.

When executing their implementation using a single-GPU version, CUDASW++ [3], achieves a per- formance value of about 10 GCUPS on an NVIDIA GeForce GTX 280 graphics card. In a multi-GPU version, it achieves a performance of up to 16 GCUPS on an NVIDIA GeForce GTX 295 graphics card, which has two G200 GPU-chips on a single card.

Meanwhile, the same authors have proposed a new version of this implementation, the CUDASW++2.0 [17]. In this new version, they proposed three different implementations: a optimized SIMT SW algorithm ver- sion, a basic vectorized SW algorithm and a partitioned vectorized SW algorithm.

Optimized SIMT SW algorithm - This implementation is a optimized version of CUDASW++ focused on its first stage, with the introduction of two optimizations: introduction of a sequential query profile

33 3. Sequence Alignment in Bioinformatics

and the utilization of a packed data format. The packed data format is used in the re-organization of each subject sequence; four successive residues of each subject sequence are packed together and represented using the uchar4 vector . When using the cell block division method, the four residues loaded by one texture fetch are further stored in shared memory for the use of the inner loop.

Basic Vectorized SW algorithm - This implementation is based on Michael Farrar striped SW imple- mentation [42]. It directly maps Farrar’s implementation onto CUDA, based on the virtualized SIMD vector programming model. As seen before, Farrar denotes as F values. That part of the similarity

values for H(i, j), which derive from the same line: Hi−kj − Wk. The lazy-F loop technique avoids the calculations of similarity scores when running this algorithm. This technique states that, for

most cells in the matrix H(i, j), the value of Hi−kj − Wk remains at zero and does not contribute

to the value of H. Only when H is greater than Wk will F start to influence the value of H. For the computation of each column of the alignment matrix, the striped SW algorithm consists of two loops: an inner loop calculating local alignment scores postulating that F values do not contribute to the corresponding H values, and a lazy-F loop correcting any errors introduced from the calculations of the inner loop. This algorithm uses a striped query profile.

Partitioned Vectorized SW algorithm - In this implementation, the algorithm first divides a query se- quence into a series of non-overlapping, consecutive small partitions, according to a pre-specified partition length. Then, it aligns the query sequence to a subject sequence, partition by partition, considering each one a new query sequence. Finally, it constructs a striped query profile for each partition.

Concerning performance evaluation, just like in the first version of CUDASW++ implementation, Liu et al. use two different approaches: a single GPU implementation (NVIDIA Geforce GTX 280) and a multi-GPU implementation (Geforce GTX 295). The optimized SIMT SW algorithm achieves an average performance of 16.5 GCUPS on Geforce GTX 280. The same algorithm, when running on GTX 295, achieves an average performance of 27.2 GCUPS. The partitioned vectorized algorithm achieved an average performance of 15.3 GCUPS using a gap penalty of 10-2 k (gap penalty initialization of 10 and gap extension penalty of 2), gap; an aver- age performance of 16.3 GCUPS using a gap penalty of 20-2 k; and an average performance of 16.8 GCUPS using a gap penalty of 40-3 k on GTX 280. The same partitioned vectorized algorithm, when running on GTX 295, achieved an average performance of 22.9 GCUPS using a gap penalty of 10-2 k; an average performance of 24.8 GCUPS using a gap penalty of 20-2 k; and an average performance of 26.2 GCUPS using a gap penalty of 40-3 k. When comparing this algorithm with the first CUDASW++ implementation, the optimized SIMT algorithm method runs 1.74 faster on GTX 280 and 1.72 faster on GTX 295. The partitioned vectorized algorithm method runs about 1.58 and 1.77 times faster on GTX 280 and about 1.45 and 1.66 times faster on GTX 295. In 2013, Liu et al. [4] presented the third version of this algorithm, CUDASW++ 3.0. This implementation

34 3.4 Parallel Implementations couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. Be- sides the inclusion of CPU implementation, this version has investigated for the first time a GPU SIMD parallelization, based on the CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over CPUs and GPUs based on their respective computing capabilities. The GPU implementations were specified for GPUs based on the Kepler architecture7. In order to balance the runtimes of CPU and GPU computations, they have dynamically distributed all sequence alignment workloads over CPUs and GPUs, as per their compute power. For the computation on CPUs, Liu et al. [4] have employed the streaming SIMD extensions (SSE) based vector execution units and multithreading to speed up the SW algorithm. The program workflow is presented in Figure 3.15.

Figure 3.15: Program workflow of CUDASW++ 3.0 [4].

Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improve- ment over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively. In addition, Liu et al. CUDASW++ 3.0 algorithm [4] has demonstrated good speedups over other top- performing tools: SWIPE and BLAST+.

3.4.3 Discussion on the Presented implementations

The Smith-Waterman algorithm represents one of the most used bioinformatics algorithms. As a result, last years there have been proposed a lot of solutions based on it. In the previous sections, we presented the main parallel implementations of this algorithm.

Considering task parallelization, we observe two types of implementation:

• intra-task parallelization considers the parallelization within a single alignment, breaking the se- quences into multiple parts;

• inter-task parallelization considers the parallelization where multiple database or query sequences are processed simultaneously, considering a single query sequence and breaking the database into several sequences, parallelizing at the sequence level.

7http://www.nvidia.com/object/nvidia-kepler.html

35 3. Sequence Alignment in Bioinformatics

Concerning the presented CPU implementations, Wozniak’s [41] implementation, exploring the instruction- level parallelism, has achieved a great performance evolution from the original Smith-Waterman algo- rithm implementations. In addition, Wozniak proposed a different processing approach named anti- diagonal (process the matrix in diagonal , in the intra-task parallelization method. This algorithm achieves 18 million cell updates per second on a single processor.

This optimization was then explored and surpassed by Michael Farrar [42] exploiting Intel SSE in- structions present on modern Intel processors. This implementation considers a striped pattern in the query sequence access and achieved a performance of over 3 billion cell updates per second (GCUPS), reaching a speedup of approximately 8 times over the previously SIMD implementations.

Finally, Rognes [16] solution explores not only the instruction level parallelism with the usage of the Intel’s Streaming SIMD Extensions (SSE) on ordinary CPUs, but also the data parallelism, implement- ing the master/worker model. This implementation can use inter-task approach when considering the execution of one alignment query against one database sequence alignment, but also the intra-task approach when splitting the several database sequences between several workers configured in the environment. This model was implemented on Intel processors with the SSE3 instruction set extension, such as the Intel Core i7. SWIPE achieved performances of over 9 GCUPS for a single thread and up to 106 GCUPS for 24 parallel threads.

In the GPU context, the two solutions presented in Section 3.4.2 use NVIDIA’s CUDA. Manavski’s [43] implementation implements the query profile, presented by Farrar [42], in order to avoid the lookup step of the internal cycle that calculates the s(ai, bj), pre-computing the query profile parallel to the query sequence. This optimization removes the random accesses to the score matrix replacing them by se- quential accesses to the query profile. The strategy adopted by this implementation makes each GPU thread compute the whole alignment of the query sequence with one database sequence, in a inter-task parallelization approach. Another optimization was pre-ordering the database sequences. This algo- rithm achieved speeds of more than 3.5 GCUPS, less than Rognes [16], but faster than than any other previous attempt available on commodity hardware [43]. By the other way, the Liu’s CudaSW++ [17] application also considers the query profile presented by Farrar [42], and proposes three different op- timizations. The first optimization (Optimized SIMT SW algorithm) takes into account one intra-task parallelization model, using a packed data format in the re-organization of each subject sequence. In it, four successive residues of each subject sequence are packed together and represented using the uchar4 vector data type. When using the cell block division method, the four residues loaded by one texture fetch are further stored in shared memory for the use of the inner loop. The second one (Basic Vectorized SW algorithm) is based on Michael Farrar striped SW implementation [42]. It directly maps the Farrar’s implementation onto CUDA, based on the virtualized SIMD vector programming model. Far- rar denotes as F values, that part of the similarity values for H(i, j), which derives from the same line:

Hi−kj − Wk. The lazy-F loop is a technique used by Farrar in order to avoid the calculations of similarity

36 3.4 Parallel Implementations scores when running this algorithm. This technique states that for most cells in the matrix H(i, j), the

Hi−kj − Wk remains at zero and does not contribute to the value of H. Only when H is greater than Wk will F start to influence the value of H. And finally the third optimization, the partitioned Vectorized SW algorithm where the algorithm first divides a query sequence into a series of non-overlapping, consecu- tive small partitions, according to a pre specified partition length. Then, it aligns the query sequence to a subject sequence partition by partition, considering each one a new query sequence. Then it constructs a striped query profile for each partition. Considering performance, CUDASW++2.0, the optimized SIMT SW algorithm achieves an average per- formance of 16.5 GCUPS on Geforce GTX 280. The same algorithm, when running on GTX 295, achieves an average performance of 27.2 GCUPS. The partitioned vectorized algorithm achieved an average performance of 15.3 GCUPS using a gap penalty of 10-2 k; an average performance of 16.3 GCUPS using a gap penalty of 20-2 k; and an average performance of 16.8 GCUPS using a gap penalty of 40-3 k on GTX 280. The same partitioned vectorized algorithm, when running on GTX 295, achieved an average performance of 22.9 GCUPS using a gap penalty of 10-2 k; an average performance of 24.8 GCUPS using a gap penalty of 20-2 k; and an average performance of 26.2 GCUPS using a gap penalty of 40-3 k. The third implementation of CUDASW++ gains a performance improvement over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively and in addition shows significant speedups over other top-performing tools: SWIPE and BLAST+.

So, considering the values presented and all the implementations, this work implements one ap- plication that mixes the Rogne’s implementation with the Liu’s CudaSW++2.0 implementation, in or- der to develop a master/worker model implementation that can speedups the execution times on the Smith-Waterman algorithm execution. This algorithm will determine one alignment between a database sequence file with thousands sequences against one query sequence faster and with dynamic load balancing when getting the chunks of work, maximizing the execution time of each one of the solution workers.

37 4 Heterogeneous Parallel Alignment MultiSW

Contents 4.1 Introduction ...... 39 4.2 Architecture ...... 40 4.2.1 CPU Worker ...... 41 4.2.2 GPU Worker ...... 42 4.3 Application Execution Flow ...... 44 4.4 Implementation Details and Optimizations ...... 44 4.4.1 Database File Format ...... 44 4.4.2 Database Sequences Pre-Loading ...... 45 4.4.3 Data Structures ...... 45 4.5 Dynamic Load-balancing Layer ...... 47 4.6 Conclusion ...... 49

38 4.1 Introduction

4.1 Introduction

In Section 3.4, a set of parallel implementations of the Smith-Waterman Algorithm that were pro- posed in the last years were presented. Considering two of the solutions presented: CUDASW++2.0 by Liu et al. [17] and the Pedro Monteiro’s SWIPE extension [2] presented in Section 3.4.1, our work proposes an efficient solution for parallel implementation of the Smith-Waterman algorithm, named Mul- tiSW (Section 3.4.2.B). This implementation consists of the orchestration of both applications execution modules in a single solution, exploiting the use of multiple CPU cores and the NVIDIA GPUs that may be available on the running machine, in a heterogeneous approach methodology, as presented in Figure 4.1. Each one of the modules is called a worker, so we have the CPU workers (Section 4.2.1) and the GPU workers (Sec- tion 4.2.2). The MultiSW application considers a load balancing , in order to efficiently split the database sequences during the execution. This layer is explained in Section 4.5. Another im- plemented optimization is a wrapper8 function for the CPU worker execution (Section 4.2.1.A). These additional implementations were proposed in order to improve the application CPU worker execution time. Besides these improvements in the CPU, it was also proposed several implementations in the GPU worker (Section 4.2.2). During the execution, the proposed MultiSW application must receive multiple arguments from the prompt, specifying the running parameters. Then, it prepares all the execution structures (presented in Section 4.4.3), and coordinates the execution of all executable work between the available workers (specified at invocation time). This coordination process can be referred to as Orchestration process.

Figure 4.1: Heterogeneous Architecture

This way, multiple parallelization techniques are considered in a single software solution, in a medium- grained parallelization approach where multiple database sequences are processed simultaneously, as will be explained in Section 4.2. In this kind of applications, the main objective is to process all data in the minimum execution time, leading the maximum execution speedup (concept explained below in Section 5.2). Considering both

8A wrapper function is a subroutine in a software or a computer program whose main purpose is to call a second subroutine or a system call with little or no additional computation.

39 4. Heterogeneous Parallel Alignment MultiSW execution workers, the execution time its directly related to the amount of data processed (database sequences) in each iteration. Due to the base implementations considered in the implementation of MultiSW, it was necessary to create several auxiliar processing structures (see Section 4.4.3). Sec- tion 4.2 presents the architecture of this solution and the adaptation of the existing solutions to enable it. To improve MultiSW, in Section 4.5, a model that changes this block size in run time iterations, in order to minimize the application total execution time, is presented.

4.2 Architecture

The solution’s architecture is presented in Figure 4.2. The orchestration can be considered the application’s core. This main orchestration invokes the CPU and GPU implementations to execute work that consists in processing alignments between database sequences and the query sequence. Both workers are adapted from the considered applications Pedro Monteiro 3.4.1.D and CUDASW++2.0 to this thesis implementation solution. This adaptation is explained below.

MultiSW

CPU Module CPU Wrapper

Database Get work Orchestration Sequences

Load Balancing Module Balancing Load GPU Module

Figure 4.2: MultiSW block diagram.

In order to adapt both solutions to this work, the considered model was the Master/Worker originally proposed by Pedro Monteiro in SWIPE extension [2]. A possible representation for this model execution is shown in Figure 4.3. The split of database in multiple chunks represents the inter-task parallelization model introduced by Pedro Monteiro in his solution.

During the execution, all running workers, the CPU and the GPUs ones are getting some new work to process, execution after execution, invoking the function get_fasta_sequences(). This function loads the next processing database sequences, from the database sequences file specified at the application run time. The access to this function is protected by a pthread_mutex_t, to ensure that only one of the workers can obtain sequences, each time. The worker then gets the respective processing block from the profile_seqs structure. GPU workers use the processing block structure presented in Section 4.4.3.

40 4.2 Architecture

Figure 4.3: Master Worker Model [2]

In Sections 4.2.1 and 4.2.2, the adaptations of each existing applications, to use the original code adapted in the orchestration implementation, are presented. Also is presented a CPU wrapper function used to minimize all the multiple thread accesses to the global shared variables, that need synchroniza- tion amongst all execution threads.

4.2.1 CPU Worker

The CPU worker of our work consists of the adaption of the Pedro Monteiro’s solution [2], trans- forming the master of the original master/worker model into one of our workers, since in the original Pedro Monteiro’s implementation the master thread controls all the execution and creates new process- ing work for the workers. In the original implementation, the master thread creates processing blocks of 16 sequences, blocking the access to the get_fasta_sequences() function (explained above) of the other workers, in every execution iteration. This represents an efficiency problem in the final solution, because of the low parallelization level on the database sequences, so a CPU wrapper function was developed in our work (Section 4.2.1.A), to avoid this problem. The architecture of the original solution was not changed, and the application still works as repre- sented in Figure 4.4.

Figure 4.4: Master Worker Model [2]

So, in our implementation, the worker creates the processing blocks to be processed. It gets the database sequence from the CPU wrapper function, and then it creates the 16 database sequence

41 4. Heterogeneous Parallel Alignment MultiSW blocks to be inserted in the queue to be processed. Besides the CPU wrapper implementation, some of the initial functions were adapted, since the original database file considered was the BLAST sequence type [48] and our implementation works with the FASTA [49] database file format. This demanded that initializing functions were changed in order to support this different file format.

4.2.1.A CPU Wrapper

When workers get some new execution block, it is necessary to guarantee that the accessed method which obtains the database sequences does not block the access to other execution workers. In the CPU implementation this is implemented using one mutex that blocks every concurrent accesses to these variables. Pedro Monteiro’s implementation [2] considers that executable blocks have only 16 sequences, and the method getwork() (explained above) gets those sequences was blocking the access of another workers when getting them information. So, in order to avoid the CPU worker to get only 16 sequences at a time and block the other workers, it was presented in this work one CPU Wrapper function that gets a bigger block (default value its 30000) to avoid the other worker waits for several accesses. After that, the CPU worker creates processing blocks accessing that block obtained by this wrapper, getting 16 sequences from it (as shown in Figure 4.5).

GetSeqs(16)

GetSeqs(16) (…) CPU Database CPU Worker GetSeqs(30000) Wrapper Sequences GetSeqs(16)

GetSeqs(16)

Figure 4.5: CPU Wrapper Function.

4.2.2 GPU Worker

The GPU module considers several GPU workers, each one assigned to a physical NVIDIA GPU device. The number of running GPUs is specified in the prompt at run time. The application creates a CPU pthread for each one of the considered GPUs. This thread runs a function named gpu_worker() and this function gets the execution database sequences from the get_fasta_sequences() function, run- ning all the preparation and execution flows from the original CUDASW++ implementation [17]. Liu’s et al. solution works with FASTA sequences, so it was not necessary to change the sequence’s preparing functions. To minimize the application execution time, some optimizations that reduce the execution time for each iteration of the worker are presented. A CUDA Stream is "a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and,

42 4.2 Architecture when possible, they can even run concurrently" [50]. Using CUDA Streams, the memory transfers can be made asynchronous between the host and the device (Section 4.2.2.A). Also, with streams is also possible parallelize the execution of kernels (Section 4.2.2.B). Besides these, the loading of the next pro- cessing sequences is done in parallel with the execution of kernels on the device side (Section 4.2.2.C).

4.2.2.A Asynchronous Transfers

By creating CUDA streams and assigning them to data transfers and changing memory transfers to asynchronous (placing -Async in the name of the transferring instruction), as shown in the code the data transfer function does not block the execution:

//... cudaStreamCreate(&mystream1); cudaMemcpyAsync(deviceArray,hostArray,size,cudaMemcpyHostToDevice,mystream1); kernel<<>>(otherDataArray); //...

4.2.2.B CUDA Streams in Kernel Execution

It is possible to execute two different kernel functions at same time, if the data processed by each one is different and independent:

//... cudaStreamCreate(&mystream1); cudaStreamCreate(&mystream2); kernel<<1,N,mystream1>>(DataArray); kernel<<1,N,mystream2>>(differentDataArray); //...

4.2.2.C Loading Sequences with Execution

It is also possible to execute some host code during the kernel execution in the device. This way, the three instructions are executed in parallel:

//... cudaStreamCreate(&mystream1); cudaStreamCreate(&mystream2); kernel<<1,N,mystream1>>(DataArray); get_fasta_sequences();// host function kernel<<1,N,mystream2>>(differentDataArray); //...

43 4. Heterogeneous Parallel Alignment MultiSW

4.3 Application Execution Flow

The application execution flow its presented in Figure 4.6. In the beginning of the application the workers are started (the number of CPU cores and GPUs chosen for the execution are specified in the prompt using the t parameter for the CPU threads and the g parameter for the GPUs). Considering the CPU Wrapper as a big CPU worker, this implementation only considers one CPU worker, since it is the CPU module master that accesses the CPU wrapper function and gets the work for all the CPU module sub-workers. At the same time, GPU workers also start to get blocks of database sequences to process. At the end of the execution, after all database sequences are processed, CPU module workers and GPU workers are killed with the pthread_exit function and control returns to the main function, that shows the results of the best alignments score, and finishes the application execution.

Figure 4.6: Execution Sequence Diagram.

4.4 Implementation Details and Optimizations

4.4.1 Database File Format

As for the database file format, Pedro Monteiro’s implementation [2] only considers the BLAST se- quences type, and the CUDASW++ only considers the FASTA format. All the functions used from Pedro Monteiro’s implementation were adapted to use the FASTA database sequence file format.

44 4.4 Implementation Details and Optimizations

4.4.2 Database Sequences Pre-Loading

The BLAST file format indexes all the sequences presented in the file, and can be used efficiently to obtain the desired profile sequences at run-time. On the other hand, the FASTA file format is not indexed, and it will be very difficult to index the correct sequences in the execution. The solution was to create the structure profile_seqs and pre-load all the sequences of the file in this structure. This way, it becomes possible to index the wished sequence in the run-time execution.

4.4.3 Data Structures

In order to organize all the code and to be possible to separate the blocks in several files with distinct uses, several structures we created in this implementation to keep the running arguments of the appli- cation. The first structure is named execution_params and contains all the execution parameters for the appli- cation: typedef struct execution_parameters {

char *progname;// application name char *matrixname;// score matrix name used for the scoring model char *databasename;// name of the database used char *queryname;// query sequence name long maxmatches;// maximum show results long minscore;// minimum show score long threads;// number of cpu cores long blocksize;// blocksize for the cpu worker long p_blocksize;// size of the profile blocksize long workerIter;// contains number of worker iterations long nodes;// execution nodes in cpu long gpu_enable;// number of gpus enabled long gpu_max_seqs;// blocksize for gpu execution

long gapopen;// gap open value long gapextend;// gap extended value

BYTE gap_open_penalty;// gap open penalty BYTE gap_extend_penalty;// gap extended penalty

} execution_params;

Another used structure is query_seq_parameters. This one contains the query sequence parame- ters: typedef struct query_seq_parameters {

int qlen;// query length int qlen_aligned;// query length aligned by8

45 4. Heterogeneous Parallel Alignment MultiSW

char* filename;// query filename char *description;// query description read of the file

BYTE *query_sequence;// query sequence residues BYTE *query_sequence_padded;// query sequence residues padded by8

} query_seq_params;

Another one is the score_matrixes structure. This structure keeps the matrixes and the score limit for the SWIPE execution modes. These values are read by the score matrix file indicated in the prompt when executing the application: typedef struct score_matrixes {

long SCORELIMIT_7;// score limit for worker7 execution long SCORELIMIT_16;// score limit for worker16 execution long SCORELIMIT_63;// score limit for worker64 execution

char *score_matrix_7;// score matrix for worker7 short *score_matrix_16; // score matrix for worker7 long *score_matrix_63;// score matrix for worker63

} score_matrixes;

Another structure created is profile_params that keeps the information about the database profile parameters: typedef struct profile_params {

// Genbank NCBI Format int fd_psq;// file descriptor for BLAST psq file int fd_phr;// file descriptior for BLAST phr file

// Fasta Format int using_fasta;// flag to indicate if file is in FASTA file format int fd_fasta;// file descriptor for FASTA file format FILE* fasta_file;// file pointer int fasta_pos;// next considered sequence

UINT32 *adr_pin;// address pin variable for BLAST mode execution

off_t len_psq;// offset of psq BLAST file off_t len_phr;// offset of phr BLAST file off_t len_pin;// offset of pin BLAST file

char *dbname;// database name obtained from the file char *dbdate;// database date obtained from the file

unsigned seqcount;// total number of sequences presented in database file unsigned longest;// longest sequence of the database file

46 4.5 Dynamic Load-balancing Layer

unsigned long totalaa;// total number of aminoacids for all the sequences in database file

unsigned long phroffset;// offset of phr BLAST file unsigned long psqoffset;// offset of phr BLAST file

} profile_params;

To enable the use of these structures, all implementation functions were changed to return the struc- tures back to the application’s main function. The main application then sends the initialized execution structures to both kinds of workers: CPU and GPU.

4.5 Dynamic Load-balancing Layer

"Load balancing is dividing the amount of work that a computer has to do between two or more computers so that more work gets done in the same amount of time and, in general, all users get served faster" 9. In our implementation the processing data unit represents the database sequences, that needs to be aligned against the query sequence. The execution time of each worker iteration is directly affected by the size of the processed block. To make the implementation more efficient and to adjust the execution time for all workers, in this implementation, also is considered a load balancing module, adjusting dynamically the block size of the obtained work. The load balancing layer dynamically adjusts the block size for each worker (concept of block size is presented above). In this implementation, all workers were considered equals. The only difference is that the default CPU worker block size was 30000 and the GPUs default block size is 65000. These sizes were defined taking into account the average execution time for each processing modules. Imagine the scenario presented in Figure 4.7. In this case, worker A spends almost twice as much time than the time that worker B takes to process its block. If the application finishes its execution after the worker B finishes its iteration, worker A has not processed all the information. This way, the solution does not take advantage of each worker most efficient.

Worker A

Worker B

Figure 4.7: Workers execution not balanced.

In order to minimize this inefficiency in each iteration, the proposed load balancing layer adjusts the block size of each worker, to make the execution time as close as possible. Considering this, it is meant to adjust the block size, in order to reduce the execution time of the worker A. In the developed model, the following variables were considered:

9http://searchnetworking.techtarget.com/definition/load-balancing

47 4. Heterogeneous Parallel Alignment MultiSW

Worker A

Worker B

Figure 4.8: Workers execution balanced.

• blocksize(w, i) - Represents the block size computed by worker w in iteration i;

• Texecution(w, i) - Represents the execution time of worker w in iteration i;

• Tminexecution(i) - Represents the minimum execution time for all workers in iteration i.

When a worker finishes its execution, it calls the registerExecutionTime function, that registers for the deviceNum worker the execution time and processed block size. This function updates the attributes for the current worker execution and calls a method named adjustBlockSizes that reprocesses all workers block sizes. The first worker (the fastest one) to finish its work increases their block size 10% considering the previous block size. Thus, the next iteration’s block size is going to be:

blocksize = blocksize × 1.1 (4.1)

After all the workers finish their execution, the new block size for each worker is calculated taking into account the fastest worker’s execution, and the time spent in that execution. This time is presented in Equation 4.2 and its given by the function getMinExecT ime().

Tminexecution = getMinExecT ime(); (4.2)

Then, for each worker, the block size of next iteration can be calculated by (This block size must be an integer value, in this case it is used the ceil function to round the value obtained by the formula):

T × blockSize(i − 1) blockSize(i) = ceil( minexecution ) (4.3) Texecution(i − 1)

Considering an execution example like the ones presented on Figure 4.7, being worker A the CPU worker and the worker B the GPU worker. Initialization values are given by:

blocksize(0, 0) = 30000 (4.4)

blocksize(1, 0) = 65000 (4.5)

Texecution(0, 0) = 0 seconds (4.6)

48 4.6 Conclusion

Texecution(1, 0) = 0 seconds (4.7)

These values, presented in equations 4.11, 4.5, 4.6, and 4.7, represent the initial execution structures for the workers of our implementation. After the first worker finish their execution, their block size value is updated and their execution time is registered: blocksize(1, 0) = 65000 × 1.1 = 71500 (4.8)

Texecution(1, 0) = 2.01 seconds (4.9)

Texecution(0, 0) = 3.5 seconds (4.10)

Then the load balancing layer updates the block size of other workers.

2.01 × 30000 blocksize(0, 0) = ceil( ) = 17228 (4.11) 3.5

4.6 Conclusion

The main objective of this work is to study efficient implementations based on heterogeneous archi- tectures on the Smith-Waterman algorithm. For this algorithm, as was seen on Chapter 3, there were proposed several efficient implementations as Pedro Monteiro’s implementation, with an extension to the initial SWIPE proposal, presented by Rognes and also Liu et. al. CUDASW++ 2.0 implementation. Considering the execution modules of both implementations, this work considers each one of them as base of its execution workers, in order to process all of the work chunks that exists in the introduced database file, in the fastest way possible. Thus, the architecture of the MultiSW implementation presented in Section 4.2 represents an orchestra- tion between the execution modules, considering all the synchronizing mechanisms and access to the sequence to process database file. In general, for us to accomplish an orchestration of the two different solutions, was needed to understand and adapt both modules so they could access the same database file and process the sequences in FASTA format. Complementing this orchestration are proposed optimizations at the GPUs module, implementing the introduction of CUDA Streams to the transference code of data execution and invoking kernels to the execution GPUs. These GPU worker optimizations are presented in Section 4.2. Finally, in order to assure that the several system workers process the correct amount of data, is intro- duced an extra module of Load Balancing, with the objective of adjusting the size of blocks processed along the execution of the application, in such a way that the end time of the execution of the application is the minimum possible, as represented in the example presented on Section 4.5.

49 5 Experimental Results

Contents 5.1 Experimental Setup ...... 51 5.1.1 Experimental Dataset ...... 51 5.2 Evaluating Metrics ...... 52 5.3 Results ...... 52 5.3.1 Scenario A - Single CPU core ...... 53 5.3.2 Scenario B - Four CPU cores ...... 53 5.3.3 Scenario C - Single GPU - GeForce GTX 780 Ti ...... 54 5.3.4 Scenario D - Single GPU - GeForce GTX 660 Ti ...... 55 5.3.5 Scenario E - Four CPU cores + Single GPU Execution ...... 55 5.3.6 Scenario F - Four CPU cores + Double GPUs Execution ...... 57 5.4 Summary ...... 58

50 5.1 Experimental Setup

The performance of our implementation - MultiSW - was evaluated by considering multiple execution scenarios. The results are presented below in section 5.3.

5.1 Experimental Setup

To correctly the implemented solution, the considered experimental setup was a based workstation with the following characteristics: Machine:

• Intel(R) Core(TM) i7 4770K @ 3.5GHz (CPU);

• Four Kingston HyperX DDR3 CL9 8GB @ 1.6GHz Memory RAM modules;

• ASUS Z87-Pro Motherboard;

• GPU A - MSI GeForce GTX 780 Ti Gaming 3GB DDR5;

• GPU B - GeForce GTX 660 Ti 2GB DDR5;

The code was compiled for a 64-bit Linux using the Intel C compiler version 13.1.3 and the NVIDIA Compiler release 6.5. When comparing the used GPUs its easy to identify which one will obtain the best results. GeForce GTX 780 Ti has more processing cores (2880) than GeForce GTX 660 Ti (1344), so can run the kernels with more parallelization power. Another big difference is in the memory bandwith that in the GTX 780 Ti is 336 GB/s, whereas in the GTX 660 Ti is only 144.2 GB/s, less than a half. The memory interface is also different for both, in the first one, it is 384 and, in the second one it is 192 bits. So it is expected that the first GPU runs the kernel functions faster, and transfers the data more quickly than the second GPU.

5.1.1 Experimental Dataset

The query sequence that was used in the experimental scenarios was the IFNA6 interferon, alpha 6 [Homo sapiens (human)] [51] with 189 residues. The database sequences that were considered was the release 2014_02 of UniProtKB/Swiss-Prot [52] database sequences in the FASTA format repeated 5 times in the file. This database contains 542,503 sequences with several sizes, comprising 192,888,369 amino acids abstracted from 226,190 references. The total processed number of sequences is 2,712,515.

51 5. Experimental Results

5.2 Evaluating Metrics

In order to compare the considered scenarios, the speedup metric will be used. This metric measures how much one optimized implementation is faster than the base implementation. It is given by equation 5.1:

t speedup = sequential (5.1) tparallel

5.3 Results

This section presents multiple scenarios and their results when running the application with the various execution parameters configurations. It starts with the simplest scenario, corresponding to a single CPU core execution, and finishes with the most complex configuration with an orchestration of the workers based on a multicore CPU and multiple GPUs, that processes all the available work. The execution block sizes for each kind of worker were pre-adjusted in order to obtain the best solutions in the overall execution times running with several block size configurations before getting the experimental results. Each execution scenario was executed ten times, and the presented results correspond to the av- erage of the times of these executions. An iteration execution represents the time that the application spends to execute the block size defined for the execution worker. In each presented scenario, for the global orchestration, the CPU execution worker represents the CPU Wrapper module, presented in Section 4.2.1.A, regardless the execution being done with one or four CPU cores. The iteration time may vary, because the sequences processed has different sizes. The block sizes for the experimental results considered in the CPU and the GPU were adjusted by varying the block sizes and checking the best execution times using only one CPU core and a single GPU. The CPU obtained block size was 30,000 sequences and the GPU default block size was 65,000 sequences.

52 5.3 Results

5.3.1 Scenario A - Single CPU core

Considering a single CPU core execution, the total execution time was about 31.52 seconds, as shown in Figure 5.1.

31.52

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Time (s)

Figure 5.1: Processing times considering a single CPU core execution and a processing block with 30000 sequences.

The multiple grey-colored blocks represents the execution for each CPU wrapper iteration, consid- ering its size of 30,000 sequences. These iteration execution times vary between 0.0688 and 2.391 seconds and all together represent the total execution time of about 31.52 seconds. Between each iteration execution (during the overall execution), in the beginning the preparation time is about 0.0009 seconds and is not visible in the figure presented above. The difference in the iteration execution times is explained by different sequences size in each iteration. For bigger sequences, the iteration execution time will be bigger.

5.3.2 Scenario B - Four CPU cores

Considering a 4-core CPU execution, the total execution time was about 15.55 seconds, as shown in Figure 5.2.

53 5. Experimental Results

15.55

0 2 4 6 8 10 12 14 16 Time (s)

Figure 5.2: Processing times for 4 CPU cores, considering a block size of 30,000 sequences.

The distinct grey-based colored blocks represents the process time for a block of 30,000 sequences (considering a CPU wrapper iteration) by four CPU cores. These iteration values vary between 0.046 and 0.957 seconds. Total execution time was about 15.55 seconds. The reason why the solution with four CPU cores is not four times faster than the single CPU core is because of the synchronization between multiple threads and the data partition and organization times, this way the speedup obtained was not linear like it was supposed to.

5.3.3 Scenario C - Single GPU - GeForce GTX 780 Ti

Considering a single GeForce GTX 780 Ti GPU execution, the total execution time was 6.35 seconds, as show in Figure 5.3.

6.35

0 1 2 3 4 5 6 7 Time (s)

Figure 5.3: Processing Times for single GPU in Machine A, considering blocks size of 65,000 sequences. Total execution time about 6.35 seconds.

In the figure it is presented several grey colored execution blocks that represents the time of pro- cessing 65,000 database sequences against the query sequence. These iteration values vary between

54 5.3 Results

0.118 and 0.266 seconds. Considering the several optimizations mentioned in Section 4.2.2, specially the use of CUDA Streams provided by NVIDIA in its framework, it is possible to reduce significantly the preparation time between iterations and getting the best overall execution times.

5.3.4 Scenario D - Single GPU - GeForce GTX 660 Ti

Considering a single GeForce GTX 660 Ti GPU execution, the total execution time was about 7.38 seconds, as show in Figure 5.4.

7.38

0 1 2 3 4 5 6 7 8 Time (s)

Figure 5.4: Processing Times for single GPU in Machine B, considering blocks size of 65,000 sequences. Total execution time about 7.38 seconds.

In the figure it is presented several grey colored execution blocks. Each one of them represents the time of processing 65,000 database sequences against the query sequence. The total execution time was 7.38 seconds. These iteration values vary between 0.126 and 0.304 seconds.

5.3.5 Scenario E - Four CPU cores + Single GPU Execution

In this scenario, the considered workers for the execution are the four CPU cores and the GeForce GTX780 Ti GPU. This execution time was about 6.112 seconds, as show in Figure 5.6:

55 5. Experimental Results

6.112

GPU A

CPU Cores

0 1 2 3 4 5 6 Time (s)

Figure 5.5: Processing Times for 4 CPU cores and a GeForce GTX780 Ti GPU, considering CPU blocks of 30,000 sequences and GPU blocks of 65,000 sequences. Total execution time was 6.112 seconds.

This time was better than the one obtained in Scenario F, because the considered GPU was the GeForce GTX 780 Ti, that executes faster than GeForce GTX 660 Ti like presented in Scenarios C and D. In Figure 5.6 it is presented the number of sequences processed by each kind of worker. The execution CPU worker processed 817,087 sequences, while the GPU worker processed 1,895,428 se- quences.

2100000 1895428 1800000

1500000

1200000

900000 817087

600000

300000

0 CPU Cores GPU A

Figure 5.6: Number of Sequences Processed by CPU cores and GPU.

The orchestration represented in this scenario is better than the single GPU execution, but a linear speedup was not achieved since the synchronization points increased with the increase of the workers in the orchestration.

In Figure 5.5, its also presented the dynamic block size along the time. These values are presented in the arrow next to the block execution, in the GPU worker and in the CPU worker. So the CPU worker

56 5.3 Results starts with the 30,000 block size and finish with a size of 15,000. The GPU worker starts with a 65,000 sequences block size and finish with a size of 40,000 sequences. To both of the workers, the number of next processing sequences is decreasing along the execution time, since it was the way load balacing module works. The iteration execution times for the GPU worker varies between 0.082 and 0.316 sec- onds. For the CPU worker this value goes between the 0.072 and the 0.258 seconds.

5.3.6 Scenario F - Four CPU cores + Double GPUs Execution

The last scenario considered is composed by 4-core CPU execution and both available GPUs: the GeForce GTX 780 Ti (GPU A) and the GeForce GTX 660 Ti (GPU B).

As expected, this execution was the fastest one but not the most efficient and takes about 4.957 seconds, as show in Figure 5.7.

65000 44601 40000 4.957 GPU B

65000 58633 58415 56279 40000 GPU A

30000 33000 25387 17843 16372 15000 4.9 CPU Cores

0 1 2 3 4 5 Time (s)

Figure 5.7: Processing Times for 4 cores CPU, GPU A and GPU B, considering the initial block size of 30,000 sequences blocks to the CPU solution and 65,000 to the GPU solution. Total execution time of 4.957 seconds. Near some of the iteration blocks it is presented the new considered block size.

Figure 5.7 shows the execution blocks for the three workers. The CPU worker starts with a block size of 30.000 and finishes with the block size of 15,000. The execution times for this worker takes from 0,06 to 0,379 seconds. The execution times for the GPU A worker goes between 0.067 and 0.394 seconds. Finally, for the GPU B worker the execution times goes between 0.067 and 0.520 seconds. The number of sequences computed by each worker is presented in Figure 5.8. The CPU worker process 70,795 sequences, the GPU A worker computed 1,241,496 sequences and the GPU B worker process 1,059,202 sequences. The low quantity processed by the CPU worker is because the block size of this worker is smaller than the GPU worker block sizes improve performance.

57 5. Experimental Results

1500000

1241496 1200000 1059202

900000

600000 411817

300000

0 CPU Cores GPU A GPU B

Figure 5.8: Number of Sequences Processed by CPU, GPU A, and GPU B workers.

5.4 Summary

As shown in Table 5.1, considering the multiple scenarios, Scenario F presented in Section 5.3.6 was achieved a speedup of 6.4x when comparing the execution in a CPU single core execution presented in Scenario A, Section 5.3.1.

Execution Time Speedup Single core 31.52 Four cores 15.55 2.03 GeForce GTX 780 Ti 6.350 4.96 GeForce GTX 660 Ti 7.380 4.271 Four CPU cores + GeForce GTX 780 Ti 6.112 5.16 Four CPU cores + 2 GPU 4.96 6.36 Table 5.1: Execution Speedups.

The increase of workers in the execution of our work orchestration, increases also the synchroniza- tion between the involved threads are needed. This causes the occurrence of execution delays and makes the workers wait longer. This situation is minimized with the load balancing layer presented in our solution, since the block size is being adapted to be similar. However, there are some limitations in the loading balance module, since the number of total processing sequences are not known at the beginning of the application. Despite these limitations, as it can be verified in the Table 5.1, to the different execution scenarios, the orchestration was getting relatively good speedups, with the inclusion of new worker in their execution.

58 6 Conclusions and Future Work

Contents 6.1 Conclusions ...... 60 6.2 Future Work ...... 60

59 6. Conclusions and Future Work

6.1 Conclusions

Multiple solutions have been proposed in the last years to be possible to respond to the large amount of biological information produced everyday. Exploiting parallel architectures based on CPU and GPU architectures, enables the quickly processing of these data. So, in our work, it was proposed a solution that mixes both to get better results. Under this context, this thesis proposed the integration of two pre- viously presented parallel implementations: an adaptation of SWIPE implementation [16], for multi-core CPUs that exploits SIMD vectorial instructions [2], and an implementation of the Smith-Waterman algo- rithm for GPU platforms (CUDASW++ 2.0) [17]. Accordingly, the presented work offers a unified solution that tries to take advantage of all computational resources that are made available in heterogeneous platforms, composed by CPUs and GPUs, by integrating a convenient dynamic load balancing layer. The obtained results presented in Chapter 5 show that the attained speeup can reach values as high as 6x, when executing in a quad-core CPU and two distinct GPUs.

6.2 Future Work

The presented solution already considers both intra-task and inter-task processing approaches. However, in the CPU module, it would be good to explore some extra inter-task approach. Another possible future work is to add an extra thread to the solution to prepare all the GPU work, like it occurs in the CPU module.

The new Kepler NVIDIA GPUs presents a tecnology designated as Dynamic Parallelism, which al- lows to create new chunks of work without need of new data transfer between the device and the host. Thus, is possible to spend less time transferring data between device and host, optimizing the total exe- cution time.

Finally, another possible optimization is taking advantage of the Load Balancing module, optimizing the algorithm used in the module, in order to achieve more optimal values to the processing blocks size, processed by the available workers in the solution.

60 Bibliography

[1] NVIDIA CUDA - NVIDIA CUDA C Programming Guide, February 2014. URL http://docs.nvidia. com/cuda/pdf/CUDA_C_Programming_Guide.pdf.

[2] Pedro Matos Monteiro. Profiling biological applications for parallel implementation in multicore computers. Master’s thesis, Av. Rovisco Pais, 1, november 2012.

[3] Yongchao Liu, Douglas L. Maskell, and Bertil Schmidt. CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. BMC research notes, 2(1):73+, 2009. ISSN 1756-0500. doi: 10.1186/1756-0500-2-73. URL http://dx.doi.org/10. 1186/1756-0500-2-73.

[4] Yongchao Liu, Adrianto Wirawan, and Bertil Schmidt. Cudasw++ 3.0: accelerating smith-waterman protein database search by coupling cpu and gpu instructions. BMC Bioinformatics, 14(1): 117, 2013. ISSN 1471-2105. doi: 10.1186/1471-2105-14-117. URL http://www.biomedcentral. com/1471-2105/14/117.

[5] M. J. Flynn. Very high-speed computing systems. Proc. IEEE, 54(12):1901–1909, December 1966.

[6] EBERLY COLLEGE OF ARTS and SCIENCES. Eberly college of arts and sciences. http://eberly.wvu.edu/, 2014. Accessed on October 9, 2014.

[7] Pat Hanrahan Daniel Reiter Horn, Mike Houston. ClawHMMer: A streaming HMMer-search imple- mentation. In Supercomputing, 2005.

[8] Genbank DNA Database. Genbank dna database. http://www.ncbi.nlm.nih.gov/genbank/, 2014. Accessed on October 9, 2014.

[9] National Center for Biotechnology Information (NCBI). National center for biotechnology information (ncbi). http://www.ncbi.nlm.nih.gov/, 2014. Accessed on October 9, 2014.

[10] Universal Protein Resource (UniProt). Universal protein resource (uniprot). http://www.uniprot.org/, 2014. Accessed on October 9, 2014.

[11] Nucleotide sequence database (EMBL). Nucleotide sequence database (embl). http://www.ebi.ac.uk/ena/, 2014. Accessed on October 9, 2014.

[12] Swiss-Prot. Swiss-prot. http://www.ebi.ac.uk/uniprot/, 2014. Accessed on October 9, 2014.

61 Bibliography

[13] TrEMBL. Trembl. http://www.ebi.ac.uk/uniprot/, 2014. Accessed on October 9, 2014.

[14] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197, March 1981. ISSN 0022-2836. URL http://view.ncbi.nlm. nih.gov/pubmed/7265238.

[15] D.E. Culler, J.P. Singh, and A. Gupta. Parallel computer architecture: a hardware/software ap- proach. The Morgan Kaufmann Series in Computer Architecture and Design. Morgan Kauf- mann Publishers, 1999. ISBN 9781558603431. URL http://books.google.pt/books?id= gftcVOn7iGsC.

[16] Torbjorn Rognes. Faster Smith-Waterman database searches with inter-sequence SIMD par- allelisation. BMC Bioinformatics, 12(1):221+, June 2011. ISSN 1471-2105. doi: 10.1186/ 1471-2105-12-221. URL http://dx.doi.org/10.1186/1471-2105-12-221.

[17] Yongchao Liu, Bertil Schmidt, and Douglas Maskell. CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstrac- tions. BMC Research Notes, 3(1):93+, 2010. ISSN 1756-0500. doi: 10.1186/1756-0500-3-93. URL http://dx.doi.org/10.1186/1756-0500-3-93.

[18] G.S. Almasi and A. Gottlieb. Highly parallel computing. The Benjamin/Cummings series in com- puter science and engineering. Benjamin/Cummings Pub. Co., 1994. ISBN 9780805304435. URL http://books.google.pt/books?id=rohQAAAAMAAJ.

[19] J.L. Hennessy, D.A. Patterson, and A.C. Arpaci-Dusseau. Computer architecture: a quantitative approach. Number vol. 1 in The Morgan Kaufmann Series in Computer Architecture and De- sign. Morgan Kaufmann, 2007. ISBN 9780123704900. URL http://books.google.pt/books?id= 57UIPoLt3tkC.

[20] Alex Peleg and Uri Weiser. MMX Technology Extension to the Intel Architecture. IEEE Micro, 16 (4):42–50, 1996. ISSN 0272-1732. doi: 10.1109/40.526924. URL http://dx.doi.org/10.1109/ 40.526924.

[21] W. Stallings. Computer Organization and Architecture: Designing for Performance. Prentice Hall, 2010. ISBN 9780136073734. URL http://books.google.es/books?id=-7nM1DkWb1YC.

[22] N. Wilt. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Pearson Education, 2013. ISBN 9780133261509. URL http://books.google.pt/books?id=ynydqKP225EC.

[23] NVIDIA. Kepler GK110 whitepaper, 2012. URL http://www.nvidia.com/content/PDF/kepler/ NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.

[24] Jason Sanders and Edward Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional, 1st edition, 2010. ISBN 0131387685, 9780131387683.

62 Bibliography

[25] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, and Timothy J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80–113, 2007. ISSN 1467-8659. doi: 10.1111/j.1467-8659.2007.01012.x.

[26] Xiaoqing Tang. Introduction to general purpose GPU computing. University of Rochester - Class Lecture, March, 16 2011. URL http://www.cs.rochester.edu/~kshen/csc258-spring2011/ lectures/student_Tang.pdf.

[27] John Nickolls and William J. Dally. The gpu computing era. IEEE Micro, 30(2):56–69, March 2010. ISSN 0272-1732. doi: 10.1109/MM.2010.41. URL http://dx.doi.org/10.1109/MM.2010.41.

[28] AMD. AMD Fusion Website. http://www.amd.com/us/products/technologies/fusion/Pages/fusion.aspx. accessed on 1/1/2012.

[29] Mayank Daga, Ashwin M. Aji, and Wu-chun Feng. On the efficacy of a fused cpu+gpu processor (or apu) for parallel computing. In Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing, SAAHPC ’11, pages 141–149, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0-7695-4448-9. doi: 10.1109/SAAHPC.2011.29. URL http://dx. doi.org/10.1109/SAAHPC.2011.29.

[30] Math Smith. What is an APU? [technology explained]. http://www.makeuseof.com/tag/apu- technology-explained/, February 18 2011. Accessed on January 1, 2012.

[31] Michael Wolfe. Understanding the CUDA Data Parallel Threading Model A Primer. http://www.pgroup.com/lit/articles/insider/v2n1a5.htm, February 2010. accessed in 7-1-2012.

[32] Brent Oster and Greg Ruetsch. Getting started with CUDA. In NVISION 2008, The World of Visual Computing. NVIDIA, 2008. URL http://www.nvidia.com/content/nvision2008/tech_ presentations/CUDA_Developer_Track/NVISION08-Getting_Started_with_CUDA.pdf.

[33] Biological Sequences. Biological sequences. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/BIOSEQ.HTML, 2014. Accessed on October 9, 2014.

[34] Colin Dewey. Multiple sequence alignment. http://www.biostat.wisc.edu/bmi576/lectures/multiple- alignment.pdf, Fall 2011. URL http://www.biostat.wisc.edu/bmi576/lectures/ multiple-alignment.pdf. Acessed December 21, 2011.

[35] Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Biological Sequence Anal- ysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, July 1998. ISBN 0521629713.

[36] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48:443–453, 1970.

[37] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001. ISBN 0070131511.

63 Bibliography

[38] D. J. Lipman and W. R. Pearson. Rapid and Sensitive protein Similarity Searches. Science, 227: 1435–1441, March 1985.

[39] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America, 85(8):2444–2448, April 1988. ISSN 0027-8424. doi: 10.1073/pnas.85.8.2444. URL http://dx.doi.org/10.1073/pnas.85.8. 2444.

[40] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, Oct 1990. URL citeseer. nj.nec.com/akutsu99identification.html.

[41] A. Wozniak. Using video-oriented instructions to speed up sequence comparison. Computer Appli- cations in the Biosciences, 13(2):145–150, 1997. URL http://dblp.uni-trier.de/db/journals/ bioinformatics/bioinformatics13.html#Wozniak97.

[42] Michael Farrar. Striped Smith–Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23:156–161, January 2007. ISSN 1367-4803. doi: http://dx.doi. org/10.1093/bioinformatics/btl582. URL http://dx.doi.org/10.1093/bioinformatics/btl582.

[43] Svetlin A Manavski and Giorgio Valle. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment. BMC Bioinformatics, 9(Suppl 2):S10, 2008. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2323659&tool= pmcentrez&rendertype=abstract.

[44] T. Rognes and E. Seeberg. Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics (Oxford, England), 16(8): 699–706, August 2000. ISSN 1367-4803. doi: 10.1093/bioinformatics/16.8.699. URL http://dx. doi.org/10.1093/bioinformatics/16.8.699.

[45] Bio-sequence database scanning on a GPU, April 2006. doi: 10.1109/ipdps.2006.1639531. URL http://dx.doi.org/10.1109/ipdps.2006.1639531.

[46] NCBI BLAST. Blast - ncbi. http://blast.ncbi.nlm.nih.gov/Blast.cgi, 2014. Accessed on October 9, 2014.

[47] EBI. Ssearch algorithm. http://www.ebi.ac.uk/Tools/sss/, 2014. Accessed on October 9, 2014.

[48] Blast DB Format. Blast format. http://selab.janelia.org/people/farrarm/blastdbfmtv4/blastdbfmt.html, 2014. Accessed on October 9, 2014.

[49] FASTA DB Format. Fasta format. http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml, 2014. Accessed on October 9, 2014.

[50] NVIDIA. Nvidia overlap data transfers and cuda executions. http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/, 2014. Accessed on October 9, 2014.

64 Bibliography

[51] NCBI. Ifna6 interferon, alpha 6 [homo sapiens (human)] - ncbi. http://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=ShowDetailView&TermToSearch=3443, 2014. Accessed on October 9, 2014.

[52] Uniprot. Uniprotkb/swiss-prot release 2014 february - uniprot. http://www.uniprot.org/downloads, 2014. Accessed on October 9, 2014.

[53] Kamran Karimi, Neil G Dickson, and Firas Hamze. A performance comparison of CUDA and OpenCL. Read, cs.PF(1):12, 2010. URL http://arxiv.org/abs/1005.2581.

[54] E.A. Lee. The problem with threads. Computer, 39(5):33 – 42, may 2006. ISSN 0018-9162. doi: 10.1109/MC.2006.180.

[55] Microsoft. Microsoft - mmx, sse, and sse2 intrinsics. http://msdn.microsoft.com/en- us/library/y0dh78ez(v=vs.90).aspx, May 2011. Acessed May 14, 2014.

[56] W. Stallings. Computer Organization and Architecture: Designing for Performance. Prentice Hall, 2010. ISBN 9780136073734. URL http://books.google.es/books?id=-7nM1DkWb1YC.

[57] J. D. Thompson, D. G. Higgins, and T. J. Gibson. CLUSTAL W: improving the sensitivity of pro- gressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22(22):4673–4680, November 1994. ISSN 1362-4962. doi: 10.1093/nar/22.22.4673. URL http://dx.doi.org/10.1093/nar/22.22.4673.

[58] N. Whitehead and A. Fit-Florea. Precision & performance: Floating point and ieee 754 compliance for nvidia gpus. nVidia technical white paper, 2011.

65 Bibliography

66