Parallel Simulations for Analysing Portfolios of Catastrophic Event Risk

Thesis submitted in partial fulfillment of the requirements for the degree of

MS by Research in and Engineering

by

Aman Kumar Bahl 200802003 [email protected]

Center for Security, Theory & Algorithmic Research (CSTAR) International Institute of Information Technology Hyderabad - 500 032, INDIA July 2014 Copyright c Aman Kumar Bahl, 2014 All Rights Reserved International Institute of Information Technology Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Prallel Simulations for Analysing Portfolios of Catastrophic Event Risk” by Aman Kumar Bahl, has been carried out under my supervision and is not submitted elsewhere for a degree.

Date Adviser: Dr. Kishore Kothapalli To My Family Acknowledgments

First I would like to thank Dr. Kishore Kothapalli for all the support, guidance, suggestions and opportunities he gave me through out the duration of this thesis. He has supported me with quite patience and knowledge, directed my energy towards useful tasks at the same time allowing me to work on topics of my choice. He listened to my all arguments and proposals even if they are too intuitive and silly. His modestly and enthusiasm inspires me a lot. I am grateful to CSTAR for providing me an atmosphere to excel and do research. I would like to thanks all the people in the lab Shashank, Nadeem, Tushar, Jatin, Dharmeet, Niraj, Manoj and SRK. Special thanks to my mentor DipSankar Banerjee for his moral boosting enthusiasm and sup- port. Discussions with them were not only fruitful but entertaining too. Thanks to my friends Chan- drashekar, Govind, Abhishek, Ayush, Sagar, Aman Tyagi, Siddarth Shandilya, Harshit Jain, Nitesh, Jayant, Abhishek Bhatia, Siddarth Gupta, Tarun, Jasmeet, Abhilash, Akshay, Ashish Masand, Rohit, Mohit, Pranav, Abhijeet, Apoorv, Jaspal and Omar who helped me during bad times. Finally and most importantly, I would like to thank my parents and brother for their unconditional love and support throughout the thesis.

v Abstract

After a stall in improvement of clock speeds, High Performance Computing(HPC) community is looking for general purpose computing on graphics processing units(GPGPU). Due to their low cost, high performance per watt and evolving programming models they are becoming very popular not only among HPC community, but also for getting good performance on commodity devices like laptops and mobile phones. Unlike the traditional graphics rendering cards, now a days manufacturers are providing many other architectural features like cache and which make them promising for general purpose computing. A wide range of applications in the scientific and financial domains, for example those presented in [24, 25, 26, 27, 28] and [29], benefit from the merits of the GPU architecture. GPUs offer an alternative machine architecture in three ways. GPUs provide, firstly, lots of cycles for independent parallelism, secondly, fast memory access under the right circumstances, and finally, fast mathematical computation at relatively low costs. Because of this different architecture there is a need to reinterpret the applications in a highly threaded manner. This architecture is a good fit for applications which have regular memory access patterns like dense matrix multiplication [31, 32]. But for non regular applications like list ranking [33, 36, 37], graph algorithms [34], sparse matrix calculations [35]. Exploring such applications and accelerating them using GPUs is the need of the hour. Also most applications accelerated by GPU tends to use only GPU in computation. CPU’s role limits to just passing the code and data to GPU for computations and accumulating the results back. When GPU is busy executing given code, CPU stays idle. Hence CPU resources are not utilized properly. By carefully designing our applications we can get a significant over the GPU-only implementation. To use both CPUs and GPUs in computation, tasks and data has to be divided efficiently. This partitioning of data is mostly dependent on the underlying CPU and GPU platform specifications. This is also referred to as static load balancing. Static load balancing through empirical or analytical methods have many drawbacks. While analytical methods are not available for most of the applications, empirical methods outputs the partition which is specific to the platform and hence making application hard to port efficiently. In this thesis we explore Aggregate risk analysis which is a stochastic simulation technique for portfolio risk analysis and pricing using multi-core CPUs and many-core GPUs. Firstly, a sequential aggregate risk analysis algorithm is implemented in C on a CPU, followed by parallel im- plementations using C and OpenMP on multi-core CPU and using C and CUDA on many-core GPU platforms.Aggregate risk analysis is a form of Monte Carlo simulation performed on a portfolio of risks that a re-insurer holds rather than on individual risks. This involves a high number of simulations with lookups from real world database. These lookups are highly irregular and hence slow down the entire computation. Also there are huge data structures which need to be handled very carefully. Storing

vi vii them in dense format is quite slow, and sparse formats which are easy to lookup take too much space in memory. We design a hybrid implementation which uses both CPU resource and GPU resource in computa- tion and hence more efficient. The implementation is based on work sharing. We validate our results by comparing them to GPU-only implementations on two different CPU-GPU combinations. The work partition is computed empirically and experimentally for getting maximum efficiency. This partition is platform dependent making the application less portable. To address this challenge we propose an alternate automated hybrid solution, which works well for all CPU-GPU combinations and hence plat- form independent. This solution adds more portability to the application. We analyzed this automated solution and found it very close to the non-portable hybrid implementation. Contents

Chapter Page

1 Introduction ...... 1 1.1 Relevance of ...... 1 1.2 Challenges in Aggregate risk analysis on GPUs ...... 2 1.3 Contribution of this thesis ...... 3

2 Background and related work ...... 4 2.1 GPU computation and CUDA model ...... 4 2.2 Risk analysis ...... 5

3 Catastrophic risk analysis ...... 6 3.1 Introduction ...... 6 3.1.1 Risk assessment ...... 6 3.2 Aggregate risk analysis ...... 7 3.2.1 Inputs ...... 7 3.2.1.1 Year Event Table (YET ) ...... 8 3.2.1.2 Event Loss Tables (ELT )...... 8 3.2.1.3 Layers (L)...... 8 3.2.2 Algorithm sketch ...... 9 3.2.3 Output ...... 9 3.3 Experimental evaluation ...... 11 3.3.1 Platform details ...... 11 3.3.1.1 Platform 1 - A Multi-core CPU ...... 11 3.3.1.2 Platform 2 - A Many-core GPU ...... 11 3.3.2 Implementation ...... 12 3.3.2.1 Basic Implementations ...... 12 3.3.2.2 Optimised/Chunked Implementation ...... 13 3.3.3 Results ...... 13 3.3.3.1 Results for Aggregate Analysis on CPUs ...... 13 3.3.3.2 Results for the basic aggregate analysis algorithm on GPU ...... 15 3.3.3.3 Results for the optimised aggregate analysis algorithm on GPU . . . 16

4 Hybrid Computing ...... 19 4.1 Introduction ...... 19 4.1.1 Work sharing ...... 19 4.1.2 ...... 20

viii CONTENTS ix

4.2 Static partitioning hybrid model ...... 20 4.2.1 Results on static partitioning hybrid model ...... 20 4.3 Work-queue hybrid model ...... 23 4.3.1 Results on Work-queue hybrid model ...... 23

5 Conclusions and future work ...... 26

Bibliography ...... 28 List of Figures

Figure Page

1.1 A tightly coupled cpu-gpu combination ...... 2

3.1 The inputs and output of the Aggregate Risk Engine (ARE) ...... 7 3.2 Performance of the basic aggregate analysis algorithm on a CPU using a single core . . 14 3.3 Graphs plotted for the parallel version of the basic aggregate analysis algorithm on a multi-core CPU ...... 14 3.4 Graphs plotted for number of threads vs the time taken for executing the parallel version of the Basic Algorithm on many-core GPU ...... 16 3.5 Performance of the optimized aggregate analysis algorithm on GPU ...... 17 3.6 Total time taken for executing the algorithm ...... 17 3.7 Percentage of time taken for fetching Events from memory, time for look-up of ELTs in the direct access table, time for financial term calculations and time for layer term calculations ...... 18

4.1 Case of hybrid computing ...... 21 4.2 Performance of the static partitioning hybrid model compared to CPU only and GPU only implementations ...... 22 4.3 Work queue model ...... 23 4.4 Percentage of work done by CPU, GPU and total work ...... 24 4.5 Performance of the work-queue hybrid model compared to static partitioning hybrid model 25

x List of Tables

Table Page

3.1 Basic Algorithm for Aggregate Risk Analysis ...... 10 3.2 Layer Terms applicable to Aggregate Risk Analysis ...... 11

4.1 The specifications for the different GPUs and CPUs used in our experiments...... 20

xi Chapter 1

Introduction

Most modern quantitative insurance/reinsurance companies perform stochastic simulation techniques for their portfolio risk analysis and pricing processes. This process which is a prime component of their analytical pipeline is commonly known as Aggregate analysis. Typically most catastrophe insurance contracts like Cat eXcess of Loss (XL), or Per-Occurrence XL, and Aggregate XL have quite complex properties. Aggregate analysis outputs risk measures like Probable Maximum Loss (PML) and the Tail Value at Risk (TVAR) for these contracts. In addition to this, aggregate analysis also gives contracts which combine these measures to support the computation. This computation involves a huge number of simulations with lookups from real world database. This nature of problem makes it regular in computation but irregular in lookups. These simulations are independent and hence can be performed in parallel.

1.1 Relevance of parallel computing

Parallel computing is a form of computation in which many tasks of computations are carried out simultaneously. This practice is based upon the principle that often large problems can be broken into smaller ones which can be solved at the same time. Though parallel computing had been in practice mainly in high performance computing and from a very long time, but mainstream ap- plication has begun only after there was a slowdown in frequency scaling and memory speeds due to physical constraints. Soon power consumption started to become a bottleneck and hence community moved towards prallel computing devices and models. Single core processors turned into multi core and hardware accelerators like cells, FGPA and GPUs gained widespread attention. Graphics processing units (GPUs) are traditionally being used for graphics rendering. They support the main processor(CPU) and accelerate the graphics computation. Graphics applications by their very nature are highly parallel. Hence GPUs are based on highly parallel architectures and for efficient data transfer a high memory bandwidth. A variety of devices like personal computers, gaming consoles, workstations and mobile phones used GPUs for speeding up their computations. GPUs are also power efficient and hence becoming highly popular. Because of their such properties GPUs have drawn the attention of high performance computing com- munity for general purpose calculations. A good amount of work has been done in recent years in using

1 Figure 1.1: A tightly coupled cpu-gpu combination

GPUs in general purpose computations [34, 39, 40]. This is usually referred to as General Purpose Computing on Graphics Processing Units (GPGPU). By their very design GPUs are highly suitable for data parallel computations because they can make use of regular coalesced memory accesses [31, 38]. But many applications need quite irregular memory accesses making the whole execution slow. Some of recent research focuses on such types of applications [33, 36, 37]. In this thesis we explore one such application and show how we can gain speedup using GPU’s architectural features and using the CPU in this heterogeneous computation efficiently.

1.2 Challenges in Aggregate risk analysis on GPUs

Aggregate risk analysis involves a huge number of independent simulations which can be executed in parallel but kernels do not get expected speedup and Gflops. Some of the challenges that aggregate risk analysis poses are: 1. There is a huge number of simulations and each simulation is typically for one calendar year. Hence it is a huge memory overhead to store them. GPUs usually have lesser global memory as compared to CPUs main memory. Managing and doing computation on chunks of simulation on such systems needs a very high bandwidth.

2. There are a large number of loss sets on which simulation threads performs lookup. Choosing the right data structure for these loss sets poses a huge trade-off. Keeping them in dense structures (like sorted arrays) results slower lookup, while keeping them in sparse direct lookup tables results in giving away a huge portion of global memory.

3. The lookups are highly irregular. Threads in the warp of GPU may perform memory accesses which are not adjacent. Scattered memory accesses of threads in the warp is usually referred to as uncoalesced memory accesses. These uncoalesced memory accesses can degrade the performance of the GPU kernels by a huge factor.

2 4. Using both the CPU and the GPU fairly in the computation. In this scenario effectively partition- ing the work and data to CPU and GPU is very important.

5. While partitioning the data to CPU and GPU, there is a need to find the effective partition ra- tio. This ratio is generally calculated empirically or experimentally. By both methods this ratio- partition becomes platform sensitive.

1.3 Contribution of this thesis

we explored parallel methods for aggregate risk analysis on both multi-core CPUs and many-core GPUs. Some of the chief contributions of this thesis are:

1. We propose a parallel aggregate risk analysis algorithm and an engine based on the algorithm. This engine is implemented in C and OpenMP for multi-core CPUs and in C and CUDA for many-core GPUs.

2. A detailed performance analysis has been done and based on that many different implementations are proposed. These implementations exploit architectural features of GPUs. Performance analy- sis of the algorithm indicates that GPUs offer an alternative feasible HPC solution for aggregate risk analysis that is cost effective. The optimized algorithm on the GPU performs a 1 million trial aggregate simulation with 1000 catastrophic events per trial on a typical exposure set and con- tract structure in just over 20 seconds which is approximately 15x times faster than the sequential counterpart.

3. We also propose a hybrid solution which takes both CPU and GPU in computation. This offers better resource utilization and hence more computing power by effectively dividing the data on different platforms.

4. We propose an automated partitioning hybrid solution which is platform independent and hence works for all CPU-GPU combinations. we analyzed the above solution and found it very close to static partitioning of data.

3 Chapter 2

Background and related work

2.1 GPU computation and CUDA model

NVidia’s unified architecture of its tesla and fermi series GPUs essentially supports both graphics rendering and general purpose computing. Though graphics community is still a major market for these devices. High performance computing community views GPUs under the lens of general purpose com- puting device. In such view a GPU is a device which has a highly multi threaded architecture consisting of hundreds of computation cores. GPU’s architecture is based on both parallel RAM (PRAM) model and Shared Memory Processors (SMP) model. For example Tesla 2050 has 448 cores which are or- ganized as 14 SMs each having 32 symmetric processors. Each core can run a number of threads and also it can switch between the threads. This switching overhead is negligible and hence this can be used to hide fetching latency making GPUs highly efficient for parallel applications. User can launch a large number of threads in Compute Unified Device Architecture(CUDA).These threads are grouped into grids and blocks and warps. A grid consists of many blocks and a block consists of many threads. Blocks are grouped and divided into warps. A warp is group of 32 threads for which execute in parallel in SIMD fashion.

Each CUDA device has several memories that can be used by programmers to achieve high Compu- tation to Global Memory Access (CGMA) ratio and thus high execution speed in their kernels. Variables that reside in registers and shared memories can be accessed at very high speed in a highly parallel man- ner. Registers are allocated to individual threads, each can only access its own registers. A kernel function typically uses registers to hold frequently accessed variables that are private to each thread. Shared memories are allocated to thread blocks all threads in a block can access variables in the shared memory locations allocated to the block. Shared memories are efficient means for threads to cooperate by sharing the results of their work.There is global memory that the host code can write and read by calling API functions. The global memory can be accessed by all the threads at anytime of program execution. The constant memory allows read-only access by the device and provides faster and more parallel data access paths for CUDA kernel execution than the global memory. CUDA uses explicit functions to perform computations on GPU, these functions are called kernel functions. Before launching the kernel, all the data must be transferred from the host CPU memory

4 to GPU’s global memory. With kernel invocation GPU gets the control and executes the code inside kernel functions. CUDA provides user a facility to synchronize threads within a block and also global synchronization using inbuilt functions. Details of the platforms we used are given in section 3.3.1.

2.2 Risk analysis

Risk analysis is the process of systematically identifying and accessing the potential uncertainties and risks that occur while trying to achieve a certain goal. Goal can be finishing a project or reaching a target income or even sustaining some architectural structure. Understanding the uncertainties can help us take a much better decision. Risk analysis can help us in finding a feasible strategy to control these risks efficiently. Risk analytics, the model based computational analysis of risk [1], has become an integral part of business processes in domains ranging from financial services to engineering. Simulation based risk an- alytics has been applied to areas as diverse as analysis of catastrophic events [2, 3], financial instruments [4], structures [5], chemicals [6], diseases [7], power systems [8], nuclear power plants [9], radioactive waste disposal [10] and terrorism [11]. In many of these areas models must both consume huge data sets and perform hundreds of thousands or even millions of simulations making the application of parallel computing techniques very attractive. For financial analysis problems, especially those concerned with the pricing of assets, parallelism and high-performance computing has been applied to very good effect (for example [12, 13, 14, 15] and [16]). However, in the insurance and reinsurance settings, where data sizes are arguably as large or larger, relatively fewer HPC based methods have been reported on.

5 Chapter 3

Catastrophic risk analysis

3.1 Introduction

For the risks associated with catastrophic events such as floods, hurricanes, earthquakes, cyclones, storms and volcanic eruptions, insurance and reinsurance companies hold portfolios of contracts that cover most of them. Efficiently and accurately quantifying individual risks and portfolios of these risks is very much essential to have a marketplace of such risks. Typically the analytical pipeline of modern quantitative insurance or reinsurance companies consists of three major stages:

1. Risk assessment

2. Portfolio risk management and pricing

3. Enterprise risk management

These three stages are explained in detail in following section.

3.1.1 Risk assessment

In risk assessment stage, catastrophe models [17] are used to provide scientifically credible loss estimates for individual risks by taking following two inputs:

1. A mathematical representation of the natural occurrence patterns and characteristics of catastro- phe perils such as hurricanes, tornadoes, severe winter storms or earthquakes which is generally referred to as stochastic event catalog.

2. Exposure databases that describe thousands or millions of structures and buildings to be analyzed, their location, use, construction types, value, and coverage.

Each event-exposure pair is then analyzed by a risk model that quantifies the hazard intensity at the exposure site, the vulnerability of the building and resulting damage level, and the resultant expected loss, given the customer’s financial terms. The output of a catastrophe model is an Event Loss Table (ELT) which specifies the probability of occurrence and the expected loss for every event in the catalog.

6 Figure 3.1: The inputs and output of the Aggregate Risk Engine (ARE)

However, an ELT does not capture which events are likely to occur in a contractual year, in which order, and how they will interact with complex treaty terms to produce an aggregated loss. Reinsurers may have thousands or tens of thousands of contracts and must analyze the risk associated with their whole portfolio. These contracts often have an ‘eXcess of Loss’ (XL) structure and can take many forms, including (i) Cat XL or Per-Occurrence XL contracts providing coverage for single event occurrences up to a specified limit with an optional retention by the insured and (ii) Aggregate XL contracts (also called stop-loss contracts) providing coverage for multiple event occurrences up to a specified aggregate limit and with an optional retention by the insured. In addition, combinations of such contract terms providing both Per-Occurrence and Aggregate features are common. In the second stage of the analysis pipeline, portfolio risk management and pricing of portfolios of contracts necessitates a further level of stochastic simulation, called aggregate analysis [18, 19, 20, 21, 22, 23] (see Figure 3.1). Aggregate analysis is a form of Monte Carlo simulation in which each simulation trial represents an alternative view of which events occur and in which order they occur within a predetermined period, i.e., a contractual year. In order to provide actuaries and decision makers with a consistent lens through which to view results, rather than using random values generated on-the-fly, a pre-simulated Year Event Table (YET) containing between several thousand and millions of alternative views of a single contractual year is used. The output of aggregate analysis is a Year Loss Table (YLT). From a YLT, a reinsurer can derive important portfolio risk metrics such as the Probable Maximum Loss (PML) and the Tail Value at Risk (TVAR) which are used for both internal risk management and reporting to regulators and rating agencies. Furthermore, these metrics then flow into the final stage in the risk analysis pipeline, namely Enterprise Risk Management, where liability, asset, and other forms of risks are combined and correlated to generate an enterprise wide view of risk.

3.2 Aggregate risk analysis

The Aggregate Risk Engine, referred to as ARE (see Figure 3.1) is considered in this section. The description of ARE is separated out as the inputs to the engine, the basic sequential algorithm for aggre- gate analysis and the output of the engine.

3.2.1 Inputs

Inputs to the Aggregate Risk Engine are three main components:

7 1. Year Event Table (YET )

2. Event Loss Tables (ELT )

3. Layers (L)

3.2.1.1 Year Event Table (YET )

The Year Event Table (YET), denoted as YET , is a database of pre-simulated occurrences of events from a catalog of stochastic events. Each record in a YET called a “trial”, denoted as Ti, represents a possible sequence of event occurrences for any given year. The sequence of events is defined by an ordered set of tuples containing the ID of an event and the time-stamp of its occurrence in that trial Ti = {(Ei,1, ti,1),..., (Ei,k, ti,k)}. The set is ordered by ascending time-stamp values. A typical YET may comprise thousands to millions of trials, and each trial may have approximately between 800 to 1500 ‘event time-stamp’ pairs, based on a global event catalog covering multiple perils. The YET can be represented as:

YET = {Ti = {(Ei,1, ti,1),..., (Ei,k, ti,k)}}

where i = 1, 2,... and k = 1, 2,..., 800 − 1500.

3.2.1.2 Event Loss Tables (ELT )

Event Loss Tables, denoted as ELT , represent collections of specific events and their corresponding losses with respect to an exposure set. An event may be part of multiple ELTs and associated with a different loss in each ELT. For example, one ELT may contain losses derived from one exposure set while another ELT may contain the same events but different losses derived from a different exposure set. Each ELT is characterized by its own metadata including information about currency exchange rates and terms that are applied at the level of each individual event loss. Each record in an ELT is denoted as event loss ELi = {Ei, li}, and the financial terms associated with the ELT are represented as a tuple I = (I1, I2,... ). A typical aggregate analysis may comprise 10,000 ELTs, each containing 10,000- 30,000 event losses with exceptions even up to 2,000,000 event losses. The ELTs can be represented as: ( ) EL = {E , l }, ELT = i i i I = (I1, I2,... )

with i = 1, 2,..., 10, 000 − 30, 000.

3.2.1.3 Layers (L)

Layers, denoted as L, cover a collection of ELTs under a set of layer terms. A single layer Li is composed of two attributes. Firstly, the set of ELTs E = {ELT1, ELT2, . . . , ELTj}, and secondly, the

8 Layer Terms, denoted as T = (TOccR, TOccL, TAggR, TAggL). A typical layer covers approximately 3 to 30 individual ELTs. The Layer can be represented as:

( ) E = {ELT , ELT , . . . , ELT }, L = 1 2 j T = (TOccR, TOccL, TAggR, TAggL)

with j = 1, 2,..., 3 − 30.

3.2.2 Algorithm sketch

The principal sequential algorithm for aggregate analysis utilized in ARE consists of two stages: 1) a preprocessing stage in which data is loaded into local memory, and 2) the analysis stage performing the simulation and producing the resulting YLT output. The algorithm for the analysis stage is as follows: In the preprocessing stage the input data, YET , ELT and L, is loaded into memory. The aggregate analysis stage is composed of four steps which are all executed for each Layer and each trial in the YET. In the first step (lines 3-5), for each event of a trial its corresponding event loss in the set of ELTs associated with the Layer is determined. In the second step (lines 6 and 7), a set of financial terms is applied to each event loss pair extracted from an ELT. In other words, contractual financial terms to the benefit of the layer are applied in this step. For this the losses for a specific event’s net of financial terms I are accumulated across all ELTs into a single event loss (lines 8 and 9). In the third step (lines 10-13), the event loss for each event occurrence in the trial, combined across all ELTs associated with the layer, is subject to occurrence terms (TOccR, and TOccL). Occurrence terms are part of the layer terms (refer Figure 3.2) and applicable to individual event occurrences independent of any other occurrences in the trial. The occurrence terms capture the specific contractual properties of Cat XL and Per-Occurrence XL treaties as they apply to individual event occurrences only. The event losses net of occurrence terms are then accumulated into a single aggregate loss for the given trial. In the fourth and final step (lines 14-19), the aggregate terms are applied to the trial’s aggregate loss for a layer. Unlike occurrence terms, aggregate terms are applied to the cumulative sum of occurrence losses within a trial and thus the result depends on the sequence of prior events in the trial. This be- haviour captures the properties of common Stop-Loss or Aggregate XL contracts. The aggregate loss net of the aggregate terms is referred to as the trial loss or the year loss and stored in a Year Loss Table (YLT) as the result of the aggregate analysis.

3.2.3 Output

The algorithm will provide an aggregate loss value for each trial denoted as lr (line 19). Then filters (financial functions) are applied on the aggregate loss values.

9 Table 3.1: Basic Algorithm for Aggregate Risk Analysis

Basic Algorithm for Aggregate Risk Analysis

1 for all a ∈ L

2 for all b ∈ YET

3 for all c ∈ (EL ∈ a)

4 for all d ∈ (Et ∈ b)

5 xd ← E ∈ d in El ∈ f, where f ∈ ELT and (EL ∈ f) = c

6 for all d ∈ (Et ∈ b)

7 lxd ← Apply Financial Terms(I) 8 for all d ∈ (Et ∈ b)

9 loxd + = lxd 10 for all d ∈ (Et ∈ b)

11 loxd = min(max(loxd − TOccR, 0), TOccL) 12 for all d ∈ (Et ∈ b)

d P 13 loxd = loxi i=1 14 for all d ∈ (Et ∈ b)

15 loxd = min(max(loxd − TAggR, 0), TAggL) 16 for all d ∈ (Et ∈ b)

17 loxd = loxd − loxd−1 18 for all d ∈ (Et ∈ b)

19 lr+ = loxd

10 Table 3.2: Layer Terms applicable to Aggregate Risk Analysis

Notation Terms Description TOccR Occurrence Retention Retention or deductible of the insured for an individual occurrence loss TOccL Occurrence Limit Limit or coverage the insurer will pay for occurrence losses in excess of the retention TAggR Aggregate Retention Retention or deductible of the insured for an annual cumulative loss TAggL Aggregate Limit Limit or coverage the insurer will pay for annual cumulative losses in excess of the aggregate retention

3.3 Experimental evaluation

The aims of the ARE are to handle large data, organize input data in efficient data structures, and define the granularity at which parallelism can be applied on the aggregate risk analysis problem to achieve a significant speedup. This section confirms through experimental evaluation of a number of implementations of the aggregate risk analysis algorithm as to how the aims of ARE are achieved. The hardware platforms used in the experimental evaluation are firstly considered.

3.3.1 Platform details

Two hardware platforms were used in the experimental evaluation of the aggregate risk algorithm.

3.3.1.1 Platform 1 - A Multi-core CPU

The multi-core CPU employed in this evaluation was a 3.40 GHz quad-core Intel(R) Core (TM) i7- 2600 processor with 16.0 GB of RAM. The processor had 256 KB L2 cache per core, 8MB L3 cache and maximum memory bandwidth of 21 GB/sec. Both sequential and parallel versions of the aggregate risk analysis algorithm were implemented on this platform. The sequential version was implemented in C++, while the parallel version was implemented in C++ and OpenMP. Both versions were compiled using the GNU Compiler Collection g++ 4.4 using the “-O3” and “-m64” flags.

3.3.1.2 Platform 2 - A Many-core GPU

A NVIDIA Tesla C2075 GPU, consisting of 448 processor cores (organized as 14 streaming multi- processors each with 32 symmetric multi-processors) and a global memory of 5.375 GB was employed in the GPU implementations of the aggregate risk analysis algorithm. CUDA is employed for a basic GPU implementation of the aggregate risk analysis algorithm and an optimized implementation.

11 3.3.2 Implementation

Four variations of the Aggregate Risk Analysis algorithm are presented in this section. They are (i) a sequential implementation, (ii) a parallel implementation using OpenMP for multi-cores CPUs, (iii) a basic GPU implementation and (iv) an optimized/“chunked” GPU implementation. In all implementa- tions a single thread is employed per trial, Tid. The key design decision from a performance perspective is the selection of a data structure for representing Event Loss Tables (ELTs). ELTs are essentially dic- tionaries consisting of key-value pairs and the fundamental requirement is to support fast random key lookup. The ELTs corresponding to a layer were implemented as direct access tables. A direct access table is a highly sparse representation of a ELT, one that provides very fast lookup performance at the cost of high memory usage. For example, consider an event catalog of 2 million events and a ELT consisting of 20K events for which non-zero losses were obtained. To represent the ELT using a direct access table, an array of 2 million loss are generated in memory of which 20K are non-zero loss values and the remaining 1.98 million events are zero. So if a layer has 15 ELTs, then 15 × 2 million = 30 million event-loss pairs are generated in memory. A direct access table was employed in all implementations over any alternate compact representation for the following reasons. A search operation is required to find an event-loss pair in a compact repre- sentation. If sequential search is adopted, then O(n) memory accesses are required to find an event-loss pair. Even if sorting is performed in a pre-processing phase to facilitate a binary search, then O(log(n)) memory accesses are required to find an event-loss pair. If a constant-time space-efficient hashing scheme, such as cuckoo hashing [30] is adopted then only a constant number of memory accesses may be required but this comes at considerable implementation and run-time performance complexity. This overhead is particularly high on GPUs with their complex memory hierarchies consisting of both global and shared memories. Compact representations therefore place a very high cost on the time taken for accessing an event-loss pair. Essentially the aggregate analysis process is memory access bound. For example, to perform aggregate analysis on a YET of 1 million trials (each trial comprising 1000 events) and for a layer covering 15 ELTs, there are 1000 × 1 million ×15 = 15 billion events, which requiring random access to 15 billion loss values. Direct access tables, although wasteful of memory space, allow for the fewest memory accesses as each lookup in an ELT requires only one memory access per search operation.

3.3.2.1 Basic Implementations

The data structures used for the basic implementations are:

(i) A vector consisting of all Ei,k that contains approximately 800M-1500M integer values requiring 3.2GB-6GB memory.

(ii) A vector of 1M integer values indicating trial boundaries to support the above vector requiring 4MB memory.

(iii) A structure consisting of all Eli that contains approximately 100M-300M integer and double pairs requiring 1.2GB-3.6GB.

12 (iv) A vector to support the above vector by providing ELT boundaries containing approximately 10K integer values requiring 40KB.

(v) A number of smaller vectors for representing I and T .

In the basic implementation on the multi-core CPU platform the entire data required for the algorithm is processed in memory. The GPU implementation of the basic algorithm uses the GPU’s global memory to store all of the required data structures. Global memory on the GPU is large and reasonably efficient (if you carefully manage the memory access patterns) but considerably slower than much smaller shared and constant memories available on each streaming multi-processor.

3.3.2.2 Optimised/Chunked Implementation

The optimized version of the GPU algorithm endeavours to utilize shared and constant memory as much as possible. The key concept employed in the optimized algorithm is chunking. In this context, chunking means to process a block of events of fixed size (referred to as chunk size) for the efficient use of shared memory. In the optimised GPU implementation, all of the three major steps (lines 3-19 in the basic algorithm, i.e., events in a trial and both financial and layer terms computations) of the basic algorithm are chunked. In addition, the financial terms, I, and the layer terms, T , are stored in the streaming multi-processor’s

constant memory. In the basic implementation, lxd and loxd are represented in the global memory and therefore, in each step while applying the financial and layer terms the global memory has to be accessed and updated adding considerable overhead. This overhead is minimized in the optimized implementation.

3.3.3 Results

The results obtained from the basic and the optimized implementations are described in this section. The basic implementation results are presented for both multi-core CPU and many-core GPU platforms, while the optimized implementation is applicable only to the GPU.

3.3.3.1 Results for Aggregate Analysis on CPUs

The size of an aggregate analysis problem is determined by four key parameters of the input, namely:

(i) Number of Events in a Trial, |Et|av, which affects computations in line nos. 4-19 of the basic algorithm.

(ii) Number of Trials, |T |, which affects the loop in line no. 2 of the basic algorithm.

(iii) Average number of ELTs per Layer, |ELT |av, which affects line no. 3 of the basic algorithm.

(iv) Number of Layers, |L|, which affects the loop in line no. 1 of the basic algorithm

13 (a) No. of Events in a Trial vs time taken for executing (b) No. of Trials vs time taken for executing

(c) Average no. of ELTs per Layer vs time taken for executing (d) Number of Layers vs time taken for executing

Figure 3.2: Performance of the basic aggregate analysis algorithm on a CPU using a single core

(a) No. of cores vs execution time (b) Total No. of threads vs execution time

Figure 3.3: Graphs plotted for the parallel version of the basic aggregate analysis algorithm on a multi- core CPU

14 Figure 3.2 shows the impact on running time of executing the sequential version of the basic aggre- gate analysis algorithm on a CPU using a single core when the number of the number of events in a trial, number of trials, average number of ELTs per layer and number of layers is increased. The range chosen for each of the input parameters represents the range expected to be observed in practice and is based on discussions with industrial practitioners. In Figure 3.2a the number of ELTs per Layer is varied from 3 to 15. The number of Layers are 1, the number of Trials are set to 1 million and each Trial comprises 1000 events. In Figure 3.2b the number of Trials is varied from 200,000 to 1,000,000 with each trial comprising 1000 events and the experiment is considered for one Layer and 15 ELTs. In Figure 3.2c the number of Layers is varied from 1 to 5 and the experiment is considered for 15 ELTs per Layer, 1 million trials and each Trial comprises 1000 events. In Figure 3.2d the number of Events in a Trial is varied from 800 to 1200 and the experiments is performed for 1 Layer, 15 ELTs per Layer and 100,000 trials. Asymptotic analysis of the aggregate analysis algorithm suggests that performance should scale lin- early in these parameters and this is indeed what is observed. In all the remaining performance experi- ments the focus is on a large fixed size input that is representative of the kind of problem size observed in practice. Figure 3.4 illustrates the performance of the basic aggregate analysis engine on a multi-core CPU. In Figure 3.3a, a single thread is run on each core and the number of cores is varied from 1 to 8. Each thread performs aggregate analysis for a single trial and threading is implemented by introducing OpenMP directives into the C++ source. Limited speedup is observed. For two cores we achieve a speedup of 1.5x, for four cores the speedup is 2.2x, and for 8 cores it is only 2.6x. As we increase the number of cores we do not equally increase the bandwidth to memory which is the limiting factor. The algorithm spends most of its time performing random access reads into the ELT data structures. Since these accesses exhibit no locality of reference they are not aided by the processors cache hierarchy. A number of approaches were attempted, including the chunking method described later for GPUs, but were not successful in achieving a high speedup on our multi-core CPU. However a moderate reduction in absolute time by running many threads on each core was achieved. Figure 3.3b illustrates the performance of the basic aggregate analysis engine when all 8 cores are used and each core is allocated many threads. As the number of threads are increased an improvement in the performance is noted. With 256 threads per core (i.e. 2048 in total) the overall runtime drops from 135 seconds to 125 seconds. Beyond this point we observe diminishing returns as illustrated in Figure 3.3b.

3.3.3.2 Results for the basic aggregate analysis algorithm on GPU

In the GPU implementations, CUDA provides an abstraction over the streaming multi-processors, referred to as a CUDA block. When implementing the basic aggregate analysis algorithm on a GPU we need to select the number of threads executed per CUDA block. For example, consider 1 million threads are used to represent the simulation of 1 million trials on the GPU, and 256 threads are executed 1,000,000 on a streaming multi-processor. There will be 256 ≈ 3906 blocks in total which will have to be executed on 14 streaming multi-processors. Each streaming multi-processor will therefore have to 3906 execute 14 ≈ 279 blocks. Since the threads on the same streaming multi-processor share fixed size

15 Figure 3.4: Graphs plotted for number of threads vs the time taken for executing the parallel version of the Basic Algorithm on many-core GPU allocations of shared and constant memory there is a real trade-off to be made. If we have a smaller number of threads, each thread can have a larger amount of shared and constant memory, but with a small number of threads we have less opportunity to hide the latency of accessing the global memory. Figure 3.4 shows the time taken for executing the parallel version of the basic implementation on the GPU when the number of threads per CUDA block are varied between 128 and 640. At least 128 treads per block are required to efficiently use the available hardware. An improved performance is observed with 256 threads per block but beyond that point the performance improvements diminish greatly.

3.3.3.3 Results for the optimised aggregate analysis algorithm on GPU

The optimized or chunked version of the GPU algorithm aims to utilize shared and constant memory as much as possible by processing “chunks”, blocks of events of fixed size (referred to as chunk size), to improve the utilization of the faster shared memories that exist on each streaming multi-processor. Figure 3.5a illustrates the performance of the optimized aggregate analysis algorithm as the chunk size is increased. With a chunk size of 4 the optimized algorithm has a significantly reduced runtime from 38.47 seconds down to 22.72 seconds, representing a 1.7x improvement. Interestingly, increasing the chunk size does not improve the performance. The curve is observed to be flat up to a chunk size of 12 and beyond that the performance deteriorates rapidly as the shared memory overflow is handled by the slow global memory.

16 (a) Size of chunk vs time taken for executing (b) No. of threads per block vs time taken for executing

Figure 3.5: Performance of the optimized aggregate analysis algorithm on GPU

Figure 3.6: Total time taken for executing the algorithm

17 Figure 3.7: Percentage of time taken for fetching Events from memory, time for look-up of ELTs in the direct access table, time for financial term calculations and time for layer term calculations

Figure 3.5b illustrates the performance of the optimized aggregate analysis algorithm as the number of threads is increased. The number of threads range in multiples of 32 due to the WARP size, which corresponds to the number of symmetric multi-processors in CUDA. With a chunk size of 4 the maxi- mum number of threads that can be supported is 192. As the number of threads are increased, there is a small gradual improvement in performance, but the overall improvement is not significant.

18 Chapter 4

Hybrid Computing

4.1 Introduction

General purpose computing using hardware accelerators like GPUs have provided very fast and effi- cient solutions for many applications. However most implementations do not use CPUs in computation. Data and codes are sent to GPU, GPU threads perform computation in parallel and then results are sent back to CPU. However CPU stays idle for most of the time while GPU execution. This is quite a waste of CPU computing power and resources. Hybrid computing tends to use both CPU and GPU in compu- tation. This hybrid platform is heterogeneous as CPUs and GPUs differ in their architecture, instruction set and computing power. But carefully designing the application to use both platforms can result in a better performance over using just device (GPU) in computation [36, 37, 41, 42, 43]. Figure 4.1 shows the case of hybrid computing. In this model codes and some proportion of data is sent to GPU. CUDA provides a mechanism of asynchronous kernel calls and memory transfer. CPU and GPU both take part in computation and via asynchronous memory transfer they exchange data. Thus instead of one kernel call, there will be many kernel calls. By balancing the execution times and transfer times we can minimize the idle time for both devices. In the end output is gathered on either device depending on the application. This keeps both CPU and GPU busy and hence a little idle time resulting in high efficiency. Hybrid computing can be modeled based on what and how much work is assigned to which device. There can be two different algorithm engineering approaches for hybrid computing. These are described in following sections.

4.1.1 Work sharing

In this approach problem is decomposed into different parts and those parts are assigned to different devices. Data is divided into parts and sent to devices to carry on execution [41, 43]. The devices can use same or different algorithms, and these algorithms may be different from the best possible algorithm on that device. Proportion of data each device get depends on the computation power of that device and the difference in throughputs for the specific application on that device. In such scenario as both devices do essentially the same type of work, it is important to balance the workload.

19 GPUs Device Cores # of SMs Clock Global Memory L2 Cache Threads per block GTX580 512 16 1.54 GHz 1535 MB 768 KB 1024 GT520 48 1 1.62 GHz 1024 MB 64 KB 1024 CPUs Device # of Cores SIMD width Clock Last Level Cache L1 Cache # of Threads i7 980x 6 4 3.4 GHz 12 MB 32 KB 12 Core 2 Duo 2 2 2.8 GHz 3 MB 32KB 4

Table 4.1: The specifications for the different GPUs and CPUs used in our experiments.

4.1.2 Task parallelism

Task parallelism approach views problem as set of subtasks which are dependent on each other. These subtasks can be assigned to device based on suitability of the device for the subtask. Output of one subtask can be input to another [36, 37]. Hence this approach aims for choosing the best person for the job.

We used work sharing approach in implementing a hybrid version of aggregate risk analysis engine.

4.2 Static partitioning hybrid model

In work sharing hybrid approach we partition the data and send it to different devices for the execu- tion. Choosing the right partition for data devision so that desired amount of data can be input to each device is very important. In static partitioning hybrid model we choose the partition by experiments or empirically by analyzing the throughput on each device. This method in theory aims for getting the overall throughput as sum of throughputs on each device. In aggregate risk analysis we divided trials across the devices, and by exploiting features like asynchronous memory transfers and kernel calls and by using techniques like double buffering, we achieved good results.

4.2.1 Results on static partitioning hybrid model

Figure 4.2 shows the results of static data partitioning on two different platforms which we name as low end platform and high end platform. Low end platform is a combination of Intel Core 2 Duo E7400 CPU coupled with a NVidia GT520 GPU. High end platform is a combination of Intel i7 980 CPU and a NVidia GTX 580 GPU. Details and specifications of the platforms are given in table 4.1.

We achieved 15-20% speedup over GPU only implementation for both the devices.

20 (a) Device only computing

(b) Hybrid computing

Figure 4.1: Case of hybrid computing

21 (a) Time taken on Intel Core 2 Duo E7400 CPU + NVidia GT520 GPU

(b) Time taken on Intel i7 980 CPU + NVidia GTX 580 GPU

Figure 4.2: Performance of the static partitioning hybrid model compared to CPU only and GPU only implementations

22 Figure 4.3: Work queue model

4.3 Work-queue hybrid model

In work-queue hybrid model instead of deciding on a partition we decide on a chunk of work for each of the two devices. This chunk of computation also referred to as work unit is decided for each device. We maintain a logical double ended queue of these work units as shown in figure 4.3. CPU and GPU start doing computation from their respective ends. We maintain two heads for this logical queue termed as CPU front and GPU front. CPU front indicates how many CPU work units have been finished by CPU and GPU front indicates the same about GPU work units. When these fronts cross each other, computation terminates. For the purpose of hybrid aggregate risk analysis engine we choose a chunk size which is same for both CPU and GPU. A work unit in this case consists of 1792 trials for both CPU and GPU. Choosing same work unit size makes this implementation platform independent and hence more generic.

4.3.1 Results on Work-queue hybrid model

Figure 4.5 shows the results of timing somparison of work-queue hybrid model for both the platforms listed in section 4.2.1. Results of this automated platform independent hybrid implementation are very close to the static partitioning hybrid model. Due to this fine grained distribution of work units we get an accuracy of more than 95% as compared to its static partitioning counterpart. Figure ?? represents the percentage of total work done by CPU for both static load balancing implementation and work queue implementation on both the platforms. These results show the accuracy of work queue model.

23 (a) Split percentage of work on Intel Core 2 Duo E7400 CPU + NVidia GT520 GPU

(b) Split percentage of work on Intel i7 980 CPU + NVidia GTX 580 GPU

Figure 4.4: Percentage of work done by CPU, GPU and total work

24 (a) Time taken on Intel Core 2 Duo E7400 CPU + NVidia GT520 GPU

(b) Time taken on Intel i7 980 CPU + NVidia GTX 580 GPU

Figure 4.5: Performance of the work-queue hybrid model compared to static partitioning hybrid model

25 Chapter 5

Conclusions and future work

In this work, we have shown how GPUs can be used efficiently to solve problems which have highly irregular memory accesses like Aggregate risk analysis. We investigated parallel methods for aggregate risk analysis which utilizes large input data, organize input data in efficient data structures, and define the granularity at which parallelism can be applied on the aggregate risk analysis problem to achieve a significant speedup. In this implementation we used sparse lookup tables. While they take a lot of space but they achieve good speedups. In future we wish to experiment on several hashing techniques which can exploit GPU’s features and can replace sparse direct lookup tables. We then emphasized better resource management and worked on a hybrid implementation of the same. This hybrid implementation uses both the resources the multi-core CPU and the many-core GPU in the computation and tends to minimize the idle time. Then we moved towards making the hybrid application portable by removing the static workload balancing. We implemented an automated load balancing model which makes the application platform independent. These implementations seems promising and in future we wish to explore the multi-GPU platform independent implementations.

26 Related Publications

1. AK Bahl, O Baltzer, A Rau-Chaplin, B Varghese “Parallel Simulations for Analysing Portfolios of Catastrophic Event Risk“ Workshop for high performance computational finance 2012

2. AK Bahl, O Baltzer, A Rau-Chaplin, B Varghese, A Whiteway ”Multi-GPU Computing for Achieving Speedup in Real-time Aggregate Risk Analysis” High Performance Computing on Graphics Processing Units (hgpu. org)

3. Dip Sankar Banerjee, Aman Kumar Bahl, and Kishore Kothapalli. “An On-Demand Fast Parallel Pseudo Random Number Generator with Applications”, in Proc. of the Workshop on Large Scale Parallel Processing (LSPP), 2012, in conjunction with IPDPS 2012

27 Bibliography

[1] G. Connor, L. R. Goldberg and R. A. Korajczyk, “Portfolio Risk Analysis,” Princeton University Press, 2010.

[2] E. Mason, A. Rau-Chaplin, K. Shridhar, B. Varghese and N. Varshney, “Rapid Post-Event Catas- trophe Modelling and Visualization,” accepted to the 23rd International Workshop on Database and Expert Systems Applications, 2012.

[3] C. Trumbo, M. Lueck, H. Marlatt and L. Peek, “The Effect of Proximity to Hurricanes Katrina and Rita on Subsequent Hurricane Outlook and Optimistic Bias,” Risk Analysis, Vol. 31, Issue 12, pp. 1907-1918, December 2011.

[4] A. Melnikov, “Risk Analysis in Finance and Insurance, Second Edition” CRC Press, 2011.

[5] D. S. Bowles, A. M. Parsons, L. R. Anderson and T. F. Glover, “Portfolio Risk Assessment of SA Water’s Large Dams,” Australian Committee on Large Dams Bulletin, Vol. 112, Issue 27-39, August 1999.

[6] D. J. Moschandreas and S. Karuchit, “Scenario-Model-Parameter: A New Method of Cumulative Risk Uncertainty Analysis,” Environment International, Vol. 28, Issue 4, pp. 247-261, September 2002.

[7] T. Walshe and M. Burgman, “A Framework for Assessing and Managing Risks Posed by Emerging Diseases,” Risk Analysis, Vol. 30, Issue 2, pp. 236-249, February 2010.

[8] A. M. L. da Silva, L. C. Resende and L. A. F. Manso, “Application of Monte Carlo Simulation to Well-Being Analysis of Large Composite Power Systems,” Proceedings of the International Conference on Probabilistic Methods Applied to Power Systems, 2006.

[9] C. L. Smith and E. Borgonovo, “Decision Making During Nuclear Power Plant Incidents - A New Approach to the Evaluation of Precursor Events,” Risk Analysis, Vol. 27, Issue 4, pp. 1027-1042.

[10] B. J. Garrick, J. W. Stetkar and P. J. Bembia, “Quantitative Risk Assessment of the New York State Operated West Valley Radioactive Waste Disposal Area,” Risk Analysis, Vol. 30, Issue 8, pp. 12191230, August 2010.

[11] P. Gordon, J. E. Moor II, J. Y. Park and H. W. Richardson, “The Economic Impacts of a Terrorist Attack on the U.S. Commercial Aviation System,” Risk Analysis, Vol. 27, Issue 3, pp. 505-512.

28 [12] M. B. Giles and C. Reisinger, “Stochastic Finite Differences and Multilevel Monte Carlo for a Class of SPDEs in Finance,” SIAM Journal of Financial Mathematics, Vol. 3, Issue 1, , 2012, pp. 572-592.

[13] A. J. Lindeman, “Opportunities for Shared memory Parallelism in Financial Modelling,” Proceed- ings of the IEEE Workshop on High Performance Computational Finance, 2010.

[14] S. Weston, J. -T. Marin, J. Spooner, O. Pell and O. Mencer, “Accelerating the Computation of Port- folios of Tranched Credit Derivatives,” Proceedings of the IEEE Workshop on High Performance Computational Finance, 2010.

[15] D. B. Thomas, “Acceleration of Financial Monte-Carlo Simulations using FPGAs,” Proceedings of the IEEE Workshop on High Performance Computational Finance, 2010.

[16] K. Smimou and R. K. Thulasiram, “ A Simple for Large-Scale Portfolio Prob- lems,” Journal of Risk Finance, Vol. 11, Issue 5, 2010, pp.481-495.

[17] P. Grossi, H. Kunreuther (Editors), “Catastrophe Modelling: A New Approach to Managing Risk,” Springer, 2005.

[18] R. R. Anderson and W. Dong, “Pricing Catastrophe Reinsurance with Reinstatement Provisions using a Catastrophe Model,” Casualty Actuarial Society Forum, Summer 1998, pp. 303-322.

[19] G. G. Meyers, F. L. Klinker and D. A. Lalonde, “The Aggregation and Correlation of Reinsurance Exposure,” Casualty Actuarial Society Forum, Spring 2003, pp. 69-152.

[20] G. G. Meyers, F. L. Klinker and D. A. Lalonde, “The Aggregation and Correlation of Insurance Exposure,” Casualty Actuarial Society Forum, Summer 2003 , pp. 15-82.

[21] W. Dong, H. Shah and F. Wong, “A Rational Approach to Pricing of Catastrophe Insurance,” Journal of Risk and Uncertainty, Vol. 12, 1996, pp. 201-218.

[22] W. Dong and F. S. Wong, “Portfolio Theory for Earthquake Insurance Risk Assessment,” Proceed- ings of the 12th World Conference on Earthquake Engineering, Paper No. 0163, 2000.

[23] R. M. Berens, “Reinsurance Contracts with a Multi-Year Aggregate Limit,” Casualty Actuarial Society Forum, Spring 1997, pp. 289-308.

[24] E. Alerstam, T. Svensson and S Andersson-Engels, “Parallel Computing with Graphics Process- ing Unites for High-Speed Monte Carlo Simulation of Photon Migration,” Journal of Biomedical Optics, Vol. 13, Issue 6, pp. , 2008.

[25] J. Tolke and M. Krafczyk, “TeraFLOP computing on a desktop PC with GPUs for 3D CFD,” Inter- national Journal of Computational Fluid Dynamics - Mesoscopic Methods and their Applications to CFD, Vol. 22, Issue 7, pp. 443-456, 2008.

[26] I. S. Haque and V. S. Pande, “Large-Scale Chemical Informatics on GPUs,” GPU Computing Gems: Emerald Edition (Edited by: W. -M. W. Hwu), 2011.

29 [27] A. Eklund, M. Andersson and H. Knutsson, “fMRI Analysis on the GPU - Possibilities and Chal- lenges,”, Computer Methods and Programs in Biomedicine, Vol. 105, Issue 2, pp. 145-161, 2012.

[28] G. Pages and B. Wilbertz, “Parallel Implementation of Quantization Methods for the Valuation of Swing Options on GPGPU,” Proceedings of the IEEE Workshop on High Performance Computa- tional Finance, 2010.

[29] D. M. Dang, C. C. Christara and K. R. Jackson, “An Efficient Graphics Processing Unit-Based Parallel Algorithm for Pricing Multi-Asset American Options,” Concurrency and Computation: Practice & Experience, Vol. 24, Issue 8, June 2012, pp. 849-866.

[30] R. Pagh and F. F. Rodler, “Cuckoo hashing,” Journal of Algorithms, Vol. 51, 2004.

[31] Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. Scan primitives for GPU computing. In Proc. of ACM Symposium on Graphics hardware, pp: 97-106, 2007.

[32] Nadathur Satish, Mark Harris, and Michael Garland. 2009. Designing efficient sorting algorithms for manycore GPUs. In Proc. of IEEE International Parallel & Distributed Processing Symposium (IPDPS) 09. IEEE Computer Society, Washington, DC, USA, pp:1-10.

[33] Zheng Wei, Joseph JaJa. Optimization of linked list prefix computations on multithreaded GPUs using CUDA. In Proc. IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2010, pp. 1-8.

[34] Pawan Harish, P. Narayanan. Accelerating Large Graph Algorithms on the GPU Using CUDA. In Proc. of International Conference on High Performance Computing (HiPC) 2007, Vol. 4873, pp. 197-208.

[35] James W. Demmel. 1997. Applied Numerical Linear Algebra. Soc. for Industrial and Applied Math., Philadelphia, PA, USA.

[36] D. S. Banerjee and K. Kothapalli, Hybrid multicore algorithms for list ranking and graph connected components, in 18th Annual International Conference on High Performance Computing (HiPC), 2011.

[37] Dip Sankar Banerjee, Aman Kumar Bahl, and Kishore Kothapalli. An On-Demand Fast Parallel Pseudo Random Number Generator with Applications, in Proc. of the Workshop on Large Scale Parallel Processing (LSPP), 2012, in conjunction with IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2012.

[38] V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proc. SC, 2008.

[39] Zheng Wei, Joseph JaJa. Optimization of linked list prefix computations on multithreaded GPUs using CUDA. In Proc. IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2010, pp. 1-8.

30 [40] Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. of SC 09, ACM, New York, NY, USA, , Article 18 , 11 pages.

[41] A. Deshpande, I. Misra, and P. J. Narayanan, Hybrid implementation of error diffusion dithering, in High Performance Computing Conference(HiPC), 2011, pp. 110.

[42] S. Choudhary, S. Gupta, and P. J. Narayanan, Practical time bundle adjustment for 3d reconstruc- tion on the gpu, in European Conference on Computer Vision (ECCV) Workshop on Computer vision on GPUs (CVGPU), 2011.

[43] K. Matam, S. K. Bharadwaj, and K. Kothapalli, Sparse matrix matrix multiplication on hybrid CPU+GPU platforms, High Performance Computing Conference (HiPC), 2012

31