Power-Performance Analysis of Metaheuristic Search Algorithms on the GPU

Power-performance Analysis of Metaheuristic Search Algorithms on the GPU Tiffany Connors Apan Qasem Department of Computer Science Department of Computer Science Texas State University Texas State University San Marcos, TX San Marcos, TX Email: [email protected] Email: [email protected] Abstract—This paper presents a power-performance analysis 100 of three metaheuristic search algorithms on the GPU. We investi- compute memory gate the impact of varying thread configurations with constraints 80 on the maximum allowable registers per thread. The experimen- 60 tal results reveal that thread geometry can have a significant impact on both the performance and power consumption of the 40 studied codes. Generally, larger block sizes with a low constraint % utilization on the number of allocated registers provide the best results. A 20 particularly interesting outcome of this study is the discovery of 0 thread and grid dimensions which yield improvements in both 2opt 2opt-shared tabu anneal performance and power consumption. These improvements come from better data locality across thread blocks and reduction in DRAM traffic. Fig. 1. Compute and memory resource utilization of heuristic search algorithms on an Nvidia Tesla K20c GPU I. INTRODUCTION The ability to deliver high degrees of parallelism through fine-grain multithreading has made GPUs a key compo- results in higher occupancy can lead to improved performance; nent in today’s HPC systems. GPUs can also offer higher however, it can cause degradation if it inhibits the utilization of performance-per-watt over their CPU counterparts for appli- shared resources. Although the influence of thread geometry cations running at close to theoretical peak [1]. Although and allocation of shared resources on performance has been GPU architecture can provide energy-efficiency for many HPC studied previously [5], [6], the impact on power consumption applications, the mapping of a particular application to a GPU and the associated trade-offs has not been explored. can cause huge variations in power and performance. Recent For this study, we selected three widely used search algo- investigations into the energy behavior of GPU codes have rithms: (i) 2opt, (ii) Tabu Search and (iii) Simulated Annealing. shown that not only are there significant variations, but the All three algorithms exhibit regular memory access patterns effect that certain optimizations have on power consumption and tend to be memory-bound for larger data sets. The can be starkly different on GPUs as compared to CPUs [2]. compute and memory utilization of the three algorithms for These results provide the motivation for further exploration one input instance is shown in Fig. 1. We specifically chose of the energy-efficiency behavior of different classes of GPU these algorithms because un-optimized implementations of applications. these codes tend to exhibit high register pressure, requiring This paper presents a power-performance analysis of a set spills when the register capacity is set to below 36. On the of metaheuristic search algorithms on the GPU. Specifically, Tesla K20c GPU, the register pressure is high enough to we are interested in examining how the number of registers restrict the thread block to a lower occupancy level, which allocated to a thread block and the number of threads assigned leads to interesting power performance trade-offs. Moreover, to it, impact the execution time and power consumption of although each algorithmic variant has high register pressure these algorithms. It has been shown that effectively managing the relative register demands for the different algorithms the GPU thread hierarchy is instrumental in producing high- varies, with simulated annealing requiring the most number of performing GPU codes [3], [4]. A thread configuration which registers at 58 and 2opt requiring the least at 37. Thus, these set of codes and their variants give us a good test-bench to The material is based upon work supported by the National Science Foundation through grant no. CNS-1305302 and CAREER award no. CNS- explore the relationship between register allocation and thread 1253292. Equipment support was provided by Nvidia configuration and their impact on power and performance. 978-1-5090-0172-9/15/$31.00 c 2015 IEEE II. SEARCH ALGORITHMS of units. A tabu list is maintained to prevent getting stuck in The three algorithms we use in this study follow a fairly local minima. If the best solution in a neighborhood resides simple scheme of task decomposition. The search space is in the tabu list then it is eliminated from consideration for divided into equal-sized regions using diversification. Each k subsequent iterations. The algorithm also implements an GPU thread implements a separate instance of the search and aspiration criteria that allows the tabu tenure of a solution explores the assigned sub-space. In this study, we use the three to be overridden. The tabu tenure, aspiration criteria, and the algorithms to solve the Quadratic Assignment Problem (QAP) maximum number of iterations are all tunable parameters in using real-world data sets from Operations Research. In this our implementation. However, tuning of these parameters was section, we provide an explanation of the QAP problem and not explored in this study. briefly describe the implementation of the three algorithms. D. Simulated Annealing A. Quadratic Assignment Problem (QAP) For this study, we developed a new CUDA implementation of simulated annealing for solving QAP. The implementation The Quadratic Assignment Problem (QAP) is an NP-hard follows the standard simulated annealing heuristic. It explores combinatorial optimization problem. In this problem, the ob- the search space one neighborhood at a time and moves jective is to assign n units to n locations such that the total cost to a new neighborhood using the steepest descent criteria. measured as the sum of the products of flows between units and Exceptions to moving to a new location are made when the distances between locations is minimized. QAP maps directly temperature threshold is reached. In that case a higher-valued to the facility layout problem in manufacturing systems and neighbor is chosen. plays an important role in the area of Operations Research. Although many heuristic algorithms have been proposed, III. METHODOLOGY finding an implementation that can efficiently solve a variety In this section, we provide a brief description of the search of problem instances has proven to be particularly challenging. space of control parameters. This is because the performance of QAP implementations tends to be highly sensitive to both the size and the shape of A. Compiler optimization the input data sets. This makes QAP an interesting problem The Nvidia nvcc compiler does not make many optimiza- for our study. tion flags visible to the user. In our framework, we include in the search space all visible compiler and ptxas flags from B. 2-Opt nvcc version 6.0 and make provisions for including more As part of this study, we wanted to develop a relatively optimizations, should more flags be exposed in future versions. simple search heuristic and concentrate our efforts on learning For the experiments in this paper, we set both the nvcc and about the GPU capabilities and how to exploit them. We use ptxas optimization levels to -O3, which is the highest level the 2-opt implementation described in [7]. This implementa- available. tion starts with an initial feasible solution and explores its B. Thread configuration neighborhood. To get a single neighborhood solution, two po- sitions i and j are randomly selected and a pairwise exchange In our implementations of the search algorithms, each thread of their content is performed. An O(n) cost function [8] is performs a search with a unique initial starting position. used to compute the delta in the objective function following Thus, the total number of threads across all blocks equals the an exchange. Once the cost of all neighboring solutions is number of initial solutions. The range of the initial solutions obtained it is checked against the best solutions obtained thus is determined by the quality of the solution produced. Prior studies have shown that fewer than 29 instances can hamper far. If a better solution has been found, an update occurs. To 14 start a new iteration the new best solution replaces the old the solution quality while more than 2 instances start to produce diminishing returns [7]. Based on this result, we set one. This process is repeated for a pre-determined number of 9 14 iterations. the lower and upper bounds for initial solutions to 2 and 2 , We also implemented a variant to 2opt called 2opt-shared, respectively. which attempts to exploit inter-thread data locality. In this The number of blocks is determined by evenly dividing variation, portions of the flow and distance matrices that the total number of threads. We also ensure that each block comprise a neighborhood are copied into shared memory and contains threads in multiples of the warp size. The maximum the computation is adjusted accordingly. The idea is to restrict number of threads per block is further constrained by the neighborhood explorations by each thread to shared memory, maximum number of threads allowed per block on the target thereby reducing cost calculation time. platform (1024 on Tesla K20c). Thus, the thread configuration is a three-dimensional space that can be expressed as a set as C. Tabu follows We use the Tabu search implementation proposed by Zhu et T = f(p; t; b) j PMIN ≤ p ≤ PMAX and p mod ws = 0; al. [9]. The Tabu implementation works similar to 2opt in that w ≤ t ≤ t and t mod w = 0; it starts with an initial feasible solution and then continues to s MAX s explore neighboring values by performing pair-wise exchange b = p=t where p mod t = 0 g (1) where p = initial solutions (permutations), b = number of on whether speedup and power values are greater than or blocks, t = number of threads, ws = warp size, tMAX = less than one.

Power-Performance Analysis of Metaheuristic Search Algorithms on the GPU

Attacking Client-Side JIT Compilers.Key

1 More Register Allocation Interference Graph Allocators

Feasibility of Optimizations Requiring Bounded Treewidth in a Data Flow Centric Intermediate Representation

Iterative-Free Program Analysis

Register Allocation by Puzzle Solving

A Little on V8 and Webassembly

Lecture Notes on Register Allocation

Graph Decomposition in Routing and Compilers

Register Allocation Deconstructed

Just-In-Time Compilation

CS415 Compilers Register Allocation and Introduction to Instruction Scheduling

Experimental Evaluation of a Branch and Bound Algorithm for Computing Pathwidth and Directed Pathwidth David Coudert, Dorian Mazauric, Nicolas Nisse