Power-performance Analysis of Search Algorithms on the GPU

Tiffany Connors Apan Qasem Department of Computer Science Department of Computer Science Texas State University Texas State University San Marcos, TX San Marcos, TX Email: [email protected] Email: [email protected]

Abstract—This paper presents a power-performance analysis 100 of three metaheuristic search algorithms on the GPU. We investi- compute memory gate the impact of varying thread configurations with constraints 80 on the maximum allowable registers per thread. The experimen- 60 tal results reveal that thread geometry can have a significant impact on both the performance and power consumption of the 40 studied codes. Generally, larger block sizes with a low constraint %utilization on the number of allocated registers provide the best results. A 20 particularly interesting outcome of this study is the discovery of 0 thread and grid dimensions which yield improvements in both 2opt 2opt-shared tabu anneal performance and power consumption. These improvements come from better data locality across thread blocks and reduction in DRAM traffic. Fig. 1. Compute and memory resource utilization of heuristic search algo- rithms on an Nvidia Tesla K20c GPU I.INTRODUCTION The ability to deliver high degrees of parallelism through fine-grain multithreading has made GPUs a key compo- results in higher occupancy can lead to improved performance; nent in today’s HPC systems. GPUs can also offer higher however, it can cause degradation if it inhibits the utilization of performance-per-watt over their CPU counterparts for appli- shared resources. Although the influence of thread geometry cations running at close to theoretical peak [1]. Although and allocation of shared resources on performance has been GPU architecture can provide energy-efficiency for many HPC studied previously [5], [6], the impact on power consumption applications, the mapping of a particular application to a GPU and the associated trade-offs has not been explored. can cause huge variations in power and performance. Recent For this study, we selected three widely used search algo- investigations into the energy behavior of GPU codes have rithms: (i) 2opt, (ii) and (iii) . shown that not only are there significant variations, but the All three algorithms exhibit regular memory access patterns effect that certain optimizations have on power consumption and tend to be memory-bound for larger data sets. The can be starkly different on GPUs as compared to CPUs [2]. compute and memory utilization of the three algorithms for These results provide the motivation for further exploration one input instance is shown in Fig. 1. We specifically chose of the energy-efficiency behavior of different classes of GPU these algorithms because un-optimized implementations of applications. these codes tend to exhibit high register pressure, requiring This paper presents a power-performance analysis of a set spills when the register capacity is set to below 36. On the of metaheuristic search algorithms on the GPU. Specifically, Tesla K20c GPU, the register pressure is high enough to we are interested in examining how the number of registers restrict the thread block to a lower occupancy level, which allocated to a thread block and the number of threads assigned leads to interesting power performance trade-offs. Moreover, to it, impact the execution time and power consumption of although each algorithmic variant has high register pressure these algorithms. It has been shown that effectively managing the relative register demands for the different algorithms the GPU thread hierarchy is instrumental in producing high- varies, with simulated annealing requiring the most number of performing GPU codes [3], [4]. A thread configuration which registers at 58 and 2opt requiring the least at 37. Thus, these set of codes and their variants give us a good test-bench to The material is based upon work supported by the National Science Foundation through grant no. CNS-1305302 and CAREER award no. CNS- explore the relationship between register allocation and thread 1253292. Equipment support was provided by Nvidia configuration and their impact on power and performance. 978-1-5090-0172-9/15/$31.00 c 2015 IEEE II.SEARCH ALGORITHMS of units. A tabu list is maintained to prevent getting stuck in The three algorithms we use in this study follow a fairly local minima. If the best solution in a neighborhood resides simple scheme of task decomposition. The search space is in the tabu list then it is eliminated from consideration for divided into equal-sized regions using diversification. Each k subsequent iterations. The algorithm also implements an GPU thread implements a separate instance of the search and aspiration criteria that allows the tabu tenure of a solution explores the assigned sub-space. In this study, we use the three to be overridden. The tabu tenure, aspiration criteria, and the algorithms to solve the Quadratic Assignment Problem (QAP) maximum number of iterations are all tunable parameters in using real-world data sets from Operations Research. In this our implementation. However, tuning of these parameters was section, we provide an explanation of the QAP problem and not explored in this study. briefly describe the implementation of the three algorithms. D. Simulated Annealing A. Quadratic Assignment Problem (QAP) For this study, we developed a new CUDA implementation of simulated annealing for solving QAP. The implementation The Quadratic Assignment Problem (QAP) is an NP-hard follows the standard simulated annealing heuristic. It explores combinatorial optimization problem. In this problem, the ob- the search space one neighborhood at a time and moves jective is to assign n units to n locations such that the total cost to a new neighborhood using the steepest descent criteria. measured as the sum of the products of flows between units and Exceptions to moving to a new location are made when the distances between locations is minimized. QAP maps directly temperature threshold is reached. In that case a higher-valued to the facility layout problem in manufacturing systems and neighbor is chosen. plays an important role in the area of Operations Research. Although many heuristic algorithms have been proposed, III.METHODOLOGY finding an implementation that can efficiently solve a variety In this section, we provide a brief description of the search of problem instances has proven to be particularly challenging. space of control parameters. This is because the performance of QAP implementations tends to be highly sensitive to both the size and the shape of A. optimization the input data sets. This makes QAP an interesting problem The Nvidia nvcc compiler does not make many optimiza- for our study. tion flags visible to the user. In our framework, we include in the search space all visible compiler and ptxas flags from B. 2-Opt nvcc version 6.0 and make provisions for including more As part of this study, we wanted to develop a relatively optimizations, should more flags be exposed in future versions. simple search heuristic and concentrate our efforts on learning For the experiments in this paper, we set both the nvcc and about the GPU capabilities and how to exploit them. We use ptxas optimization levels to -O3, which is the highest level the 2-opt implementation described in [7]. This implementa- available. tion starts with an initial feasible solution and explores its B. Thread configuration neighborhood. To get a single neighborhood solution, two po- sitions i and j are randomly selected and a pairwise exchange In our implementations of the search algorithms, each thread of their content is performed. An O(n) cost function [8] is performs a search with a unique initial starting position. used to compute the delta in the objective function following Thus, the total number of threads across all blocks equals the an exchange. Once the cost of all neighboring solutions is number of initial solutions. The range of the initial solutions obtained it is checked against the best solutions obtained thus is determined by the quality of the solution produced. Prior studies have shown that fewer than 29 instances can hamper far. If a better solution has been found, an update occurs. To 14 start a new iteration the new best solution replaces the old the solution quality while more than 2 instances start to produce diminishing returns [7]. Based on this result, we set one. This process is repeated for a pre-determined number of 9 14 iterations. the lower and upper bounds for initial solutions to 2 and 2 , We also implemented a variant to 2opt called 2opt-shared, respectively. which attempts to exploit inter-thread data locality. In this The number of blocks is determined by evenly dividing variation, portions of the flow and distance matrices that the total number of threads. We also ensure that each block comprise a neighborhood are copied into shared memory and contains threads in multiples of the warp size. The maximum the computation is adjusted accordingly. The idea is to restrict number of threads per block is further constrained by the neighborhood explorations by each thread to shared memory, maximum number of threads allowed per block on the target thereby reducing cost calculation time. platform (1024 on Tesla K20c). Thus, the thread configuration is a three-dimensional space that can be expressed as a set as C. Tabu follows

We use the Tabu search implementation proposed by Zhu et T = {(p, t, b) | PMIN ≤ p ≤ PMAX and p mod ws = 0, al. [9]. The Tabu implementation works similar to 2opt in that w ≤ t ≤ t and t mod w = 0, it starts with an initial feasible solution and then continues to s MAX s explore neighboring values by performing pair-wise exchange b = p/t where p mod t = 0 } (1) where p = initial solutions (permutations), b = number of on whether speedup and power values are greater than or blocks, t = number of threads, ws = warp size, tMAX = less than one. Thus, the top-right corner represents instances maximum threads per block in architecture, and PMIN and where there is an improvement for both performance and PMAX are lower and upper bounds for the initial solutions. power, while the bottom-left quadrant represents instances We generate values of p and t that meet the above constraints. where there is a degradation in both dimensions. The value of b is computed from p and t. The generated values First, we observe that except for a few cases, points are are inserted into the CUDA source by our framework. generally concentrated on the right half of the bi-plots. This indicates that for these codes, there are many of point of C. Register pressure different levels of occupancy that does not impact power As mentioned earlier, implementations of all three search negatively. In fact, the particular baseline configuration that is algorithms exhibit high degrees of register pressure. Since the chosen produces one of the worst power-performance points number of required registers in a thread cannot be modified for all four algorithms. maxrregcount nvcc arbitrarily, we use the flag in to Next, we notice that for tabu and anneal there are many control the number of allocated registers, and thereby the points that land in the top-right quadrant, indicating improve- register pressure, in each kernel. The number of allocated ments in both power and performance. This is also somewhat registers per thread is always chosen to be a multiple of 4, counter-intuitive. Generally, techniques that lead to improved with a lower bound of 16 registers per thread and an upper performance incur a cost of higher power consumption. On bound enforced by the system (64K per block on K20c). the other hand, power reduction techniques often sacrifice D. Algorithm selection some performance. We further investigate this phenomenon We developed a CUDA code template that allows a specific in Section ??. algorithm to be invoked based on a runtime parameter. The For 2opt, increasing the block size beyond the baseline I/O, setup, and clean-up code are identical for all variants. results in a decrease in performance, although the power The only difference is in the kernel function invocation and the consumption is reduced. For block sizes in the range 768- parameters that are passed into the kernel. In each iteration of 1024, there are several points that result in a reduction in the experiment, one of the four algorithmic variants is selected power without a noticeable loss in performance. Thus, for this and the specific CUDA source is compiled with the requisite algorithm, block sizes in the range 768-1024 appear to provide parameters. the sweet spot. 2opt-shared exhibits similar patterns as 2opt but with a more pronounced degradation in performance for IV. RESULTS increased block sizes. Because of the shared memory usage, A. Experimental Setup 2opt-shared suffers a more dramatic decrease in occupancy We ran experiments on an Nvidia Tesla K20c GPU with levels when the block size is increased. For example, the compute capability 3.5. All code was compiled with nvcc ver- lowest occupancy for tabu in these experiments was 25% sion 6.0. The C portion of the codes was compiled with GCC whereas for 2opt-shared it was just 6%. version 4.8.2. Execution times and power measurements were Overall, these experiments suggest that choosing block sizes collected using the Nvidia nvprof tool and built-in power much larger than the ones that provide the highest occupancy sensor of the Tesla K20c. A shell-script was used to extract the can be helpful for both performance and power. The exception relevant metrics from the nvprof output and generate CSV is the implementation that utilizes shared memory, where files, which were later used for analysis and visualization. 22 performance drops significantly for larger block sizes due to real-world data sets collected from QAPLIB [10] were used low occupancy. However, even for 2opt-shared, if the goal is in the study. to reduce power consumption then a large block size may be chosen. B. Block Size Fig. 2 presents four bi-plots that illustrate the effect of block C. Allocated Registers size on the execution time and power consumption of the four algorithms. The numbers reported are speedup and power gain Fig. 3 shows how allocation of registers to a GPU kernel values, where speedup is the improvement in execution time impacts the performance and power consumption. For these and power gains is the improvement in power consumption, experiments, each algorithmic variant was executed with a computed over the baseline version. For each algorithm, cap on the maximum registers per thread. The minimum cap the baseline is a variant that yields highest occupancy. For was 16, which is the minimum number of registers required instance, a < 208, 128 > configuration with 26, 624 total by nvcc to compile the kernels. The cap was progressively threads. However, the baseline is not the only configuration increased by increments of 4. The highest cap was set to 512. that leads to highest occupancy. Thus, our data set contains With a block size of 128 (our baseline version), a cap of 512 other points of highest occupancy but with different block resulted in 64K register per block, which was the maximum and grid sizes. The bi-plot is divided into quadrants1 based number of registers allowed on the Kepler GPU. Although the maximum cap was 512, we observed that the number of 1For better visualization the quadrants are not set to the same size registers allocated by the compiler did not change beyond a 32−64 32−64 96−192 96−192 224−480 224−480 512−736 512−736 768−1024 768−1024 speedup speedup 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.8 0.9 1.0 1.1 1.2 0.8 0.9 1.0 1.1 1.2

power gains power gains

(a) 2opt (b) 2opt-shared

32−64 32−64 96−192 96−192 224−480 224−480 512−736 512−736 768−1024 768−1024

● ● ● ● ● ● ● ● ●

●● ● speedup speedup ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.8 0.9 1.0 1.1 1.2 0.8 0.9 1.0 1.1 1.2

power gains power gains

(c) tabu (d) anneal

Fig. 2. Effect of block size on performance and power consumption cap of 128. Therefore, in Fig. 3 we only report numbers for starts to negatively affect the occupancy. Thus, these results cap ranges between 16-128. corroborate earlier studies that have shown that it is better to We observe that increasing the number of registers allocated run some codes at a lower occupancy to ameliorate ill-effects has no significant impact on power consumption of any of of inefficient memory usage [11]. the algorithms. However, the number of allocated registers Another interesting aspect is that the 36-40 range is less does influence the performance of tabu and anneal. For tabu, than the number of registers that is actually used by the speedups of greater than 1.5, with no increase in power nvcc compiler when there is no restriction on the number consumption, are observed when the register cap is increased of registers to be allocated. In fact, if using the actual num- above 32. The speedups for anneal are somewhat lower for the bers from the compiler, the performance gains are somewhat same threshold with slight improvements in power. 2opt also diminished. These results suggest opportunities for tuning the experiences small gains in performance with a more generous register allocation heuristic in the nvcc compiler. register allocation. There is about a 2% increase in power consumption for these improved performance points. 2opt- D. Block size and Register Pressure shared is not sensitive to the allocation of registers. This makes To evaluate the interaction of block size and register pres- sense because for this algorithm the reuse in each thread is sure, we ran an experiment where each variant was executed exploited via shared memory and not registers. with different block sizes (as in Section IV-B) with a constraint These results are consistent with the register pressure ob- on the number of registers allocated per thread. As with our served in the individual codes. Increasing the register threshold register allocation experiments, we started with a constraint leads to more performance gains for programs with higher of 16 and progressively increased it by increments of 4. register pressure. On the target GPU, a value greater than 32 Somewhat surprisingly, the numbers for different constraints 32−64 32−64 96−192 96−192 224−480 224−480 512−736 512−736 768−1024 768−1024 speedup speedup 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.8 0.9 1.0 1.1 1.2 0.8 0.9 1.0 1.1 1.2

power gains power gains

(a) 2opt (b) 2opt-shared

32−64 32−64 96−192 96−192 224−480 224−480 512−736 512−736 ● 768−1024 768−1024 ● ● ● ● ●●

●● ●

● ● ● speedup speedup ● ●● ●●●

● ● ●● ● ● ●●● ● 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.8 0.9 1.0 1.1 1.2 0.8 0.9 1.0 1.1 1.2

power gains power gains

(c) tabu (d) anneal

Fig. 4. Effect of block size with register constraint on performance and power consumption were very similar. Therefore, in Fig. 4, we only present results experiments, we observe that allocating a higher number with a register constraint of 16. of registers for a fixed-size thread block provide significant As before, we illustrate the power-performance trade-offs performance gains. The results in Fig. 4 show that much using a series of bi-plots. We observe that bi-plots in Fig. 4 higher speedups and power gains can be achieved by kernels exhibit very similar patterns to those seen in Fig. 2. For that use fewer registers. Thus, as an optimization strategy for example, for tabu the highest performance gained in the block power-performance, it would be beneficial to emphasize thread size experiments was 2.52, whereas in these experiments the block size (but not necessarily occupancy) over increasing the best speedup is very close to a factor of three. Similar gains register count. are seen for anneal while 2opt yields several points in the top-right quadrant. The values for 2opt-shared remain mostly V. ANALYSIS unchanged. To further investigate some of the non-intuitive results, These added performance gains can be mainly attributed to we ran an additional set of experiments and conducted post- increased occupancy observed in the codes. Since the number mortem analysis. This section describes our findings. of allocated registers is no longer the dominant factor, the To discover why certain instances, with the same occupancy kernels (except for 2opt-shared) are able to achieve higher levels, showed improvements in both power and performance, levels of occupancy than what was possible in the block we analyzed the experimental results in the following manner. size experiments. However, this does not tell the whole story. We first re-ran all instances and collected all available CUDA As with the block size experiments, we observe significant metrics for each instance. The number of available metrics variations in power-performance between points with the on the Kepler K20C is 115. Some of these metrics returned same occupancy level. Moreover, in the register allocation a value of zero or were not computed for the given codes scheduler is scheduling warps that are from different thread blocks. Since the address space of thread blocks are not necessarily coalesced, these configurations suffer from poor 2opt register cap 2opt_shrd 16−24 locality, which results in higher miss rates for both L1 and tabu 32−64 ● anneal 72−96 L2 . Furthermore, the reduction in DRAM traffic is able 104−128 to not only improve performance but also results in power ● ●● ●●● ●● ● savings.

VI.CONCLUSIONS speedup This paper studied the impact of thread block size and register allocation on the performance and power consumption

● ● of three heuristic search algorithms. The experimental results ● reveal several interesting aspects of the behavior of these codes with respect to changing those parameters. We observed

0.8 1.0 1.2 1.4 1.6 1.8 significant variations in performance for codes executing at 0.96 0.98 1.00 1.02 1.04 the same level of occupancy. The best power-performance power gains trade-offs were not always achieved at the highest level of occupancy. The study also found that, for a given occupancy Fig. 3. Effect of register cap on performance and power consumption level, having larger thread blocks, and consequently fewer thread blocks per SM, generally leads to improvements in 6.0 both performance and power consumption. This is a result 5.0 4.0 of better locality in the L1 and L2 caches and reduced 3.0 2.0 DRAM transactions. The improved locality is a consequence 1.0 of coalesced access between warps within the same block. 0.0 -1.0 -2.0 REFERENCES %change in feature score

ldst_issued inst_issued [1] Y. Abe, H. Sasaki, M. Peres, K. Inoue, K. Murakami, and ldst_executed gst_throughputgld_throughput dram_utilization S. Kato, “Power and performance analysis of GPU-accelerated l2_l1_read_hit_rate l2_read_throughput l2_write_transactions local_store_throughputlocal_load_throughput dram_read_throughputl2_l1_read_throughput local_replay_overhead l1_cache_local_hit_rate dram_read_transactions local_load_transactionsdram_write_transactionslocal_memory_overhead global_replay_overhead systems,” in Proceedings of the 2012 USENIX Conference on Power- sysmem_write_throughputgst_requested_throughput gld_requested_throughput Aware Computing and Systems, ser. HotPower’12. Berkeley, CA, local_load_transactions_per_request local_store_transactions_per_request USA: USENIX Association, 2012, pp. 10–10. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387879 Fig. 5. Comparison of performance characteristics between default and good [2] A. Sethia, G. Dasika, M. Samadi, and S. Mahlke, “Apogee: Adaptive instances prefetching on GPUs for energy efficiency,” in Parallel Architectures and Compilation Techniques (PACT), 2013 22nd International Conference on, Sept 2013, pp. 145–156. [3] A. Magni, C. Dubach, and M. F. P. O’Boyle, “A large-scale cross- (e.g., texture memory utilization) and were eliminated from architecture evaluation of thread-coarsening,” in Proc. of the 2013 the set. The remaining metrics constituted the feature space for ACM/IEEE conf. on Supercomputing, 2013. [4] S. Unkule, C. Shaltz, and A. Qasem, “Automatic restructuring of GPU all instances. We then applied a grid search feature selection kernels for exploiting inter-thread data locality,” in Proc. Int’l. Conf. on technique to extract the most relevant metrics representative Compiler Construction (CC12), 2012, pp. 21–40. of all program instances. This feature set included 25 metrics. [5] X. Cui, Y. Chen, C. Zhang, and H. Mei, “Auto-tuning dense matrix multiplication for GPGPU with cache,” in Parallel and Distributed Then we selected 32 instances from the data set that provide Systems (ICPADS), 2010 IEEE 16th International Conference on, 2010. substantial improvements in both power and performance over [6] N. Fauzia, L.-N. Pouchet, and P. Sadayappan, “Characterizing and en- the baseline version. We label these as good instances. These hancing global memory data coalescing on GPUs,” in Code Generation and Optimization (CGO), 2015 IEEE/ACM International Symposium on, instances typically had larger block sizes (i.e., ¿ 512) and were 2015. associated with code variants where the baseline incurred a [7] A. Chaparala, C. Novoa, and A. Qasem, “A SIMD solution for the high amount of spills (e.g., anneal with a 1KB of spill quadratic assignment problem with GPU acceleration,” in 3rd Annual XSEDE Conf., 2014. allocation to local memory). We averaged the feature values [8] R. Burkard and F. Rendl, “A thermodynamically motivated simulation for each instance and computed a relative feature score which procedure for combinatorial optimization problems,” European Journal represents the rate of change between two feature values. of Operational Research, vol. 17, no. 2, pp. 169–174, 1984. [9] W. Zhu, J. Curry, and A. Marquez, “SIMD tabu search for the quadratic Fig. 5 shows the average feature scores between good and assignment problem with graphics hardware acceleration,” International baseline instances. As we can see the L1 cache hit rate and Journal of Production Research, vol. 48, no. 4, pp. 1035–1047, 2010. the local memory throughput are the dominant factors leading [10] “QAPLIB - a quadratic assignment problem library,” http://anjos.mgi.polymtl.ca/qaplib/, 2014. to the improved performance. The savings in power is a direct [11] V. Volkov and J. W. Demmel, “Benchmarking GPUs to tune dense linear result of the reduced DRAM utilization. As mentioned, the algebra,” in Proc. of the 2008 ACM/IEEE conf. on Supercomputing, codes used in this study all exhibit high register pressure 2008. and hence incur substantial spill cost. For configurations, where many blocks are scheduled on the same SM, the warp