Priority-Based Task Management in a GPGPU Megakernel

Priority-Based Task Management in a GPGPU Megakernel Bernhard Kerbl∗ Supervised by: Markus Steinberger† Institute for Computer Graphics and Vision Graz University of Technology Graz / Austria Abstract architecture and associated compilers for industry stan- dard programming languages. With respect to certain re- In this paper, we examine the challenges of implement- strictions, the architecture enables programmers to run ing priority-based task management with respect to user- general purpose computations on compatible devices in defined preferential attributes on a graphics processing parallel. Newer models of GPUs supporting these features unit (GPU). Previous approaches usually rely on constant are therefore often referred to as general purpose graph- synchronization with the CPU to determine the proper ics processing units (GPGPU). Several hundred Stream chronological sequence for execution. We transfer the re- Processors (SP), also referred to as thread blocks, which sponsibility for evaluating and arranging planned tasks to are grouped in clusters called Streaming Multiprocessors the GPU where they are continuously processed in a per- (SM) can be instructed to execute commands at the same sistent kernel. By implementing a dynamic work queue time – the programmer simply specifies the number of nec- with segments of variable size, we introduce possibili- essary threads when launching a GPGPU kernel. The com- ties for gradually improving the overall order and iden- bined capacities of these SPs have produced a significant tify necessary meta data required to avoid write-read con- gap in raw speed between high-end GPUs and CPUs [12]. flicts in a massively parallel environment. We employ an However, using the available resources to their full poten- autonomous controller module to allow queue manage- tial is not an easy task. Memory latencies, insufficient ment to run concurrently with task execution. We revise parallelization or unfavorable occupancy at runtime may Batcher’s bitonic merge sort and show its eligibility for cause implementations to fall short of their ideal behavior. sorting partitioned work queues. The performance of the One particular cause of poor efficiency in CUDA ker- algorithm for tasks with constant execution time is evalu- nels is unbalanced work load distribution, which can have ated with respect to multiple setups and priority types. An a limiting effect on performance, especially when consid- implementation of a Monte Carlo Ray-Tracing engine is ering problems that exhibit irregular behavior, e. g. adap- presented in which we use appropriate priority functions tive ray-tracing techniques [2]. Instead of relying on the to direct the available resources of the GPU to areas of in- built-in CUDA work distribution units, custom task man- terest. The results show increased precision for the desired agement strategies can be implemented to address these is- features at early rendering stages, demonstrating the selec- sues [4]. One specific example is given in the OptiX Ray- tive preference established by our management system. Tracing Engine and its dynamically load-balanced GPU Keywords: GPGPU, megakernel, GPU sorting, priority- execution model [14]. based task management, dynamic work queue, Monte In this context, the term megakernel refers to a solution Carlo Ray-Tracing where improved work load distribution is achieved using a persistent kernel in order to benefit from uninterrupted computational activity. One or more queues are commonly 1 Introduction used to store tasks that are constantly being fetched and executed until the queues are empty. For static work queues, Parallel computing has become a valuable tool for han- tasks can only be added in between megakernel launches, dling time-consuming procedures that contain a high num- while dynamic implementations also allow for insertion of ber of homogeneous, independent operations. Since the new tasks at runtime [4]. rise of the Unified Shader Model which features explicit When using a megakernel for processing diverse prob- programmability of GPUs, exploiting data level paral- lems and the implied task level parallelism, assigning lelism on a customary personal computer has become meaningful priorities to each task can have a positive ef- increasingly straightforward. The advent of NVIDIA’s fect. Especially procedures that involve rendering may ex- GeForce 8 Series introduced the first release of the CUDA perience a considerable boost in usability through proper ∗[email protected] task classification. Considering applications with guaran- †[email protected] teed frame rates, the perceived quality of each frame can Proceedings of CESCG 2012: The 16th Central European Seminar on Computer Graphics (non-peer-reviewed) be improved by focusing on areas that exhibit prominent cially targeting GPGPUs, have become a popular research features [9]. High quality initial views of rendered scenes topic. Following the implementations of Batcher’s bitonic can be generated by directing the resources of the GPU merge sort which was developed for parallel sorting net- based on priorities. In an interactive environment with works [3], other algorithms designed for use on massive a modifiable view model, early perception of the visible datasets were adopted for the GPU as well [5, 7, 13, 15]. dataset enables efficient adjustment of the extrinsic param- eters in order to obtain a desired setup. Similarly, priori- tizing user interactions achieves faster response to input 3 GPU Megakernel System commands and postpones time-consuming rendering procedures that are otherwise wasted on a setup that is inade- 3.1 Available Resources and Challenges quate. Previous approaches based on priorities commonly ex- We target dynamic GPGPU task management by introduc- ploit the sophisticated sorting methods on the CPU and ing a global, self-organizing work queue in a megakernel establish means for communicating the favored order of system which allows for arbitrary tasks to be added at run- execution to the GPU [10, 6]. Moving the responsi- time. Apart from the functions to be invoked upon exe- bility of organizing the runtime agenda to the GPU ex- cution of a task, each queue entry provides additional at- pectably eliminates the otherwise significant overhead of tributes that are considered during different stages of the inter-component communication. Therefore, we aim to management process. Sorting the queue is a time-critical provide a functional autonomic task management system problem, since the megakernel system constantly removes for dynamic work queues on the GPU, in order to enable the frontmost entries and processes the associated behav- adaptive behavior for CUDA applications without inter- ior in the available SPs to achieve proper workload distri- rupting the execution of tasks. bution. Due to the intended autonomy of our management system, we cannot rely on the CPU to rearrange queue entries 2 Previous Work based on their priorities. Instead, one thread block is re- served and used as a controlling unit which is responsible While analyzing the influences of possible limiting fac- for sorting the queue at runtime. The remaining thread tors on ray-tracing kernels, Aila and Laine experimentally blocks are henceforth referred to as working module to bypassed the default CUDA work distribution units, sug- clarify their intended purpose. Since both modules con- gesting that this approach might show an improvement for stantly process the shared contents of the queue, leaving tasks with varying durations [2]. In their implementation, them unprotected would cause interference and lead to se- they use a work queue from which tasks are constantly vere runtime errors. Therefore, we require secure methods fetched and executed in individual thread blocks, which for classifying queue entries and deciding whether they are proved to be very effective. A detailed analysis of possi- safe to be processed by either module. ble techniques for using work queues in a GPGPU kernel was authored by Cederman and Tsigas, addressing sophis- 3.2 Work Queue Segmentation ticated load balancing schemes which are made possible through the atomic hardware primitives supported by the Since the contents of a dynamic work queue are poten- CUDA architecture [4]. tially volatile, we partition the queue and thereby enable In [14], Parker et al. successfully employed a self- faster detection of segments that are unlikely to change in provided strategy for balancing work load with the goal of the near future. Consequently, the controller can quickly improving efficiency in a ray-tracing engine using a per- select segments that are not yet in use and perform sorting sistent megakernel. Furthermore, they use static prioriti- while the working module constantly removes entries from zation to achieve fine-grained internal scheduling for tasks segments in the front of the queue. Treating segments as to avoid penalties resulting from state divergence. instances of classes gives us the advantage of storing ad- A recent example for a priority-based solution that ditional information about the contained queue entries as achieves GPU task management in real-time is TimeGraph attributes, such as a reference to the task with the shortest [10]. The routine relies on interrupts and substantial com- execution time or current availability. munication between the GPU and the CPU which acts For using segments to restructure the work queue, we as an executive supervisor. A similar system was imple- identify two additional

Priority-Based Task Management in a GPGPU Megakernel

An Evolutionary Approach for Sorting Algorithms

Bitonic Sorting Algorithm: a Review

Finalabc.Pdf

Evaluation of Sorting Algorithms, Mathematical and Empirical Analysis of Sorting Algorithms

36 Algorithms.Txt 4/4/2010 Zhan and Noon Tested Algorithms For

Hierarchical Histogram-Based Median Filter for Gpus

LEARNING NEURAL NETWORKS THAT CAN SORT a Thesis

Estimation of Execution Time and Speedup for Bitonic Sorting in Sequential and Parallel Enviroment

Hierarchical Histogram-Based Median Filter for Gpus

On the Practicality of Data-Oblivious Sorting Kris Vestergaard Ebbesen, 20094539

Técnicas De Ordenação Para Simulação Física Baseada Em