Execution of MIMD MIPSEL Assembly Programs Within CUDA/Opencl Gpus Henry G

ICPP12 PAPER SUBMISSION 1 Execution Of MIMD MIPSEL Assembly Programs Within CUDA/OpenCL GPUs Henry G. Dietz and Frank Roberts University of Kentucky, Department of Electrical and Computer Engineering Abstract—Earlier work demonstrated that general MIMD- (MIMD On GPU) environment have demonstrated similar parallel programs could be transformed to be efficiently inter- efficiency on NVIDIA CUDA GPUs using carefully tuned preted within a CUDA GPU. Unfortunately, the quirky split-stack interpreters [8]. instruction set used to make the GPU interpreter efficient meant that only a specially-constructed C-subset compiler could be used Given that the MOG concept has been proven viable, the with the system. work presented in this paper is more focused on making MOG In this paper, a new system is described that can directly use practical. Beyond reimplementing and making minor improve- a complete, production-quality, compiler toolchain such as GCC. ments to the best methods discovered in earlier research, the The toolchain is used to compile a MIMD application program primary contributions are: into a standard assembly language – currently, MIPSEL. This assembly code is then processed by a series of transformations • Rather than processing a stack assembly language, the that convert it into a new instruction set that manages GPU local new MOG system uses an accumulator/register instruc- memory as registers. From this code, an optimizing assembler tion set. Both assembly languages needed somewhat un- generates a customized interpreter in either NVIDIA CUDA or usual features in order to obtain good efficiency. However, portable OpenCL. management of the very limited low-latency memory resources is mapped into an apparently conventional register allocation problem in the new instruction set, rather I. INTRODUCTION than the unusual problem of explicit movement of data Traditional SIMD architectures always have been very easy between local and global portions of a split stack. This to scale, but clock rates generally plummet as fanout be- makes the new instruction set much more compatible with comes large. GPUs (Graphics Processing Units) solve the existing compiler backends without imposing a significant classic SIMD clock rate problem by implementing a group runtime performance penalty. of loosely-coupled relatively-narrow SIMD engines instead of • Whereas the old versions required compilers to be re- a huge synchronous one. For example, ClearSpeed’s more targeted, the new version alternatively allows existing conventional CSX700 SIMD chip [1] reached a maximum compiler toolchains to be used unchanged. The system clock rate of 250MHz, whereas contemporary NVIDIA [2] and described here converts MIPSEL assembly code into AMD [3] GPUs ran at twice that rate. The potential efficiency the new MOG assembly language, thus allowing any of GPUs is further increased by avoiding hardware-intensive existing toolchain generating MIPSEL code to be used. features that do not increase peak arithmetic performance. For The method has been tested with both LLVM [9] and example, features like interrupt handling, various forms of GCC. memory protection, and large caches would all reduce peak • Earlier versions of the MOG system exclusively targeted performance per unit circuit complexity. NVIDIA CUDA GPUs. The MOG system described in The problem is that the resulting more scalable, but more this paper targets both NVIDIA CUDA and portable complex and restrictive, execution model is difficult to pro- OpenCL [10]. It worthwhile noting that although OpenCL gram. The goal of the research reported here is making GPUs is intended to be vendor neutral and is supported by able to efficiently run parallel programs that were developed both NVIDIA and AMD/ATI GPUs, writing code to targeting MIMD systems using either shared memory or be portable between GPUs from different vendors and message-passing communication via MPI [4]. efficient on all requires very careful use of OpenCL. The concept of MIMD execution on SIMD hardware was at The new instruction set architecture most directly exposes best a curiosity until the early 1990s. At that time, large-scale the key issues, and is presented in Section II. Translation of SIMD machines were widely available and, especially using MIPSEL assembly code is described in Section III. Section IV architectural features of the MasPar MP1 [5], a number of briefly discusses the MOG assembler and interpreter structure researchers began to achieve reasonable efficiency. For exam- – more details about how and why MOG interpreters work ple, Wilsey, et al., [6] implemented a MasPar MP1 interpreter can be found in our LCPC 2009 paper [8]. Conclusions are for a toy instruction set called MINTABS. Our first MIMD given in Section V. interpreter running on the MasPar MP1 [7] achieved approx- imately 1/4 the theoretical peak native distributed-memory II. INSTRUCTION SET ARCHITECTURE SIMD speed while supporting a full-featured shared-memory Earlier MOG systems were very closely tied to the proper- MIMD programming model. Earlier versions of our MOG ties of the NVIDIA CUDA GPUs they targeted. In contrast, 2 ICPP12 PAPER SUBMISSION the new MOG system explicitly is designed to be able to be accumulator, but a number of interpreter-internal registers also efficiently implemented by both NVIDIA CUDA GPUs and are placed here. OpenCL on any of a wide variety of hardware including GPUs REGs. What CUDA calls shared memory and OpenCL calls from both NVIDIA and AMD/ATI. Thus, it is useful to view local memory is ideally the patch of memory within a SIMD the latest MOG as a true instruction set architecture offering engine. Although it can be accessed by any virtual PEs within many implementation choices. There is an abstract model of that SIMD engine, GPU hardware imposes a performance the hardware environment for each PE (processing element) penalty for violating hardware banking constraints. In MOG, and an instruction set specification including both assembly this memory is partitioned along bank boundaries and used to language and bit-level encoding. This is the compiler target hold the programmer-visible PE registers. for portable MOG. CPOOL and TEXT. The constant pool (CPOOL) and program text (TEXT) together make-up the program code. The reason they are two separate structures is a word-size A. PE Hardware Environment difference; the CPOOL constants are 32-bit words, while Each PE as we count them is actually a virtual PE inside instructions are just 16-bits long to maximize bandwidth uti- a GPU, generally not a dedicated block of physical hardware. lization. Note that this is a Harvard architecture, with separate The number of virtual PEs is not trivially derived from the memories for code and data, so addresses in the TEXT are number of physical PEs, but is computed as a function of the given in units of instructions, not bytes. In CUDA, both can be number of SIMD engines and various constraints that together naturally implemented as 1D textures. Unfortunately, OpenCL determine the optimal degree of multithreading. The smallest images (the equivalent to textures) only support 2D or 3D GPUs contain at least 256 MOG PEs and the largest GPUs addressing. Thus, the OpenCL version currently marks both could contain more than 64K MOG PEs. A cluster of nodes as being stored in constant memory. each containing a GPU would multiply the PE count by the DATA. The DATA for all PEs is kept in global memory. number of nodes, easily reaching millions of PEs. The global memory is roughly 100X higher latency than the For most purposes, each MOG PE behaves exactly like a registers, and has banking constraints similar to those for conventional processor running a sequential user-level pro- shared/local memory. Thus, a layout respecting the banking gram. It apparently executes independently of both the other is used to keep all PE references within the correct bank. PEs and the host computer’s processor. Although the archi- The preferred data memory layout treats memory as a three- tectural model could support fully general MIMD execution, dimension array of 32-bit datum_t values: mem[NPROC / the current compilation toolchain does require the text of the WARPSIZE][MEMSIZE][WARPSIZE] in which NPROC is program to be a single image shared by all PEs: a MIMD the number of logical processing elements (assumed to be variant more precisely known as SPMD (Single Program, a multiple of WARPSIZE, which in turn is assumed to be Multiple Data). Each PE may apparently asynchronously take a power-of-two multiple of the number of memory banks). its own path through the program, but all PEs share the same One might have expected a two-dimensional layout with program. mem[MEMSIZE][NPROC] would suffice, but that would Whereas conventional processors handle system calls by en- significantly complicate address arithmetic because NPROC tering a protected execution mode provided by the processor’s is not necessarily a power of 2 and might not even be hardware, MOG PEs have no such support. Indeed, most GPU a compile-time constant. In fact, the CUDA compilation hardware has no mechanism by which it can initiate anything system does not handle the constant power-of-two stride of like a system call, nor can it access I/O devices – other than WARPSIZE*sizeof(datum_t) any better, but explicitly the video display it drives. Our solution is to hand-off any using shift and mask operations on pointer offsets imple- such system call to be done by the host processor. A GPU ments the desired addressing without multiply and modulus PE cannot directly interrupt the host processor, but it can drop operations. One final complication is that although the DATA a message to the host in a place that both can access. The memory is banked for 4-byte word access, addresses are given primary way a PE can attract the host processor’s attention as byte addresses and both byte and half-word operations are is to cleanly terminate the GPU kernel, thus causing the host supported.

Load more