Nyami: a Synthesizable GPU Architectural Model for General-Purpose and Graphics-Specific Workloads

Nyami: a Synthesizable GPU Architectural Model for General-Purpose and Graphics-Specific Workloads

Nyami: A Synthesizable GPU Architectural Model for General-Purpose and Graphics-Specific Workloads Jeff Bush Philip Dexter†, Timothy N. Miller†, and Aaron Carpenter⇤ San Jose, California †Dept. of Computer Science [email protected] ⇤ Dept. of Electrical & Computer Engineering Binghamton University {pdexter1, millerti, carpente}@binghamton.edu Abstract tempt to bridge this gap, Intel developed Larrabee (now called Graphics processing units (GPUs) continue to grow in pop- Xeon Phi) [25]. Larrabee is architected around small in-order ularity for general-purpose, highly parallel, high-throughput cores with wide vector ALUs to facilitate graphics rendering and systems. This has forced GPU vendors to increase their fo- multi-threading to hide instruction latencies. The use of small, cus on general purpose workloads, sometimes at the expense simple processor cores allows many cores to be packed onto a of the graphics-specific workloads. Using GPUs for general- single die and into a limited power envelope. purpose computation is a departure from the driving forces be- Although GPUs were originally designed to render images for hind programmable GPUs that were focused on a narrow subset visual displays, today they are used frequently for more general- of graphics rendering operations. Rather than focus on purely purpose applications. However, they must still efficiently per- graphics-related or general-purpose use, we have designed and form what would be considered traditional graphics tasks (i.e. modeled an architecture that optimizes for both simultaneously rendering images onto a screen). GPUs optimized for general- to efficiently handle all GPU workloads. purpose computing may downplay graphics-specific optimiza- In this paper, we present Nyami, a co-optimized GPU archi- tions, even going so far as to offload them to software. Ideally, a tecture and simulation model with an open-source implementa- GPU would have the capability to process both general-purpose tion written in Verilog. This approach allows us to more easily and graphics-specific workloads with high performance and ef- explore the GPU design space in a synthesizable, cycle-precise, ficiency. Additionally, a modular approach can allow system modular environment. An instruction-precise functional simu- integrators to make static selections among alternative compo- lator is provided for co-simulation and verification. Overall, nents to optimize more for one paradigm or the other. we assume a GPU may be used as a general-purpose GPU These factors make it important for the research and develop- (GPGPU) or a graphics engine and account for this in the archi- ment communities to have effective tools that allow them to con- tecture’s construction and in the options and modules selectable tribute new performance and energy-efficiency improvements, for synthesis and simulation. particularly as GPUs are traditionally very power-hungry [19]. To demonstrate Nyami’s viability as a GPU research plat- Unfortunately, most GPU design details are proprietary, and form, we exploit its flexibility and modularity to explore the im- there are limited options for accurate and modular architectural pact of a set of architectural decisions. These include sensitivity simulations. to cache size and associativity, barrel and switch-on-stall multi- With this work, we have attempted to take a fresh look at threaded instruction scheduling, and software vs. hardware im- both graphics and GPGPU applications to develop an architec- plementations of rasterization. Through these experiments, we ture, and associated simulation model, that performs well for gain insight into commonly accepted GPU architecture deci- both types of workload. Like Larrabee [25], we adopt a more sions, adapt the architecture accordingly, and give examples of traditional programming model, but we avoid the performance the intended use as a GPU research tool. and die-area drawbacks by using a more GPU-like RISC ISA and pipeline architecture. As such, a given die area can hold 1 Introduction a greater number of processor cores, increasing the aggregate Historically, high-performance computing (HPC) performance throughput and performance per Watt. growth has generally followed Moore’s law. This trend con- Thus, we present Nyami, implemented as synthesizable logic tinues today, except for one major recent discontinuity: the in Verilog. This allows us to conduct design space exploration adoption of GPUs. In terms of performance per Watt and per- for GPUs, with very precise simulation of GPU operation at the formance per cubic meter, GPUs can outperform CPUs by or- RTL and gate levels. The Nyami architecture and model allow ders of magnitude on many important workloads. The adop- us to explore the trade-offs inherent in simultaneously design- tion of GPUs into HPC systems has therefore been both a major ing a GPU for general-purpose and graphics-centric workloads boost in performance and a shift in how supercomputers are pro- in order to facilitate selecting an optimization target (graph- grammed. ics, GPGPU, or a compromise). The Nyami Verilog model Unfortunately, this shift has suffered slow adoption because provides a flexible framework for exploring architectural trade- the GPU programming model is unfamiliar to those who are ac- offs that affect GPU performance in general, including changes customed to writing software for traditional CPUs. In an at- to the cache hierarchy, pipeline structure, and hardware thread 978-1-4799-1957-4/15/$31.00 ©2015 IEEE 173 scheduling. Most importantly, Nyami is offered up as a new CUDA and OpenCL have exposed general-purpose compute platform for research, designed to help researchers contribute functionality for most modern GPU platforms. to high-performance GPU research. Although Nyami is syn- Intel’s Larrabee architecture [25] is an often-cited modern ex- thesizable, it is still easy to modify, which is important for test- ample of a divergence from the architectural trends of traditional ing architectural hypotheses and performing design space explo- GPU architecture. Instead of numerous scalar processor cores, ration. (Source code and documentation are available on-line.) Larrabee is an array of general-purpose SIMD cores being used Our accompanying software also includes an LLVM-based C++ as a graphics processor, adapted from the existing x86 CPU ar- compiler [16] that targets the Nyami ISA and a functional simu- chitecture. These in-order x86 cores are enhanced with wide lator written in C, allowing us to test the architecture across the vector functional units, with very little special-purpose logic for various design stack levels, including cycle-precise simulation, graphics. While many traditional GPUs use hardware for task instruction emulation, and power/area analysis. control, scheduling, and rasterization, Larrabee does all of this While numerous open-source CPU implementations have in software under the assumption that software control gives been available for many years [2, 15, 20, 24], this is not true for flexibility that confers performance advantages. Specifically, GPUs. There are only two open-source fully-functional GPUs rasterization and graphics rendering is done in software. In Sec- currently active, OpenShader [18], which is still under devel- tion 4, we will explore this option, as well. opment, and Nyami. This currently leaves Nyami as the only Aspects of modern GPU architecture are described in other fully-functional, synthesizable open source GPU implementa- sections below. Top vendors of discrete high-performance GPUs tion available. Nyami is also written to be a research tool, include Nvidia and AMD (ATI). AMD and Intel also produce where the implementation in Verilog directly reflects the archi- GPUs that integrate on the same die as their CPUs. There is tecture, with minimal obfuscating performance optimizations. also a sizable market for embedded GPUs, which are licensed To demonstrate the significance of Nyami’s contribution, we as IP blocks to be integrated into systems-on-chip for portable will describe the native architecture, as well as a number of de- devices, and vendors include Imagination Technologies (Pow- sign explorations, including hardware thread scheduling tech- erVR), ARM, and Qualcomm. Relatively recent literature on niques, rasterization methods, and cache configuration design the basics of GPU architecture includes [8, 17, 31, 32]. space exploration. GPU Simulators GPUs and GPGPUs are an active area of re- The rest of the paper is organized as follows. Section 2 dis- search. However, most GPU architectures are proprietary, and cusses relevant existing work. Section 3 gives an overview of the information about internal details is trade secret. This makes it Nyami simulation model and relevant baseline design choices. challenging for researchers to evaluate microarchitectural trade- Section 4 presents examples of the significance of open-source offs in a simulation environment. There are a few notable sim- GPU simulation and co-optimization techniques. Finally, Sec- ulators that warrant discussion. The Guppy project modified tion 5 concludes. an open source processor, LEON3, to be more GPU-like, and is synthesizable for FPGA [3]; unfortunately it is not widely 2 Related Work available to the public. Similar work has been done to create GPU Architectures The earliest 3D graphics accelerators im- soft GPGPU frameworks in FPGA hardware [4]. The ATILLA plemented a fixed-function pipeline, specialized for graphics. project is a cycle-accurate emulator for GPU

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us