S. Chatterjee Design and exploitation L. R. Bachega of a high-performance P. Bergner K. A. Dockser SIMD floating-point J. A. Gunnels M. Gupta unit for Blue Gene/L F. G. Gustavson C. A. Lapkowski We describe the design of a dual-issue single-instruction, multiple- G. K. Liu data-like (SIMD-like) extension of the IBM PowerPCt 440 M. Mendell floating-point unit (FPU) core and the compiler and algorithmic R. Nair techniques to exploit it. This extended FPU is targeted at both the C. D. Wait IBM massively parallel Blue Genet/L machine and the more T. J. C. Ward pervasive embedded platforms. We discuss the hardware and software codesign that was essential in order to fully realize the P. Wu performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a Blue Gene/L node. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Our measurements show that the combination of algorithm, compiler, and hardware delivers a significant fraction of peak floating-point performance for compute-bound-kernels, such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memory- bound kernels, such as DAXPY, while remaining largely insensitive to data alignment.
Introduction beginning that actual performance relies heavily on the Blue Gene*/L [1] is a massively parallel computer system optimization of software for the platform. Feedback from under development at the IBM Thomas J. Watson the software teams was instrumental in identifying and Research Center and the IBM Engineering and refining new extensions to the PowerPC instruction set Technology Services Group in Rochester, Minnesota, in to speed up target applications without adding excessive collaboration with the Lawrence Livermore National complexity. On the hardware side, we needed to double Laboratory. The Blue Gene/L program targets a the raw performance of our FPU while still being able to machine with 65,536 dual-processor nodes and a peak connect it to the preexisting auxiliary processor unit performance of 360 trillion floating-point operations per (APU) interface of the PPC440 G5 central processing unit second (360 Tflops). It is expected to deliver previously (CPU) core [3] under the constraints of working with the unattainable levels of performance for a wide range of dual-issue nature of the CPU and keeping the floating- scientific applications, such as molecular dynamics, point pipes fed with data and instructions. On the turbulence modeling, and three-dimensional dislocation software side, we needed tight floating-point kernels dynamics. optimized for the latency and throughput of the FPU This paper describes a hardware and software codesign and a compiler that could produce code optimized for to enhance the floating-point performance of a Blue this unit. Gene/L node, based on extensions to the IBM PowerPC* This paper reports four major contributions: First, a 440 (PPC440) floating-point unit (FPU) core [2], a high- minor modification of the PPC440 FPU core design performance dual-issue FPU. We recognized from the produces a single-instruction, multiple-data-like (SIMD-