Infrastructure Architecture Essentials, Part 7: High-Performance Computing

Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/ Infrastructure architecture essentials, Part 7: High-performance computing off the shelf Concepts and techniques Sam Siewert ( [email protected] ), Principal Software Architect/Adjunct Professor, University of Colorado Summary: The year 2008 will forever be remembered as the year of the off-the-shelf (OTS) supercomputer, thanks to the Los Alamos National Labs (LANL) and IBM team that constructed the world's first machine to break the peta-FLOP (1,000,000,000,000,000 floating-point operations per second) barrier. Get an overview of OTS strategies to architect high-performance computing (HPC) systems as well as the methods and concepts behind building HPC systems from OTS components and open source software. Date: 09 Dec 2008 Level: Intermediate PDF: A4 and Letter (127KB | 17 pages) Get Adobe® Reader® Activity: 472 views Comments: 0 ( Add comments ) Average rating (based on 1 vote) Continuing the Infrastructure architecture essentials series, this article provides an overview of methods for building HPC systems with OTS components and open source software. Architectures that employ clusters and hybrid nodes composed of traditional multi-core symmetrical multiprocessing (SMP)/non-uniform memory access (NUMA) architectures integrated with single-instruction multiple data (SIMD) Cell-based or graphic processing unit (GPU)-based offloading. Methods for implementing Cell-based and GPU-based offloads are not reviewed here in detail, but you can find numerous excellent references on the topic of Cell-based algorithm acceleration (see Resources ) as well as significant help with GPU offload provided by the NVIDIA Compute Unified Device Architecture (CUDA) environment (see Resources ). Open source code that provides assistance with HPC cluster and hybrid offload applications is prevalent, and the skills and competencies necessary for such architectures are reviewed here to help you get started. Advances in OTS processor complexes Numerous individual architecture advances made by IBM and IBM partners have made OTS HPC a reality. The best proof was provided when the Roadrunner system broke the petaflop (1x10^15 floating-point operations) barrier using OTS IBM® BladeCenter® server boards this past summer (see Resources ). The Roadrunner system employs two BladeCenter QS22 blades with IBM PowerXCell™ 8i processors and an LS21 AMD Opteron processor in a tri-blade configuration. The Roadrunner system currently is first on the Supercomputing TOP500 list (see Resources ). Here's a quick review of the emerging OTS technologies that are making OTS HPC possible: Virtualization software. The emergence of software that makes one resource look like many and many resources look like one, first demonstrated by IBM with the original mainframe virtual machine (VM), is fundamental to authoring scalable applications that can exploit large clusters of OTS processing, memory, input/output (I/O) and storage resources. 1 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/ Multi-core processors. Since the peak of uniprocessor clock rates just below 4GHz, AMD and Intel have both developed a wide offering of SMP and NUMA architectures for OTS mainboards and have interesting new multi-core architectures coming out with the AMD Shangai and Intel® Nehalem processor complexes. Multi-core processor complexes have become typical for all of general purpose computing (GPC) and has helped to motivate HPC OTS solutions built from scalable clusters of OTS compute nodes along with software libraries to exploit multiple instruction, multiple data (MIMD) architectures. Scalable I/O hubs. The IBM xSeries® system includes both traditional SMP memory controller hub interfaces to the PCI-E bus, memory, and processor cores as well as NUMA scaling with options like the IBM System x 3950. Many new chip sets will employ protocols such as Intel's Quick Path Interface (QPI) and AMD's Hypertransport (HTX) for scaling with NUMA in 2009. The x3950 provides NUMA scaling of up to four systems and a total of 28 PCI-E x8 I/O expansion slots (seven interfaces per x3950 system). Scalable memory interfaces. As memory is scaled, many systems are employing protocols such as DDR3 increasing transfer rates up to 12800 MB/sec per memory bank with capability to easily scale to 256GB of memory per processing node with OTS memory technology. Manycore SIMD offload engines. The Cell Broadband Engine™ (Cell/B.E.™) and PowerXCell 8i processors as well as GP-GPUs from NVIDIA and AMD/ATI provide 10s to 100s of offload cores for SIMD acceleration of applications. IBM xSeries Cluster 1350. IBM supported clustering of xSeries rackmount or BladeCenter MIMD clusters. IBM pSeries® Cluster 1600. IBM supported clustering of pSeries IBM POWER™ architecture clusters. BladeCenter. A highly integrated vertical server integration with a mid-plane and IBM BladeCenter Open Fabric I/O for a variety of IBM POWER6, AMD Opteron, Intel Xeon®, and Cell processing boards. Skills and competencies: Offloading and SIMD instruction set extensions HPC OTS clusters can now leverage SIMD instructions sets as well as Cell and GP-GPU SIMD many-core processors like the NVIDIA Tessla. Here's a quick overview of options: Cell processor offload. The Cell design, originally developed for digital media with the Cell/B.E., has found its way into IBM Blue Gene®/L, now with the PowerXCell 8i processor in OTS solutions like the BladeCenter QS22 used for Roadrunner as well as OTS offload PCI-E cards like Fixstars GigaAccel 180 (see Resources ). GPU offload. NVIDIA CUDA for the Tessla GP-GPU and GeForce/Quadra GPUs along with AMD/ATI Stream Computing software development kit (SDK) programming environments for writing SIMD kernels for offload in hybrid architectures provide methods for developing and debugging HPC applications employing OTS components like GP-GPUs. (see Resources for more information on CUDA/Tessla and Steam Computing/AMD-FireStream.) SIMD instruction set extension. Although GP-GPUs are helping to bring hundreds of cores to HPC OTS for offloading mathematically intensive kernels, Intel SSE 4.x and AMD are likewise adding SIMD instruction set extensions to traditional processors. Both the Nehalem and Shanghai processor complexes will bring additional SIMD instructions to the market in 2009 (see Resources for Intel Performance Primitives .) Tools and techniques: Multi-core programming In this section, you get a quick look at programming methods and the value of threading multi-core as well as offloads for many-core Cell and GP-GPU hybrid architectures. Programming Cell/B.E. and PowerXCell 8i OTS offload engines has been made much easier by the programming environments that IBM makes 2 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/ available. The best way to get started with Cell programming is to install Linux® on a Sony Playstation 3 (PS3) and write some code to accelerate threaded code with Synergistic Processing Element (SPE) offload. The article, " SoC drawer—The Cell Broadband Engine chip: High-speed offload for the masses ," provides an example to help get you going at home. Programming GP-GPUs by comparison can be tricky; however, the newer NVIDIA Tessla GP-GPUs and the CUDA programming environment have made GP-GPU SIMD programming far easier than it was a year or two ago. Both offload methods provide an excellent way to accelerate compute/math kernels in larger-scale OTS HPC cluster applications. Spending time with both is recommended to determine how well your applications of interest can be accelerated using Cell or GP-GPU offload. The redundant array of independent disks (RAID)-5 example code (see Download) provided with this article provides a simple demonstration of how threading can significantly speed up arithmetic logic unit (ALU) processing using the multi-core Intel Core™ 2 Duo processor I happen to have on my laptop. Running this code single threaded, once it's cached, you see about 430,000 RAID-5 operations per second. Compared to the threaded version, running 16 threads on the Core 2 Duo processor, you see a significant improvement, with about 980,000 RAID-5 operations per second. The following session on my laptop shows the power of threading. Listing 1 first shows the singly threaded RAID-5 run for the example code provided for download with this article; Listing 2 then shows the speed-up that threading on an OTS dual-core processor provides. Listing 1. Singly threaded RAID-5 computations on a dual-core system Sam Siewert@sam-laptop /cygdrive/c/Publishing/HPC-OTS/developerworks/hpcots/raid $ ./testraid5 Test Done in 315000 microsecs for 100000 iterations 317460.317460 RAID-5 OPS computed per second WITH PRECHECK ON WITH MODIFY ON WITH REBUILD ON WITH VERIFY ON Test Done in 231000 microsecs for 100000 iterations 432900.432900 RAID-5 OPS computed per second WITH PRECHECK ON WITH MODIFY ON WITH REBUILD ON WITH VERIFY ON Now, the same RAID-5 block level data verification ( PRECHECK ), XOR encoding of a parity block ( MODIFY ), and restoration of a lost block in the parity set ( REBUILD ), followed by data verification again is repeated using 16 threads to process 16 blocks concurrently by my dual-core laptop, doubling performance. Listing 2. The threaded version

Load more