Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/

Infrastructure architecture essentials, Part 7: Highperformance computing off the shelf

Concepts and techniques

Sam Siewert ( [email protected] ), Principal Software Architect/Adjunct Professor, University of Colorado

Summary: The year 2008 will forever be remembered as the year of the off-the-shelf (OTS) supercomputer, thanks to the Los Alamos National Labs (LANL) and IBM team that constructed the world's first machine to break the peta-FLOP (1,000,000,000,000,000 floating-point operations per second) barrier. Get an overview of OTS strategies to architect high-performance computing (HPC) systems as well as the methods and concepts behind building HPC systems from OTS components and open source software.

Date: 09 Dec 2008 Level: Intermediate PDF: A4 and Letter (127KB | 17 pages) Get Adobe® Reader® Activity: 472 views Comments: 0 ( Add comments )

Average rating (based on 1 vote)

Continuing the Infrastructure architecture essentials series, this article provides an overview of methods for building HPC systems with OTS components and open source software. Architectures that employ clusters and hybrid nodes composed of traditional multi-core symmetrical multiprocessing (SMP)/non-uniform memory access (NUMA) architectures integrated with single-instruction multiple data (SIMD) Cell-based or graphic processing unit (GPU)-based offloading. Methods for implementing Cell-based and GPU-based offloads are not reviewed here in detail, but you can find numerous excellent references on the topic of Cell-based algorithm acceleration (see Resources ) as well as significant help with GPU offload provided by the NVIDIA Compute Unified Device Architecture (CUDA) environment (see Resources ). Open source code that provides assistance with HPC cluster and hybrid offload applications is prevalent, and the skills and competencies necessary for such architectures are reviewed here to help you get started.

Advances in OTS processor complexes

Numerous individual architecture advances made by IBM and IBM partners have made OTS HPC a reality. The best proof was provided when the Roadrunner system broke the petaflop (1x10^15 floating-point operations) barrier using OTS IBM® BladeCenter® server boards this past summer (see Resources ). The Roadrunner system employs two BladeCenter QS22 blades with IBM PowerXCell™ 8i processors and an LS21 AMD Opteron processor in a tri-blade configuration. The Roadrunner system currently is first on the Supercomputing TOP500 list (see Resources ). Here's a quick review of the emerging OTS technologies that are making OTS HPC possible:

Virtualization software. The emergence of software that makes one resource look like many and many resources look like one, first demonstrated by IBM with the original mainframe virtual machine (VM), is fundamental to authoring scalable applications that can exploit large clusters of OTS processing, memory, input/output (I/O) and storage resources.

1 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/

Multicore processors. Since the peak of uniprocessor clock rates just below 4GHz, AMD and Intel have both developed a wide offering of SMP and NUMA architectures for OTS mainboards and have interesting new multi-core architectures coming out with the AMD Shangai and Intel® Nehalem processor complexes. Multi-core processor complexes have become typical for all of general purpose computing (GPC) and has helped to motivate HPC OTS solutions built from scalable clusters of OTS compute nodes along with software libraries to exploit multiple instruction, multiple data (MIMD) architectures. Scalable I/O hubs. The IBM xSeries® system includes both traditional SMP memory controller hub interfaces to the PCI-E bus, memory, and processor cores as well as NUMA scaling with options like the IBM System x 3950. Many new chip sets will employ protocols such as Intel's Quick Path Interface (QPI) and AMD's Hypertransport (HTX) for scaling with NUMA in 2009. The x3950 provides NUMA scaling of up to four systems and a total of 28 PCI-E x8 I/O expansion slots (seven interfaces per x3950 system). Scalable memory interfaces. As memory is scaled, many systems are employing protocols such as DDR3 increasing transfer rates up to 12800 MB/sec per memory bank with capability to easily scale to 256GB of memory per processing node with OTS memory technology. Manycore SIMD offload engines. The Cell Broadband Engine™ (Cell/B.E.™) and PowerXCell 8i processors as well as GP-GPUs from NVIDIA and AMD/ATI provide 10s to 100s of offload cores for SIMD acceleration of applications. IBM xSeries Cluster 1350. IBM supported clustering of xSeries rackmount or BladeCenter MIMD clusters. IBM pSeries® Cluster 1600. IBM supported clustering of pSeries IBM POWER™ architecture clusters. BladeCenter. A highly integrated vertical server integration with a mid-plane and IBM BladeCenter Open Fabric I/O for a variety of IBM POWER6, AMD Opteron, Intel Xeon®, and Cell processing boards.

Skills and competencies: Offloading and SIMD instruction set extensions

HPC OTS clusters can now leverage SIMD instructions sets as well as Cell and GP-GPU SIMD many-core processors like the NVIDIA Tessla. Here's a quick overview of options:

Cell processor offload. The Cell design, originally developed for digital media with the Cell/B.E., has found its way into IBM Blue Gene®/L, now with the PowerXCell 8i processor in OTS solutions like the BladeCenter QS22 used for Roadrunner as well as OTS offload PCI-E cards like Fixstars GigaAccel 180 (see Resources ). GPU offload. NVIDIA CUDA for the Tessla GP-GPU and GeForce/Quadra GPUs along with AMD/ATI Stream Computing software development kit (SDK) programming environments for writing SIMD kernels for offload in hybrid architectures provide methods for developing and debugging HPC applications employing OTS components like GP-GPUs. (see Resources for more information on CUDA/Tessla and Steam Computing/AMD-FireStream.) SIMD instruction set extension. Although GP-GPUs are helping to bring hundreds of cores to HPC OTS for offloading mathematically intensive kernels, Intel SSE 4.x and AMD are likewise adding SIMD instruction set extensions to traditional processors. Both the Nehalem and Shanghai processor complexes will bring additional SIMD instructions to the market in 2009 (see Resources for Intel Performance Primitives .)

Tools and techniques: Multi-core programming

In this section, you get a quick look at programming methods and the value of threading multi-core as well as offloads for many-core Cell and GP-GPU hybrid architectures. Programming Cell/B.E. and PowerXCell 8i OTS offload engines has been made much easier by the programming environments that IBM makes

2 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/

available. The best way to get started with Cell programming is to install ® on a Sony Playstation 3 (PS3) and write some code to accelerate threaded code with Synergistic Processing Element (SPE) offload. The article, " SoC drawer—The Cell Broadband Engine chip: High-speed offload for the masses ," provides an example to help get you going at home.

Programming GP-GPUs by comparison can be tricky; however, the newer NVIDIA Tessla GP-GPUs and the CUDA programming environment have made GP-GPU SIMD programming far easier than it was a year or two ago. Both offload methods provide an excellent way to accelerate compute/math kernels in larger-scale OTS HPC cluster applications. Spending time with both is recommended to determine how well your applications of interest can be accelerated using Cell or GP-GPU offload.

The redundant array of independent disks (RAID)-5 example code (see Download) provided with this article provides a simple demonstration of how threading can significantly speed up arithmetic logic unit (ALU) processing using the multi-core Intel Core™ 2 Duo processor I happen to have on my laptop. Running this code single threaded, once it's cached, you see about 430,000 RAID-5 operations per second. Compared to the threaded version, running 16 threads on the Core 2 Duo processor, you see a significant improvement, with about 980,000 RAID-5 operations per second. The following session on my laptop shows the power of threading. Listing 1 first shows the singly threaded RAID-5 run for the example code provided for download with this article; Listing 2 then shows the speed-up that threading on an OTS dual-core processor provides.

Listing 1. Singly threaded RAID5 computations on a dualcore system

Sam Siewert@sam-laptop /cygdrive/c/Publishing/HPC-OTS/developerworks/hpcots/

$ ./testraid5 Test Done in 315000 microsecs for 100000 iterations 317460.317460 RAID-5 OPS computed per second WITH PRECHECK ON WITH MODIFY ON WITH REBUILD ON WITH VERIFY ON Test Done in 231000 microsecs for 100000 iterations 432900.432900 RAID-5 OPS computed per second WITH PRECHECK ON WITH MODIFY ON WITH REBUILD ON WITH VERIFY ON

Now, the same RAID-5 block level data verification ( PRECHECK ), XOR encoding of a parity block ( MODIFY ), and restoration of a lost block in the parity set ( REBUILD ), followed by data verification again is repeated using 16 threads to process 16 blocks concurrently by my dual-core laptop, doubling performance.

Listing 2. The threaded version of the same RAID5 computations on a dualcore system

Sam Siewert@sam-laptop /cygdrive/c/Publishing/HPC-OTS/developerworks/hpcots/raid

$ ./testraid5n 16 Will start 16 synthetic IO workers

**************** MULTI THREAD TESTS Pthread Policy is SCHED_FIFO

3 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/

min prio = 15, max prio = -14 PTHREAD SCOPE PROCESS

***************** TOTAL PERFORMANCE SUMMARY

For 16 threads, Total rate=983587.704950

Granted, RAID-5 encoding and rebuilding is embarrassingly parallel, but many applications in fact are or have sections that can be significantly sped up with concurrency. Examples include many data-driven algorithms, including Computational Fluid Dynamics (CFD), simulation, Monte Carlo analysis (running the same simulation with randomly varied initial conditions), image processing, data mining, global climate-change models, bioinformatics, and a list that goes on and on.

The potential to further accelerate algorithms like RAID-5 and RAID-6 using not just multi-core but many-core processors like PowerXCell 8i processors and GP-GPUs is significant in addition to more obvious scientific algorithms that are adaptable to SIMD programming. In fact, as noted in ovember 2008 Linux Journal, efforts to create extensions to software RAID using GP-GPUs to accelerate RAID-6 are in progress, making double fault protection for OTS HPC using software RAID a real possibility. Traditionally, RAID-6 has required custom application-specific integrated circuits (ASICs) that provide RAID on chip (RoC). Adapting the example RAID-5 code is beyond the scope of this introductory article, but for those of you interested in digging in deeper using either CUDA or the IBM Linux Cell programming tools, you can easily adapt the code to Cell or GP-GPU acceleration (see Resources ). In fact, the ambitious reader can construct a small-scale version of an OTS HPC using PS3 systems or NVIDIA GP-GPUs. A teraflop capable machine is within the reach of modest budgets for the first time ever using OTS HPC approaches.

Flynn's classification of architectures

Flynn's classification of architectures includes all possible combinations of Single Instruction, Single Data (SISD), Single Instruction, Multiple Data (SIMD), Multiple Instruction, Single Data (MISD), and Multiple Instruction, and Multiple Data (MIMD) architectures. Traditional uniprocessors are SISD; Cell-based SPE is SIMD; and GPUs are typically SIMD, as are vector instruction set extensions like Intel Streaming SIMD Extensions (SSE) version 4. Pipelined and systolic array architectures are often considered MISD, and the MIMD typical of SMP and NUMA are multi-core architectures. Multiple processors may be connected to memory, so they have symmetric multi-processing (SMP) or NUMA or may be fully distributed and use message passing to synchronize and share data.

Advances in OTS interconnection networks

The open fabric architecture for 10G Ethernet, storage area network (SAN) Internet Small Computer System Interface (iSCSI), and Fiber Channel along with Infiniband has simplified OTS HPC cluster node interconnection and connection to storage. Scale out of OTS HPC components is enabled by software and hardware technology from the Open Fabrics Initiative (see Resources ). OTS interconnection options found in Open Fabrics include:

4G and 8G Fiber Channel. For initiator-to-SCSI block storage RAID and solid-state drive (SSD) devices Single Data Rate (SDR), Double Data Rate (DDR), and Quad Data Rate (QDR)—10, 20, and 40G, respectively—Infiniband. For cluster message passing 10G Ethernet. For cluster message passing and iSCSI storage interfaces

4 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/

I encourage you to learn more about these options from IBM and the Open Fabrics Alliance (see Resources ). Gigabit Ethernet has become a commodity, and 10G Ethernet is rapidly being cost-reduced along with SDR 10G and DDR 20G Infiniband, once again making gigascale clustering attainable with modest budgets using OTS HPC concepts.

Skills and competencies: Programming for clusters

Programming clusters to achieve good utilization of scaled-out clusters for applications requires message passing between nodes to share data and for synchronization as well as process and thread control over all nodes in a cluster. The OpenMP and MPI programming frameworks provide an excellent application programming interface (API) for developers writing applications for clusters (see Resources ). These two frameworks abstract the systems programming burden of balancing workload, sharing data between distributed nodes, and synchronizing concurrent computation such that cluster computing requires far less systems programming than it has in the past.

Tools and techniques: Bonding and multi-pathing

The use of multiple gigabit Ethernet, 10G Ethernet, SDR/DDR/QDR Infiniband, and 4G/8G Fiber Channel for SANs and OTS HPC node clustering becomes far more scalable and fault tolerant when multiple links and paths through scalable switches can be managed in an active-active or active-failover fashion. In the Ethernet message-passing world, the ability to make multiple Ethernet links appear as one IP interface is known as bonding. In Linux, ethernet bonding is supported by open source code and most gigE and 10GE drivers (see Resources ). Bonding allows you to configure OTS HPC systems to use any number of links as if they were one and to use active-active so that if one link fails, clustered nodes can continue to communicate with degraded performance.

Likewise, in Linux, for SANs using gigE, 10G Ethernet, or Fiber Channel SANs, /dev/mapper multi-pathing can be configured so that SCSI devices can have multiple ports and paths from initiators to target logical unit numbers (LUNs) that are active-active. When multi-pathing is applied, initiators see target storage as one LUN with load-balanced use of all the paths to that storage and resiliency to path failures. The same multi- pathing features provided by /dev/mapper in Linux are provided by all flavors of Linux as well as Windows Server® 2003 and Windows Server 2008 (see Resources ).

Advances in OTS high-performance storage

Traditionally, RAID has required expensive custom hardware RAID controllers or RoC. Although these are still great options, advances in multi-core and many-core processing has made software RAID a much more viable option (see Linux Journal, November 2008). Simple RAID-0 striping over multiple disks is easily achieved by software and provides tremendous storage I/O speed-up, but it lacks failure protection. Striping over RAID-1 mirror sets is a great solution. Known as RAID10, you can use this solution easily with software RAID, and it provides single disk failure protection along with speed-up from striping.

RAID-5, which uses XOR parity calculation on blocks to provide recovery with 80 percent or better effective capacity compared to 50 percent effective capacity in RAID-1, is more compute intensive and suffers inefficiencies for small Write operations that require Read-Modify-Write updates to disk. RAID-6, a Galois math-based double-parity scheme, provides protection against double faults in RAID sets with good capacity as well, but computation is very complex. Both RAID-5 and RAID-6 can be striped for RAID-50 and RAID-60; however, this is traditionally done only with a hardware RAID controller or RoC. With the advent of GP-GPUs, it's possible that open source software RAID can make use of many-core accelerators to support RAID-5 and RAID-6 in the future.

Flash devices have provided SSD options for storage for years; however, with newer Nand flash multi-layer

5 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/

cell (MLC) and single-layer cell (SLC) density advancements, the capacities and cost per gigabyte are making flash interesting as a Tier-0 for OTS HPC. Numerous announcements of SAS/SATA flash-based SSD have been announced this year—for example, the Intel SLC and MLC SATA drives being shown at the International Conference for High Performance Computing, Networking, Storage and Analysis 2008 (SC08). Likewise, emergent companies like FusionIO offer PCI-E card-based SSDs. The cost of a few terabytes of SSD is not low, but it is affordable. It is most likely that the industry will focus on how to pair SSDs in a terabyte Tier-0 with petabytes of traditional RAID storage for OTS HPC (see Resources .)

In the past, RAID arrays have been large custom storage sub-systems integrated with HPC, but many new options for OTS RAID are emerging today, including both block-level RAID and file-level RAID offerings. Some examples of much lower-cost commodity RAID for the home now exist (see the Resources entry for Netgear RAID) along with offerings from Dell and many new lower-cost OTS storage vendors. These systems may not have the performance density that HPC requires, but emergent companies such as Atrato Inc., and Data Direct Networks offer high-density solutions with gigabyte I/O rates from scalable OTS storage.

Skills and competencies: Software RAID

Software RAID—basic RAID levels, including RAID-1 (mirroring), RAID-0 (striping), and RAID-5 (parity encoding and recovery) as well as combinations—can now be handled by software at gigabyte rates on today's processor cores. Likewise, combinations such as RAID-10 (striping over many mirror pairs) and RAID-50 (striping over many RAID-5 sets) can also be handled by software RAID modules like the open source mdadm program in Linux. RAID-6 provides double fault protection using "P" XOR parity along with "Q" Galois encoded parity. RAID-6 sets can likewise be striped for a RAID-60 implementation (see Resources .)

Tools and techniques: Software RAID

For a real quick start with software RAID, you can experiment with mdadm using RAM disks, as shown in Listing 3 . By default, Linux includes several small RAM disks that you can use to learn mdadm so that you don't have to buy new drives or risk using your system drive.

Listing 3. Basic commands in Linux using RAM disk and mdadm to build a RAID set

First, create a simple mirrored RAM disk pair with: mdadm --create /dev/md0 --chunk=4 --level=1 --raid-devices=2 /dev/ram0 /dev/ram1 mke2fs -m 0 /dev/md0 mount /dev/md0 /mnt/r1

You can verify it works using "dd" tests: dd if=/dev/zero of=/mnt/rd1/newfile bs=64k count=100

The state of the mirror pair can be seen with: [root@localhost r1]# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Sat May 10 00:47:22 2008 Raid Level : raid1 Array Size : 16320 (15.94 MiB 16.71 MB) Used Dev Size : 16320 (15.94 MiB 16.71 MB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent

Update Time : Sat May 10 00:59:34 2008

6 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/

State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0

UUID : 1f0e5e8e:9b47c52d:6d4724ce:92d48e8b Events : 0.4

Number Major Minor RaidDevice State 0 1 0 0 active sync /dev/ram0 1 1 1 1 active sync /dev/ram

Now, to test the protection provided by MDADM RAID-1, set one of the RAM disks faulty: [root@localhost r1]# mdadm --manage --set-faulty /dev/md0 /dev/ram0 mdadm: set /dev/ram0 faulty in /dev/md0

[root@localhost r1]# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Sat May 10 00:47:22 2008 Raid Level : raid1 Array Size : 16320 (15.94 MiB 16.71 MB) Used Dev Size : 16320 (15.94 MiB 16.71 MB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent

Update Time : Sat May 10 01:06:42 2008 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0

UUID : 1f0e5e8e:9b47c52d:6d4724ce:92d48e8b Events : 0.6

Number Major Minor RaidDevice State 0 0 0 0 removed 1 1 1 1 active sync /dev/ram

2 1 0 - faulty spare /dev/ram0

Now, with one RAM disk set faulty, run a write test to the degraded RAID-1 set: [root@localhost mnt]# dd if=/dev/zero of=/mnt/r1/newfile2 bs=64k count=100 100+0 records in 100+0 records out 6553600 bytes (6.6 MB) copied, 0.0162858 s, 402 MB/s

Finally, recover the faulty drive by removing and adding back, and let it resync data: [root@localhost mnt]# mdadm /dev/md0 -r /dev/ram0 mdadm: hot removed /dev/ram0 [root@localhost mnt]# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Sat May 10 00:47:22 2008 Raid Level : raid1 Array Size : 16320 (15.94 MiB 16.71 MB) Used Dev Size : 16320 (15.94 MiB 16.71 MB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent

7 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/

Update Time : Sat May 10 01:20:20 2008 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0

[root@localhost mnt]# mdadm /dev/md0 -a /dev/ram0 mdadm: added /dev/ram0 [root@localhost mnt]# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Sat May 10 00:47:22 2008 Raid Level : raid1 Array Size : 16320 (15.94 MiB 16.71 MB) Used Dev Size : 16320 (15.94 MiB 16.71 MB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent

Update Time : Sat May 10 01:22:11 2008 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0

UUID : 1f0e5e8e:9b47c52d:6d4724ce:92d48e8b Events : 0.24

Number Major Minor RaidDevice State 0 1 0 0 active sync /dev/ram0 1 1 1 1 active sync /dev/ram

You can see MDADM resync the data to the restored RAM disk as follows: [root@localhost mnt]# cat /proc/mdstat

UUID : 1f0e5e8e:9b47c52d:6d4724ce:92d48e8b Events : 0.16

Number Major Minor RaidDevice State 0 0 0 0 removed 1 1 1 1 active sync /dev/ram

Advances in scalable file systems

Scalable parallel file systems and storage are fundamental to OTS HPC systems. IBM's General Parallel File System (GPFS) for the IBM AIX® and Linux operating systems provides a great solution and can be built upon scalable block storage using SAN scaling and Open Fabric networks. Going in depth on GPFS configuration is beyond the scope of this article, but you can find many excellent developerWorks articles on the subject—for example, Harish Chauhan's " Install and configure General Parallel File System (GPFS) on xSeries . For an open source solution, Parallel NFS (pNFS) for Linux is a good option, along with the Parallel Virtual File System version 2 (PVFS2) (see Resources ).

Putting all the OTS HPC components together still isn't easy, but Figure 1 and Figure 2 show examples built using IBM xSeries and all OTS components, including PowerXCell 8i blades or PCI-E cards for SIMD acceleration, GPFS for parallel file system access to scalable DS 4xxx storage, and a scalable cluster of

8 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/

xSeries 36555 nodes in Figure 1 or a scalable tri-blade configuration similar to Roadrunner in Figure 2.

Figure 1. Example of an HPC OTS xSeries configuration

In Figure 1, xSeries 3655 nodes are used to host a PowerXCell 8i offload PCI-E x16 card. This type of card is offered by Fixstars (see Resources ) and can provide 180 GFLOPs of single precision (SP) or 90 GFLOPS of double precision (DP) offload for Cell-based algorithms. The integration of the PowerXCell 8i offload card and xSeries 3655 technology provides a hybrid HPC node that can also be clustered together on a 10G Ethernet network or 10G/20G Infiniband fabric. This type of configuration supports cluster digital signal processor (DSP) algorithms, image processing, or finite element methods (for example) that can benefit from Cell-based acceleration and likewise can scale through OpenMP+MPI methods for threading and message passing or similar software clustering and load balancing approaches.

Processing is, of course, only one part of the overall system. In this example, the GPFS is shown with scalable GPFS server nodes running on the xSeries 3650 machine interfaced to scalable block storage that the IBM DS 4xxx controllers and disk expansion subsystems provide. Storage access and I/O access performance in general often become the bottleneck in HPC systems that are really high-throughput (HT) computing systems, as well. To deal with this storage access bottleneck, you can build tiered storage approaches that use SSD (as shown in Figure 1) along with high-performance density RAID solutions (see Resources ).

Figure 2. Example of an HPC OTS BladeCenter H/HT configuration

9 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/

Figure 2 shows an OTS HPC system similar to Figure 1 but employing BladeCenter H (BC-H) scalable processing and Open Fabrics I/O scaling. The advantage of BC-H is that 14 total blades can be accommodated in nine RUs of space, including the "tri-blade" configuration that was used in the Roadrunner petaflop hybrid system, which was composed of two QS22 and one LS21 blade per compute node interconnected through DDR Infiniband. As in Figure 1, it is vitally important that HT computing also include scalable file systems like GPFS and scalable high-performance density block storage.

The Scale of HPC

Today, terascale computing is within the reach of OTS solutions—e.g. teraflops, tens to hundreds of I/O gigabits interconnection bandwidth per node, many gigabytes to terabytes of RAM, terabytes of RAID storage, and a few terabytes of SSD fast-access storage. These scales can be reached, for example, with OTS PowerXCell blades, Opteron/Xeon blades, or xSeries in addition to GPU offloads clustered with Infiniband/10G Ethernet with scalable DDR3 memory, SSD fast-access storage, software RAID arrays, and open source Linux and GPFS. Petascale has been achieved recently by OTS systems like Roadrunner. Exascale remains a future vision for all of HPC, and zetascale can hardly be imagined. (As a reminder, mega=10^6, giga=10^9, tera=10^12, peta=10^15, and zeta=10^18.)

Looking forward

Hybrid HPC OTS architectures, including Cell and GP-GPU, along with simplified clustering and threading and scalable storage and parallel file systems have made OTS HPC a much more viable option. Designing, configuring, and making an OTS HPC system still isn't easy, but it is possible today—especially at the terascale level and, as Roadrunner has proven, even at the very limits of HPC (petascale, today). It is likely that HPC will continue to include both OTS and custom solutions blended together, but the really good news is that the barrier cost to entry for HPC is coming down, which means that much more research important to society will get done more quickly with less cost and with much greater efficiency.

10 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/

Download

Description ame Size Download method Sample raid setup raid.zip 60KB HTTP

Information about download methods

Resources

Learn

Most operating systems include the ability to multi-path storage that is SAN attached. This Guide on Multipath is an excellent resource for Linux. Likewise, in Windows 2008 server MPIO (multipath) options are built directly into device manager and in Windows 2003 as a driver module - this web page is very helful for Windows MPIO users. Just like multi-pathing for SAN storage, bonding of ethernet interfaces for active-active or active-failover redundancy on gigE or 10GE links is critical - this Linux Corner help page is a great place to start.

RAID-6 "P" encoding is simply XOR like RAID-5, but "Q" encoding requires mastery of Galois math to provide double fault protection. This Intel Intelligent RAID-6 whitepaper , provides a great overview of the Galois math to implement RAID-6.

The concept of "tiered" or hierarchical storage management is well defined by Wikipedia , with SSD or RAM-based tiers often referred to as a "tier-0" in this type of strategy.

IBM Deep Computing helped Los Alamos Labs build Roadrunner using QS22 PowerXCell 8i blades and Opteron LS21 blades with Infiniband interconnection. The November 2008 issue of Linux Journal has an excellent article on Roadrunner.

For the second year in the row, I've been lucky enough to attend the supercomputing conference—this year, SC-08 —where the theme this year was OTS HPC systems. This was a significant change from the previous year's conference, when the most notable theme was the quest for exascale computing (still beyond our reach) and hybrid computers built from multi-processor clusters and FPGA offload engines. This year, offload seems to have gone OTS with the emergence of numerous OTS Cell and GP-GPU solutions. This trend appears to be more than a fad.

The TOP500 ranks HPC systems by compute performance and includes many systems that employ OTS HPC designs as well as more customized designs like Blue Gene/L.

This LLNL OpenMP tutorial is a great place to get started with cluster programming methods along with the Wikipedia page on MPI .

The developerWorks article " SoC drawer: The Cell Broadband Engine chip: High-speed offload for the masses " (Sam Siewert, developerWorks, April 2007) is a good place to get going with Cell programming on Linux using a PS3. The Cell Broadband Engine resource center is a great place to get all the latest documentation and latest SDK.

Browse the technology bookstore for books on these and other technical topics.

11 of 12 8/22/2009 6:49 PM Infrastructure architecture essentials, Part 7: High-performance computing ... http://www.ibm.com/developerworks/library/ar-infraarch7/

Get products and technologies

Download CUDA from NVIDIA's Web site to develop SIMD offload/acceleration code on Tessla GP-GPU, GeForce GPU, or Quadro GPUs.

Download the ATI Steram Computing SDK from AMD/ATI's Web site for use with the new AMD FireStream GP-GPU.

pNFS provides an alternative parallel file system to GPFS and can be downloaded as open source from pNFS . Likewise, PVFS is an open source option.

More information on PCI-E-integrated PowerXCell 8i processors can be found on the Fixstars sight for the GigaAccel 180 .

The Blade Center LS21 and QS22 along with Open Fabric integration used in the Roadrunner configuration are available OTS from IBM.

The Cavium OCTEON Plux XL is another interesting combined offload and gigE PCI-e card that can be used for both networking and algorithm acceleration.

Many new RAID devices have lowered the cost on high-performance storage, from the very low-cost home/small business Netgear RAID box to Dell's EqualLogic to the IBM DS4xxx and DS8xxx scalable storage systems to newer HPC OTS offerings such as DDN and Atrato Inc. V1000 .

Options from Intel , Micron , and FusionIO as well as many other flash and SSD vendors exist that can be integrated with x-Series and Blade Center either via PCI-e or SAS/SATA interfaces.

The Intel Performance Primitives provides support for taking advantage of SSE4 and multi-core processor architectures.

The Open Fabrics Alliance is well supported at IBM for BladeCenter and xSeries.

Discuss

Check out developerWorks blogs and get involved in the developerWorks community .

About the author

Dr. Sam Siewert is a systems and software architect who has worked in the aerospace, telecommunications, digital cable, and storage industries. He also teaches at the University of Colorado at Boulder in the Embedded Systems Certification Program, which he co-founded in 2000. His research interests include high-performance computing, broadband networks, real-time media, distance learning environments, and embedded real-time systems.

Trademarks | My developerWorks terms and conditions

12 of 12 8/22/2009 6:49 PM