SGI Multi-Paradigm Architecture

Michael Woodacre Chief Engineer, Platform Group [email protected] A History of Innovation in HPC

Challenge® XL media server fuels Steven Spielberg’s Shoah NASA Ames and project to document ® set world Power Series™, Holocaust survivor record for multi-processing stories systems provide STREAMS Jim Clark compute power First systems benchmark founded SGI on SGI introduces for high-end deployed in Stephen the vision of its first 64-bit graphics Hawking’s COSMOS Altix®, first scalable Computer operating applications system 64-bit ® Server Visualization system

1982 1984 1988 1994 1995 1996 1997 1998 2001 2003 2004

DOE deploys 6144p Introduced First generation Origin 2000 to IRIS® Workstations modular NUMA System: monitor and become first integrated NUMAflex™ Origin® 2000 simulate nuclear 3D graphics systems architecture stockpile with Origin® 3000

First 512p Altix cluster Dockside engineering analysis on Origin® drives ocean research at NASA Ames 2000 and Indigo2™ helps Team New +10000p upgrade! Zealand win America’s Cup

Images courtesy of Team New Zealand and the University of Cambridge

SGI Proprietary

2 Over Time, Problems Get More Complex, Data Sets Exploding

Bumper, hood, engine, wheels Entire car E-crash dummy Organ damage

This Trend Continues Across SGI's Markets

Improve design Improve patient safety Improve oil exploration Improve hurricane prediction & manufacturing

First Row Images: EAI, Lana Rushing, Engineering Animation, Inc, Volvo Car Corporation, Images courtesy of the SCI, Second Row Images: The MacNeal-Schwendler Corp , Manchester Visualization Center and University Department of Surgery, Paradigm Geophysical, the Laboratory for Atmospheres,SGI Proprietary NASA Goddard Space Flight Center.

3 SGI Scalable ccNUMA Architecture Basic Node Structure and Interconnect

C C C C A A A A C CPU CPU C C CPU CPU C H H H H E E E E

NUMAlink™ Interconnect Interface Interface Chip Chip

Physical Memory Physical Memory

SGI Proprietary

4 SGI Scalable ccNUMA Architecture Basic Node Structure and Interconnect

C C C C A A A A C CPU CPU C C CPU CPU C H H H H E E E E

NUMAlink™ Interconnect Interface Interface Chip Chip

Global Shared Memory

SGI Proprietary

5 Logical Layout - 8TB

Altix 128 Processor 8TB - 1.6GB/s Uniform Memory Bandwidth Plane A Level 2 Routers RC2A RC4A RC6A RC8A RC3A RC5A RC7A RC9A

Level 1 Routers

RB2 RB3 RB6 RB7 RB10 RB11 RB14 RB15 RB18 RB19 RB22 RB23 RB26 RB27 RB30 RB31

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28 R29 R30 R31 R32

C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28 R29 R30 R31 R32

RB2 RB3 RB6 RB7 RB10 RB11 RB14 RB15 RB18 RB19 RB22 RB23 RB26 RB27 RB30 RB31 Level 1 Routers

RC2B RC4B RC6B RC8B SGI Proprietary Level 2 Routers RC3B RC5B RC7B RC9B Plane B 6 3.2 Interconnect Topology 3.0

Bi-Section Bandwidth Profiles ) 2.8

GBs/sec/cpu cpu 2.6

sec/ 2.4 Bs/ G

( 2.2 th

id 2.0 w d

n 1.8 Ba 1.6 ion ct

e 1.4 -S i Dual Plane - NL3 router - 1.2

8 port router bricks wB a

R 1.0 Dual Plane - NL4 router - 8 port router bricks 0.8 0.6

0.4

0.2

SGI Proprietary 4 8 16 32 64 128 256 512 1024 2048 Processors 7 Examples of Single-Paradigm Architectures

Scalar Vector App-Specific

Intel X1 Graphics - GPU SGI MIPS NEC SX Signals - DSP IBM Power Prog’ble - FPGA Sun SPARC Other ASICs HP PA

SGI Proprietary

8 Paradigms to Applications

Application-specific high

Application-specific

Vector Scalar Intensity Compute Low

Low Data locality High

SGI Proprietary

9 Peer I/O: Increased I/O Flexibility & Performance

XIO+ I/O C R C

NL I/O XIO+ C

1:1 Ratio C-brick to I/O

Peer I/O

I/O R C

NL I/O

SGI Proprietary

10 SGI Scalable ccNUMA Architecture Multi-Paradigm Computing Architecture

General Purpose RASC™ (FPGA) Scalable GPUs I/O Interfaces

C C C C C C C General C A A A A A A GeneralA A C C C C C C C C H CPU CPU H H CPU FPGA(s) CPU H GPUsH CPU GPUsCPU H PurposeH CPU PurposeCPU H E E E E E E EI/O I/O E

Interface Interface Interface Interface TIO Chip Chip TIOChip TIOChip

Physical Memory Physical Memory Physical Memory Physical Memory

NUMAlink™ Interconnect Fabric

SGI Proprietary

11 Data-Centric Architecture

CPU CPU CPU U CP IO

IO Very Large GAM FPGA

APU . Globally Addressable . Low Latency . High Bandwidth . Many Ports APU FPGA

PU 3 A U- Graphics Graphics APU U GP G APU AP GPU-0 GPU-1 PU-0 GPU-1 GPU-2 osition Graphics Graphics Comp GPU-3 GPU-2

SGI Proprietary

12 Big Data

1TB, 32*32=1024 elements

Each box represents 1GB

SGI Proprietary

13 Big Data

SGI Proprietary

14 Big Data

SGI Proprietary

15 Big Datasets : 3D Interactive Visualization

1993 2004 100 MB 40,000x 400 GB 10% viewed / year Productivity 100% viewed / month ~1 MB / month 400 GB / month

SGI Proprietary

16 Commodity GPU systems 5X the price of a Scale-up System

March 17, 2005 nVIDIA visualizes large data set •473 million triangles •128 GPU’s on Dell Systems •~$1million system Compliments of nVIDIA

January 21, 2005 SGI visualizes large data set •350 million triangles •12P, 56GB memory •Utilizes a ray tracer •~$180,000 system Compliments of Boeing

SGI Proprietary

17 Dynamic Load Balancing

Load Balancing OFF Load Balancing ON Given Most Work

SGI Proprietary

18 Dimensions of Scalability

• Processors • Processor bandwidth • Memory bandwidth • Memory capacity • Interconnect bandwidth • IO bandwidth • Graphics processing • Reconfigurable processing • Other acceleration elements

SGI Proprietary

19 Origin3000 Building Blocks (Bricks)

C-brick CPU Module

R-brick Router Interconnect

I-brick Base I/O Module

P-brick PCI Expansion

X-brick XIO Expansion

D-brick Disk Storage

SGI Proprietary

20 SGI Altix™ 3700 Bx2 Platform Introduction: Building Blocks

Itanium® 2 CR-brick CPU and memory

M-brick Memory

R-brick SGI® Router interconnect Advanced Linux Environment IX-brick With Base I/O module SGI ProPack PA-brick, PX-brick PCI-X expansion

D-brick2 Disk expansion

SGI Proprietary

21 High-End Servers – Moving Forward: Altix® 4700 Platform….. Blade Packaging

•Innovative Blade-to-NUMALink4 Concept: Provides Unprecedented Versatility, Density

•Blade Architecture Leads Next-Wave of HPC Blade-Based Platforms: With Better Upgradeability, Expansion & Repair

•Investment Protection: Processor-Only Upgrade to Future Dual Core Processors

•Enables Flexible Multi-Paradigm Computing: Enhanced integrated RASC, Graphics

SGI Proprietary

22 Next Generation RASC™ Technology Blade Base Package Blade Based Package

SGI Proprietary

23 Standardized Blades, NUMAlink Backbone

Rack Individual Rack Unit (IRU) Small Rack = 4 IRUs (Contains 10 Blades)

L1 Display

L1 Display Blade L1 Display Blade Slot 8 Blade Slot 10 Filler Panel 2 Slot Blade 4 Slot Blade 6 Slot Blade L1 Display

L1 Display

L1 Display Filler Panel Filler Blade Slot1 Blade Slot3 Blade Slot5 Blade Slot 7 Blade Slot 9

L1 Display

L1 Display

L1 Display

SGI Proprietary

24 Altix® 4700 Compute Blades

• Support for Madison9M Processors (Montecito/Montvale as Available)

L1 Display • Two Compute Blade Options to Blade Slot 8 Blade Slot 10 Filler Panel 2 Slot Blade 4 Slot Blade 6 Slot Blade Provide Different System Capabilities: Graphics Blade – Best $/FLOP, Best Density

(Density Compute Blade) OR – Best Performance, Memory BW Filler Panel Filler Blade Slot1 Blade Slot3 Blade Slot5 Blade Slot 7 Blade Slot 9

I/O Blades (Bandwidth Compute Blade) RASC Blade Memory Blade Memory Compute Blade

SGI Proprietary

25 Altix® 4700 Compute Blades

Top View Front View Highest Memory BW, Performance: Bandwidth DDR2 DIMM DDR2 DIMM Bandwidth Compute Blade Compute DDR2 DIMM DDR2 DIMM Blade • 667MHz FSB Madison9M -> DDR2 DIMM DDR2 DIMM 10.7GB/s Local Memory Bandwidth M9M Shub • 32 M9M Sockets / S-Rack NL4 6.4GB/s Socket 10.7GB/s 2.0 • Processors Supported: 1.66GHz/9M, 1.66GHz/6M Madison9M with DDR2 DIMM DDR2 DIMM 667MHz FSB DDR2 DIMM DDR2 DIMM DDR2 DIMM DDR2 DIMM • Memory Sizes: 2G – 48G/core Single Blade

Top View Front View Best $/FLOP, Best Density: Density Density Compute DDR2 DIMM DDR2 DIMM Compute Blade Blade DDR2 DIMM DDR2 DIMM • 533MHz FSB Madison9M -> M9M DDR2 DIMM DDR2 DIMM 8.524GB/s Local Memory Bandwidth Socket Shub • 64 M9M Sockets / S-Rack NL4 6.4GB/s 8.5GB/s 2.0 • Processors Supported: 1.6GHz/9M, M9M 1.6GHz/6M Madison9M with 533MHz DDR2 DIMM DDR2 DIMM Socket FSB DDR2 DIMM DDR2 DIMM DDR2 DIMM DDR2 DIMM • Memory Sizes: 1G – 24GB/core Single Blade SGI Proprietary

26 Memory Blade

Top View Front View

DDR2 DIMM 06 DDR2 DIMM Memory Blade: CY DDR2 DIMM DDR2 DIMM Q2 DDR2 DIMM DDR2 DIMM • Scale Memory Independently with 12 DDR2 DIMM Slots Per Blade Shub NL4 6.4GB/s • Up to 128TB 2.0

DDR2 DIMM DDR2 DIMM DDR2 DIMM DDR2 DIMM DDR2 DIMM DDR2 DIMM Single Blade

SGI Proprietary

27 Altix® 4700 RASC Blade

L1 Display • RASC Blade Blade Slot 8 Blade Slot 10 Filler Panel 2 Slot Blade 4 Slot Blade 6 Slot Blade – Abacus Computation Blade – Enhanced Performance, Tightly Graphics Blade Integrated

Filler Panel Filler Blade Slot1 Blade Slot3 Blade Slot5 Blade Slot 7 Blade Slot 9 I/O Blades CPU Blade RASC Blade Memory Blade Memory

SGI Proprietary

28 RASC Blades – Cont.

Top View Front View

SSRAM SSRAM Abacus Computation Blade: SSP SSRAM • New Levels of Performance: FPGA TIO SSRAM PCI – High Performance V4LX160 SSRAM SSRAM Selmap FGPA with 160K Logic Cells Loader NL4 6.4GB/s SSRAM SSRAM Selmap – Increased Memory Sizes,12 DIMM per Blade SSRAM • Optional Brick Packaging for SSRAM FPGA SSP TIO Legacy Platforms SSRAM SSRAM Single Blade

SGI Proprietary

29 How does RASC™ Technology Differ from Traditional CPUs?

Compare Application Run Time %’s Application Run-Time Comparison

Algorithm Algorithm Memory Calls Branche inst. 100% Algorithm

90% A Execution Time l 80% gorithm toR 01001000010010 01110100101010 70%

Exp 11100101010001 e n Time

m Time i 60% 10001000110001 nt u 01010101010111 o R

50% Savings Ru rt of 00000111100100 Algorithm % 40% 00010010111010 Execution time Job

A 0 11 001 00011 1 30% 01001000010010 S 1 11011110011 0 01110100101010

20% C

10% Traditional RASC Method 0% Method Key Algorithm App 1 App 2 App 3 App 4 App 5 CPU only running on FPGA

Identify RASC appropriate algorithm Directly map computationally-intensive algorithms to hardware with RASC

SGI Proprietary

30 Application Segments

Application segments Sample applications Transcoding (digital watermarks, format conversion), Image and video compression (JPEG, MPEG), color correction, ray-tracing, processing edge detection (Sobel)

Digital Signal Processing FFT, IFFT, Filtering (FIR and IIR)

Interleaver/de-interleaver, coding/decoding (Reed Solomon, Network and Viterbi), convolution encoders, encryption, error correction, Communication packet processing (IPsec)

Database Acceleration Query, sorting, pattern recognition, data compression

HPC Algorithm MATLAB, STAR-P, random number generators, Sigint/Elint, Acceleration– Gov/Defense image recognition (radar/vision/IR), DEM HPC Algorithm Acceleration– Blast, Smith-Waterman Bioinformatics

SGI Proprietary

31 Ease of Use

•Leverage 3rd Party Std Language Tools – Celoxica, Mitrionics, Starbridge Systems, Nallatech

•Developed an FPGA aware version of GDB – Capable of debugging the FPGA and System Software – Capable of multiple CPUs and multiple FPGAs

•Developed RASC Abstraction Layer (RASCAL)

•Provide for HDL modules – Integrated environment with debugger – Highest performance

SGI Proprietary

32 3rd Party Tools

• Celoxica – http://www.celoxica.com – Handel-C • Mitrionics - http://www.mitrionics.com – Mitrion C • Starbridge Systems - http://www.starbridgesystems.com/ – Viva graphical development environment • Nallatech - http://www.nallatech.com/ – SGI strategic partner

SGI Proprietary

33 Ease of Use v. Efficiency

x VHDL High Verilog

x

Efficiency x x Low

Easy Ease of Use Difficult

SGI Proprietary

34 Bitstream Generation… HLL Tools

HLL Design Entry Design Verification (Handel-C, Impulse C, Mitrion C, Viva) RTL Generation and Integration with Core Services Behavioral Simulation .v, .vhd .v, .v, (VCS, Modelsim) .vhd IA-32 .vhd Design Synthesis Linux® (Synplify Pro, Machine Metadata Amplify) Processing (Python) .edf Static Timing Analysis .ncd, (ISE Timing Analyzer) Design Implementation .pcf .cfg (ISE) .bin

Altix® Device Programming .c Real-time (RASC™ Abstraction Layer, Verification Server Device Manager, Device Driver) (gdb)

SGI Proprietary

35 Ease of Use

•Leverage 3rd Party Std Language Tools – Celoxica, Impulse Acceleration, Mitrion, Starbridge Viva – In discussions with other HLL tool vendors

•Developed an FPGA aware version of GDB – Capable of debugging the FPGA and System Software – Capable of multiple CPUs and multiple FPGAs

•Developed RASC Abstraction Layer (RASCAL)

•Provide for HDL modules – Integrated environment with debugger – Highest performance

SGI Proprietary

36 FPGA Aware Debugger

• Based on Open Source GNU Debugger (GDB) • Uses extensions to current command set • Can debug host application and FPGA • Provides notification when FPGA starts or stops • Supplies information on FPGA characteristics • Can “single-step” or “run N steps” of the algorithm • Can HLL line step per C-line source • Dumps data regarding the set of “registers” that are visible when the FPGA is active

SGI Proprietary

37 GDB Debugging Environment

Algorithm.c

tmp = a & b; (gdb) fpgastep Debugger running d = tmp | c; (gdb) p/x $a $6 = 0x444433 in real time (gdb) p/x $b $7 = 0x111122 (gdb) p/x $tmp $8 = 0x555533 (gdb) fpgastep a COP FPGA (gdb) p/x $tmp $9 = 0x555533 tmp & (gdb) p/x $c $10 = 0x331222 b | d (gdb) p/x $d $11 = 0x111022 c

SGI Proprietary

38 Ease of Use

•Leverage 3rd Party Std Language Tools – Celoxica, Impulse Acceleration, Mitrion, Starbridge Viva – In discussions with other HLL tool vendors

•Developed an FPGA aware version of GDB – Capable of debugging the FPGA and System Software – Capable of multiple CPUs and multiple FPGAs

•Developed RASC Abstraction Layer (RASCAL)

•Provide for HDL modules – Integrated environment with debugger – Highest performance

SGI Proprietary

39 RASC™ Software Stack

Download Debugger (GDB) SpeedShop™ Utilities

Application User Space Device Manager Abstraction Layer Library

Algorithm Device Driver Download Driver Linux® Kernel

COP (TIO, Algorithm FPGA, Memory, Download FPGA) Hardware

SGI Proprietary

40 Abstraction Layer: Algorithm API

The Abstraction Layer’s algorithm API mirrors the COP API with a few Algorithm additions that enable: Input Data Output Data COP

COP Wide Scaling Application

COP

-and - Input Data Algorithm Output Data

COP Deep Scaling Application COP

Working with industry/customers (www.openfpga.org) on API stds…

SGI Proprietary

41 Ease of Use

•Leverage 3rd Party Std Language Tools – Celoxica, Impulse Acceleration, Mitrion, Starbridge Viva – In discussions with other HLL tool vendors

•Developed an FPGA aware version of GDB – Capable of debugging the FPGA and System Software – Capable of multiple CPUs and multiple FPGAs

•Developed RASC Abstraction Layer (RASCAL)

•Provide for HDL modules – Integrated environment with debugger – Highest performance

SGI Proprietary

42 FPGA Architecture Overview

RAM Bank 0

3.2 GB/s Write Read port 0 port 0 Core

SSP Services Algorithm Block Block

3.2 GB/s Read Write port N port N

RAM Bank N

SGI Proprietary

43 Algorithm Block as Submodule

alg_clk do_step Algorithm alg_rst controller step_flag Algorithm alg_done Block

debug0 debug63 sram_wr_data[63:0] sram_wr_addr[17:0] sram_rd_cmd_vld sram_wr_be[7:0] sram_rd_addr[17:0] sram_wr_req sram_rd_dvld sram_rd_data sram_wr_gnt Debug sram_wr_dvld sram_rd_gnt port sram_rd_req

SRAM controller (one bank shown)

SGI Proprietary

44 Verilog / VHDL Module Support

• Templates for Verilog and VHDL – Fast start to algorithm coding • Provide a system simulation stub – Allows both simulation debug or system debug • Provide source code for core service – Allows user to modify to meet special needs • Extractor tools supports GDB meta-data – Application and FPGA debugging

SGI Proprietary

45 RASC™ Technology — Demonstrated Application Speed-up

Bit Manipulation (Cryptography)1 • 79x 1.5GHz ® Itanium® 2 Processor (single RASC Unit) • 119x 1.5GHz Intel® Itanium® 2 Processor (dual RASC Unit)

Graphics Edge Detection1 • 7.4x 1.5GHz Intel® Itanium® 2 Processor (single RASC Unit)

Customer Application • 20,000x speedup on scalar microprocessor

• EXERGY – MAPLD 2005 paper 190

1 SGI Proprietary Based on internal testing

46 RASC Platform Capabilities

• Direct Connection to NUMAlink4 6.4GB/s/connection • Fast System Level Reprogramming of FPGA FPGA load at memory speeds • Atomic Memory Operations Same set as System CPUs • Hardware Barriers Dynamic Load Balancing • Configurations to 8191 NUMA/FPGA Nodes Scalability

SGI Proprietary

47 Thank You Strategy for Big Data

MPU MPU IO MPU MPU MPU IO MPU Heterogeneous TBs Memory TBs Memory APU-GPU . IRIX Dataset Dataset . Linux IO IO IO . Windows IO IO APU . Solaris IO APU APU . IBM AIX . HP-UX . Mac OS X Heterogeneous . IRIX . Linux PBs . Windows Disk . Solaris Open Source (Datasets) . IBM AIX Scalable . HP-UX Filesystem . Mac OS X

SGI Proprietary

49 SGI® RASC™ Technology Summary

• Tightly coupled, high bandwidth/low latency integration into NUMA fabric – Significant bandwidth advantage (6.4GB/s) – Coherent shared memory access – Atomic memory operations – Scalability (wide scaling and deep scaling) • Orders-of-magnitude performance improvement and application speedup – Beneficial when running data intensive applications critical to oil and gas exploration, defense and intelligence, bioinformatics, medical imaging, broadcast media, and other data-dependent industries. • Ease of programming—complete software stack – RASClib (API and core services library) provides abstraction layer to support reconfigurable elements in a multi-processing, multi-user environment – Fully integrated third-party party HLL development tools – FPGA-aware enhancements to GNU debugger (open-source) • Add-in module that seamlessly operates with SGI® Altix® servers and Prism™ visualization systems

SGI Proprietary

50 Multi-Paradigm Computing Other Non-traditional Processing Initiatives

• GPU-based processing – High potential performance (200-300GF peak today) and performance/price on single precision floating point applications…clear roadmap to future semiconductor process technologies – SGI working with SI on scaling to multiple GPUs and on development environment/programming paradigms…initial focus on signal processing apps

• Specialized processors… ClearSpeed™ processors, custom processors (MD-GRAPE, classified chip) – High potential performance/watt on certain apps

This slide contains forward-looking statements. The results and forecasts as stated SGI Proprietary may vary. Other risks and uncertainties relating to this slide may be found in the "Safe-Harbor" statement at the beginning of this presentation. 51