SGI Multi-Paradigm Architecture
Michael Woodacre Chief Engineer, Server Platform Group [email protected] A History of Innovation in HPC
Challenge® XL media server fuels Steven Spielberg’s Shoah NASA Ames and project to document Altix® set world Power Series™, Holocaust survivor record for multi-processing stories systems provide STREAMS Jim Clark compute power First systems benchmark founded SGI on SGI introduces for high-end deployed in Stephen the vision of its first 64-bit graphics Hawking’s COSMOS Altix®, first scalable Computer operating applications system 64-bit Linux® Server Visualization system
1982 1984 1988 1994 1995 1996 1997 1998 2001 2003 2004
DOE deploys 6144p Introduced First generation Origin 2000 to IRIS® Workstations modular NUMA System: monitor and become first integrated NUMAflex™ Origin® 2000 simulate nuclear 3D graphics systems architecture stockpile with Origin® 3000
First 512p Altix cluster Dockside engineering analysis on Origin® drives ocean research at NASA Ames 2000 and Indigo2™ helps Team New +10000p upgrade! Zealand win America’s Cup
Images courtesy of Team New Zealand and the University of Cambridge
SGI Proprietary
2 Over Time, Problems Get More Complex, Data Sets Exploding
Bumper, hood, engine, wheels Entire car E-crash dummy Organ damage
This Trend Continues Across SGI's Markets
Improve design Improve patient safety Improve oil exploration Improve hurricane prediction & manufacturing
First Row Images: EAI, Lana Rushing, Engineering Animation, Inc, Volvo Car Corporation, Images courtesy of the SCI, Second Row Images: The MacNeal-Schwendler Corp , Manchester Visualization Center and University Department of Surgery, Paradigm Geophysical, the Laboratory for Atmospheres,SGI Proprietary NASA Goddard Space Flight Center.
3 SGI Scalable ccNUMA Architecture Basic Node Structure and Interconnect
C C C C A A A A C CPU CPU C C CPU CPU C H H H H E E E E
NUMAlink™ Interconnect Interface Interface Chip Chip
Physical Memory Physical Memory
SGI Proprietary
4 SGI Scalable ccNUMA Architecture Basic Node Structure and Interconnect
C C C C A A A A C CPU CPU C C CPU CPU C H H H H E E E E
NUMAlink™ Interconnect Interface Interface Chip Chip
Global Shared Memory
SGI Proprietary
5 Logical Layout - 8TB
Altix 128 Processor 8TB - 1.6GB/s Uniform Memory Bandwidth Plane A Level 2 Routers RC2A RC4A RC6A RC8A RC3A RC5A RC7A RC9A
Level 1 Routers
RB2 RB3 RB6 RB7 RB10 RB11 RB14 RB15 RB18 RB19 RB22 RB23 RB26 RB27 RB30 RB31
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28 R29 R30 R31 R32
C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28 R29 R30 R31 R32
RB2 RB3 RB6 RB7 RB10 RB11 RB14 RB15 RB18 RB19 RB22 RB23 RB26 RB27 RB30 RB31 Level 1 Routers
RC2B RC4B RC6B RC8B SGI Proprietary Level 2 Routers RC3B RC5B RC7B RC9B Plane B 6 3.2 Interconnect Topology 3.0
Bi-Section Bandwidth Profiles ) 2.8
GBs/sec/cpu cpu 2.6
sec/ 2.4 Bs/ G
( 2.2 th
id 2.0 w d
n 1.8 Ba 1.6 ion ct
e 1.4 -S i Dual Plane - NL3 router - 1.2
8 port router bricks wB a
R 1.0 Dual Plane - NL4 router - 8 port router bricks 0.8 0.6
0.4
0.2
SGI Proprietary 4 8 16 32 64 128 256 512 1024 2048 Processors 7 Examples of Single-Paradigm Architectures
Scalar Vector App-Specific
Intel Itanium Cray X1 Graphics - GPU SGI MIPS NEC SX Signals - DSP IBM Power Prog’ble - FPGA Sun SPARC Other ASICs HP PA
SGI Proprietary
8 Paradigms to Applications
Application-specific high
Application-specific
Vector Scalar Intensity Compute Low
Low Data locality High
SGI Proprietary
9 Peer I/O: Increased I/O Flexibility & Performance
XIO+ I/O C R C
NL I/O XIO+ C
1:1 Ratio C-brick to I/O
Peer I/O
I/O R C
NL I/O
SGI Proprietary
10 SGI Scalable ccNUMA Architecture Multi-Paradigm Computing Architecture
General Purpose RASC™ (FPGA) Scalable GPUs I/O Interfaces
C C C C C C C General C A A A A A A GeneralA A C C C C C C C C H CPU CPU H H CPU FPGA(s) CPU H GPUsH CPU GPUsCPU H PurposeH CPU PurposeCPU H E E E E E E EI/O I/O E
Interface Interface Interface Interface TIO Chip Chip TIOChip TIOChip
Physical Memory Physical Memory Physical Memory Physical Memory
NUMAlink™ Interconnect Fabric
SGI Proprietary
11 Data-Centric Architecture
CPU CPU CPU U CP IO
IO Very Large GAM FPGA
APU . Globally Addressable . Low Latency . High Bandwidth . Many Ports APU FPGA
PU 3 A U- Graphics Graphics APU U GP G APU AP GPU-0 GPU-1 PU-0 GPU-1 GPU-2 osition Graphics Graphics Comp GPU-3 GPU-2
SGI Proprietary
12 Big Data
1TB, 32*32=1024 elements
Each box represents 1GB
SGI Proprietary
13 Big Data
SGI Proprietary
14 Big Data
SGI Proprietary
15 Big Datasets : 3D Interactive Visualization
1993 2004 100 MB 40,000x 400 GB 10% viewed / year Productivity 100% viewed / month ~1 MB / month 400 GB / month
SGI Proprietary
16 Commodity GPU systems 5X the price of a Scale-up System
March 17, 2005 nVIDIA visualizes large data set •473 million triangles •128 GPU’s on Dell Systems •~$1million system Compliments of nVIDIA
January 21, 2005 SGI visualizes large data set •350 million triangles •12P, 56GB memory •Utilizes a ray tracer •~$180,000 system Compliments of Boeing
SGI Proprietary
17 Dynamic Load Balancing
Load Balancing OFF Load Balancing ON Given Most Work
SGI Proprietary
18 Dimensions of Scalability
• Processors • Processor bandwidth • Memory bandwidth • Memory capacity • Interconnect bandwidth • IO bandwidth • Graphics processing • Reconfigurable processing • Other acceleration elements
SGI Proprietary
19 Origin3000 Building Blocks (Bricks)
C-brick CPU Module
R-brick Router Interconnect
I-brick Base I/O Module
P-brick PCI Expansion
X-brick XIO Expansion
D-brick Disk Storage
SGI Proprietary
20 SGI Altix™ 3700 Bx2 Platform Introduction: Building Blocks
Itanium® 2 CR-brick CPU and memory
M-brick Memory
R-brick SGI® Router interconnect Advanced Linux Environment IX-brick With Base I/O module SGI ProPack PA-brick, PX-brick PCI-X expansion
D-brick2 Disk expansion
SGI Proprietary
21 High-End Servers – Moving Forward: Altix® 4700 Platform….. Blade Packaging
•Innovative Blade-to-NUMALink4 Concept: Provides Unprecedented Versatility, Density
•Blade Architecture Leads Next-Wave of HPC Blade-Based Platforms: With Better Upgradeability, Expansion & Repair
•Investment Protection: Processor-Only Upgrade to Future Dual Core Processors
•Enables Flexible Multi-Paradigm Computing: Enhanced integrated RASC, Graphics
SGI Proprietary
22 Next Generation RASC™ Technology Blade Base Package Blade Based Package
SGI Proprietary
23 Standardized Blades, NUMAlink Backbone
Rack Individual Rack Unit (IRU) Small Rack = 4 IRUs (Contains 10 Blades)
L1 Display
L1 Display Blade L1 Display Blade Slot 8 Blade Slot 10 Filler Panel 2 Slot Blade 4 Slot Blade 6 Slot Blade L1 Display
L1 Display
L1 Display Filler Panel Filler Blade Slot1 Blade Slot3 Blade Slot5 Blade Slot 7 Blade Slot 9
L1 Display
L1 Display
L1 Display
SGI Proprietary
24 Altix® 4700 Compute Blades
• Support for Madison9M Processors (Montecito/Montvale as Available)
L1 Display • Two Compute Blade Options to Blade Slot 8 Blade Slot 10 Filler Panel 2 Slot Blade 4 Slot Blade 6 Slot Blade Provide Different System Capabilities: Graphics Blade – Best $/FLOP, Best Density
(Density Compute Blade) OR – Best Performance, Memory BW Filler Panel Filler Blade Slot1 Blade Slot3 Blade Slot5 Blade Slot 7 Blade Slot 9
I/O Blades (Bandwidth Compute Blade) RASC Blade Memory Blade Memory Compute Blade
SGI Proprietary
25 Altix® 4700 Compute Blades
Top View Front View Highest Memory BW, Performance: Bandwidth DDR2 DIMM DDR2 DIMM Bandwidth Compute Blade Compute DDR2 DIMM DDR2 DIMM Blade • 667MHz FSB Madison9M -> DDR2 DIMM DDR2 DIMM 10.7GB/s Local Memory Bandwidth M9M Shub • 32 M9M Sockets / S-Rack NL4 6.4GB/s Socket 10.7GB/s 2.0 • Processors Supported: 1.66GHz/9M, 1.66GHz/6M Madison9M with DDR2 DIMM DDR2 DIMM 667MHz FSB DDR2 DIMM DDR2 DIMM DDR2 DIMM DDR2 DIMM • Memory Sizes: 2G – 48G/core Single Blade
Top View Front View Best $/FLOP, Best Density: Density Density Compute DDR2 DIMM DDR2 DIMM Compute Blade Blade DDR2 DIMM DDR2 DIMM • 533MHz FSB Madison9M -> M9M DDR2 DIMM DDR2 DIMM 8.524GB/s Local Memory Bandwidth Socket Shub • 64 M9M Sockets / S-Rack NL4 6.4GB/s 8.5GB/s 2.0 • Processors Supported: 1.6GHz/9M, M9M 1.6GHz/6M Madison9M with 533MHz DDR2 DIMM DDR2 DIMM Socket FSB DDR2 DIMM DDR2 DIMM DDR2 DIMM DDR2 DIMM • Memory Sizes: 1G – 24GB/core Single Blade SGI Proprietary
26 Memory Blade
Top View Front View
DDR2 DIMM 06 DDR2 DIMM Memory Blade: CY DDR2 DIMM DDR2 DIMM Q2 DDR2 DIMM DDR2 DIMM • Scale Memory Independently with 12 DDR2 DIMM Slots Per Blade Shub NL4 6.4GB/s • Up to 128TB 2.0
DDR2 DIMM DDR2 DIMM DDR2 DIMM DDR2 DIMM DDR2 DIMM DDR2 DIMM Single Blade
SGI Proprietary
27 Altix® 4700 RASC Blade
L1 Display • RASC Blade Blade Slot 8 Blade Slot 10 Filler Panel 2 Slot Blade 4 Slot Blade 6 Slot Blade – Abacus Computation Blade – Enhanced Performance, Tightly Graphics Blade Integrated
Filler Panel Filler Blade Slot1 Blade Slot3 Blade Slot5 Blade Slot 7 Blade Slot 9 I/O Blades CPU Blade RASC Blade Memory Blade Memory
SGI Proprietary
28 RASC Blades – Cont.
Top View Front View
SSRAM SSRAM Abacus Computation Blade: SSP SSRAM • New Levels of Performance: FPGA TIO SSRAM PCI – High Performance V4LX160 SSRAM SSRAM Selmap FGPA with 160K Logic Cells Loader NL4 6.4GB/s SSRAM SSRAM Selmap – Increased Memory Sizes,12 DIMM per Blade SSRAM • Optional Brick Packaging for SSRAM FPGA SSP TIO Legacy Platforms SSRAM SSRAM Single Blade
SGI Proprietary
29 How does RASC™ Technology Differ from Traditional CPUs?
Compare Application Run Time %’s Application Run-Time Comparison
Algorithm Algorithm Memory Calls Branche inst. 100% Algorithm
90% A Execution Time l 80% gorithm toR 01001000010010 01110100101010 70%
Exp 11100101010001 e n Time
m Time i 60% 10001000110001 nt u 01010101010111 o R
50% Savings Ru rt of 00000111100100 Algorithm % 40% 00010010111010 Execution time Job
A 0 11 001 00011 1 30% 01001000010010 S 1 11011110011 0 01110100101010
20% C
10% Traditional RASC Method 0% Method Key Algorithm App 1 App 2 App 3 App 4 App 5 CPU only running on FPGA
Identify RASC appropriate algorithm Directly map computationally-intensive algorithms to hardware with RASC
SGI Proprietary
30 Application Segments
Application segments Sample applications Transcoding (digital watermarks, format conversion), Image and video compression (JPEG, MPEG), color correction, ray-tracing, processing edge detection (Sobel)
Digital Signal Processing FFT, IFFT, Filtering (FIR and IIR)
Interleaver/de-interleaver, coding/decoding (Reed Solomon, Network and Viterbi), convolution encoders, encryption, error correction, Communication packet processing (IPsec)
Database Acceleration Query, sorting, pattern recognition, data compression
HPC Algorithm MATLAB, STAR-P, random number generators, Sigint/Elint, Acceleration– Gov/Defense image recognition (radar/vision/IR), DEM HPC Algorithm Acceleration– Blast, Smith-Waterman Bioinformatics
SGI Proprietary
31 Ease of Use
•Leverage 3rd Party Std Language Tools – Celoxica, Mitrionics, Starbridge Systems, Nallatech
•Developed an FPGA aware version of GDB – Capable of debugging the FPGA and System Software – Capable of multiple CPUs and multiple FPGAs
•Developed RASC Abstraction Layer (RASCAL)
•Provide for HDL modules – Integrated environment with debugger – Highest performance
SGI Proprietary
32 3rd Party Tools
• Celoxica – http://www.celoxica.com – Handel-C • Mitrionics - http://www.mitrionics.com – Mitrion C • Starbridge Systems - http://www.starbridgesystems.com/ – Viva graphical development environment • Nallatech - http://www.nallatech.com/ – SGI strategic partner
SGI Proprietary
33 Ease of Use v. Efficiency
x VHDL High Verilog
x
Efficiency x x Low
Easy Ease of Use Difficult
SGI Proprietary
34 Bitstream Generation… HLL Tools
HLL Design Entry Design Verification (Handel-C, Impulse C, Mitrion C, Viva) RTL Generation and Integration with Core Services Behavioral Simulation .v, .vhd .v, .v, (VCS, Modelsim) .vhd IA-32 .vhd Design Synthesis Linux® (Synplify Pro, Machine Metadata Amplify) Processing (Python) .edf Static Timing Analysis .ncd, (ISE Timing Analyzer) Design Implementation .pcf .cfg (ISE) .bin
Altix® Device Programming .c Real-time (RASC™ Abstraction Layer, Verification Server Device Manager, Device Driver) (gdb)
SGI Proprietary
35 Ease of Use
•Leverage 3rd Party Std Language Tools – Celoxica, Impulse Acceleration, Mitrion, Starbridge Viva – In discussions with other HLL tool vendors
•Developed an FPGA aware version of GDB – Capable of debugging the FPGA and System Software – Capable of multiple CPUs and multiple FPGAs
•Developed RASC Abstraction Layer (RASCAL)
•Provide for HDL modules – Integrated environment with debugger – Highest performance
SGI Proprietary
36 FPGA Aware Debugger
• Based on Open Source GNU Debugger (GDB) • Uses extensions to current command set • Can debug host application and FPGA • Provides notification when FPGA starts or stops • Supplies information on FPGA characteristics • Can “single-step” or “run N steps” of the algorithm • Can HLL line step per C-line source • Dumps data regarding the set of “registers” that are visible when the FPGA is active
SGI Proprietary
37 GDB Debugging Environment
Algorithm.c
tmp = a & b; (gdb) fpgastep Debugger running d = tmp | c; (gdb) p/x $a $6 = 0x444433 in real time (gdb) p/x $b $7 = 0x111122 (gdb) p/x $tmp $8 = 0x555533 (gdb) fpgastep a COP FPGA (gdb) p/x $tmp $9 = 0x555533 tmp & (gdb) p/x $c $10 = 0x331222 b | d (gdb) p/x $d $11 = 0x111022 c
SGI Proprietary
38 Ease of Use
•Leverage 3rd Party Std Language Tools – Celoxica, Impulse Acceleration, Mitrion, Starbridge Viva – In discussions with other HLL tool vendors
•Developed an FPGA aware version of GDB – Capable of debugging the FPGA and System Software – Capable of multiple CPUs and multiple FPGAs
•Developed RASC Abstraction Layer (RASCAL)
•Provide for HDL modules – Integrated environment with debugger – Highest performance
SGI Proprietary
39 RASC™ Software Stack
Download Debugger (GDB) SpeedShop™ Utilities
Application User Space Device Manager Abstraction Layer Library
Algorithm Device Driver Download Driver Linux® Kernel
COP (TIO, Algorithm FPGA, Memory, Download FPGA) Hardware
SGI Proprietary
40 Abstraction Layer: Algorithm API
The Abstraction Layer’s algorithm API mirrors the COP API with a few Algorithm additions that enable: Input Data Output Data COP
COP Wide Scaling Application
COP
-and - Input Data Algorithm Output Data
COP Deep Scaling Application COP
Working with industry/customers (www.openfpga.org) on API stds…
SGI Proprietary
41 Ease of Use
•Leverage 3rd Party Std Language Tools – Celoxica, Impulse Acceleration, Mitrion, Starbridge Viva – In discussions with other HLL tool vendors
•Developed an FPGA aware version of GDB – Capable of debugging the FPGA and System Software – Capable of multiple CPUs and multiple FPGAs
•Developed RASC Abstraction Layer (RASCAL)
•Provide for HDL modules – Integrated environment with debugger – Highest performance
SGI Proprietary
42 FPGA Architecture Overview
RAM Bank 0
3.2 GB/s Write Read port 0 port 0 Core
SSP Services Algorithm Block Block
3.2 GB/s Read Write port N port N
RAM Bank N
SGI Proprietary
43 Algorithm Block as Submodule
alg_clk do_step Algorithm alg_rst controller step_flag Algorithm alg_done Block
debug0 debug63 sram_wr_data[63:0] sram_wr_addr[17:0] sram_rd_cmd_vld sram_wr_be[7:0] sram_rd_addr[17:0] sram_wr_req sram_rd_dvld sram_rd_data sram_wr_gnt Debug sram_wr_dvld sram_rd_gnt port sram_rd_req
SRAM controller (one bank shown)
SGI Proprietary
44 Verilog / VHDL Module Support
• Templates for Verilog and VHDL – Fast start to algorithm coding • Provide a system simulation stub – Allows both simulation debug or system debug • Provide source code for core service – Allows user to modify to meet special needs • Extractor tools supports GDB meta-data – Application and FPGA debugging
SGI Proprietary
45 RASC™ Technology — Demonstrated Application Speed-up
Bit Manipulation (Cryptography)1 • 79x 1.5GHz Intel® Itanium® 2 Processor (single RASC Unit) • 119x 1.5GHz Intel® Itanium® 2 Processor (dual RASC Unit)
Graphics Edge Detection1 • 7.4x 1.5GHz Intel® Itanium® 2 Processor (single RASC Unit)
Customer Application • 20,000x speedup on scalar microprocessor
• EXERGY – MAPLD 2005 paper 190
1 SGI Proprietary Based on internal testing
46 RASC Platform Capabilities
• Direct Connection to NUMAlink4 6.4GB/s/connection • Fast System Level Reprogramming of FPGA FPGA load at memory speeds • Atomic Memory Operations Same set as System CPUs • Hardware Barriers Dynamic Load Balancing • Configurations to 8191 NUMA/FPGA Nodes Scalability
SGI Proprietary
47 Thank You Strategy for Big Data
MPU MPU IO MPU MPU MPU IO MPU Heterogeneous TBs Memory TBs Memory APU-GPU . IRIX Dataset Dataset . Linux IO IO IO . Windows IO IO APU . Solaris IO APU APU . IBM AIX . HP-UX . Mac OS X Heterogeneous . IRIX . Linux PBs . Windows Disk . Solaris Open Source (Datasets) . IBM AIX Scalable . HP-UX Filesystem . Mac OS X
SGI Proprietary
49 SGI® RASC™ Technology Summary
• Tightly coupled, high bandwidth/low latency integration into NUMA fabric – Significant bandwidth advantage (6.4GB/s) – Coherent shared memory access – Atomic memory operations – Scalability (wide scaling and deep scaling) • Orders-of-magnitude performance improvement and application speedup – Beneficial when running data intensive applications critical to oil and gas exploration, defense and intelligence, bioinformatics, medical imaging, broadcast media, and other data-dependent industries. • Ease of programming—complete software stack – RASClib (API and core services library) provides abstraction layer to support reconfigurable elements in a multi-processing, multi-user environment – Fully integrated third-party party HLL development tools – FPGA-aware enhancements to GNU debugger (open-source) • Add-in module that seamlessly operates with SGI® Altix® servers and Silicon Graphics Prism™ visualization systems
SGI Proprietary
50 Multi-Paradigm Computing Other Non-traditional Processing Initiatives
• GPU-based processing – High potential performance (200-300GF peak today) and performance/price on single precision floating point applications…clear roadmap to future semiconductor process technologies – SGI working with SI on scaling to multiple GPUs and on development environment/programming paradigms…initial focus on signal processing apps
• Specialized processors… ClearSpeed™ processors, custom processors (MD-GRAPE, classified chip) – High potential performance/watt on certain apps
This slide contains forward-looking statements. The results and forecasts as stated SGI Proprietary may vary. Other risks and uncertainties relating to this slide may be found in the "Safe-Harbor" statement at the beginning of this presentation. 51