The XD1

Cray XD1 Technical Overview Amar Shan, Senior Product Marketing Manager

Cray Proprietary The Cray XD1

• Built for price performance • 30 times interconnect performance • 2 times the density Cray XD1 • High availability • Single system command & control

PurposePurpose-built-built andand optimizedoptimized forfor highhigh performanceperformance workloadsworkloads

May 04 Cray Proprietary 2 Cray XD1 System Architecture

Compute • 12 AMD Opteron 32/64 bit, x86 processors • High Performance RapidArray Interconnect • 12 communications processors • 1 Tb/s switch fabric Active Management • Dedicated processor Application Acceleration • 6 co-processors

ProcessorsProcessors directlydirectly connectedconnected viavia integratedintegrated switchswitch fabricfabric

May 04 Cray Proprietary 3 Under the Covers

Six 2-way SMP 4 Fans Blades 500 Gb/s crossbar Six SATA switch Hard Drives Chassis Front

12-port Inter- chassis Four connector independent PCI-X Slots Connector to 2nd 500 Gb/s crossbar switch and 12-port inter-chassis connector

Chassis Rear

May 04 Cray Proprietary 4 Compute Blade

4 DIMM Sockets for DDR 400 RapidArray Registered ECC Communications AMD Opteron Memory Processor 2XX Processor

4 DIMM Sockets AMD Opteron for DDR 400 Connector to 2XX Registered ECC Main Board Processor Memory

May 04 Cray Proprietary 5 The AMD Opteron Processor

64KB Dedicated Memory Bus

64KB Native 32 & 64 bit x86 compatibility 1 MB Up to 19.2 GB/s I/O

May 04 Cray Proprietary 6 Cray Innovations

Balanced Interconnect

Active Management

Application Cray XD1 Acceleration

PerformancePerformance andand UsabilityUsability

May 04 Cray Proprietary 7 Interconnect

Cray Proprietary Cray XD1 Interconnect System

RapidArray –Interconnect processors –Switch fabric –Communication s software

May 04 Cray Proprietary 9 Typical HPC Application

• HPC applications Compute Communicate Compute Communicate Compute…. exhibit intense compute/ communicate cycles • 20% - 60% of time, CPUs sit idle, stalled by communications • Application performance is very sensitive to latency and bandwidth

InterconnectInterconnectInterconnect DrivesDrivesDrives SystemSystemSystem PerformancePerformancePerformance

May 04 Cray Proprietary 10 Removing the Communications Bottleneck

GigaBytes GigaBytes per Second GFLOPS Memory Processor I/O Interconnect

1 GB/s 0.25 GB/s PCI-X Xeon GigE Server 5.3 GB/s DDR 333

Cray A R XD1 6.4GB/s 8 GB/s DDR 400

6.4GB/s DDR 400 8 GB/s Cray S RS/Strider S 6.4 GB/s 31 GB/s DDR 400

Cray 31 GB/s X1

34.1 GB/s 102 GB/s May 04 Cray Proprietary 11 HPC Communications Optimizations

Cray Communications Libraries RapidArray Communications Processor – MPI 1.2 library – HT/RA tunnelling with bonding –TCP/IP – Routing with route redundancy – PVM – Reliable transport – Shmem – Short message latency – Global Arrays optimization – System-wide process & time synchronization – DMA operations – System-wide clock synchronization

AMD RapidArray Opteron 2XX Communications Processor Processor 2 GB/s

3.2 GB/s A R 2 GB/s

DirectDirect ConnectedConnected ProcessorProcessor ArchitectureArchitecture

May 04 Cray Proprietary 12 Interconnect Benchmarks (MPI Latency)

MPI Latency versus Message Size

40.00 30.00

20.00 10.00 0.00 Latency (M icrosec.) 4 8 32 64 128 256 512 1024 2048 4096

Message Length (Bytes)

Cray XD1 RapidArrayMVAPICH (MPI/VAPI/IB)MPICH-GM (Myrinet D card)MPICH-QsNet (Quadrics Elan3)

44 times times lower lower latency latency than than Myrinet Myrinet (small (small message). message). TheThe Cray Cray XD1 XD1 has has sent sent 1 1 KB KB before before others others have have sent sent a a single single byte byte..

May 04 Cray Proprietary 13 Interconnect Benchmarks (MPI Throughput)

Bandwidth versus Message Size

Cray XD1 RapidArray

1000 MVAPICH (IB) 900 800 MPICH-GM (Myrinet) 700 600 MPICH-QsNet (Quadrics) 500 400

Bandwidth (M B/s)300 200 100 0 1 4 8 64 128 256 512 1024 4096 256000

Data length (Bytes)

TheThe Cray Cray XD1 XD1 is is 5X 5X throughput throughput of of Quadrics Quadrics at at 1KB 1KB messages messages

May 04 Cray Proprietary 14 Processor Interaction

HowHowHow tototo DoubleDoubleDouble ApplicationApplicationApplication PerformancePerformancePerformance May 04 Cray Proprietary 15 Synchronized Linux Scheduler

May 04 Cray Proprietary 16 Direct Connect Topology

1 Cray XD1 Chassis 12 AMD Opteron Processors 53 GFLOPS 8 GB/s between SMPs 1.6 µsec interconnect Integrated switching

3 Cray XD1 Chassis 36 AMD Opteron Processors 158 GFLOPS 8 GB/s between SMPs 1.8 µsec interconnect Integrated switching

25 Cray XD1 Chassis, two racks 300 AMD Opteron Processors 1.3 TFLOPS 2 - 8 GB/s between SMPs 1.8 µsec interconnect Integrated switching

May 04 Cray Proprietary 17 Fat Tree Topology

• 12 Cray XD1 chassis • 144 AMD Opteron Processors • 633 GFLOPS Spine switch • 4/8 GB/s between SMPs Spine switch Spine switch • 1.9 µsec interconnect • Fat tree switching, integrated first & third order • 6/12 RapidArray spine switches (24-ports)

May 04 Cray Proprietary 18 Management

Cray Proprietary Causes of System Outages

• Most problems occur during: – Upgrades, – Problem diagnosis, – Configuration Other Power Supply • Complex systems 1% 2% Disks • Patchwork software 7% • Inadequate tools Fa ns 5% • Multivendor h/w, s/w Hardware • Driver & software 15% compatibility • Corrupt, stale software & configs

Other (Software & Operations) 85% Source: UC Berkeley May 04 Cray Proprietary 20 Active Manager System

Usability – Single System Command and Control Resiliency – Dedicated management processors, real-time CLI and Active Management OS and communications fabric. Web Access Software – Proactive background diagnostics with self- healing.

AutomatedAutomated managementmanagement forfor exceptionalexceptional reliability,reliability, availability,availability, serviceabilityserviceability

May 04 Cray Proprietary 21 FCAPS

FUNCTION Active Manager Features • Maintain health of the system • System monitoring Fault • Monitor, report on and automatically heal or • Hardware management • Allow operator intervention to resolve • Alarm management problems

• Define, monitor and change operational • Commission, upgrade, and expand system Configuration parameters of the system • Software management • Provide administrators with a real-time view • Network configuration management of total system configuration and • Manage users • Automate configuration tasks to minimize • Partition Management effort and inconsistencies • Monitor system resources and usage •Quotas Accounting • Support cost accounting or bill back • Usage tracking – per job, user, department • Reporting • Resource and Queue Management Allow administrators to fine tune system • Job management Performance operation to improve job and system • File system management performance • System performance analysis

Control access to • User privilege management Security • the system • Data backups • system resources and to • specific functions within the system

May 04 Cray Proprietary 22 Active Manager GUI: SysAdmin Portal

GUIGUI provides provides quick quick accessaccess to to status status info info andand system system functions functions

May 04 Cray Proprietary 23 Active Manager Command Line Interface

AdministratorsAdministrators can can access access ActiveActive Manager Manager software software and and LinuxLinux through through the the CLI CLI

May 04 Cray Proprietary 24 XML Integration Point

SimplifiesSimplifies datadata transfertransfer

May 04 Cray Proprietary 25 System Partitions

Users & Administrators

File Services Compute Partition 1 Partition

Front End Partition Compute Partition 2 Compute Partition 1 • Front End Partition • Compute Partition ManageManage multiplemultiple processorsprocessors • Service Partition – File Services andand copiescopies ofof LinuxLinux asas single,single, – Database unified system – DNS unified system

May 04 Cray Proprietary 26 Single System Command and Control

Users & Administrators

File Services Compute Partition 1 Partition

• Partition management Front End Partition Compute Partition 2 Compute Partition 1 • Linux configuration • Hardware monitoring • Software upgrades • File system management • Data backups • Network configuration • Accounting & user management • Security • Performance analysis • Resource & queue management May 04 Cray Proprietary 27 Active Manager System/Partition View

ManagingManaging virtual virtual computerscomputers instead instead of of individualindividual processors processors

May 04 Cray Proprietary 28 Active Manager Task Wizard

SimplifiesSimplifies complex complex tasks tasks toto increase increase efficiency efficiency and and reducereduce downtime downtime

May 04 Cray Proprietary 29 Active Manager Job Scheduler

JobJob management management is is integrated integrated withwith self self-healing-healing features features to to increaseincrease job job completion completion rates rates

May 04 Cray Proprietary 30 Active Manager User Portal

IntuitiveIntuitive user user access access for for job job submissionsubmission and and status status lets lets usersusers focus focus on on applications, applications, notnot computing computing

May 04 Cray Proprietary 31 Self-Monitoring

Parity Heartbeat Temperature Fan speed Diagnostics Air Velocity Thermals Voltage Processors Current Memory Hard Drive

Fans Power supply

Active Manager Interconnect

PCI-X Power Supply

DedicatedDedicated ManagementManagement Processor,Processor, OS,OS, FabricFabric May 04 Cray Proprietary 32 Self-Healing

Users & Administrators

File Services Compute Partition 1 Partition

Front End Partition Compute Partition 2 Compute Partition 1 • Continuous Monitoring • Detect (Future) Failure • Attempt reset • Isolate failed component • Re-allocate resources (N+1 sparing or policy-based) AutomatedAutomated RecoveryRecovery ReducesReduces MTTRMTTR fromfrom hourshours toto minutesminutes May 04 Cray Proprietary 33 Active Manager Thermal Management

May 04 Cray Proprietary 34 Active Manager Alarm Management

May 04 Cray Proprietary 35 Roll back

• Undo command: – Journaling rlsemaster

apprlse otherappmaster

v1.3 v1.4 v2.0 gamess nwchem

RPM RPM RPM Files Files Files v4.6 r2.3a r2.4

Install Install Install Files Files Files

QuickQuick RollbackRollback ReducesReduces MTTRMTTR fromfrom hourshours toto minutesminutes

May 04 Cray Proprietary 36 Application Acceleration FPGA

Cray Proprietary Application Acceleration

Application Accelerator Application Acceleration • Reconfigurable Computing • Tightly coupled to Opteron • FPGA acts like a programmable co- processor • Well-suited for: – Searching, sorting, signal processing, audio/video/image manipulation, error P correction, coding/decoding, packet RAP processing, random number generation.

P RAP SuperLinearSuperLinearspeedup speedup forfor keykey algorithmsalgorithms

May 04 Cray Proprietary 38 Application Acceleration FPGA

... do for each array element DataSet . . . end …

Application Acceleration FPGA Compute Processor …

FineFine-grained-grained parallelismparallelism appliedapplied for 100x potential speedup … for 100x potential speedup May 04 Cray Proprietary 39 FPGA and Vector Processors

Vector Processors FPGAs • SIMD – instruction level • MIMD – code fragments are parallelism replicated • Mature Development • Emerging Development Tools Environment (mostly HW tools – VHDL, (C, Fortran) Verilog) • Integer or Floating Point • Integer or Fixed Point

• Suited to matrix operations: • Suited to pre-processing BLAS, LAPACK, … (signal processing, sorting/searching, error correction, coding/decoding, packet processing)

May 04 Cray Proprietary 40 Cray XD1: Built for Performance

FasterFaster interconnect interconnect throughput throughput LowerLower interconnect interconnect latency latency System-wideSystem-wide process process synchronization synchronization

Ideal 60

50

40 Cray Greater Efficiency 30

Speedup Greater Scalability Myrinet 20 Gigabit Ethernet Faster Performance Fast Ethernet 10

0 0 102030405060 Processors

IncreasingIncreasing applicationapplication effiefeffificiencyciencyciency andand scalabilityscalability forfor breakthroughbreakthrough performanceperformance gainsgains May 04 Cray Proprietary 41 The Cray XD1

• Built for price/performance – Interconnect bandwidth/latency – System-wide process synchronization – Application Acceleration FPGAs • Standards-based – 32/64-bit X86, Linux, MPI • High resiliency Cray XD1 – Self-configuring, self-monitoring, self- healing • Single system command & control – Intuitive, tightly integrated management software

PurposePurpose-built-built and and optimized optimized for for high high performanceperformance workloads workloads

May 04 Cray Proprietary 42 Questions?

Amar Shan, Senior Product Marketing Manager [email protected]

Cray Proprietary Additional Slides

Cray Proprietary Cray System Portfolio

• 1 to 50+ TFLOPS • $3 M - $10 M+ • Vectorized apps • Cray MSP/UNICOS/mp

• 1 to 50+ TFLOPS • 256 – 10,000+ processors • $1 M - $5 M+ • Opteron/Linux/Catamount RS/Strider

• 48 GFLOPS - 1.2+ TFLOPS • 12 – 288+ processors • $50 K - $2 M+ • AMD Opteron/Linux Cray XD1

PurposePurpose-Built-Built HighHigh PerformancePerformance ComputersComputers

May 04 Cray Proprietary 45 Communications Protocols

May 04 Cray Proprietary 46 Switchless Toroid Topology

27 chassis, 324 Processors 1.4 TFLOP 3x3x3 Toroid

Nearest Neighbor 4-8 GB/s bandwidth per SMP 1.8 µsec latency

Worst case: 6 Hops, 2.8 µsec 3D torus

WellWell-Suited-Suited forfor NearestNearest NeighborNeighbor ProblemsProblems

May 04 Cray Proprietary 47 Active Manager Self Healing Policies

May 04 Cray Proprietary 48 Application Acceleration Co-Processor

AMD Opteron HyperTransport

3.2 GB/s 3.2 GB/s

3.2 GB/s 3.2 GB/s

3.2 GB/s

3.2 GB/s RAP RapidArray QDR SRAM Application Acceleration FPGA Xilinx Virtex II Pro

2 GB/s 2 GB/s

Cray RapidArray Interconnect

May 04 Cray Proprietary 49 Application Acceleration Interface

RapidArray User Transport Logic QDR RAM Core Interface Core

ADDR(20:0) D(35:0) Q(35:0)

ADDR(20:0) D(35:0) TX Q(35:0) QDR RX SRAM RAP ADDR(20:0) D(35:0) Q(35:0)

ADDR(20:0) D(35:0) RapidArray Q(35:0)

• XC2VP30 running at 200 MHz. • 4 QDR II RAM with over 400 HSTL-I I/O at 200 MHz DDR (400 MTransfers/s). • 16 bit simplified HyperTransport I/F at 400 MHz DDR (800 MTransfers/s.) • QDR and HT I/F take up <20 % of XC2VP30. The rest is available for user applications.

May 04 Cray Proprietary 50 Application Acceleration Variants

A variety of Application Acceleration variants can be manufactured by populating different pin compatible FPGAs and QDR II RAMs.

Logic 18x18 FPGA Speed Elements PowerPC Multipliers XC2VP30 -6 30,816 2 136 XC2VP50 -7 53,136 2 232

RAMs Speed Dimensions Quantity Total Size K7R643682 200 MHz 1M x 36 4 16 MByte

May 04 Cray Proprietary 51 FPGA Development Flow

Cores

RAP I/F, Metadata QDR RAM I/F 0100010101 1010101011 0100101011 HDL Synthesize Implement 0101011010 Download 1001110101 0110101010

Binary File

Synplicity, Xilinx ISE Leonardo, From Command line Precision, or Application Xilinx ISE Verify VHDL, Simulate Verilog, C

Xilinx ChipScope Modelsim

May 04 Cray Proprietary 52 FPGA Linux API

• Admininstration Commands – fpga_open – allocate and open fpga – fpga_close – close allocated fpga – fpga_load – load binary into fpga • Control Commands – fpga_start – start fpga (release from reset) – fpga_stop – stop fpga • Status Commands – fpga_status – get status of fpga • Data Commands ProgrammerProgrammer seessees get/get/putput andand messagemessage – fpga_putpassingpassing programmingprogramming – put model modeldata to fpga May 04ram Cray Proprietary 53 Storage Strategy

Cray Proprietary File Systems: Local Disks

One S-ATA HD per SMP; Local Linux directory per HD

EXT2/3 EXT2/3 EXT2/3

RapidArray

EXT2/3 EXT2/3 EXT2/3

Cray XD1

May 04 Cray Proprietary 55 File Systems: Diskless compute nodes

Single Boot Disk for System, External NAS for User Data

Front EXT2/3 End Partition system

NFS NFS NAS

Compute Partition

Cray XD1

May 04 Cray Proprietary 56 File Systems: SAN

SMP acting as a File Server for the SAN

File EXT2/3 FC Server FC HBA SAN

NFS

Compute

Cray XD1

May 04 Cray Proprietary 57 File Systems: Local Disk per Compute Node

One disk per SMP; NFS Cross-mounting; External NAS Optional

EXT2/3 EXT2/3 EXT2/3

NFS

EXT2/3 EXT2/3 EXT2/3

Cray XD1

May 04 Cray Proprietary 58 Designed for High Availability

Fewer Failures Today – Stateless Hardware 99% Availability – Reduced Variability – 100 minutes downtime / week – 1 failure / week – Self-Configuring

Faster Recovery Target – Self-Monitoring 99.99% Availability – Self-Healing – 53 minutes downtime / year – Software Rollbacks – 1 failure of 5 minutes/month – No incremental cost

OneOne vendorvendor OneOne phonephone callcall FullyFully integratedintegrated andand testedtested

May 04 Cray Proprietary 59