The Cray XD1
Cray XD1 Technical Overview Amar Shan, Senior Product Marketing Manager
Cray Proprietary The Cray XD1
• Built for price performance • 30 times interconnect performance • 2 times the density Cray XD1 • High availability • Single system command & control
PurposePurpose-built-built andand optimizedoptimized forfor highhigh performanceperformance workloadsworkloads
May 04 Cray Proprietary 2 Cray XD1 System Architecture
Compute • 12 AMD Opteron 32/64 bit, x86 processors • High Performance Linux RapidArray Interconnect • 12 communications processors • 1 Tb/s switch fabric Active Management • Dedicated processor Application Acceleration • 6 co-processors
ProcessorsProcessors directlydirectly connectedconnected viavia integratedintegrated switchswitch fabricfabric
May 04 Cray Proprietary 3 Under the Covers
Six 2-way SMP 4 Fans Blades 500 Gb/s crossbar Six SATA switch Hard Drives Chassis Front
12-port Inter- chassis Four connector independent PCI-X Slots Connector to 2nd 500 Gb/s crossbar switch and 12-port inter-chassis connector
Chassis Rear
May 04 Cray Proprietary 4 Compute Blade
4 DIMM Sockets for DDR 400 RapidArray Registered ECC Communications AMD Opteron Memory Processor 2XX Processor
4 DIMM Sockets AMD Opteron for DDR 400 Connector to 2XX Registered ECC Main Board Processor Memory
May 04 Cray Proprietary 5 The AMD Opteron Processor
64KB Dedicated Memory Bus
64KB Native 32 & 64 bit x86 compatibility 1 MB Up to 19.2 GB/s I/O
May 04 Cray Proprietary 6 Cray Innovations
Balanced Interconnect
Active Management
Application Cray XD1 Acceleration
PerformancePerformance andand UsabilityUsability
May 04 Cray Proprietary 7 Interconnect
Cray Proprietary Cray XD1 Interconnect System
RapidArray –Interconnect processors –Switch fabric –Communication s software
May 04 Cray Proprietary 9 Typical HPC Application
• HPC applications Compute Communicate Compute Communicate Compute…. exhibit intense compute/ communicate cycles • 20% - 60% of time, CPUs sit idle, stalled by communications • Application performance is very sensitive to latency and bandwidth
InterconnectInterconnectInterconnect DrivesDrivesDrives SystemSystemSystem PerformancePerformancePerformance
May 04 Cray Proprietary 10 Removing the Communications Bottleneck
GigaBytes GigaBytes per Second GFLOPS Memory Processor I/O Interconnect
1 GB/s 0.25 GB/s PCI-X Xeon GigE Server 5.3 GB/s DDR 333
Cray A R XD1 6.4GB/s 8 GB/s DDR 400
6.4GB/s DDR 400 8 GB/s Cray S RS/Strider S 6.4 GB/s 31 GB/s DDR 400
Cray 31 GB/s X1
34.1 GB/s 102 GB/s May 04 Cray Proprietary 11 HPC Communications Optimizations
Cray Communications Libraries RapidArray Communications Processor – MPI 1.2 library – HT/RA tunnelling with bonding –TCP/IP – Routing with route redundancy – PVM – Reliable transport – Shmem – Short message latency – Global Arrays optimization – System-wide process & time synchronization – DMA operations – System-wide clock synchronization
AMD RapidArray Opteron 2XX Communications Processor Processor 2 GB/s
3.2 GB/s A R 2 GB/s
DirectDirect ConnectedConnected ProcessorProcessor ArchitectureArchitecture
May 04 Cray Proprietary 12 Interconnect Benchmarks (MPI Latency)
MPI Latency versus Message Size
40.00 30.00
20.00 10.00 0.00 Latency (M icrosec.) 4 8 32 64 128 256 512 1024 2048 4096
Message Length (Bytes)
Cray XD1 RapidArrayMVAPICH (MPI/VAPI/IB)MPICH-GM (Myrinet D card)MPICH-QsNet (Quadrics Elan3)
44 times times lower lower latency latency than than Myrinet Myrinet (small (small message). message). TheThe Cray Cray XD1 XD1 has has sent sent 1 1 KB KB before before others others have have sent sent a a single single byte byte..
May 04 Cray Proprietary 13 Interconnect Benchmarks (MPI Throughput)
Bandwidth versus Message Size
Cray XD1 RapidArray
1000 MVAPICH (IB) 900 800 MPICH-GM (Myrinet) 700 600 MPICH-QsNet (Quadrics) 500 400
Bandwidth (M B/s)300 200 100 0 1 4 8 64 128 256 512 1024 4096 256000
Data length (Bytes)
TheThe Cray Cray XD1 XD1 is is 5X 5X throughput throughput of of Quadrics Quadrics at at 1KB 1KB messages messages
May 04 Cray Proprietary 14 Processor Interaction
HowHowHow tototo DoubleDoubleDouble ApplicationApplicationApplication PerformancePerformancePerformance May 04 Cray Proprietary 15 Synchronized Linux Scheduler
May 04 Cray Proprietary 16 Direct Connect Topology
1 Cray XD1 Chassis 12 AMD Opteron Processors 53 GFLOPS 8 GB/s between SMPs 1.6 µsec interconnect Integrated switching
3 Cray XD1 Chassis 36 AMD Opteron Processors 158 GFLOPS 8 GB/s between SMPs 1.8 µsec interconnect Integrated switching
25 Cray XD1 Chassis, two racks 300 AMD Opteron Processors 1.3 TFLOPS 2 - 8 GB/s between SMPs 1.8 µsec interconnect Integrated switching
May 04 Cray Proprietary 17 Fat Tree Topology
• 12 Cray XD1 chassis • 144 AMD Opteron Processors • 633 GFLOPS Spine switch • 4/8 GB/s between SMPs Spine switch Spine switch • 1.9 µsec interconnect • Fat tree switching, integrated first & third order • 6/12 RapidArray spine switches (24-ports)
May 04 Cray Proprietary 18 Management
Cray Proprietary Causes of System Outages
• Most problems occur during: – Upgrades, – Problem diagnosis, – Configuration Other Power Supply • Complex systems 1% 2% Disks • Patchwork software 7% • Inadequate tools Fa ns 5% • Multivendor h/w, s/w Hardware • Driver & software 15% compatibility • Corrupt, stale software & configs
Other (Software & Operations) 85% Source: UC Berkeley May 04 Cray Proprietary 20 Active Manager System
Usability – Single System Command and Control Resiliency – Dedicated management processors, real-time CLI and Active Management OS and communications fabric. Web Access Software – Proactive background diagnostics with self- healing.
AutomatedAutomated managementmanagement forfor exceptionalexceptional reliability,reliability, availability,availability, serviceabilityserviceability
May 04 Cray Proprietary 21 FCAPS
FUNCTION Active Manager Features • Maintain health of the system • System monitoring Fault • Monitor, report on and automatically heal or • Hardware management • Allow operator intervention to resolve • Alarm management problems
• Define, monitor and change operational • Commission, upgrade, and expand system Configuration parameters of the system • Software management • Provide administrators with a real-time view • Network configuration management of total system configuration and • Manage users • Automate configuration tasks to minimize • Partition Management effort and inconsistencies • Monitor system resources and usage •Quotas Accounting • Support cost accounting or bill back • Usage tracking – per job, user, department • Reporting • Resource and Queue Management Allow administrators to fine tune system • Job management Performance operation to improve job and system • File system management performance • System performance analysis
Control access to • User privilege management Security • the system • Data backups • system resources and to • specific functions within the system
May 04 Cray Proprietary 22 Active Manager GUI: SysAdmin Portal
GUIGUI provides provides quick quick accessaccess to to status status info info andand system system functions functions
May 04 Cray Proprietary 23 Active Manager Command Line Interface
AdministratorsAdministrators can can access access ActiveActive Manager Manager software software and and LinuxLinux through through the the CLI CLI
May 04 Cray Proprietary 24 XML Integration Point
SimplifiesSimplifies datadata transfertransfer
May 04 Cray Proprietary 25 System Partitions
Users & Administrators
File Services Compute Partition 1 Partition
Front End Partition Compute Partition 2 Compute Partition 1 • Front End Partition • Compute Partition ManageManage multiplemultiple processorsprocessors • Service Partition – File Services andand copiescopies ofof LinuxLinux asas single,single, – Database unified system – DNS unified system
May 04 Cray Proprietary 26 Single System Command and Control
Users & Administrators
File Services Compute Partition 1 Partition
• Partition management Front End Partition Compute Partition 2 Compute Partition 1 • Linux configuration • Hardware monitoring • Software upgrades • File system management • Data backups • Network configuration • Accounting & user management • Security • Performance analysis • Resource & queue management May 04 Cray Proprietary 27 Active Manager System/Partition View
ManagingManaging virtual virtual computerscomputers instead instead of of individualindividual processors processors
May 04 Cray Proprietary 28 Active Manager Task Wizard
SimplifiesSimplifies complex complex tasks tasks toto increase increase efficiency efficiency and and reducereduce downtime downtime
May 04 Cray Proprietary 29 Active Manager Job Scheduler
JobJob management management is is integrated integrated withwith self self-healing-healing features features to to increaseincrease job job completion completion rates rates
May 04 Cray Proprietary 30 Active Manager User Portal
IntuitiveIntuitive user user access access for for job job submissionsubmission and and status status lets lets usersusers focus focus on on applications, applications, notnot computing computing
May 04 Cray Proprietary 31 Self-Monitoring
Parity Heartbeat Temperature Fan speed Diagnostics Air Velocity Thermals Voltage Processors Current Memory Hard Drive
Fans Power supply
Active Manager Interconnect
PCI-X Power Supply
DedicatedDedicated ManagementManagement Processor,Processor, OS,OS, FabricFabric May 04 Cray Proprietary 32 Self-Healing
Users & Administrators
File Services Compute Partition 1 Partition
Front End Partition Compute Partition 2 Compute Partition 1 • Continuous Monitoring • Detect (Future) Failure • Attempt reset • Isolate failed component • Re-allocate resources (N+1 sparing or policy-based) AutomatedAutomated RecoveryRecovery ReducesReduces MTTRMTTR fromfrom hourshours toto minutesminutes May 04 Cray Proprietary 33 Active Manager Thermal Management
May 04 Cray Proprietary 34 Active Manager Alarm Management
May 04 Cray Proprietary 35 Roll back
• Undo command: – Journaling rlsemaster
apprlse otherappmaster
v1.3 v1.4 v2.0 gamess nwchem
RPM RPM RPM Files Files Files v4.6 r2.3a r2.4
Install Install Install Files Files Files
QuickQuick RollbackRollback ReducesReduces MTTRMTTR fromfrom hourshours toto minutesminutes
May 04 Cray Proprietary 36 Application Acceleration FPGA
Cray Proprietary Application Acceleration
Application Accelerator Application Acceleration • Reconfigurable Computing • Tightly coupled to Opteron • FPGA acts like a programmable co- processor • Well-suited for: – Searching, sorting, signal processing, audio/video/image manipulation, error P correction, coding/decoding, packet RAP processing, random number generation.
P RAP SuperLinearSuperLinearspeedup speedup forfor keykey algorithmsalgorithms
May 04 Cray Proprietary 38 Application Acceleration FPGA
... do for each array element DataSet . . . end …
Application Acceleration FPGA Compute Processor …
FineFine-grained-grained parallelismparallelism appliedapplied for 100x potential speedup … for 100x potential speedup May 04 Cray Proprietary 39 FPGA and Vector Processors
Vector Processors FPGAs • SIMD – instruction level • MIMD – code fragments are parallelism replicated • Mature Development • Emerging Development Tools Environment (mostly HW tools – VHDL, (C, Fortran) Verilog) • Integer or Floating Point • Integer or Fixed Point
• Suited to matrix operations: • Suited to pre-processing BLAS, LAPACK, … (signal processing, sorting/searching, error correction, coding/decoding, packet processing)
May 04 Cray Proprietary 40 Cray XD1: Built for Performance
FasterFaster interconnect interconnect throughput throughput LowerLower interconnect interconnect latency latency System-wideSystem-wide process process synchronization synchronization
Ideal 60
50
40 Cray Greater Efficiency 30
Speedup Greater Scalability Myrinet 20 Gigabit Ethernet Faster Performance Fast Ethernet 10
0 0 102030405060 Processors
IncreasingIncreasing applicationapplication effiefeffificiencyciencyciency andand scalabilityscalability forfor breakthroughbreakthrough performanceperformance gainsgains May 04 Cray Proprietary 41 The Cray XD1
• Built for price/performance – Interconnect bandwidth/latency – System-wide process synchronization – Application Acceleration FPGAs • Standards-based – 32/64-bit X86, Linux, MPI • High resiliency Cray XD1 – Self-configuring, self-monitoring, self- healing • Single system command & control – Intuitive, tightly integrated management software
PurposePurpose-built-built and and optimized optimized for for high high performanceperformance workloads workloads
May 04 Cray Proprietary 42 Questions?
Amar Shan, Senior Product Marketing Manager [email protected]
Cray Proprietary Additional Slides
Cray Proprietary Cray System Portfolio
• 1 to 50+ TFLOPS • $3 M - $10 M+ • Vectorized apps • Cray MSP/UNICOS/mp Cray X1
• 1 to 50+ TFLOPS • 256 – 10,000+ processors • $1 M - $5 M+ • Opteron/Linux/Catamount RS/Strider
• 48 GFLOPS - 1.2+ TFLOPS • 12 – 288+ processors • $50 K - $2 M+ • AMD Opteron/Linux Cray XD1
PurposePurpose-Built-Built HighHigh PerformancePerformance ComputersComputers
May 04 Cray Proprietary 45 Communications Protocols
May 04 Cray Proprietary 46 Switchless Toroid Topology
27 chassis, 324 Processors 1.4 TFLOP 3x3x3 Toroid
Nearest Neighbor 4-8 GB/s bandwidth per SMP 1.8 µsec latency
Worst case: 6 Hops, 2.8 µsec 3D torus
WellWell-Suited-Suited forfor NearestNearest NeighborNeighbor ProblemsProblems
May 04 Cray Proprietary 47 Active Manager Self Healing Policies
May 04 Cray Proprietary 48 Application Acceleration Co-Processor
AMD Opteron HyperTransport
3.2 GB/s 3.2 GB/s
3.2 GB/s 3.2 GB/s
3.2 GB/s
3.2 GB/s RAP RapidArray QDR SRAM Application Acceleration FPGA Xilinx Virtex II Pro
2 GB/s 2 GB/s
Cray RapidArray Interconnect
May 04 Cray Proprietary 49 Application Acceleration Interface
RapidArray User Transport Logic QDR RAM Core Interface Core
ADDR(20:0) D(35:0) Q(35:0)
ADDR(20:0) D(35:0) TX Q(35:0) QDR RX SRAM RAP ADDR(20:0) D(35:0) Q(35:0)
ADDR(20:0) D(35:0) RapidArray Q(35:0)
• XC2VP30 running at 200 MHz. • 4 QDR II RAM with over 400 HSTL-I I/O at 200 MHz DDR (400 MTransfers/s). • 16 bit simplified HyperTransport I/F at 400 MHz DDR (800 MTransfers/s.) • QDR and HT I/F take up <20 % of XC2VP30. The rest is available for user applications.
May 04 Cray Proprietary 50 Application Acceleration Variants
A variety of Application Acceleration variants can be manufactured by populating different pin compatible FPGAs and QDR II RAMs.
Logic 18x18 FPGA Speed Elements PowerPC Multipliers XC2VP30 -6 30,816 2 136 XC2VP50 -7 53,136 2 232
RAMs Speed Dimensions Quantity Total Size K7R643682 200 MHz 1M x 36 4 16 MByte
May 04 Cray Proprietary 51 FPGA Development Flow
Cores
RAP I/F, Metadata QDR RAM I/F 0100010101 1010101011 0100101011 HDL Synthesize Implement 0101011010 Download 1001110101 0110101010
Binary File
Synplicity, Xilinx ISE Leonardo, From Command line Precision, or Application Xilinx ISE Verify VHDL, Simulate Verilog, C
Xilinx ChipScope Modelsim
May 04 Cray Proprietary 52 FPGA Linux API
• Admininstration Commands – fpga_open – allocate and open fpga – fpga_close – close allocated fpga – fpga_load – load binary into fpga • Control Commands – fpga_start – start fpga (release from reset) – fpga_stop – stop fpga • Status Commands – fpga_status – get status of fpga • Data Commands ProgrammerProgrammer seessees get/get/putput andand messagemessage – fpga_putpassingpassing programmingprogramming – put model modeldata to fpga May 04ram Cray Proprietary 53 Storage Strategy
Cray Proprietary File Systems: Local Disks
One S-ATA HD per SMP; Local Linux directory per HD
EXT2/3 EXT2/3 EXT2/3
RapidArray
EXT2/3 EXT2/3 EXT2/3
Cray XD1
May 04 Cray Proprietary 55 File Systems: Diskless compute nodes
Single Boot Disk for System, External NAS for User Data
Front EXT2/3 End Partition system
NFS NFS NAS
Compute Partition
Cray XD1
May 04 Cray Proprietary 56 File Systems: SAN
SMP acting as a File Server for the SAN
File EXT2/3 FC Server FC HBA SAN
NFS
Compute
Cray XD1
May 04 Cray Proprietary 57 File Systems: Local Disk per Compute Node
One disk per SMP; NFS Cross-mounting; External NAS Optional
EXT2/3 EXT2/3 EXT2/3
NFS
EXT2/3 EXT2/3 EXT2/3
Cray XD1
May 04 Cray Proprietary 58 Designed for High Availability
Fewer Failures Today – Stateless Hardware 99% Availability – Reduced Variability – 100 minutes downtime / week – 1 failure / week – Self-Configuring
Faster Recovery Target – Self-Monitoring 99.99% Availability – Self-Healing – 53 minutes downtime / year – Software Rollbacks – 1 failure of 5 minutes/month – No incremental cost
OneOne vendorvendor OneOne phonephone callcall FullyFully integratedintegrated andand testedtested
May 04 Cray Proprietary 59