BlackWidow System Update BlackWidow ƒ System Update ƒ Performance Update ƒ Programming Environment ƒ Scalar & Vector Computing ƒ Roadmap

Slide 2 The Roadmap Baker Granite

Realizing Our Adaptive 2 0 1 0 Supercomputing Vision Cray XT4+ + Upgrade 20 09 Cray XT4 Phase II: Upgrade Cascade Program Adaptive BlackWidow Cray XMT Hybrid System 20 Cray XT4 08

Cray XT3

Phase I: Rainier Program Multiple Processor Types with 200 7 Integrated Infrastructure and Cray XD1 User Environment

Cray X1E 2006 Phase 0: Individually Architected Machines Unique Products Serving Individual Market Needs Slide 3 BlackWidow Summary

ƒ Project name for Cray’s next generation vector system ƒ Follow-on to and Cray X1E vector systems ƒ Special focus on improving scalar performance ƒ Significantly improved price-performance over earlier systems ƒ Closely integrated with Cray XT infrastructure ƒ Product will be launched and shipments will begin towards the end of this year

“Vector“Vector MPP”MPP” System System •• HighlyHighly scalablescalable –– to to thousandsthousands ofof CPUsCPUs •• HighHigh networknetwork bandwidthbandwidth •• MinimalMinimal OSOS “jitter”“jitter” on on computecompute bladesblades

Slide 4 BlackWidow System Overview

ƒ Tightly integrated with Cray XT3 or Cray XT4 system ƒ Cray XT SIO nodes provide I/O and login support for BlackWidow ƒ Users can easily run on either or both vector and scalar compute blades BlackWidow cabinets attached to Cray XT4 system (artist’s conception)

Slide 5 Compute Cabinet

Blower

Rank1 router modules Compute • Up to 4 Chassis and per • 2 per bridge chassis cabinet blades • Up to 8 per Power Back Front chassis supplies Slide 6 Cray BlackWidow Compute Blade

BlackWidow vector CPUs •8 per blade • Configured as two 4-way SMP nodes

Memory • 32, 64, or 128 GB per node

Voltage regulator modules 22.8” x 14.4”

Slide 7 BlackWidow Chassis Blade 7 … Blade 1

Blade 0 32,64, or 128 GB memory 32, 64, or 128 GB memory

0 1 2 3 0 1 2 3

Node 0 Node 1

Router connections to other blades R R R R R R R R R0 R1 R0 R1 Eight blades per chassis ƒ 64 CPUs configured as 16 4-way SMP nodes Two chassis per cabinet ƒ 128 CPUs per cabinet

Slide 8 Bridge Blade

Cable connectors to Cray XT system

22.8” x 14.4”

SerDes chips StarGate FPGAs Voltage regulator •8 per blade modules

Slide 9 Rank1 Router Module

YARC chips •2 per module 23” x 7”

Network cable Voltage regulator connectors modules

Slide 10 Fat Tree Network Scales to 1000s of CPUs

ƒ Rank 1 building block – 32 CPUs

R R R R

P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P

Node Node Node Node Node Node Node Node

Slide 11 Fat Tree Network Scales to 1000s of CPUs ƒ Rank 2 building block – up to 1024 processors • Diagram shows 512 processors (4 compute cabinets)

R R R R R R R R R R R R R R R R

¼taper

R R R R R R R R R R R R R R R R

P P 01P P P P 2 P P 15

Rank 1 Rank 1 Rank 1 Rank 1

Slide 12 Compute Cabinet

Back of cabinet showing cables connecting Rank1 router modules • 2 modules in top chassis • 4 modules in bottom chassis

Slide 13 I/O and Networking ƒ Delivered by Cray XT Service & I/O (SIO) nodes 10 GbE through network PEs ƒ Shared with scalar BlackWidow Compute MPP compute nodes

GbE for login PEs Scalar MPP compute PE Login PE FC to Network PE Cray XT SIO Nodes RAID System PE I/O PE User RAID System RAID with Lustre

Slide 14 BlackWidow ƒ System Update ƒ Performance Update ƒ Programming Environment ƒ Scalar & Vector Computing ƒ Roadmap

Slide 15 Key Comparisons with Cray X1E

Cray X1E BlackWidow Processor Dual-core, MSP Single-core, single-CPU (no MSP!) Processor performance 4.5 GFLOPS (SSP) 20+ GFLOPS Memory bandwidth 34 GB/s/MSP 41 GB/s/CPU Network bandwidth 6.4 GB/s/MSP 8 GB/s/CPU Price/MFLOP ~$5/MFLOP $1/MFLOP Cabinets Two – air-cooled, liquid- One – air cooled (with liquid- cooled cooled upgrade option) I/O Infrastructure Unique Cray-developed Cray XT SIO nodes and components software Programming environment Same as BlackWidow Same as Cray X1E Irix-based -based (CNL) User environment Cray X1E CPU, CPES Cray XT login nodes File system XFS Lustre

Slide 16 Early Benchmark Results

Results are on prototype BlackWidow hardware • Some performance limitations; e.g. speculative loads are disabled on prototype hardware • Comparisons are to X1 (i.e. not X1E) performance

Slide 17 Key Metrics ƒ HPL (Linpack) • Currently estimate BlackWidow will get 18+ GFLOPS/CPU ƒ 90+% of peak

ƒ STREAM Triad Estimates • Triad Triad (stride = 4 – no cache locality) CPUs GB/s CPUs GB/s 139 127 283 244 4 121 440

ƒ MPI Metric • Estimated latency for zero-length message: 4.9 microseconds

Slide 18 STREAM Triad Performance Comparison 45 40 35 30 BW results are single processor, 25 non-allocating loads & stores 20 15 GBytes/second 10 5 0 X1 SSP (1 of 4) X1 SSP (1 of 1) X1 MSP BW

Slide 19 DGEMM, SGEMM, DGEMV, SGEMV (Compiler generated user level Fortran routines, N=5000). 35

30

25

20

15 GFlops

10

5

0 DGEMM SGEMM DGEMV SGEMV

X1 SSP X1 MSP BW

Slide 20 HPCC Performance (4 BW PEs vs 4 X1 MSPs)

4

r 3

2 Times Faste 1

0

y c HPL FFT I ten a PTRANS DGEMM STREAM MP Bandwidth dom Access g n Ring L m g Pong Latency Ra n PI M dom Rin n Rando Min Pi Ra

Slide 21 BlackWidow Price/Performance Gains

BlackWidow 2007 $1/MFLOP Cray X1E 2005 $5/MFLOP Cray X1 2003 $15/MFLOP

Slide 22 BlackWidow ƒ System Update ƒ Performance Update ƒ Programming Environment ƒ Scalar & Vector Computing ƒ Roadmap

Slide 23 Programming Environment

ƒ BlackWidow has a single Programming Environment product ƒ Includes both Fortran and C/C++ ƒ Also includes MPI and SHMEM libraries • No MPT product ƒ Licensing: base fee, fee per socket

ƒ Compiles are performed on Cray XT login nodes

Slide 24 BlackWidow OS Features

ƒ UNICOS/lc operating system is distributed between BlackWidow compute nodes and Cray XT SIO nodes ƒ BlackWidow compute nodes: • Linux (SLES 10) kernel with minimum commands and packages • Lustre client • TCP/IP Networking support • Dump/crash support • HSS/RAS support and resiliency • Application monitoring and management • Linux process accounting, application time reporting ƒ Cray XT SIO nodes: • Linux (SLES 10) kernel and “full set” of commands and packages on login nodes • Other SIO nodes run Lustre OSS and MDS, networking connections, scheduling and workload management software

Slide 25 Parallel Programming

ƒ MPI-2 ƒ Co-Array Fortran ƒ UPC 1.2 ƒ OpenMP (within a node) ƒ Shmem ƒ Pthreads (within a node) ƒ MPMD (multiple binary launch)

Slide 26 Resource Scheduling ƒ ALPS resource scheduler ƒ Design goals • Manage scheduling and placement for distributed memory applications • Support predictable application performance • Conceal architecture specific details • Guarantee resource availability for batch and interactive requests • Integrate with and rely upon workload management applications ƒ PBS Pro, LSF, etc. • Extensible, scalable, and maintainable ƒ Distributed design: • apinit (one per process) runs on BlackWidow compute nodes • All other daemons and commands run on Cray XT service nodes

Slide 27 TotalView Licensing Model

ƒ Moving to a new licensing model for TotalView ƒ TotalView Technologies calls this “team licensing” • They still offer old model – “enterprise licensing” ƒ Customer buys “tokens”, where a token allows one MPI process to be debugged on one CPU ƒ Tokens are managed as a pool ƒ So 32 tokens can be used in a variety of ways: • One user debugging 32-process application • 32 users each debugging 1-process application • And everything in between! ƒ Easy to upgrade – just buy more tokens! ƒ See TotalView web site for details, more examples

Slide 28 BlackWidow Software Roadmap

ƒ New BlackWidow-specific capabilities including: • Scheduling enhancements and improvements • New versions of Linux kernel for BlackWidow compute node • New versions of Lustre client software • New versions of PBS Pro and Totalview

ƒ BlackWidow customers will also benefit from new releases of XT SIO node software: • New storage, networking, and device support • New and updated login node (Linux distribution) features • New versions of Lustre server software

Slide 29 BlackWidow ƒ System Update ƒ Performance Update ƒ Programming Environment ƒ Scalar & Vector Computing ƒ Roadmap

Slide 30 Scalar and Vector Processing at Boeing

Linux Networx Clusters CFD++ FLOW-3D Cray X1 NASTRAN

Overflow Wing Design Code

Slide 31 Integrated Vector/Scalar Computing

ƒ Single point of login with complete Linux software environment ƒ Single Lustre environment provides shared files ƒ Common ALPS for scheduling mixed applications on both types of compute nodes

Pre-process Perform Post-process input data simulation results

Input Scalar Temp Vector Temp Scalar Output File File A File B File

Slide 32 BlackWidow ƒ System Update ƒ Performance Update ƒ Programming Environment ƒ Scalar & Vector Computing ƒ Roadmap

Slide 33 Blade Level Integration The Cray Roadmap Baker Granite

Realizing Our Adaptive 2 0 1 0 Supercomputing Vision Cray XT4+ + Upgrade 20 Cabinet Level Scalar-Vector Integration 09 Cray XT4 Phase II: Upgrade Cascade Program Adaptive BlackWidow Cray XMT Hybrid System 20 Cray XT4 08

Cray XT3

Phase I: Rainier Program Multiple Processor Types with 200 7 Integrated Infrastructure and Cray XD1 User Environment

Cray X1E 2006 Phase 0: Individually Architected Machines Unique Products Serving Individual Market Needs Slide 34