Crayyy XT and XE System Overview

Customer Documentation and Training Overview Topics

• System Overview – Cabinets, Chassis, and Blades – Compute and Service Nodes – Components of a Node ƒ Processor ƒ SeaStar ASIC • Portals API Design ƒ Gemini ASIC • System Networks • Interconnection Topologies

10/18/2010 Cray Private 2 Cray XT System

10/18/2010 Cray Private 3 System Overview

Y Z GigE

X

10 GigE

GigE SMW

Fibre Channels RAID Subsystem

Compute node Login node Network node Boot /Syslog/Database nodes

10/18/2010 Cray Private I/O and Metadata nodes 4 Cabinet

– The cabinet contains three chassis, a blower for cooling, a power distribution unit (PDU), a control system (CRMS), and the compute and service blades (modules) – All components of the system are air cooled ƒ A blower in the bottom of the cabinet cools the blades within the cabinet • Other rack-mounted devices within the cabinet have their own internal fans for cooling – The PDU is located behind the blower in the back of the cabinet

10/18/2010 Cray Private 5 Liquid Cooled Cabinets

Heat exchanger Heat exchanger (XT5-HE LC only) (LC cabinets only)

48Vdc flexible Cage 2 buses Cage 2

Cage 1 Cage 1

Cage VRMs Cage 0 Cage 0 backplane assembly Cage ID controller Interconnect 01234567 Heat exchanger network cable Cage inlet (LC cabinets only) connection air temp sensor Airflow Heat exchanger (slot 3 rail) conditioner 48Vdc shelf 3 (XT5-HE LC only) 48Vdc shelf 2 L1 controller 48Vdc shelf 1 Blower speed controller (VFD)

Blooewer PDU line filter XDP temperature XDP interface & humidity sensor board (behind (behind perforated perforated panel, panel in some XT5-HE LC only) XT5-HE LC cabinets) 10/18/2010 Cray Private 6 Blades

• The system contains two types of blades: – Compute blades ƒ 4 SeaStar or 2 Gemini ASICs ƒ 4 nodes – Service blades ƒ SIO (XT systems) • 4 SeaStar ASICs • 2 nodes • One dual-slot PCI-X or PCIe riser assemblies per node ƒ XIO (XE systems) • 2 Gemini ASICs • 4 nodes • OilltPCIiOne single slot PCIe riser per nod d(tthbtd)e (except on the boot node)

10/18/2010 Cray Private 7 PCI Cards

• Cray supports the following types of PCIe cards: – Gigabit Ethernet – 10Gigabit Ethernet – Fibre Channel (FC2, FC4, and FC8) – IfiiBdInfiniBand

10/18/2010 Cray Private 8 Node Components

– Opteron processor ƒ Single-, dual-, quad-, six-, eight-, twelve-core versions – NdNode memory ƒ 2- to 8-GB of DDR1 memory in Cray XT3 compute nodes ƒ 2- to 8 -GB of DDR1 memory in Cray XT service (SIO) nodes ƒ 4- to 16-GB of DDR2 memory in Cray XT4 compute nodes ƒ 4- to 32-GB of DDR2 memory in Cray XT5 compute nodes ƒ 8- t64to 64-GB o f DDR3 memory in Cray XT6 compu te no des ƒ 8- to 64-GB of DDR3 memory in Cray XE6 compute nodes ƒ 8- to 16-GB of DDR2 memoryyy in Cray XE6 service ()(XIO) nodes – SeaStar or Gemini ASIC

10/18/2010 Cray Private 9 Cray XT4 Compute Blade

Node memory

Node VRM Node 1

NdNode 2

SeaStar Node 0 Node 3

L0 controller

48Vdc to 12Vdc VRMs 10/18/2010 Cray Private 10 XT4 Node Block Diagram

Opteron slots

processor MM -64 4 DIM

HyperTransport Link

High-speed network (HSN) SeaStar (()interconnect)

L0 controller

10/18/2010 Cray Private 11 SIO Service Blade

PCI card

Node 3

3 Node 0 2 PCI riser 1 (PCI-X or PCIe) 0

10/18/2010 Cray Private 12 SeaStar ASIC

– Direct HyperTransport connection to Opteron – The DMA engine transfers data between the host memory andthd the net work ƒ Separate read and write sides in the DMA engine ƒ The Oppyteron does not directly load/store to network ƒ On the receive side a set of content addressable memory (CAM) locations are used to route incoming messages • There are 256 CAM entries – Designed for Sandia Portals API

10/18/2010 Cray Private 13 SeaStar Diagram

Core

DDR- Northbridge SDRAM Memory interface

HyperTransport HyperTransport Interface Interface

SeaStar To PCI-X on Service blades DMA Hyper x R engine Transport o

y u Memory t z e r SSI PPU

L0 controller 10/18/2010 Cray Private 14 Node Connection Bandwidths

Opteron core(s) +Y ry oo

HyperTransport Mem B/s GG +Z HyperTransport Cray XT3 6.4 GB/s bidirectional

9.6 2.17 GB/s sustained Cray XT4 8.0 GB/s bidirectional

Memory Cray XT3 DDR1 6.4 GB/s (400 MHz DIMMs) 5.7 GB/s sustained -X 9.6 GB/s 9.6 GB/s +X 51.6 ns latency - measured SeaStar Cray XT4 DDR2 10.4 GB/s (667 MHz DIMMs) 8. 5 GB/s sustained 50 ns latency - measured High Speed Network Cray XT4 DDR2 12.8 GB/s (800 MHz DIMMs) 9.6 GB/s bidirectional 6.5 GB/s sustained Cray XT5 DDR2 10.4 GB/s (667 MHz DIMMs)

GB/s Cray XT5 DDR2 12.8 GB/s (800 MHz DIMMs) -Z 9.6

10/18/2010 -Y Cray Private 15 Portals API Design

– Designed to message among thousands of nodes ƒ Performance is critical only in terms of scalability ƒ SiSuccess is measured dbhllitiilldt by how large an application is allowed to scale, not by a 2-node ping-pong time – Designed to avoid scalability limitations ƒ Is network independent ƒ Bypasses OS – no memory copies into or out of the kernel; few interrupts ƒ Bypasses application – does not require activity by the application to ensure progress ƒ Reliable, ordered delivery of messages between two nodes

10/18/2010 Cray Private 16 Portals API Design

– Receiver (target) managed – no reserved memory for thousands of potential senders ƒ Is connecti onl ess; no expli cit connec tion i s est abli sh ed with other processes in the application • When connected, maintains a minimum amount of state ƒ The target process determines how to respond to the incoming message • The target node can choose to accept or ignore a message from any node • Destination of any message is not an address – Instead “Match bits ” enable the receiver to place it ƒ Unexpected messages are handled by the unexpected message queue ƒ All buffers are in user space

10/18/2010 Cray Private 17 X6 Compute Blade with Gemini

Node 3

NdNode 2 Gemini 1

Node 1

Gemini 0

Node 0

10/18/2010 Cray Private 18 XE6 Node Block Diagram lots lots Opteron Opteron ss ss processor processor X86-64 X86-64 4 DIMM 4 DIMM HyperTransport Links

High-speed SeaStar (XT) network (HSN) Gemini (XE) (()interconnect)

L0 controller

10/18/2010 Cray Private 19 XIO Service Blades

• Four nodes; each node contains: – One AMD Opteron processor with up to 16 GB of DDR2 memory ƒ Processor is a six-core Opteron – A connection to a Gemini ASIC – Voltage regulating modules (VRMs) • L0 controller • Gemini mezzanine card – Contains two Gemini ASICs • Four PCIe risers – One riser per node

10/18/2010 Cray Private 20 XIO Blade

node 3 connects Node 3 to the lower mezzanine Node 2 Gemini 1

node 1 connects Node 1 to the lower mezzanine Gemini 0

Node 0

10/18/2010 Cray Private 21 XIO Riser Configuration

DDR2 Upper bay memory

Inner x16 PCIe G2 HyperTransport Opteron mezzanine bridge HyperTransport node 3 DDR2 75 Watt slot memory Gemini 225 Watt slot HyperTransport Outer x16 PCIe G2 HyperTransport Opteron mezzanine bridge node 2 DDR2 memory

Inner x16 PCIe G2 HyperTransport Opteron mezzanine bridge HyperTransport node 1 75 Watt slot DDR2 memory Gemini 225 Watt slot HyperTransport Outer x16 PCIe G2 HyperTransport Opteron mezzanine bridge node 0 Lower bay 10/18/2010 Cray Private 22 XIO Boot Node Configuration

DDR2 Upper bay memory

Inner x16 PCIe G2 HyperTransport Opteron mezzanine bridge HyperTransport node 3 DDR2 memory Gemini HyperTransport Outer x16 PCIe G2 HyperTransport Opteron mezzanine bridge node 2 DDR2 memory

Inner x16 PCIe G2 HyperTransport Opteron mezzanine bridge HyperTransport node 1 DDR2 memory Gemini HyperTransport Outer boot x8 PCIe G2 HyperTransport Opteron mezzanine bridge node 0 Lower bay 10/18/2010 Cray Private 23 Opteron Processors

CPU CPU CPU CPU CPU CPU CPU

L1 I cache L1 D cache L1 caches L1 caches L1 s L1 s L1 s L1 s

L2 cache L2 cache L2 cache L2 L2 L2 L2

System request queue System request queue L3 cache

Crossbar Crossbar System request queue Memory Memory Crossbar Controller Controller Hyper- Hyper- Memory Transport Transport Controller Hyper- Links Links DDR DDR Transport Links DDR

AMD Opteron single-core processor AMD Opteron dual-core processor AMD Opteron quad-core processor Cray XT3 System Cray XT3 System Cray XT4 System : DDR1 buffered DIMMs Socket 940: DDR1 buffered DIMMs Socket AM2: DDR2 unbuffered DIMMs and 3 HyperTransportsand 3 HyperTransports and one HyperTransport Cray XT4 System Cray XT5 System Socket AM2: DDR2 unbuffered DIMMs : DDR2 registered (buffered) and one HyperTransport DIMMs and 3 HyperTransports

AMD Opteron hex-core processor

10/18/2010 Cray Private 24 Opteron Processors

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2

L3 cache L3 cache L3 cache System request queue System request queue System request queue

Crossbar Crossbar Crossbar Memory Memory Controller Controller Hyper- Hyper- Hyper- Transport Transport Transport Links Links Links DDR DDR DDR

AMD Opteron six-core processor AMD Opteron twelve-core processor Cray XT5 System Dual-die socket Socket F: DDR2 unbuffered DIMMs (for an eight-core socket remove two and 3 HyperTransports CPUs from each side) L3 Cache: 6 MB Cray XT6 and Cray XE6 Systems : DDR3 unbuffered DIMMs and 4 HyperTransports L3 cache: each die has 6 MBs, 12 MB per socket

10/18/2010 Cray Private 25 Gemini

• Gemini is designed to pass data to and from the network with less control – GiiiitdfSMPGemini is suited for SMP – Gemini allows for adaptive routing – Hardware support for PGAS languages ƒ Remote memory address translations, as with systems – Atomic memoryyp operations • The Gemini chip uses HyperTransport version 3

10/18/2010 Cray Private 26 Gemini Block Diagram

Node 0 Node 1

Opteron Opteron Opteron Opteron

HyperTransport HyperTransport and NIC 0 and NIC 1

Router L0 processor

10/18/2010 Cray Private 27 Terminology

• BTE – Block Transfer Engine • DMAPP – Distributed Memory Application • FMA – Fast Memory Access • GHAL – Generic Hardware Abstraction Layer • GNI – Generic Network Interface • PGAS – Partitioned Global Address Space – Programming language extensions such as Unified Parallel C (UPC) and Co-Array Fortran (CAF)

10/18/2010 Cray Private 28 Performance and Resiliency Features

• Automatic link-level and HT3 retries – HT3 supports an improved CRC • Congestion feedback to enable routing around network bottlenecks – PktPackets are reor dere did in recei ve b bffuffer ƒ Separate completion notification when all are stored • Automatic link failure handling with software help – To route around a failed link, system traffic must be quiesced to prevent deadlock and preserve ordering ƒ This feature is also used to support hot blade swap • Router errors are detected and reported at the point of error

10/18/2010 Cray Private 29 Gemini ASIC Diagram

HT3 HT3

NIC 0 NIC 1

Netlink block

Tile 0, coordinate 0,0 Tile 7, coordinate 0,7

48-port YARC router

Tile 47, coordinate 5,7

10/18/2010 Cray Private 30 Node Types

– Compute nodes ƒ No user login ƒ OlOnly f or executi on of parall lllel appli litications ƒ Diskless - no permanent storage ƒ CNL operating system • symmetrical multi-processing (SMP) • Supports OpenMP programming model • apinit part of ALPS (Application Level Placement Scheduler) – Service Nodes ƒ Provide services based on hardware/software configuration ƒ Node names: boot node, SDB, syslog, login, I/O, and network nodes ƒ Run Linux operating system - SLES

10/18/2010 Cray Private 31 Service Nodes

– Boot node ƒ First node booted ƒ HitHas its own roo tfiltt file system ƒ Exports the “shared root” file system to the other service nodes – Syy()stem database (SDB) node ƒ Stores system configuration information and system state ƒ Implemented with MySQL Pro – syslog node ƒ Collects syslog data from other service nodes – Login nodes ƒ Provide users with a familiar Linux environment ƒ Edit, compile, submit, and monitor interactive or batch jobs • PBS Pro is the standard batch system

10/18/2010 Cray Private 32 Service Nodes

– I/O nodes ƒ Attach to RAID storage devices ƒ Per form remo te DMA I/O throug h the HSN ƒ Function as a metadata server (MDS) and object storage servers (OSSs) for file systems – Network nodes ƒ 10 Gigabit Ethernet (10GbE) adapter ƒ Connect to other file servers

10/18/2010 Cray Private 33 Service Node Software Components

– ALPS - Application Level Placement Scheduler ƒ Application placement, launch and management for CNL ƒ IldIncludes several ld daemons and dlit client processes – RCA – Resiliency Communication Agent ƒ Part of the kernel, on all nodes ƒ Sends a heartbeat message to the SMW (CRMS/HSS/CMS) ƒ Detects and responds to failing application launchers

10/18/2010 Cray Private 34 System Networks

• The High-speed Network (HSN) or interconnect network – Connects the service and compute nodes ƒ The preferred topology is a folded torus • A torus is a circle • Folding enables the use of the shortest cables possible to connect all nodes within a dimension • Depending on the class, the topology may be a mesh ƒ The class of the system governs the configuration – Applications move data across the HSN • The Hardware Supervisory System (HSS) network – Monitors and controls the system – Ethernet network with the System Management Workstation (SMW) is at the top of the network

10/18/2010 Cray Private 35 Interconnect Network Topology Classes

• Configurations are based on topology classes: – Class 0 contains 1 to 9 chassis (1 to 3 cabinets) ƒ 1 row – Class 1 contains 4 to 16 cabinets ƒ 1row1 row – Class 2 contains 16 to 48 cabinets ƒ 2 equal length rows – Class 3 contains 48 to 320 cabinets ƒ Three rows of equal length • AhiA mesh in one or more dimensi ons ƒ Even number of 4 or more equal length rows • Torus in all dimensions

10/18/2010 Cray Private 36 A Single-chassis Configuration

• The basic building block is a single chassis – A chassis is 11x4x8 x 4 x 8 ƒ Dimensions are: X x Y x Z – Each node on a blade is connected in the Y dimension (mezzanine) – Each node in a chassis is

connected in the Z dimension de (backplane) aa – All X-dimension connections

are cables rs in a bl y aa z

4 SeaSt x

10/18/2010 Cray Private 37 Network Connectors

10/18/2010 Cray Private 38 Class 0 Topology

10/18/2010 Cray Private 39 Class 0 Cable Drawing, 3 Cabinets

Y Z

X

10/18/2010 Cray Private 40 Class 1 Topology

10/18/2010 Cray Private 41 Class 1 Cable Drawing, 4 Cabinets

10/18/2010 Cray Private 42 Class 2 Topology

Y X Z

10/18/2010 Cray Private 43 Class 2 Cable Drawing, 16 Cabinets

X

Z

Y

10/18/2010 Cray Private 44 Class 3 Topology

Z

X Y

10/18/2010 Cray Private 45 HSN Cabling Photo

10/18/2010 Cray Private 46 Cray XT5m

• Cray XT5m topology • 1 to 18 chassis,1 to 6 cabcabinetsinets • All cabinets are in a single row • The system scales in units of chassis • 2D torus topology (y and z) • (Y x 4) x 8 topology • Y is the number of populated chassis in the system

10/18/2010 Cray Private 47 Identifying Components

• System components are labeled according to physical ID (HSS Identification), node ID, IP address, or class Component Format Description System s0, all All components attached to the SMW. Cabinet cX-Y Cabinet number and row; this is the L1 host name. Physical cage in cabinet: 0, 1, 2. Cages are Cage cX-Yc# numbered from bottom to top. Physical blade slot in cage:0 – 7, numbered from Bla de or s lo t cX-Y#Yc#s# left to right; this is the L0 hosts name. Opteron Chip on a blade: 0 - 3 for compute blades, Node cX-Yc#s#n# 0 and 3 for SIO blades SeaStar ASIC cX-Yc#s#s# Cray SeaStar ASIC on a blade: 0 – 3 Gemini ASIC cX-Yc#s#g# Cray Gemini ASIC on a blade: 0 – 1 cX-Y###Yc#s#s#l# Link port ofSSf a SeaStar AS SCIC:0 – 5 Link cX-Yc#s#g#l# Link port of a Gemini ASIC: 0 - 57

10/18/2010 Cray Private 48 Node ID (NID)

• The Node ID is a unique hexadecimal or decimal number that identifies each CLE node – The NID re flec ts the no de l ocati on i n th e net work ƒ In a SeaStar based system, the NIDs are assigned on 128-number boundaries for each cabinet ƒ In a Gemini based system, NIDs are sequential – NID format is nidnnnnn (decimal) when it refers to a hostname of a node , for example: nid00003

10/18/2010 Cray Private 49 Cabinet 0 NID Numbering Example

Cray XT (SeaStar based) system Cray XE (Gemini based) system , 67 , 71 , 75 , 79 , 83 , 87 , 91 , 95 , 95 , 93 , 91 , 89 , 87 , 85 , 83 , 81 66 00 44 88 22 66 00 44 44 22 00 88 66 44 22 00 66, 67, 9 68, 69, 9 70, 71, 8 72, 73, 8 74, 75, 8 76, 77, 8 78, 79, 8 64, 65, 6 68, 69, 7 72, 73, 7 76, 77, 7 80, 81, 8 84, 85, 8 88, 89, 9 92, 93, 9 64, 65, 9 , 60, 61 , 58, 59 , 56, 57 , 40, 41 , 42, 43 , 50, 51 , 48, 49 , 34, 35 , 38, 39 , 42, 43 , 46, 47 , 50, 51 , 54, 55 , 58, 59 , 62, 63 , 62, 63 55 77 99 55 33 55 77 33 77 11 55 99 33 77 11 33 34, 3 36, 3 38, 3 54, 5 52, 5 44, 4 46, 4 32, 3 36, 3 40, 4 44, 4 48, 4 52, 5 56, 5 60, 6 32, 3 19 17 15 19 23 27 31 9 7 5 3 1 1 21 22 22 22 22 33 33 77 , 11, 20, , 11, , 9, 10, 1 , 13, 18, , 15, 16, , 13, 14, , 17, 18, , 21, 22, , 25, 26, , 29, 20, , 3, 28, , 5, 26, , 7, 24, , 9, 22, 0, 1, 2, 4, 5, 6, , 1, 30, 00 88 22 44 22 66 00 44 88 22 44 66 88 00 1 1 1 1 1 2 2 2

10/18/2010 Cray Private 50 Cray System

10/18/2010 Cray Private 51 External Services for Cray Systems

– Implementation of services external to the Cray XT and Cray XE systems to address customer requirements ƒ MfliblMore flexible user access ƒ More options for data management, data protection ƒ Leverage commodity components in customer-specific implementations ƒ Provides faster access to new devices and technologies ƒ Repeatable solutions that remain open to custom configuration ƒ Each solution can be used, scaled, and configured independently

esFS esLogin esDM

10/18/2010 Cray Private 52 Specifically

– esFS ƒ Provides globally shared data between multiple systems • Cray systems andthd others ƒ Provides access to other file systems ƒ Data Virtualization Service (DVS) is used to project Panasas or StorNext to the compute nodes – esLogin ƒ Less dependence on a single Cray system for data access and compiles ƒ An enhanced user environment • ldhlarger memory, swap space, and more horsepower • Increased availability of data and system services to users – esDM ƒ More options for data management and data protection

10/18/2010 Cray Private 53 esLogin

Cray XE System

Service/MOM node

NIC

XE | AP Client XE Programming side commands Environment

Modules Batch Client Software

NIC NIC NIC

esMS User network Other services

10/18/2010 Cray Private 54