Crayyy XT and Cray XE System Overview
Customer Documentation and Training Overview Topics
• System Overview – Cabinets, Chassis, and Blades – Compute and Service Nodes – Components of a Node Opteron Processor SeaStar ASIC • Portals API Design Gemini ASIC • System Networks • Interconnection Topologies
10/18/2010 Cray Private 2 Cray XT System
10/18/2010 Cray Private 3 System Overview
Y Z GigE
X
10 GigE
GigE SMW
Fibre Channels RAID Subsystem
Compute node Login node Network node Boot /Syslog/Database nodes
10/18/2010 Cray Private I/O and Metadata nodes 4 Cabinet
– The cabinet contains three chassis, a blower for cooling, a power distribution unit (PDU), a control system (CRMS), and the compute and service blades (modules) – All components of the system are air cooled A blower in the bottom of the cabinet cools the blades within the cabinet • Other rack-mounted devices within the cabinet have their own internal fans for cooling – The PDU is located behind the blower in the back of the cabinet
10/18/2010 Cray Private 5 Liquid Cooled Cabinets
Heat exchanger Heat exchanger (XT5-HE LC only) (LC cabinets only)
48Vdc flexible Cage 2 buses Cage 2
Cage 1 Cage 1
Cage VRMs Cage 0 Cage 0 backplane assembly Cage ID controller Interconnect 01234567 Heat exchanger network cable Cage inlet (LC cabinets only) connection air temp sensor Airflow Heat exchanger (slot 3 rail) conditioner 48Vdc shelf 3 (XT5-HE LC only) 48Vdc shelf 2 L1 controller 48Vdc shelf 1 Blower speed controller (VFD)
Blooewer PDU line filter XDP temperature XDP interface & humidity sensor board (behind (behind perforated perforated panel, panel in some XT5-HE LC only) XT5-HE LC cabinets) 10/18/2010 Cray Private 6 Blades
• The system contains two types of blades: – Compute blades 4 SeaStar or 2 Gemini ASICs 4 nodes – Service blades SIO (XT systems) • 4 SeaStar ASICs • 2 nodes • One dual-slot PCI-X or PCIe riser assemblies per node XIO (XE systems) • 2 Gemini ASICs • 4 nodes • OilltPCIiOne single slot PCIe riser per nod d(tthbtd)e (except on the boot node)
10/18/2010 Cray Private 7 PCI Cards
• Cray supports the following types of PCIe cards: – Gigabit Ethernet – 10Gigabit Ethernet – Fibre Channel (FC2, FC4, and FC8) – IfiiBdInfiniBand
10/18/2010 Cray Private 8 Node Components
– Opteron processor Single-, dual-, quad-, six-, eight-, twelve-core versions – NdNode memory 2- to 8-GB of DDR1 memory in Cray XT3 compute nodes 2- to 8 -GB of DDR1 memory in Cray XT service (SIO) nodes 4- to 16-GB of DDR2 memory in Cray XT4 compute nodes 4- to 32-GB of DDR2 memory in Cray XT5 compute nodes 8- t64to 64-GB o f DDR3 memory in Cray XT6 compu te no des 8- to 64-GB of DDR3 memory in Cray XE6 compute nodes 8- to 16-GB of DDR2 memoryyy in Cray XE6 service ()(XIO) nodes – SeaStar or Gemini ASIC
10/18/2010 Cray Private 9 Cray XT4 Compute Blade
Node memory
Node VRM Node 1
NdNode 2
SeaStar Node 0 Node 3
L0 controller
48Vdc to 12Vdc VRMs 10/18/2010 Cray Private 10 XT4 Node Block Diagram
Opteron slots
processor MM X86-64 4 DIM
HyperTransport Link
High-speed network (HSN) SeaStar (()interconnect)
L0 controller
10/18/2010 Cray Private 11 SIO Service Blade
PCI card
Node 3
3 Node 0 2 PCI riser 1 (PCI-X or PCIe) 0
10/18/2010 Cray Private 12 SeaStar ASIC
– Direct HyperTransport connection to Opteron – The DMA engine transfers data between the host memory andthd the net work Separate read and write sides in the DMA engine The Oppyteron does not directly load/store to network On the receive side a set of content addressable memory (CAM) locations are used to route incoming messages • There are 256 CAM entries – Designed for Sandia Portals API
10/18/2010 Cray Private 13 SeaStar Diagram
Core
DDR- Northbridge SDRAM Memory interface
HyperTransport HyperTransport Interface Interface
SeaStar To PCI-X on Service blades DMA Hyper x R engine Transport o
y u Memory t z e r SSI PPU
L0 controller 10/18/2010 Cray Private 14 Node Connection Bandwidths
Opteron core(s) +Y ry oo
HyperTransport Mem B/s GG +Z HyperTransport Cray XT3 6.4 GB/s bidirectional
9.6 2.17 GB/s sustained Cray XT4 8.0 GB/s bidirectional
Memory Cray XT3 DDR1 6.4 GB/s (400 MHz DIMMs) 5.7 GB/s sustained -X 9.6 GB/s 9.6 GB/s +X 51.6 ns latency - measured SeaStar Cray XT4 DDR2 10.4 GB/s (667 MHz DIMMs) 8. 5 GB/s sustained 50 ns latency - measured High Speed Network Cray XT4 DDR2 12.8 GB/s (800 MHz DIMMs) 9.6 GB/s bidirectional 6.5 GB/s sustained Cray XT5 DDR2 10.4 GB/s (667 MHz DIMMs)
GB/s Cray XT5 DDR2 12.8 GB/s (800 MHz DIMMs) -Z 9.6
10/18/2010 -Y Cray Private 15 Portals API Design
– Designed to message among thousands of nodes Performance is critical only in terms of scalability SiSuccess is measured dbhllitiilldt by how large an application is allowed to scale, not by a 2-node ping-pong time – Designed to avoid scalability limitations Is network independent Bypasses OS – no memory copies into or out of the kernel; few interrupts Bypasses application – does not require activity by the application to ensure progress Reliable, ordered delivery of messages between two nodes
10/18/2010 Cray Private 16 Portals API Design
– Receiver (target) managed – no reserved memory for thousands of potential senders Is connecti onl ess; no expli cit connec tion i s est abli sh ed with other processes in the application • When connected, maintains a minimum amount of state The target process determines how to respond to the incoming message • The target node can choose to accept or ignore a message from any node • Destination of any message is not an address – Instead “Match bits ” enable the receiver to place it Unexpected messages are handled by the unexpected message queue All buffers are in user space
10/18/2010 Cray Private 17 X6 Compute Blade with Gemini
Node 3
NdNode 2 Gemini 1
Node 1
Gemini 0
Node 0
10/18/2010 Cray Private 18 XE6 Node Block Diagram lots lots Opteron Opteron ss ss processor processor X86-64 X86-64 4 DIMM 4 DIMM HyperTransport Links
High-speed SeaStar (XT) network (HSN) Gemini (XE) (()interconnect)
L0 controller
10/18/2010 Cray Private 19 XIO Service Blades
• Four nodes; each node contains: – One AMD Opteron processor with up to 16 GB of DDR2 memory Processor is a six-core Opteron – A connection to a Gemini ASIC – Voltage regulating modules (VRMs) • L0 controller • Gemini mezzanine card – Contains two Gemini ASICs • Four PCIe risers – One riser per node
10/18/2010 Cray Private 20 XIO Blade
node 3 connects Node 3 to the lower mezzanine Node 2 Gemini 1
node 1 connects Node 1 to the lower mezzanine Gemini 0
Node 0
10/18/2010 Cray Private 21 XIO Riser Configuration
DDR2 Upper bay memory
Inner x16 PCIe G2 HyperTransport Opteron mezzanine bridge HyperTransport node 3 DDR2 75 Watt slot memory Gemini 225 Watt slot HyperTransport Outer x16 PCIe G2 HyperTransport Opteron mezzanine bridge node 2 DDR2 memory
Inner x16 PCIe G2 HyperTransport Opteron mezzanine bridge HyperTransport node 1 75 Watt slot DDR2 memory Gemini 225 Watt slot HyperTransport Outer x16 PCIe G2 HyperTransport Opteron mezzanine bridge node 0 Lower bay 10/18/2010 Cray Private 22 XIO Boot Node Configuration
DDR2 Upper bay memory
Inner x16 PCIe G2 HyperTransport Opteron mezzanine bridge HyperTransport node 3 DDR2 memory Gemini HyperTransport Outer x16 PCIe G2 HyperTransport Opteron mezzanine bridge node 2 DDR2 memory
Inner x16 PCIe G2 HyperTransport Opteron mezzanine bridge HyperTransport node 1 DDR2 memory Gemini HyperTransport Outer boot x8 PCIe G2 HyperTransport Opteron mezzanine bridge node 0 Lower bay 10/18/2010 Cray Private 23 Opteron Processors
CPU CPU CPU CPU CPU CPU CPU
L1 I cache L1 D cache L1 caches L1 caches L1 s L1 s L1 s L1 s
L2 cache L2 cache L2 cache L2 L2 L2 L2
System request queue System request queue L3 cache
Crossbar Crossbar System request queue Memory Memory Crossbar Controller Controller Hyper- Hyper- Memory Transport Transport Controller Hyper- Links Links DDR DDR Transport Links DDR
AMD Opteron single-core processor AMD Opteron dual-core processor AMD Opteron quad-core processor Cray XT3 System Cray XT3 System Cray XT4 System Socket 940: DDR1 buffered DIMMs Socket 940: DDR1 buffered DIMMs Socket AM2: DDR2 unbuffered DIMMs and 3 HyperTransportsand 3 HyperTransports and one HyperTransport Cray XT4 System Cray XT5 System Socket AM2: DDR2 unbuffered DIMMs Socket F: DDR2 registered (buffered) and one HyperTransport DIMMs and 3 HyperTransports
AMD Opteron hex-core processor
10/18/2010 Cray Private 24 Opteron Processors
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L1 s L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2
L3 cache L3 cache L3 cache System request queue System request queue System request queue
Crossbar Crossbar Crossbar Memory Memory Memory Controller Controller Controller Hyper- Hyper- Hyper- Transport Transport Transport Links Links Links DDR DDR DDR
AMD Opteron six-core processor AMD Opteron twelve-core processor Cray XT5 System Dual-die socket Socket F: DDR2 unbuffered DIMMs (for an eight-core socket remove two and 3 HyperTransports CPUs from each side) L3 Cache: 6 MB Cray XT6 and Cray XE6 Systems Socket G34: DDR3 unbuffered DIMMs and 4 HyperTransports L3 cache: each die has 6 MBs, 12 MB per socket
10/18/2010 Cray Private 25 Gemini
• Gemini is designed to pass data to and from the network with less control – GiiiitdfSMPGemini is suited for SMP – Gemini allows for adaptive routing – Hardware support for PGAS languages Remote memory address translations, as with Cray X2 systems – Atomic memoryyp operations • The Gemini chip uses HyperTransport version 3
10/18/2010 Cray Private 26 Gemini Block Diagram
Node 0 Node 1
Opteron Opteron Opteron Opteron
HyperTransport HyperTransport and NIC 0 and NIC 1
Router L0 processor
10/18/2010 Cray Private 27 Terminology
• BTE – Block Transfer Engine • DMAPP – Distributed Memory Application • FMA – Fast Memory Access • GHAL – Generic Hardware Abstraction Layer • GNI – Generic Network Interface • PGAS – Partitioned Global Address Space – Programming language extensions such as Unified Parallel C (UPC) and Co-Array Fortran (CAF)
10/18/2010 Cray Private 28 Performance and Resiliency Features
• Automatic link-level and HT3 retries – HT3 supports an improved CRC • Congestion feedback to enable routing around network bottlenecks – PktPackets are reor dere did in recei ve b bffuffer Separate completion notification when all are stored • Automatic link failure handling with software help – To route around a failed link, system traffic must be quiesced to prevent deadlock and preserve ordering This feature is also used to support hot blade swap • Router errors are detected and reported at the point of error
10/18/2010 Cray Private 29 Gemini ASIC Diagram
HT3 HT3
NIC 0 NIC 1
Netlink block
Tile 0, coordinate 0,0 Tile 7, coordinate 0,7
48-port YARC router
Tile 47, coordinate 5,7
10/18/2010 Cray Private 30 Node Types
– Compute nodes No user login OlOnly f or executi on of parall lllel appli litications Diskless - no permanent storage CNL operating system • Linux symmetrical multi-processing (SMP) • Supports OpenMP programming model • apinit part of ALPS (Application Level Placement Scheduler) – Service Nodes Provide services based on hardware/software configuration Node names: boot node, SDB, syslog, login, I/O, and network nodes Run Linux operating system - SLES
10/18/2010 Cray Private 31 Service Nodes
– Boot node First node booted HitHas its own roo tfiltt file system Exports the “shared root” file system to the other service nodes – Syy()stem database (SDB) node Stores system configuration information and system state Implemented with MySQL Pro – syslog node Collects syslog data from other service nodes – Login nodes Provide users with a familiar Linux environment Edit, compile, submit, and monitor interactive or batch jobs • PBS Pro is the standard batch system
10/18/2010 Cray Private 32 Service Nodes
– I/O nodes Attach to RAID storage devices Per form remo te DMA I/O throug h the HSN Function as a metadata server (MDS) and object storage servers (OSSs) for Lustre file systems – Network nodes 10 Gigabit Ethernet (10GbE) adapter Connect to other file servers
10/18/2010 Cray Private 33 Service Node Software Components
– ALPS - Application Level Placement Scheduler Application placement, launch and management for CNL IldIncludes several ld daemons and dlit client processes – RCA – Resiliency Communication Agent Part of the kernel, on all nodes Sends a heartbeat message to the SMW (CRMS/HSS/CMS) Detects and responds to failing application launchers
10/18/2010 Cray Private 34 System Networks
• The High-speed Network (HSN) or interconnect network – Connects the service and compute nodes The preferred topology is a folded torus • A torus is a circle • Folding enables the use of the shortest cables possible to connect all nodes within a dimension • Depending on the class, the topology may be a mesh The class of the system governs the configuration – Applications move data across the HSN • The Hardware Supervisory System (HSS) network – Monitors and controls the system – Ethernet network with the System Management Workstation (SMW) is at the top of the network
10/18/2010 Cray Private 35 Interconnect Network Topology Classes
• Configurations are based on topology classes: – Class 0 contains 1 to 9 chassis (1 to 3 cabinets) 1 row – Class 1 contains 4 to 16 cabinets 1row1 row – Class 2 contains 16 to 48 cabinets 2 equal length rows – Class 3 contains 48 to 320 cabinets Three rows of equal length • AhiA mesh in one or more dimensi ons Even number of 4 or more equal length rows • Torus in all dimensions
10/18/2010 Cray Private 36 A Single-chassis Configuration
• The basic building block is a single chassis – A chassis is 11x4x8 x 4 x 8 Dimensions are: X x Y x Z – Each node on a blade is connected in the Y dimension (mezzanine) – Each node in a chassis is
connected in the Z dimension de (backplane) aa – All X-dimension connections
are cables rs in a bl y aa z
4 SeaSt x
10/18/2010 Cray Private 37 Network Connectors
10/18/2010 Cray Private 38 Class 0 Topology
10/18/2010 Cray Private 39 Class 0 Cable Drawing, 3 Cabinets
Y Z
X
10/18/2010 Cray Private 40 Class 1 Topology
10/18/2010 Cray Private 41 Class 1 Cable Drawing, 4 Cabinets
10/18/2010 Cray Private 42 Class 2 Topology
Y X Z
10/18/2010 Cray Private 43 Class 2 Cable Drawing, 16 Cabinets
X
Z
Y
10/18/2010 Cray Private 44 Class 3 Topology
Z
X Y
10/18/2010 Cray Private 45 HSN Cabling Photo
10/18/2010 Cray Private 46 Cray XT5m
• Cray XT5m topology • 1 to 18 chassis,1 to 6 cabcabinetsinets • All cabinets are in a single row • The system scales in units of chassis • 2D torus topology (y and z) • (Y x 4) x 8 topology • Y is the number of populated chassis in the system
10/18/2010 Cray Private 47 Identifying Components
• System components are labeled according to physical ID (HSS Identification), node ID, IP address, or class Component Format Description System s0, all All components attached to the SMW. Cabinet cX-Y Cabinet number and row; this is the L1 host name. Physical cage in cabinet: 0, 1, 2. Cages are Cage cX-Yc# numbered from bottom to top. Physical blade slot in cage:0 – 7, numbered from Bla de or s lo t cX-Y#Yc#s# left to right; this is the L0 hosts name. Opteron Chip on a blade: 0 - 3 for compute blades, Node cX-Yc#s#n# 0 and 3 for SIO blades SeaStar ASIC cX-Yc#s#s# Cray SeaStar ASIC on a blade: 0 – 3 Gemini ASIC cX-Yc#s#g# Cray Gemini ASIC on a blade: 0 – 1 cX-Y###Yc#s#s#l# Link port ofSSf a SeaStar AS SCIC:0 – 5 Link cX-Yc#s#g#l# Link port of a Gemini ASIC: 0 - 57
10/18/2010 Cray Private 48 Node ID (NID)
• The Node ID is a unique hexadecimal or decimal number that identifies each CLE node – The NID re flec ts the no de l ocati on i n th e net work In a SeaStar based system, the NIDs are assigned on 128-number boundaries for each cabinet In a Gemini based system, NIDs are sequential – NID format is nidnnnnn (decimal) when it refers to a hostname of a node , for example: nid00003
10/18/2010 Cray Private 49 Cabinet 0 NID Numbering Example
Cray XT (SeaStar based) system Cray XE (Gemini based) system , 67 , 71 , 75 , 79 , 83 , 87 , 91 , 95 , 95 , 93 , 91 , 89 , 87 , 85 , 83 , 81 66 00 44 88 22 66 00 44 44 22 00 88 66 44 22 00 66, 67, 9 68, 69, 9 70, 71, 8 72, 73, 8 74, 75, 8 76, 77, 8 78, 79, 8 64, 65, 6 68, 69, 7 72, 73, 7 76, 77, 7 80, 81, 8 84, 85, 8 88, 89, 9 92, 93, 9 64, 65, 9 , 60, 61 , 58, 59 , 56, 57 , 40, 41 , 42, 43 , 50, 51 , 48, 49 , 34, 35 , 38, 39 , 42, 43 , 46, 47 , 50, 51 , 54, 55 , 58, 59 , 62, 63 , 62, 63 55 77 99 55 33 55 77 33 77 11 55 99 33 77 11 33 34, 3 36, 3 38, 3 54, 5 52, 5 44, 4 46, 4 32, 3 36, 3 40, 4 44, 4 48, 4 52, 5 56, 5 60, 6 32, 3 19 17 15 19 23 27 31 9 7 5 3 1 1 21 22 22 22 22 33 33 77 , 11, 20, , 11, , 9, 10, 1 , 13, 18, , 15, 16, , 13, 14, , 17, 18, , 21, 22, , 25, 26, , 29, 20, , 3, 28, , 5, 26, , 7, 24, , 9, 22, 0, 1, 2, 4, 5, 6, , 1, 30, 00 88 22 44 22 66 00 44 88 22 44 66 88 00 1 1 1 1 1 2 2 2
10/18/2010 Cray Private 50 Cray System
10/18/2010 Cray Private 51 External Services for Cray Systems
– Implementation of services external to the Cray XT and Cray XE systems to address customer requirements MfliblMore flexible user access More options for data management, data protection Leverage commodity components in customer-specific implementations Provides faster access to new devices and technologies Repeatable solutions that remain open to custom configuration Each solution can be used, scaled, and configured independently
esFS esLogin esDM
10/18/2010 Cray Private 52 Specifically
– esFS Provides globally shared data between multiple systems • Cray systems andthd others Provides access to other file systems Data Virtualization Service (DVS) is used to project Panasas or StorNext to the compute nodes – esLogin Less dependence on a single Cray system for data access and compiles An enhanced user environment • ldhlarger memory, swap space, and more horsepower • Increased availability of data and system services to users – esDM More options for data management and data protection
10/18/2010 Cray Private 53 esLogin
Cray XE System
Service/MOM node
NIC
XE | AP Client XE Programming side commands Environment
Modules Batch Client Software
NIC NIC NIC
esMS User network Other services
10/18/2010 Cray Private 54