XD1 DATASHEET

Cray XD1™ Release 1.3

■ Purpose-built for HPC — delivers exceptional application The Cray XD1 supercomputer combines breakthrough performance interconnect, management and reconfigurable computing

■ Affordable power — designed for a broad range of HPC technologies to meet users’ demands for exceptional workloads and budgets performance, reliability and usability. Designed to meet

, 32 and 64-bit x86 compatible — runs a wide variety of ISV the requirements of highly demanding HPC applications applications and open source codes in fields ranging from product design to weather prediction to

■ Simplified system administration — automates configuration and scientific research, the Cray XD1 system is an indispensable management functions tool for engineers and scientists to simulate and analyze

■ Highly reliable — monitors and maintains system health faster, solve more complex problems, and bring solutions

■ Scalable to hundreds of compute nodes — high bandwidth and to market sooner. low latency let applications scale Direct Connected Processor Architecture Cray XD1 System Highlights

The Cray XD1 system is based on the Direct Connected Processor (DCP) architecture,

harnessing many processors into a single, unified system to deliver new levels of application � � � ���� � � � Compute� � � Processors performance. ���� �

Cray’s implementation of the DCP architecture optimizes message-passing applications by 12 AMD Opteron™ 64-bit single directly linking processors to each other through a high performance interconnect fabric, or dual core processors run eliminating shared memory contention and PCI bus bottlenecks. Linux and are organized as six nodes of 2 or 4-way SMPs to deliver up to 106 GFLOPS* per chassis. Matching memory and I/O performance removes bottlenecks and maximizes processor performance.

� � � ���� � � � R� �api� dArray™ Interconnect ���� � The industry’s fastest embedded switching fabric, the RapidArray interconnect uses 12 custom communications processors and a 96 GigaBytes (GB) per second nonblocking switching fabric per chassis to deliver 8 GB per second bandwidth �� between nodes with 1.7 microsecond MPI latency. Each

� � chassis presents 24 RapidArray � ���� � � � � � � links externally with an � �� � � aggregate 48 GB per second bandwidth between chassis.

� � � ���� � � � � � � Application Acceleration ���� � ������������������������������ Six Xilinx Virtex-4 Field Programmable Gate Arrays (FPGAs) per chassis attach to the RapidArray fabric for massively parallel execution of critical algorithm components, promising orders of magnitude performance improvement for target applications.

� � � ���� � Highly modular, the Cray XD1 base unit is a chassis. Up to 12 chassis can be installed in a � � � � � Active Management � �� � cabinet. Multi-cabinet configurations integrate hundreds of processors into a single system. � A management processor Chassis Each Cabinet in each chassis, a realtime Compute Processors 12 144 , distributed software and an independent Performance 106 GFLOPS* 1.27 TFLOPS* supervisory network operate Aggregate Switching Capacity 96 GB/s 1152 GB/s together to monitor, control and MPI Interprocessor Latency 1.7 µsec 2.0 µsec manage every aspect of the

Aggregate Memory Bandwidth 77 GB/s 922 GB/s computer. The Cray XD1 system enables single system command Maximum Memory 96 GB 1.2 TB and control and provides Maximum Local Disk Storage 1.5 TB 18 TB extensive high availability features. * peak theoretical performance with 2.2 GHz Dual Core AMD Opteron processors

CRAY XD1 DATASHEET The Cray XD1 HPC system is based on the � � � ���� � � � R� �api� dArray Interconnect DCP architecture. This innovative new computer ���� � unifies up to hundreds of processors into The Cray XD1 RapidArray interconnect a single, balanced computer, delivering directly connects processors over high-speed, exceptional performance, reliability and 1.7 low-latency pathways. This DCP architecture usability. The Cray XD1 architecture includes eliminates the memory contention and PCI microsecond MPI latency four key subsystems: bus bottlenecks. • Compute Environment • RapidArray Interconnect Each fully configured chassis includes: • Application Acceleration • 12 custom communications processors System Topologies • Active Management • a 96 GB per second nonblocking The Cray XD1 embedded switching fabric embedded switch fabric, which includes and 24 interchassis links enable a wide variety main and optional expansion fabrics

� � of network topologies. Topologies initially � ���� � � � • Optimized communications libraries Compute� � � Environment ���� � supported include direct connect and fat tree. The Cray XD1 compute subsystem is composed A fat tree topology is ideal for larger systems of a high-performance Linux operating system (or small systems that are expected to grow) and AMD Opteron 64-bit processors. 96 GB/s that run applications that use all-to-one, one- to-all, or all-to-all communications. Systems up Direct Connect Bandwidth to 24 chassis can be supported with a single The AMD Opteron processor’s integrated nonblocking switching layer fat tree. Each half of the RapidArray fab- memory controller increases memory fabric per chassis ric forms a separate fat tree, with its own set bandwidth and reduces latencies. Three of external switches. HyperTransport™ links per processor provide up to 19.2 GB per second I/O bandwidth. RapidArray Communications Processors Lustre Global Parallel File System The Lustre parallel file system provides a high- Dual Core Processors Tightly coupled to the AMD Opteron performance, high-availability, object-based Dual Core AMD Opteron processors are processors and switching fabric, the storage architecture for the Cray XD1 system available as an option on the Cray XD1 RapidArray processors offload and accelerate and can scale to thousands of nodes. Lustre system. Two processing cores exist on a single communications functions from the Opteron runs natively across the Cray XD1 RapidArray die, doubling peak performance, without processor, freeing it to perform core compute interconnect, avoiding TCP/IP overheads. raising power consumption and heat levels. tasks and enabling concurrent computing and communication.

The communications processors deliver up to

� � � ���� � 8 GB per second bandwidth with 1.7 � � � � � Application Acceleration ���� � microsecond MPI latency between nodes. The communications processors also deliver The application acceleration subsystem Linux 1.1 microsecond latency on the randomly incorporates reconfigurable computing capabilities to deliver superlinear speedup for a wide variety ordered ring latency test, part of the HPC Challenge (HPCC) benchmark suite. With of targeted applications. of HPC codes interconnect bandwidth on par with memory Each Cray XD1 chassis can be configured bandwidth, a major system bottleneck is removed with six application acceleration processors, for many applications, improving performance Global Process Synchronization based on Xilinx Virtex-4 FPGAs, that can be and greatly simplifying software development. Optimizations to the Linux scheduler result in programmed to accelerate key algorithms. the synchronization of process execution RapidArray Embedded Switching Fabric Well suited to functions such as all_reduce across the system, minimizing time spent by operations, searching, sorting and signal applications at barriers, and increasing A 96 GB per second, nonblocking, crossbar switching fabric in each chassis provides processing, the application acceleration application performance as much as 50 subsystem acts as a coprocessor to the percent. four 2 GB per second links to each node and twenty-four 2 GB per second AMD Opteron processors, handling the interchassis links. computationally intensive and highly repetitive Standards Based algorithms that can be significantly The Linux operating system, combined with the accelerated through parallel execution. AMD Opteron processor’s support for 32 and RapidArray Communications Libraries 64-bit x86 compatible computing, allows users The RapidArray communications libraries The application acceleration processors to take advantage of a wide range of work in conjunction with the RapidArray are tightly integrated with Linux and the commercial and open-source applications. communications processors to streamline AMD Opteron processors and use standard communications protocols, bypassing the software programming APIs, removing a major Linux kernel where possible and optimizing obstacle to application speedup. the compute/communicate interaction. The RapidArray interconnect delivers outstanding performance for highly parallel applications built upon the MPI, shmem, and Global Arrays standards. Balance

The Key to Exceptional Application Performance

It takes more than simply adding processors to increase HPC application performance. MPI applications, such as simulation and modeling, are compute- and

� � Self-Healing communicate- � ���� � � � � � � Active Management ���� � The Cray XD1 system provides extensive fault intensive. They must The active management subsystem delivers detection, isolation, and prediction capabilities, perform millions of coupled with automated proactive and reactive outstanding usability and reliability through calculations and then self-healing intelligence. single system command and control and self- exchange intermediate healing capabilities. results before beginning another iteration. Single System Command and Control

The active management system combines To maximize processor partitioning and intelligent self-configuring 200 performance, this features to allow administrators to view and sensors to monitor system health exchange of data must be manage hundreds of processors as one or fast enough to prevent more logical computers. the compute processor Partitions divide the Cray XD1 system Fault Detection: In each chassis, a dedicated from sitting idle, waiting into logical computers. Administrators view management processor with its own supervisory for data. Thus, the faster and operate on partitions, rather than on network continuously monitors over 200 critical the processors, the individual nodes. Self-configuration features hardware functions, including air flow, greater the bandwidth automatically translate partition details into temperatures, voltages, in-rush currents, parity required and the lower node configurations. Transaction processing errors and component diagnostics. techniques ensure configuration integrity. the latency that can be Proactive Management: Sophisticated proactive tolerated. Single system control is provided over common controls adjust a broad range of operating administrative functions: system configuration, parameters to maintain peak performance and A well-balanced system monitoring and fault tolerance, software optimal operating conditions. These proactive matches processing upgrades, storage management, network measures improve the mean time between power with memory, management, user management, security, failures (MTBF) and ensure system resiliency interprocessor and I/O and resource and queue management. These and job completion. functions may be accessed using the active bandwidth, ensuring that manager graphical user interface (GUI) or a Recovery from Failures: The Cray XD1’s self- applications are free to command line interface (CLI). healing intelligence facilitates a quick and communicate results automated recovery in the event of a hardware without experiencing Single system command and control failure, reducing outages from hours to minutes. eliminates common configuration problems, delays. increases system uptime, and significantly Redundancy features include “N+1 sparing” reduces the time and effort necessary to manage and the ability to reallocate resources in the a multiprocessor, high performance computer. event of a failure, enabling a replacement node to assume the persona of a failed node and restore full capacity to the affected partition. Jobs are automatically rescheduled from the last checkpoint. CRAY XD1 SUPERCOMPUTER Technical data Release 1.3

CPU 12 64-bit AMD Opteron 200 series single or dual core processors per chassis Cache 64K L1 instruction cache, 64K L1 data cache, 1 MB L2 cache per processor FLOPS 106 GFLOPS theoretical peak performance (@ 2.2 GHz Dual Core) SMP 2-way (single core) or 4-way (dual core) nodes Main Memory 12–48 GB PC3200 (DDR400) Registered ECC SDRAM per chassis or 96 GB PC2700 (DDR333) Registered EEC SDRAM per chassis (1- 8 GB per socket) Memory Bandwidth 12.8 GB/s per node

2 or 4 Cray RapidArray links per node (4 or 8 GB/s per node) Fully nonblocking Cray RapidArray switch fabric (48 GB/s or 96 GB/s) Interconnect 12 or 24 external Cray RapidArray interchassis links — 24 or 48 GB/s aggregate 1.7 µs MPI latency between nodes Direct Connect or Fat Tree Topology

Application Acceleration (FPGA) 6 Xilinx Virtex-4 FPGAs, XC4VLX160-10 or XC4VSX55-10, 16 MB QDR RAM, 3.2 GB/s interconnect

4 PCI-X bus slots Quad Port Gigabit Ethernet PCI-X card External I/O Options Single Port 10 Gigabit Ethernet PCI-X Card Single and Dual Port Fibre Channel HBAs Single Port Infiniband HCA

Disk Up to six 3.5 inch Serial ATA drives (74GB 10K RPM or 250 GB 7200 RPM)

Graphical and command line system administration tool sets for partition management, fault management, configuration, security, software updating, telemetry and provisioning across all chassis in a system Partitioning of system into multiple logical computers Administration of entire partition as a single entity (single system command and control) Transaction processing used to ensure configuration consistency System Administration Workload management - Grid Engine, PBS Pro, Platform LSF Automatic interconnect topology verification and auto-configuration of L2 and IP networking Automatic response to component failures: isolation of hard failures, re-initialization on soft failures, switching around redundant components. Automatic re-start of jobs from last checkpoint following system failure

Management processor, network, over 200 measurement points on each Cray XD1 chassis Independent 100 Mb/s management fabric within and between Cray XD1 chassis Thermal stability maintained through temperature monitoring and regulation of fan speed Reliability Proactive detection of impending component failure (CPUs, fans, power supplies, memory, interconnect) and automatic isolation of failed components Hot swappable fans and disk blades

Operating System Cray HPC enhanced Linux, Kernel version 2.6.5 File Systems EXT2/3, NFS v2/3, ReiserFS, Lustre Global Parallel File System External Storage Fibre Channel disk controllers and drive enclosures Parallel Processing MPI 1.2 MPI-IO, MPI-2, Sockets Direct Protocol (SDP) for TCP/IP acceleration Shared Memory Access Shmem, OpenMP, Global Arrays Compilers Fortran 77, 90, 95, HPF; C/C++; Java Power 2200 W typical. 3 phase: Circuit requirement: 20A, 208 V per chassis Single phase: Circuit requirement: 30A, 230 V per chassis Dimensions 3 VU (5.25”) x 23” W x 32” D per chassis (13.3 cm high x 58.4 cm wide x 81.3 cm deep), 12 chassis per cabinet

For the most up-to-date information, please visit www.cray.com The Supercomputer Company

Global Headquarters: Cray Inc. 411 First Avenue S., Suite 600 Seattle, WA 98104-2860 USA tel (206) 701 2000 fax (206) 701 2500

Sales Inquiries: North America: 1 (877) CRAY INC Worldwide: 1 (651) 605 8817 [email protected] www.cray.com

© 2005 Cray Inc. All rights reserved. Specifications subject to change without notice. Cray is a registered trademark, and the Cray logo, Cray XD1 and RapidArray are trademarks of Cray Inc. All other trademarks mentioned herein are the properties of their respective owners. 0405