XK6 REDEFINING SUPERCOMPUTING

- Sanjana Rakhecha - Nishad Nerurkar CONTENTS

| Introduction | History | Specifications | Cray XK6 | Architecture | Performance | Industry acceptance and applications | Summary INTRODUCTION

| The Cray XK6 is a trifecta of scalar, network and many-core innovation. | Hybrid supercomputer | Combination of: Cray’s Gemini interconnect, AMD's leading multi-core scalar processors and ’s powerful many-core GPU processors | Enhanced version of XE6 | Uses Blade architecture as in Cray XE6 | Capable of scaling to 500,000 scalar processors and 50 petaflops of hybrid peak performance HISTORY | In 1988, Cray Research introduced Cray Y-MP, the world's first supercomputer | Sustained over 1 gigaflop on many applications | Fujitsu's Numerical Wind Tunnel supercomputer used 166 vector processors to gain the top spot in 1994 with a peak speed of 1.7 gigaflops per processor. | The Hitachi SR2201: peak performance of 600 gigaflops in 1996 by using 2048 | The Intel Paragon had 1000 to 4000 Intel i860 processors, was ranked the fastest in the world in 1993 SUPER-COMPUTER STATISTICS COMPARISON WITH THE PRESENT CRAY CRAY XK6- ARCHITECTURE

| Four nodes per blade | Adaptive hybrid computing | Scalable compute nodes, I/Os | Gemini Mezzanine | Plug compatible with | Cray XE6 blade | Configurable processor, memory and SXM GPU | AMD 6200 Series processor: y Highly associative on-chip data cache supports aggressive out-of-order execution y Integrated memory controller y Significant performance advantage to algorithms • The 20-series: Based on the next generation CUDA GPU architecture codenamed “Fermi NODE- ARCHITECTURE XK6 ACCELERATOR BLADE GEMINI INTERCONNECTION NETWORK GEMINI INTERCONNECTION NETWORKS

| Each node acts as 2 nodes on a 3D Torus | Each Node provided with a High Radix YARC router to support up to 168 Gbps. | Parallel electrical and optical paths y High Bandwidth and lower latency for both long and short messages y Low cost of integration | Gemini Mezzanine card to avoid memory – ICN bottlenecks. NVIDIA TESLA X2090

| Special Embedded version of Tesla M2090. | Provides High Performance Computing for highly parallel applications. | 448 cores with 6 GB GDDR5 Memory. Can support up to 600+ GFLOPs | High Bandwidth to host – Quick Master-Slave Communication. | CUDA capable for easy programmability. CRAY XK6 CABINETS

| Each cabinet has up to 96 processors | Two processors wrapped in the form of a “blade” (XE6 compatible) | With 1536 cores, can give 70+ TFLOPs performance SPECIFICATIONS SPECIFICATIONS PERFORMANCE- LUDWIG

| 10 cabinets of Cray XK6 | 936 GPUs (nodes) | Only 4% deviation from perfect scaling between 8 and 936 GPUs | Application sustaining 40+ Tflop/s and still scaling... | Strong scaling also very good, but physicists want to simulate larger systems PERFORMANCE - HIMENO | Parallel 3D Poisson equation solver benchmark | iterative loop evaluating 19-point stencil | Co-Array Fortran version of code | Fully ported to accelerators using 27 directive pairs | Strong scaling | Use asynchronous GPU data transfers and kernel launches to help avoid this INDUSTRIAL ACCEPTANCE

• Oak Ridge National Laboratory Jaguar/

| High computation capacity for Scientific research

| 200 cabinets with > 18000 nodes.

| Estimated 10 – 20 PFLOPs

| Currently upgrading from XT5 based Jaguar system to XK6 based Titan system with increased performance. INDUSTRIAL ACCEPTANCE INDUSTRIAL ACCEPTANCE

| CSCS- Swiss National Super Computing Centre | Cray XE6 y 402 Tflops y 1496 nodes y Gemini Interconnects | Cray XK6 y 176 nodes with one AMD and one GPU element each SUMMARY

| Higher Supercomputing potential with GPU Accelerated computing | Better Inter node communication with the Gemini Optical interconnects | Backward compatible with XE6 cabinets and can be merged with XE6 systems. | Highly suited to Scientific Research computations requiring high computational power of the order of 100s TFLOPs REFERENCES | http://www.cray.com/Products/XK6/XK6.aspx | CrayXK6Brochure.pdf | http://en.wikipedia.org/wiki/Supercomputer | http://i.top500.org/stats | Applications on Cray XK6, Roberto Ansaloni