Appro Supercomputer Solutions Appro and Tsukuba University Accelerator Cluster Collaboration

Steven Lyness, VP HPC Solutions Engineering About Appro Over 20 Years of Experience

2001-2007 1991 – 2000 2007 to 2012 Branded Servers OEM Server End-To-End Clusters Solutions Manufacturer Supercomputer Solutions Manufacturer Moving Forward….

Company Overview Appro Celebrates 20 Years of HPC Success…. Appro on Top 500

• Over 2 PFLOPs (peak) with just five Top100 systems added in to Top500 in November • Variety of technologies: −Intel, AMD, NVIDIA −Multiple server form factors −Infiniband and GigE −Fat Tree and 3D Torus

• Excellent Linpack efficiency on non- optimized SB systems −85.5% Fat Tree −83% - 85% 3D Torus

3 Appro Milestones Installations in 2012

Site Peak Performance

Los Alamos (LANL) > 1.8 PFLOPs

Sandia (SNL) > 1.2 PFLOPs

Livermore (LLNL) > 1.5 PFLOPs

Japan (Tsukuba, Kyoto) > 1 PFLOPs About University Of Tsukuba HA-PACS Project

• HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences)  Apr. 2011 – Mar. 2014, 3-year project  Project Leader: Prof. M. Sato (Director, CCS, Univ. of Tsukuba) • Develop next generation GPU system : 15 members  Project Office for Exascale Computing System Development (Leader: Prof. T. Boku）  GPU cluster based on Tightly Coupled Accelerators architecture • Develop large scale GPU applications : 15 members  Project Office for Exascale Computational Sciences （Leader: Prof. M. Umemura）  Elementary Particle Physics, Astrophysics, Bioscience, Nuclear/Quantum Physics, Global Environmental Science, High Performance Computing

5 University of Tsukuba- HA-PACS Project :: Problem Definition

• Many technology discussions to determine KEY :

 Fixed budget

 High Availability

 Latest Processor / High Flops

 1:2 CPU:Accelerator Ratio

 High Bandwidth to the Accelerator

 High bandwidth, low latency interconnect

 Apps Could take advantage of “more than QDR IB”

 High IO Bandwidth to storage

 “Easy to Manage”

2012 GTC Conference 6 Solution Keys

Fixed Budget Considerations  Need to find a balance between: . Performance - Flops, bandwidth (memory, IO . Capacity (CPU Qty, GPU Qty, Memory per core, IO, Storage) . Availability Features . Ease of Management / Supportability

Architecture needed: High Availability  Nodes (PS, Fans)  IPC networks (Ex. InfiniBand)  Service Networks (Provisioning and Management)

2012 GTC Conference 7 Meeting Key Requirements

. Challenge: Create a Solution with High Availability − Redundant power supplies − Redundant hot swap fan trays − Redundant Hot swap disk drives − Redundant Networks

. Solution: Appro Xtreme-X™ Supercomputer, flagship product-line using GreenBlade™ sub-rack component used for for the DoE TLCC2 project Expand to add support for new custom blade nodes

What Appro Brings to NWS 8 Solution Architecture :: Appro Xtreme-X™ Supercomputer

Unified scalable cluster architecture that can be provisioned and managed as a stand-alone supercomputer.

Improved power & cooling efficiency to dramatically lower total cost of ownership

Offers high performance and high availability features with lower latency and higher bandwidth.

Appro HPC Software Stack - Complete HPC Cluster Software tools combined with the Appro Cluster Engine™ (ACE) Management Software including the following capabilities:

System Management Network Management Server Management Cluster Management 2012 GTC Conference 9 Storage Management Meeting Key Requirements Optimal Performance

Peak Performance . CPU Contribution  Sandy Bridge-EP 2.6 GHz E5-2670 Processor (332 GFlops per node) . GPU Contribution  665 GFlops per NVIDIA S2090  Four (4) S2090’s per node or 2.66 TFlops per node . Combined Peak Performance is 3 TFlops per node . Two Hundred and Sixty-Eight (268) nodes provides 802 TFlops

Accelerator Performance . DEDICATED PCI-e Gen3 X16 for each NVIDIA GPU . Uses Gen2 so we have up to 8 GB/s per GPU available

IO Performance . 2 x QDR (Mellanox CX3) – Up to 4GB/s per link (on PCI-e Gen3 X8) bus . GigE for Operations networks

Presentation Name 10 Appro GreenBlade™ Sub-Rack With Accelerator Expansion Blades

Up to 4x 2P GB812X blades − Expandability for HDD, SSD, GPU, MIC Six Cooling Fan Units − Hot swappable & redundant Up to six 1600W power supplies − Platinum-rated; 95%+ efficient − Hot swappable & redundant Support one or redundant iSCB platform manager modules with Enhanced management capabilities − Active & dynamic fan control − Power monitoring − Remote power control − Integrated console server

Appro Confidential and Proprietary Appro GreenBlade™ Subrack

iSCB Modules • Server Board −Increased memory footprint (2 DPC) −Provides access to two (2) PCI-e Gen3 X16 PER SOCKET • Provides for increased IO capability −QDR or FDR InfiniBand on the motherboard −Internal RAID Adapter on Gen3 bus • Up to two (2) 2.5” Hard drives NOTE: Can run diskless/stateless because of Appro Cluster Engine but needed local scratch

2012 GTC Conference Appro Confidential and Proprietary Meeting Key Requirements :: Server Node Design

Challenge:

Create a server node with − Latest Generation of processors: Need for flops AND IO capacity − HIGH bandwidth to the Accelerators − High Memory capacity

Solution: High Bandwidth Intel Sandy Bridge-EP for CPU and the NVIDIA Tesla for GPU

Working with Intel® EPSD EARLY on to design a motherboard − Washington Pass (S2600WP) Motherboard with:

 Dual Sandy Bridge-EP (E5-2700) sockets  Expose four (4) PCI-e Gen3 X16 for Accelerator Connectivity  Expose one (1) PCI-e Gen3 X8 for Expansion slot/IO  Two (2) DIMMS Per channel (16 DIMMS total) − 2U form factor for fit and air flow/cooling

2012 GTC Conference 13 Meeting Key Requirements Intel® EPSD S2600WP Motherboard

4 Channels 4 Channels 1,600 MHz 1,600 MHz

51.2 GB/sec 51.2 GB/sec

DDR3 DDR3 DDR3 DDR3 DDR3 DDR3

DDR3 DDR3

Sandy Bridge QPI Sandy Bridge Patsburg

SNB-EN DMI ESI BIOS QPI PCH EP EP

PCI-e X4

Gen 3 8 x

Gen 3 16x

Gen 3 16x Gen 3 8 x Gen 3 16x Dual BMC

GbE

4 x NVIDIA M2090

2xQDR IB Dual GbE

2012 GTC Conference Appro Confidential and Proprietary GreenBlade Node Design

HDD0

HDD1

GigE – Cluster Management / Operations Network (Prime)

QDR InfiniBand (Port 0)

QDR InfiniBand (Port 1)

GigE – Cluster Management / Operations Network (Secondary)

| PAGE PAGE

2012 GTC Conference Meeting Key Requirements :: Network Availability

Challenge To provide cost effective redundant networks to eliminate/reduce failures (MTTI)

Solution − Build system with redundant operations Ethernet networks  Redundant on-board GigE each with access to IPMI  Redundant iSCB Modules for baseboard management, node control and monitoring − Build system with redundant InfiniBand networks  DUAL QDR for price/performance  Selected Mellanox due to Gen3 X8 support (dual port adapter)

2012 GTC Conference 16 Meeting Key Requirements :: Operations Networking

Management Node(s) Login External Node(s) Network GbE 10GigE

10GigE Switch Sub Management Node Sub Management Node (GreenBlade™ GB812X) (GreenBlade™ GB812X)

48 port Leaf Switches

Compute Nodes Rack (1) , Rack (2) Compute Nodes Rack (N-2) , Rack (N-1) and and Rack (3) Rack (N)

2012 GTC Conference 17 Meeting Key Requirements :: Ease of Use

Challenge • Need the System top install quickly to get into production • Most have limited “people resources” • Need to be able to keep the system running and doing science

Solution • Appro HPC Software Stack − Tested and Validated − Full stack from HW layer to Application layer − Allows for quick bring up of a cluster

2012 GTC Conference 18 Appro HPC Software Stack

User Applications

Performance HPCC Perfctr IOR PAPI/IPM netperf Monitoring Intel® Cluster Compilers PGI (PGI CDK) GNU PathScale Studio Message Passing MVAPICH2 OpenMPI Intel® MPI-(Intel Cluster Studio) Job Scheduling Grid Engine SLURM PBS Pro Local FS Storage NFS (3.x PanFS Lustre (ext3, ext4, XFS) Cluster Monitoring ACE™ (iSCB and OpenIPMI) Remote Power Mgmt ACE ™ PowerMan Console Mgmt ACE ™ ConMan

Appro HPC Software Stack Software HPC Appro Provisioning Appro Cluster Engine (ACE™) Virtual Clusters OS Linux (Red Hat, CentOS, SuSE)

Appro Xtreme-X™ Supercomputer – Building Blocks

Appro Turn-Key Integration & Delivery Services HW and SW integration, pre-acceptance testing, dismantle, packing and shipping

Appro HPC Professional Services - On-site Installation services and/or Customized services

2012 GTC Conference Appro Key Advantages :: Summary

• Partnering with Key technology partners to offer cutting-edge integrated solutions: − Performance  Storage IOR  Networking Bandwidth, latencies and message rates − Features  High Availability (high standard MTBF, redundancy - PS)  Ease of Management − Flexibility − Price /Performance − Training Programs  Pre-Sales (Sell everything it does and ONLY that)  Installation and Tuning  Post Install Support

2012 GTC Conference 20 Appro Xtreme-X™ Supercomputer :: Turn-Key Solution Summary

Appro HPC Software Stack

Appro Cluster Engine™ (ACE) Management Software Suite Appro Xtreme-X™ Supercomputer addressing 4 HPC Workload Configurations

Data Capacity Hybrid Capability Intensive Computing Computing Computing Computing

Turn-Key Integration & Delivery Services - Node, Rack, Switch, Interconnect, cable, network, storage, software, Burning-in - Pre-acceptance testing, performance validation, dismantle, packing and shipping Appro HPC Professional Services - On-site Installation services and/or Customized services

Appro Corporate Presentation 21 Questions? Ask Now or see us at Table #54 Appro Supercomputer Solutions

Steve Lyness, VP HPC Solutions Engineering Learn More at www.appro.com HA-PACS Next Step for Scientific Frontier by Accelerated Computing

Taisuke Boku Center for Computational Sciences University of Tsukuba [email protected]

GTC2012, San Jose 23 2012/ 05/15 Project plan of HA-PACS

HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences) Accelerating critical problems on various scientific fields in Center for Computational Sciences, University of Tsukuba − The target application fields will be partially limited − Current target: QCD, Astro, QM/MM (quantum mechanics / molecular mechanics, for life science) Two parts − HA-PACS base cluster: . for development of GPU-accelerated code for target fields, and performing product- GTC2012, San Jose 2 run of them 2012/− HA-PACS/TCA: (TCA = Tightly Coupled 4 05/15Accelerators ) . for elementary research on new technology for accelerated computing . Our original communication system based on PCI-Express named “PEARL”, and a prototype communication chip named “PEACH2” GPU Computing: current trend of HPC GPU clusters in TOP500 on Nov. 2011 − 2nd 天河Tienha-1A (Rpeak=4.70 PFLOPS) − 4th 星雲Nebulae (Rpeak=2.98 PFLOPS) − 5th TSUBAME2.0 (Rpeak=2.29 PFLOPS) − (1st K Computer Rpeak=11.28 PFLOPS) Features − high peak performance / cost ratio − high peak performance / power ratio − large scale applications with GPU acceleration don’t run yet in production on GPU cluster

⇒ Our First target is to develop large scale applications accelerated by GPU in real computational sciences

GTC2012, San Jose 25 2012/ 05/15 Problems of GPU Cluster

Problems of GPGPU for HPC − Data I/O performance limitation . Ex) GPGPU: PCIe gen2 x16 . Peak Performance： 8GB/s (I/O) ⇔ 665 GFLOPS (NVIDIA M2090) − Memory size limitation . Ex) M2090: 6GByte vs CPU: 4 – 128 GByte − Communication between accelerators: no direct path (external) Our another target⇒ communication is developing latency avia direct CPU communicationbecomes system large between external . Ex) GPGPU: GPUs for a feasibilityGPU mem study ⇒ CPU for mem future ⇒ (MPI) ⇒ CPU accelerated computingmem ⇒ GPU mem

26 GTC2012,Researches San Jose for direct communication between2012/ GPUs are required 05/15 Project Formation

.HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences)  Apr. 2011 – Mar. 2014, 3-year project  Project Leader: Prof. M. Sato (Director, CCS, Univ. of Tsukuba) .Develop next generation GPU system : 15 members  Project Office for Exascale Computing System Development (Leader: Prof. T. Boku）  GPU cluster based on Tightly Coupled Accelerators architecture .Develop large scale GPU applications : 15 members GTC2012, San Jose 27  Project Office for Exascale Computational 2012/Sciences 05/15（Leader: Prof. M. Umemura）  Elementary Particle Physics, Astrophysics, Bioscience, Nuclear/Quantum Physics, Global Environmental Science, High Performance Computing

HA-PACS base cluster (Feb. 2012)

GTC2012, San Jose 2 2012/ 8 05/15 HA-PACS base cluster

Front view Side view

GTC2012, San Jose 2 2012/ 9 05/15 HA-PACS base cluster

Front view of 3 blade chassis

Rear view of Infiniband switch Rear view of one blade chassis and cables with 4 blades (yellow=fibre, black=copper)

GTC2012, San Jose 3 2012/ 0 05/15 HA-PACS: base cluster (computation node)

(16GB, 12.8GB/s)x8 AVX =128GB, 102.4GB/s (2.6GHz x 8flop/clock) 20.8GFLOPSx16 =332.8GFLOPS

Total: 3TFLOPS

665GFLOPSx4 (6GB, 177GB/s)x4 =2660GFLOPS =24GB, 708GB/s

8GB/s GTC2012, San Jose 31 2012/ 05/15 HA-PACS: base cluster unit（CPU）

Intel Xeon E5 (SandyBridge-EP) x 2 − 8 cores/socket (16 cores/node) with 2.6 GHz − AVX (256-bit SIMD) on each core ⇒ peak perf./socket = 2.6 x 4 x 2 = 166.4 GFLOPS ⇒ pek perf./node = 332.8 GFLOPS − Each socket supports up to 40 lanes of PCIe gen3 ⇒ great performance to connect multiple GPUs without I/O performance bottleneck ⇒ current NVIDIA M2090 supports just PCIe gen2, but net generation (Kepler) will support PCIe gen3 − M2090 x4 can be connected to 2 SandyBridge-EP still remaining PCIe gen3 x8 x2 ⇒ Infiniband QDR x 2

GTC2012, San Jose 3 2012/ 2 05/15 HA-PACS: base cluster unit（GPU）

NVIDIA M2090 x 4 − Number of processor core: 512 − Processor core clock: 1.3 GHz − DP 665 GFLOPS, SP 1331GFLOPS − PCI Express gen2 ×16 system interface − Board power dissipation: <= 225 W − Memory clock: 1.85 GHz, size: 6GB with ECC, 177GB/s − Shared/L1 Cache: 64KB, L2 Cache: 768KB

GTC2012, San Jose 33 2012/ 05/15 HA-PACS: base cluster unit（blade node） 2x 2.6GHz 8core 2x NVIDIA Tesla SandyBridge-EP M2090 1x PCIe slot Air flow for HCA

2x 2.5”HDD 2x NVIDIA Tesla M2090 Power Supply Unit and Fan - 8U enclosure - 4 nodes - 3 PSU(Hot Swappable) - 6 Fans(Hot Front view Rear view Swappable)

GTC2012, San Jose 34 2012/ 05/15 Basic performance data

MPI pingpong

− 6.4 GB/s (N1/2= 8KB) − with dual rail Infiniband QDR (Mellanox ConnectX-3) − actually FDR for HCA and QDR for switch PCIe benchmark (Device -> Host memory copy), aggregated perf. for 4 GPUs simultaneously

− 24 GB/s (N1/2= 20KB) − PCIe gen2 x16 x4, theoretical peak = 8 GB/s x4 = 32 GB/s Stream (memory) − 74.6 GB/s − theoretical peak = 102.4 GB/s

GTC2012, San Jose 3 2012/ 5 05/15 PCIe Host:Device communication performance

Slower start on Host->Device compared with Device->Host GTC2012, San Jose 3 2012/ 6 05/15 HA-PACS Application (1)： Elementary Particle Physics

Multi-scale physics Finite temperature and density

Investigate hierarchical properties Phase analysis of QCD at finite via direct construction of nuclei in temperature and density lattice QCD GPU to perform matrix-matrix GPU to solve large sparse linear product of dense matrices systems of equationsquark Expected QCD phase diagram

proton neutron

nucleus

GTC2012, San Jose 37 2012/ 05/15 HA-PACS Applications (2)： Astrophysics (A) Collisional N-body Simulation(B) Radiation Transfer Globular Clusters First Stars and Re-ionization of the Universe • Formation of the most primordial objects formed more than 10 giga • Understanding of the formation of the first stars years. in the universe and the succeeded re-ionization • Fossil object as a clue to investigate of the universe. the primordial universe Accretion Disks around Black Holes Massive Black Holes in Galaxies• Study of the high temperature regions • Understanding of the formation of massive around black holes black holes in galaxies

• Numerical simulations of complicated gravitational  Calculation of the physical effects of interactions between stars and multiple black holes photons emitted by stars and galaxies onto in galaxy centers. the surrounding matter.

 Direct (brute force) calculations of  So far, poorly investigated due to its huge acceleration and jerks are required to amount of computational cost, though it is of achieve the required numerical accuracy critical importance in the formation of stars and galaxies.  Computations of the accelerations of  Computations of the radiation intensity and particles and their time derivatives (jerks) are the resulting chemical reactions based on the time consuming. ray-tracing methods can be highly accelerated  Accelerations and jerks are computed on GPU with GPUs owing to its high concurrency.

GTC2012, San Jose 38 2012/ 05/15 HA-PACS Application (3)： Bioscience

GPU acceleration

- Direct coulmb DNA-protein complex （Gromacs, （） QM region macroscale MD NAMD, > 100 atoms Amber) -2 electron integral

Reaction mechanisms （QM/MM-MD） GTC2012, San Jose 39 2012/ 05/15 HA-PACS Application (4)

Other advanced researches on HPC Division in CCS − XcalableMP-dev (XMP-dev) for easy and simple programming language to support distributed memory & GPU accelerated computing for large scale computational sciences − G8 NuFuSE (Nuclear Fusion Simulation for Exascale) project platform for porting Plasma Simulation Code with GPU technology − Climate simulation especially for LES (Large Eddy Simulation) for cloud-level resolution on city-model size simulation − Any other collaboration ...

GTC2012, San Jose 4 2012/ 0 05/15 HA-PACS: TCA (Tightly Coupled Accelerator)

TCA: Tightly Coupled Accelerator − Direct connection between accelerators (GPUs) − Using PCIe as a communication device between accelerator . Most acceleration device and other I/O device are connected by PCIe as PCIe end-point (slave device) . An intelligent PCIe device logically enables an end- point device to directly communicate with other end- point devices PEARL: PCI Express Adaptive and Reliable Link − We already developed such PCIe device (PEACH, PCI Express Adaptive Communication Hub) on JST-CREST project “low power and dependable network for embedded system” − It enables direct connection between nodes by PCIe Gen2 x4 link GTC2012,⇒ Improving San Jose PEACH for HPC to realize TCA 41 2012/ 05/15 PEACH

PEACH: PCI-Express Adaptive Communication Hub An intelligent PCI-Express communication switch to use PCIe link directly for node-to-node interconnection Edge of PEACH PCIe link can be connected to any peripheral devices, including GPU Prototype PEACH chip − 4-port PCI-E gen.2 with x4 lane / port − PCI-E link edge control feature: “root complex” and “end points” are GTC2012, Sanautomatically Jose switched (flipped) according 4 2012/to the connection handling 2 05/15− Other fault-tolerant (reliability) function is implemented: “flip network link” to allow single link fault in HA-PACS/TCA prototype development, we will enhance current PEACH chip ⇒ PEACH2 HA-PACS/TCA (Tightly Coupled Accelerator)  True GPU-direct Enhanced version of PEACH

 current GPU clusters ⇒ PEACH2 require 3-hop − x4 lanes -> x8 lanes communication (3-5 times − hardwired on main data path memory copy) and PCIe interface fabric

 For strong scaling, Inter- GPU direct communication protocol is needed for lower latency and higher throughput Node Node PCIe PCIe IBPCIe CP CP PEAC HC GPU H2 GPU U U PCIe A PCIe MEM MEM MEM IB Switc PCIe h

IB PCIe PCIe CP PCIe PCIe CP PEAC GPU HC GPU U H2 A U MEM MEM MEM

GTC2012, San Jose 4 2012/ 3 05/15 Implementation of PEACH2: ASIC⇒FPGA

FPGA based implementation − today’s advanced FPGA allows to use PCIe hub with multiple ports − currently gen2 x 8 lanes x 4 ports are available ⇒ soon gen3 will be available (?) − easy modification and enhancement − fits to standard (full-size) PCIe board − internal multi-core general purpose CPU with programmability is available ⇒ easily split hardwired/firmware partitioning on certain level on control layer Controlling PEACH2 for GPU GTC2012,communication San Jose protocol 4 2012/− collaboration with NVIDIA for information 4 05/15sharing and discussion − based on CUDA4.0 device to device direct memory copy protocol HA-PACS/TCA Node Cluster = NC PEARL Ring Network Node Cluster with 16 Gx4 Gx4 Gx4 nodes PEAC PEAC PEAC ..... • GPUx64 H2 H2 H2 C C C (G) x x x • CPUx32 2 2 2 (C) • GPU Infiniband Link 4 NC with 16 nodes, comm •High speed GPU-GPU comm. by PEACH within NC or 8 NC with 8 nodes (PCI-E gen2x8 = 5GB/s/link) with = 360 TFLOPS extension •Infiniband QDR (x2) for NC-NC comm. (4GB/s/link) PCIe to base cluster Node Node Node Node ...... Node • IB link / ClusterCluster Cluster Cluster Clusternode Infiniband Network • CPU: 4 Xeon E5 5 • GPU: Kepler PEARL/PEACH2 variation (1)

 Option 1:

 Performance comparison among IB and PEARL can be evenly compared

 Additional latency by PCIe switch

C C C C C C C C QPI G3 x8 G3 G3 PCIe PCIe SW x16 x16 G3 G2 x8 x8 GP GP GP GP PEA U U U U IB HC CH A 2

GTC2012, San Jose 4 2012/ 6 05/15 PEARL/PEACH2 variation (2) Option 2: − Requires only 72 lanes in total − asymmetric connection among 3 blocks of GPUs

C C C C G3 x16 C C QPI C C

G3 G3 PCIe G3 PCIe x8 SW x16 x16 G3 G2 x8 GP GP GP PE U U U IB x16 HC AC GP A H2 U

GTC2012, San Jose 4 2012/ 7 05/15 PEACH2 prototype board for TCA FPGA daughter board (Altera Stratix IV GX530)connector

PCIe external power regulators link connector x2 PCIe edge connector for FPGA (one more on (to host server) daughter board) GTC2012, San Jose 4 2012/ 8 05/15 Summary

HA-PACS consists of two elements: HA- PACS base cluster for application development and HA-PACS/TCA for elementary study for advanced technology on direct communication among accelerating devices (GPUs) HA-PACS base cluster started its operation from Feb. 2012 with 802 TFLOPS peak performance (Linpack performance will come on June 2012, also expecting good score on Green500) FPGA implementation of PEACH2 is GTC2012,finished San Jose for the prototype version on 4 Mar.2012/ 2012 and enhanced for final 9 version05/15 in following 6 months HA-PACS/TCA with at least 300 TFLOPS additional performance will be installed around Mar. 2013