Moore’s Law

Processor Technology

Firmware / OS Accelerator sSoftware OpenStack Storage Network ... Price/Performance POWER8

2000 2020 DRAM Memory Chips Buffer Power8: Up to 12 Cores, up to 96 Threads L1, L2, L3 + L4 Caches Up to 1 TB per socket https://www.ibm.com/blogs/syst Up to 230 GB/s sustained memory ems/power-systems- openpower-enable- bandwidth acceleration/ System System Memory Memory 115 GB/s 115 GB/s POWER8 POWER8 CPU CPU NVLink NVLink 80 GB/s 80 GB/s P100 P100 P100 P100 GPU GPU GPU GPU

GPU GPU GPU GPU Memory Memory Memory Memory

GPU PCIe CPU 16 GB/s System bottleneck Graphics System Memory Memory IBM advantage: data communication and GPU performance

POWER8 + 78 ms Tesla P100+NVLink x86 based 170 ms GPU system

ImageNet / Alexnet: Minibatch size = 128 ADD: Coherent Accelerator Processor Interface (CAPI)

FPGA

CAPP PCIe

POWER8 Processor ...FPGAs, networking, memory...

Typical I/O Model Flow

Copy or Pin MMIO Notify Poll / Int Copy or Unpin Ret. From DD DD Call Acceleration Source Data Accelerator Completion Result Data Completion

Flow with a Coherent Model Shared Mem. Shared Memory Acceleration Notify Accelerator Completion Focus on Enterprise Scale-Up Focus on Scale-Out and Enterprise Future Technology and Performance Driven Cost and Acceleration Driven

Partner Chip POWER6 Architecture POWER7 Architecture POWER8 Architecture POWER9 Architecture POWER10 POWER8/9

2007 2008 2010 2012 2014 2016 2017 TBD 2018 - 20 2020+ POWER6 POWER6+ POWER7 POWER7+ POWER8 POWER8 P9 SO P9 SU P9 SO 2 cores 2 cores 8 cores 8 cores 12 cores w/ NVLink 24 cores 12 cores 10nm - 7nm 65nm 65nm+ 22nm 12 cores 14nm 14nm 45nm 32nm 22nm New Micro- Existing Architecture Enhanced T Micro- New Micro- Enhanced New Micro- Enhanced New Micro- Micro- New Micro- Enhanced Architecture Architecture Micro- Architecture Micro- Architecture Architecture Architecture Micro- Architecture Architecture With NVLink Architecture New Foundry Technology New Process New Process New Process Direct attach B Technology Technology Enhanced Technology New Process Technology memory Buffered Process Technology Memory Technology New Process Technology D 300+ members 80+ technologies revealed

30 countries 6 continents 100s innovations under way Community • Over 2,500 Linux ISVs developing on Power • 50 IBM Innovation Centers Accelerates • Compelling PoCs innovation • Support for little endian applications

Big Data & Machine HPC Cloud Mobile Enterprise Learning CHARMM miniDFT GROMACS CTH NAMD BLAST AMBER Bowtie RTM BWA GAMESS FASTA WRF HMMER HYCOM GATK HOMME SOAP3 LES STAC-A2 MiniGhost SHOC AMG2013 Graph500 OpenFOAM Ilog

Major Linux Distros Available now: Barreleye G1

In partnership with Avago, IBM, Mellanox, PMC & Samsung http://blog.rackspace.com/first-look-zaius-server-platform-google-rackspace-collaboration The IBM Power Systems LC Line OpenPOWER servers for cloud and cluster deployments that are different by design

High Performance Computing

S822LC For High S822LC For Performance Computing S821LC

S812LC S822LC

• 2 socket, 2U • 1 socket, 2U • 2 socket, 2U • 2 sockets, 1U • POWER8 with • 2 socket, 2U • Storage rich for big • Storage-centric and high NVLink • Dense computing data applications through-put workloads • Up to 4 integrated • Memory Intensive • Memory Intensive • Big data acceleration with NVIDIA “Pascal” P100 workloads workloads work CAPI and GPUs GPUs POWER9 – dual memory subsystems

Scale Out Scale Up Direct Attach Memory Buffered Memory

8 Direct DDR4 Ports 8 Buffered Channels • Up to 120 GB/s of sustained bandwidth • Up to 230GB/s of sustained bandwidth • Low latency access • Extreme capacity – up to 8TB / socket • Commodity packaging form factor • Superior RAS with chip kill and lane sparing • Adaptive 64B / 128B reads • Compatible with POWER8 system memory • Agnostic interface for alternate memory innovations Power Systems for High Performance Computing (aka Minsky)

NVIDIA: Tesla P100 GPU Accelerator with NVLink

Ubuntu by Canonical: Launch OS supporting NVLink and Page Migration Engine Wistron: Platform co-design Mellanox: InfiniBand/Ethernet Connectivity in and out of server HGST: Optional NVMe Adapters Broadcom: Optional PCIe Adapters QLogic: Optional Fiber Channel PCIe Samsung: 2.5” SSDs Hynix, Samsung, Micron: DDR4 IBM: POWER8 CPU with NVLink PowerAI: Enterprise Deep Learning Distribution

Enterprise Software To o l s f o r E as e o f Faster Training Times Distribution Development for Data Scientists

Binary Package of Major Graphical tools to Enhance Performance Optimized for Deep Learning Frameworks Data Scientist Developer Single Node & Distributed with Enterprise Support Experience Computing Scaling

18 Caffe NVCaffe IBMCaffe Torch Deep Learning Frameworks Distributed TensorFlow TensorFlow Theano Chainer

Supporting Distributed OpenBLAS Bazel NCCL DIGITS Libraries Communications

Cluster of NVLink Spectrum Scale: Scale to Servers High-Speed Parallel Cloud Accelerated File System Servers and Infrastructure for Scaling Data Science Experience (DSX)

PowerAI Enterprise (Coming) DL Developer Tools

PowerAI Application Dev Services DL Frameworks + Libraries (TensorFlow, CAFFE, ..) IBM Enterprise Support

Distributed Computing with Spark & MPI

Spectrum Scale High-Speed Cluster of NVLink Servers Scale to Cloud File System via HDFS APIs Training time (minutes): AlexNet and Caffe to top- BVLC Caffe vs IBM Caffe / VGGNet 1, 50% Accuracy Time to Top-1 50% accuracy: (Lower is better) (Lower is better) 140 8:24 S822LC/HPC with 4 Tesla P100 120 7:12 S822LC/HPC with 4 Tesla Tesla GPUs is 24% Faster than 8x P100 GPUs is 2.2x Faster than 100 6:00 Tesla M40 GPUs 4x Tesla M40 GPUs 80 4:48 60 3:36 40 2:24 20 1:12 0 x86 with 4x M40 / PCIe POWER8 with 4x Tesla P100 / 0:00 NVLink x86 with 8x M40 / PCIe POWER8 with 4x Tesla P100 / NVLink

IBM S822LC (Minsky) 20-cores 2.86GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs / 16.04 / CUDA 8.0.44 / cuDNN 5.1 / IBM Caffe 1.0.0-rc3 / Imagenet Data Intel Broadwell E5-2640v4 20-core 2.6 GHz 512GB memory / 8 NVIDIA TeslaM40 GPUs / Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / BVLC Caffe 1.0.0-rc3 / Imagenet Data Images Processed (Images/Sec) (TensorFlow, Inception v3) 300 250 200 Minsky: 30% Faster 150 100 50 0 S822LC - Optimized E5-2640v4 Power8 “Minsky” Server Intel x86-Based Server

IBM S822LC 20-cores 2.86GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs / Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch) Intel Broadwell E5-2640v4 20-core 2.6 GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs/ Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch) 22 No need to compile from open-source PowerAI OS Ubuntu 16.04 CUDA 8.0 cuDNN 5.1 Built w/ MASS Yes OpenBLAS 0.2.19 Caffe 1.0 rc5 0.14.5 + NVIDIA Caffe 0.15.14 IBM Caffe 1.0 rc3 Chainer 1.20.1 NVIDIA DIGITS 5 Torch 7 Theano 0.9 1.0.0+ TensorFlow 0.12 GPU 4 x P100 Base System S822LC/HPC http://ibm.biz/powerai

https://power.jarvice.com/ 25 26