Welcome to the waitless world IBM POWER8 HPC Review April 12th, 2017
Jamie Syptak, Systems Architect, IBM
Harry Parks HPC|Data Analytics - Solution Sales Welcome to the waitless world Topics • OpenPower … why • Power8 Platform & Chip Design • NVidia Tesla P100 GPU Deep Learning • CAPI & Open CAPI • Spectrum Scale & Components • Power8 Solutions with Elastic Storage Sever • Big Data Analytics - Hadoop • Open Software Solutions and Application Performance
© 2016 IBM Corporation 2 Welcome to the waitless world
POWER – OpenPower …why
© 2016 IBM Corporation 3 Today’s challenges demand innovationWelcome to the waitless world
Full system and stack Data holds competitive value open innovation required
Moore’s 44 zettabytes Law Processor Technology Data Growth
Firmware / OS Accelerators Price/Performance Software Storage You are here Network unstructured data
structured data
2000 2020 2010 2020
© 2016 IBM Corporation © 2016 OpenPOWER Foundation OpenPOWER, a catalyst for Open InnovationWelcome to the waitless world
Market Shifts Moore’s law no longer satisfies Growing workload Numerous IT consumption Mature Open software performance gain demands models ecosystem
OpenPOWER Strategy Vibrant ecosystem through Accelerated innovation through Amplified capabilities driving open development collaboration of partners industry performance leadership
Industry adoption, Open choice Cloud Computing High Performance Hyperscale & Large scale Domestic Computing & Analytics IT Agendas Datacenters OpenPOWER is an open development community, 5 © 2016 IBM Corporation using the POWER Architecture to serve the evolving needs of customers. © 2016 OpenPOWER Foundation This is what a revolution looks likeWelcome to the waitless world
© 2016 IBM Corporation Welcome to the waitless world
POWER – Power8 Platform & Chip Design
© 2016 IBM Corporation 7 Welcome to the waitless world Introducing the LC Portfolio of OpenPOWER servers HPC NEW NEW S822LC For High NEW S822LC For Big Data Performance Computing MTM 8335-GTB S821LC MTM 8001-22C MTM 8001-22C S812LC S822LC MTM 8348-21C MTM 8335-GCA
•Introducing CPU-GPU NVLink, delivering >2.5X the bandwidth •Ideal for storage-centric and to GPUs •Storage rich single socket •2 POWER8 sockets in a 1U high data through-put •Up to 2.2X memory system for big data •POWER8 with NVIDIA NVLink form factor workloads bandwidth of Intel x86 applications •Up to 4 integrated NVIDIA •Ideal for environments •Brings 2 POWER8 sockets systems •Memory Intensive “Pascal” GPUs requiring dense computing for Big Data workloads workloads •Memory Intensive •Big data acceleration with workloads work CAPI and GPUs IBM Power Systems LC Portfolio
Systems designed to take Data Rich and High Performance Computing (HPC) Linux workloads to the next level 8 © 2016 IBM Corporation Current Power Linux Servers
S822LC for High S822LC for Big Data S821LC Performance Computing
New New Tesla P100 K80 or Tesla P100 POWER8 with NVLink K80 or Tesla P100 with NVLink Processor
System Details System Details System Details . 2-socket, 2U . 2-socket, 2U . 2 socket, 1U . . Up to 20 cores (2.86-3.26Ghz) . Up to 20 cores (2.9-3.3Ghz) Up to 20 cores (2.09-2.32Ghz) . 1 TB Memory (32 DIMMs) . 512 GB Memory (16 DIMMs) . 512 GB Memory (16 DIMMs) . 230GB/sec memory bandwidth . 115 GB/sec memory bandwidth . . 2x SFF (HDD/SSD), SATA 115GB/sec memory bandwidth . 4 SFF/LFF (HDD/SSD), 32 TB Storage . Up to 4 integrated NVIDIA Tesla P100 GPUs . 12 SFF/LFF (HDD/SSD) 96 TB storage . 4 PCIe slots, 3 CAPI enabled . 3 PCIe slots, 3 CAPI enabled, IB Add-in . 5 PCIe slots, 4 CAPI enabled . 1 NVIDIA PCIe GPU capable . . Air or water cooled 2 NVIDIA PCIe GPU capable
IBM Systems Welcome to the waitless world IBM Innovates with POWER8: Breakthrough performance for YOUR data 4X 4X 4X Threads per core* Mem. Bandwidth* More cache* Data flow x86
POWER8 x86 POWER8 x86 SMT8 Hyperthread POWER8 + pipe OpenPOWER Parallel Processing x86 pipe POWER8
These design decisions result in best performance for data centric workloads like: Database, NoSQL, Big Data Analytics, OLTP
SMT=Simultaneous Multi-Threading OLTP = On-Line Transaction Processing © 2016 IBM Corporation 10 Power 8 Internal Speeds –Important for Big Data Welcome to the waitless world (Best Version) Processor Comparisons: Haswell POWER8 Comments
SMT / Core (Threads) 2T 8T 4X L1 DCache / Core 32 KB 64 KB 2X L2 Cache / Core 256 KB 512 KB 2X L3 Cache / Processor 16 - 45 MB 80 - 96 MB 2.1X – 6.0X L4 Cache / System None 64 MB – 2 GB Helps Big Data
L2 Cache Bw / Core 116 GB/s 320 GB/s 2.76X Maximum “Sustained” Mem Bw 53 GB/s 224 GB/s 4.2X STAC-A2 Greeks / Core 121/ms 38/ms 3.2X STAC-A2 Assets / Core 3.125 1.667 1.9X STAC-A2 Paths / Core 1.17 0.22 5.3X
Transactional Memory No Yes Helps Scalability PCI-Exp Gen3 / Processor 16x (32 GB/s) 48x (96 GB/s) 3X
Native/Custom Engine Accelerator No CAPI/FPGA Helps Analytics Large “In Memory” Flash No CAPI/Flash Helps Big Data POWER8 Core is ~2X to 3X of a Haswell Core! POWER8 Provides a “New Class” of Capabilities for Attacking Analytics, Big Data, HPC & Cloud Workloads! © 2016 IBM Corporation Return to top Welcome to the waitless world Hot Chips 2017 – Roadmap
© 2016 IBM Corporation 12 Welcome to the waitless world Hot Chips 2016 – POWER9 Processor Common Features
© 2016 IBM Corporation 13 Welcome to the waitless world
POWER – NVidia with NVlink Tesla P100 GPU – Deep Learning Machine Learning
© 2016 IBM Corporation 14 Welcome to the waitless world POWER8+P100+NVLink for increases system bandwidth
• NVLink between CPUs and GPUs System System enables fast memory access to large Memory Memory data sets in system memory 115 GB/s 115 GB/s Power8 Power8 CPU CPU • Two NVLink connections between NVLink NVLink 80 GB/s 80 GB/s each GPU and CPU-GPU leads to P100 P100 P100 P100 faster data exchange GPU GPU GPU GPU
GPU GPU GPU GPU Memory Memory Memory Memory • First to market: volume shipments starting September, 2016
© 2016 IBM Corporation Why it Matters: Use Cases where NVLink will have the most Impact
Stream Data at Same Burst Data at Startup Constant Data Transfers Mask Bus Transfers Rate as Computation and Teardown between adjacent GPUs from Host-Device .
Genomics, Cryptography, Video CFD/CAE, Machine Learning, Molecular Dynamics, Accelerated Databases, Processing, etc. Deep Learning, etc. Amber, etc. Analytics, etc.
IBM Systems Introducing PowerAI: Get Started Fast with Deep Learning
Package of Pre-Compiled Easy to install & get started Optimized for Performance Major Deep Learning with Deep Learning with To Take Advantage of Frameworks Enterprise-Class Support NVLink
Enabled by High Performance Computing Infrastructure
17 Welcome to the waitless world
Simplify Access and Installation Already ported Future focus • Tested, binary builds of common Deep Learning OS Ubuntu 14.04 Ubuntu 16.04 frameworks for ease of implementation CUDA 7.5 8.0 cuDNN 5.1 5.1 • Simple, complete installation process documented Built w/ on IBM OpenPOWER MASS Yes Yes – http://openpowerfoundation.org/blogs/ and search Deep OpenBLAS 0.2.18 Optimize Learning Caffe 1.0 rc3 • Future focus on optimizing specific packages for NVIDIA Caffe 0.14.5 Optimize POWER: NVIDIA Caffe, TensorFlow, and Torch NVIDIA DIGITS 3.2 Torch 7 Optimize Theano 0.8.2 TensorFlow 0.9 Optimize CNTK Nov 2015(*) DL4J 0.5.0(*) • Additional docs… DIY… Chainer https://www.ibm.com/developerworks/community/blogs/home/search?q=%22Deep+Learning+on+ OpenPOWER%22&t=entry&f=all&maxresults=50&sortby=0&order=desc&lang=en GPU 2x K80 4 x P100 © 2016 IBM Corporation Base System 822LC Minsky
Machine Learning / Deep Learning Software Stack
Healthcare Industry Retail Customer Banking Diagnosis / Self-driving Cars ... Service Agents Fraud analysis Solutions Treatment MLDL Building Machine Pattern Vision / Image Knowledge NLP Blocks Learning Recognition processing ... Representation
DL Frameworks ML Libraries Cloud Services Cognitive Distributed Software Libraries CAFFE, Torch, Spark ML, Spark R, Amazon, Microsoft, Platforms Computing Frameworks, Services TensorFlow, … SciKit, Numpy, Google, IBM Cloud Watson Hadoop, Spark, SystemML, Mahout MPI PowerAI Math Libraries: OpenBLAS/MASS, ATLAS, ESSL, LAPACK, cuDNN, cuBLAS, cuSparse, … Accelerated Data Layer NoSQL DBs Streams Hadoop / HDFS Legacy DBs DBs
On-Prem & Cloud Accelerators Unique Power Infrastructure Scale-out servers GPUs, FPGAs, CAPI Technologies Cloud Deployment Flash, DL Accelerators NVLink, CAPI 19 Welcome to the waitless world Improve performance: 2.2X faster training time
AlexNet Trained in Under 1 Hour (57 mins)
© 2016 IBM Corporation Welcome to the waitless world Realize CFD results 40% faster on OpenFOAM on IBM Power System S822LC compared to Xeon E5-2600v3 Systems Iterate faster through improved time to solution
• IBM Power S822LC delivers 1.4x the OpenFOAM performance from a superior processor design Higher Memory bandwidth Increased L3 cache SMT (simultaneous multi-threading)
• Moreover, IBM Power S822LC delivers exceptional throughput on the largest meshes
• Results are based on IBM internal testing of systems running OpenFOAM version 2.3.0 code benchmarked on POWER8 systems.. Individual results will vary depending on individual workloads, configurations and conditions. • IBM Power System S822LC, POWER8; 3.5 GHz, 512 GB memory; 2x 10 core processors/ 4 threads per core • Bull R424-E4, Intel E5-2680v3, 2.5 GHz, 128 GB memory, 2x 12 core processors / 1 thread per core 22 © 2016 IBM Corporation Welcome to the waitless world The POWER8 processor is designed for Big Data and HPC, with up to 2X the performance on VASP • Results based on five VASP relative performance 250% standard test cases 203% – CPU-only 200% 167% Dual socket Intel E5-2698 v3
• Memory bandwidth helps 150% with application 108% IBM POWER S822LC 100% 100% 100% 100% 100% 100% performance 87% Performance ratio 75%
50%
0% Fe Mn2O4 PAW_GGA PAW_PBE PdO2 Test cases
System #MPI #OM • Results are based on IBM & BSC internal testing of systems. Individual results will vary depending on individual workloads, configurations and conditions. • IBM Power System S822LC; 10 cores, POWER8; 2.93 GHz, 256 GB memory P • Intel Xeon data is based on IBM & BSC internal measurements. 16 cores Intel Xeon E5-2698 v3, 2.3GHz, 256 GB Dual socket Intel 32 1 IBM Power System S822LC 20 1 IBM Confidential © 2016 IBM Corporation 23 Welcome to the waitless world
POWER – CAPI & Open CAPI
© 2016 IBM Corporation 24 Welcome to the waitless world Power 8 CAPI – Coherent Accelerator Processor Interface
• Virtual Addressing – Accelerator can work with same memory addresses POWER8 that the processors use
Coherence Bus • Hardware Managed Cache Coherence
– Enables the accelerator to participate in “Locks” as a CAPP normal thread Lowers Latency over IO communication model PCIe Gen 3 Transport for encapsulated messages _ Processor Service Customizable Hardware Layer (PSL) Application Accelerator • Present robust, • Specific system SW, PSL durable interfaces to middleware, or user application applications • Written to durable interface FPGA or ASIC • Offload complexity / provided by PSL content from CAPP
© 2016 IBM Corporation 25 What was done before CAPI?
Prior to CAPI, an application called a device driver to utilize an FPGA Accelerator. The device driver performed a memory mapping operation. CAPI Device Driver Virt Addr Storage Area Coherency
Variables Variables Vs. IO Model Input Input Data Data
Output Memory Subsystem Data
3 versions of the data (not coherent). 1000s of instructions in the device driver. FPGA PCIE Output Data
VariablesInput Data
POWER8 POWER8 POWER8 POWER8 POWER8 POWER8 Core Core Core Core Core Core App DD
© 2016 IBM Corporation CAPI Coherency
With CAPI, the FPGA shares memory with the cores
CAPI Virt Addr Coherency Vs. IO Model
Memory Subsystem 1 coherent version of the data. No device driver call/instructions.
PSL FPGA PCIE Output Data
Input Variables Data
POWER8 POWER8 POWER8 POWER8 POWER8 POWER8 Core Core Core Core Core Core App
© 2016 IBM Corporation CAPI vs. I/O Device Driver: Data Prep
Typical I/O Model Flow: Total ~13µs for data prep
Copy or Pin MMIO Notify Poll / Interrupt Copy or Unpin Ret. From DD DD Call Acceleration Source Data Accelerator Completion Result Data Completion
1,000 Instructions 300 Instructions 10,000 Instructions Application 3,000 Instructions Dependent, but 1,000 Instructions 7.9µs Equal to below 4.9µs
Flow with a Coherent Model: Total 0.36µs
Shared Mem. Shared Memory Acceleration Notify Accelerator Completion
400 Instructions Application 100 Instructions Dependent, but 0.3µs Equal to above 0.06µs
© 2016 IBM Corporation Welcome to the waitless world Storage Access Acceleration Demonstrating the Value of CAPI Attachment 120000 108,438 500 Identical hardware with 2 466 different paths to data 450 100000 FlashSystem 400
350 80000
300
Conventional 60000 250 199 PCIe I/O CAPI 200 40000 150 16,150 100 20000 50
0 0 PCIe I/O CAPI PCIe I/O CAPI
IOPs per HW Thread Latency (us) Power S822
© 2016 IBM Corporation Welcome to the waitless world
© 2016 IBM Corporation Welcome to the waitless world
POWER – Spectrum Scale & Components
© 2016 IBM Corporation 33 IBM Software Defined Infrastructure Welcome to the waitless world Multi-scale Infrastructure for High Performance Computing & Analytics
High Performance Computing High Performance Analytics ‘New-gen Workloads’ Design / Simulation / Modeling Trade / Risk Analytics Hadoop, Spark, Containers Workload Aware Scheduling
Shared Resource Management
Shared Multi-tier Data Management Heterogeneous Servers & Storage
Flash Disk Tape Power x86 Linux on z Sparc ARM docker VM
Hybrid Cloud Infrastructure
© 2016 IBM Corporation IBM Systems | 34 IBM Academic Initiative
From IBM + Suite Advanced with Spark for HPC Edition ibm.onthehub.com
https://developer.ibm.com/academic/ Welcome to the waitless world IBM Spectrum LSF Application Center, IBM Spectrum LSF Process Manager Simplify User Experience and Automate Workflows IBM Spectrum LSF Application Center
Provides flexible, application-centric interfaces for cluster users and administrators that are easy to use, deploy, manage and support
IBM Spectrum LSF Process Manager
Enables organizations to design and automate computational or analytic processes, making them reliable and capturing repeatable best practices
© 2016 IBM Corporation The IBM Spectrum LSF Family
IBM Spectrum IBM Spectrum Application LSF Analytics Center
IBM Spectrum IBM Spectrum LSF Process LSF RTM IBM Manager Spectrum LSF
IBM Spectrum IBM Spectrum Hadoop Connector LSF Data LSF License Manager Scheduler
Spark Connector IBM Spectrum IBM Spectrum MPI LSF Session Scheduler Docker Connector IBM Spectrum Cluster Resource Connector for Foundation OpenStack & Cloud
© 2017 IBM Corporation IBM Spectrum LSF Process Manager Welcome to the waitless world Visual Flow Editing, Management IBM Spectrum LSF Process Manager Flow Editor
• Intuitive drag-and-drop interface • Creates self-documenting flows • Support for sub-flows, job arrays • Rich error-handling / retry capability • Save workflows in XML format • Publish flows directly to Flow Manager
IBM Spectrum LSF Process Manager Flow Manager • Manages multiple flows for multiple users and groups simultaneously • Monitor workflow execution graphically • Trigger flows automatically through calendar events, the flow manager or the command line.
© 2016 IBM Corporation IBM Spectrum LSF Application Center Welcome to the waitless world Intuitive Application Interfaces Guided, self-documenting interfaces boost productivity, reduce training and lower support costs.
Integrated console boosts productivity enabling access to multiple interactive applications through the browser.
© 2016 IBM Corporation Spectrum Scale deployment models Welcome to the waitless world
Enterprise Integrated Model Network Shared Disk (NSD) Model
Unify and parallelize storage silos Modular High-Performance Scaling
Shared Nothing Cluster (SNC) Model
Span storage rich servers for converged architecture or HDFS deployment
© 2016 IBM Corporation IBM Systems | 40 Welcome to the waitless world
POWER – Power8 HPC Solutions & Elastic Storage Sever
© 2016 IBM Corporation 41 Welcome to the waitless world System Head Rack (Login, Management, Storage)
Top of Rack Mellanox Leaf Software Stack Switches (Ethernet or Infiniband) Spectrum Compute Foundation • Cluster Manager Compute Nodes (FAT) • LSF Scheduler • Process Manager Spectrum Foundation • Application Center Pipeline Manager SW. • Workload Manager Spectrum Scale Storage • Real Time Manager Management Node End User, Usability Login / SW Mgt Nodes
Spectrum Scale ESS Gl2 Storage disc 500 TB Disc Parallel 6-18 GB/sec
© 2016 IBM Corporation Sample IBM HPC Dense Compute Welcome to the waitless world
• Multiple racks of 1U compute nodes based on Power8+ cpus
• 20 or 24 cores per server. • 160/192 threads simultaneous
• Infiniband or Ethernet Interconnect
• 22.5KW per rack (no GPU)
© 2016 IBM Corporation Welcome to the waitless world
InfiniBand switch IBM Delivers Full HPC Solutions Ethernet switch
S822LC servers 2 POWER8 CPUs 1U, 2U
NVIDIA GPUs compute nodes P100, K80
Mellanox IB, Ethnt switches Keyboard/monitor UFM appliance Ethernet switches
Parallel File System Management server Elastic Storage Server Login node
ESS Storage
HPC Software
XL Compilers Spectrum Spectrum Compute Fa Parallel Env CUDA ESSL Scale xCAT miily PGI LSF,PAC,PPM
© 2016 IBM Corporation 44 Welcome to the waitless world Compute Intense FAT nodes for GPU Workloads Nvidia NVLINK© Technology = IBM OpenPower 822LC Sample: One or More racks of 2U Servers 16 High Performance IBM Power822LC servers 2 or 4 Nvidia Pascal P100 GPUs , 256 Gb Memory + FAT Nodes w 1TB Memory, GPU (or not)
Intense Deep Learning and Bioinformatics Workloads are Seeing 5-10X improvement over Intel x86 PCI based systems © 2016 IBM Corporation Welcome to the waitless world Sample Solution Architecture
Campus 10 Gb Network Campus Network
Campus Gb Network
Internal Management Network (1Gb Ethernet – 2X Redundancy) Compute Nodes Storage Nodes / Pools
FPGA Node Server Server 1 2 1 2 3 CAPI Node 2X Infra Mgmt Tier 1 Storage Tier 2 Storage X86 Nodes Database (Performance) (Capacity) Tape Block 2X App Center TBs TBs-PBs Storage Storage
Stratton 2X Data G’way Computing (GPFS Protocol) (non GPU) Tier 3 TBs
10Gb Ethernet Switch Ethernet 10Gb (Backup) (Archive) Minsky PBs Computing (P100 GPU)
Internal Data Network (EDR InfiniBand)
© 2016 IBM Corporation Welcome to the waitless world HPC Storage - Elastic Storage Server Elastic Storage Server (ESS) is an integrated file/object based storage solution consisting of: IBM Power System Servers Storage Enclosure and drives Spectrum Scale Software including Spectrum Scale RAID software Networking components Scalable Parallel File System **Built on 15 Years of IBM GPFS CIFS, NFS, SMB, Posix gateways Spectrum Object Transparent Cloud Tiering.
ESS is a building block solution for Spectrum Scale ESS’s can be scaled horizontally to provide massive scaling of: Capacity Throughput (3-12 GB/second) *** Number of files Number of Objects
© 2016All IBM of Corporation the intelligence in ESS lies in Spectrum Scale and Spectrum Scale RAID Software 47 ESS – The World’s Fastest Spectrum Scale Product GPFS „Classic“ IBM ESS IBM Spectrum Scale Raid
21 stripes (42 strips)
7 stripes per group 49 strips (2 strips per stripe)
3 groups spare 7 disks 7 spare 6 disks disk strips
IBM Systems | 48 IBM’s DeepFlash ESS enhances the Elastic Storage Server family. Welcome to the waitless world GS models use 2U 24x2.5” JBODs or SSDs, GL models use 4U 60x3.5” JBODs, GF models use 3U 32 JBOF enclosures Support drives: 1.8TB SAS, 400GB, 800GB, 1.6TB SSD 2.5”; 2TB,4TB,6TB and 8TB NL-SAS 3.5” HDDs Supported NICs: 10GbE, 40GbE Ethernet and EDR Infiniband FC 5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 FC 5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 FC FC 5887 5887 GF1 building GF2 building 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
block block FC 5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 FC 5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 FC 5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 FC FC 5887 5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 FC 5887 FC 5887 FC 5887 FC 5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 GS1 building GS2 building GS4 building GS6 building GL2 building GL4 building GL6 building block block block block block block block
© 2016 IBM Corporation Welcome to the waitless world
POWER – IBM Data Engine for Hadoop & Spark
© 2016 IBM Corporation 50 Welcome to the waitless world IBM Data Engine for Hadoop and Spark OpenPOWER innovation with IBM Open Platform with Apache Hadoop for a high performance, storage dense and fully integrated cluster offering.
Optimized configurations Single vendor support for Hadoop or Spark workloads better price performance Based on S812LC Up to 2x servers with up to for Spark workloads* 14*6TB disk drives per server Delivered as a fully integrated Optionally preloaded with IBM BigInsights and cluster ready to run IBM Open Platform Simplify operations – OpenPOWER innovation with IBM easy to deploy and S821LC servers manage Adapt and scale to your changing analytics needs
• All results are based on IBM Internal Testing of 3 SparkBench benchmarks consisting of SQL RDD Relation, Logistic Regression, © 2016 IBM Corporation SVM Welcome to the waitless world New Members of LC Server line ideal for Big Data
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48
M S 1GbE
49 51 50 52 42U 12 24 36 48
2 4 2 Mgmt D D 14 26 38 A A Rst 11 10GbE23 35 47 1 3 S D D 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 A A 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 41U 12 24 36 48
2 4 2 Mgmt D D 14 26 38 A A Rst 11 10GbE23 35 47 1 3 S D D 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 A A 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 40U
System x 3650 M4 BD 1 2 3 4 3 U 39U Data Node 2U 2S 2 U 38U System x 3650 M4 BD 1 2 3 4 3 U 37U Data Node 2U 2S 2 U 36U System x 3650 M4 BD 1 2 3 4 3 U 35U Data Node 2U 2S 2 U 34U System x 3650 M4 BD 1 2 3 4 3 U 33U Data Node 2U 2S 2 U 32U System x 3650 M4 BD 1 2 3 4 3 U 31U Data Node 2U 2S 2 U 30U
System x 3650 M4 BD 1 2 3 4 3 U 29U Data Node 2U 2S 2 U 28U System x 3650 M4 BD 1 2 3 4 3 U 27U Data Node 2U 2S 2 U 26U System x 3650 M4 BD 1 2 3 4 3 U 25U Data Node 2U 2S 2 U 24U System x 3650 M4 BD 1 2 3 4 3 U 23U Data Node 2U 2S 2 U 22U System x 3650 M4 BD 1 2 3 4 3 U 21U Data Node 2U 2S 2 U 20U System x 3650 M4 BD 1 2 3 4 3 U 19U Data Mgmt Data Node 2U 2S 2 U 18U System x 3650 M4 BD 1 2 3 4 3 U 17U Nodes Nodes Data Node 2U 2S 2 U 16U System x 3650 M4 BD 1 2 3 4 3 U Data Node 2U 2S 15U 2 U 14U
System x 3650 M4 BD 1 2 3 4 3 U 13U Data Node 2U 2S 2 U 12U System x 3650 M4 BD 1 2 3 4 3 U 11U Data Node 2U 2S 2 U 10U System x 3650 M4 BD 1 2 3 4 3 U 9 U Data Node 2U 2S 2 U 8 U System x 3650 M4 BD 1 2 3 4 3 U 7 U Data Node 2U 2S 2 U 6 U Management Node 1U2S 5 U Management Node 1U2S 4 U Management Node 1U2S 3 U System Management Node 1U1S 2 U Cable Ingress / Egress 1 U
© 2016 IBM Corporation 52 Welcome to the waitless world Typical IDEHS Configurations Small Medium Large • Starter/Small (max 1 rack) – 1 sys mgmt, 1 mgmt, 3 data – Ideal for dev/test, POC or single use where HA and performance not a consideration – Capacity: 72TB raw • Medium – 1 sys mgmt., 3-6 mgmt., 7 data (min 3) – Redundant 10GB top of rack switches – Ideal for small to medium production cluster – Capacity: 504TB raw • Large (max 2 rack) – 1 sys mgmt., 3-6 mgmt., 12 data – Ideal for multiple lines of business in production – Capacity: 864TB raw Data Node Default Options: • Standard Analytics Node Default Configuration – 1x POWER8 2.92GHz 10Core + 128GB (16x8GB) DRAM + 12 x 6TB (front drives) + 2 x 1TB HDD (rear drives) • ‘Memory Rich’ Node Default Configuration: – 1x POWER8 2.92GHz 10Core + 256GB (16x16GB) DRAM + 10 x 6TB HDD + 2 x 960GB SSD + 2 x 1TB HDD(rear drives)
© 2016 IBM Corporation 53 Welcome to the waitless world Actual Use case for Hadoop with Power8 SMT8 Louisiana State University- Genome Assembly with Hadoop
Hadoop workload with SMT8 enabled + External GPFS storage (HDFS) connector
• 9x Performance over Intel nodes. • 120 Nodes Dell vs 40 Nodes IBM 3x Time performance, 1/3 nodes.
LSU Whitepaper published http://www.lsu.edu/mediacenter/docs/LSU-IBM_POWER8_GenomeBenchmark.pdf
© 2016 IBM Corporation 54 Welcome to the waitless world POWER8 Delivers more for Spark
2.5
2 • S812LC delivers optimized Spark price-performance 1.94 based on an average of 10 SparkBench benchmarks 1.5 1 1 Relative Perform Relative – Complete the same Spark workloads for less than ½ the 0.5 0 cost of Intel Xeon E5-2690 v3 systems POWER8 x86 IBM S812LC HP DL380 2.3X BETTER performance per dollar spent 10c/80t E5-2690 v3 o 24c/48t 2.5
2.3 – 94% more Spark workloads in the same rack space 2 versus Intel Xeon E5-2690 v3 systems 1.5 1 o 1.94X BETTER performance per system (10 core S812LC 1
vs 24 core DL380) Performance Relative 0.5
0 POWER8 x86 IBM S812LC HP DL380 • All results are based on IBM Internal Testing of 10 SparkBench benchmarks consisting of SQL RDD Relation, Twitter, Pageview Streaming, PageRank, Logistic Regression, SVD++, TriangleCount, SVM, MF, SQL Hive 10c/80t E5-2690 v3 • IBM Power System S812LC 10 cores / 80 threads, POWER8; 2.9GHz, 256 GB memory, Ubuntu 15.04, Spark 1.4, OpenJDK 1.8 24c/48t • Intel Xeon HP DL380; 24 cores / 48 threads, E5-2690 v3; 2.3GHz , 256 GB memory. Ubuntu 15.04, Spark 1.4, OpenJDK 1.8 • Pricing is based on list prices of HP DL380 and estimated prices of IBM Power S82LC © 2016 IBM Corporation Welcome to the waitless world IBM Data Engine for Hadoop and Spark (IDE-HS) Cluster Performance Designed for the Cognitive Era to Make Better Decisions even Faster
• IBM Data Engine for Hadoop and Spark infrastructure delivers Spark workload scaling to minimize execution times and reduce batch windows
- 2.1X more performance per dollar spent for Spark Logistic Regression based Machine Learning used in model training by wide variety of lines of business Logres SV SQL M - 1.4X more performance per dollar spent for Support Vector Machine (SVM) – a Machine Learning algorithm used in product Recommender Systems
- 1.7X more performance per dollar spent for Spark SQL query processing used widely in Big Data clusters
• All results are based on IBM Internal Testing of 3 SparkBench benchmarks consisting of SQL RDD Relation, Logistic Regression, SVM • 6 Data Nodes and 1 Management Node. Each node is IBM Power System S812LC 10 cores / 80 threads, POWER8; 2.92GHz, 256 GB memory, RedHat 7.2, Spark 1.5.1, OpenJDK 1.8 • 6 Data Nodes and 1 Management Node. Each node is x86 E5-2620V3 12 cores / 24 threads, E5-2620 V3; 2.4GHz, 256 GB memory, RedHat 7.1, Spark 1.5.1, OpenJDK 1.8 • Pricing is based on web prices of HP DL380 and list prices of IBM Power S812LC Logres SV SQL M
© 2016 IBM Corporation Welcome to the waitless world
POWER – Open Software Solutions and Application Performance
© 2016 IBM Corporation 57 Welcome to the waitless world Open Source Ecosystem
Other Databases Cloud Dev. Env Big Data HA, Technical & search Managemnt /Tools & Analytics Security Computing engines Stack etc.
Available: Available: Available: Available: Available: Available: Angular.js,Backbone, Accumulo (column), Hadoop Base, Hive, Apache Web Server Alfresco, BTRFS, ALLPATH-LG, bfast, Bootstrap, buildbot, Cassandra HBase, Accumolo, Apache tomcat Bootstrap BioConductor, BLAST, busyBox, Continuum, CouchDB (document) Ambari, Avro, Ceph, Chef Client, CentOS BOOST, Bowtie2, BWA, CruiseControl, Docker , Couchbase (noSQL) ElasticSearch Glassfish, Jetty, Chroma-key EMBOSS, FASTA, FastQC, Eigen lib, Erlang, Ganglia, Derby, InfluxDB, Firebird Falcon, FluentD Juju & Juju gui Cluster Glue, HMMER, HTSeq, ISAAC, GCC, gccgo, GDB, ElasticSearch Flume, Hue, Kafka, Landscape client corosync, CRMshell, MUSCLE, PICARD, PLINK, GoLang, Gump, Jenkins, EnterpriseDB, LevelDB, Knox, Lucene-Solr, MAAS, OpenStack, DRBD RNAStar, RSEM, SAMTools, Jruby, Lynx, MariaDB (v10.x optimized) Mahout, Oozie, Puppet Evolution data svr, SAMTools 1.0, SeqAn, LLVM, logstash, logstash- Memcached (KVS) Parquet, Phoenix Apache Qpid ClusterGlue, SHRiMP, SOAPAligner, forwarder, kibana, MongoDB 3.2 (document) Pig, Ranger, Slider, Thrift HAProxy, Heartbeat SOAPbuilder, SOAPDenovo, Maven, Nagios, NGINX, MySQL, Percona Server, Spark, Sqoop, Ceilometer client, keepalived TMAP, TopHat, ABySS, Balsa, node.js, OCaml, PostgreSQL, PrestoDB, Storm, Tez, Sensu Server & Ldirectord, Caffe, Code-Saturne, GATK, OpenJDK, Phantom.js, RabbitMQ, Redis (KVS), Zookeeper Client Linux-HA. mesos GMP, Heat3d, IOR, Lattice- PHP, phpMy Admin, Perl, Riak, Sphinx, SQLite, OpenSSL Boltzmann, LAPACK, PyCi, Python, Python- TokyoCabinet Optimizing: Pacemaker, REAR, OpenFOAM, waterman …. Django, Python-Pip Virtuoso (graph) Hadoop Port In Progress: resource agents, ecosystem, PyPy, R, Wired Tiger, ZODB Chef Server v12 Samba, Tophat Optimized & GPU enabled: RabitMQ, rsyslog, Ruby, Evaluating: WordPress AMBER14, CHARMM, Galaxy, Ruby on Rails (rbenv), Port In Progress: Clusterpoint GAMESS, GROMACS, HPC- Ruby Gems, scala, GPUDB, Neo4j (graph) Port In Progress: challenge, KKRnano, snappy, Socket.io LatticeQCD, MAFIA, NAMD. (npmjs), SpiderMonkey, NEST, PEASOUP, PLUTO, Open source application is ported and Strider, Supervisord, Available: Quantum Espresso, QUDA, Optimizing: available on distro (Ubuntu or RHEL or SLES) (black), Evaluating: SystemTap, Vagrant, V8, CouchDB in community (purple), Lab7 (green) or IBM IOP Cluster-Network SOAP3-DP, WRF Xerces (blue). Does not mean it is optimized. Does not Open Identity Stack Evaluating: mean that a commercial ISV version is available. (forgerock.com) Evaluating: Optimized: InfiniSQL CEGMA, DIALIGN-TX, R Evaluating: Needs to be vetted in new business openQCD, SIFT, STREAM, development prioritization process. Some of these miRdeep2, ucsctools, Port In Progress: are available codes that need optimization to be ViennaRNA, competitive with x86. Travis-CI © 2016 IBM Corporation 2 © 2016 International Business Machines Corporation Welcome to the waitless world Available Technical Computing Packages
attice-Boltzmann, LatticeQCD (LQCD), LAPACK ABySS, ALLPATHS-LG, ALYA, Amber14, ATLAS L (liblapack3), LES, LSQR, LibGD
IGV, IOR, iRODS (beta), iSSAC (Illumina) VAS P, variant_tools, Velvet/Oases, native vector support
Jurrassic, KKRnano libs, Waterman, WRF
© 2016 IBM Corporation 59 © 2016 International Business Machines Corporation HPC Software Stack for Power Systems Welcome to the waitless world
Products Client Benefits • Ease of Use: web portal Systems PCM Standard Ed • Customizable: admin productivity • Faster time to system productivity Management xCAT • Robust monitoring
IBM MPI runtime • Optimized Parallel Runtime Application ESSL / PESSL • Optimized LAPACK and ScaLPACK libraries Runtime CUDA runtime • User controlled workflow support
PE Developer Ed • Modern application development environment using Eclipse Development • Performance analysis tools to help analyze applications XL Complier suite • Productivity Optimized compiler for Power Totalview/DDT debuggers
Workload • Optimized utilization of resources Spectrum LSF • Policy and resource aware scheduling Management • Robust add-on features
Spectrum Scale • Scalable/reliable storage for parallel files system (ESS) Data Management HPSS • ILM for transparent migration of data from storage to tape and back Spectrum Protect • Enhance availability with RAID-based ESS and Tape
Spectrum Application Center • Simplify job submission for repeatable workload: customization Application • Customizable Spectrum Process Manager • Workflow management - Faster time to system productivity Environment Spark,Hadoop Connectors © 2016 IBM Corporation IBM Confidential 60 Welcome to the waitless world
Compiler Offerings includeUnlock the Proprietary power of big data with and IBM TechnicalOpen ComputingSource versions of Acceleration Enabled Programing Models
CUDA Key Features: Key Features: Key Features:
• Gives direct access to the GPU • Designed to simplify Programing of • OpenMP 4.0 introduces offloading and instruction set heterogeneous CPU/GPU systems support for heterogeneous CPU/GPU
• Supports C, C++ and Fortran • Directive based parallelization for • Leverage existing OpenMP high level accelerator device directives support • Generally achieves best leverage of GPUs for best application • PGI/NVIDIA Compiler performance • IBM XL Compiler • OpenACC/gcc • PGI/NVIDIA Compiler • Open Source LLVM OpenMP Compiler • CUDA C/C++ for Power via XL NVCC
61
© 2016 IBM Corporation IBM Confidential Return to top Welcome to the waitless world
Give us your hardest workload! How can IBM make you and the university successful? Thank you….
© 2016 IBM Corporation 62