<<

Building a Successful Platform: Experiences in Building GPU Enabled HPC Clusters

Brian Michael Belgodere Software Engineer Cluster 1 Agenda

• Who am I • Why should you listen to IBM Research • What did we build • Why did we do this • How do we keep our users productive • How do we keep it running • Future Work

2 IBM 5/10/17 About Me

• Software Engineer • IBM Research • HPC and ML/Deep Learning • DevOps • Background in Computational Finance and Economics • Law Degree • Coffee and Beer Enthusiast • I Love Solving Problems 3 IBM 5/10/17 Why should you listen to IBM Research?

4 IBM 5/10/17 5 6 7 Bluemix Services

8 In 2 short years IBM Research...

• 20+ high value patents submitted • 50+ Papers Accepted at Top Conferences • NIPS, COLING, INTERSPEECH, AISTATS, EACL, ICASSP, EMNLP, etc. • Several Commercialized Services and Client Engagements • 5 Moonshots

9 IBM 5/10/17 What did we build?

10 IBM 5/10/17 Cognitive Computing Cluster

11 IBM 5/10/17 What did we build? "' ! !!!!!!!!!!!!!!! ! !!!!!!!!!!!!!!! "& ! !!!!!!!!!!!!!!! " ( ! # !!!!!!!!!!!!!!! ! !!!!!!!!!!!!!!! !% !!!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!! ! $$$$$$$$$$ ! !!!!!!!!!! 12 !!!!!!!!!! What did we build?

• 263 x86 Compute Nodes • 130 P8 Compute Nodes • 946 GPU Devices • 3.45 Petabytes of Storage • 10GbE + Infiniband • 750+ Active Users

13 • 1.2IBM Million5/10/17 Jobs a Month Off the Shelf Parts IBM Software Defined Open Source Frameworks Infrastructure

14 Why did we do this?

• Users of BluegeneQ, Speech, and several other cluster looking for next-gen compute platform • Colocation of Data and Workload across Disciplines • Centralization of Management, Cost, and Workload Visibility • Advanced Scheduling to Maximize Cluster Utilization

15 • DemandIBM for5/10/17 Deep Learning & GPUs Why did we do this?

• The Cognitive Computing Cluster was built to explore the enormous possibilities of Deep Learning in order to accelerate the velocity of our researchers' experiments. This enables IBM Research to bring innovation to our customers faster and help extract the maximum value from their ever-growing

16 proprietaryIBM 5/10/17 data sets. IBM Research: A global research capability

Ireland Zurich Almaden Watson Tokyo China Austin Haifa

India

Africa

Brazil

Australia

© 2017 International Business Machines Corporation 17 IBM Research: A global research capability 3,000 researchers

Healthcare Cognitive IoT & Cognitive Data Centric Computing Cloud Core Technologies Ireland Quantum Computing Cognitive Zurich Healthcare Green Horizon Energy IoT & Mobile OpenPOWER Cloud Security Cognitive Health Almaden Watson Security Tokyo Big Data Nanotechnology Haifa China Cognitive Robotics Nanomaterials Austin Exascale Financial Services Neurosynaptics POWER Accessibility Mobile India Aging Blockchain Cognitive Fashion Africa Education & Skilling Cognitive Financial Services Cognitive Oil & Gas Healthcare Insurance Analytics Industry Cloud Industry Cloud IoT Blockchain Brazil Healthcare Government Financial Services Australia

© 2017 International Business Machines Corporation 18 How do we keep our users productive

19 IBM 5/10/17 Cognitive Systems Lifecycle

20 IBM 5/10/17 Providing a Full Software Stack

User files & project Cognitive Applications directories cross mounted across cluster Platform Enablers for GPU Tensorflow, Torch, Caffe, Application Libraries Theano Programming Platforms Licensed Tools: Intel MKL, CPU: C/C++, Python, Java Matlab Compilers & Runtime GPU: CUDA, OpenCL

Distributed Software: Distributed Runtime Support • SpectrumMPI, OpenMPI, & OpenMP OS: Redhat • Spark Filesystems: NFS, GPFS • TensorflowOnSpark OS Services & Data Access • CaffeOnSpark • SystemML Cluster and Job Manager: Ethernet & Infiniband Hardware Platform, • Spectrum LSF Software 21 GPU, Storage Parallelizing out Deep Learning

• SpectrumMPI & OpenMPI • Parameter Servers • Spark

22 23 IBM 5/10/17 24 IBM 5/10/17 How do we keep it running?

25 IBM 5/10/17 There are no silver bullets for this, only lead bullets. - Ben Horowitz

26 IBM 5/10/17 xCAT

27 Nagios

28 ELK Stack

29 Future Work

30 IBM 5/10/17 Thanks

31 IBM 5/10/17

For more information: http://research.ibm.com