Building a Successful Deep Learning Platform: Experiences in Building GPU Enabled HPC Clusters
Brian Michael Belgodere Software Engineer Cognitive Computing Cluster 1 Agenda
• Who am I • Why should you listen to IBM Research • What did we build • Why did we do this • How do we keep our users productive • How do we keep it running • Future Work
2 IBM 5/10/17 About Me
• Software Engineer • IBM Research • HPC and ML/Deep Learning • DevOps • Background in Computational Finance and Economics • Law Degree • Coffee and Beer Enthusiast • I Love Solving Problems 3 IBM 5/10/17 Why should you listen to IBM Research?
4 IBM 5/10/17 5 6 7 Bluemix Services
8 In 2 short years IBM Research...
• 20+ high value patents submitted • 50+ Papers Accepted at Top Conferences • NIPS, COLING, INTERSPEECH, AISTATS, EACL, ICASSP, EMNLP, etc. • Several Commercialized Services and Client Engagements • 5 Moonshots
9 IBM 5/10/17 What did we build?
10 IBM 5/10/17 Cognitive Computing Cluster
11 IBM 5/10/17 What did we build? "' ! !!!!!!!!!!!!!!! ! !!!!!!!!!!!!!!! "& ! !!!!!!!!!!!!!!! " ( ! # !!!!!!!!!!!!!!! ! !!!!!!!!!!!!!!! !% !!!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!! ! $$$$$$$$$$ ! !!!!!!!!!! 12 !!!!!!!!!! What did we build?
• 263 x86 Compute Nodes • 130 P8 Compute Nodes • 946 GPU Devices • 3.45 Petabytes of Storage • 10GbE + Infiniband • 750+ Active Users
13 • 1.2IBM Million5/10/17 Jobs a Month Off the Shelf Parts IBM Software Defined Open Source Frameworks Infrastructure
14 Why did we do this?
• Users of BluegeneQ, Speech, and several other cluster looking for next-gen compute platform • Colocation of Data and Workload across Disciplines • Centralization of Management, Cost, and Workload Visibility • Advanced Scheduling to Maximize Cluster Utilization
15 • DemandIBM for5/10/17 Deep Learning & GPUs Why did we do this?
• The Cognitive Computing Cluster was built to explore the enormous possibilities of Deep Learning in order to accelerate the velocity of our researchers' experiments. This enables IBM Research to bring innovation to our customers faster and help extract the maximum value from their ever-growing
16 proprietaryIBM 5/10/17 data sets. IBM Research: A global research capability
Ireland Zurich Almaden Watson Tokyo China Austin Haifa
India
Africa
Brazil
Australia
© 2017 International Business Machines Corporation 17 IBM Research: A global research capability 3,000 researchers
Healthcare Cognitive IoT Big Data & Cognitive Data Centric Computing Cloud Core Technologies Ireland Quantum Computing Cognitive Zurich Healthcare Green Horizon Energy IoT & Mobile OpenPOWER Cloud Security Cognitive Health Almaden Watson Analytics Security Tokyo Big Data Nanotechnology Haifa China Cognitive Robotics Nanomaterials Austin Exascale Financial Services Neurosynaptics POWER Accessibility Mobile India Aging Blockchain Cognitive Fashion Africa Education & Skilling Cognitive Financial Services Cognitive Oil & Gas Healthcare Insurance Analytics Industry Cloud Industry Cloud IoT Blockchain Brazil Healthcare Government Financial Services Australia
© 2017 International Business Machines Corporation 18 How do we keep our users productive
19 IBM 5/10/17 Cognitive Systems Lifecycle
20 IBM 5/10/17 Providing a Full Software Stack
User files & project Cognitive Applications directories cross mounted across cluster Platform Enablers for GPU Tensorflow, Torch, Caffe, Application Libraries Theano Programming Platforms Licensed Tools: Intel MKL, CPU: C/C++, Python, Java Matlab Compilers & Runtime GPU: CUDA, OpenCL
Distributed Software: Distributed Runtime Support • SpectrumMPI, OpenMPI, & OpenMP OS: Redhat • Spark Filesystems: NFS, GPFS • TensorflowOnSpark OS Services & Data Access • CaffeOnSpark • SystemML Cluster and Job Manager: Ethernet & Infiniband Hardware Platform, • Spectrum LSF Software 21 GPU, Storage Parallelizing out Deep Learning
• SpectrumMPI & OpenMPI • Parameter Servers • Spark
22 23 IBM 5/10/17 24 IBM 5/10/17 How do we keep it running?
25 IBM 5/10/17 There are no silver bullets for this, only lead bullets. - Ben Horowitz
26 IBM 5/10/17 xCAT
27 Nagios
28 ELK Stack
29 Future Work
30 IBM 5/10/17 Thanks
31 IBM 5/10/17
For more information: http://research.ibm.com