Experiences in Building GPU Enabled HPC Clusters
Total Page:16
File Type:pdf, Size:1020Kb
Building a Successful Deep Learning Platform: Experiences in Building GPU Enabled HPC Clusters Brian Michael Belgodere Software Engineer Cognitive Computing Cluster 1 Agenda • Who am I • Why should you listen to IBM Research • What did we build • Why did we do this • How do we keep our users productive • How do we keep it running • Future Work 2 IBM 5/10/17 About Me • Software Engineer • IBM Research • HPC and ML/Deep Learning • DevOps • Background in Computational Finance and Economics • Law Degree • Coffee and Beer Enthusiast • I Love Solving Problems 3 IBM 5/10/17 Why should you listen to IBM Research? 4 IBM 5/10/17 5 6 7 Bluemix Services 8 In 2 short years IBM Research... • 20+ high value patents submitted • 50+ Papers Accepted at Top Conferences • NIPS, COLING, INTERSPEECH, AISTATS, EACL, ICASSP, EMNLP, etc. • Several Commercialized Services and Client Engagements • 5 Moonshots 9 IBM 5/10/17 What did we build? 10 IBM 5/10/17 Cognitive Computing Cluster 11 IBM 5/10/17 What did we build? "' ! !!!!!!!!!!!!!!! ! !!!!!!!!!!!!!!! "& ! !!!!!!!!!!!!!!! " ( ! # !!!!!!!!!!!!!!! ! !!!!!!!!!!!!!!! !% !!!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!! ! $$$$$$$$$$ ! !!!!!!!!!! 12 !!!!!!!!!! What did we build? • 263 x86 Compute Nodes • 130 P8 Compute Nodes • 946 GPU Devices • 3.45 Petabytes of Storage • 10GbE + Infiniband • 750+ Active Users 13 • 1.2IBM Million5/10/17 Jobs a Month Off the Shelf Parts IBM Software Defined Open Source Frameworks Infrastructure 14 Why did we do this? • Users of BluegeneQ, Speech, and several other cluster looking for next-gen compute platform • Colocation of Data and Workload across Disciplines • Centralization of Management, Cost, and Workload Visibility • Advanced Scheduling to Maximize Cluster Utilization 15 • DemandIBM for5/10/17 Deep Learning & GPUs Why did we do this? • The Cognitive Computing Cluster was built to explore the enormous possibilities of Deep Learning in order to accelerate the velocity of our researchers' experiments. This enables IBM Research to bring innovation to our customers faster and help extract the maximum value from their ever-growing 16 proprietaryIBM 5/10/17 data sets. IBM Research: A global research capability Ireland Zurich Almaden Watson Tokyo China Austin Haifa India Africa Brazil Australia © 2017 International Business Machines Corporation 17 IBM Research: A global research capability 3,000 researchers Healthcare Cognitive IoT Big Data & Cognitive Data Centric Computing Cloud Core Technologies Ireland Quantum Computing Cognitive Zurich Healthcare Green Horizon Energy IoT & Mobile OpenPOWER Cloud Security Cognitive Health Almaden Watson Analytics Security Tokyo Big Data Nanotechnology Haifa China Cognitive Robotics Nanomaterials Austin Exascale Financial Services Neurosynaptics POWER Accessibility Mobile India Aging Blockchain Cognitive Fashion Africa Education & Skilling Cognitive Financial Services Cognitive Oil & Gas Healthcare Insurance Analytics Industry Cloud Industry Cloud IoT Blockchain Brazil Healthcare Government Financial Services Australia © 2017 International Business Machines Corporation 18 How do we keep our users productive 19 IBM 5/10/17 Cognitive Systems Lifecycle 20 IBM 5/10/17 Providing a Full Software Stack User files & project Cognitive Applications directories cross mounted across cluster Platform Enablers for GPU Tensorflow, Torch, Caffe, Application Libraries Theano Programming Platforms Licensed Tools: Intel MKL, CPU: C/C++, Python, Java Matlab Compilers & Runtime GPU: CUDA, OpenCL Distributed Software: Distributed Runtime Support • SpectrumMPI, OpenMPI, & OpenMP OS: Redhat • Spark Filesystems: NFS, GPFS • TensorflowOnSpark OS Services & Data Access • CaffeOnSpark • SystemML Cluster and Job Manager: Ethernet & Infiniband Hardware Platform, • Spectrum LSF Software 21 GPU, Storage Parallelizing out Deep Learning • SpectrumMPI & OpenMPI • Parameter Servers • Spark 22 23 IBM 5/10/17 24 IBM 5/10/17 How do we keep it running? 25 IBM 5/10/17 There are no silver bullets for this, only lead bullets. - Ben Horowitz 26 IBM 5/10/17 xCAT 27 Nagios 28 ELK Stack 29 Future Work 30 IBM 5/10/17 Thanks 31 IBM 5/10/17 For more information: http://research.ibm.com.