Delivering a Machine Learning Course on HPC Resources
Total Page:16
File Type:pdf, Size:1020Kb
Delivering a machine learning course on HPC resources Stefano Bagnasco, Federica Legger, Sara Vallero This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement LHCBIGDATA No 799062 The course ● Title: Big Data Science and Machine Learning ● Graduate Program in Physics at University of Torino ● Academic year 2018-2019: ○ Starts in 2 weeks ○ 2 CFU, 10 hours (theory+hands-on) ○ 7 registered students ● Academic year 2019-2020: ○ March 2020 ○ 4 CFU, 16 hours (theory+hands-on) ○ Already 2 registered students 2 The Program ● Introduction to big data science ○ The big data pipeline: state-of-the-art tools and technologies ● ML and DL methods: ○ supervised and unsupervised models, ○ neural networks ● Introduction to computer architecture and parallel computing patterns ○ Initiation to OpenMP and MPI (2019-2020) ● Parallelisation of ML algorithms on distributed resources ○ ML applications on distributed architectures ○ Beyond CPUs: GPUs, FPGAs (2019-2020) 3 The aim ● Applied ML course: ○ Many courses on advanced statistical methods available elsewhere ○ Focus on hands-on sessions ● Students will ○ Familiarise with: ■ ML methods and libraries ■ Analysis tools ■ Collaborative models ■ Container and cloud technologies ○ Learn how to ■ Optimise ML models ■ Tune distributed training ■ Work with available resources 4 Hands-on ● Python with Jupyter notebooks ● Prerequisites: some familiarity with numpy and pandas ● ML libraries ○ Day 2: MLlib ■ Gradient Boosting Trees GBT ■ Multilayer Perceptron Classifier MCP ○ Day 3: Keras ■ Sequential model ○ Day 4: bigDL ■ Sequential model ● Coming: ○ CUDA ○ MPI ○ OpenMP 5 ML Input Dataset for hands on ● Open HEP dataset @UCI, 7GB (.csv) ● Signal (heavy Higgs) + background ● 10M MC events (balanced, 50%:50%) ○ 21 low level features ■ pt’s, angles, MET, b-tag, … Signal ○ 7 high level features ■ Invariant masses (m(jj), m(jjj), …) Background: ttbar Baldi, Sadowski, and Whiteson. “Searching for Exotic Particles in https://archive.ics.uci.edu/ml/datasets/HIGGS High-energy Physics with Deep Learning.” Nature Communications 5 6 Infrastructure: requirements ● Commodity hardware (CPUs) ● Non-dedicated and heterogeneous resources: ○ Bare metal ■ 1 x 24 cores, 190 GB RAM ■ 4 x 28 cores, 260 GB RAM ○ IaaS Cloud (on premises) ■ 10 VM, 8 cores, 70 GB RAM ● Uniform application/service orchestration layer -> Kubernetes ● High-throughput vs. high-performance -> Spark ● Distributed datasets -> HDFS ● Elasticity: allow to scale up if there are unused resources 7 What about HPCs? ● HPC = high performance processors + low latency interconnect ● HPC clusters are typically managed with a batch system ● The OCCAM HPC @University of Torino employs a Cloud like management strategy coupled to lightweight virtualization -> OCCAM facility ○ https://c3s.unito.it/index.php/super-computer 8 The OCCAM supercomputer APPLICATION VIRTUALIZATION Defined by: The pivotal technologies ● Runtime environment for the middleware ● Resource requirements architecture are Linux ● Execution model containers, currently managed with Docker. Cloud-like ● package, ship and run COMPUTING MODEL distributed application components with HPC cluster guaranteed platform parity across different HPC: environments batch-like, multi-node workloads ● democratizing using MPI and inter-node VIRTUAL WORKSTATION: virtualization by communication code execution (e.g. R or ROOT) providing it to in a single multicore node, developers in a usable, possibly with GPU acceleration PIPELINES: application-focused form multi-step data analysis JUPYTER-HUB: requiring high-memory large ON DEMAND With Spark backend for ML single-image nodes Autoscaling and Big Data workloads 9 Infrastructure: architecture OAuth login CPUs: 216 Memory: 1.9 TB HDFS: 2.3 TB Spark Spark Spark Spark Executor Executor Executor Driver Network: 1 Gbps Spark Spark Spark Spark Spark Spark Executor Executor Executor Executor Executor Driver Spark Spark Spark Executor Executor Driver HDFS (for Datasets) Kubernetes Control Plane Kubernetes Workers High-class hardware Lower-class hardware Virtual Machines 10 Infrastructure: elasticity ● Spark driver continuously Spark Spark Spark scales up to reach the Driver Driver Driver requested number of executors ● No static quotas enforced, but Scale Up a Min number of executors to be granted to each tenant ● Custom Kubernetes Operator Executors (alpha version): ○ lets tenants occupy all available resources in a Executors FIFO manner ○ undeploys exceeding executors only to grant the Min number of Scale Down Executors resources to all registered tenants Farm Operator 11 Scaling tests ● #cores per executor Perfect scaling ● #cores per machine ● #cores in homogeneous cluster BigDL,NN ● Strong scaling MLLib, efficiency = GBT MLLib, time(1)/(N*time(N)) MPC ○ N = #cores 12 ML models && lessons learned Model AUC time # events cores note MLLib GBT 82 15m 10M 25 Doesn’t scale MLLib MPC - 4 74 9m 10M 25 Scales well, layers, 30 hidden can’t build units complex models Keras Sequential 81 18m 1M 25 No distributed - 1 layer, 100 training, cannot hidden units process 10M events BigDL Sequential 86 3h15m 10M 88 1 core/executor - 2 layers, 300 required hidden units 13 Summary ● Applied ML course for Ph.D students focusing on distributed training for ML models ● Infrastructure runs on ‘opportunistic’ resources ● Architecture can be ‘reused’ on OCCAM 14 Spares 15 Farm Kube Operator https://github.com/svallero/farmcontroller ● Spark Driver deploys executor Pods with given namespace/label/name (let’s call this triad a selector) ● But a Pod is not a scalable Kubernetes Resource (i.e. a Deployment is) ● The farm Operator implements two Custom Resource Definitions (CRDs) with their own Controller: ○ Farm Resource ○ FarmManager Resource ● The Farm Operator can be applied to any other app (farm type) with similar features ● CAVEAT: ○ The Farm app should be resilient to the live removal of executors (i.e. Spark, HTCondor) 16 Farm Kube Operator (continued) Farm Resource ● Collects Pods with given selector ● Implements scaledown Farm ● Defines a Min number of executors (quota) ● Reconciles on selected Pod events FarmManager Resource ● Reconciles on Farm events ● Scales down Farms over quota only if some other Farm requests resources and it’s below its quota ● Simple algorithm: number of Farm killed pods per Farm is Manager proportional to the number of Pods over the quota (should be improved) 17 OCCAM HPC facility at University of Turin ● managed using container-based cloud-like technologies ● computing applications are run on Virtual Clusters deployed on top of the physical infrastructure 18 OCCAM SPECS 2 Management nodes 4 Fat Nodes ● CPU - 2x Intel® Xeon® Processor E5-2640 v3 8 ● CPU - 4x Intel® Xeon® Processor E7-4830 v3 12 core/2.1Ghz core 2.6 GHz ● RAM - 768GB/1666MHz (48 x 16Gb) DDR4 ● RAM - 64GB/2133MHz ● DISK - 1 SSD 800GB + 1 HDD 2TB 7200rpm ● DISK - 2x HDD 1Tb Raid0 ● NET - IB 56Gb + 2x10Gb ● NET - IB 56Gb + 2x10Gb + 4 x 1GB ● FORMFACTOR - 1U 4 GPU Nodes 32 Light nodes ● CPU - 2x Intel® Xeon® Processor E5-2680 v3, 12 core 2.1Ghz ● RAM - 128GB/2133 (8 x 16Gb) DDR4 ● CPU - 2x Intel® Xeon® Processor E5-2680 v3, 12 ● DISK - 1 x SSD 800GB sas 6 Gbps 2.5’’ core 2.5Ghz ● NET - IB 56Gb + 2x10Gb ● RAM - 128GB/2133 (8 x 16 Gb) ● GPU - 2 x NVIDIA K40 su PCI-E Gen3 x16 ● DISK - SSD 400GB SATA 1.8 inch. ● NET - IB 56Gb + 2x10Gb ● FORMFACTOR - high density (4 nodes x RU) 19 Scaling tests #1 ● Optimize #cores per #core = 5 optimal executor ● Model: MLLib MCP and GBT, 1M events ● One machine t2-mlwn-01.to.infn.it ● In the ‘literature’ #cores = 5 is magic number to achieve maximum HDFS GBT does not scale well throughput Expected since GBT training is hard to parallelise 20 Scaling tests #2 ● Optimize #executors ● #cores/executor = 5 ● Model: MLlib MCP, 1M, ● One machine 10M events 21 Scaling tests #3 ● Scaling on homogeneous resources ○ bare metal, 4 machines with 56 cores and 260 GB 22 ML models && lessons learned 2 GBT fast MPC GBT slow Keras Sequential 23.