Delivering a course on HPC resources

Stefano Bagnasco, Federica Legger, Sara Vallero

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement LHCBIGDATA No 799062 The course

● Title: Big and Machine Learning

● Graduate Program in Physics at University of Torino

● Academic year 2018-2019: ○ Starts in 2 weeks ○ 2 CFU, 10 hours (theory+hands-on) ○ 7 registered students

● Academic year 2019-2020: ○ March 2020 ○ 4 CFU, 16 hours (theory+hands-on) ○ Already 2 registered students

2 The Program

● Introduction to big data science ○ The big data pipeline: state-of-the-art tools and technologies ● ML and DL methods: ○ supervised and unsupervised models, ○ neural networks ● Introduction to computer architecture and parallel computing patterns ○ Initiation to OpenMP and MPI (2019-2020) ● Parallelisation of ML algorithms on distributed resources ○ ML applications on distributed architectures ○ Beyond CPUs: GPUs, FPGAs (2019-2020)

3 The aim

● Applied ML course: ○ Many courses on advanced statistical methods available elsewhere ○ Focus on hands-on sessions ● Students will ○ Familiarise with: ■ ML methods and libraries ■ Analysis tools ■ Collaborative models ■ Container and cloud technologies ○ Learn how to ■ Optimise ML models ■ Tune distributed training ■ Work with available resources

4 Hands-on

● Python with Jupyter notebooks ● Prerequisites: some familiarity with numpy and pandas ● ML libraries ○ Day 2: MLlib ■ Trees GBT ■ Multilayer Classifier MCP ○ Day 3: Keras ■ Sequential model ○ Day 4: bigDL ■ Sequential model ● Coming: ○ CUDA ○ MPI ○ OpenMP 5 ML Input Dataset for hands on

● Open HEP dataset @UCI, 7GB (.csv) ● Signal (heavy Higgs) + background ● 10M MC events (balanced, 50%:50%) ○ 21 low level features ■ pt’s, angles, MET, b-tag, … Signal ○ 7 high level features ■ Invariant masses (m(jj), m(jjj), …) Background: ttbar

Baldi, Sadowski, and Whiteson. “Searching for Exotic Particles in https://archive.ics.uci.edu/ml/datasets/HIGGS High-energy Physics with .” Nature Communications 5 6 Infrastructure: requirements

● Commodity hardware (CPUs) ● Non-dedicated and heterogeneous resources: ○ Bare metal ■ 1 x 24 cores, 190 GB RAM ■ 4 x 28 cores, 260 GB RAM ○ IaaS Cloud (on premises) ■ 10 VM, 8 cores, 70 GB RAM ● Uniform application/service orchestration layer -> Kubernetes ● High-throughput vs. high-performance -> Spark ● Distributed datasets -> HDFS ● Elasticity: allow to scale up if there are unused resources 7 What about HPCs?

● HPC = high performance processors + low latency interconnect ● HPC clusters are typically managed with a batch system ● The OCCAM HPC @University of Torino employs a Cloud like management strategy coupled to lightweight virtualization -> OCCAM facility ○ https://c3s.unito.it/index.php/super-computer

8 The OCCAM supercomputer

APPLICATION VIRTUALIZATION Defined by: The pivotal technologies ● Runtime environment for the middleware ● Resource requirements architecture are Linux ● Execution model containers, currently managed with Docker. Cloud-like ● package, ship and run COMPUTING MODEL distributed application components with HPC cluster guaranteed platform parity across different HPC: environments batch-like, multi-node workloads ● democratizing using MPI and inter-node VIRTUAL WORKSTATION: virtualization by communication code execution (e.g. R or ROOT) providing it to in a single multicore node, developers in a usable, possibly with GPU acceleration PIPELINES: application-focused form

multi-step data analysis JUPYTER-HUB: requiring high-memory large ON DEMAND With Spark backend for ML single-image nodes Autoscaling and Big Data workloads 9 Infrastructure: architecture

OAuth login CPUs: 216 Memory: 1.9 TB HDFS: 2.3 TB Spark Spark Spark Spark Executor Executor Executor Driver Network: 1 Gbps

Spark Spark Spark Spark Spark Spark Executor Executor Executor Executor Executor Driver

Spark Spark Spark Executor Executor Driver

HDFS (for Datasets) Kubernetes Control Plane Kubernetes Workers

High-class hardware Lower-class hardware Virtual Machines 10 Infrastructure: elasticity

● Spark driver continuously Spark Spark Spark scales up to reach the Driver Driver Driver requested number of executors

● No static quotas enforced, but Scale Up a Min number of executors to be granted to each tenant

● Custom Kubernetes Operator Executors (alpha version): ○ lets tenants occupy all available resources in a Executors FIFO manner ○ undeploys exceeding executors only to grant the Min number of Scale Down Executors resources to all registered tenants

Farm Operator 11 Scaling tests

● #cores per executor Perfect scaling ● #cores per machine ● #cores in homogeneous cluster

BigDL,NN ● Strong scaling MLLib, efficiency = GBT MLLib, time(1)/(N*time(N)) MPC ○ N = #cores

12 ML models && lessons learned

Model AUC time # events cores note

MLLib GBT 82 15m 10M 25 Doesn’t scale

MLLib MPC - 4 74 9m 10M 25 Scales well, layers, 30 hidden can’t build units complex models

Keras Sequential 81 18m 1M 25 No distributed - 1 layer, 100 training, cannot hidden units process 10M events

BigDL Sequential 86 3h15m 10M 88 1 core/executor - 2 layers, 300 required hidden units 13 Summary

● Applied ML course for Ph.D students focusing on distributed training for ML models ● Infrastructure runs on ‘opportunistic’ resources ● Architecture can be ‘reused’ on OCCAM

14 Spares

15 Farm Kube Operator https://github.com/svallero/farmcontroller

● Spark Driver deploys executor Pods with given namespace/label/name (let’s call this triad a selector)

● But a Pod is not a scalable Kubernetes Resource (i.e. a Deployment is)

● The farm Operator implements two Custom Resource Definitions (CRDs) with their own Controller: ○ Farm Resource ○ FarmManager Resource

● The Farm Operator can be applied to any other app (farm type) with similar features

● CAVEAT: ○ The Farm app should be resilient to the live removal of executors (i.e. Spark, HTCondor)

16 Farm Kube Operator (continued)

Farm Resource ● Collects Pods with given selector ● Implements scaledown Farm ● Defines a Min number of executors (quota) ● Reconciles on selected Pod events

FarmManager Resource ● Reconciles on Farm events ● Scales down Farms over quota only if some other Farm requests resources and it’s below its quota ● Simple algorithm: number of Farm killed pods per Farm is Manager proportional to the number of Pods over the quota (should be improved) 17 OCCAM HPC facility at University of Turin

● managed using container-based cloud-like technologies ● computing applications are run on Virtual Clusters deployed on top of the physical infrastructure

18 OCCAM SPECS

2 Management nodes 4 Fat Nodes

● CPU - 2x Intel® Xeon® Processor E5-2640 v3 8 ● CPU - 4x Intel® Xeon® Processor E7-4830 v3 12 core/2.1Ghz core 2.6 GHz ● RAM - 768GB/1666MHz (48 x 16Gb) DDR4 ● RAM - 64GB/2133MHz ● DISK - 1 SSD 800GB + 1 HDD 2TB 7200rpm ● DISK - 2x HDD 1Tb Raid0 ● NET - IB 56Gb + 2x10Gb ● NET - IB 56Gb + 2x10Gb + 4 x 1GB ● FORMFACTOR - 1U

4 GPU Nodes

32 Light nodes ● CPU - 2x Intel® Xeon® Processor E5-2680 v3, 12 core 2.1Ghz ● RAM - 128GB/2133 (8 x 16Gb) DDR4 ● CPU - 2x Intel® Xeon® Processor E5-2680 v3, 12 ● DISK - 1 x SSD 800GB sas 6 Gbps 2.5’’ core 2.5Ghz ● NET - IB 56Gb + 2x10Gb ● RAM - 128GB/2133 (8 x 16 Gb) ● GPU - 2 x NVIDIA K40 su PCI-E Gen3 x16 ● DISK - SSD 400GB SATA 1.8 inch. ● NET - IB 56Gb + 2x10Gb ● FORMFACTOR - high density (4 nodes x RU) 19 Scaling tests #1

● Optimize #cores per #core = 5 optimal executor ● Model: MLLib MCP and GBT, 1M events ● One machine t2-mlwn-01.to.infn.it ● In the ‘literature’ #cores = 5 is magic number to achieve maximum HDFS GBT does not scale well throughput Expected since GBT training is hard to parallelise

20 Scaling tests #2

● Optimize #executors ● #cores/executor = 5 ● Model: MLlib MCP, 1M, ● One machine 10M events

21 Scaling tests #3

● Scaling on homogeneous resources ○ bare metal, 4 machines with 56 cores and 260 GB

22 ML models && lessons learned 2

GBT fast MPC

GBT slow Keras Sequential

23