Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster
Total Page:16
File Type:pdf, Size:1020Kb
Performance Optimization of Deep Learning Frameworks on Modern Intel Architectures ElMoustapha Ould-Ahmed-Vall, AG Ramesh, Vamsi Sripathi and Karthik Raman Representing the work of many at Intel Agenda • Op#mizaon maers on modern architectures • Intel’s recent Xeon and Xeon Phi products • Introduc#on to Deep Learning • Op#mizing DL frameworks on IA • Key challenges • Op#mizaon techniques • Performance data • DL scaling Moore’s Law Goes on! Increasing clock speeds -> more cores + wider SIMD (Hierarchical parallelism) Combined Amdahl’s Law for Vector Mul<cores* �������=(1/������↓���� +1−������↓���� /�������� )∗(1/������↓���� +1−������↓���� /������������ ) Goal: Reduce Serial Fraction and Reduce Scalar Fraction of Code Ideal Speedup: NumCores*VectorLength (requires zero scalar, zero serial work) Peak “Compute” Gflops/s Compute Bound Performance Most kernels of ML codes are compute bound i.e. raw FLOPS matter Peak “Compute” Gflops/s without SIMD Roofline Model Gflops/s = min (Peak Gflops/s, Stream BW * flops/byte) A+ainable Gflops/s Compute intensity (flops/byte) Overview of Current Generation of Intel Xeon and Xeon Phi Products Current Intel® Xeon PlaBorms 45nm Process 14nm Process Technology 32nm Process Technology 22nm Process Technology Technology Nehalem Westmere Sandy Bridge Ivy Bridge Haswell Broadwell NEW Intel® Intel NEW Intel Intel NEW Intel Intel Microarchitecture Microarchitecture Microarchitecture Microarchitecture Microarchitecture Microarchitecture (Nehalem) (Nehalem) (Sandy Bridge) (Sandy Bridge) (Haswell) (Haswell) TOCK TICK TOCK TICK TOCK TICK Latest released – Broadwell (14nm process) • Intel’s foundation of HPC and ML performance • Suited for full scope of workloads • Industry leading performance/watt for serial & highly parallel workloads. • Upto 22 cores / socket (Broadwell-EP) (w/ Hyper-Threading technology) Software optimization helps maximize benefit and adoption of new features 2nd Generaon Intel® Xeon Phi™ Plaorm Intel® AVX Technology SNB/IVB HSW/BDW SKX & KNL 256b AVX1 256b AVX2 512b AVX512 Flops/Cycle: 16 SP / 8 Flops/Cycle: 32SP / 16 Flops/Cycle: 64SP / 32 DP DP (FMA) DP (FMA) AVX512 AVX AVX2 512-bit FP/Integer 256-bit basic FP Float16 (IVB 2012) 32 registers 16 registers 256-bit FP FMA 8 mask registers Embedded rounding NDS (and AVX128) 256-bit integer Embedded broadcast Improved blend PERMD Scalar/SSE/AVX “promotions” MASKMOV Gather Native media additions Implicit unaligned HPC additions Transcendental support Gather/Scatter Overview of Deep Learning and DL Frameworks Deep Learning – ConVolu<onal Neural Network Convolution Parameters: Number of outputs/feature-maps: < 4 > Feature maps Filter size: < 3 x 3 > Stride: < 2 > Pad_size (for corner case): <1> Filter = 3 x Stride = Pad_size = 3 2 1 Deep Learning: Train Once Use Many Times • Step 1: Training • Step 2: Inference (Over Hours/Days/Weeks) (Real Time) New input from Input data camera and Person sensors Create Deep Trained Trained neural network Model network model 97% Output 90% person person Output Classificaon 8% traffic light Classificaon Deep Learning: Why Now? Bigger Data Be[er Hardware Smarter Algorithms Image: 1000 KB / picture Transistor density doubles every Advances in algorithm Audio: 5000 KB / song 18 months innovaon, including neural networks, leading to be+er Video: 5,000,000 KB / movie Cost / GB in 1995: $1000.00 accuracy in training models Cost / GB in 2015: $0.03 Intel Caffe – ML Framework Opmized for Xeon and Xeon Phi Products Intel Caffe q Fork of BVLC Caffe by Intel to optimize for IA Compute Data Layer q Leverages Intel MKL Deep Neural Layer Network (DNN) API’s LMDB, Convoluon ReLU Pooling LevelDB, q Optimized for BDW (AVX2) and KNL HDF5 (MIC_AVX512) q https://github.com/intel/caffe Intel MKL DNN BDW KNL Tensorflow ™ : Open Source ML Framework (Google) • Computaon is a Dataflow Graph with Tensors • General compu#ng mathemacal framework – widely used for • Deep Neural Networks • Other machine learning algorithms • HPC applicaons • Key computaonal kernels, extendable user operaons • Core in C++, front end wrapper in python • Mul# node support using GRPC • Google Remote Procedural Calls Example from Jeff Dean’s presentaon Optimizing Deep Learning Frameworks Performance Op<miza<on on Modern PlaBorms Hierarchical Parallelism Coarse-Grained / Fine-Grained Parallelism / within node mul-node Sub-domain: 1) Mul<-leVel domain decomposi<on (ex. across layers) Domain decomposion 2) Data decomposion (layer parallelism) Scaling U#lize all the Vectorize/SIMD Efficient cores memory/cache Improve load OpenMP, MPI, TBB… Unit strided access use balancing per SIMD lane Reduce Blocking Reduce synchronizaon High vector synchronizaon events, serial code efficiency Data reuse events, all-to-all comms Improve load Data alignment Prefetching balancing Memory allocaon Intel Strategy: Opmized Deep Learning Environment Fuel the development of ver#cal solu#ons Intel® Deep Learning SDK Accelerate design, training, and deployment Drive op#mizaons across open source deep learning frameworks Intel® Math Kernel Library (Intel® MKL) Intel® MKL-DNN Maximum performance on Intel architecture Intel® Omni-Path Deliver best single node and mul#-node + Architecture + performance (Intel® OPA) Training Inference Example Challenge 1: Data Layout Has Big Impact on Performance • Data Layouts impacts performance • Sequen#al access to avoid gather/scaer • Have iteraons in inner most loop to ensure high vector u#lizaon • Maximize data reuse; e.g. weights in a convolu#on layer • Conver#ng to/from op#mized Layout is some #mes less expensive than operang on unop#mized Layout 21 18 32 6 3 21 18 … 1 .. 8 92 .. 1 8 0 3 26 8 92 37 29 44 40 9 22 76 81 Be+er op#mized for 11 9 22 3 26 some operaons 23 44 81 32 11 3 47 29 88 1 vs 5 38 10 11 1 15 16 22 46 12 21 8 18 92 .. 1 11 .. 29 9 13 11 1 Example Challenge 2: Minimize ConVersions OVerhead • End to end op#mizaon can reduce conversions • Staying in op#mized layout as long as possible becomes one of the tuning goals • Minimize the number of back and forth conversions • Use of graph op#mizaon techniques Nave to Convoluon MKL layout Max Pool Nave to Convoluon MKL layout MKL layout to Nave MKL layout to Nave Example Challenge 3: Ensuring Enough Parallelism to Leverage all Cores • Maximize parallelism to use all cores efficiently • Intra operaon/layer parallelism Inter operaon parallelism across operators within operators (OpenMP) 8 92 37 29 Parallel execu#on 11 9 22 3 1x1 3x3 5x5 Conv Conv Conv 3 47 29 88 15 16 22 46 10 20 Convolu#on of #les in parallel 15 18 concat Example Challenge 4: Opmizing the Data Layer • Data Layer comprises 3 major ops C0 Boost C1 Boost C2 C3 Cn-1 thread thread OpenMP OpenMP …. …. OpenMP o Read data o Decode data: e.g. JPEG decode, decompression o Transform data • Result of read, decode & transform is input to DNN layers • Reduce number of cores dedicated to feed DNN o IO op#mizaon: consider compression o Decode: consider LMDB instead of JPEG o Resizing/data processing: consider pre-processing o Then vectorize, parallelize Op<mizing Deep Learning Frameworks for Intel® Architecture • Leverage high performant compute libraries and tools • e.g. Intel® Math Kernel Library, Intel® Python, Intel® Compiler etc. • Data Format/Shape: • Right format/shape for max performance: blocking, gather/scaer • Data Layout: • Minimize cost of data layout conversions • Parallelism: • Use all cores, eliminate serial sec#ons, load imbalance • Other Func#ons/Primi#ves (un-op#mized in libraries): • Op#mize via compiler knobs, improve exis#ng implementaons • Memory allocaon • unique characteris#cs and ability to reuse buffers • Data layer op#mizaons: • parallelizaon, vectorizaon, IO • Op#mize hyper parameters: • e.g. batch size for more parallelism • learning rate and op#mizer to ensure accuracy/convergence AlexNet Op<miza<on Progression 60 49.07x 50 40.71x 40 30 25.49x 27.17x 18.28x 20 12.48x 13.36x 13.72x 9.27x 6.96x 7.72x 10 2.20x 4.18x 2.16x Cumulave speedup 1.00x 0 Broadwell Knights Landing VGG Opmizaon Progression 300 273.50x 250 200 164.95x 171.20x 150 122.50x 100 Cumulave Speedup 50 23.60x 19.27x 15.80x 10.18x 13.29x 14.65x 1.00x 1.00x 3.15x 5.40x 0 Baseline MKL Integraon Thread Compiler Knobs Matrix Memory Conversions Op#mizaon Tuning Transpose/Data Allocaons Op#mizaon Transformaons Broadwell Knights Landing Configuraon details Intel® Xeon™ processor E5-2699v4 (22 Cores, 2.2 GHz), 128GB DDR memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2 Intel® Xeon Phi™ processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: Flat mode), 96GB DDR memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2 AlexNet and VGG benchmarks: h+ps://github.com/soumith/convnet-benchmarks 25 Mul<-Node Distributed Training • Model Parallelism • Break the model into N nodes • The same data is in all the nodes • Data Parallelism • Break the dataset into N nodes • The same model is in all the nodes • Good for networks with few weights, e.g. GoogLeNet • You can use either model or data parallelism or a hybrid of both Data Parallelism Scaling Efficiency: Intel® Xeon Phi™ Processor Deep Learning Image Classificaon Training Performance : MULTI-NODE Scaling OverFeat AlexNet VGG-A GoogLeNet 100 87 90 80 70 62 60 50 40 30 SCALING EFFICIENCY % 20 10 0 1 2 4 8 16 32 64 128 # OF INTEL® XEON PHI™ PROCESSOR 7250 (68-CORES, 1.4 GHZ, 16 GB) NODES Soxware and workloads used in performance tests may have been op#mized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, soxware, operaons and func#ons. Any change to any of those factors may cause the results to vary. You should consult other informaon and performance tests to assist you in fully evaluang your contemplated purchases, including the performance of that product when combined with other products.