Tensorflow* on Modern Intel® Architectures Jing Huang and Vivek Rane Artificial Intelligence Product Group Intel Tensorflow*On CPU Has Been Very Slow

Tensorflow* on Modern Intel® Architectures Jing Huang and Vivek Rane Artificial Intelligence Product Group Intel Tensorflow*On CPU Has Been Very Slow

TensorFlow* on Modern Intel® Architectures Jing Huang and Vivek Rane Artificial Intelligence Product Group Intel Tensorflow*on CPU has been very slow https://www.tensorflow.org/install/install_linux UNTIL TODAY. Up to 72x Speedup in Training and 86x Speedup in Inference! Up-streamed and Ready to Use! © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. 2 *Other names and brands may be claimed as the property of others. Agenda • Deep Learning & TensorFlow • Optimizing TensorFlow on Intel® Architecture • Summary & Call to Action • Tutorial on Cray (cori) systems © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. 3 *Other names and brands may be claimed as the property of others. Deep Learning: Convolutional Neural Network Convolution Parameters: Number of outputs/feature-maps: < 4 > Feature Filter size: < 3 x 3 > maps Stride: < 2 > Pad_size (for corner case): <1> Filter = 3 x Stride = Pad_size = 3 2 1 © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 4 Deep Learning: Train Once Use Many Times Step 1: Training Step 2: Inference (Over Hours/Days/Weeks) (Real Time) New input Input data from camera Person and sensors Trained neural Create deep Trained network network Model model 97% Output 90% person person Output classification 8% traffic light classification © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 5 Deep Learning: Why Now? Bigger Data Better Hardware Smarter Algorithms Image: 1000 KB / picture Transistor density doubles Advances in algorithm Audio: 5000 KB / song every 18 months innovation, including neural networks, leading to better Video: 5,000,000 KB / movie Cost / GB in 1995: $1000.00 Cost / GB in 2015: $0.03 accuracy in training models © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 6 TensorFlow • 2nd generation open source machine learning framework from Google* • Widely used across Google in many key apps – search, Gmail, photos, translate, etc. • General computing mathematical framework used on: • Deep neural network • Other machine learning frameworks • HPC applications • Core system provides set of key computational kernel, extendable user ops • Core in C++, front end wrapper is in python specifies/drives computation • Multi-node support using proprietary GRPC protocols © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. 7 *Other names and brands may be claimed as the property of others. TensorFloW: Computation is a Dataflow Graph with Tensors • Google’s open source machine learning framework • https://github.com/tensorflow/tensorflow Biases Add Matmul Input Relu Weights Xent Labels Example from Jeff Dean’s presentation © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 8 Agenda • Deep Learning & TensorFlow • Optimizing TensorFlow on Intel® Architecture • Why Optimize • Optimization Challenges & Techniques Used • Performance Results • Summary & Call to Action © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. 9 *Other names and brands may be claimed as the property of others. Optimization Matters on Modern Architectures with High Core Counts and Wide SIMD Vectors © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 10 Moore’s Law Goes on! Increasing clock speeds -> more cores + wider SIMD (Hierarchical parallelism) © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 11 Combined Amdahl’s Law for Vector Multicores* 1 1 푺풑풆풆풅풖풑 = ∗ 1 − 푆푒푟푖푎푙 1 − 푆푐푎푙푎푟 푆푒푟푖푎푙 + 푓푟푎푐 푆푐푎푙푎푟 + 푓푟푎푐 푓푟푎푐 푵풖풎푪풐풓풆풔 푓푟푎푐 푽풆풄풕풐풓푳풆풏품풕풉 Goal: Reduce Serial Fraction and Reduce Scalar Fraction of Code Compute Bound Performance Most kernels of ML codes are compute bound i.e. raw FLOPS matter Peak “Compute” Gflops/s without SIMD Roofline Model Gflops/s = min (Peak Gflops/s, Stream BW * flops/byte) Gflops/s Attainable Compute intensity (flops/byte) © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 12 Optimizing TensorFlow and Deep Learning Workloads © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 13 Performance Optimization on Modern Platforms Hierarchical Parallelism Coarse-Grained / Fine-Grained Parallelism / within node multi-node Sub-domain: 1) Multi-level domain decomposition (ex. across layers) Domain decomposition 2) Data decomposition (layer parallelism) Scaling Utilize all the cores Vectorize/SIMD Efficient memory/cache use • Improve load • OpenMP, MPI, • Unit strided balancing TBB… access per SIMD • Blocking lane • Reduce • Reduce • Data reuse synchronization synchronization • High vector • events, all-to-all events, serial code efficiency • Prefetching comms • Improve load • Data alignment • Memory balancing allocation © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 14 Intel Strategy: Optimized Deep Learning Environment Fuel the development of vertical solutions Nervana Cloud™ Accelerate design, training, and deployment Drive optimizations across open source machine learning frameworks Intel® Intel® Math Kernel Nervana™ Intel® MKL-DNN Maximum performance on Intel architecture Library (Intel® MKL) Graph Deliver best single node and multi-node performance Training Inference © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 15 Example Challenge 1: Data Layout Has Big Impact on Performance • Data layouts impact performance • Sequential access to avoid gather/scatter • Have iterations in inner most loop to ensure high vector utilization • Maximize data reuse; e.g. weights in a convolution layer • Converting to/from optimized layout is some times less expensive than operating on unoptimized layout A1 A2 A3 A4 A5 A’1 A’2 A’3 A’4 A’5 B1 8 0 3 26 A1 A2 … B1 .. A’1 A’2 .. B’1 B’2 B’3 B’4 B’5 C1 9 22 76 81 Better optimized for C’1 C’2 C’3 C’4 C’5 D1 44 81 32 11 some operations E1 D’138 D’210 D’311 D’41 D’5 vs E’1 E’2 E’3 E’4 E’5 A1 A’1 A2 A’2 .. B1 B’1 .. © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. 16 *Other names and brands may be claimed as the property of others. Example Challenge 2: Minimize Conversions Overhead • End-to-end optimization can reduce conversions • Staying in optimized layout as long as possible becomes one of the tuning goals • Minimize the number of back and forth conversions • Use of graph optimization techniques Native to Convolution MKL layout Max Pool Native to Convolution MKL layout MKL layout to Native MKL layout to Native 17 Optimizing TensorFlow & Other DL Frameworks for Intel® Architecture • Leverage high performant compute libraries and tools • e.g. Intel® Math Kernel Library, Intel® Python, Intel® Compiler etc. • Data format/shape: • Right format/shape for max performance: blocking, gather/scatter • Data layout: • Minimize cost of data layout conversions • Parallelism: • Use all cores, eliminate serial sections, load imbalance • Memory allocation • Unique characteristics and ability to reuse buffers • Data layer optimizations: • Parallelization, vectorization, IO • Optimize hyper parameters: • e.g. batch size for more parallelism • Learning rate and optimizer to ensure accuracy/convergence © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 18 Initial Performance Gains on Modern Xeon (2 Sockets Broadwell –22 Cores) Baseline Baseline Optimized Optimized Batch Performance Performance Performance Performance Speedup Speedup Benchmark Metric Size Training Inference Training Inference Training Inference ConvNet- Images / 128 33.52 84.2 524 1696 15.6x 20.2x Alexnet sec ConvNet- Images / GoogleNet 128 16.87 49.9 439.7 6.7x 8.8x sec 112.3 v1 ConvNet- Images / 64 8.2 30.7 47.1 151.1 5.7x 4.9x VGG sec • Baseline using TensorFlow 1.0 release with standard compiler knobs • Optimized performance using TensorFlow with Intel optimizations and built with • bazel build --config=mkl --copt=”-DEIGEN_USE_VML” © 2017 Intel Corporation.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    38 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us