DEEP LEARNING Alison B Lowndes Deep Learning Solutions Architect & Community Manager | EMEA THE BIG BANG IN MACHINE LEARNING
DNN BIG DATA GPU
“ Google’s AI engine also reflects how the world of computer hardware is changing. (It) depends on machines equipped with GPUs… And it depends on these chips more than the larger tech universe realizes.”
Progression from classical to …... automated
3 WHAT IS DEEP LEARNING?
Image “Volvo XC90”
Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng. 4 Machine Learning Software
Forward Propagation
“turtle” Tree
Cat
Dog
5 Machine Learning Software
Forward Propagation
“turtle” Tree Backward Propagation Cat
Compute weight update to nudge Dog from “turtle” towards “dog”
6 Machine Learning Software Training Forward Propagation Repeat
“turtle” Tree Backward Propagation Cat
Compute weight update to nudge Dog from “turtle” towards “dog”
7 Machine Learning Software Training Forward Propagation Repeat
“turtle” Tree Backward Propagation Cat
Compute weight update to nudge from “turtle” towards “dog” Dog Trained Model
8 Machine Learning Software Training Forward Propagation Repeat
“turtle” Tree Backward Propagation Cat
Compute weight update to nudge from “turtle” towards “dog” Dog Trained Model
Inference “cat”
9 Convolutional Neural Networks Stacked Repeating Triplets
Max Pooling Activation Convolutions (block-Wise max) (point-wise ReLu)
10 Recurrent neural networks
x – input; h – hidden state vector; y – output xt f – maps input and previous hidden state into f ht g yt new hidden state; g – maps hidden state into y ht-1 f – can be a huge feed-forward network unfolding through time
xt+2 xt+1 f ht+2 g yt+2 xt f ht+1 f ht ht-1 “classical” feed-forward network with shared weights f
Most commonly trained by back-propagation
How to put f ’s weight update together: • sum with much smaller learning rate • averaging
Slide courtesy of P Molchanov, NVIDIA 11 Long short-term memory (LSTM)
Hochreiter (1991) analysed vanishing gradient “LSTM falls out of this almost naturally” Schmidhuber
LSTM:
output gate
forget gate Training via backprop unfolded in input gate time
Long time dependencies are preserved until input gate is closed (-) and forget gate is open (O)
Gates control importance of Fig from Graves, Schmidhuber et al, Supervised Sequence Labelling with RNNs Fig from Vinyals et al, Google April 2015 NIC Generator the corresponding activations
Slide courtesy of P Molchanov, NVIDIA 12 GRUs
Bengio et al, NIPS15
http://www.ausy.tu-darmstadt.de/ 14 Systap GPUs are game changer for graph analytics
• Fraud detection
• Time series analysis
• Compliance
15 http://images.nvidia.com/events/sc15/SC5122-overcoming-barriers-graphs-gpus.html Marvin Weinstein Dynamic Quantum Clustering
http://arxiv.org/ftp/arxiv/papers/1310/1310.2700.pdf 16 Deep Learning Platform
17 CUDA FOR DEEP LEARNING DEVELOPMENT
DEEP LEARNING SDK
DIGITS cuDNN cuSPARSE cuBLAS NCCL
TITAN X DEVBOX GPU CLOUD
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 18 cuDNN: Powering Deep Learning
Applications
Frameworks
mocha.jl
KERAS Deeplearning4j CNTK
cuDNN
19 Tensorflow
www.tensorflow.org 20 Circa 2000 - now Torch7 - 4th (using odd numbers only 1,3,5,7) Web-scale learning in speech, image and video applications
Maintained by top researchers: Ronan Collobert - Research Scientist @ Facebook Clement Farabet - Senior Software Engineer @ Twitter Koray Kavukcuoglu - Research Scientist @ Google DeepMind Soumith Chintala - Research Engineer @ Facebook
21 Cheatsheet
Cheatsheet: https://github.com/torch/torch7/wiki/Cheatsheet
Github: https://github.com/torch/torch7
Google Group for new users and installation queries: https://groups.google.com/forum/embed/?place=forum%2Ftorch7#!forum/torch7
Advanced only: https://gitter.im/torch/torch7
22 What is Caffe? An open framework for deep learning developed by the Berkeley Vision and Learning Center (BVLC)
• Pure C++/CUDA architecture caffe.berkeleyvision.org • Command line, Python, MATLAB interfaces http://github.com/BVLC/caffe
• Fast, well-tested code
• Pre-processing and deployment tools, reference models and examples
• Image data management
• Seamless GPU acceleration • Large community of contributors to the open-source project
23 Caffe features Deep Neural Network training
Network training also requires no coding – just define a “solver” file net: “lenet_train.prototxt” base_lr: 0.01 momentum: 0.9 max_iter: 10000 All you need to run snapshot_prefix: “lenet_snapshot” things on the GPU Solver_mode: GPU
> caffe train –solver lenet_solver.prototxt –gpu 0
Multiple optimization algorithms available: SGD (+momentum), ADAGRAD, NAG 24 NVIDIA GPU THE ENGINE OF DEEP LEARNING
WATSON CHAINER THEANO MATCONVNET
TENSORFLOW CNTK TORCH CAFFE
NVIDIA CUDA ACCELERATED COMPUTING PLATFORM Deep Learning Software
26 GPU Accelerated Libraries “Drop-in” Acceleration for Your Applications
Linear Algebra NVIDIA cuFFT, FFT, BLAS, cuBLAS, SPARSE, Matrix cuSPARSE
Numerical & Math RAND, Statistics NVIDIA Math Lib NVIDIA cuRAND
Data Struct. & AI Sort, Scan, Zero Sum GPU AI – Board GPU AI – Games Path Finding
Visual Processing NVIDIA Image & Video NVIDIA Video NPP Encode 27 Caffe Performance
CUDA BOOSTS DEEP LEARNING
Performance 5X IN 2 YEARS
AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04 Tiled FFT up to 2x faster than FFT ▪ GPU-accelerated Deep Learning subroutines ▪ High performance neural network training cuDNN ▪ Accelerates Major Deep Learning frameworks: Caffe, Theano, Torch Deep Learning Primitives ▪ Up to 3.5x faster AlexNet training IGNITING ARTIFICIAL in Caffe than baseline GPU INTELLIGENCE Millions of Images Trained Per Day
developer.nvidia.com/cudnn 29 WHAT’S NEW IN CUDNN 4? Faster Training, Optimized for Inference
• Train neural networks up to 14x faster using Tiled FFT up to 2x faster on VGG Layers Google's Batch Normalization technique
• Increase training and inference performance for convolution layers up to 2x faster with new 2D Tiled-FFT algorithm
• Accelerate inference for small batch sizes Small Batch Sizes Up to 2x faster on Alexnet Layers up to 2x using convolution layers on Maxwell-based GPUs
• Optimize for energy efficient inference with 10x better performance/Watt on Jetson TX1
developer.nvidia.com/cudnn 30 NVIDIA DIGITS Interactive Deep Learning GPU Training System
Process Data Configure DNN Monitor Progress Visualize Layers
Test Image
developer.nvidia.com/digits 31 WHAT’S NEW IN DIGITS 3? DIGITS 3 Improves Training Productivity with Enhanced Workflows
• Train neural network models with • Screenshot of new DIGITS Torch support (preview) • Save time by quickly iterating to identify the best model • Easily manage multiple jobs to optimize use of system resources
• Active open source project with New Results valuable community contributions Browser!
developer.nvidia.com/digits 32 cuBLAS Accelerated Linear Algebra for Deep Learning
Accelerated Level 3 BLAS - GEMM, SYMM, TRSM, SYRK - >3 TFlops Single Precision on a single K40
Multi-GPU BLAS support available in cuBLAS-XT
developer.nvidia.com/cublas 33 cuSPARSE: (DENSE MATRIX) X (SPARSE VECTOR) Speeds up Natural Language Processing
cusparse
x = sparse vector
y = dense vector Sparse vector could be frequencies of words in a text sample cuSPARSE provides a full suite of accelerated sparse matrix functions
developer.nvidia.com/cusparse 34 NCCL Accelerating Multi-GPU Communications for Deep Learning
GOAL: • Build a research library of accelerated collectives that is easily integrated and topology-aware so as to improve the scalability of multi-GPU applications
APPROACH: • Pattern the library after MPI’s collectives • Handle the intra-node communication in an optimal way • Provide the necessary functionality for MPI to build on top to handle inter-node github.com/NVIDIA/nccl 35 Deep Learning Hardware
36 GPUS AND DEEP LEARNING
NEURAL GPUS NETWORKS Inherently Parallel ✓ ✓ Matrix Operations ✓ ✓
FLOPS ✓ ✓
Bandwidth ✓ ✓
GPUs deliver -- - same or better prediction accuracy - faster results - smaller footprint - lower power
37 DIGITSTM Devbox
4 TITAN X GPUs NVIDIA DIGITS software Pre-installed standard Ubuntu 14.04 with Caffe, Torch, Theano, BIDMach, cuDNN v4, and CUDA 7.5 A single deskside machine that plugs into standard wall plug
38 DIGITS 2 trains models up to 2x faster with Multi-GPU Scaling
DIGITS DEVBOX Fastest training for deep learning from a single wall socket
DIGITS 2 performance vs. previous version on an NVIDIA DIGITS DevBox system
Note: Caffe benchmark with AlexNet, CPU server uses 2x E5-2680v3 12 Core 2.5GHz CPU, 128GB System Memory, Ubuntu 14.04 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 39 13x Faster Training Caffe
TESLA M40 Reduce Training Time from 13 Days to just 1 Day World’s Fastest Accelerator for Deep Learning Training
Number of Days
CUDA Cores 3072 Peak SP 7 TFLOPS GDDR5 Memory 12 GB
Bandwidth 288 GB/s Power 250W
Note: Caffe benchmark with AlexNet, CPU server uses 2x E5-2680v3 12 Core 2.5GHz CPU, 128GB System Memory, Ubuntu 14.04 40 Stabilization and Resize, Filter, Search, Enhancements Auto-Enhance Video Image Processing Processing
4x 5x
H.264 & H.265, SD & HD Machine TESLA M4 Video Learning Transcode Highest throughput Hyperscale Inference workload acceleration 2x 2x
CUDA Cores 1024 Peak SP 2.2 TFLOPS GDDR5 Memory 4 GB
Bandwidth 88 GB/s Form Factor PCIe Low Profile Power 50 – 75 W
Preliminary specifications. Subject to change.
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 41 • Unmatched performance under 10W • Advanced tech for autonomous machines • Smaller than a credit card
JETSON TX1 JETSON TX1 Embedded GPU 1 TFLOP/s 256-core Maxwell Deep Learning CPU 64-bit ARM A57 CPUs
Memory 4 GB LPDDR4 | 25.6 GB/s
Storage 16 GB eMMC
Wifi/BT 802.11 2x2 ac/BT Ready
Networking 1 Gigabit Ethernet
Size 50mm x 87mm
Interface 400 pin board-to-board connector
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 42 Jetson TX1 Developer Kit
Jetson TX1 Developer Board 5MP Camera Jetson SDK
43 Develop and deploy
Jetson TX1 and Jetson TX1 Developer Kit 44 ONE ARCHITECTURE — END-TO-END AI
PC GAMING
DRIVE Tesla Titan X Jetson PX for Cloud for PC for Embedded for Auto
45 Your Next Steps Many Learning Opportunities
Free Hands-On Self-Paced Labs and In-Depth On-Line Courses
Deep Learning: developer.nvidia.com/deep-learning-courses
OpenACC: hdeveloper.nvidia.com/openacc-course
APRIL 4-7, 2016 SILICON VALLEY Learn from hundreds of experts at GTC 2016 All about Accelerated Computing: developer.nvidia.com
46 More about DEEP Learning? Our Free Deep Learning Courses: https://developer.nvidia.com/deep- learning-courses https://github.com/junhyukoh/deep-reinforcement-learning-papers
Class
#1 Introduction to Deep Learning
#2 Getting Started with DIGITS interactive training system for image classification
#3 Getting Started with the Caffe Framework
#4 Getting Started with the Theano Framework
#5 Getting Started with the Torch Framework
http://www.deeplearningbook.org/ Bengio et al
47 April 4-7, 2016 | Silicon Valley | #GTC16 www.gputechconf.com
CONNECT LEARN DISCOVER INNOVATE Connect with technology Gain insight and hands-on See how GPU technologies Hear about disruptive experts from NVIDIA and training through the are creating amazing innovations as early-stage other leading organizations hundreds of sessions and breakthroughs in important companies and startups research posters fields such as deep learning present their work
Questions? 48