Alison B Lowndes Deep Learning Solutions Architect & Community Manager | EMEA THE BIG BANG IN

DNN BIG DATA GPU

“ Google’s AI engine also reflects how the world of computer hardware is changing. (It) depends on machines equipped with GPUs… And it depends on these chips more than the larger tech universe realizes.”

Progression from classical to …... automated

3 WHAT IS DEEP LEARNING?

Image “Volvo XC90”

Image source: “ of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng. 4 Machine Learning Software

Forward Propagation

“turtle” Tree

Cat

Dog

5 Machine Learning Software

Forward Propagation

“turtle” Tree Backward Propagation Cat

Compute weight update to nudge Dog from “turtle” towards “dog”

6 Machine Learning Software Training Forward Propagation Repeat

“turtle” Tree Backward Propagation Cat

Compute weight update to nudge Dog from “turtle” towards “dog”

7 Machine Learning Software Training Forward Propagation Repeat

“turtle” Tree Backward Propagation Cat

Compute weight update to nudge from “turtle” towards “dog” Dog Trained Model

8 Machine Learning Software Training Forward Propagation Repeat

“turtle” Tree Backward Propagation Cat

Compute weight update to nudge from “turtle” towards “dog” Dog Trained Model

Inference “cat”

9 Convolutional Neural Networks Stacked Repeating Triplets

Max Pooling Activation Convolutions (block-Wise max) (point-wise ReLu)

10 Recurrent neural networks

x – input; h – hidden state vector; y – output xt f – maps input and previous hidden state into f ht g yt new hidden state; g – maps hidden state into y ht-1 f – can be a huge feed-forward network unfolding through time

xt+2 xt+1 f ht+2 g yt+2 xt f ht+1 f ht ht-1 “classical” feed-forward network with shared weights f

Most commonly trained by back-propagation

How to put f ’s weight update together: • sum with much smaller learning rate • averaging

Slide courtesy of P Molchanov, 11 Long short-term memory (LSTM)

Hochreiter (1991) analysed vanishing gradient “LSTM falls out of this almost naturally” Schmidhuber

LSTM:

output gate

forget gate Training via backprop unfolded in input gate time

Long time dependencies are preserved until input gate is closed (-) and forget gate is open (O)

Gates control importance of Fig from Graves, Schmidhuber et al, Supervised Sequence Labelling with RNNs Fig from Vinyals et al, Google April 2015 NIC Generator the corresponding activations

Slide courtesy of P Molchanov, NVIDIA 12 GRUs

Bengio et al, NIPS15

13

http://www.ausy.tu-darmstadt.de/ 14 Systap GPUs are game changer for graph analytics

• Fraud detection

• Time series analysis

• Compliance

15 http://images.nvidia.com/events/sc15/SC5122-overcoming-barriers-graphs-gpus.html Marvin Weinstein Dynamic Quantum Clustering

http://arxiv.org/ftp/arxiv/papers/1310/1310.2700.pdf 16 Deep Learning Platform

17 CUDA FOR DEEP LEARNING DEVELOPMENT

DEEP LEARNING SDK

DIGITS cuDNN cuSPARSE cuBLAS NCCL

TITAN X DEVBOX GPU CLOUD

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 18 cuDNN: Powering Deep Learning

Applications

Frameworks

mocha.jl

KERAS CNTK

cuDNN

19 Tensorflow

www..org 20 Circa 2000 - now Torch7 - 4th (using odd numbers only 1,3,5,7) Web-scale learning in speech, image and video applications

Maintained by top researchers: Ronan Collobert - Research Scientist @ Facebook Clement Farabet - Senior Software Engineer @ Twitter Koray Kavukcuoglu - Research Scientist @ Google DeepMind Soumith Chintala - Research Engineer @ Facebook

21 Cheatsheet

Cheatsheet: https://github.com/torch/torch7/wiki/Cheatsheet

Github: https://github.com/torch/torch7

Google Group for new users and installation queries: https://groups.google.com/forum/embed/?place=forum%2Ftorch7#!forum/torch7

Advanced only: https://gitter.im/torch/torch7

22 What is ? An open framework for deep learning developed by the Berkeley Vision and Learning Center (BVLC)

• Pure ++/CUDA architecture caffe.berkeleyvision.org • Command line, Python, MATLAB interfaces http://github.com/BVLC/caffe

• Fast, well-tested code

• Pre-processing and deployment tools, reference models and examples

• Image data management

• Seamless GPU acceleration • Large community of contributors to the open-source project

23 Caffe features Deep Neural Network training

Network training also requires no coding – just define a “solver” file net: “lenet_train.prototxt” base_lr: 0.01 momentum: 0.9 max_iter: 10000 All you need to run snapshot_prefix: “lenet_snapshot” things on the GPU Solver_mode: GPU

> caffe train –solver lenet_solver.prototxt –gpu 0

Multiple optimization algorithms available: SGD (+momentum), ADAGRAD, NAG 24 NVIDIA GPU THE ENGINE OF DEEP LEARNING

WATSON MATCONVNET

TENSORFLOW CNTK TORCH CAFFE

NVIDIA CUDA ACCELERATED Deep Learning Software

26 GPU Accelerated Libraries “Drop-in” Acceleration for Your Applications

Linear Algebra NVIDIA cuFFT, FFT, BLAS, cuBLAS, SPARSE, cuSPARSE

Numerical & Math RAND, Statistics NVIDIA Math Lib NVIDIA cuRAND

Data Struct. & AI Sort, Scan, Zero Sum GPU AI – Board GPU AI – Games Path Finding

Visual Processing NVIDIA Image & Video NVIDIA Video NPP Encode 27 Caffe Performance

CUDA BOOSTS DEEP LEARNING

Performance 5X IN 2 YEARS

AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04 Tiled FFT up to 2x faster than FFT ▪ GPU-accelerated Deep Learning subroutines ▪ High performance neural network training cuDNN ▪ Accelerates Major Deep Learning frameworks: Caffe, Theano, Torch Deep Learning Primitives ▪ Up to 3.5x faster AlexNet training IGNITING ARTIFICIAL in Caffe than baseline GPU INTELLIGENCE Millions of Images Trained Per Day

developer.nvidia.com/cudnn 29 WHAT’S NEW IN CUDNN 4? Faster Training, Optimized for Inference

• Train neural networks up to 14x faster using Tiled FFT up to 2x faster on VGG Layers Google's Batch Normalization technique

• Increase training and inference performance for convolution layers up to 2x faster with new 2D Tiled-FFT algorithm

• Accelerate inference for small batch sizes Small Batch Sizes Up to 2x faster on Alexnet Layers up to 2x using convolution layers on Maxwell-based GPUs

• Optimize for energy efficient inference with 10x better performance/Watt on Jetson TX1

developer.nvidia.com/cudnn 30 NVIDIA DIGITS Interactive Deep Learning GPU Training System

Process Data Configure DNN Monitor Progress Visualize Layers

Test Image

developer.nvidia.com/digits 31 WHAT’S NEW IN DIGITS 3? DIGITS 3 Improves Training Productivity with Enhanced Workflows

• Train neural network models with • Screenshot of new DIGITS Torch support (preview) • Save time by quickly iterating to identify the best model • Easily manage multiple jobs to optimize use of system resources

• Active open source project with New Results valuable community contributions Browser!

developer.nvidia.com/digits 32 cuBLAS Accelerated Linear Algebra for Deep Learning

Accelerated Level 3 BLAS - GEMM, SYMM, TRSM, SYRK - >3 TFlops Single Precision on a single K40

Multi-GPU BLAS support available in cuBLAS-XT

developer.nvidia.com/cublas 33 cuSPARSE: (DENSE MATRIX) X (SPARSE VECTOR) Speeds up Natural Language Processing

cusparsegemvi() y A A A A - 1 11 12 13 14 A15 y1 y A A A A A 2 y 2 α 21 22 23 24 25 - + β 2 ∗ ∗ ∗ y A A32 A A A y y = α op(A) x + β y 3 31 33 34 35 - 3 1 A = dense matrix

x = sparse vector

y = dense vector Sparse vector could be frequencies of words in a text sample cuSPARSE provides a full suite of accelerated sparse matrix functions

developer.nvidia.com/cusparse 34 NCCL Accelerating Multi-GPU Communications for Deep Learning

GOAL: • Build a research of accelerated collectives that is easily integrated and topology-aware so as to improve the scalability of multi-GPU applications

APPROACH: • Pattern the library after MPI’s collectives • Handle the intra-node communication in an optimal way • Provide the necessary functionality for MPI to build on top to handle inter-node github.com/NVIDIA/nccl 35 Deep Learning Hardware

36 GPUS AND DEEP LEARNING

NEURAL GPUS NETWORKS Inherently Parallel ✓ ✓ Matrix Operations ✓ ✓

FLOPS ✓ ✓

Bandwidth ✓ ✓

GPUs deliver -- - same or better prediction accuracy - faster results - smaller footprint - lower power

37 DIGITSTM Devbox

4 TITAN X GPUs NVIDIA DIGITS software Pre-installed standard Ubuntu 14.04 with Caffe, Torch, Theano, BIDMach, cuDNN v4, and CUDA 7.5 A single deskside machine that plugs into standard wall plug

38 DIGITS 2 trains models up to 2x faster with Multi-GPU Scaling

DIGITS DEVBOX Fastest training for deep learning from a single wall socket

DIGITS 2 performance vs. previous version on an NVIDIA DIGITS DevBox system

Note: Caffe benchmark with AlexNet, CPU server uses 2x E5-2680v3 12 Core 2.5GHz CPU, 128GB System Memory, Ubuntu 14.04 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 39 13x Faster Training Caffe

TESLA M40 Reduce Training Time from 13 Days to just 1 Day World’s Fastest Accelerator for Deep Learning Training

Number of Days

CUDA Cores 3072 Peak SP 7 TFLOPS GDDR5 Memory 12 GB

Bandwidth 288 GB/s Power 250W

Note: Caffe benchmark with AlexNet, CPU server uses 2x E5-2680v3 12 Core 2.5GHz CPU, 128GB System Memory, Ubuntu 14.04 40 Stabilization and Resize, Filter, Search, Enhancements Auto-Enhance Video Image Processing Processing

4x 5x

H.264 & H.265, SD & HD Machine TESLA M4 Video Learning Transcode Highest throughput Hyperscale Inference workload acceleration 2x 2x

CUDA Cores 1024 Peak SP 2.2 TFLOPS GDDR5 Memory 4 GB

Bandwidth 88 GB/s Form Factor PCIe Low Profile Power 50 – 75 W

Preliminary specifications. Subject to change.

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 41 • Unmatched performance under 10W • Advanced tech for autonomous machines • Smaller than a credit card

JETSON TX1 JETSON TX1 Embedded GPU 1 TFLOP/s 256-core Maxwell Deep Learning CPU 64-bit ARM A57 CPUs

Memory 4 GB LPDDR4 | 25.6 GB/s

Storage 16 GB eMMC

Wifi/BT 802.11 2x2 ac/BT Ready

Networking 1 Gigabit Ethernet

Size 50mm x 87mm

Interface 400 pin board-to-board connector

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 42 Jetson TX1 Developer Kit

Jetson TX1 Developer Board 5MP Camera Jetson SDK

43 Develop and deploy

Jetson TX1 and Jetson TX1 Developer Kit 44 ONE ARCHITECTURE — END-TO-END AI

PC GAMING

DRIVE Tesla Titan X Jetson PX for Cloud for PC for Embedded for Auto

45 Your Next Steps Many Learning Opportunities

Free Hands-On Self-Paced Labs and In-Depth On-Line Courses

Deep Learning: developer.nvidia.com/deep-learning-courses

OpenACC: hdeveloper.nvidia.com/openacc-course

APRIL 4-7, 2016 SILICON VALLEY Learn from hundreds of experts at GTC 2016 All about Accelerated Computing: developer.nvidia.com

46 More about DEEP Learning? Our Free Deep Learning Courses: https://developer.nvidia.com/deep- learning-courses https://github.com/junhyukoh/deep-reinforcement-learning-papers

Class

#1 Introduction to Deep Learning

#2 Getting Started with DIGITS interactive training system for image classification

#3 Getting Started with the Caffe Framework

#4 Getting Started with the Theano Framework

#5 Getting Started with the Torch Framework

http://www.deeplearningbook.org/ Bengio et al

47 April 4-7, 2016 | Silicon Valley | #GTC16 www.gputechconf.com

CONNECT LEARN DISCOVER INNOVATE Connect with technology Gain insight and hands-on See how GPU technologies Hear about disruptive experts from NVIDIA and training through the are creating amazing innovations as early-stage other leading organizations hundreds of sessions and breakthroughs in important companies and startups research posters fields such as deep learning present their work

Questions? 48