Cs231n Lecture 8 : Deep Learning Software

Lecture 8: Deep Learning Software Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 1 April 27, 2017 Administrative - Project proposals were due Tuesday - We are assigning TAs to projects, stay tuned - We are grading A1 - A2 is due Thursday 5/4 - Remember to stop your instances when not in use - Only use GPU instances for the last notebook Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -2 2 April 27, 2017 Last time Regularization: Dropout Transfer Learning Optimization: SGD+Momentum, FC-C Nesterov, RMSProp, Adam FC-4096 Reinitialize FC-4096 this and train MaxPool Conv-512 Conv-512 MaxPool Conv-512 Conv-512 MaxPool Freeze these Conv-256 Regularization: Add noise, then Conv-256 marginalize out MaxPool Conv-128 Train Conv-128 MaxPool Conv-64 Test Conv-64 Image Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 3 April 27, 2017 Today - CPU vs GPU - Deep Learning Frameworks - Caffe / Caffe2 - Theano / TensorFlow - Torch / PyTorch Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -4 4 April 27, 2017 CPU vs GPU Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 5 April 27, 2017 My computer Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -6 6 April 27, 2017 Spot the CPU! (central processing unit) This image is licensed under CC-BY 2.0 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -7 7 April 27, 2017 Spot the GPUs! (graphics processing unit) This image is in the public domain Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -8 8 April 27, 2017 NVIDIA vs AMD Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -9 9 April 27, 2017 NVIDIA vs AMD Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -1010 April 27, 2017 CPU vs GPU # Cores Clock Speed Memory Price CPU: Fewer cores, but each core is CPU 4 4.4 GHz Shared with system $339 much faster and (Intel Core (8 threads with much more hyperthreading i7-7700k) ) capable; great at sequential tasks CPU 10 3.5 GHz Shared with system $1723 (Intel Core (20 threads with i7-6950X) hyperthreading ) GPU: More cores, GPU 3840 1.6 GHz 12 GB GDDR5X $1200 but each core is (NVIDIA much slower and Titan Xp) “dumber”; great for parallel tasks GPU 1920 1.68 GHz 8 GB GDDR5 $399 (NVIDIA GTX 1070) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -1111 April 27, 2017 Example: Matrix Multiplication A x B A x C B x C = Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -1212 April 27, 2017 Programming GPUs ● CUDA (NVIDIA only) ○ Write C-like code that runs directly on the GPU ○ Higher-level APIs: cuBLAS, cuFFT, cuDNN, etc ● OpenCL ○ Similar to CUDA, but runs on anything ○ Usually slower :( ● Udacity: Intro to Parallel Programming https://www.udacity.com/course/cs344 ○ For deep learning just use existing libraries Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -1313 April 27, 2017 (CPU performance not CPU vs GPU in practice well-optimized, a little unfair) 66x 67x 71x 64x 76x Data from https://github.com/jcjohnson/cnn-benchmarks Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 14 April 27, 2017 cuDNN much faster than CPU vs GPU in practice “unoptimized” CUDA 2.8x 3.0x 3.1x 3.4x 2.8x Data from https://github.com/jcjohnson/cnn-benchmarks Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 15 April 27, 2017 CPU / GPU Communication Model Data is here is here Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 16 April 27, 2017 CPU / GPU Communication Model Data is here is here If you aren’t careful, training can bottleneck on reading data and transferring to GPU! Solutions: - Read all data into RAM - Use SSD instead of HDD - Use multiple CPU threads to prefetch data Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 17 April 27, 2017 Deep Learning Frameworks Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 18 April 27, 2017 Last year ... Caffe (UC Berkeley) Torch (NYU / Facebook) Theano TensorFlow (U Montreal) (Google) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 19 April 27, 2017 This year ... Paddle (Baidu) Caffe Caffe2 (UC Berkeley) (Facebook) CNTK (Microsoft) Torch PyTorch (NYU / Facebook) (Facebook) MXNet (Amazon) Developed by U Washington, CMU, MIT, Hong Kong U, etc but main framework of Theano TensorFlow choice at AWS (U Montreal) (Google) And others... Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 20 April 27, 2017 Today Paddle A bit about these (Baidu) Caffe Caffe2 (UC Berkeley) (Facebook) CNTK (Microsoft) Torch PyTorch (NYU / Facebook) (Facebook) MXNet (Amazon) Developed by U Washington, CMU, MIT, Hong Kong U, etc but main framework of Theano TensorFlow choice at AWS (U Montreal) (Google) Mostly these And others... Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 21 April 27, 2017 Recall: Computational Graphs x s (scores) hinge + * loss L W R Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 22 April 27, 2017 Recall: Computational Graphs input image weights loss Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 23 April 27, 2017 Recall: Computational Graphs input image loss Figure reproduced with permission from a Twitter post by Andrej Karpathy. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 - 24 April 27, 2017 The point of deep learning frameworks (1) Easily build big computational graphs (2) Easily compute gradients in computational graphs (3) Run it all efficiently on GPU (wrap cuDNN, cuBLAS, etc) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -2525 April 27, 2017 Computational Graphs Numpy x y z * a + b Σ c Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -2626 April 27, 2017 Computational Graphs Numpy x y z * a + b Σ c Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -2727 April 27, 2017 Computational Graphs Numpy x y z * a Problems: + - Can’t run on GPU - Have to compute b our own gradients Σ c Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -2828 April 27, 2017 Computational Graphs TensorFlow Numpy x y z * a + b Σ c Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -2929 April 27, 2017 Computational Graphs TensorFlow x y z * a Create forward + computational graph b Σ c Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -3030 April 27, 2017 Computational Graphs TensorFlow x y z * a + Ask TensorFlow to b compute gradients Σ c Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -3131 April 27, 2017 Computational Graphs TensorFlow x y z * Tell TensorFlow a to run on CPU + b Σ c Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -3232 April 27, 2017 Computational Graphs TensorFlow x y z * Tell TensorFlow a to run on GPU + b Σ c Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -3333 April 27, 2017 Computational Graphs PyTorch x y z * a + b Σ c Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -3434 April 27, 2017 Computational Graphs PyTorch x y z * a Define Variables to + start building a computational graph b Σ c Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -3535 April 27, 2017 Computational Graphs PyTorch x y z * a + b Forward pass looks just like Σ numpy c Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -3636 April 27, 2017 Computational Graphs PyTorch x y z * a + b Σ Calling c.backward() c computes all gradients Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -3737 April 27, 2017 Computational Graphs PyTorch x y z * Run on GPU by a casting to .cuda() + b Σ c Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -3838 April 27, 2017 Numpy TensorFlow PyTorch Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -3939 April 27, 2017 TensorFlow (more detail) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -4040 April 27, 2017 TensorFlow: Neural Net Running example: Train a two-layer ReLU network on random data with L2 loss Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -4141 April 27, 2017 TensorFlow: Neural Net (Assume imports at the top of each snipppet) Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -4242 April 27, 2017 TensorFlow: Neural Net First define computational graph Then run the graph many times Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -4343 April 27, 2017 TensorFlow: Neural Net Create placeholders for input x, weights w1 and w2, and targets y Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -4444 April 27, 2017 TensorFlow: Neural Net Forward pass: compute prediction for y and loss (L2 distance between y and y_pred) No computation happens here - just building the graph! Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -4545 April 27, 2017 TensorFlow: Neural Net Tell TensorFlow to compute loss of gradient with respect to w1 and w2. Again no computation here - just building the graph Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -4646 April 27, 2017 TensorFlow: Neural Net Now done building our graph, so we enter a session so we can actually run the graph Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -4747 April 27, 2017 TensorFlow: Neural Net Create numpy arrays that will fill in the placeholders above Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -4848 April 27, 2017 TensorFlow: Neural Net Run the graph: feed in the numpy arrays for x, y, w1, and w2; get numpy arrays for loss, grad_w1, and grad_w2 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -4949 April 27, 2017 TensorFlow: Neural Net Train the network: Run the graph over and over, use gradient to update weights Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 8 -5050 April 27, 2017 TensorFlow: Neural Net Problem: copying weights between CPU / GPU each step Train the network: Run the graph over and over,

Cs231n Lecture 8 : Deep Learning Software

Automated Elastic Pipelining for Distributed Training of Transformers

Introduction to Deep Learning Framework 1. Introduction 1.1

Zero-Shot Text-To-Image Generation

Real-Time Object Detection for Autonomous Vehicles Using Deep Learning

TRAINING NEURAL NETWORKS with TENSOR CORES Dusan Stosic, NVIDIA Agenda

Pytorch Is an Open Source Machine Learning Library for Python and Is Completely Based on Torch

Quantifying Deep Fake Generation and Detection Accuracy for a Variety of Newscast Settings

Deep Learning Software Security and Fairness of Deep Learning SP18 Today

How to Profile with Dlprof CONTENT How to Use Dlprof

Dell EMC Isilon: Deep Learning Infrastructure for Autonomous Driving Enabled by Dell EMC Isilon and Dell EMC DSS 8440

Deep Learning Software Spring 2020 Today

Towards Russian Text Generation Problem Using Openai's GPT-2