Lecture 6: Hardware and Software Hardware, Dynamic & Static Computational Graph, PyTorch & TensorFlow

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 1 April 15, 2021 Administrative

Assignment 1 is due tomorrow April 16th, 11:59pm.

Assignment 2 will be out tomorrow, due April 30th, 11:50 pm.

Project proposal due Monday April 19.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 2 April 15, 2021 Administrative

Friday’s section topic: course project

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 3 April 15, 2021 two more layers to go: POOL/FC

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 4 April 15, 2021 Convolution Layers (continue from last time)

3x3 CONV, stride=1, padding=1 56 56 n_filters=64

56 56 32 64

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 5 April 15, 2021 Pooling layer - makes the representations smaller and more manageable - operates over each activation map independently:

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 6 April 15, 2021 MAX POOLING

Single depth slice 1 1 2 4 x max pool with 2x2 filters 5 6 7 8 and stride 2 6 8 3 2 1 0 3 4 1 2 3 4

y

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 7 April 15, 2021 Pooling layer: summary

Let’s assume input is W1 x H1 x C Conv layer needs 2 hyperparameters: - The spatial extent F - The stride S

This will produce an output of W2 x H2 x C where: - W2 = (W1 - F )/S + 1 - H2 = (H1 - F)/S + 1

Number of parameters: 0

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 8 April 15, 2021 Fully Connected Layer (FC layer) - Contains neurons that connect to the entire input volume, as in ordinary Neural Networks

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 9 April 15, 2021 Lecture 6: Hardware and Software Deep Learning Hardware, Dynamic & Static Computational Graph, PyTorch & TensorFlow

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 10 April 15, 2021 Where we are now... Computational graphs

x s (scores) hinge + * loss L

W R

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 11 April 15, 2021 Where we are now... Convolutional Neural Networks

Illustration of LeCun et al. 1998 from CS231n 2017 Lecture 1

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 12 April 15, 2021 Where we are now… (more on optimization in lecture 8) Learning network parameters through optimization

Landscape image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 13 April 15, 2021 Today

- Deep learning hardware - CPU, GPU - Deep learning software - PyTorch and TensorFlow - Static and Dynamic computation graphs

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 14 April 15, 2021 Deep Learning Hardware

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 15 April 15, 2021 Inside a computer

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 16 April 15, 2021 Spot the CPU! (central processing unit)

This image is licensed under CC-BY 2.0

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 17 April 15, 2021 Spot the GPUs! (graphics processing unit)

This image is licensed under CC-BY 2.0

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 18 April 15, 2021 CPU vs GPU

Cores Clock Memory Price Speed (throughput) CPU: Fewer cores, Speed but each core is much faster and CPU 10 4.3 System $385 ~640 GFLOPS FP32 much more ( Core GHz RAM capable; great at i9-7900k) sequential tasks

GPU 10496 1.6 24 GB $1499 ~35.6 TFLOPS FP32 GPU: More cores, ( GHz GDDR6X but each core is RTX 3090) much slower and “dumber”; great for parallel tasks

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 19 April 15, 2021 Example: Matrix Multiplication

A x B A x C B x C

=

cuBLAS::GEMM (GEneral Matrix-to-matrix Multiply) Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 20 April 15, 2021 (CPU performance not CPU vs GPU in practice well-optimized, a little unfair)

66x 67x 71x 64x 76x

Data from https://github.com/jcjohnson/cnn-benchmarks

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 21 April 15, 2021 cuDNN much faster than CPU vs GPU in practice “unoptimized” CUDA

2.8x 3.0x 3.1x 3.4x 2.8x

Data from https://github.com/jcjohnson/cnn-benchmarks

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 22 April 15, 2021 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 23 April 15, 2021 NVIDIA vs AMD

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 24 April 15, 2021 NVIDIA vs AMD

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 25 April 15, 2021 CPU vs GPU

Cores Clock Memor Price Speed CPU: Fewer cores, Speed y but each core is much faster and CPU 10 4.3 GHz System $385 ~640 GFLOPs FP32 much more (Intel Core RAM capable; great at i7-7700k) sequential tasks GPU 10496 1.6 GHz 24 GB $1499 ~35.6 TFLOPs FP32 (NVIDIA GDDR GPU: More cores, RTX 3090) 6X but each core is much slower and GPU 6912 CUDA, 1.5 GHz 40/80 $3/hr ~9.7 TFLOPs FP64 “dumber”; great for (Data Center) 432 Tensor GB (GCP) ~20 TFLOPs FP32 parallel tasks NVIDIA A100 HBM2 ~312 TFLOPs FP16

TPU 2 Matrix Units ? 128 GB $8/hr ~420 TFLOPs TPU: Specialized Google Cloud (MXUs) per HBM (GCP) (non-standard FP) hardware for deep TPUv3 core, 4 cores learning

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 26 April 15, 2021 Programming GPUs

● CUDA (NVIDIA only) ○ Write C-like code that runs directly on the GPU ○ Optimized APIs: cuBLAS, cuFFT, cuDNN, etc ● OpenCL ○ Similar to CUDA, but runs on anything ○ Usually slower on NVIDIA hardware ● HIP https://github.com/ROCm-Developer-Tools/HIP ○ New project that automatically converts CUDA code to something that can run on AMD GPUs ● Stanford CS 149: http://cs149.stanford.edu/fall20/

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 27 April 15, 2021 CPU / GPU Communication

Model Data is here is here

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 28 April 15, 2021 CPU / GPU Communication

Model Data is here is here

If you aren’t careful, training can bottleneck on reading data and transferring to GPU!

Solutions: - Read all data into RAM - Use SSD instead of HDD - Use multiple CPU threads to prefetch data

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 29 April 15, 2021 Deep Learning Software

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 30 April 15, 2021 A zoo of frameworks! PaddlePaddle Chainer (Preferred Networks) (Baidu) The company has officially migrated its research infrastructure to PyTorch Caffe2 (UC Berkeley) (Facebook) mostly features absorbed MXNet by PyTorch CNTK (Amazon) Developed by U Washington, CMU, MIT, Hong Kong U, etc but main framework of (Microsoft) PyTorch choice at AWS (NYU / Facebook) (Facebook) JAX TensorFlow (Google) (U Montreal) (Google) And others...

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 31 April 15, 2021 A zoo of frameworks! PaddlePaddle Chainer (Preferred Networks) (Baidu) The company has officially migrated its research infrastructure to PyTorch Caffe Caffe2 (UC Berkeley) (Facebook) mostly features absorbed MXNet by PyTorch CNTK (Amazon) Developed by U Washington, CMU, MIT, Hong Kong U, etc but main framework of (Microsoft) Torch PyTorch choice at AWS (NYU / Facebook) (Facebook) JAX Theano TensorFlow (Google) (U Montreal) (Google) We’ll focus on these And others...

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 32 April 15, 2021 Recall: Computational Graphs

x s (scores) hinge + * loss L

W R

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 33 April 15, 2021 Recall: Computational Graphs

input image

weights

loss

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 34 April 15, 2021 Recall: Computational Graphs

input image

loss

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 35 April 15, 2021 The point of deep learning frameworks

(1) Quick to develop and test new ideas (2) Automatically compute gradients (3) Run it all efficiently on GPU (wrap cuDNN, cuBLAS, OpenCL, etc)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 36 April 15, 2021 Computational Graphs Numpy x y z * a + b Σ c

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 37 April 15, 2021 Computational Graphs Numpy x y z * a + b Σ c

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 38 April 15, 2021 Computational Graphs Good: Numpy x y z Clean API, easy to * write numeric code a + b Bad: - Have to compute Σ our own gradients - Can’t run on GPU c

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 39 April 15, 2021 Computational Graphs Numpy x y z PyTorch * a + b Σ

c Looks exactly like numpy!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 40 April 15, 2021 Computational Graphs Numpy x y z PyTorch * a + b Σ c PyTorch handles gradients for us!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 41 April 15, 2021 Computational Graphs Numpy x y z PyTorch * a + b Σ

Trivial to run on GPU - just construct c arrays on a different device!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 42 April 15, 2021 PyTorch (More details)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 43 April 15, 2021 PyTorch: Fundamental Concepts

torch.Tensor: Like a numpy array, but can run on GPU

torch.autograd: Package for building computational graphs out of Tensors, and automatically computing gradients

torch.nn.Module: A neural network layer; may store state or learnable weights

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 44 April 15, 2021 PyTorch: Versions

For this class we are using PyTorch version 1.7

Major API change in release 1.0

Be careful if you are looking at older PyTorch code (<1.0)!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 45 April 15, 2021 PyTorch: Tensors

Running example: Train a two-layer ReLU network on random data with L2 loss

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 46 April 15, 2021 PyTorch: Tensors

Create random tensors for data and weights

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 47 April 15, 2021 PyTorch: Tensors

Forward pass: compute predictions and loss

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 48 April 15, 2021 PyTorch: Tensors

Backward pass: manually compute gradients

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 49 April 15, 2021 PyTorch: Tensors

Gradient descent step on weights

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 50 April 15, 2021 PyTorch: Tensors

To run on GPU, just use a different device!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 51 April 15, 2021 PyTorch: Autograd

Creating Tensors with requires_grad=True enables autograd

Operations on Tensors with requires_grad=True cause PyTorch to build a computational graph

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 52 April 15, 2021 PyTorch: Autograd

Forward pass looks exactly the same as before, but we don’t need to track intermediate values - PyTorch keeps track of them for us in the graph

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 53 April 15, 2021 PyTorch: Autograd

Compute gradient of loss with respect to w1 and w2

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 54 April 15, 2021 PyTorch: Autograd

Make gradient step on weights, then zero them. Torch.no_grad means “don’t build a computational graph for this part”

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 55 April 15, 2021 PyTorch: Autograd

PyTorch methods that end in underscore modify the Tensor in-place; methods that don’t return a new Tensor

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 56 April 15, 2021 PyTorch: New Autograd Functions

Define your own autograd functions by writing forward and backward functions for Tensors

Use ctx object to “cache” values for the backward pass, just like cache objects from A2

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 57 April 15, 2021 PyTorch: New Autograd Functions

Define your own autograd functions by writing forward and backward functions for Tensors

Use ctx object to “cache” values for the backward pass, just like cache objects from A2

Define a helper function to make it easy to use the new function

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 58 April 15, 2021 PyTorch: New Autograd Functions

Can use our new autograd function in the forward pass

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 59 April 15, 2021 PyTorch: New Autograd Functions

In practice you almost never need to define new autograd functions! Only do it when you need custom backward. In this case we can just use a normal Python function

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 60 April 15, 2021 PyTorch: nn

Higher-level wrapper for working with neural nets

Use this! It will make your life easier

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 61 April 15, 2021 PyTorch: nn

Define our model as a sequence of layers; each layer is an object that holds learnable weights

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 62 April 15, 2021 PyTorch: nn

Forward pass: feed data to model, and compute loss

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 63 April 15, 2021 PyTorch: nn

Forward pass: feed data to model, and compute loss torch.nn.functional has useful helpers like loss functions

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 64 April 15, 2021 PyTorch: nn

Backward pass: compute gradient with respect to all model weights (they have requires_grad=True)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 65 April 15, 2021 PyTorch: nn

Make gradient step on each model parameter (with gradients disabled)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 66 April 15, 2021 PyTorch: optim

Use an optimizer for different update rules

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 67 April 15, 2021 PyTorch: optim

After computing gradients, use optimizer to update params and zero gradients

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 68 April 15, 2021 PyTorch: nn Define new Modules

A PyTorch Module is a neural net layer; it inputs and outputs Tensors

Modules can contain weights or other modules

You can define your own Modules using autograd!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 69 April 15, 2021 PyTorch: nn Define new Modules

Define our whole model as a single Module

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 70 April 15, 2021 PyTorch: nn Define new Modules

Initializer sets up two children (Modules can contain modules)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 71 April 15, 2021 PyTorch: nn Define new Modules

Define forward pass using child modules

No need to define backward - autograd will handle it

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 72 April 15, 2021 PyTorch: nn Define new Modules

Construct and train an instance of our model

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 73 April 15, 2021 PyTorch: nn Define new Modules

Very common to mix and match custom Module subclasses and Sequential containers

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 74 April 15, 2021 PyTorch: nn Define new Modules

Define network component as a Module subclass

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 75 April 15, 2021 PyTorch: nn Define new Modules

Stack multiple instances of the component in a sequential

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 76 April 15, 2021 PyTorch: Pretrained Models

Super easy to use pretrained models with torchvision https://github.com/pytorch/vision

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 77 April 15, 2021 PyTorch: torch.utils.tensorboard

A python wrapper around Tensorflow’s web-based visualization tool.

This image is licensed under CC-BY 4.0; no changes were made to the image

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 78 April 15, 2021 PyTorch: Computational Graphs

input image

loss

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 79 April 15, 2021 PyTorch: Dynamic Computation Graphs

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 80 April 15, 2021 PyTorch: Dynamic Computation Graphs

x w1 w2 y

Create Tensor objects

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 81 April 15, 2021 PyTorch: Dynamic Computation Graphs

x w1 w2 y mm

clamp

mm

y_pred

Build graph data structure AND perform computation

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 82 April 15, 2021 PyTorch: Dynamic Computation Graphs

x w1 w2 y mm

clamp

mm

y_pred

- Build graph data structure AND pow sum loss perform computation

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 83 April 15, 2021 PyTorch: Dynamic Computation Graphs

x w1 w2 y mm

clamp

mm

y_pred

- Search for path between loss and w1, w2 pow sum loss (for backprop) AND perform computation

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 84 April 15, 2021 PyTorch: Dynamic Computation Graphs

x w1 w2 y

Throw away the graph, backprop path, and rebuild it from scratch on every iteration

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 85 April 15, 2021 PyTorch: Dynamic Computation Graphs

x w1 w2 y mm

clamp

mm

y_pred

Build graph data structure AND perform computation

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 86 April 15, 2021 PyTorch: Dynamic Computation Graphs

x w1 w2 y mm

clamp

mm

y_pred

- Build graph data structure AND pow sum loss perform computation

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 87 April 15, 2021 PyTorch: Dynamic Computation Graphs

x w1 w2 y mm

clamp

mm

y_pred

- Search for path between loss and w1, w2 pow sum loss (for backprop) AND perform computation

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 88 April 15, 2021 PyTorch: Dynamic Computation Graphs

Building the graph and computing the graph happen at the same time.

Seems inefficient, especially if we are building the same graph over and over again...

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 89 April 15, 2021 Static Computation Graphs

Alternative: Static graphs

Step 1: Build computational graph describing our computation (including finding paths for backprop)

Step 2: Reuse the same graph on every iteration

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 90 April 15, 2021 TensorFlow

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 91 April 15, 2021 TensorFlow Versions

Pre-2.0 (1.14 latest) 2.0+ Default static graph, Default dynamic graph, optionally dynamic optionally static graph. graph (eager mode). We use 2.4 in this class.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 92 April 15, 2021 TensorFlow: Neural Net (Pre-2.0)

(Assume imports at the top of each snippet)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 93 April 15, 2021 TensorFlow: Neural Net (Pre-2.0) First define computational graph

Then run the graph many times

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 94 April 15, 2021 TensorFlow: 2.0+ vs. pre-2.0

Tensorflow 2.0+: “Eager” Mode by default assert(tf.executing_eagerly()) Tensorflow 1.13

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 95 April 15, 2021 TensorFlow: 2.0+ vs. pre-2.0

Tensorflow 2.0+: “Eager” Mode by default assert(tf.executing_eagerly()) Tensorflow 1.13

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 96 April 15, 2021 TensorFlow: 2.0+ vs. pre-2.0

Tensorflow 2.0+: “Eager” Mode by default assert(tf.executing_eagerly()) Tensorflow 1.13

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 97 April 15, 2021 TensorFlow: Neural Net

Convert input numpy arrays to TF tensors. Create weights as tf.Variable

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 98 April 15, 2021 TensorFlow: Neural Net

Use tf.GradientTape() context to build dynamic computation graph.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 99 April 15, 2021 TensorFlow: Neural Net

All forward-pass operations in the contexts (including function calls) gets traced for computing gradient later.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 100 April 15, 2021 TensorFlow: Neural Net

Forward pass

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 101 April 15, 2021 TensorFlow: Neural Net

tape.gradient() uses the traced computation graph to compute gradient for the weights

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 102 April 15, 2021 TensorFlow: Neural Net

Backward pass

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 103 April 15, 2021 TensorFlow: Neural Net

Train the network: Run the training step over and over, use gradient to update weights

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 104 April 15, 2021 TensorFlow: Neural Net

Train the network: Run the training step over and over, use gradient to update weights

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 105 April 15, 2021 TensorFlow: Optimizer

Can use an optimizer to compute gradients and update weights

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 106 April 15, 2021 TensorFlow: Loss

Use predefined loss functions

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 107 April 15, 2021 : High-Level Wrapper

Keras is a layer on top of TensorFlow, makes common things easy to do

(Used to be third-party, now merged into TensorFlow)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 108 April 15, 2021 Keras: High-Level Wrapper

Define model as a sequence of layers

Get output by calling the model

Apply gradient to all trainable variables (weights) in the model

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 109 April 15, 2021 Keras: High-Level Wrapper

Keras can handle the training loop for you!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 110 April 15, 2021 TensorFlow: High-Level Wrappers

Keras (https://keras.io/) tf.keras (https://www.tensorflow.org/api_docs/python/tf/keras) tf.estimator (https://www.tensorflow.org/api_docs/python/tf/estimator)

Sonnet (https://github.com/deepmind/sonnet) TFLearn (http://tflearn.org/) TensorLayer (http://tensorlayer.readthedocs.io/en/latest/)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 111 April 15, 2021 @tf.function: compile static graph

tf.function decorator (implicitly) compiles python functions to static graph for better performance

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 112 April 15, 2021 @tf.function: Ran on Google Colab, April 2020 compile static graph Here we compare the forward-pass time of the same model under dynamic graph mode and static graph mode

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 113 April 15, 2021 @tf.function: Ran on Google Colab, April 2020 compile static graph

Static graph is in theory faster than dynamic graph, but the performance gain depends on the type of model / layer / computation graph.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 114 April 15, 2021 @tf.function: Ran on Google Colab, April 2020 compile static graph

Static graph is in theory faster than dynamic graph, but the performance gain depends on the type of model / layer / computation graph.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 115 April 15, 2021 Static vs Dynamic: Optimization

With static graphs, The graph you wrote Equivalent graph with fused operations framework can Conv optimize the graph for you ReLU Conv+ReLU before it runs! Conv Conv+ReLU ReLU Conv+ReLU Conv ReLU

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 116 April 15, 2021 Static PyTorch: ONNX Support

You can export a PyTorch model to ONNX (Open Neural Network Exchange)

Run the graph on a dummy input, and save the graph to a file

Will only work if your model doesn’t actually make use of dynamic graph - must build same graph on every forward pass

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 117 April 15, 2021 Static PyTorch: ONNX Support

graph(%0 : Float(64, 1000) %1 : Float(100, 1000) %2 : Float(100) %3 : Float(10, 100) %4 : Float(10)) { %5 : Float(64, 100) = onnx::Gemm[alpha=1, beta=1, broadcast=1, transB=1](%0, %1, %2), scope: Sequential/Linear[0] %6 : Float(64, 100) = onnx::Relu(%5), scope: Sequential/ReLU[1] %7 : Float(64, 10) = onnx::Gemm[alpha=1, beta=1, broadcast=1, transB=1](%6, %3, %4), scope: Sequential/Linear[2] After exporting to ONNX, can return (%7); } run the PyTorch model in Caffe2

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 118 April 15, 2021 Static PyTorch: ONNX Support

ONNX is an open-source standard for neural network models

Goal: Make it easy to train a network in one framework, then run it in another framework

Supported by PyTorch, Caffe2, Microsoft CNTK, Apache MXNet (3rd-party support Tensorflow)

https://github.com/onnx/onnx

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 119 April 15, 2021 Static PyTorch: TorchScript

graph(%self.1 : __torch__.torch.nn.modules.module.___torch_mangl e_4.Module, %input : Float(3, 4), %h : Float(3, 4)): %19 : __torch__.torch.nn.modules.module.___torch_mangl e_3.Module = prim::GetAttr[name="linear"](%self.1) %21 : Tensor = prim::CallMethod[name="forward"](%19, %input) %12 : int = prim::Constant[value=1]() # :7:0 %13 : Float(3, 4) = aten::add(%21, %h, %12) # :7:0 %14 : Float(3, 4) = aten::tanh(%13) # :7:0 %15 : (Float(3, 4), Float(3, 4)) = Build static graph with torch.jit.trace prim::TupleConstruct(%14, %14) return (%15) Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 120 April 15, 2021 PyTorch vs TensorFlow, Static vs Dynamic

PyTorch TensorFlow Dynamic Graphs Dynamic: Eager Static: ONNX, Static: @tf.function TorchScript

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 121 April 15, 2021 Static vs Dynamic: Serialization

Static Dynamic Once graph is built, can Graph building and execution serialize it and run it are intertwined, so always without the code that need to keep code around built the graph!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 122 April 15, 2021 Dynamic Graph Applications

- Recurrent networks

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015 Figure copyright IEEE, 2015. Reproduced for educational purposes.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 123 April 15, 2021 Dynamic Graph Applications

- Recurrent networks - Recursive networks

The cat ate a big rat

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 124 April 15, 2021 Dynamic Graph Applications

- Recurrent networks - Recursive networks - Modular networks

Figure copyright Justin Johnson, 2017. Reproduced with permission. Andreas et al, “Neural Module Networks”, CVPR 2016 Andreas et al, “Learning to Compose Neural Networks for Question Answering”, NAACL 2016 Johnson et al, “Inferring and Executing Programs for Visual Reasoning”, ICCV 2017

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 125 April 15, 2021 Dynamic Graph Applications

- Recurrent networks - Recursive networks - Modular Networks - (Your creative idea here)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 126 April 15, 2021 Model Parallel vs. Data Parallel

Model parallelism: Data parallelism: split split computation minibatch into chunks & graph into parts & distribute to GPUs/ nodes distribute to GPUs/ nodes

Model Parallel minibatch Data Parallel Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 127 April 15, 2021 PyTorch: Data Parallel nn.DataParallel Pro: Easy to use (just wrap the model and run training script as normal) Con: Single process & single node. Can be bottlenecked by CPU with large number of GPUs (8+). nn.DistributedDataParallel Pro: Multi-nodes & multi-process training Con: Need to hand-designate device and manually launch training script for each process / nodes.

Horovod (https://github.com/horovod/horovod): Supports both PyTorch and TensorFlow https://pytorch.org/docs/stable/nn.html#dataparallel-layers-multi-gpu-distributed Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 128 April 15, 2021 TensorFlow: Data Parallel tf.distributed.Strategy

https://www.tensorflow.org/tutorials/distribute/keras Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 129 April 15, 2021 PyTorch vs. TensorFlow: Academia

https://thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates-re search-tensorflow-dominates-industry/ Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 130 April 15, 2021 PyTorch vs. TensorFlow: Academia

https://thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates-re search-tensorflow-dominates-industry/

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 131 April 15, 2021 PyTorch vs. TensorFlow: Industry (202)

● No official survey / study on the comparison.

● A quick search on a job posting website turns up 2389 search results for TensorFlow and 1366 for PyTorch.

● The trend is unclear. Industry is also known to be slower on adopting new frameworks.

● TensorFlow mostly dominates mobile deployment / embedded systems.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 132 April 15, 2021 My Advice:

PyTorch is my personal favorite. Clean API, native dynamic graphs make it very easy to develop and debug. Can build model using the default API then compile static graph using JIT. Lots of research repositories are built on PyTorch.

TensorFlow’s syntax became a lot more intuitive after 2.0. Not perfect but has huge community and wide usage. Can use same framework for research and production. Probably use a higher-level wrapper (Keras, Sonnet, etc.).

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 133 April 15, 2021 Next Time: Training Neural Networks

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 134 April 15, 2021