Lecture 6: Hardware and Software Deep Learning Hardware, Dynamic & Static Computational Graph, Pytorch & Tensorflow

Lecture 6: Hardware and Software Deep Learning Hardware, Dynamic & Static Computational Graph, PyTorch & TensorFlow Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 1 April 15, 2021 Administrative Assignment 1 is due tomorrow April 16th, 11:59pm. Assignment 2 will be out tomorrow, due April 30th, 11:50 pm. Project proposal due Monday April 19. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 2 April 15, 2021 Administrative Friday’s section topic: course project Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 3 April 15, 2021 two more layers to go: POOL/FC Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 4 April 15, 2021 Convolution Layers (continue from last time) 3x3 CONV, stride=1, padding=1 56 56 n_filters=64 56 56 32 64 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 5 April 15, 2021 Pooling layer - makes the representations smaller and more manageable - operates over each activation map independently: Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 6 April 15, 2021 MAX POOLING Single depth slice 1 1 2 4 x max pool with 2x2 filters 5 6 7 8 and stride 2 6 8 3 2 1 0 3 4 1 2 3 4 y Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 7 April 15, 2021 Pooling layer: summary Let’s assume input is W1 x H1 x C Conv layer needs 2 hyperparameters: - The spatial extent F - The stride S This will produce an output of W2 x H2 x C where: - W2 = (W1 - F )/S + 1 - H2 = (H1 - F)/S + 1 Number of parameters: 0 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 8 April 15, 2021 Fully Connected Layer (FC layer) - Contains neurons that connect to the entire input volume, as in ordinary Neural Networks Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 9 April 15, 2021 Lecture 6: Hardware and Software Deep Learning Hardware, Dynamic & Static Computational Graph, PyTorch & TensorFlow Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 10 April 15, 2021 Where we are now... Computational graphs x s (scores) hinge + * loss L W R Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 11 April 15, 2021 Where we are now... Convolutional Neural Networks Illustration of LeCun et al. 1998 from CS231n 2017 Lecture 1 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 12 April 15, 2021 Where we are now… (more on optimization in lecture 8) Learning network parameters through optimization Landscape image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 13 April 15, 2021 Today - Deep learning hardware - CPU, GPU - Deep learning software - PyTorch and TensorFlow - Static and Dynamic computation graphs Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 14 April 15, 2021 Deep Learning Hardware Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 15 April 15, 2021 Inside a computer Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 16 April 15, 2021 Spot the CPU! (central processing unit) This image is licensed under CC-BY 2.0 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 17 April 15, 2021 Spot the GPUs! (graphics processing unit) This image is licensed under CC-BY 2.0 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 18 April 15, 2021 CPU vs GPU Cores Clock Memory Price Speed (throughput) CPU: Fewer cores, Speed but each core is much faster and CPU 10 4.3 System $385 ~640 GFLOPS FP32 much more (Intel Core GHz RAM capable; great at i9-7900k) sequential tasks GPU 10496 1.6 24 GB $1499 ~35.6 TFLOPS FP32 GPU: More cores, (NVIDIA GHz GDDR6X but each core is RTX 3090) much slower and “dumber”; great for parallel tasks Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 19 April 15, 2021 Example: Matrix Multiplication A x B A x C B x C = cuBLAS::GEMM (GEneral Matrix-to-matrix Multiply) Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 20 April 15, 2021 (CPU performance not CPU vs GPU in practice well-optimized, a little unfair) 66x 67x 71x 64x 76x Data from https://github.com/jcjohnson/cnn-benchmarks Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 21 April 15, 2021 cuDNN much faster than CPU vs GPU in practice “unoptimized” CUDA 2.8x 3.0x 3.1x 3.4x 2.8x Data from https://github.com/jcjohnson/cnn-benchmarks Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 22 April 15, 2021 Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 23 April 15, 2021 NVIDIA vs AMD Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 24 April 15, 2021 NVIDIA vs AMD Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 25 April 15, 2021 CPU vs GPU Cores Clock Memor Price Speed CPU: Fewer cores, Speed y but each core is much faster and CPU 10 4.3 GHz System $385 ~640 GFLOPs FP32 much more (Intel Core RAM capable; great at i7-7700k) sequential tasks GPU 10496 1.6 GHz 24 GB $1499 ~35.6 TFLOPs FP32 (NVIDIA GDDR GPU: More cores, RTX 3090) 6X but each core is much slower and GPU 6912 CUDA, 1.5 GHz 40/80 $3/hr ~9.7 TFLOPs FP64 “dumber”; great for (Data Center) 432 Tensor GB (GCP) ~20 TFLOPs FP32 parallel tasks NVIDIA A100 HBM2 ~312 TFLOPs FP16 TPU 2 Matrix Units ? 128 GB $8/hr ~420 TFLOPs TPU: Specialized Google Cloud (MXUs) per HBM (GCP) (non-standard FP) hardware for deep TPUv3 core, 4 cores learning Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 26 April 15, 2021 Programming GPUs ● CUDA (NVIDIA only) ○ Write C-like code that runs directly on the GPU ○ Optimized APIs: cuBLAS, cuFFT, cuDNN, etc ● OpenCL ○ Similar to CUDA, but runs on anything ○ Usually slower on NVIDIA hardware ● HIP https://github.com/ROCm-Developer-Tools/HIP ○ New project that automatically converts CUDA code to something that can run on AMD GPUs ● Stanford CS 149: http://cs149.stanford.edu/fall20/ Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 27 April 15, 2021 CPU / GPU Communication Model Data is here is here Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 28 April 15, 2021 CPU / GPU Communication Model Data is here is here If you aren’t careful, training can bottleneck on reading data and transferring to GPU! Solutions: - Read all data into RAM - Use SSD instead of HDD - Use multiple CPU threads to prefetch data Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 29 April 15, 2021 Deep Learning Software Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 30 April 15, 2021 A zoo of frameworks! PaddlePaddle Chainer (Preferred Networks) (Baidu) The company has officially migrated its research infrastructure to PyTorch Caffe Caffe2 (UC Berkeley) (Facebook) mostly features absorbed MXNet by PyTorch CNTK (Amazon) Developed by U Washington, CMU, MIT, Hong Kong U, etc but main framework of (Microsoft) Torch PyTorch choice at AWS (NYU / Facebook) (Facebook) JAX Theano TensorFlow (Google) (U Montreal) (Google) And others... Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 31 April 15, 2021 A zoo of frameworks! PaddlePaddle Chainer (Preferred Networks) (Baidu) The company has officially migrated its research infrastructure to PyTorch Caffe Caffe2 (UC Berkeley) (Facebook) mostly features absorbed MXNet by PyTorch CNTK (Amazon) Developed by U Washington, CMU, MIT, Hong Kong U, etc but main framework of (Microsoft) Torch PyTorch choice at AWS (NYU / Facebook) (Facebook) JAX Theano TensorFlow (Google) (U Montreal) (Google) We’ll focus on these And others... Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 32 April 15, 2021 Recall: Computational Graphs x s (scores) hinge + * loss L W R Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 33 April 15, 2021 Recall: Computational Graphs input image weights loss Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 34 April 15, 2021 Recall: Computational Graphs input image loss Figure reproduced with permission from a Twitter post by Andrej Karpathy. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 35 April 15, 2021 The point of deep learning frameworks (1) Quick to develop and test new ideas (2) Automatically compute gradients (3) Run it all efficiently on GPU (wrap cuDNN, cuBLAS, OpenCL, etc) Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 36 April 15, 2021 Computational Graphs Numpy x y z * a + b Σ c Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 37 April 15, 2021 Computational Graphs Numpy x y z * a + b Σ c Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 38 April 15, 2021 Computational Graphs Good: Numpy x y z Clean API, easy to * write numeric code a + b Bad: - Have to compute Σ our own gradients - Can’t run on GPU c Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 39 April 15, 2021 Computational Graphs Numpy x y z PyTorch * a + b Σ c Looks exactly like numpy! Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 40 April 15, 2021 Computational Graphs Numpy x y z PyTorch * a + b Σ c PyTorch handles gradients for us! Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 41 April 15, 2021 Computational Graphs Numpy x y z PyTorch * a + b Σ Trivial to run on GPU - just construct c arrays on a different device! Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 42 April 15, 2021 PyTorch (More details) Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 43 April 15, 2021 PyTorch: Fundamental Concepts torch.Tensor: Like a numpy array, but can run on GPU torch.autograd: Package for building computational graphs out of Tensors, and automatically computing gradients torch.nn.Module: A neural network layer; may store state or learnable weights Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 44 April 15, 2021 PyTorch: Versions For this class we are using PyTorch version 1.7 Major API change in release 1.0 Be careful if you are looking at older PyTorch code (<1.0)! Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 6 - 45 April 15, 2021 PyTorch: Tensors

Load more