Practical

December 12-13, 2019 CSC – IT Center for Science Ltd., Espoo

Markus Koskela Mats Sjöberg All original material (C) 2019 by CSC – IT Center for Science Ltd. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License, http://creativecommons.org/licenses/by-sa/4.0

All other material copyrighted by their respective authors. Course schedule

Thursday Friday 9.00-10.30 Lecture 1: Introduction 9.00-9.45 Lecture 5: Deep to deep learning learning frameworks

10.30-10.45 Coffee break 9.45-10.15 Lecture 6: GPUs and batch jobs 10.45-11.00 Exercise 1: Introduction to Notebooks, 10.15-10.30 Coffee break 11.00-11.30 Lecture 2: Multi- 10.30-12.00 Exercise 5: Image networks classification: dogs vs. cats; traffic signs 11.30-12.00 Exercise 2: Classifica- 12.00-13.00 Lunch break tion with MLPs 13.00-14.00 Exercise 6: Text catego- 12.00-13.00 Lunch break rization: 20 newsgroups 13.00-14.00 Lecture 3: Images and convolutional neural 14.00-14.45 Lecture 7: Cloud, GPU networks utilization, using multiple GPU 14.00-14.30 Exercise 3: Image classification with CNNs 14.45-15.00 Coffee break

14.30-14.45 Coffee break 15.00-16.00 Exercise 7: Using multiple GPUs 14.45-15.30 Lecture 4: Text data, embeddings, 1D CNN, recurrent neural networks, attention 15.30-16.00 Exercise 4: Text sentiment classification with CNNs and RNNs

Up-to-date agenda and lecture slides can be found at https://tinyurl.com/r3fd3st Exercise materials are at GitHub: https://github.com/csc-training/intro-to-dl/ Wireless accounts for CSC-guest network behind the badges. Alternatively, use the eduroam network with your university accounts or the LAN cables on the tables. Accounts to Puhti-AI cluster delivered separately. About this course • Introduction to deep learning • basics of ML assumed • mostly high-school math Lecture 1: Introduction to • much of theory, many details skipped deep learning • 1st day: lectures + small-scale exercises using notebooks.csc.fi • 2nd day: experiments using GPUs at Puhti-AI • Slides at: https://tinyurl.com/r3fd3st • Other materials (and link to Gitter) at GitHub: Practical deep learning https://github.com/csc-training/intro-to-dl • Focus on text and image classification, no fancy stuff • Using Python, TensorFlow 2 / Keras, and PyTorch

Further resources

• This course is largely “inspired by”: “Deep What is artificial intelligence? Learning with Python” by François Chollet • Recommended textbook: “Deep learning” Artificial intelligence is the ability of a computer to perform by Goodfellow, Bengio, Courville tasks commonly associated with intelligent beings. • Lots of further material available online, e.g.: http://cs231n.stanford.edu/ http://course.fast.ai/ https://developers.google.com/machine-learning/crash-course/ www.nvidia.com/dlilabs http://introtodeeplearning.com/ https://github.com/oxford-cs-deepnlp-2017/lectures, https://jalammar.github.io/ • Academic courses

What is ? What is deep learning?

Machine learning is the study of algorithms that learn from Deep learning is a subfield of machine learning focusing on examples and experience instead of relying on hard-coded rules learning data representations as successive layers of and make predictions on new data. increasingly meaningful representations. “Traditional” machine learning:

handcrafted learned cat features classifier

Deep, “end-to-end” learning:

learned learned learned learned low-level mid-level high-level cat classifier features features features

Image from https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/

Demotivational slide

“All of these AI systems we see, none of them is ‘real’ AI” – Josh Tennenbaum

“Neural networks are … neither neural nor even networks.” – François Chollet, author of Keras

From: Wang & Raj: On the Origin of Deep Learning (2017)

Main types of machine learning

cat • dog • Self-supervised learning Main types of machine learning • Main types of machine learning Main types of machine learning

• Supervised learning • Supervised learning

• Unsupervised learning • Unsupervised learning • Self-supervised learning • Self-supervised learning • Reinforcement learning • Reinforcement learning

Image from https://arxiv.org/abs/1710.10196

By Chire [CC BY-SA 3.0], from Wikimedia Commons

Main types of machine learning

• Supervised learning

• Unsupervised learning • Self-supervised learning Fundamentals of machine • Reinforcement learning learning

Animation from https://yanpanlau.github.io/2016/07/10/FlappyBird-Keras.html

Data Data

• Humans learn by observation and unsupervised learning • Tensors: generalization of matrices • model of the world / to n dimensions (or rank, order, degree) common sense reasoning • 1D tensor: vector • 2D tensor: matrix • Machine learning needs lots of • 3D, 4D, 5D tensors (labeled) data to compensate • numpy.ndarray(shape, dtype) • Training – validation – test split (+ adversarial test) • Minibatches • small sets of input data used at a time

• usually processed independently Image from: https://arxiv.org/abs/1707.08945 Optimization Model – learning/training – inference • Mathematical optimization: “the selection of a best element (with regard to some criterion) from some set of available alternatives” (Wikipedia) • Main types: finite-step, iterative, heuristic • Learning as an optimization problem By Rebecca Wilson (originally posted to Flickr as Vicariously) [CC BY 2.0], via Wikimedia Commons • cost function: loss regularization http://playground.tensorflow.org/

• 휃 • parameters and hyperparameters

Optimization

• Derivative and minima/maxima of functions

• Gradient: the derivative of a multivariable function • Gradient descent:

• (Mini-batch) stochastic gradient descent (and its variants)

Image from: Li et al. “Visualizing the Loss Landscape of Neural Nets”, arXiv:1712.09913 Image from: https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3

Over- and underfitting, generalization, regularization

• Models with lots of parameters can easily overfit to training data • Generalization: the quality of ML model is measured on new, unseen samples Deep learning • Regularization: any method* to prevent overfitting • simplicity, sparsity, dropout, early stopping • *) other than adding more data

By Chabacano [GFDL or CC BY-SA 4.0], from Wikimedia Commons Layers Anatomy of a deep neural network • Data processing modules • Many different kinds exist • Layers • densely connected • • Input data and targets convolutional • recurrent • • pooling, flattening, merging, normalization, • Optimizer etc. • Input: one or more tensors output: one or more tensors • Usually have a state, encoded as weights • learned, initially random • When combined, form a network or a model

Input data and targets Loss function

• The quantity to be minimized (optimized) during training • the only thing the network cares about • The network maps the input data X to predictions Y′ • there might also be other metrics you care about • During training, the predictions Y′ • Common tasks have “standard” loss functions: are compared to true targets Y • mean squared error for regression using the loss function • binary cross-entropy for two-class classification • categorical cross-entropy for multi-class classification • cat etc. dog • https://lossfunctions.tumblr.com/

Optimizer Anatomy of a deep neural network

• How to update the weights based on the loss function • Learning rate (+scheduling) • Stochastic gradient descent, momentum, and their variants • RMSProp is usually a good first choice • more info: http://ruder.io/optimizing-gradient-descent/

Animation from: https://imgur.com/s25RsOr Deep learning frameworks

Deep learning frameworks Deep learning Lasagne Keras TF Estimator .nn Gluon

+ frameworks Theano TensorFlow CNTK PyTorch MXNet • Actually tools for defining static or dynamic general-purpose computational + CUDA, cuDNN MKL, MKL-DNN graphs • Keras is a high-level • Automatic differentiation ✕ ✕ neural networks API • GPUs CPUs • we will use TensorFlow Seamless CPU / GPU usage as the compute backend • multi-GPU, distributed x y 5 • included in TensorFlow 2 as tf.keras • Python/numpy or R interfaces • https://keras.io/ , https://www.tensorflow.org/guide/keras • instead of C, C++, or CUDA • PyTorch is: • Open source • a GPU-based tensor library • an efficient library for dynamic neural networks • https://pytorch.org/ Neuron as a linear classifier

Lecture 2: Multi-layer perceptron networks

Practical deep learning

By User:ZackWeinberg, based on PNG version by User:Cyc [CC BY-SA 3.0], via Wikimedia Commons

1 2

A non-linear classifier? Non-linearity =

• A smooth (differentiable) nonlinear function that is applied after the inner product with the weights

3 4

Neural network ● An (artificial) neural network is a collection of neurons ● Usually organized in layers ○ input layer ○ one or more hidden layers (sizes, activation functions are hyperparameters) ○ output layer

6 Output activation Input

● Usually a non-linear activation after each layer Layer 1 • Neural networks are trained with gradient descent, ● Typically ReLU between the layers starting from a random weight initialization ● At the output layer we need to consider the task, i.e., ReLU • Algorithm for computing the gradients for a neural network: what kinds of outputs we want, e.g, Layer 2 ○ Multi-label classification An error (loss) when comparing to the correct label each K output separate probability (values 0.0-1.0) cat ReLU → sigmoid dog ○ Multi-class classification Layer 3 Backpropagate errors = probability distribution over K classes (sums to 1.0) How to change each weight to reduce error

→ softmax e.g. softmax ○ Regression • Based on the chain rule of calculus: free range of values → linear softmax: Output 7 8

Multilayer perceptron (MLP) / Dense network Dropout

• Classic feedforward neural network • Randomly setting a fraction rate of input units to 0 at cat • Densely connected: all inputs each update during training from the previous layer connected dog • Helps to prevent overfitting (regularization) • In tf.keras: fish • In tf.keras: layers.Dense(units, activation=None) layers.Dropout(rate) or: layers.Dense(units) layers.Activation(activation)

9 10 Image from Srivastava et al (2014), JMLR 15: 1929-1958

Flatten

• Flatten • flattens the input into a vector • needed if the input is has more than one dimension (2D, 3D, … ), e.g., image data • Example: 1 2 3 4 5 6 → 1 2 3 4 5 6 7 8 9 7 8 9 • In tf.keras: layers.Flatten()

11

Computer vision = giving computers the ability to Lecture 3: Images and understand visual information convolutional neural Examples: networks ○ A robot that can move around obstacles by analysing the input of its camera(s) ○ A computer system finding images of cats among millions of images on the Internet Practical deep learning

1 2

From picture to pixels From pixels to … understanding?

An image has to be digitized for It is turned into millions of “pixel” elements computer processing There’s a cat among some flowers in the grass

● This is easy for humans ● But for AI it’s actually one of the harder problems!

Each a set of numbers quantifying the ● How do you transform that grid of numbers into understanding… color of that element or even something useful?

3 4

Image understanding

• Humans are so good in vision that it’s not even considered intelligence

Convolutional neural networks

5 Convolutional neural network for image data (CNN, ConvNet) 3✕3 weights 3✕3 image area (conv. kernel) output ● Dense or fully-connected: each neuron connected to all ● Image represented as 2D grid of values neuron neurons in previous layer ● Each output neuron connected to ● CNN: only connected to a small “local” set of neurons small 2D area in the image ● Radically reduces number Convolutional layer ● Output value = weighted sum of inputs of network connections Dense layer ● Idea: nearby pixels are related ⇒ we can learn local relationships of pixels

7 8

Convolution for image data A real example

✕ ● We repeat for each output neuron ● Weights stay the same (shared weights) ● Border effect: without padding output area is smaller ● Outputs form a “feature map”

9

Side note: color images Convolution for image data ✕ ✕

● Example: 256 ✕ 256 color image with 3 color channels (red, green, and blue) ● We can repeat for different sets of ✕ ✕ ⇒ ✕ ✕ single image is a 3D tensor: 256 256 3 weights (kernels) ● Example: 5 ✕ 5 convolution is actually also a 3D tensor: 5 ✕ 5 ✕ 3 ● Slides over width and height, but covers the full color depth ● Each learns a different “feature” ● Typically: edges, corners, etc ✕ ✕ ● Each outputs a feature map

...

...

11 12 Convolution for image data Convolution in layers: intuition ✕ ✕ ● We can then add another ● We stack the feature maps into a ✕ ✕ convolutional layer single tensor ● This operates on the ● Depth out output tensor = previous layer’s output number of kernels K tensor (feature maps) ● “cat” Tensor is the output of the ✕ ✕ entire convolutional layer ● Features layered from simple to more complex ...

13 14

Image datasets learned learned learned learned low-level mid-level high-level cat classifier features features features • Color image mini-batches are 4D tensors: width ✕ height ✕ color channels ✕ samples • Plenty of big datasets for training exist, e.g., ImageNet with 1,2 million images in 1000 classes • for small datasets: generate more training data by transforming existing data • E.g., shifting, rotation, cropping, Scaling, adding noise, etc …

15 16

Convolutional layers Pooling layers

• Input: tensor of size N × Wi × Hi × Ci • Used to reduce the spatial resolution • Hyperparameters: • independently on each channel • K: number of filters • reduce complexity and number • w, h: kernel size of parameters padding: how to handle image borders activation function • MAX operator most common • • sometimes also AVERAGE Output: tensor of size N × Wo × Ho × K • In tf.keras: • In tf.keras: layers.Conv2D(filters, kernel_size, layers.MaxPooling2D(pool_size) padding, activation) layers.AveragePooling2D(pool_size)

(there is also Conv1D and Conv3D)

17 18 Other layers Typical architecture

• Flatten • flattens the input into a vector (typically before dense layers) 1. Input layer = image pixels 2. Convolution • Dropout 3. ReLU • similar as with dense layers 4. Pooling • In tf.keras: 5. One or more fully connected layers (+ReLU) layers.Flatten() 6. Final fully connected layer to get to the number of classes layers.Dropout(rate) we want 7. Softmax to get probability distribution over classes

19 20

21 22

Computer vision applications Large-scale CNNs with pre-trained weights

Image credit: Li Fei-Fei et al

• For many applications, an existing CNN can be re-used instead of training a new model from scratch: extract features from suitable layer or retrain the top layers with new data • Keras contains several models trained with ImageNet: • Xception, VGG16, VGG19, ResNet50, InceptionV3, InceptionResNetV2, MobileNet, DenseNet, NASNet Image credit: Noh et24 al, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015 Some selected applications

• Object detection: https://pjreddie.com/darknet/yolo/ • Semantic segmentation: https://www.youtube.com/watch?v=qWl9idsCuLQ • Human pose estimation: https://www.youtube.com/watch?v=pW6nZXeWlGM • Video recognition: https://valossa.com/ • Digital pathology: https://www.aiforia.com/

25 Lecture 4: Text, embeddings, 1D CNN, recurrent neural networks, attention Representations for text Practical deep learning

Sequence data Text data

• sequence of words (or characters) • main representations: • one-hot encoding • word embedding

one-hot encoding raw cleaned By Mogrifier5 [CC BY-SA 3.0], from Wikimedia Commons tokens text text word preprocessing tokenization vectorization embedding

By Der Lange 11/6/2005, http://commons.wikimedia.org/w/index.php?title=File:Spike-waves.png&action=edit§ion=2

One-hot encoding and bag-of-words Word embeddings

• dimensionality equals the number of {“a”: 1, • dense vector representations dictionary “aardvark”: 2, distinct tokens in dictionary “aardwolf”: 3, • dimensionality typically much lower than in one-hot • 1000’s or 10000’s …} ⇒ bag-of-words not needed • tokens are independent of each other • learned based on context of words • bag-of-words loses the ordering of • semantics tokens one-hot bag of • similar words have similar vectors • lots of important applications: IR etc. encoding words • n-grams • directions in the vector space map to semantic relationships cleaned • context-free and contextual embeddings tokens text • either learn from data or use a pre-trained embedding The cat is in [“the”, “cat”, “is”, “in”, the moon. “the”, “moon”] Standalone word embedding algorithms (Context-free) word embeddings • unsupervised learning, no annotation needed • popular context-free algorithms include: • • GLoVe • fastText • recently proposed contextual algorithms include: • ELMo • BERT • to learn a task-based embedding or to use a pre-trained one? • pre-trained embeddings encode general semantic relationships • need to handle OOV (out-of-vocabulary) words • task-based may sometimes be better if enough data

Images from: https://www.tensorflow.org/tutorials/word2vec Image from: http://wiki.fast.ai/index.php/Lesson_5_Notes

Word sequence embedding

• usually a fixed-size matrix or sequence • in tf.keras: layers.Embedding(input_dim, output_dim, input_length, learned trainable, weights) embedding Deep learning for sequences

cleaned padding / sequence tokens text truncate embedding

The cat is in [“the”, “cat”, “is”, “in”, [“the”, “cat”, “is”, “in”, 10 × N matrix the moon. “the”, “moon”] “the”, “moon”, ∅, ∅, ∅, ∅] or sequence of length 10

Deep learning for sequences (1) CNNs for sequences

• first layer is usually an embedding • a fixed-length embedded sequence is a matrix • can be considered as an image ⇒ CNNs can be applied • then there are three main approaches (that can also be combined): • 1D convolution • • 1D convolutional layers as we want to process the full embedding each time • recurrent layers • simple and cheap approach for simple tasks • dense layers + attention • in tf.keras: • last layers are often dense layers.Conv1D(filters, kernel_size, padding, activation) layers.MaxPooling1D(pool_size) layers.GlobalMaxPooling1D() 1D convolution (2) Recurrent neural networks • MLPs and CNNs expect fixed-sized input, not sequences • RNNs have memory and recurrent connections, i.e. loops • last output contains a representation of the whole sequence • learning by backpropagation through time • vanishing or exploding gradients!

tokens embedding

By François Deloche [CC BY-SA 4.0], from Wikimedia Commons

Recurrent neural networks Long-short term memory (LSTM) network • specialized architecture to solve the vanishing gradient problem • additional “conveyor belt” dataflow to carry information across timesteps • “forget”, “input”, and “output” gates

Image from: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ By François Deloche [CC BY-SA 4.0], from Wikimedia Commons

Image from: https://support.apple.com/en-us/HT207525

LSTM layer Language models and text generation

• RNNs can be trained to predict the next word and • simple RNNs do not usually work in practice then used to generate novel text (or music, etc.) ⇒ use LSTM or its variants (e.g. GRU) • can also be used bidirectionally • in tf.keras: layers.LSTM(units, return_sequences) layers.GRU(units, return_sequences) layers.Bidirectional(layer, merge_mode)

Image from: https://github.com/oxford-cs-deepnlp-2017/lectures/blob/master/Lecture%204%20-%20Language%20Modelling%20and%20RNNs%20Part%202.pdf Encoder–decoder () networks (3) Attention

Problem: the final encoder output vector is a bottleneck

“You can’t cram the Solution: attention meaning of a whole • %&!$ing sentence into a allows the model to focus on single $&!*ing vector!” the relevant part of the input sequence -- Ray Mooney • all encoder output vectors are passed to the decoder and are weighted using a learned alignment

Image from: https://arxiv.org/abs/1409.0473 Image from: https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-2/

Attention is all you need Models based on Transformer

• No CNNs or RNNs • Self-attention: relating different positions of a sequence in forming its representation

Image from: https://arxiv.org/abs/1810.04805

Image from: https://jalammar.github.io/illustrated-gpt2/

• XLNet, XLM, RoBERTa, DistilBERT, Megatron, ALBERT, BART, …

Image from: https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html • Available as pre-trained models Transformer • See also: https://jalammar.github.io/ Images from: https://arxiv.org/abs/1706.03762 • See also: https://github.com/huggingface/transformers

Some applications

CSC, a supercomputing center in Finland, announces natural language processing to be the main strategic target of the company. Google and Baidu have been working on this technology for years, but now the two companies hope to see a product available for commercial deployment. "When I first joined Google in 2007, they had an idea of what they wanted to do. I was working in machine learning, and I started thinking about how could we solve the biggest problems in AI," says John Halligan, CSC's vice president of research and chief architect. Halligan says the company's goal was to build a "self-learning and self-improving system" so that it could become an expert inhttps://talktotransformer.com/ solving any number of challenges, even the hardest problems in the world. https://transformer.huggingface.co/doc/gpt2-large https://rajpurkar.github.io/SQuAD-explorer/ To that end, Google launched the project in 2012. The project was developed by Stanfordhttps://gpt2.apps.allenai.org/ researchers who worked with the company's https://sheng-z.github.io/ReCoRD-explorer/ deep learning group and the neural networks research lab of Google's artificial intelligence department. It involved the use of a neural network, a type of artificial neural network. This type of neural network is like a computer's "neuron," and it works by looking at each piece of data From: arxiv.org/abs/1910.13461 From: https://techcrunch.com/2019/10/25/google-brings-in-bert-to-improve-its-search-results/ TensorFlow is just a deep learning framework, right… ?

Lecture 5: Deep learning TensorFlow is a free and open-source library for frameworks dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. – Wikipedia, https://en.wikipedia.org/wiki/TensorFlow Practical deep learning

1 2

Neural network as a computational graph Computational graphs

Example network from Exercise 2: • Mathematical computations can often be + ● Each box represents a expressed as a computational graph mathematical transformation from • Neural networks are just a (huge) number of input tensor to output tensor simple computations + ● E.g. dense_1 takes input n⨯784 • With the graph it is easy to automatically sized matrix and does a matrix calculate the gradients backwards for each ✕ ✕ multiplication to get n⨯20 matrix node (backpropagation!) (n is batch size) ● This can be seen as a graph where https://en.wikipedia.org/wiki/Automatic_differentiation x y 5 each node performs some computation...

3 4

Static computational graph Dynamic computational graph

● A static graph is defined as a fixed graph with • Dynamic graph or “eager execution” mode + + inputs being undefined or abstract (variables) • You define concrete tensors, e.g., ● Then “execute” the graph with specific inputs x = tf.constant(42.) + y = tf.constant(2.) + ● This can be cumbersome and hard to debug • Then just write the calculations, e.g., ● In theory fast as graph can be optimized ✕ ✕ z = tf.multiply(x, y) ✕ ✕ during compilation • The computational graph is generated “on model.compile(...) the fly” in the background x y 5 x y 5 • Easy to debug, feels more like normal Python coding 5 6 Sequential style Functional style

• Keras models typically defined in a sequential style • Keras also supports a functional style • Each layer is added in sequence to a list • Each step is written as a function of some input (or output from a previous step) Example: 2-layer MLP: Example: 2-layer MLP: No value, just an “open slot” for a future value inputs = Input(shape=(100,)) model = Sequential() x = Dense(64, activation='relu')(inputs) model.add(Dense(units=64, activation='relu', input_dim=100)) predictions = Dense(10, activation='softmax')(x) model.add(Dense(units=10, activation='softmax')) model = Model(inputs=inputs, outputs=predictions) model.compile(optimizer='rmsprop', loss='categorical_crossentropy') model.compile(optimizer='rmsprop', loss='categorical_crossentropy') model.fit(x_train, y_train) model.fit(x_train, y_train)

7 8

Functional with subclassing

• Define the model initialization and forward pass explicitly • Makes sense for in “eager execution” mode Example: 2-layer MLP:

class MyModel(tf.keras.Model): Software frameworks for def __init__(self): super(MyModel, self).__init__(name=’my_model’) self.dense_1 = layers.Dense(64, activation=’relu’) deep learning self.dense_2 = layers.Dense(10, activation=’softmax’) def call(self, inputs): x = self.dense_1(inputs) return self.dense_2(x) model = MyModel() model.compile(optimizer='rmsprop', loss='categorical_crossentropy') model.fit(x_train, y_train)

Software frameworks for deep learning Popularity (among researchers)

Image source: https://www.oreilly.com/ideas/one-simple-graphic-researchers-love--and-tensorflow PyTorch: functional with subclassing class Net(nn.Module): def __init__(self): super(Net, self).__init__() ● Most popular framework ● Python version of (Lua) Torch self.fc1 = nn.Linear(64) self.fc2 = nn.Linear(10) ● ● Network defined as a Python class “Traditional” TF 1.0 pretty Strongest growth, especially in def forward(self, inputs): low-level and hard to use Academia x = nn.ReLU(self.fc1(inputs)) predictions = self.fc2(x) ● Keras easy-to-use “front end”, ● Based from the start on dynamic return predictions built in since TF 2.0 graphs net = Net() ● TF and Keras originally based on ● Functional style with subclassing optimizer = optim.RMSprop(net.parameters()) criterion = nn.CrossEntropyLoss() static computational graphs ● More low-level than Keras, but for i in range(num_epochs): We have to handle ● TF 2.0: “eager execution”, i.e., easier to use than traditional TF for (x_train, y_train) in enumerate(batch_loader): training loop manually dynamic graphs optimizer.zero_grad() outputs = net(x_train) ● Supports all styles (in principle) loss = criterion(outputs, y_train) loss.backward() Backpropagation, and optimizer.step() weight updates

13 14

Common neural network modules in TF/Keras or PyTorch? PyTorch torch.nn.Linear(in_features, out_features, bias=True) We’ll provide examples on how to do things with PyTorch, it’s torch.nn.Dropout(p=0.5, ...) up to you if you wish to learn PyTorch or stick with TF/Keras torch.nn.Conv2d(in_channels, out_channels, kernel_size, ...) • PyTorch allows more control and customization, easier experimentation with new architectures torch.nn.Embedding(num_embeddings, embedding_dim, ...) • Keras is easier if you just want to apply deep learning, and not do research in machine learning torch.nn.LSTM(input_size, hidden_size, num_layers=1, dropout=0, bidirectional=False, ...) Useful PyTorch links: torch.nn.GRU(input_size, hidden_size, num_layers=1, dropout=0, https://pytorch.org/tutorials/ bidirectional=False, ...) https://pytorch.org/docs/stable/index.html

15 16 GPU computing

Lecture 6: GPUs and batch • CPUs are optimized for latency whereas GPUs are optimized for throughput jobs • Example: CSC’s GPU nodes with V100’s: #cores max clock memory speed Practical deep learning 2 x Xeon CPUs 2 x 20 3.9 GHz 384 GB

4 x V100 GPUs 4 x 5120 1.455 GHz 4 x 32 GB

1 2

4

CSC - IT Center for Science Ltd.

● Finnish non-profit state enterprise with special tasks ● Owned by the Finnish state (70%) and higher education institutions (30%) ● ICT expertise for research, education, public Running GPU batch jobs on administration CSC’s clusters ● Headquarters in Espoo, datacenter in Kajaani

5 6 CSC’s solutions CSC’s computing resources • Supercomputer Sisu Mahti, April 2020 Computing and software Solutions for managing and Hosting services tailored to organizing education customers’ needs • Supercluster Taito Puhti (since 2.9.2019)

• EuroHPC pre-exascale supercomputer LUMI, Q4/2020 will be among world’s fastest computers ~ 200 petaflops/s Data management and analytics for research Solutions for learners and teachers Identity and authorisation https://datacenter.csc.fi/wp/about-eurohpc/

• Cloud services (cPouta, ePouta, Rahti) • Accelerated computing (GPUs, Pouta, and Puhti AI) Support and training for research Solutions for educational and Management and use of data teaching cooperation • Grid (FGCI)

ICT platforms, Funet network and data center • International resources functions are the base for our solutions • Extremely large computing (PRACE) Research administration 7 • Nordic resources8 (NEIC)

CSC compute nodes are used via a queuing Puhti AI system

V100 nodes consisting of 80 servers with: Do not use the login node for heavy computation! • 2x Xeon Gold 6230 Cascade Lake CPUs with 20 cores each running at 2.1 GHz puhti.csc.fi • CPUs support VNNI instructions for AI inference workloads • 384 GB memory • 4x V100 GPUs with 32 GB memory each, connected with NVLink • 3,6 TB fast local NVMe storage • Dual-rail HDR100 InfiniBand 200 interconnect network connectivity, providing 200 Gbps of aggregate bandwidth

10 11

Batch jobs

Steps for running a batch job: 1. Write a batch job script 2. Make sure you have all the input files where the program can find them 3. Submit your job (sbatch batch_job_file.sh) 4. Wait (or check progress: tail slurm-jobid.out) 5. Look at the results, e.g., standard output in slurm-jobid.out You have to specify the necessary resources: • resources need to be sufficient for the job • requested resources consume BUs and affect time spent in queue ⇒ realistic resource requests give best results

12 13 Example batch job script on Puhti Relevant sbatch options -J, --job-name name of job #!/bin/bash -c, --cpus-per-task number of processors per task #SBATCH --partition=gpu -p partition specify partition (gpu, gputest, gpulong) #SBATCH --gres=gpu:v100:1 --gres=gpu:v100:number request number of GPUs of type v100 --gres=nvme:size request $LOCAL_SCRATCH fast local disk #SBATCH --time=1:00:00 storage of size GB #SBATCH --mem=8G -t, --time time limit in DD-HH:MM:SS #SBATCH --cpus-per-task=4 --mem the real (host) memory required per node -o, --output file for script’s standard output #SBATCH --account= -e, --error file for script’s standard error -A, --account project name (required on Puhti!) python my_python_program.py 15 16

Managing batch jobs Module system

• Different software packages have different, possibly conflicting, requirements sbatch batch_job_file.sh submit a job • Most commonly used module commands: sbatch --options batch_job_file.sh submit job and add or modify option module help Show available options module load modulename Load the given environment module scancel jobid delete a job module load modulename/version module list List the loaded modules squeue -l show all jobs module avail List modules that are available to load squeue -l -u username show all jobs for a single user module spider List all existing modules squeue -l -j jobid show status of a single job module spider name Search the list of existing modules module swap module1 module2 Replaces a module, including sinfo check all available queues compatible versions of other loaded modules seff jobid show CPU, mem and GPU utilization module unload modulename Unload the given environment module module purge Unload all modules

17 18

Modules for deep learning Loading large datasets with tf.data

● All applications on Puhti: https://docs.csc.fi/#apps/ ● MNIST is a toy dataset - we can load it entirely into memory ● Specific for data analytics and machine learning: https://docs.csc.fi/#apps/#data-analytics-and-machine-learning ● For real datasets we need to be more sophisticated ● Split into several modules: ○ Typically load just the next set of batches into memory ○ Python Data ● TensorFlow provides the tf.data API to create an input pipeline ○ PyTorch ● Example: ○ TensorFlow ds = tf.data.Dataset.from_tensor_slices(im_paths, im_labels) ○ MXNet ds = ds.map(load_and_augment_image, num_parallel_calls=10) ● Usage, e.g.: ds = ds.shuffle(2000).batch(32, drop_remainder=True) module load tensorflow/2.0.0 ds = ds.prefetch(buffer_size=tf.data.AUTOTUNE)

19 20 Don’t read lots of small files - use TFRecords TensorBoard

● It is often inefficient to read lots of small files from disk • Tool to visualize TF graphs, plot quantitative metrics, and ○ E.g., each batch can contain 100 images ⇒ 100 separate files to read show additional data like ○ Better design: read a small number of large files images ● TensorFlow’s solution: TFRecords - a simple record-oriented • Operates by reading TF event binary format files, which contain summary ● Example: data that can be generated while running TF ds = tf.data.TFRecordDataset(tfrecord_paths) (or Keras, PyTorch, etc.) ds = ds.map(...).shuffle(...).batch(...) # same as before… • Instructions in the exercises if ● For more, see: https://www.tensorflow.org/guide/datasets you want to try

21 22

Running TensorBoard in the cluster

puhti.csc.fi

23 Data analytics in the cloud

• Cloud environment allows flexible data analytics Lecture 7: Cloud, • Pouta (Openstack): run and manage your own VMs GPU utilization, • GPU nodes and IO intensive nodes using multiple GPUs • ePouta for sensitive data • Rahti (Openshift/Kubernetes): run and manage your own containers (in public beta) Practical deep learning • Allas for shared data storage • Good for: web applications, big data frameworks, installing custom software, building computing infrastructure

GPU utilization

• a GPU cannot be shared among users • running multiple parallel processes possible (in theory) but cumbersome ⇒ GPU jobs should be optimized to utilize the GPU as efficiently as possible • one standard solution: increase mini-batch size • do not reserve multiple GPUs unless you can utilize all of them • monitor your GPU usage (not yet in Puhti): seff jobid

Mixed precision

• The Volta architecture includes Tensor Cores: • specialized hardware for tensor multiplication • uses half-precision (FP16) for intermediate values

Using multiple GPUs

Animation from: https://www.nvidia.com/en-us/data-center/tensorcore/ • In PyTorch we need to use a separate library https://github.com/NVIDIA/apex Using multiple GPUs for model training Model parallelism Data parallelism

• Model and data parallelism • Single-node multi-GPU and distributed training • All the main frameworks offer some level of support • TensorFlow and PyTorch both good choices • external tools: Horovod, ...

Model parallelism in Alexnet Model parallelism in Google’s NMT

Image from https://arxiv.org/pdf/1609.08144.pdf

Model and data parallelism in Nvidia Megatron MPI allreduce • In data parallelism, we need to gather all gradients and to send the mean of the gradients back to all GPUs

• 8.3 Billion parameters!

Table from: https://nv-adlr.github.io/MegatronLM Images from: https://cwiki.apache.org/confluence/display/MXNET/Single+machine+All+Reduce+Topology-aware+Communication Horovod and ring-allreduce Horovod and ring-allreduce

• Horovod is a Python framework for distributed deep learning • supports TensorFlow, Keras, and PyTorch • uses Nvidia’s NCCL 2 which provides a highly optimized version of ring-allreduce • uses MPI which launches all tasks and transparently sets up the distributed infrastructure for communication between tasks • readily compatible with Slurm!

Image from: https://eng.uber.com/horovod/

GPU topology GPU topology

NVIDIA DGX-2: CSC’s P100 nodes: CSC’s Puhti V100 nodes:

CPU CPU CPU CPU

PCIe switch PCIe switch PCIe switch PCIe switch

GPU0 GPU1 GPU2 GPU3 GPU0 GPU2

GPU1 GPU3

Using multiple CPUs for ETL Using multiple GPUs in Puhti

1. request multiple GPUs with sbatch:

--gres=gpu:type:number request number of GPUs of type (“v100” in Puhti) EXTRACT TRANSFORM LOAD

• In TF: tf.data and num_parallel_calls and prefetch 2. modify your code to utilize multiple GPUs • if you use some existing code, there might already be an dataset = dataset.map(..., num_parallel_calls=N) option for this dataset = dataset.prefetch(buffer_size)

• a single process/CPU may not be able to feed GPUs fast • In PyTorch: workers (multiple processes) enough => use multiple CPU cores for data processing • IO easily becomes the bottleneck (especially when reading train_loader = torch.utils.data.DataLoader(..., from parallel storage, network file access, lots of small files) num_workers=N) => use local NVMe disk and/or TFRecords or similar Using multiple GPUs in Keras Using multiple GPUs in PyTorch • Keras/TF2 supports single node multi-GPU data parallelism • PyTorch supports single node multi-GPU data parallelism by with tf.distribute.MirroredStrategy: wrapping your model with torch.nn.DataParallel()

mirrored_strategy = tf.distribute.MirroredStrategy() model = MyModel(...) with mirrored_strategy.scope(): if torch..device_count() > 1: model = Sequential(...) model = nn.DataParallel(model) model.add(...) model.to(device) model.add(...) model.compile(...)

• Notes: • Notes: • batch_size is split among GPUs (each gets batch_size/gpus of data) • batch_size is split among GPUs (each gets batch_size/gpus of data) • other strategies currently experimental or planned, see https://www.tensorflow.org/beta/guide/distribute_strategy