Zaikun Xu HPC Advisory Council Switzerland 2017

Home , LeNet

Introduction of Deep Learning

1 2 Tejas Kulkarni Overview

In-depth Introduction

Big Data + Big computational power

Representation Learning

3 3 Fashions of Learning

• Supervised Learning (Data with ground truth)

• Unsupervised Learning (Data without ground truth)

• Reinforcement Learning (feedback loop)

4 Inspiration from brain

• Over 100 billion neurons

• Around 1000 connections

• Electrical signal transmitted through axon, can be inhibitory or excitatory • Activation of a neuron depends on inputs

5 Artificial Neurons

source : http://cs231n.stanford.edu/

Circles ———————————- Neurons

Arrow ———————————- Synapse

Weight ———————————- Signal

6 Perceptron

• Perceptron with linear activation

1ifwx + b>0, f(x)= (0 o.w

7 Nonlinearity

• Neural network learns complicated structured data with non-linear activation functions

8 Universal approximation theorem

• A feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rn, under mild assumptions on the activation function

• Can it actually learn well?

9 Back-Propagation

• 1970, got BS on psychology and keep studying NN at University of Edinburg

• 1986, Geoffrey Hinton and David Rumelhart, Nature paper "Learning Representations by Back-propagating errors”

• Hidden layers added

• Computation linearly depends on # of neurons

• Of course, computers at that time is much faster

10 BP - forward

• Forward pass, propagating information from input to output

o = f(Wx+ b)

11 BP - backward

• Backward pass, propagating errors back and update weights accordingly by calculating gradients

12 Convolutional NN

• Yann Lecun, post-doc of Hinton at U Toronto.

• LeNet on hand-digital recognition at Bell Labs, 5% error rate

• Attack skill : convolution, pooling • Simple op, repeat multiple times 13 Local connection

14 Weight sharing

15 Convolution

16 Pooling

http://vaaaaaanquish.hatenablog.com/entry/2015/01/26/060622

17 Support Vector Machine

• Vladmir Vapnik, born at Soviet Union, colleague of Yann Lecun , invented SVM at 1963

• Attack skill : Max-margin + kernel mapping • Works well when dataset is small

18 Another winter looming

• NN : good at modeling • SVM : good at “capacity complicated dataset, hard to control”, easy to use, repeat train

• SVM achieved 0.8% error rate as 1998 and 0.56% in 2002 on the same hand-digit recognition task.

• NN stops at local optima, hard and takes long to train, overfitting

19 Recurrent Neural Nets

st = f(xtU + st 1W ) Weights are shared yˆt = ot = Softmax(Vst)

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/

20 Vanish Gradient Problem

E(y, yˆ)= E (y , yˆ )= y logyˆ t t t t t t t X X @E T @E @yˆ @s @s T = T T T k @W @yˆT @sT @sk @W kX=0 T T @sj T 0 W diag(f (sk)) @sj 1 j=k+1 Y j=Yk+1 LSTM

• Gating mechanism to deal with long-term dependencies

22 Canadian Institute for Advanced Research

• Hinton was happy and rebranded its neural network to deep learning. The story goes on …………..

23 Optimization

• Non-convex, non- linear functions

W = W ↵@W

animation by Alec Radford

24 Summary

• While neural networks can approximate any function of input X to output y, it is very hard to train with multiple layers

• Initialization/Optimization

• Activation function

• More data/Dropout to avoid overfitting

• Accelerate with GPU

25 Big data + Big computational power

• Large number of images collected with labels

• ImageNet — 1m images with > 1000 classes

• Youtube-8m — 450,000 hours of videos with 4700+ classes

• Nvidia GPUs

• Tesla/Pascal

26 27 Parallelization

• Data Parallelization: same weight but different data, gradient needs to be synchronized

http://research.baidu.com/bringing-hpc-techniques-deep-learning/

28 Parallelization

• Model Parallelization: same data, but different parts of weight

• Parameters in a big Neural net can not fit into memory of a single GPU

Figure from Yi Wang 29 Model+Data Parallelization

weight heavy

computation heavy

Krizhevsky et al. 2014

30 AlphaGo

31 Summary

• Big data and computational power make it possible to train much more complicated NN models and accomplish complex tasks

32 Representation Learning

• http://www.rsipvision.com/exploring-deep-learning/ 33 Transfer learning

• Use features learned on ImageNet to tune your only classifier on other tasks.

• 9 categories + 200, 000 images

• extract features from last 2 layers + linear SVM 34 Fully Convolutional Networks for Semantic Segmentation

35 Faster-RCNN for Object detection

36 Playing FPS Games with Deep Reinforcement Learning

37 Summary

• “The takeaway is that deep learning excels in tasks where the basic unit, a single pixel, a single frequency, or a single word/character has little meaning in and of itself, but a combination of such units has a useful meaning.” — Dallin Akagi

38 Q&A

• Thanks for your attentions