Introduction of Deep Learning
Zaikun Xu HPC Advisory Council Switzerland 2017
1 2 Tejas Kulkarni Overview
In-depth Introduction
Big Data + Big computational power
Representation Learning
3 3 Fashions of Learning
• Supervised Learning (Data with ground truth)
• Unsupervised Learning (Data without ground truth)
• Reinforcement Learning (feedback loop)
4 Inspiration from brain
• Over 100 billion neurons
• Around 1000 connections
• Electrical signal transmitted through axon, can be inhibitory or excitatory • Activation of a neuron depends on inputs
5 Artificial Neurons
source : http://cs231n.stanford.edu/
Circles ———————————- Neurons
Arrow ———————————- Synapse
Weight ———————————- Signal
6 Perceptron
• Perceptron with linear activation
1ifwx + b>0, f(x)= (0 o.w
7 Nonlinearity
• Neural network learns complicated structured data with non-linear activation functions
8 Universal approximation theorem
• A feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rn, under mild assumptions on the activation function
• Can it actually learn well?
9 Back-Propagation
• 1970, got BS on psychology and keep studying NN at University of Edinburg
• 1986, Geoffrey Hinton and David Rumelhart, Nature paper "Learning Representations by Back-propagating errors”
• Hidden layers added
• Computation linearly depends on # of neurons
• Of course, computers at that time is much faster
10 BP - forward
• Forward pass, propagating information from input to output
o = f(Wx+ b)
11 BP - backward
• Backward pass, propagating errors back and update weights accordingly by calculating gradients
12 Convolutional NN
• Yann Lecun, post-doc of Hinton at U Toronto.
• LeNet on hand-digital recognition at Bell Labs, 5% error rate
• Attack skill : convolution, pooling • Simple op, repeat multiple times 13 Local connection
14 Weight sharing
15 Convolution
16 Pooling
http://vaaaaaanquish.hatenablog.com/entry/2015/01/26/060622
17 Support Vector Machine
• Vladmir Vapnik, born at Soviet Union, colleague of Yann Lecun , invented SVM at 1963
• Attack skill : Max-margin + kernel mapping • Works well when dataset is small
18 Another winter looming
• NN : good at modeling • SVM : good at “capacity complicated dataset, hard to control”, easy to use, repeat train
• SVM achieved 0.8% error rate as 1998 and 0.56% in 2002 on the same hand-digit recognition task.
• NN stops at local optima, hard and takes long to train, overfitting
19 Recurrent Neural Nets
st = f(xtU + st 1W ) Weights are shared yˆt = ot = Softmax(Vst)
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/
20 Vanish Gradient Problem
E(y, yˆ)= E (y , yˆ )= y logyˆ t t t t t t t X X @E T @E @yˆ @s @s T = T T T k @W @yˆT @sT @sk @W kX=0 T T @sj T 0 W diag(f (sk)) @sj 1 j=k+1 Y j=Yk+1 LSTM
• Gating mechanism to deal with long-term dependencies
22 Canadian Institute for Advanced Research
• Hinton was happy and rebranded its neural network to deep learning. The story goes on …………..
23 Optimization
• Non-convex, non- linear functions
W = W ↵@W
animation by Alec Radford
24 Summary
• While neural networks can approximate any function of input X to output y, it is very hard to train with multiple layers
• Initialization/Optimization
• Activation function
• More data/Dropout to avoid overfitting
• Accelerate with GPU
25 Big data + Big computational power
• Large number of images collected with labels
• ImageNet — 1m images with > 1000 classes
• Youtube-8m — 450,000 hours of videos with 4700+ classes
• Nvidia GPUs
• Tesla/Pascal
26 27 Parallelization
• Data Parallelization: same weight but different data, gradient needs to be synchronized
http://research.baidu.com/bringing-hpc-techniques-deep-learning/
28 Parallelization
• Model Parallelization: same data, but different parts of weight
• Parameters in a big Neural net can not fit into memory of a single GPU
Figure from Yi Wang 29 Model+Data Parallelization
weight heavy
computation heavy
Krizhevsky et al. 2014
30 AlphaGo
31 Summary
• Big data and computational power make it possible to train much more complicated NN models and accomplish complex tasks
32 Representation Learning
• http://www.rsipvision.com/exploring-deep-learning/ 33 Transfer learning
• Use features learned on ImageNet to tune your only classifier on other tasks.
• 9 categories + 200, 000 images
• extract features from last 2 layers + linear SVM 34 Fully Convolutional Networks for Semantic Segmentation
35 Faster-RCNN for Object detection
36 Playing FPS Games with Deep Reinforcement Learning
37 Summary
• “The takeaway is that deep learning excels in tasks where the basic unit, a single pixel, a single frequency, or a single word/character has little meaning in and of itself, but a combination of such units has a useful meaning.” — Dallin Akagi
38 Q&A
• Thanks for your attentions
39