@ NVIDIA GTC, April 6th, 2016
A Powerful, Flexible, and Intui ve Deep Learning Framework
Shohei Hido Chief Research Officer Preferred Networks, Inc. Overview
http://chainer.org/ l Chainer is a Python-based deep learning framework l Chainer v1.0 was released as an open source on June 2015 l It DOESN’T rely on Theano, unlike other Python frameworks l Chainer uses a unique scheme named Define-by-Run
l Why do users s ll need another framework? l How different and effec ve Chainer is?
2 Preferred Networks (PFN) A startup that applies deep learning to industrial IoT l Founded: March 2014 l Headquarter: Tokyo, Japan l U.S. Subsidiary: San Mateo, California l Company size: 35 engineers & researchers l Investors: Toyota, FANUC, NTT Manufacturing Healthcare
Automotive
Deep learning Industrial IoT
3 Partnering with world-leading companies using Chainer
l R&D collabora on on industrial problems with real-world data
̶ Specific requirements, modified algorithms, many trials and errors, etc
̶ Different from making general-purpose recogni on system
Toyota NTT FANUC
Panasonic Cisco NVIDIA
4 Two types of background behind DL frameworks
1. Scalability-oriented 2. Flexibility-oriented l Use-cases in mind l Use-cases in mind
̶ Image/speech recogni on system ̶ Algorithm research
̶ Fast DL as a service in cloud ̶ R&D projects for new products l Problem type l Problem type
̶ A few general applica ons ̶ Various specific applica ons
̶ 10+ million training samples ̶ 10+ k training samples
̶ 10+ nodes cluster w/ fast network ̶ 1 node with mul ple GPUs l Possible bo leneck l Possible bo leneck
̶ Tuning of well-known algorithms ̶ Trial-and-error in prototyping
̶ Distributed computa on for ̶ Debugging, profiling & refactoring model/data-parallel training ̶ (wait me during compila on) Designed for efficient research & development
l Flexible: new kinds of complex models for various applica ons l Intui ve: rapid prototyping and efficient trial-and-error l Powerful: comparable performance for 1 node & mul -GPUs
Scalability-oriented Flexibility-oriented
6 Agenda l Deep learning framework basics l Introduc on to Chainer l CuPy: NumPy-compa ble GPU library l Performance and applica ons
7 Neural network and computation
Forward computation
x Image 1 Object:
・・ h 1 Tulip ・・ k1 y1 ・・・・・・Anomaly score:
Sensor ・・ ・・ 0.35
kM yM
hH x N Category: Sports Text Input Hidden units Output
Backward computation (backpropagation)
8 Chainer focuses on network representation/training l Design choices for deep learning frameworks
̶ How to build neural networks?
̶ How to train neural networks?
̶ Which text format/language for modeling?
̶ Which language for compu ng?
̶ Run with GPU?
̶ Run on mul ple GPUs?
̶ Run on mul ple compute nodes?
9 Building and training neural networks: Computational graph construction is the key
1. Construct a computa onal graph
̶ Based on network defini on given by users
̶ Chains of func ons and opera ons on input variables
2. Compute loss and gradients
̶ Forward computa on to calculate loss for a minibatch
̶ Backpropaga on gives gradients to all of parameters
3. Op mize model
̶ Update each parameter with the gradient
̶ Repeat un l convergence
Step 1. is the most important and there are many approaches
10 Building blocks l These func onali es are very similar between frameworks l But the structure, abstrac on level, and interface are different l It comes to the design of domain-specific language for NN
Array data structure Network (vector/matrix/tensor) (computational graph)
Optimizer Operations & functions (SGD/AdaGrad/Adam)
11 Types of domain-specific language for neural networks l Text DSL l Symbolic program l Impera ve program
̶ Ex. Caffe (prototxt) ̶ Opera ons ̶ Direct computa ons on symbols on raw data arrays ̶ Ex. CNTK (NDL) ̶ Ex. Theano ̶ Ex. Torch.nn
%% Defini on in text ̶ Ex. TensorFlow ̶ Ex. Chainer f: { Ex. MXNet “A”: “Variable”, “B”: “Variable”, # Symbolic defini on # Impera ve declara on “C”: [“B”, “*”, “A”], A = Variable(‘A’) a = np.ones(10) “ret”: [“C”, “+”, 1] B = Variable(‘B’) b = np.ones(10) * 2 } C = B * A c = b * a D = C + Constant(1) d = c + 1 # Compile # Compile f = compile(“f.txt”) f = compile(D) d = f(A=np.ones(10), d = f(A=np.ones(10), B=np.ones(10) * 2) B=np.ones(10) * 2)
12 Comparison of DSL type
DSL type Pros. Cons. • Human-readable defini on • Users must study the format Text DSL • Non-programmer can easily • Format might have to be edit the network extended for new algorithms
• Sta c analysis at compile • Users must study special syntax • Op miza on before training • May need more efforts to
Symbolic • Easy to parallelize implement new algorithms
• Less efforts to learn syntax • Hard to op mize in advance • • Internal DSL Easy debugging and profiling Less efficient in memory Impera ve • Suitable for new algorithms alloca on and paralleliza on with complex logic
Chainer is at the extreme end of impera ve program for high flexibility
13 Agenda l Deep learning framework basics l Introduc on to Chainer l CuPy: NumPy-compa ble GPU library l Performance and applica ons
14 Chainer as an open-source project l h ps://github.com/pfnet/chainer l 50 contributors l 1,277 stars & 255 fork Original developer l 3,708 commits Seiya Tokui l Ac ve development & release for last 10 months
̶ v1.0.0 (June 2015) to v1.7.2 (March 2016)
15 Chainer software stack l Chainer is built on top of NumPy and CUDA l CuPy is also introduced as an equivalent of NumPy on GPU
Chainer
CuPy NumPy cuDNN BLAS CUDA
CPU NVIDIA GPU
16 Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not) l Define: build a computa onal graph based on defini on l Run: update the model (parameters) using training dataset
Define Parameters
Network Computa onal Gradient defini on graph func on Auto differen a on
Parameters Run Update Training Computa onal Gradient data graph func on Loss & gradient
17 Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly l No graph is constructed before training l Instead, the graph is built at each forward computa on l Computa onal graph can be modified dynamically for each itera on/sample or depending on some condi ons
Define-by-Run Model Parameters defini on Update
Training Computa onal Gradient graph func on data Dynamic change Condi ons
18 Define-by-Run example: MLP for MNIST l Only transforma ons between units are set before training
l1 = Linear(784, n_units) l2 = Linear(n_units, 10)) l Connec on is given as forward computa on def forward(x): h1 = ReLU(l1(x)) return l2(h1)
W bias W bias 0 5 h1 y x 9 Linear l1 ReLU Linear l2
19 Define-by-Run: An interpreted language for neural network l Idea
̶ Forward computa on actually goes through computa onal graph
̶ By remembering the history, the actual graph can be obtained l Advantage
̶ Flexibility for new algorithms with complex components
u Ex. recurrent, recursive, a en on, memory, adversarial, etc
̶ Intui ve coding with highly impera ve network defini on
u Ex. stochas c network of which graph changes for each itera on l Current drawbacks
̶ Graph is generated every me also for fixed networks
̶ No op miza on even for sta c part of graphs
u JIT-like analysis and subgraph cache might be useful 20 Basic components (1/2): Variable and Function l Variable
̶ Variable wraps arrays (.data) Variable
̶ It remembers parent func on 0 (.creator) x h1 y 5 9 ̶ It will be assigned gradient (.grad)
̶ It keeps track of not only data but also computa ons l Func on
̶ Transforma on between Variable
y ̶ Stateless x Function ̶ e.g. sigmoid, tanh, ReLU, maxpooling, dropout 21 Basic components (2/2): Link and Chain l Link = func on with state W b
̶ Parameters are also Variable
and gradients will be assigned x y ̶ e.g. Linear (fully-connected), LSTM Link Convolu on2d, word-embedding (Linear) y=f(W*x+b) l Chain = network
̶ Chain has a set of child Link
W bias W bias ̶ Forward computa on is defined h1 y in . __call__() Linear l1 ReLU Linear l2
̶ e.g. MLP2, AlexNet, GoogLeNet, Chain (MLP2) RNNLM, seq2seq,
22 Backpropagation through computational graph l Consider an objec ve (Link.Linear): L = f(x * w + b) l This computes the value of L in forward computa on, and simultaneously builds the following computa onal graph
is Variable W b is Func on
x * + f L l The gradient of L can be computed with respect to any variables by backpropaga on l Then the op mizer updates the value of parameters
23 Code sample (1/4): Multi-layer perceptron class MLP2(Chain): # Model and optimizer setup def __init__(self): model = Classifier(MLP2()) super(MLP2, self).__init__( optimizer = optimizers.SGD() l1=L.Linear(784, 100), optimizer.setup(model) l2=L.Linear(100, 10), ) # training loop with minibatch for i in range(0, datasize, batchsize): def __call__(self, x): x = Variable(x_tr[i:i+batchsize]) h1 = F.relu(self.l1(x)) t = Variable(y_tr[i:i+batchsize]) y = self.l2(h1) model.zerograds() return y loss, acc = model(x, t) loss.backward() class Classifier(Chain): optimizer.update() def __init__(self, predictor): super(Classifier, self). __init__(predictor=predictor) W bias W bias
def __call__(self, x, t): h1 y Linear l1 ReLU Linear l2 y = self.predictor(x) self.accuracy = F.accuracy(y, t) self.loss = F.softmax_cross_entropy(y, t) Chain (MLP2) return self.loss, self.accuracy
24 Code sample (2/4): Convolutional neural network class AlexNet(Chain): def __init__(self): super(AlexNet, self).__init__( conv1=L.Convolution2D(3, 96, 11, stride=4), conv2=L.Convolution2D(96, 256, 5, pad=2), conv2d conv3=L.Convolution2D(256, 384, 3, pad=1), conv4=L.Convolution2D(384, 384, 3, pad=1), conv2d conv5=L.Convolution2D(384, 256, 3, pad=1), fc6=L.Linear(9216, 4096), conv2d fc7=L.Linear(4096, 4096), fc8=L.Linear(4096, 1000), ) conv2d
def __call__(self, x, t): conv2d h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv1(x))), 3, stride=2) linear h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv2(h))), 3, stride=2) h = F.relu(self.conv3(h)) linear h = F.relu(self.conv4(h)) h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2) linear h = F.dropout(F.relu(self.fc6(h)), train=self.train) h = F.dropout(F.relu(self.fc7(h)), train=self.train) y = self.fc8(h) return y * ImageNet Classification with Deep Convolutional Neural Networks http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf 25 Code sample (3/4): Recurrent neural network