@ GTC, April 6th, 2016

A Powerful, Flexible, and Intuive Framework

Shohei Hido Chief Research Officer Preferred Networks, Inc. Overview

http://chainer.org/ l Chainer is a Python-based deep learning framework l Chainer v1.0 was released as an open source on June 2015 l It DOESN’T rely on , unlike other Python frameworks l Chainer uses a unique scheme named Define-by-Run

l Why do users sll need another framework? l How different and effecve Chainer is?

2 Preferred Networks (PFN) A startup that applies deep learning to industrial IoT l Founded: March 2014 l Headquarter: Tokyo, Japan l U.S. Subsidiary: San Mateo, California l Company size: 35 engineers & researchers l Investors: Toyota, FANUC, NTT Manufacturing Healthcare

Automotive

Deep learning Industrial IoT

3 Partnering with world-leading companies using Chainer

l R&D collaboraon on industrial problems with real-world data

̶ Specific requirements, modified algorithms, many trials and errors, etc

̶ Different from making general-purpose recognion system

Toyota NTT FANUC

Panasonic Cisco NVIDIA

4 Two types of background behind DL frameworks

1. Scalability-oriented 2. Flexibility-oriented l Use-cases in mind l Use-cases in mind

̶ Image/speech recognion system ̶ Algorithm research

̶ Fast DL as a service in cloud ̶ R&D projects for new products l Problem type l Problem type

̶ A few general applicaons ̶ Various specific applicaons

̶ 10+ million training samples ̶ 10+ k training samples

̶ 10+ nodes cluster w/ fast network ̶ 1 node with mulple GPUs l Possible boleneck l Possible boleneck

̶ Tuning of well-known algorithms ̶ Trial-and-error in prototyping

̶ Distributed computaon for ̶ Debugging, profiling & refactoring model/data-parallel training ̶ (wait me during compilaon) Designed for efficient research & development

l Flexible: new kinds of complex models for various applicaons l Intuive: rapid prototyping and efficient trial-and-error l Powerful: comparable performance for 1 node & mul-GPUs

Scalability-oriented Flexibility-oriented

6 Agenda l Deep learning framework basics l Introducon to Chainer l CuPy: NumPy-compable GPU library l Performance and applicaons

7 Neural network and computation

Forward computation

x Image 1 Object:

・・ h 1 Tulip ・・ k1 y1 ・・・・・・Anomaly score:

Sensor ・・ ・・ 0.35

kM yM

hH x N Category: Sports Text Input Hidden units Output

Backward computation (backpropagation)

8 Chainer focuses on network representation/training l Design choices for deep learning frameworks

̶ How to build neural networks?

̶ How to train neural networks?

̶ Which text format/language for modeling?

̶ Which language for compung?

̶ Run with GPU?

̶ Run on mulple GPUs?

̶ Run on mulple compute nodes?

9 Building and training neural networks: Computational graph construction is the key

1. Construct a computaonal graph

̶ Based on network definion given by users

̶ Chains of funcons and operaons on input variables

2. Compute loss and gradients

̶ Forward computaon to calculate loss for a minibatch

̶ Backpropagaon gives gradients to all of parameters

3. Opmize model

̶ Update each parameter with the gradient

̶ Repeat unl convergence

Step 1. is the most important and there are many approaches

10 Building blocks l These funconalies are very similar between frameworks l But the structure, abstracon level, and interface are different l It comes to the design of domain-specific language for NN

Array data structure Network (vector/matrix/tensor) (computational graph)

Optimizer Operations & functions (SGD/AdaGrad/Adam)

11 Types of domain-specific language for neural networks l Text DSL l Symbolic program l Imperave program

̶ Ex. Caffe (prototxt) ̶ Operaons ̶ Direct computaons on symbols on raw data arrays ̶ Ex. CNTK (NDL) ̶ Ex. Theano ̶ Ex. .nn

%% Definion in text ̶ Ex. TensorFlow ̶ Ex. Chainer f: { Ex. MXNet “A”: “Variable”, “B”: “Variable”, # Symbolic definion # Imperave declaraon “C”: [“B”, “*”, “A”], A = Variable(‘A’) a = np.ones(10) “ret”: [“C”, “+”, 1] B = Variable(‘B’) b = np.ones(10) * 2 } C = B * A c = b * a D = C + Constant(1) d = c + 1 # Compile # Compile f = compile(“f.txt”) f = compile(D) d = f(A=np.ones(10), d = f(A=np.ones(10), B=np.ones(10) * 2) B=np.ones(10) * 2)

12 Comparison of DSL type

DSL type Pros. Cons. • Human-readable definion • Users must study the format Text DSL • Non-programmer can easily • Format might have to be edit the network extended for new algorithms

• Stac analysis at compile • Users must study special syntax • Opmizaon before training • May need more efforts to

Symbolic • Easy to parallelize implement new algorithms

• Less efforts to learn syntax • Hard to opmize in advance • • Internal DSL Easy debugging and profiling Less efficient in memory Imperave • Suitable for new algorithms allocaon and parallelizaon with complex logic

Chainer is at the extreme end of imperave program for high flexibility

13 Agenda l Deep learning framework basics l Introducon to Chainer l CuPy: NumPy-compable GPU library l Performance and applicaons

14 Chainer as an open-source project l hps://github.com/pfnet/chainer l 50 contributors l 1,277 stars & 255 fork Original developer l 3,708 commits Seiya Tokui l Acve development & release for last 10 months

̶ v1.0.0 (June 2015) to v1.7.2 (March 2016)

15 Chainer software stack l Chainer is built on top of NumPy and CUDA l CuPy is also introduced as an equivalent of NumPy on GPU

Chainer

CuPy NumPy cuDNN BLAS CUDA

CPU NVIDIA GPU

16 Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not) l Define: build a computaonal graph based on definion l Run: update the model (parameters) using training dataset

Define Parameters

Network Computaonal Gradient definion graph funcon Auto differenaon

Parameters Run Update Training Computaonal Gradient data graph funcon Loss & gradient

17 Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly l No graph is constructed before training l Instead, the graph is built at each forward computaon l Computaonal graph can be modified dynamically for each iteraon/sample or depending on some condions

Define-by-Run Model Parameters definion Update

Training Computaonal Gradient graph funcon data Dynamic change Condions

18 Define-by-Run example: MLP for MNIST l Only transformaons between units are set before training

l1 = Linear(784, n_units) l2 = Linear(n_units, 10)) l Connecon is given as forward computaon def forward(x): h1 = ReLU(l1(x)) return l2(h1)

W bias W bias 0 5 h1 y x 9 Linear l1 ReLU Linear l2

19 Define-by-Run: An interpreted language for neural network l Idea

̶ Forward computaon actually goes through computaonal graph

̶ By remembering the history, the actual graph can be obtained l Advantage

̶ Flexibility for new algorithms with complex components

u Ex. recurrent, recursive, aenon, memory, adversarial, etc

̶ Intuive coding with highly imperave network definion

u Ex. stochasc network of which graph changes for each iteraon l Current drawbacks

̶ Graph is generated every me also for fixed networks

̶ No opmizaon even for stac part of graphs

u JIT-like analysis and subgraph cache might be useful 20 Basic components (1/2): Variable and Function l Variable

̶ Variable wraps arrays (.data) Variable

̶ It remembers parent funcon 0 (.creator) x h1 y 5 9 ̶ It will be assigned gradient (.grad)

̶ It keeps track of not only data but also computaons l Funcon

̶ Transformaon between Variable

y ̶ Stateless x Function ̶ e.g. sigmoid, tanh, ReLU, maxpooling, dropout 21 Basic components (2/2): Link and Chain l Link = funcon with state W b

̶ Parameters are also Variable

and gradients will be assigned x y ̶ e.g. Linear (fully-connected), LSTM Link Convoluon2d, word-embedding (Linear) y=f(W*x+b) l Chain = network

̶ Chain has a set of child Link

W bias W bias ̶ Forward computaon is defined h1 y in . __call__() Linear l1 ReLU Linear l2

̶ e.g. MLP2, AlexNet, GoogLeNet, Chain (MLP2) RNNLM, seq2seq,

22 Backpropagation through computational graph l Consider an objecve (Link.Linear): L = f(x * w + b) l This computes the value of L in forward computaon, and simultaneously builds the following computaonal graph

is Variable W b is Funcon

x * + f L l The gradient of L can be computed with respect to any variables by backpropagaon l Then the opmizer updates the value of parameters

23 Code sample (1/4): Multi-layer perceptron class MLP2(Chain): # Model and optimizer setup def __init__(self): model = Classifier(MLP2()) super(MLP2, self).__init__( optimizer = optimizers.SGD() l1=L.Linear(784, 100), optimizer.setup(model) l2=L.Linear(100, 10), ) # training loop with minibatch for i in range(0, datasize, batchsize): def __call__(self, x): x = Variable(x_tr[i:i+batchsize]) h1 = F.relu(self.l1(x)) t = Variable(y_tr[i:i+batchsize]) y = self.l2(h1) model.zerograds() return y loss, acc = model(x, t) loss.backward() class Classifier(Chain): optimizer.update() def __init__(self, predictor): super(Classifier, self). __init__(predictor=predictor) W bias W bias

def __call__(self, x, t): h1 y Linear l1 ReLU Linear l2 y = self.predictor(x) self.accuracy = F.accuracy(y, t) self.loss = F.softmax_cross_entropy(y, t) Chain (MLP2) return self.loss, self.accuracy

24 Code sample (2/4): Convolutional neural network class AlexNet(Chain): def __init__(self): super(AlexNet, self).__init__( conv1=L.Convolution2D(3, 96, 11, stride=4), conv2=L.Convolution2D(96, 256, 5, pad=2), conv2d conv3=L.Convolution2D(256, 384, 3, pad=1), conv4=L.Convolution2D(384, 384, 3, pad=1), conv2d conv5=L.Convolution2D(384, 256, 3, pad=1), fc6=L.Linear(9216, 4096), conv2d fc7=L.Linear(4096, 4096), fc8=L.Linear(4096, 1000), ) conv2d

def __call__(self, x, t): conv2d h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv1(x))), 3, stride=2) linear h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv2(h))), 3, stride=2) h = F.relu(self.conv3(h)) linear h = F.relu(self.conv4(h)) h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2) linear h = F.dropout(F.relu(self.fc6(h)), train=self.train) h = F.dropout(F.relu(self.fc7(h)), train=self.train) y = self.fc8(h) return y * ImageNet Classification with Deep Convolutional Neural Networks http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf 25 Code sample (3/4): class SimpleRNN(Chain): # Truncated BPTT (length=3) def __init__(self, n_vocab, n_units): for i in range(0, datasize, batchsize): super(SimpleRNN, self).__init__( ... embed=L.EmbedID(n_vocab, n_units) accum_loss += model(x, t) x2h=L.Linear(n_units, n_units), if i % bptt_length == 0: h2h=L.Linear(n_units, n_units), model.zerograds() h2y=L.Linear(n_units, n_vocab),) accum_loss.backward() self.h = None accum_loss.unchain_backward() optimizer.update() def __call__(self, x): y, h_new = self.fwd_one_step(x, self.h) Input word Recurrent state Output self.h = h_new return y x_1 h y_1

def fwd_one_step(self, x, h): x = F.tanh(self.embed(x)) x_2 h y_2 if h is None: h = F.tanh(self.x2h(x)) else: x_3 h y_3 h = F.tanh(self.x2h(x) + self.h2h(h)) y = F.softmax(self.h2y(h)) return y, h x_4 h y_4

BPTT length = 3 26 Code sample (4/4): Deep Networks with Stochastic Depth A paper published on arXiv, March 30, 2016 l A variant of Residual Net that skips connecons stochascally

̶ Outperformed the original Residual Net (ImageNet 2015 winner, MSR)

̶ Stochasc skip:

w/ survival probability: # Mock code in Chainer class StochasticResNet(Chain): def __init__(self, prob, size, …): super(StochasticResNet, size, …).__init__( ## Define f[i] as same for Residual Net ) self.p = prob # Survival probabilities

def __call__(self, h): for i in range(self.size): b = numpy.random.binomial(1, self.p[i]) c = self.f[i](h) + h if b == 1 else h Taken from http://arxiv.org/abs/1603.09382v2 G. Huang et al. h = F.relu(c) return h 27 Miscellaneous l Other features

̶ Install with pip in one line: $ pip install chainer

̶ Mul-GPU support by explicitly selecng the ID to use

̶ Pre-trained Caffe model import from Model Zoo

̶ Model serializaon & save & load : HDF5 or NumPy npz l Future direcon (not only for Chainer)

̶ JIT-like opmizaon during Define-by-Run

̶ Memory consumpon reducon (GPU memory is sll small)

̶ Handling variable-length inputs without minibatch

̶ Maximizing performance on mul-node & mul-GPU environment

28 Agenda l Deep learning framework basics l Introducon to Chainer l CuPy: NumPy-compable GPU library l Performance and applicaons

29 CuPy: (partially-)NumPy-compatible GPU library l Movaon: NumPy + CUDA = CuPy

̶ NumPy is the standard library in Python for numerical computaon

̶ CUDA is the standard APIs for using GPU for high-performance

̶ Unfortunately, NumPy does NOT work with CUDA l CuPy supports:

̶ Fast computaon using NVIDIA’s cuBLAS and cuDNN

̶ Array indexing, slicing, transpose, and reshape

̶ Most of operaons/funcons in NumPy

u Chainer v1.7.2 already supports more than 170 funcons

̶ User-defined funcons and kernels

̶ all dtypes, broadcasng, memory pool, etc 30 How to use CuPy l Usage of CuPy: just replace NumPy with CuPy import numpy, cupy enable_cupy = True xp = cupy if enable_cupy else numpy l Conversion between numpy.ndarray and cupy.ndarray

w_c = cupy.asarray(numpy.ones(10)) # cupy.ndarray w_n = cupy.asnumpy(cupy.ones(10)) # numpy.ndarray l Ex. CPU/GPU-agnosc logsumexp funcon def logsumexp(x, axis=None): xp = cuda.get_array_module(x) #Get CuPy or NumPy x_max = x.max(axis) exp_sum = xp.exp(x - x_max).sum(axis) return x_max + xp.log(exp_sum)

31 CuPy implementation: Optimized for performance & NumPy-compatibility l Use Cython for cupy.core & cupy.cuda l Dynamic code generaon & compile

̶ CUDA code is generated for specific tensor dimension & data type

̶ On-the-fly compile by nvcc and binary cache (faster aer 1st use)

Tensor operaons & funcons cupy

ufunc, elementwise, reducon cupy.core ndarray

CUDA Python wrapper cupy.cuda

CUDA libraries (cuBLAS, cuRAND, cuDNN)

32 CuPy performance on linear algebra: 5 to 25 times faster than NumPy def test(xp): a = xp.arange(1000000).reshape(1000, -1) return a.T * 2

test(numpy) t1 = datetime.datetime.now() for i in range(1000): msec speed test(numpy) t2 = datetime.datetime.now() up print(t2 -t1) NumPy 2,929 1.0 test(cupy) t1 = datetime.datetime.now() CuPy 585 5.0 for i in range(1000): CuPy + 123 23.8 test(cupy) t2 = datetime.datetime.now() Memory Pool print(t2 -t1) Core i7-4790 @3.60GHz,32GB, GeForce GTX 970

33 Use CuPy for GPU-based computation l Support three paerns as wrappers

̶ ElementwiseKernel: for element-wise computaon

̶ ReduconKernel: for reduce operaon along axis

̶ ufunc: universal funcon as in Numpy l Ex. definion of an element-wise funcon squared_diff = cupy.ElementwiseKernel( ‘float32 x, float32 y’, # Input ‘float32 z’, # Output ‘z = (x - y) * (x - y)’, # Operation ‘squared_diff’) # Name l Usage (automac broadcast and type check are supported) squared_diff(cupy.arange(10), 10)

34 Agenda l Deep learning framework basics l Introducon to Chainer l CuPy: NumPy-compable GPU library l Performance and applicaons

35 Public benchmark results (CNN): Chainer shows comparable performance l Forward computaon is almost the same with TensorFlow l Training with backward computaon is slower, but it can be offset by no compilaon me while debugging/tuning

Forward computation (msec) Backward computation (msec) 1200 1200 Torch Torch 1000 TensorFlow 1000 TensorFlow Chainer Chainer 800 Caffe (naCve) 800 Caffe (naCve)

600 600

400 400

200 200

0 0 AlexNet GoogLeNet VGG-A OverFeat AlexNet GoogLeNet VGG-A OverFeat Taken from https://github.com/soumith/convnet-benchmarks, using cuDNN except Cafe 36 Chainer can benefit from latest CUDA libraries: Ex. Winograd algorithm in cuDNN v5 l Conv3x3 is common in CNNs & now computed with Winograd l State-of-the-art CNN models (e.g., GoogLeNet, VGG-A) can be accelerated up to 2.0x at test me (forward only) Forward computation (msec) Backward computation (msec) 600 600 cuDNN v4 cuDNN v4 500 cuDNN v5 500 cuDNN v5

400 400

300 300

200 200

100 100

0 0 AlexNet GoogLeNet VGG-A OverFeat AlexNet GoogLeNet VGG-A OverFeat Independently measured by a modified version of soumith/convnet-benchmarks cuDNN v5 can be used in Chainer v1.8.0 37 Algorithm implementation in Chainer: A Neural Algorithm of Artistic Style (Gatys et al., 2015) l hps://github.com/maya/chainer-gogh New Style Content + = artistic image (cat) image image

Main code (45 lines) 38 Chainer in industry: Used in demonstrations & being commercialized l Many collaboraons are on-going w/ Chainer-based computer vision, deep , etc… l Ex. 1 Chainer-controlled toy cars in Toyota booth at CES 2016 l Ex. 2 Highly accurate FANUC’s bin-picking robot at IREX 2015

̶ 8 hours training to reach expert-level, commercializaon by 2016 end

http://tinyurl.com/pfn-ces16 http://tinyurl.com/pfn-irex15 39 Summary

l Chainer is a Python-based deep learning framework with dynamic network construcon scheme and CuPy l It is designed for efficient research and prototyping while keeping comparable performance thanks to NVIDIA GPU l Official web: hp://chainer.org/ l Github: hps://github.com/pfnet/chainer

Your contribuons will be appreciated & we are hiring!

40