A Powerful, Flexible, and Intuiºve Deep Learning Framework

@ NVIDIA GTC, April 6th, 2016 A Powerful, Flexible, and Intui5ve Deep Learning Framework Shohei Hido Chief Research Officer Preferred Networks, Inc. Overview http://chainer.org/ l Chainer is a Python-based deep learning framework l Chainer v1.0 was released as an open source on June 2015 l It DOESN’T rely on Theano, unlike other Python frameworks l Chainer uses a unique scheme named Define-by-Run l Why do users sOll need another framework? l How different and effecOve Chainer is? 2 Preferred Networks (PFN) A startup that applies deep learning to industrial IoT l Founded: March 2014 l Headquarter: Tokyo, Japan l U.S. Subsidiary: San Mateo, California l Company size: 35 engineers & researchers l Investors: Toyota, FANUC, NTT Manufacturing Healthcare Automotive Deep learning Industrial IoT 3 Partnering with world-leading companies using Chainer l R&D collaboraon on industrial problems with real-world data ̶ Specific requirements, modified algorithms, many trials and errors, etc ̶ Different from making general-purpose recogniOon system Toyota NTT FANUC Panasonic Cisco NVIDIA 4 Two types of background behind DL frameworks 1. Scalability-oriented 2. Flexibility-oriented l Use-cases in mind l Use-cases in mind ̶ Image/speech recogniOon system ̶ Algorithm research ̶ Fast DL as a service in cloud ̶ R&D projects for new products l Problem type l Problem type ̶ A few general applicaons ̶ Various specific applicaons ̶ 10+ million training samples ̶ 10+ k training samples ̶ 10+ nodes cluster w/ fast network ̶ 1 node with mulOple GPUs l Possible boZleneck l Possible boZleneck ̶ Tuning of well-known algorithms ̶ Trial-and-error in prototyping ̶ Distributed computaon for ̶ Debugging, profiling & refactoring model/data-parallel training ̶ (wait Ome during compilaon) Designed for efficient research & development l Flexible: new kinds of complex models for various applicaons l IntuiOve: rapid prototyping and efficient trial-and-error l Powerful: comparable performance for 1 node & mulO-GPUs Scalability-oriented Flexibility-oriented 6 Agenda l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compable GPU library l Performance and applicaons 7 Neural network and computation Forward computation x Image 1 Object: ・・ h 1 Tulip ・・ k1 y1 ・・・・・・・・ Anomaly score: Sensor ・・・・ 0.35 kM yM hH x N Category: Sports Text Input Hidden units Output Backward computation (backpropagation) 8 Chainer focuses on network representation/training l Design choices for deep learning frameworks ̶ How to build neural networks? ̶ How to train neural networks? ̶ Which text format/language for modeling? ̶ Which language for compuOng? ̶ Run with GPU? ̶ Run on mulOple GPUs? ̶ Run on mulOple compute nodes? 9 Building and training neural networks: Computational graph construction is the key 1. Construct a computaonal graph ̶ based on network definiOon given by users ̶ Chains of funcOons and operaons on input variables 2. Compute loss and gradients ̶ Forward computaon to calculate loss for a minibatch ̶ backpropagaon gives gradients to all of parameters 3. OpOmize model ̶ Update each parameter with the gradient ̶ Repeat unOl convergence Step 1. is the most important and there are many approaches 10 Building blocks l These funcOonaliOes are very similar between frameworks l but the structure, abstracOon level, and interface are different l It comes to the design of domain-specific language for NN Array data structure Network (vector/matrix/tensor) (computational graph) Optimizer Operations & functions (SGD/AdaGrad/Adam) 11 Types of domain-specific language for neural networks l Text DSL l Symbolic program l Imperave program ̶ Ex. Caffe (prototxt) ̶ Operaons ̶ Direct computaons on symbols on raw data arrays ̶ Ex. CNTK (NDL) ̶ Ex. Theano ̶ Ex. Torch.nn %% DefiniOon in text ̶ Ex. TensorFlow ̶ Ex. Chainer f: { Ex. MXNet “A”: “Variable”, “b”: “Variable”, # Symbolic definiOon # Imperave declaraon “C”: [“b”, “*”, “A”], A = Variable(‘A’) a = np.ones(10) “ret”: [“C”, “+”, 1] b = Variable(‘b’) b = np.ones(10) * 2 } C = b * A c = b * a D = C + Constant(1) d = c + 1 # Compile # Compile f = compile(“f.txt”) f = compile(D) d = f(A=np.ones(10), d = f(A=np.ones(10), B=np.ones(10) * 2) B=np.ones(10) * 2) 12 Comparison of DSL type DSL type Pros. Cons. • Human-readable definiOon • Users must study the format Text DSL • Non-programmer can easily • Format might have to be edit the network extended for new algorithms • Stac analysis at compile • Users must study special syntax • OpOmizaon before training • May need more efforts to Symbolic • Easy to parallelize implement new algorithms • Less efforts to learn syntax • Hard to opOmize in advance • • Internal DSL Easy debugging and profiling Less efficient in memory Imperave • Suitable for new algorithms allocaon and parallelizaon with complex logic Chainer is at the extreme end of imperave program for high flexibility 13 Agenda l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compable GPU library l Performance and applicaons 14 Chainer as an open-source project l hZps://github.com/pfnet/chainer l 50 contributors l 1,277 stars & 255 fork Original developer l 3,708 commits Seiya Tokui l AcOve development & release for last 10 months ̶ v1.0.0 (June 2015) to v1.7.2 (March 2016) 15 Chainer software stack l Chainer is built on top of NumPy and CUDA l CuPy is also introduced as an equivalent of NumPy on GPU Chainer CuPy NumPy cuDNN BLAS CUDA CPU NVIDIA GPU 16 Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not) l Define: build a computaonal graph based on definiOon l Run: update the model (parameters) using training dataset Define Parameters Network Computaonal Gradient definion graph funcon Auto differenOaon Parameters Run Update Training Computaonal Gradient data graph funcon Loss & gradient 17 Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly l No graph is constructed before training l Instead, the graph is built at each forward computaon l Computaonal graph can be modified dynamically for each iteraon/sample or depending on some condiOons Define-by-Run Model Parameters definion Update Training Computaonal Gradient graph funcon data Dynamic change Condions 18 Define-by-Run example: MLP for MNIST l Only transformaons between units are set before training l1 = Linear(784, n_units) l2 = Linear(n_units, 10)) l ConnecOon is given as forward computaon def forward(x): h1 = ReLU(l1(x)) return l2(h1) W bias W bias ０５ h1 y ｘ９ Linear l1 ReLU Linear l2 19 Define-by-Run: An interpreted language for neural network l Idea ̶ Forward computaon actually goes through computaonal graph ̶ by remembering the history, the actual graph can be obtained l Advantage ̶ Flexibility for new algorithms with complex components u Ex. recurrent, recursive, aenOon, memory, adversarial, etc ̶ IntuiOve coding with highly imperave network definiOon u Ex. stochasOc network of which graph changes for each iteraon l Current drawbacks ̶ Graph is generated every Ome also for fixed networks ̶ No opOmizaon even for stac part of graphs u JIT-like analysis and subgraph cache might be useful 20 Basic components (1/2): Variable and Function l Variable ̶ Variable wraps arrays (.data) Variable ̶ It remembers parent funcOon ０ (.creator) ｘ h1 y ５９ ̶ It will be assigned gradient (.grad) ̶ It keeps track of not only data but also computaons l Funcon ̶ Transformaon between Variable y ̶ Stateless ｘ Function ̶ e.g. sigmoid, tanh, ReLU, maxpooling, dropout 21 Basic components (2/2): Link and Chain l Link = funcOon with state W b ̶ Parameters are also Variable and gradients will be assigned ｘ y ̶ e.g. Linear (fully-connected), LSTM Link Convoluon2d, word-embedding (Linear) y=f(W*x+b) l Chain = network ̶ Chain has a set of child Link W bias W bias ̶ Forward computaon is defined h1 y in . __call__() Linear l1 ReLU Linear l2 ̶ e.g. MLP2, AlexNet, GoogLeNet, Chain (MLP2) RNNLM, seq2seq, 22 Backpropagation through computational graph l Consider an objecOve (Link.Linear): L = f(x * w + b) l This computes the value of L in forward computaon, and simultaneously builds the following computaonal graph is Variable W b is Funcon ｘ * + f L l The gradient of L can be computed with respect to any variables by backpropagaon l Then the opOmizer updates the value of parameters 23 Code sample (1/4): Multi-layer perceptron class MLP2(Chain): # Model and optimizer setup def __init__(self): model = Classifier(MLP2()) super(MLP2, self).__init__( optimizer = optimizers.SGD() l1=L.Linear(784, 100), optimizer.setup(model) l2=L.Linear(100, 10), ) # training loop with minibatch for i in range(0, datasize, batchsize): def __call__(self, x): x = Variable(x_tr[i:i+batchsize]) h1 = F.relu(self.l1(x)) t = Variable(y_tr[i:i+batchsize]) y = self.l2(h1) model.zerograds() return y loss, acc = model(x, t) loss.backward() class Classifier(Chain): optimizer.update() def __init__(self, predictor): super(Classifier, self). __init__(predictor=predictor) W bias W bias h1 y def __call__(self, x, t): Linear l1 ReLU Linear l2 y = self.predictor(x) self.accuracy = F.accuracy(y, t) self.loss = F.softmax_cross_entropy(y, t) Chain (MLP2) return self.loss, self.accuracy 24 Code sample (2/4): Convolutional neural network class AlexNet(Chain): def __init__(self): super(AlexNet, self).__init__( conv1=L.Convolution2D(3, 96, 11, stride=4), conv2=L.Convolution2D(96, 256, 5, pad=2), conv2d conv3=L.Convolution2D(256, 384, 3, pad=1), conv4=L.Convolution2D(384, 384, 3, pad=1), conv2d conv5=L.Convolution2D(384, 256, 3, pad=1), fc6=L.Linear(9216, 4096), conv2d

Load more