A Powerful, Flexible, and Intuiºve Deep Learning Framework

@ NVIDIA GTC, April 6th, 2016 A Powerful, Flexible, and Intui5ve Deep Learning Framework Shohei Hido Chief Research Officer Preferred Networks, Inc. Overview http://chainer.org/ l Chainer is a Python-based deep learning framework l Chainer v1.0 was released as an open source on June 2015 l It DOESN’T rely on Theano, unlike other Python frameworks l Chainer uses a unique scheme named Define-by-Run l Why do users sOll need another framework? l How different and effecOve Chainer is? 2 Preferred Networks (PFN) A startup that applies deep learning to industrial IoT l Founded: March 2014 l Headquarter: Tokyo, Japan l U.S. Subsidiary: San Mateo, California l Company size: 35 engineers & researchers l Investors: Toyota, FANUC, NTT Manufacturing Healthcare Automotive Deep learning Industrial IoT 3 Partnering with world-leading companies using Chainer l R&D collaboraon on industrial problems with real-world data ̶ Specific requirements, modified algorithms, many trials and errors, etc ̶ Different from making general-purpose recogniOon system Toyota NTT FANUC Panasonic Cisco NVIDIA 4 Two types of background behind DL frameworks 1. Scalability-oriented 2. Flexibility-oriented l Use-cases in mind l Use-cases in mind ̶ Image/speech recogniOon system ̶ Algorithm research ̶ Fast DL as a service in cloud ̶ R&D projects for new products l Problem type l Problem type ̶ A few general applicaons ̶ Various specific applicaons ̶ 10+ million training samples ̶ 10+ k training samples ̶ 10+ nodes cluster w/ fast network ̶ 1 node with mulOple GPUs l Possible boZleneck l Possible boZleneck ̶ Tuning of well-known algorithms ̶ Trial-and-error in prototyping ̶ Distributed computaon for ̶ Debugging, profiling & refactoring model/data-parallel training ̶ (wait Ome during compilaon) Designed for efficient research & development l Flexible: new kinds of complex models for various applicaons l IntuiOve: rapid prototyping and efficient trial-and-error l Powerful: comparable performance for 1 node & mulO-GPUs Scalability-oriented Flexibility-oriented 6 Agenda l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compable GPU library l Performance and applicaons 7 Neural network and computation Forward computation x Image 1 Object: ・・ h 1 Tulip ・・ k1 y1 ・・・・・・・・ Anomaly score: Sensor ・・・・ 0.35 kM yM hH x N Category: Sports Text Input Hidden units Output Backward computation (backpropagation) 8 Chainer focuses on network representation/training l Design choices for deep learning frameworks ̶ How to build neural networks? ̶ How to train neural networks? ̶ Which text format/language for modeling? ̶ Which language for compuOng? ̶ Run with GPU? ̶ Run on mulOple GPUs? ̶ Run on mulOple compute nodes? 9 Building and training neural networks: Computational graph construction is the key 1. Construct a computaonal graph ̶ based on network definiOon given by users ̶ Chains of funcOons and operaons on input variables 2. Compute loss and gradients ̶ Forward computaon to calculate loss for a minibatch ̶ backpropagaon gives gradients to all of parameters 3. OpOmize model ̶ Update each parameter with the gradient ̶ Repeat unOl convergence Step 1. is the most important and there are many approaches 10 Building blocks l These funcOonaliOes are very similar between frameworks l but the structure, abstracOon level, and interface are different l It comes to the design of domain-specific language for NN Array data structure Network (vector/matrix/tensor) (computational graph) Optimizer Operations & functions (SGD/AdaGrad/Adam) 11 Types of domain-specific language for neural networks l Text DSL l Symbolic program l Imperave program ̶ Ex. Caffe (prototxt) ̶ Operaons ̶ Direct computaons on symbols on raw data arrays ̶ Ex. CNTK (NDL) ̶ Ex. Theano ̶ Ex. Torch.nn %% DefiniOon in text ̶ Ex. TensorFlow ̶ Ex. Chainer f: { Ex. MXNet “A”: “Variable”, “b”: “Variable”, # Symbolic definiOon # Imperave declaraon “C”: [“b”, “*”, “A”], A = Variable(‘A’) a = np.ones(10) “ret”: [“C”, “+”, 1] b = Variable(‘b’) b = np.ones(10) * 2 } C = b * A c = b * a D = C + Constant(1) d = c + 1 # Compile # Compile f = compile(“f.txt”) f = compile(D) d = f(A=np.ones(10), d = f(A=np.ones(10), B=np.ones(10) * 2) B=np.ones(10) * 2) 12 Comparison of DSL type DSL type Pros. Cons. • Human-readable definiOon • Users must study the format Text DSL • Non-programmer can easily • Format might have to be edit the network extended for new algorithms • Stac analysis at compile • Users must study special syntax • OpOmizaon before training • May need more efforts to Symbolic • Easy to parallelize implement new algorithms • Less efforts to learn syntax • Hard to opOmize in advance • • Internal DSL Easy debugging and profiling Less efficient in memory Imperave • Suitable for new algorithms allocaon and parallelizaon with complex logic Chainer is at the extreme end of imperave program for high flexibility 13 Agenda l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compable GPU library l Performance and applicaons 14 Chainer as an open-source project l hZps://github.com/pfnet/chainer l 50 contributors l 1,277 stars & 255 fork Original developer l 3,708 commits Seiya Tokui l AcOve development & release for last 10 months ̶ v1.0.0 (June 2015) to v1.7.2 (March 2016) 15 Chainer software stack l Chainer is built on top of NumPy and CUDA l CuPy is also introduced as an equivalent of NumPy on GPU Chainer CuPy NumPy cuDNN BLAS CUDA CPU NVIDIA GPU 16 Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not) l Define: build a computaonal graph based on definiOon l Run: update the model (parameters) using training dataset Define Parameters Network Computaonal Gradient definion graph funcon Auto differenOaon Parameters Run Update Training Computaonal Gradient data graph funcon Loss & gradient 17 Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly l No graph is constructed before training l Instead, the graph is built at each forward computaon l Computaonal graph can be modified dynamically for each iteraon/sample or depending on some condiOons Define-by-Run Model Parameters definion Update Training Computaonal Gradient graph funcon data Dynamic change Condions 18 Define-by-Run example: MLP for MNIST l Only transformaons between units are set before training l1 = Linear(784, n_units) l2 = Linear(n_units, 10)) l ConnecOon is given as forward computaon def forward(x): h1 = ReLU(l1(x)) return l2(h1) W bias W bias ０５ h1 y ｘ９ Linear l1 ReLU Linear l2 19 Define-by-Run: An interpreted language for neural network l Idea ̶ Forward computaon actually goes through computaonal graph ̶ by remembering the history, the actual graph can be obtained l Advantage ̶ Flexibility for new algorithms with complex components u Ex. recurrent, recursive, aenOon, memory, adversarial, etc ̶ IntuiOve coding with highly imperave network definiOon u Ex. stochasOc network of which graph changes for each iteraon l Current drawbacks ̶ Graph is generated every Ome also for fixed networks ̶ No opOmizaon even for stac part of graphs u JIT-like analysis and subgraph cache might be useful 20 Basic components (1/2): Variable and Function l Variable ̶ Variable wraps arrays (.data) Variable ̶ It remembers parent funcOon ０ (.creator) ｘ h1 y ５９ ̶ It will be assigned gradient (.grad) ̶ It keeps track of not only data but also computaons l Funcon ̶ Transformaon between Variable y ̶ Stateless ｘ Function ̶ e.g. sigmoid, tanh, ReLU, maxpooling, dropout 21 Basic components (2/2): Link and Chain l Link = funcOon with state W b ̶ Parameters are also Variable and gradients will be assigned ｘ y ̶ e.g. Linear (fully-connected), LSTM Link Convoluon2d, word-embedding (Linear) y=f(W*x+b) l Chain = network ̶ Chain has a set of child Link W bias W bias ̶ Forward computaon is defined h1 y in . __call__() Linear l1 ReLU Linear l2 ̶ e.g. MLP2, AlexNet, GoogLeNet, Chain (MLP2) RNNLM, seq2seq, 22 Backpropagation through computational graph l Consider an objecOve (Link.Linear): L = f(x * w + b) l This computes the value of L in forward computaon, and simultaneously builds the following computaonal graph is Variable W b is Funcon ｘ * + f L l The gradient of L can be computed with respect to any variables by backpropagaon l Then the opOmizer updates the value of parameters 23 Code sample (1/4): Multi-layer perceptron class MLP2(Chain): # Model and optimizer setup def __init__(self): model = Classifier(MLP2()) super(MLP2, self).__init__( optimizer = optimizers.SGD() l1=L.Linear(784, 100), optimizer.setup(model) l2=L.Linear(100, 10), ) # training loop with minibatch for i in range(0, datasize, batchsize): def __call__(self, x): x = Variable(x_tr[i:i+batchsize]) h1 = F.relu(self.l1(x)) t = Variable(y_tr[i:i+batchsize]) y = self.l2(h1) model.zerograds() return y loss, acc = model(x, t) loss.backward() class Classifier(Chain): optimizer.update() def __init__(self, predictor): super(Classifier, self). __init__(predictor=predictor) W bias W bias h1 y def __call__(self, x, t): Linear l1 ReLU Linear l2 y = self.predictor(x) self.accuracy = F.accuracy(y, t) self.loss = F.softmax_cross_entropy(y, t) Chain (MLP2) return self.loss, self.accuracy 24 Code sample (2/4): Convolutional neural network class AlexNet(Chain): def __init__(self): super(AlexNet, self).__init__( conv1=L.Convolution2D(3, 96, 11, stride=4), conv2=L.Convolution2D(96, 256, 5, pad=2), conv2d conv3=L.Convolution2D(256, 384, 3, pad=1), conv4=L.Convolution2D(384, 384, 3, pad=1), conv2d conv5=L.Convolution2D(384, 256, 3, pad=1), fc6=L.Linear(9216, 4096), conv2d

A Powerful, Flexible, and Intuiºve Deep Learning Framework

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support