Sayan Pathak Principal ML Scientist
Chris Basoglu Partner Dev Manager
With many contributors: A. Agarwal, E. Akchurin, E. Barsoum, C. Basoglu, G. Chen, S. Cyphers, W. Darling, J. Droppo, K. Deng, A. Eversole, B. Guenter, P. He, M. Hillebrand, X. Huang, Z. Huang, R. Hoens, V. Ivanov, A. Kamenev, N. Karampatziakis, P. Kranen, O. Kuchaiev, W. Manousek, C. Marschner, A. May, B. Mitra, O. Nano, G. Navarro, A. Orlov, M. Radmilac, A. Reznichenko, P. Parthasarathi, S. Pathak, B. Peng, A. Reznichenko, W. Richert, F. Seide, M. Seltzer, M. Slaney, A. Stolcke, T. Will, H. Wang, Z. Wang, W. Xio. Yao, D. Yu, C. Zhang, Y. Zhang, G. Zweig Outline
• CNTK overview • Key features • Symbolic loop • Batch scheduling • Data parallel training • Conclusions
Microsoft Cognitive Toolkit Outline
• CNTK overview • Key features • Symbolic loop • Batch scheduling • Data parallel training • Conclusions
Microsoft Cognitive Toolkit Deep learning at Microsoft
• Microsoft Cognitive Services • Skype Translator • Cortana • Bing • HoloLens • Microsoft Research
Microsoft Cognitive Toolkit Microsoft Cognitive Services
Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit ImageNet: Microsoft 2015 ResNet
28.2 ImageNet Classification top-5 error (%) 25.8
16.4 11.7 7.3 6.7 3.5
ILSVRC ILSVRC ILSVRC ILSVRC ILSVRC ILSVRC ILSVRC 2010 NEC 2011 Xerox 2012 2013 Clarifi 2014 VGG 2014 2015 ResNet America AlexNet GoogleNet
Microsoft had all 5 entries being the 1-st places this year: ImageNet classification, ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation
Microsoft Cognitive Toolkit Youtube Link Microsoft Translator http://translate.it
Microsoft Cognitive Toolkit You can follow along to this presentation on your own device, in the language of your choice.
Download the Microsoft Translator app for Android, iOS, or Windows or Visit translate.it/
Type in this unique conversation code below to join this conversation
Microsoft Microsoft Translator is powered by machine learning. Any voice or text information you Cognitive provide will be used to improve Microsoft products and services. Toolkit Bing / Bing Ads
Microsoft Cognitive Toolkit Microsoft’s historic speech breakthrough
• Microsoft 2016 research system for conversational speech recognition • 5.9% word-error rate • enabled by CNTK’s multi-server scalability
[W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig: “Achieving Human Parity in Conversational Speech Recognition,” https://arxiv.org/abs/1610.05256]
Microsoft Cognitive Toolkit Microsoft Customer Support Agent
Microsoft Cognitive Toolkit (CNTK)
• Microsoft’s open-source deep-learning toolkit • https://github.com/Microsoft/CNTK • Created by Microsoft Speech researchers in 2012 • On GitHub since Jan 2016 under MIT license • Python support since Oct 2016 (beta), rebranded as “Cognitive Toolkit” • External contributions e.g. from MIT, Stanford and NVidia
Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit (CNTK)
• Over 80% Microsoft internal DL workload runs CNTK • 1st-class on Linux and Windows, docker support • Python, C++, C#, Java • Internal == External
Microsoft Cognitive Toolkit CNTK – The Fastest Toolkit
Caffe: 1.0rc5(39f28e4) http://dlbench.comp.hkbu.edu.hk/ CNTK: 2.0 Beta10(1ae666d) Benchmarking by HKBU, Version 8 MXNet: 0.93(32dc3a2) Single Tesla K80 GPU, CUDA: 8.0 CUDNN: v5.1 TensorFlow: 1.0(4ac9c09) Torch: 7(748f5e3)
Caffe CNTK MxNet TensorFlow Torch FCN5 (1024) 55.329ms 51.038ms 60.448ms 62.044ms 52.154ms AlexNet (256) 36.815ms 27.215ms 28.994ms 103.960ms 37.462ms ResNet (32) 143.987ms 81.470ms 84.545ms 181.404ms 90.935ms LSTM (256) - 43.581ms 288.142ms - 1130.606ms (v7 benchmark) (44.917ms) (284.898ms) (223.547ms) (906.958ms)
Microsoft Cognitive Toolkit “CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.” speed comparison (samples/second), higher = better [note: December 2015] 80000
70000
60000 Achieved with 1-bit gradient quantization algorithm 50000
40000
30000
20000 Theano only supports 1 GPU
10000
0 CNTK Theano TensorFlow Torch 7 Caffe 1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)
Microsoft Cognitive Toolkit Superior performance Scalability What is new in CNTK 2.0?
Microsoft has now released a major upgrade of the software and rebranded it as part of the Microsoft Cognitive Toolkit. This release is a major improvement over the initial release. There are two major changes from the first release that you will see when you begin to look at the new release. First is that CNTK now has a very nice Python API and, second, the documentation and examples are excellent.
Installing the software from the binary builds is very easy on both Ubuntu Linux and Windows.
https://esciencegroup.com/2016/11/10/cntk-revisited-a-new-deep-learning-toolkit-release-from-microsoft/ CNTK Other Advantages
• Python and C++ API • Mostly implemented in C++ • Low level + high level Python API • Extensibility • User functions and learners in pure Python • Readers • Distributed, highly efficient built-in data readers • Internal == External
Microsoft Cognitive Toolkit
The Microsoft Cognitive Toolkit (CNTK)
• CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.
• CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server. MNIST Handwritten Digits (OCR)
1 5 4 3 Handwritten Corresponding Digits 5 3 5 3 Labels 5 9 0 6
• Data set of hand written digits with 60,000 training images 10,000 test images • Each image is: 28 x 28 pixels Multi-layer perceptron https://github.com/Microsoft/CNTK/tree/master/Tutorials Deep
784 pixels (x) Model
. .
Weights i = 784 O= 400 400 + 400 bias
28 28 pix D a = relu 784 i = 400 D O= 200 200 + 200 bias a = relu 400 28 pix 10 nodes i = 200 D O= 10 10 + 10 bias a = None 200
z0 z1 z2 z3 z4 z5 z6 z7 z8 z9
푒푧i softmax 0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01 9 푧j σ푗=0 푒 Loss function Label One-hot encoded (Y)
1 5 4 3 0 0 0 0 5 3 5 3 0 0 0 1 0 0 5 9 0 6
28 x 28 pix (p) . Loss = − σ9 푦 푙표푔 푝 Cross entropy
28 28 pix ce function 푗=0 푗 푗 error
28 pix Predicted Probabilities (p) Model 0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01 (w, b) CNTK Model
Example: 2-hidden layer feed-forward NN
h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)
h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)
P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout) with input x RM and one-hot label y RJ and cross-entropy training criterion ce = yT log P ce = cross_entropy (L, P)
Microsoft Cognitive Toolkit CNTK Model
example: 2-hidden layer feed-forward NN
h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1)
h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2)
P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout) with input x RM and one-hot label y RJ and cross-entropy training criterion ce = yT log P ce = cross_entropy (P, y)
Microsoft Cognitive Toolkit CNTK Model ce cross_entropy P softmax
bout + h1 = sigmoid (x @ W1 + b1) h2 = sigmoid (h1 @ W2 + b2) Wout • h 2 P = softmax (h2 @ Wout + bout) s ce = cross_entropy (P, y) b2 +
W2 • h1 s
b1 +
W1 •
x y
Microsoft Cognitive Toolkit CNTK Model ce cross_entropy • Nodes: functions (primitives) P • Can be composed into reusable composites softmax • Edges: values bout + • Incl. tensors, sparse Wout • h 2 • Automatic differentiation s • ∂F / ∂in = ∂F / ∂out ∙ ∂out / ∂in b2 + • Deferred computation execution engine W2 • h1 s • Editable, clonable
b1 +
W 1 • LEGO-like composability allows CNTK to support
x y wide range of networks & applications
Microsoft Cognitive Toolkit Authoring networks as functions
ce • “model function” cross_entropy • features predictions P softmax • defines the model structure & parameter initialization bout + • holds parameters that will be learned by training Wout • h2 s
• “criterion function” b2 +
• (features, labels) (training loss, additional metrics) W2 • h1 • defines training and evaluation criteria on top of the model function s • provides gradients w.r.t. training criteria b1 +
W1 •
x y
Microsoft Cognitive Toolkit Authoring networks as functions
• CNTK model: neural networks are functions • pure functions • with “special powers”: • can compute a gradient w.r.t. any of its nodes • external deity can update model parameters
• user specifies network as function objects: • formula as a Python function (low level, e.g. LSTM) • function composition of smaller sub-networks (layering) • higher-order functions (equiv. of scan, fold, unfold) • model parameters held by function objects
• “compiled” into the static execution graph under the hood
Microsoft Cognitive Toolkit Layers lib: full list of layers/blocks • layers/blocks.py: • LSTM(), GRU(), RNNUnit() • Stabilizer(), identity • ForwardDeclaration(), Tensor[], SparseTensor[], Sequence[], SequenceOver[] • layers/layers.py: • Dense(), Embedding() • Convolution(), Convolution1D(), Convolution2D(), Convolution3D(), Deconvolution() • MaxPooling(), AveragePooling(), GlobalMaxPooling(), GlobalAveragePooling(), MaxUnpooling() • BatchNormalization(), LayerNormalization() • Dropout(), Activation() • Label() • layers/higher_order_layers.py: • Sequential(), For(), operator >>, (function tuples) • ResNetBlock(), SequentialClique() • layers/sequence.py: • Delay(), PastValueWindow() • Recurrence(), RecurrenceFrom(), Fold(), UnfoldFrom() • models/models.py: • AttentionModel()
Microsoft Cognitive Toolkit
CNTK unique features
• Symbolic loops over sequences with dynamic scheduling • Turn graph into parallel program through minibatching • Unique parallel training algorithms (1-bit SGD, Block Momentum)
Microsoft Cognitive Toolkit Symbolic loops over sequential data extend our example to a recurrent network (RNN)
h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1)
h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)
P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout) ce(t) = yT(t) log P(t) ce = cross_entropy(P, L) no explicit notion of time
Microsoft Cognitive Toolkit Symbolic loops over sequential data extend our example to a recurrent network (RNN)
h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1)
h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)
P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout) ce(t) = yT(t) log P(t) ce = cross_entropy(P, L) no explicit notion of time
Microsoft Cognitive Toolkit Symbolic loops over sequential data extend our example to a recurrent network (RNN)
h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1)
h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)
P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout) ce(t) = yT(t) log P(t) ce = cross_entropy(P, L) no explicit notion of time
Microsoft Cognitive Toolkit Symbolic loops over sequential data extend our example to a recurrent network (RNN)
h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) @ H1 + b1)
h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2)
P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout) ce(t) = LT(t) log P(t) ce = cross_entropy(P, L)
Scorpusce(t) = max
Microsoft Cognitive Toolkit Symbolic loops over sequential data ce cross_entropy P h1 = sigmoid(x @ W1 + past_value(h1) @ H1 + b1) softmax h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2) b out + P = softmax(h2 @ Wout + bout)
Wout • ce = cross_entropy(P, y)
h2 s z-1 • CNTK automatically unrolls cycles deferred computation + • • Efficient and composable b 2 + H2
W2 • h1 s z-1 + •
b 1 + H1
W1 •
x y
Microsoft Cognitive Toolkit Batch-scheduling of variable-length sequences • minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel
sequence 1
sequence 2 sequence 3 padding
sequence 4 sequence 7
parallelsequences sequence 5 sequence 6 • CNTK handles the special cases: • past_value operation correctly resets state and gradient at sequence boundaries • non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”) • sequence reductions
Microsoft Cognitive Toolkit Batch-scheduling of variable-length sequences • minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel
sequence 1
sequence 2 sequence 3 padding
sequence 4 sequence 7
parallelsequences sequence 5 sequence 6 • CNTK handles the special cases: • past_value operation correctly resets state and gradient at sequence boundaries • non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”) • sequence reductions
Microsoft Cognitive Toolkit Batch-scheduling of variable-length sequences • minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel
sequence 1
sequence 2 sequence 3 padding
sequence 3 sequence 7
parallelsequences sequence 5 sequence 6 • CNTK handles the special cases: • past_value operation correctly resets state and gradient at sequence boundaries • non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”) • sequence reductions
Microsoft Cognitive Toolkit Batch-scheduling of variable-length sequences • minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel
sequence 1
sequence 2 sequence 3 padding
sequence 4 sequence 7
parallelsequences sequence 5 sequence 6 • CNTK handles the special cases: • past_value operation correctly resets state and gradient at sequence boundaries • non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”) • sequence reductions
Microsoft Cognitive Toolkit Batch-scheduling of variable-length sequences • minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel
sequence 1
sequence 2 sequence 3 padding
sequence 4 sequence 7
parallelsequences sequence 5 sequence 6 • CNTK handles the special cases: • past_value operation correctly resets state and gradient at sequence boundaries • non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”) • sequence reductions
Microsoft Cognitive Toolkit Batch-scheduling of variable-length sequences • minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel
sequence 1
sequence 2 sequence 3 padding
sequence 4 sequence 7
parallelsequences sequence 5 sequence 6 • CNTK handles the special cases: • past_value operation correctly resets state and gradient at sequence boundaries • non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”) • sequence reductions
Microsoft Cognitive Toolkit Batch-scheduling of variable-length sequences • minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel
sequence 1
sequence 2 sequence 3 padding
sequence 4 sequence 7
parallelsequences sequence 5 sequence 6 • CNTK handles the special cases: • past_value operation correctly resets state and gradient at sequence boundaries • non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”) • sequence reductions
Microsoft Cognitive Toolkit Batch-scheduling of variable-length sequences • minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel
sequence 1
sequence 2 sequence 3 padding
sequence 4 sequence 7
parallelsequences sequence 5 sequence 6 • CNTK handles the special cases: • past_value operation correctly resets state and gradient at sequence boundaries • non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”) • sequence reductions
Microsoft Cognitive Toolkit Batch-scheduling of variable-length sequences • minibatches containing sequences of different lengths are automatically packed and padded time steps computed in parallel
sequence 1
sequence 2 sequence 3 padding
sequence 4 sequence 7
parallelsequences sequence 5 sequence 6
• speed-up is automatic: Speed comparison on RNNs
Optimized Optimized, multi Naïve , Single Naïve sequence >20 Sequence, 1 0 5 10 15 20 25
Microsoft Cognitive Toolkit Data-parallel training
• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients
node 1 node 2 node 3
S all-reduce
Microsoft Cognitive Toolkit Data-parallel training
• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients
node 1 node 2 node 3
Microsoft Cognitive Toolkit Data-parallel training
• Data-parallelism: distribute minibatch over workers, all-reduce partial gradients
node 1 node 2 node 3
ring algorithm O(2 (K-1)/K M) O(1) w.r.t. K
Microsoft Cognitive Toolkit Data-parallel training How to reduce communication cost: communicate less each time
communicate less often
Microsoft Cognitive Toolkit Data-parallel training How to reduce communication cost: communicate less each time • 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent... Distributed Training of Speech DNNs”, Interspeech 2014] node 1 node 2 node 3 • Quantize gradients to 1 bit per value • Trick: carry over quantization error to next minibatch 1-bit quantized with residual
1-bit quantized with residual
Microsoft Cognitive Toolkit data-parallel training how to reduce communication cost: communicate less each time minibatch • 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent... Distributed Training of Speech DNNs”, Interspeech 2014] GPU 1 GPU 2 GPU 3 • quantize gradients to 1 bit per value • trick: carry over quantization error to next minibatch 1-bit quantized with residual
1-bit quantized with residual
Microsoft Cognitive Toolkit Data-parallel training How to reduce communication cost: communicate less each time
• 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs”, Interspeech 2014] • quantize gradients to 1 bit per value • trick: carry over quantization error to next minibatch communicate less often
• Automatic MB sizing [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “ON Parallelizability of Stochastic Gradient Descent...”, ICASSP 2014]
• Block momentum [K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training…,” ICASSP 2016] • Very recent, very effective parallelization method • Combines model averaging with error-residual idea
Microsoft Cognitive Toolkit Benchmark result of parallel training on CNTK
• Training data: 2,670-hour speech from real traffics of VS, SMD, and Cortana • About 16 and 20 days to train DNN and LSTM on 1-GPU, respectively
1bit/BMUF Speedup Factors in LSTM Training
60.0 54.0 1bit-average 50.0 1bit-peak 43.7 40.0 BMUF-average 27.3
30.0 25.5 BMUF-peak 14.1 13.8 8.1 20.0 4.1 10.8 3.7 6.9 6.7 8.0 10.0 3.3 5.4 2.9 0.0 4 GPUs 8 GPUs 16 GPUs 32 GPUs 64 GPUs
Microsoft Credit: Yongqiang Wang, Kai Chen, Qiang Huo Cognitive Toolkit Results
• Achievement • Almost linear speedup without degradation of model quality • Verified for training DNN, CNN, LSTM up to 64 GPUs for speech recognition, image classification, OCR, and click prediction tasks • Released in CNTK as a critical differentiator • Used for enterprise scale production data loads • Production tools in other companies such as iFLYTEK and Alibaba
Microsoft Cognitive Toolkit Where to begin? On GitHub: https://github.com/Microsoft/CNTK/wiki
Tutorials: https://www.cntk.ai/pythondocs/tutorials.html (latest release) https://github.com/Microsoft/CNTK/tree/master/Tutorials (latest)
Azure Notebooks: Try for free pre-hosted https://notebooks.azure.com/cntk/libraries/tutorials
Seek help on Stack Overflow: http://stackoverflow.com/search?q=cntk (please add cntk tag)
Seek help on Stack Overflow: http://stackoverflow.com/search?q=cntk (please add cntk tag) Where to begin? Tutorials: https://www.cntk.ai/pythondocs/tutorials.html (latest release) https://github.com/Microsoft/CNTK/tree/master/Tutorials (latest) Where to begin?
Azure Notebooks: Try for free pre-hosted https://notebooks.azure.com/cntk/libraries/tutorials Where to begin? On GitHub: https://github.com/Microsoft/CNTK/wiki
Tutorials: https://www.cntk.ai/pythondocs/tutorials.html (latest release) https://github.com/Microsoft/CNTK/tree/master/Tutorials (latest)
Azure Notebooks: Try for free pre-hosted https://notebooks.azure.com/cntk/libraries/tutorials
Seek help on Stack Overflow: http://stackoverflow.com/search?q=cntk (please add cntk tag)
Seek help on Stack Overflow: http://stackoverflow.com/search?q=cntk (please add cntk tag)