Microsoft Cognitive Toolkit Outline

Outline • Overview • CNTK introduction • Symbolic loop • Batch scheduling • Data parallel training • Educational resources • Conclusions Microsoft Cognitive Toolkit Outline • Overview • CNTK introduction • Symbolic loop • Batch scheduling • Data parallel training • Educational resources • Conclusions Microsoft Cognitive Toolkit Deep Learning at Microsoft • Microsoft Cognitive Services • Skype Translator • Cortana • Bing • HoloLens • Microsoft Research Microsoft Cognitive Toolkit 24% 14% Microsoft Cognitive Toolkit ImageNet: Microsoft 2015 ResNet 28.2 ImageNet Classification top-5 error (%) 25.8 16.4 11.7 7.3 6.7 3.5 ILSVRC ILSVRC ILSVRC ILSVRC ILSVRC ILSVRC ILSVRC 2010 NEC 2011 Xerox 2012 2013 Clarifi 2014 VGG 2014 2015 ResNet America AlexNet GoogleNet Microsoft had all 5 entries being the 1-st places in 2015: ImageNet classification, ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation Microsoft Cognitive Toolkit Microsoft Cognitive Services Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Image Similarity Goal: given query image, find similar images. • Customer: Anonymous ISV (Azure Partner) • Task: given a retail image, find same product on competitor websites (to compare price). • Existing solution: solely based on mining text information from the websites of Target, Macy, etc. • Customer asked for individual similarity measure (e.g. texture, neck style, etc). Microsoft Cognitive Toolkit Bing / Bing Ads Microsoft Cognitive Toolkit Microsoft Translator http://translate.it Power point-plug in for translating speech to subtitles Microsoft Cognitive Toolkit Youtube Link Microsoft Cognitive Toolkit https://www.youtube.com/watch?v=eu9kMIeS0wQ;start=70 Microsoft Customer Support Agent Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit (CNTK) • Microsoft’s open-source deep-learning toolkit • https://github.com/Microsoft/CNTK • Created by Microsoft Speech researchers (Dong Yu et al.) in 2012, “Computational Network Toolkit” • On GitHub since Jan 2016 under MIT license • Python support since Oct 2016 (beta), rebranded as “Cognitive Toolkit” • External contributions e.g. from MIT, Stanford and Nvidia • Over 80% Microsoft internal DL workload runs CNTK • 1st-class on Linux and Windows, docker support • Python, C++, C#, Java, Keras • Internal == External Microsoft Cognitive Toolkit CTNK Speed Caffe: 1.0rc5(39f28e4) http://dlbench.comp.hkbu.edu.hk/ CNTK: 2.0 Beta10(1ae666d) Benchmarking by HKBU, Version 8 MXNet: 0.93(32dc3a2) Single Tesla K80 GPU, CUDA: 8.0 CUDNN: v5.1 TensorFlow: 1.0(4ac9c09) Torch: 7(748f5e3) Caffe CNTK MxNet TensorFlow Torch FCN5 (1024) 55.329ms 51.038ms 60.448ms 62.044ms 52.154ms AlexNet (256) 36.815ms 27.215ms 28.994ms 103.960ms 37.462ms ResNet (32) 143.987ms 81.470ms 84.545ms 181.404ms 90.935ms LSTM (256) - 43.581ms 288.142ms - 1130.606ms (v7 benchmark) (44.917ms) (284.898ms) (223.547ms) (906.958ms) Microsoft Cognitive Toolkit “CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.” speed comparison (samples/second), higher = better [note: December 2015] 80000 70000 60000 Achieved with 1-bit gradient quantization algorithm 50000 40000 30000 20000 Theano only supports 1 GPU 10000 0 CNTK Theano TensorFlow Torch 7 Caffe 1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs) Microsoft Cognitive Toolkit MICROSOFT COGNITIVE TOOLKIT First Deep Learning Framework Fully Optimized for Pascal Toolkit Delivering Near-Linear Multi-GPU Scaling AlexNet Performance 14,000 13,000 12,000 170x Faster v. CPU Server 10,000 8,000 7,600 images / sec / images 6,000 4,000 3,500 2,400 2,000 78 0 Dual Socket CPU Server 1x P100 2x P100 4x P100 DGX-1 (8x P100) AlexNet training batch size 128, Grad Bit = 32, Dual socket E5-2699v4 CPUs (total 44 cores) CNTK 2.0b3 (to be released) includes cuDNN 5.1.8, NCCL 1.6.1, NVLink enabled Superior performance CNTK Scalability ResNet50 on DGX-1, tested by Nvidia, Dec. 2016 Microsoft Cognitive Toolkit CNTK Other Advantages • Python and C++ API • Mostly implemented in C++ • Low level + high level Python API • Extensibility • User functions and learners in pure Python • Readers • Distributed, highly efficient built-in data readers • Internal == External Microsoft Cognitive Toolkit CNTK Other Advantages • Keras interoperability • With CNTK backend • Try models trained with TF/Theano with CNTK (vice versa) • Demo RL with DQN • Binary evaluation • 10x speed-up in model execution Microsoft Cognitive Toolkit Reinforcement Learning with Flappy Bird Binary evaluation Conv-Net Binary Conv-Net Microsoft Cognitive Toolkit Outline • Overview • CNTK introduction • Symbolic loop • Batch scheduling • Data parallel training • Educational resources • Conclusions Microsoft Cognitive Toolkit What is CNTK? • CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications. Microsoft Cognitive Toolkit What is CNTK? Example: 2-hidden layer feed-forward NN h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1) h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2) P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout) with input x RM and one-hot label L RM and cross-entropy training criterion ce = LT log P ce = cross_entropy (L, P) Scorpusce = max Microsoft Cognitive Toolkit What is CNTK? Example: 2-hidden layer feed-forward NN h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1) h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2) P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout) with input x RM and one-hot label y RJ and cross-entropy training criterion ce = yT log P ce = cross_entropy (L, P) Scorpusce = max Microsoft Cognitive Toolkit What is CNTK? example: 2-hidden layer feed-forward NN h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1) h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2) P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout) with input x RM and one-hot label y RJ and cross-entropy training criterion ce = yT log P ce = cross_entropy (P, y) Scorpusce = max Microsoft Cognitive Toolkit What is CNTK? h1 = sigmoid (x @ W1 + b1) h2 = sigmoid (h1 @ W2 + b2) P = softmax (h2 @ Wout + bout) ce = cross_entropy (P, y) Microsoft Cognitive Toolkit What is CNTK? ce cross_entropy P softmax bout + h1 = sigmoid (x @ W1 + b1) h2 = sigmoid (h1 @ W2 + b2) Wout • h 2 P = softmax (h2 @ Wout + bout) s ce = cross_entropy (P, y) b2 + W2 • h1 s b1 + W1 • x y Microsoft Cognitive Toolkit What is CNTK? ce cross_entropy • Nodes: functions (primitives) P • Can be composed into reusable composites softmax • Edges: values bout + • Incl. tensors, sparse Wout • h 2 • Automatic differentiation s • ∂F / ∂in = ∂F / ∂out ∙ ∂out / ∂in b2 + • Deferred computation execution engine W2 • h1 s • Editable, clonable b1 + W1 • LEGO-like composability allows CNTK to support x y wide range of networks & applications Microsoft Cognitive Toolkit CNTK Unique Features • Symbolic loops over sequences with dynamic scheduling • Turn graph into parallel program through minibatching • Unique parallel training algorithms (1-bit SGD, Block Momentum) Microsoft Cognitive Toolkit Symbolic Loops over Sequential Data Extend our example to a recurrent network (RNN) h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1) h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2) P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout) ce(t) = LT(t) log P(t) ce = cross_entropy(P, L) Scorpusce(t) = max no explicit notion of time Microsoft Cognitive Toolkit Symbolic Loops over Sequential Data Extend our example to a recurrent network (RNN) h1(t) = s(W1 x(t) + R1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1) h2(t) = s(W2 h1(t) + R2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2) P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout) ce(t) = LT(t) log P(t) ce = cross_entropy(P, L) Scorpusce(t) = max Microsoft Cognitive Toolkit Symbolic Loops over Sequential Data Extend our example to a recurrent network (RNN) h1(t) = s(W1 x(t) + R1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1) h2(t) = s(W2 h1(t) + R2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2) P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout) ce(t) = LT(t) log P(t) ce = cross_entropy(P, L) Scorpusce(t) = max compare to id (Irvine Dataflow): @Function [Arvind et al., TR114a, Dept ISC, UC Irvine, Dec 1978; “Executing a Program on the def ip(a: Sequence[tensor], b: Sequence[tensor]): MIT Tagged-Token Dataflow Architecture”, 1988] s0 = 0 s_ = ForwardDeclaration() s = past_value(s_, initial_value=s0) + a * b s_.resolve_to(s) s = last(s) return s Microsoft Cognitive Toolkit Symbolic Loops over Sequential Data ce cross_entropy P h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1) softmax h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2) b out + P = softmax(h2 @ Wout + bout) Wout • ce = cross_entropy(P, L) h2 s z-1 • CNTK automatically unrolls cycles at execution time + • • cycles are detected with Tarjan’s algorithm b • 2 + R2 only nodes in cycles W2 • • efficient and composable h 1 • cf. TensorFlow: [https://www.tensorflow.org/versions/r1.0/tutorials/recurrent/index.html] s z-1 lstm = rnn_cell.BasicLSTMCell(lstm_size) state = tf.zeros([batch_size, lstm.state_size] + • probabilities = [] loss = 0.0 b 1 + R1 for current_batch_of_words in words_in_dataset: output, state =

Microsoft Cognitive Toolkit Outline

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support