Microsoft Cognitive Toolkit Outline
Total Page:16
File Type:pdf, Size:1020Kb
Outline • Overview • CNTK introduction • Symbolic loop • Batch scheduling • Data parallel training • Educational resources • Conclusions Microsoft Cognitive Toolkit Outline • Overview • CNTK introduction • Symbolic loop • Batch scheduling • Data parallel training • Educational resources • Conclusions Microsoft Cognitive Toolkit Deep Learning at Microsoft • Microsoft Cognitive Services • Skype Translator • Cortana • Bing • HoloLens • Microsoft Research Microsoft Cognitive Toolkit 24% 14% Microsoft Cognitive Toolkit ImageNet: Microsoft 2015 ResNet 28.2 ImageNet Classification top-5 error (%) 25.8 16.4 11.7 7.3 6.7 3.5 ILSVRC ILSVRC ILSVRC ILSVRC ILSVRC ILSVRC ILSVRC 2010 NEC 2011 Xerox 2012 2013 Clarifi 2014 VGG 2014 2015 ResNet America AlexNet GoogleNet Microsoft had all 5 entries being the 1-st places in 2015: ImageNet classification, ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation Microsoft Cognitive Toolkit Microsoft Cognitive Services Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Image Similarity Goal: given query image, find similar images. • Customer: Anonymous ISV (Azure Partner) • Task: given a retail image, find same product on competitor websites (to compare price). • Existing solution: solely based on mining text information from the websites of Target, Macy, etc. • Customer asked for individual similarity measure (e.g. texture, neck style, etc). Microsoft Cognitive Toolkit Bing / Bing Ads Microsoft Cognitive Toolkit Microsoft Translator http://translate.it Power point-plug in for translating speech to subtitles Microsoft Cognitive Toolkit Youtube Link Microsoft Cognitive Toolkit https://www.youtube.com/watch?v=eu9kMIeS0wQ;start=70 Microsoft Customer Support Agent Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit Microsoft Cognitive Toolkit (CNTK) • Microsoft’s open-source deep-learning toolkit • https://github.com/Microsoft/CNTK • Created by Microsoft Speech researchers (Dong Yu et al.) in 2012, “Computational Network Toolkit” • On GitHub since Jan 2016 under MIT license • Python support since Oct 2016 (beta), rebranded as “Cognitive Toolkit” • External contributions e.g. from MIT, Stanford and Nvidia • Over 80% Microsoft internal DL workload runs CNTK • 1st-class on Linux and Windows, docker support • Python, C++, C#, Java, Keras • Internal == External Microsoft Cognitive Toolkit CTNK Speed Caffe: 1.0rc5(39f28e4) http://dlbench.comp.hkbu.edu.hk/ CNTK: 2.0 Beta10(1ae666d) Benchmarking by HKBU, Version 8 MXNet: 0.93(32dc3a2) Single Tesla K80 GPU, CUDA: 8.0 CUDNN: v5.1 TensorFlow: 1.0(4ac9c09) Torch: 7(748f5e3) Caffe CNTK MxNet TensorFlow Torch FCN5 (1024) 55.329ms 51.038ms 60.448ms 62.044ms 52.154ms AlexNet (256) 36.815ms 27.215ms 28.994ms 103.960ms 37.462ms ResNet (32) 143.987ms 81.470ms 84.545ms 181.404ms 90.935ms LSTM (256) - 43.581ms 288.142ms - 1130.606ms (v7 benchmark) (44.917ms) (284.898ms) (223.547ms) (906.958ms) Microsoft Cognitive Toolkit “CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.” speed comparison (samples/second), higher = better [note: December 2015] 80000 70000 60000 Achieved with 1-bit gradient quantization algorithm 50000 40000 30000 20000 Theano only supports 1 GPU 10000 0 CNTK Theano TensorFlow Torch 7 Caffe 1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs) Microsoft Cognitive Toolkit MICROSOFT COGNITIVE TOOLKIT First Deep Learning Framework Fully Optimized for Pascal Toolkit Delivering Near-Linear Multi-GPU Scaling AlexNet Performance 14,000 13,000 12,000 170x Faster v. CPU Server 10,000 8,000 7,600 images / sec / images 6,000 4,000 3,500 2,400 2,000 78 0 Dual Socket CPU Server 1x P100 2x P100 4x P100 DGX-1 (8x P100) AlexNet training batch size 128, Grad Bit = 32, Dual socket E5-2699v4 CPUs (total 44 cores) CNTK 2.0b3 (to be released) includes cuDNN 5.1.8, NCCL 1.6.1, NVLink enabled Superior performance CNTK Scalability ResNet50 on DGX-1, tested by Nvidia, Dec. 2016 Microsoft Cognitive Toolkit CNTK Other Advantages • Python and C++ API • Mostly implemented in C++ • Low level + high level Python API • Extensibility • User functions and learners in pure Python • Readers • Distributed, highly efficient built-in data readers • Internal == External Microsoft Cognitive Toolkit CNTK Other Advantages • Keras interoperability • With CNTK backend • Try models trained with TF/Theano with CNTK (vice versa) • Demo RL with DQN • Binary evaluation • 10x speed-up in model execution Microsoft Cognitive Toolkit Reinforcement Learning with Flappy Bird Binary evaluation Conv-Net Binary Conv-Net Microsoft Cognitive Toolkit Outline • Overview • CNTK introduction • Symbolic loop • Batch scheduling • Data parallel training • Educational resources • Conclusions Microsoft Cognitive Toolkit What is CNTK? • CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications. Microsoft Cognitive Toolkit What is CNTK? Example: 2-hidden layer feed-forward NN h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1) h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2) P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout) with input x RM and one-hot label L RM and cross-entropy training criterion ce = LT log P ce = cross_entropy (L, P) Scorpusce = max Microsoft Cognitive Toolkit What is CNTK? Example: 2-hidden layer feed-forward NN h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1) h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2) P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout) with input x RM and one-hot label y RJ and cross-entropy training criterion ce = yT log P ce = cross_entropy (L, P) Scorpusce = max Microsoft Cognitive Toolkit What is CNTK? example: 2-hidden layer feed-forward NN h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1) h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2) P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout) with input x RM and one-hot label y RJ and cross-entropy training criterion ce = yT log P ce = cross_entropy (P, y) Scorpusce = max Microsoft Cognitive Toolkit What is CNTK? h1 = sigmoid (x @ W1 + b1) h2 = sigmoid (h1 @ W2 + b2) P = softmax (h2 @ Wout + bout) ce = cross_entropy (P, y) Microsoft Cognitive Toolkit What is CNTK? ce cross_entropy P softmax bout + h1 = sigmoid (x @ W1 + b1) h2 = sigmoid (h1 @ W2 + b2) Wout • h 2 P = softmax (h2 @ Wout + bout) s ce = cross_entropy (P, y) b2 + W2 • h1 s b1 + W1 • x y Microsoft Cognitive Toolkit What is CNTK? ce cross_entropy • Nodes: functions (primitives) P • Can be composed into reusable composites softmax • Edges: values bout + • Incl. tensors, sparse Wout • h 2 • Automatic differentiation s • ∂F / ∂in = ∂F / ∂out ∙ ∂out / ∂in b2 + • Deferred computation execution engine W2 • h1 s • Editable, clonable b1 + W1 • LEGO-like composability allows CNTK to support x y wide range of networks & applications Microsoft Cognitive Toolkit CNTK Unique Features • Symbolic loops over sequences with dynamic scheduling • Turn graph into parallel program through minibatching • Unique parallel training algorithms (1-bit SGD, Block Momentum) Microsoft Cognitive Toolkit Symbolic Loops over Sequential Data Extend our example to a recurrent network (RNN) h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1) h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2) P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout) ce(t) = LT(t) log P(t) ce = cross_entropy(P, L) Scorpusce(t) = max no explicit notion of time Microsoft Cognitive Toolkit Symbolic Loops over Sequential Data Extend our example to a recurrent network (RNN) h1(t) = s(W1 x(t) + R1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1) h2(t) = s(W2 h1(t) + R2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2) P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout) ce(t) = LT(t) log P(t) ce = cross_entropy(P, L) Scorpusce(t) = max Microsoft Cognitive Toolkit Symbolic Loops over Sequential Data Extend our example to a recurrent network (RNN) h1(t) = s(W1 x(t) + R1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1) h2(t) = s(W2 h1(t) + R2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2) P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout) ce(t) = LT(t) log P(t) ce = cross_entropy(P, L) Scorpusce(t) = max compare to id (Irvine Dataflow): @Function [Arvind et al., TR114a, Dept ISC, UC Irvine, Dec 1978; “Executing a Program on the def ip(a: Sequence[tensor], b: Sequence[tensor]): MIT Tagged-Token Dataflow Architecture”, 1988] s0 = 0 s_ = ForwardDeclaration() s = past_value(s_, initial_value=s0) + a * b s_.resolve_to(s) s = last(s) return s Microsoft Cognitive Toolkit Symbolic Loops over Sequential Data ce cross_entropy P h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1) softmax h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2) b out + P = softmax(h2 @ Wout + bout) Wout • ce = cross_entropy(P, L) h2 s z-1 • CNTK automatically unrolls cycles at execution time + • • cycles are detected with Tarjan’s algorithm b • 2 + R2 only nodes in cycles W2 • • efficient and composable h 1 • cf. TensorFlow: [https://www.tensorflow.org/versions/r1.0/tutorials/recurrent/index.html] s z-1 lstm = rnn_cell.BasicLSTMCell(lstm_size) state = tf.zeros([batch_size, lstm.state_size] + • probabilities = [] loss = 0.0 b 1 + R1 for current_batch_of_words in words_in_dataset: output, state =