<<

Machine Learning on TensorFlow

liaobaohua@.com

Mar, 2018 Outlines

● Machine learning introduction ● What is TensorFlow? ● TensorFlow in China ● TensorFlow in ● TensorFlow Basics ○ Distributed TensorFlow ● New Features ○ Eager execution ○ TF Lite ○ XLA ○ Performance ○ Dataset (tf.datasets) ○ Cloud TPU 3 NPC CONTROL

Camera Effects Input Saturation Defocus + = ?

Image source: Wikimedia A Neural Algorithm of Artistic Style http://arxiv.org/abs/1508.06576 + =

Image source: Wikimedia A Neural Algorithm of Artistic Style http://arxiv.org/abs/1508.06576

Source: Instacart Source: Google blog 眼科学 放射学

“The network performed similarly to senior orthopedic surgeons when presented with images at the same resolution 0.95 0.91 as the network.” Algorithm Ophthalmologist (median) www.tandfonline.com/doi/full/10.1080/17453674.2017.1344459 共同的目标: 机器学习让未来变得更美好 What is TensorFlow 多维数组

流动 TensorFlow: Computation Graph

● Computation is defined as a graph ● Nodes represent computation or states ○ Can be run on any devices ● Data flow along edges ● Graph can be defined in any language ● Graph would be compiled and optimized TensorFlow: ML for Everyone

● An open-source machine learning platform for everyone ● Fast, flexible, and production-ready ● Scales from research to production

16 Machine learning gets complex quickly

Deep Learning Just like regular learning, but with more layers. Inception v3 has ~25 M parameters. 17 TensorFlow Handles Complexity

Modeling complexity Distributed Heterogenous System System Canned Estimators

Keras Estimator Model Datasets Layers

Python Frontend ++ Java Go ...

TensorFlow Distributed Execution Engine

CPU GPU Android iOS XLA

CPU GPU TPU ... TensorFlow provides great tools like TensorBoard

20 TensorBoard TensorFlow supports many platforms

CPU GPU

iOS Android

Raspberry Pi

1st-gen TPU Cloud TPU TensorFlow supports many languages Java 活跃的开源社区

Positive Reviews Rapid Development Direct Engagement

81,000+ 1,100+ 8,000+

GitHub Stars Contributors Stack Overflow questions answered

23,000+ 21,000+ 100+

GitHub repositories with Commits in 21 months Community-submitted GitHub

‘TensorFlow’ in the title issues responded to weekly 50000

37500 81K+ TensorFlow 25000 GitHub Star Count

12500

0

2013 2014 2015 2016 2017

Confidential + Proprietary Confidential + Proprietary TensorFlow出现在课程里

University of California, Udacity Berkeley Coursera Stanford University deeplearning.ai

University of Toronto Andreessen Horowitz TensorFlow in China TensorFlow 中文网站: .google.cn Follow TensorFlow on WeChat JD Xiaomi Document Scanning and Text Detection

With Convolution Neural Network running on TENSORFLOW, We bring powerful features to Youdao Translate and Youdao Note.

Document Scanning: Real time,

Stable, Applicable to multiple scenarios

Text Detection: Accurate, Fast, Less computing resources needed

More research examples

● “SENSING URBAN LAND-USE PATTERNS BY INTEGRATING GOOGLE TENSORFLOW AND SCENE-CLASSIFICATION MODELS” -- 中山大学 ● “Prediction of Sea Surface Temperature using Long Short-Term Memory” -- 中国海洋大学 iotAI-one人工智能分拣机是由“江西环 境工程职业学院”即“iotAI商标申报人” 陈万钧老师团队基于谷歌开源人工智 能框架TensorFlow并以智能分拣业务 为载体的第二代人工智能分拣系统原 型机。 该原型机是目前马上 商用的:“基于机器视 觉的钕铁硼毛坯外观 智能检分设备”的原 型机。

SDK下载超过1000万次, 180个国家和地区. TensorFlow in China

Supporting AI education Developing strong communities ● Supporting all levels from ● TensorFlow DevSummit viewing universities and vocational parties at Feb schools to K-12 schools ● TensorFlow symposiums in ● Investing millions of RMB for AI Beijing and Shanghai education in 2018, by following ● Nationwide GDG DevFests in 27 5-year MoU signed with China cities Ministry of Education ● More online and offline ● Externalizing machine learning community activities coming in courses to train more teachers 2018! TensorFlow in Google Neural Machine Translation

Reduces Errors By 55%-85%

Sutskever et al. NIPS, Dec 2014 Google Research Blog, Sept 2016 Wu et al. arXiv, Sept 2016 41 Google 翻译 , a truly global product...

1B+ Translations every single day, that is 1M books

Monthly active users 1B+

Google Translate Languages cover 99% of online 103 population Machine Translation: A Brief History Statistical machine learning -> Statistical MT

Deep learning -> Neural MT

1950’s 1960’s 1990’s 2000’s 2014 2016

Google’s Early research SYSTRAN founded Example-based MT Phrase-based MT Neural MT Multilingual Neural MT ALPAC report IBM models Syntax-based MT Word-based MT Semantics-based MT Rule-based MT NMT: A Brief History

Figure credit - Orhan Firat Old Tech: PBMT Translation pyramid Neural Machine Translation: The Game Changer

Phrase-based machine translation Neural Machine Translation (NMT)

Discrete, local decision Continuous, global decision Sequence Modeling

● What does this mean?

● Predict the likelihood of a sequence:

○ What is the probability of P(X1, X2, ..., XN)?

● A sequence (X1, X2, ..., XN) can be

A piece of text.

Figure(s) credit - Orhan Firat Sequence Modeling

● Non-Markovian sequence modelling ○ Markovian modelling ignores dependency beyond context ○ Directly model the original conditional probabilities N ■ P(X1, X2, ..., XN) = ∏t =1P(Xt| X1, ..., Xt-1)

● Input sequences can be variable length Sequence Modeling

● How can we handle variable length sequences? ○ Recurrence

Credits - Orhan Firat, Martin Görner Recurrent Neural Networks

● Modelling P(the,cat,sat,on) with RNNs: ○ an input sequence (X1, X2, ..., XN) ○ an internal memory state - tracks state so far

○ a function - recurses over input Xi and memory

P(the) P(cat|...) P(sat|...) P(on|...)

the cat sat Recurrent Neural Networks

● Deepest neural networks possible ○ Unlimited depth ● Most general neural networks ○ They are general computers ● Universal function approximators ○ Can learn any program

Credits - Orhan Firat, Chris Olah Training RNNs

● Unroll the loop ● Apply back-propagation through time (BPTT) Mozer et al.’95

● Problems: Vanishing or exploding gradients (Hochreiter et al.’91, Bengio et al.’94) ○ Long Short Term Memory Units (LSTM) (Hochreiter & Schmidhuber’95)

Credits - Chris Olah LSTMs

Vanilla RNN

LSTM RNN

Credits - Chris Olah LSTMs

Credits - Chris Olah LSTMs

Credits - Chris Olah LSTMs

Credits - Chris Olah Sequence to Sequence Modelling

● Learn to map: X1, X2,...,XN -> Y1, Y2,...,YN ● Encoder/Decoder framework ● Theoretically any sequence length for input/output works

Die Katze saß EOS Bottleneck

the cat sat Die Katze saß Deep Sequence to Sequence Y1 Y2

SoftMax

Encoder LSTMs Decoder LSTMs

X3 X2 Y1 Y3 Attention Mechanism

● Solves information bottleneck problem Google Neural Machine Translation Model GNMT

https://github.com/tensorflow/nmt Google Open Models

https://github.com/tensorflow Learn2Learn & 进化算法

AM!!!

以概率p取样得到架构A

以架构 来训练子网络 控制器 (RNN) A 得到精确度R

计算p的梯度,以R为比例来校正控制器

Why Evolution? Worker

● Pick 2 at random ● Kill worst ● Select best as parent ● Copy-mutate parent ● Train, evaluate child

Worker

Worker Worker Worker Worker

● 插入卷积层 ● 去除卷积层 ● 插入非线性层 ● 去除非线性层 ● 插入跳过连接 ● 去除跳过连接 ● 改变stride ● 改变channel数量 ● 改变水平过滤器大小 ● 改变垂直过滤器大小 ● 改变学习率 ● 不变 ● 重设权重

77 78 Parsey McParseface

https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html ht

Prediction A

Xt

Google's Project Magenta

https://magenta.tensorflow.org/

80 Hemorrhages

Healthy Diseased

No DR Mild DR Moderate DR Severe DR Proliferative DR 81 机器人 数据中心优化

Confidential + Proprietary 数据中心优化

高PUE 机器学习控制开启 机器学习控制关闭

低PUE

Confidential + Proprietary TensorFlow Basics TensorFlow: Computation Graph

● Computation is defined as a graph ● Nodes represent computation or states ○ Can be run on any devices ● Data flow along edges ● Graph can be defined in any language ● Graph would be compiled and optimized Simple ML Model: Linear Regression

Core TF code without using high-level API y = Wx + b # Model parameters W = tf.Variable([0.3], dtype=tf.float32) b = tf.Variable([0.1], dtype=tf.float32) x = tf.placeholder(tf.float32) y = W*x + b

y_prime = tf.placeholder(tf.float32) (x, y’) # Minimize loss. loss = tf.reduce_sum(tf.square(y - y_prime)) loss = ∑ (yi - y’i)^2 Dataflow based computation Python Program TensorFlow Graph

88 https://www.tensorflow.org/get_started/get_started Build a graph; then run it.

a b ... c = tf.add(a, b) add

c ... session = tf.Session() value_of_c = session.run(c, {a=1, b=2}) Any Computation is a TensorFlow Graph

biases

weights Add Relu

MatMul Xent examples

labels Any Computation is a TensorFlow Graph

variables with state

biases

weights Add Relu

MatMul Xent examples

labels Automatic Differentiation

Automatically add ops which compute gradients for variables

biases

... Xent grad Any Computation is a TensorFlow Graph

with state

biases

... Xent grad Mul −=

learning rate Any Computation is a TensorFlow Graph

distributed

Device A Device B biases

Add ... Mul −= ...

learning rate

Devices: Processes, Machines, CPUs, GPUs, TPUs, etc Send and Receive Nodes

distributed

Device A Device B biases

Add ... Mul −= ...

learning rate

Devices: Processes, Machines, CPUs, GPUs, TPUs, etc Send and Receive Nodes

distributed

Device A Device B biases Send Recv Add ... Mul Send Recv −= ... Send Recv Recv learning rate Send

Devices: Processes, Machines, CPUs, GPUs, TPUs, etc TensorFlow APIs Canned Estimators

Keras Estimator Model Datasets Layers

Python Frontend C++ Java Go ...

TensorFlow Distributed Execution Engine

CPU GPU Android iOS XLA

CPU GPU TPU ... API: Layers Canned Estimators

Keras Estimator Model Datasets

Layers

Python Frontend C++ Frontend ...

TensorFlow Distributed Execution Engine

CPU GPU Android iOS ... conv 5x5 (relu)

max pool 2x2 conv 5x5 (relu)

max pool 2x2

dense (relu)

dropout 0.5

dense (linear) conv 5x5 (relu) x = tf.layers.conv2d(x, kernel_size=[5,5], ...)

max pool 2x2 conv 5x5 (relu)

max pool 2x2

dense (relu)

dropout 0.5

dense (linear) conv 5x5 (relu) x = tf.layers.conv2d(x, kernel_size=[5,5], ...)

max pool 2x2 x = tf.layers.max_pooling2d(x, kernel_size=[2,2], ...) conv 5x5 (relu)

max pool 2x2

dense (relu)

dropout 0.5

dense (linear) conv 5x5 (relu) x = tf.layers.conv2d(x, kernel_size=[5,5], ...)

max pool 2x2 x = tf.layers.max_pooling2d(x, kernel_size=[2,2], ...) conv 5x5 (relu) x = tf.layers.conv2d(x, kernel_size=[5,5], ...)

max pool 2x2 x = tf.layers.max_pooling2d(x, kernel_size=[2,2], ...)

dense (relu) x = tf.layers.dense(x, activation_fn=tf.nn.relu)

dropout 0.5 x = tf.layers.dropout(x, 0.5)

dense (linear) x = tf.layers.dense(x) + Fast iteration

+ Best practices + Layers compatible with Keras dense(x, 10) == Dense(10)(x) + See: tf.keras API: Keras Canned Estimators

Keras Estimator Model Datasets

Layers

Python Frontend C++ Frontend ...

TensorFlow Distributed Execution Engine

CPU GPU Android iOS ... Toy video QA problem

“ A previously very hard problem,

made accessible to anyone “ with basic Python scripting abilities Answer word as one-hot vector

Dense

Dense

Concat

LSTM LSTM

Embedding TimeDistributed InceptionV3

Video Question as 5D tensor as integer sequences (2D tensor) Dense

Dense

Concat

LSTM LSTM

TimeDistributed InceptionV3 Embedding Dense

Dense

Concat

LSTM LSTM

TimeDistributed InceptionV3 Embedding Dense

Dense

Concat

LSTM LSTM

TimeDistributed InceptionV3 Embedding output

Dense

Dense

Concat

LSTM LSTM

TimeDistributed InceptionV3 Embedding

video question Access to all features of tf.training.Experiment

Distributed training, Cloud ML training, hyperparameter tuning

model = tf.keras.models.Model(inputs=input, outputs=output) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy', 'top_k_categorical_accuracy'])

Access to all features of tf.training.Experiment Distributed training, Cloud ML training, hyperparameter tuning Canned Estimators

Keras Estimator Model Datasets

Layers

Python Frontend C++ Frontend ...

TensorFlow Distributed Execution Engine

CPU GPU Android iOS ... inputs, labels Model predictions Estimator train_op

inputs, labels model_fn eval_op

predictions Estimator fit()

inputs, labels model_fn evaluate()

Sessions, Graphs, Loops predict() Estimator fit()

evaluate()

inputs, labels model_fn

predict()

Sessions, Graphs, Loops export_savedmodel() API: Canned Estimator Canned Estimators

Keras Estimator Model Datasets

Layers

Python Frontend C++ Frontend ...

TensorFlow Distributed Execution Engine

CPU GPU Android iOS ... area = real_valued_column(“square_foot”), rooms = real_valued_column(“num_rooms”), zip_code = sparse_column_with_integerized_feature(“zip_code”, 100000)

regressor = LinearRegressor(feature_columns=[area, rooms, zip_code], ...)

regressor.fit(train_input_fn) regressor.evaluate(eval_input_fn) area = real_valued_column(“square_foot”), rooms = real_valued_column(“num_rooms”), zip_code = sparse_column_with_integerized_feature(“zip_code”, 100000)

regressor = DNNRegressor( feature_columns=[area, rooms, embedding_column(zip_code, 8)], hidden_units=[1024, 512, 256]) regressor.fit(train_input_fn) regressor.evaluate(eval_input_fn) Distributed TensorFlow Data Parallelism Parameter Servers

Δp’ p’

Model Replicas ...

Data ... Describe a cluster: ClusterSpec tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222" ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222" ]}) Share the graph across devices with tf.device("/job:ps/task:0"): weights_1 = tf.Variable(...) biases_1 = tf.Variable(...) with tf.device("/job:ps/task:1"): weights_2 = tf.Variable(...) biases_2 = tf.Variable(...) with tf.device("/job:worker/task:7"): input, labels = ... layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1) logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2) train_op = ... with tf.Session("://worker7.example.com:2222") as sess: for _ in range(10000): sess.run(train_op) Input Pipelines with Queues

Reader Decoder Preprocess Worker

Reader Decoder Preprocess Worker

Preprocess ......

...

Filenames Raw Examples Examples https://arxiv.org/abs/1604.00981 Image Model (Inception) Synchronous Training

50 GPUs 10 GPUs

2.6 hours vs. 79.3 hours (30.5X) 1 GPU

Hours New Features in TensorFlow Eager Execution As simple as possible Simple, Instant error

# Outputs

tf.Tensor(3, dtype=int32) a = tf.constant(6) tf.Tensor(10, dtype=int32) while a != 1: tf.Tensor(5, dtype=int32) if a % 2 == 0: tf.Tensor(16, dtype=int32) a = a / 2 tf.Tensor(8, dtype=int32) else: tf.Tensor(4, dtype=int32) a = 3 * a + 1 tf.Tensor(2, dtype=int32) print(a) tf.Tensor(1, dtype=int32) Classic TensorFlow: Some Boilerplate import tensorflow as tf x = tf.placeholder(tf.float32, shape=[3]) y = tf.placeholder(tf.float32, shape=[3]) a = x + y with tf.Session() as sess: a_out = sess.run(a, feed_dict={x: [1., 1., 1.], y: [0., 2., 4.]}) print(a_out)

# outputs: [ 1. 3. 5.] Eager: No Boilerplate from tensorflow.contrib.eager.python import tfe

x = tfe.Tensor([1., 1., 1.]) y = tfe.Tensor([0., 2., 4.]) print(x + y)

# outputs: tfe.Tensor([ 1. 3. 5.], dtype=tf.float64) Eager: Instant errors from tensorflow.contrib.eager.python import tfe a = tfe.ops.gather([0, 2, 4], 7)

InvalidArgumentError: indices = 7 is not in [0, 3) [Op:Gather] Eager: Python control flow from tensorflow.contrib.eager.python import tfe a = tfe.Tensor(6) # outputs: while a != 1: tfe.Tensor(3, dtype=int32) if a % 2 == 0: tfe.Tensor(10, dtype=int32) a = a / 2 tfe.Tensor(5, dtype=int32) else: tfe.Tensor(16, dtype=int32) a = 3 * a + 1 tfe.Tensor(8, dtype=int32) print(a) tfe.Tensor(4, dtype=int32) tfe.Tensor(2, dtype=int32) tfe.Tensor(1, dtype=int32) Eager supports GPUs ● Controlled using .as_gpu_tensor(), .as_cpu_tensor(), with tfe.device("gpu:0"): ● Checks for errors and then runs kernel asynchronously ● Uses tensor handles instead of values ● Have to copy to CPU to inspect a value (e.g. print) ○ This blocks until the value is ready ○ Automatic for if/while Eager improves usability ● Debugging ○ The python debugger can be used out of the box ○ The python profilers can be used out of the box ○ Errors have meaningful stack traces ● Interactivity ○ Teaching new users is easier, as trying things is immediate ● High-level libraries ○ Will be fully-compatible with tf.estimator and tf.dataset But...we want graph benefits too! ● Speed ● Distributed execution ● Import/export ● Graph transformations ○ mobile inference ○ memory optimizations ○ layout optimizations ● Compilation for multiple platforms ○ CPUs, GPUs, TPUs, mobile, embedded, ... What is TensorFlow Lite? A lightweight machine Easier learning library for mobile and embedded devices.

Faster Smaller Our inspiration Our mobile experts at Google Inspired by internal solutions

Our partners on Android Excitement about targeting custom hardware

Our users More and more interest, and github questions Our goals Small Binary Size

Startup / Latency Low-overhead

Throughput Optimized Set of Core Kernels Quantized Kernels What is TensorFlow Lite?

Model file-format

Interpreter

Ops / Kernels

Interface to Hardware Acceleration TensorFlow Lite

Workstation Mobile TensorFlow Lite

TensorFlow

Graphdef Checkpoint

Workstation Mobile TensorFlow Lite

TensorFlow

Graphdef Checkpoint

TensorFlow Lite Format Converter Workstation Mobile TensorFlow Lite

TensorFlow Lite TensorFlow Interpreter

Graphdef Checkpoint

TensorFlow Lite Format Converter Workstation Mobile TensorFlow Lite

TensorFlow Lite TensorFlow Interpreter

Neon Kernels Graphdef Checkpoint

TensorFlow Lite Format Converter Workstation Mobile TensorFlow Lite

TensorFlow Lite TensorFlow Interpreter

Neon Kernels Graphdef Checkpoint

TensorFlow Lite Hardware Format Acceleration Interface Converter Workstation Mobile What is TensorFlow Lite?

Model file-format Lightweight Few dependencies Interpreter TOCO GraphDef conversion Quantization is first-class Ops / Kernels Flatbuffer-based

Interface to Hardware Acceleration Flatbuffers Open source Google project Originally designed for video games Similar to protobufs except More memory efficient No-unmarshalling step (mmap) Less code What is TensorFlow Lite?

Model file-format

Interpreter Optimized for small devices

Ops / Kernels

Interface to Hardware Acceleration Interpreter

Optimized for small devices Few dependencies Small binary (~100k w/o for ops, ~300k for ops) Fast load-time Static memory plan Static execution plan No control flow What is TensorFlow Lite?

Model file-format

Interpreter

Ops / Kernels

Interface to Hardware Acceleration What is TensorFlow Lite?

Core builtin ops NEON on ARM Float & quantized Many kernels optimized for our mobile apps

Pre-fused activations and biases

C API for custom ops What is TensorFlow Lite?

Model file-format

Interpreter

Ops / Kernels

Interface to Hardware Acceleration TensorFlow Lite App

TensorFlow Lite tflite binary Interpreter (flatbuffer) Neon Kernels

Mobile TensorFlow Lite App

TensorFlow Lite tflite binary Interpreter (flatbuffer) Neon Kernels

Hardware Acceleration

Mobile

How do I use it?

https://www.tensorflow.org/mobile/tflite/ Option 1: Run from Command Line Tool.

bazel build tensorflow/python/tools:freeze_graph && bazel-bin/tensorflow/python/tools/freeze_graph\ Freeze --input_graph=/tmp/mobilenet_v1_224.pb \ Graph --input_checkpoint=/tmp/checkpoints/mobilenet-10202.ckpt \ --input_binary=true --output_graph=/tmp/frozen_mobilenet_v1_224.pb \ --output_node_names=MobileNet/Predictions/Reshape_1

bazel build tensorflow/contrib/lite/toco:toco && bazel run --config=opt tensorflow/contrib/lite/toco:toco -- \ --input_file=(pwd)/mobilenet_v1_1.0_224/frozen_graph.pb \ Convert --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE \ --output_file=/tmp/mobilenet_v1_1.0_224.lite --inference_type=FLOAT \ --input_type=FLOAT --input_arrays=input \ --output_arrays=MobilenetV1/Predictions/Reshape_1 --input_shapes=1,224,224,3 Option 2: Integrate with Python Workflow.

import tensorflow as tf

img = tf.placeholder(name="img", dtype=tf.float32, shape=(1, 64, 64, 3)) val = img + tf.constant([1., 2., 3.]) + tf.constant([1., 4., 4.]) out = tf.identity(val, name="out") with tf.Session() as sess: tflite_model = tf.contrib.lite.toco_convert(sess.graph_def, [img], [out]) open("converteds_model.tflite", "wb").write(tflite_model) TensorFlow Mobile

TensorFlow

TensorFlow TensorFlow Lite Converter Lite Interpreter Format

Workstation Mobile XLA A key question: Why write every new -op in C++? Why can't we just compose them out of existing TF ops?

An answer: you don't want to pay a performance penalty.

But, what if op composition had the performance of C++? The kind of stuff SoftMax has inside... auto weighted = Dot(input, weights); auto weighted_sum = Add(weighted, biases, /*broadcast=*/{1}); auto max_activation = Reduce( weighted_sum, Constant(MinValue(F32)), Max, /*reduce_dims=*/{1}); auto activations_normalized = Exp(Sub(weighted_sum, max_activation, /*broadcast=*/{0})); auto activations_sum = Reduce(activations_normalized, Constant(0.0f), Add, /*reduce_dims=*/{1}); auto predicted = Div(activations_normalized, activations_sum, /*broadcast=*/{0});

primitive operation composition ⇒ fused & optimized composite kernel Automatic Operation Fusion XLA composes & specializes primitive operations

Note: this is all expressible in TensorFlow XLA removes the performance concern Avoids combinatorial explosion of op fusions (e.g. for custom LSTM cell) macro-ops * primitives * dim sizes * backends * devices! TensorFlow in one picture

TensorFlow Graph TensorFlow runtime python (C++)

java Executor

soft add max go Kernels ... soft add max C tf2xla: Symbolic graph execution

TensorFlow Graph TensorFlow runtime (C++)

XLA Graph Local Executor soft add max tf2xla kernels soft add max

TensorFlow Performance

176 Benchmarks!

177 https://www.tensorflow.org/performance/benchmarks (batch size 64) TensorFlow Scaling

178 Near-linear performance gains with each additional 8x NVIDIA® Tesla® K80 server added to the cluster.

179 Try it yourself! Reference Benchmark results, guide to building high performance models, and breakdowns of how to achieve peak performance on various platforms:

● NVIDIA® DGX-1™ ● with NVIDIA® Tesla® K80s ● Amazon EC2 with NVIDIA® Tesla® K80s

180 Benchmarks: Try them yourself!

https://www.tensorflow.org/performance/benchmarks

181 Fast accelerators expose new bottlenecks Training step time

Training step time

(CPU) (CPU) (Accelerator)

Model Training 10x Input processing Input processing faster Canned Estimators

Estimato Keras r Model Datasets Layers

Python Frontend C++ Java Go ...

TensorFlow Distributed Execution Engine

CPU GPU Android iOS XLA

CPU GPU TPU ... New Dataset API for data infeed

input_dataset = tf.contrib.data.TFRecordDataset( [FLAGS.train_file]) dataset = (input_dataset .repeat()

.map(parser, num_threads=16) # Parallel map .batch(FLAGS.batch_size)) 第二代TPU

TPU Pod 64个第二代TPU 第二代TPU 适用于神经网络训练与推断

● 每秒 180 万亿次浮点运算, 64 千兆的内存 ● 多个TPU设计连接在一起 性能预览

ResNet-50 训练在 20 小时内 达到 〜74% 准确度 TensorFlow Research Cloud = 1000 x

to accelerate open machine learning research One idea: AutoML

Zoph B. & Le, Q. https://arxiv.org/pdf/1611.01578.pdf Many other possibilities! Cloud TPU Alpha TensorFlow Research Cloud Program Program

Sign up at g.co/tpusignup Cloud TPUs 即将在Google Cloud上推出 Thank you! WeChat