李双峰 • • •
Input Saturation Defocus + = ? + =
• An open-source machine learning platform for everyone • Fast, flexible, and production-ready • Scales from research to production Deep Learning Just like regular learning, but with more layers. Inception v3 has ~25 M parameters.
Canned Estimators
Keras Estimator Model Datasets Layers
Python Frontend C++ Java Go ...
TensorFlow Distributed Execution Engine
CPU GPU Android iOS XLA
CPU GPU TPU ...
CPU GPU
iOS Android
Raspberry Pi
1st-gen TPU Cloud TPU Java Positive Reviews Rapid Development Direct Engagement
67,000+ 1,000+ 8,000+
GitHub Stars Contributors Stack Overflow questions answered
17,000+ 21,000+ 100+ GitHub repositories with Commits in 21 months Community-submitted GitHub ‘TensorFlow’ in the title issues responded to weekly TensorFlow in Courses
University of California, Udacity Berkeley Coursera Stanford University deeplearning.ai University of Toronto Andreessen Horowitz
Neural Machine Translation
Reduces Errors By 55%-85%
Sutskever et al. NIPS, Dec 2014 Google Research Blog, Sept 2016 Wu et al. arXiv, Sept 2016
Parsey McParseface
https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html ht
Prediction A
Xt
Google's Project Magenta
https://magenta.tensorflow.org/ Hemorrhages
Healthy Diseased
No DR Mild DR Moderate DR Severe DR Proliferative DR 28
● Computation is defined as a graph ● Nodes represent computation or states ○ Can be run on any devices ● Data flow along edges ● Graph can be defined in any language ● Graph would be compiled and optimized y = Wx + b
(x, y’) loss = ∑ (yi - y’i)^2
Python Program TensorFlow Graph
tensorflow.google.cn/get_started/get_started a b ... c = tf.add(a, b) add
c ... session = tf.Session() value_of_c = session.run(c, {a=1, b=2})
biases
weights Add Relu MatMul Xent examples
labels variables with state
biases
weights Add Relu MatMul Xent examples
labels Automatically add ops which compute gradients for variables biases
... Xent grad
with state
biases
... Xent grad Mul −= learning rate
distributed
Device A Device B biases
Add ... Mul −= ...
learning rate
Devices: Processes, Machines, CPUs, GPUs, TPUs, etc
distributed
Device A Device B biases
Add ... Mul −= ...
learning rate
Devices: Processes, Machines, CPUs, GPUs, TPUs, etc
distributed
Device A Device B biases Send Recv Add ... Mul Send Recv −= ... Send Recv Recv learning rate Send
Devices: Processes, Machines, CPUs, GPUs, TPUs, etc
Parameter Servers
Δp’ p’
Model Replicas ...
Data ... tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222" ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222" ]})
with tf.device("/job:ps/task:0"): weights_1 = tf.Variable(...) biases_1 = tf.Variable(...) with tf.device("/job:ps/task:1"): weights_2 = tf.Variable(...) biases_2 = tf.Variable(...) with tf.device("/job:worker/task:7"): input, labels = ... layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1) logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2) train_op = ... with tf.Session("grpc://worker7.example.com:2222") as sess: for _ in range(10000): sess.run(train_op) Reader Decoder Preprocess Worker
Reader Decoder Preprocess Worker
Preprocess ......
...
Filenames Raw Examples Examples https://arxiv.org/abs/1604.00981 50 GPUs 10 GPUs
2.6 hours vs. 79.3 hours (30.5X) 1 GPU
Hours
# Outputs a = tf.constant(6) tf.Tensor(3, dtype=int32) while a != 1: if a % 2 == 0: tf.Tensor(10, dtype=int32) a = a / 2 tf.Tensor(5, dtype=int32) else: a = 3 * a + 1 tf.Tensor(16, dtype=int32) print(a) tf.Tensor(8, dtype=int32) tf.Tensor(4, dtype=int32) tf.Tensor(2, dtype=int32) tf.Tensor(1, dtype=int32) import tensorflow as tf x = tf.placeholder(tf.float32, shape=[3]) y = tf.placeholder(tf.float32, shape=[3]) a = x + y with tf.Session() as sess: a_out = sess.run(a, feed_dict={x: [1., 1., 1.], y: [0., 2., 4.]}) print(a_out)
# outputs: [ 1. 3. 5.] from tensorflow.contrib.eager.python import tfe x = tfe.Tensor([1., 1., 1.]) y = tfe.Tensor([0., 2., 4.]) print(x + y)
# outputs: tfe.Tensor([ 1. 3. 5.], dtype=tf.float64) from tensorflow.contrib.eager.python import tfe a = tfe.ops.gather([0, 2, 4], 7)
InvalidArgumentError: indices = 7 is not in [0, 3) [Op:Gather] from tensorflow.contrib.eager.python import tfe a = tfe.Tensor(6) # outputs: while a != 1: tfe.Tensor(3, dtype=int32) if a % 2 == 0: tfe.Tensor(10, dtype=int32) a = a / 2 tfe.Tensor(5, dtype=int32) else: tfe.Tensor(16, dtype=int32) a = 3 * a + 1 tfe.Tensor(8, dtype=int32) print(a) tfe.Tensor(4, dtype=int32) tfe.Tensor(2, dtype=int32) tfe.Tensor(1, dtype=int32) ● Controlled using .as_gpu_tensor(), .as_cpu_tensor(), with tfe.device("gpu:0"): ● Checks for errors and then runs kernel asynchronously ● Uses tensor handles instead of values ● Have to copy to CPU to inspect a value (e.g. print) ○ This blocks until the value is ready ○ Automatic for if/while ● Debugging ○ The python debugger can be used out of the box ○ The python profilers can be used out of the box ○ Errors have meaningful stack traces ● Interactivity ○ Teaching new users is easier, as trying things is immediate ● High-level libraries ○ Will be fully-compatible with tf.estimator and tf.dataset ● Speed ● Distributed execution ● Import/export ● Graph transformations ○ mobile inference ○ memory optimizations ○ layout optimizations ● Compilation for multiple platforms ○ CPUs, GPUs, TPUs, mobile, embedded, ... def lstm_cell(x, w, h, c): xhw = tfe.ops.matmul(tfe.ops.concat([x, h], axis=1), w) y = tfe.ops.split(xhw, 4, axis=1) in_value = tfe.ops.tanh(y[0]) (in_gate, forget_gate, out_gate) = [tfe.ops.sigmoid(x) for x in y[1:]] c = (forget_gate * c) + (in_gate * in_value) h = out_gate * tf.tanh(c) return h, c h, c = lstm_cell(x, w, h, c) print(h) @tfe.func_to_object def lstm_cell(x, w, h, c): xhw = tfe.ops.matmul(tfe.ops.concat([x, h], axis=1), w) y = tfe.ops.split(xhw, 4, axis=1) in_value = tfe.ops.tanh(y[0]) (in_gate, forget_gate, out_gate) = [tfe.ops.sigmoid(x) for x in y[1:]] c = (forget_gate * c) + (in_gate * in_value) h = out_gate * tf.tanh(c) return h, c h, c = lstm_cell(x, w, h, c) print(h)
A lightweight machine Easier learning library for mobile and embedded devices.
Faster Smaller Our mobile experts at Google Inspired by internal solutions
Our partners on Android Excitement about targeting custom hardware
Our users More and more interest, and github questions Small Binary Size
Startup / Latency Low-overhead
Throughput Optimized Set of Core Kernels Quantized Kernels Model file-format
Interpreter
Ops / Kernels
Interface to Hardware Acceleration TensorFlow Lite TensorFlow Interpreter
Neon Kernels Graphdef Checkpoint
TensorFlow Lite Hardware Format Acceleration Interface Converter Workstation Mobile Model file-format Lightweight Few dependencies Interpreter TOCO GraphDef conversion Quantization is first-class Ops / Kernels Flatbuffer-based
Interface to Hardware Acceleration Open source Google project Originally designed for video games Similar to protobufs except More memory efficient No-unmarshalling step (mmap) Less code Model file-format
Interpreter Optimized for small devices
Ops / Kernels
Interface to Hardware Acceleration Optimized for small devices Few dependencies Small binary (~100k w/o for ops, ~300k for ops) Fast load-time Static memory plan Static execution plan No control flow Model file-format
Interpreter
Ops / Kernels
Interface to Hardware Acceleration Core builtin ops NEON on ARM Float & quantized Many kernels optimized for our mobile apps
Pre-fused activations and biases
C API for custom ops Model file-format
Interpreter
Ops / Kernels
Interface to Hardware Acceleration TensorFlow Lite tflite binary Interpreter (flatbuffer) Neon Kernels
Hardware Acceleration
Mobile tensorflow.google.cn/mobile/tflite/ bazel build tensorflow/python/tools:freeze_graph && bazel-bin/tensorflow/python/tools/freeze_graph\ Freeze --input_graph=/tmp/mobilenet_v1_224.pb \ Graph --input_checkpoint=/tmp/checkpoints/mobilenet-10202.ckpt \ --input_binary=true --output_graph=/tmp/frozen_mobilenet_v1_224.pb \ --output_node_names=MobileNet/Predictions/Reshape_1
bazel build tensorflow/contrib/lite/toco:toco && bazel run --config=opt tensorflow/contrib/lite/toco:toco -- \ --input_file=(pwd)/mobilenet_v1_1.0_224/frozen_graph.pb \ Convert --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE \ --output_file=/tmp/mobilenet_v1_1.0_224.lite --inference_type=FLOAT \ --input_type=FLOAT --input_arrays=input \ --output_arrays=MobilenetV1/Predictions/Reshape_1 --input_shapes=1,224,224,3 import tensorflow as tf img = tf.placeholder(name="img", dtype=tf.float32, shape=(1, 64, 64, 3)) val = img + tf.constant([1., 2., 3.]) + tf.constant([1., 4., 4.]) out = tf.identity(val, name="out") with tf.Session() as sess: tflite_model = tf.contrib.lite.toco_convert(sess.graph_def, [img], [out]) open("converteds_model.tflite", "wb").write(tflite_model) TensorFlow
TensorFlow TensorFlow Lite Converter Lite Interpreter Format
Workstation Mobile
TensorFlow Graph TensorFlow runtime python (C++)
java Executor
soft add max go Kernels ... soft add max C TensorFlow Graph TensorFlow runtime (C++)
XLA Graph Local Executor soft add max tf2xla kernels soft add max
Training step time
Training step time
(CPU) (CPU) (Accelerator)
Model Training 10x Input processing Input processing faster Input data is the lifeblood of machine learning
Modern accelerators need faster input pipelines
Getting your data into TensorFlow can be painful Canned Estimators
Keras Estimator Model Datasets Layers
Python Frontend C++ Java Go ...
TensorFlow Distributed Execution Engine
CPU GPU Android iOS XLA
CPU GPU TPU ... Functional programming to the rescue!
Data elements have the same type
Dataset might be too large to materialize all at once… or infinite
Compose functions like map() and filter() to preprocess tf.data.Dataset Represents input pipeline using functional transformations tf.data.Iterator Provides sequential access to elements of a Dataset New Dataset API for data infeed
input_dataset = tf.data.TFRecordDataset( [TRAIN_FILE]) dataset = input_dataset.repeat() .map(lambda x: tf.decode_jpeg(x),
num_threads=16) # Parallel map .batch(BATCH_SIZE) Canned Estimators
Keras Estimator Model Datasets Layers
Python Frontend C++ Java Go ...
TensorFlow Distributed Execution Engine
CPU GPU Android iOS XLA
CPU GPU TPU ...
Both for NN training and inferences
● 180 TFLOPS, 64G memory ● Can connect multipe TPU together ResNet-50 training get ~74% precision in 20 hours Cloud TPU Alpha TensorFlow Research Cloud Program Program Sign up at g.co/tpusignup
AM!!!
TPU Pod 64 next-gen TPU
Sample in probability p to get network A
Controller Train child-network A (RNN) and get precision R
Compute p’s gradient, and adjust the controller using R Zoph B. & Le, Q. https://arxiv.org/pdf/1611.01578.pdf
Tensorflow.google.cn