<<

李双峰 • • •

Input Saturation Defocus + = ? + =

• An open-source machine learning platform for everyone • Fast, flexible, and production-ready • Scales from research to production Deep Learning Just like regular learning, but with more layers. Inception v3 has ~25 M parameters.

Canned Estimators

Keras Estimator Model Datasets Layers

Python Frontend ++ Java Go ...

TensorFlow Distributed Execution Engine

CPU GPU Android iOS XLA

CPU GPU TPU ...

CPU GPU

iOS Android

Raspberry Pi

1st-gen TPU Cloud TPU Java Positive Reviews Rapid Development Direct Engagement

67,000+ 1,000+ 8,000+

GitHub Stars Contributors Stack Overflow questions answered

17,000+ 21,000+ 100+ GitHub repositories with Commits in 21 months Community-submitted GitHub ‘TensorFlow’ in the title issues responded to weekly TensorFlow in Courses

University of California, Udacity Berkeley Coursera Stanford University deeplearning.ai University of Toronto Andreessen Horowitz

Neural Machine Translation

Reduces Errors By 55%-85%

Sutskever et al. NIPS, Dec 2014 Research Blog, Sept 2016 Wu et al. arXiv, Sept 2016

Parsey McParseface

https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html ht

Prediction A

Xt

Google's Project Magenta

https://magenta.tensorflow.org/ Hemorrhages

Healthy Diseased

No DR Mild DR Moderate DR Severe DR Proliferative DR 28

● Computation is defined as a graph ● Nodes represent computation or states ○ Can be run on any devices ● Data flow along edges ● Graph can be defined in any language ● Graph would be compiled and optimized y = Wx + b

(x, y’) loss = ∑ (yi - y’i)^2

Python Program TensorFlow Graph

.google.cn/get_started/get_started a b ... c = tf.add(a, b) add

c ... session = tf.Session() value_of_c = session.run(c, {a=1, b=2})

biases

weights Add Relu MatMul Xent examples

labels variables with state

biases

weights Add Relu MatMul Xent examples

labels Automatically add ops which compute gradients for variables biases

... Xent grad

with state

biases

... Xent grad Mul −= learning rate

distributed

Device A Device B biases

Add ... Mul −= ...

learning rate

Devices: Processes, Machines, CPUs, GPUs, TPUs, etc

distributed

Device A Device B biases

Add ... Mul −= ...

learning rate

Devices: Processes, Machines, CPUs, GPUs, TPUs, etc

distributed

Device A Device B biases Send Recv Add ... Mul Send Recv −= ... Send Recv Recv learning rate Send

Devices: Processes, Machines, CPUs, GPUs, TPUs, etc

Parameter Servers

Δp’ p’

Model Replicas ...

Data ... tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222" ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222" ]})

with tf.device("/job:ps/task:0"): weights_1 = tf.Variable(...) biases_1 = tf.Variable(...) with tf.device("/job:ps/task:1"): weights_2 = tf.Variable(...) biases_2 = tf.Variable(...) with tf.device("/job:worker/task:7"): input, labels = ... layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1) logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2) train_op = ... with tf.Session("://worker7.example.com:2222") as sess: for _ in range(10000): sess.run(train_op) Reader Decoder Preprocess Worker

Reader Decoder Preprocess Worker

Preprocess ......

...

Filenames Raw Examples Examples https://arxiv.org/abs/1604.00981 50 GPUs 10 GPUs

2.6 hours vs. 79.3 hours (30.5X) 1 GPU

Hours

# Outputs a = tf.constant(6) tf.Tensor(3, dtype=int32) while a != 1: if a % 2 == 0: tf.Tensor(10, dtype=int32) a = a / 2 tf.Tensor(5, dtype=int32) else: a = 3 * a + 1 tf.Tensor(16, dtype=int32) print(a) tf.Tensor(8, dtype=int32) tf.Tensor(4, dtype=int32) tf.Tensor(2, dtype=int32) tf.Tensor(1, dtype=int32) import tensorflow as tf x = tf.placeholder(tf.float32, shape=[3]) y = tf.placeholder(tf.float32, shape=[3]) a = x + y with tf.Session() as sess: a_out = sess.run(a, feed_dict={x: [1., 1., 1.], y: [0., 2., 4.]}) print(a_out)

# outputs: [ 1. 3. 5.] from tensorflow.contrib.eager.python import tfe x = tfe.Tensor([1., 1., 1.]) y = tfe.Tensor([0., 2., 4.]) print(x + y)

# outputs: tfe.Tensor([ 1. 3. 5.], dtype=tf.float64) from tensorflow.contrib.eager.python import tfe a = tfe.ops.gather([0, 2, 4], 7)

InvalidArgumentError: indices = 7 is not in [0, 3) [Op:Gather] from tensorflow.contrib.eager.python import tfe a = tfe.Tensor(6) # outputs: while a != 1: tfe.Tensor(3, dtype=int32) if a % 2 == 0: tfe.Tensor(10, dtype=int32) a = a / 2 tfe.Tensor(5, dtype=int32) else: tfe.Tensor(16, dtype=int32) a = 3 * a + 1 tfe.Tensor(8, dtype=int32) print(a) tfe.Tensor(4, dtype=int32) tfe.Tensor(2, dtype=int32) tfe.Tensor(1, dtype=int32) ● Controlled using .as_gpu_tensor(), .as_cpu_tensor(), with tfe.device("gpu:0"): ● Checks for errors and then runs kernel asynchronously ● Uses tensor handles instead of values ● Have to copy to CPU to inspect a value (e.g. print) ○ This blocks until the value is ready ○ Automatic for if/while ● Debugging ○ The python debugger can be used out of the box ○ The python profilers can be used out of the box ○ Errors have meaningful stack traces ● Interactivity ○ Teaching new users is easier, as trying things is immediate ● High-level libraries ○ Will be fully-compatible with tf.estimator and tf.dataset ● Speed ● Distributed execution ● Import/export ● Graph transformations ○ mobile inference ○ memory optimizations ○ layout optimizations ● Compilation for multiple platforms ○ CPUs, GPUs, TPUs, mobile, embedded, ... def lstm_cell(x, w, h, c): xhw = tfe.ops.matmul(tfe.ops.concat([x, h], axis=1), w) y = tfe.ops.split(xhw, 4, axis=1) in_value = tfe.ops.tanh(y[0]) (in_gate, forget_gate, out_gate) = [tfe.ops.sigmoid(x) for x in y[1:]] c = (forget_gate * c) + (in_gate * in_value) h = out_gate * tf.tanh(c) return h, c h, c = lstm_cell(x, w, h, c) print(h) @tfe.func_to_object def lstm_cell(x, w, h, c): xhw = tfe.ops.matmul(tfe.ops.concat([x, h], axis=1), w) y = tfe.ops.split(xhw, 4, axis=1) in_value = tfe.ops.tanh(y[0]) (in_gate, forget_gate, out_gate) = [tfe.ops.sigmoid(x) for x in y[1:]] c = (forget_gate * c) + (in_gate * in_value) h = out_gate * tf.tanh(c) return h, c h, c = lstm_cell(x, w, h, c) print(h)

A lightweight machine Easier learning library for mobile and embedded devices.

Faster Smaller Our mobile experts at Google Inspired by internal solutions

Our partners on Android Excitement about targeting custom hardware

Our users More and more interest, and questions Small Binary Size

Startup / Latency Low-overhead

Throughput Optimized Set of Core Kernels Quantized Kernels Model file-format

Interpreter

Ops / Kernels

Interface to Hardware Acceleration TensorFlow Lite TensorFlow Interpreter

Neon Kernels Graphdef Checkpoint

TensorFlow Lite Hardware Format Acceleration Interface Converter Workstation Mobile Model file-format Lightweight Few dependencies Interpreter TOCO GraphDef conversion Quantization is first-class Ops / Kernels Flatbuffer-based

Interface to Hardware Acceleration Open source Google project Originally designed for video games Similar to protobufs except More memory efficient No-unmarshalling step (mmap) Less code Model file-format

Interpreter Optimized for small devices

Ops / Kernels

Interface to Hardware Acceleration Optimized for small devices Few dependencies Small binary (~100k w/o for ops, ~300k for ops) Fast load-time Static memory plan Static execution plan No control flow Model file-format

Interpreter

Ops / Kernels

Interface to Hardware Acceleration Core builtin ops NEON on ARM Float & quantized Many kernels optimized for our mobile apps

Pre-fused activations and biases

C API for custom ops Model file-format

Interpreter

Ops / Kernels

Interface to Hardware Acceleration TensorFlow Lite tflite binary Interpreter (flatbuffer) Neon Kernels

Hardware Acceleration

Mobile tensorflow.google.cn/mobile/tflite/ bazel build tensorflow/python/tools:freeze_graph && bazel-bin/tensorflow/python/tools/freeze_graph\ Freeze --input_graph=/tmp/mobilenet_v1_224.pb \ Graph --input_checkpoint=/tmp/checkpoints/mobilenet-10202.ckpt \ --input_binary=true --output_graph=/tmp/frozen_mobilenet_v1_224.pb \ --output_node_names=MobileNet/Predictions/Reshape_1

bazel build tensorflow/contrib/lite/toco:toco && bazel run --config=opt tensorflow/contrib/lite/toco:toco -- \ --input_file=(pwd)/mobilenet_v1_1.0_224/frozen_graph.pb \ Convert --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE \ --output_file=/tmp/mobilenet_v1_1.0_224.lite --inference_type=FLOAT \ --input_type=FLOAT --input_arrays=input \ --output_arrays=MobilenetV1/Predictions/Reshape_1 --input_shapes=1,224,224,3 import tensorflow as tf img = tf.placeholder(name="img", dtype=tf.float32, shape=(1, 64, 64, 3)) val = img + tf.constant([1., 2., 3.]) + tf.constant([1., 4., 4.]) out = tf.identity(val, name="out") with tf.Session() as sess: tflite_model = tf.contrib.lite.toco_convert(sess.graph_def, [img], [out]) open("converteds_model.tflite", "wb").write(tflite_model) TensorFlow

TensorFlow TensorFlow Lite Converter Lite Interpreter Format

Workstation Mobile

TensorFlow Graph TensorFlow runtime python (C++)

java Executor

soft add max go Kernels ... soft add max C TensorFlow Graph TensorFlow runtime (C++)

XLA Graph Local Executor soft add max tf2xla kernels soft add max

Training step time

Training step time

(CPU) (CPU) (Accelerator)

Model Training 10x Input processing Input processing faster Input data is the lifeblood of machine learning

Modern accelerators need faster input pipelines

Getting your data into TensorFlow can be painful Canned Estimators

Keras Estimator Model Datasets Layers

Python Frontend C++ Java Go ...

TensorFlow Distributed Execution Engine

CPU GPU Android iOS XLA

CPU GPU TPU ... Functional programming to the rescue!

Data elements have the same type

Dataset might be too large to materialize all at once… or infinite

Compose functions like map() and filter() to preprocess tf.data.Dataset Represents input pipeline using functional transformations tf.data.Iterator Provides sequential access to elements of a Dataset New Dataset API for data infeed

input_dataset = tf.data.TFRecordDataset( [TRAIN_FILE]) dataset = input_dataset.repeat() .map(lambda x: tf.decode_jpeg(x),

num_threads=16) # Parallel map .batch(BATCH_SIZE) Canned Estimators

Keras Estimator Model Datasets Layers

Python Frontend C++ Java Go ...

TensorFlow Distributed Execution Engine

CPU GPU Android iOS XLA

CPU GPU TPU ...

Both for NN training and inferences

● 180 TFLOPS, 64G memory ● Can connect multipe TPU together ResNet-50 training get ~74% precision in 20 hours Cloud TPU Alpha TensorFlow Research Cloud Program Program Sign up at g.co/tpusignup

AM!!!

TPU Pod 64 next-gen TPU

Sample in probability p to get network A

Controller Train child-network A (RNN) and get precision R

Compute p’s gradient, and adjust the controller using R Zoph B. & Le, Q. https://arxiv.org/pdf/1611.01578.pdf

Tensorflow.google.cn