李双峰 • • • Input Saturation Defocus + = ? + = • An open-source machine learning platform for everyone • Fast, flexible, and production-ready • Scales from research to production Deep Learning Just like regular learning, but with more layers. Inception v3 has ~25 M parameters. Canned Estimators Keras Estimator Model Datasets Layers Python Frontend C++ Java Go ... TensorFlow Distributed Execution Engine CPU GPU Android iOS XLA CPU GPU TPU ... CPU GPU iOS Android Raspberry Pi 1st-gen TPU Cloud TPU Java Positive Reviews Rapid Development Direct Engagement 67,000+ 1,000+ 8,000+ GitHub Stars Contributors Stack Overflow questions answered 17,000+ 21,000+ 100+ GitHub repositories with Commits in 21 months Community-submitted GitHub ‘TensorFlow’ in the title issues responded to weekly TensorFlow in Courses University of California, Udacity Berkeley Coursera Stanford University deeplearning.ai University of Toronto Andreessen Horowitz Neural Machine Translation Reduces Errors By 55%-85% Sutskever et al. NIPS, Dec 2014 Google Research Blog, Sept 2016 Wu et al. arXiv, Sept 2016 Parsey McParseface https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html ht Prediction A Xt Google's Project Magenta https://magenta.tensorflow.org/ Hemorrhages Healthy Diseased No DR Mild DR Moderate DR Severe DR Proliferative DR 28 ● Computation is defined as a graph ● Nodes represent computation or states ○ Can be run on any devices ● Data flow along edges ● Graph can be defined in any language ● Graph would be compiled and optimized y = Wx + b (x, y’) loss = ∑ (yi - y’i)^2 Python Program TensorFlow Graph tensorflow.google.cn/get_started/get_started a b ... c = tf.add(a, b) add c ... session = tf.Session() value_of_c = session.run(c, {a=1, b=2}) biases weights Add Relu MatMul Xent examples labels variables with state biases weights Add Relu MatMul Xent examples labels Automatically add ops which compute gradients for variables biases ... Xent grad with state biases ... Xent grad Mul −= learning rate distributed Device A Device B biases Add ... Mul −= ... learning rate Devices: Processes, Machines, CPUs, GPUs, TPUs, etc distributed Device A Device B biases Add ... Mul −= ... learning rate Devices: Processes, Machines, CPUs, GPUs, TPUs, etc distributed Device A Device B biases Send Recv Add ... Mul Send Recv −= ... Send Recv Recv learning rate Send Devices: Processes, Machines, CPUs, GPUs, TPUs, etc Parameter Servers Δp’ p’ Model Replicas ... Data ... tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222" ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222" ]}) with tf.device("/job:ps/task:0"): weights_1 = tf.Variable(...) biases_1 = tf.Variable(...) with tf.device("/job:ps/task:1"): weights_2 = tf.Variable(...) biases_2 = tf.Variable(...) with tf.device("/job:worker/task:7"): input, labels = ... layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1) logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2) train_op = ... with tf.Session("grpc://worker7.example.com:2222") as sess: for _ in range(10000): sess.run(train_op) Reader Decoder Preprocess Worker Reader Decoder Preprocess Worker Preprocess ... ... ... Filenames Raw Examples Examples https://arxiv.org/abs/1604.00981 50 GPUs 10 GPUs 2.6 hours vs. 79.3 hours (30.5X) 1 GPU Hours # Outputs a = tf.constant(6) tf.Tensor(3, dtype=int32) while a != 1: if a % 2 == 0: tf.Tensor(10, dtype=int32) a = a / 2 tf.Tensor(5, dtype=int32) else: a = 3 * a + 1 tf.Tensor(16, dtype=int32) print(a) tf.Tensor(8, dtype=int32) tf.Tensor(4, dtype=int32) tf.Tensor(2, dtype=int32) tf.Tensor(1, dtype=int32) import tensorflow as tf x = tf.placeholder(tf.float32, shape=[3]) y = tf.placeholder(tf.float32, shape=[3]) a = x + y with tf.Session() as sess: a_out = sess.run(a, feed_dict={x: [1., 1., 1.], y: [0., 2., 4.]}) print(a_out) # outputs: [ 1. 3. 5.] from tensorflow.contrib.eager.python import tfe x = tfe.Tensor([1., 1., 1.]) y = tfe.Tensor([0., 2., 4.]) print(x + y) # outputs: tfe.Tensor([ 1. 3. 5.], dtype=tf.float64) from tensorflow.contrib.eager.python import tfe a = tfe.ops.gather([0, 2, 4], 7) InvalidArgumentError: indices = 7 is not in [0, 3) [Op:Gather] from tensorflow.contrib.eager.python import tfe a = tfe.Tensor(6) # outputs: while a != 1: tfe.Tensor(3, dtype=int32) if a % 2 == 0: tfe.Tensor(10, dtype=int32) a = a / 2 tfe.Tensor(5, dtype=int32) else: tfe.Tensor(16, dtype=int32) a = 3 * a + 1 tfe.Tensor(8, dtype=int32) print(a) tfe.Tensor(4, dtype=int32) tfe.Tensor(2, dtype=int32) tfe.Tensor(1, dtype=int32) ● Controlled using .as_gpu_tensor(), .as_cpu_tensor(), with tfe.device("gpu:0"): ● Checks for errors and then runs kernel asynchronously ● Uses tensor handles instead of values ● Have to copy to CPU to inspect a value (e.g. print) ○ This blocks until the value is ready ○ Automatic for if/while ● Debugging ○ The python debugger can be used out of the box ○ The python profilers can be used out of the box ○ Errors have meaningful stack traces ● Interactivity ○ Teaching new users is easier, as trying things is immediate ● High-level libraries ○ Will be fully-compatible with tf.estimator and tf.dataset ● Speed ● Distributed execution ● Import/export ● Graph transformations ○ mobile inference ○ memory optimizations ○ layout optimizations ● Compilation for multiple platforms ○ CPUs, GPUs, TPUs, mobile, embedded, ... def lstm_cell(x, w, h, c): xhw = tfe.ops.matmul(tfe.ops.concat([x, h], axis=1), w) y = tfe.ops.split(xhw, 4, axis=1) in_value = tfe.ops.tanh(y[0]) (in_gate, forget_gate, out_gate) = [tfe.ops.sigmoid(x) for x in y[1:]] c = (forget_gate * c) + (in_gate * in_value) h = out_gate * tf.tanh(c) return h, c h, c = lstm_cell(x, w, h, c) print(h) @tfe.func_to_object def lstm_cell(x, w, h, c): xhw = tfe.ops.matmul(tfe.ops.concat([x, h], axis=1), w) y = tfe.ops.split(xhw, 4, axis=1) in_value = tfe.ops.tanh(y[0]) (in_gate, forget_gate, out_gate) = [tfe.ops.sigmoid(x) for x in y[1:]] c = (forget_gate * c) + (in_gate * in_value) h = out_gate * tf.tanh(c) return h, c h, c = lstm_cell(x, w, h, c) print(h) A lightweight machine Easier learning library for mobile and embedded devices. Faster Smaller Our mobile experts at Google Inspired by internal solutions Our partners on Android Excitement about targeting custom hardware Our users More and more interest, and github questions Small Binary Size Startup / Latency Low-overhead Throughput Optimized Set of Core Kernels Quantized Kernels Model file-format Interpreter Ops / Kernels Interface to Hardware Acceleration TensorFlow Lite TensorFlow Interpreter Neon Kernels Graphdef Checkpoint TensorFlow Lite Hardware Format Acceleration Interface Converter Workstation Mobile Model file-format Lightweight Few dependencies Interpreter TOCO GraphDef conversion Quantization is first-class Ops / Kernels Flatbuffer-based Interface to Hardware Acceleration Open source Google project Originally designed for video games Similar to protobufs except More memory efficient No-unmarshalling step (mmap) Less code Model file-format Interpreter Optimized for small devices Ops / Kernels Interface to Hardware Acceleration Optimized for small devices Few dependencies Small binary (~100k w/o for ops, ~300k for ops) Fast load-time Static memory plan Static execution plan No control flow Model file-format Interpreter Ops / Kernels Interface to Hardware Acceleration Core builtin ops NEON on ARM Float & quantized Many kernels optimized for our mobile apps Pre-fused activations and biases C API for custom ops Model file-format Interpreter Ops / Kernels Interface to Hardware Acceleration TensorFlow Lite tflite binary Interpreter (flatbuffer) Neon Kernels Hardware Acceleration Mobile tensorflow.google.cn/mobile/tflite/ bazel build tensorflow/python/tools:freeze_graph && bazel-bin/tensorflow/python/tools/freeze_graph\ Freeze --input_graph=/tmp/mobilenet_v1_224.pb \ Graph --input_checkpoint=/tmp/checkpoints/mobilenet-10202.ckpt \ --input_binary=true --output_graph=/tmp/frozen_mobilenet_v1_224.pb \ --output_node_names=MobileNet/Predictions/Reshape_1 bazel build tensorflow/contrib/lite/toco:toco && bazel run --config=opt tensorflow/contrib/lite/toco:toco -- \ --input_file=(pwd)/mobilenet_v1_1.0_224/frozen_graph.pb \ Convert --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE \ --output_file=/tmp/mobilenet_v1_1.0_224.lite --inference_type=FLOAT \ --input_type=FLOAT --input_arrays=input \ --output_arrays=MobilenetV1/Predictions/Reshape_1 --input_shapes=1,224,224,3 import tensorflow as tf img = tf.placeholder(name="img", dtype=tf.float32, shape=(1, 64, 64, 3)) val = img + tf.constant([1., 2., 3.]) + tf.constant([1., 4., 4.]) out = tf.identity(val, name="out") with tf.Session() as sess: tflite_model = tf.contrib.lite.toco_convert(sess.graph_def, [img], [out]) open("converteds_model.tflite", "wb").write(tflite_model) TensorFlow TensorFlow TensorFlow Lite Converter Lite Interpreter Format Workstation Mobile TensorFlow Graph TensorFlow runtime python (C++) java Executor soft add max go Kernels ... soft add max C TensorFlow Graph TensorFlow runtime (C++) XLA Graph Local Executor soft add max tf2xla kernels soft add max Training steptime Input processing (CPU) Model Training (Accelerator) Training steptime Input processing (CPU) faster 10x Input data is the lifeblood of machine learning Modern accelerators need faster input pipelines Getting your data into TensorFlow can be painful Canned Estimators Keras Estimator Model Datasets Layers Python Frontend C++ Java Go ... TensorFlow Distributed Execution Engine CPU GPU Android iOS
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages107 Page
-
File Size-