Google Dataflow 小試

Simon Su @ LinkerNetworks { Developer Expert} var simon = {/** I am at GCPUG.TW **/}; simon.aboutme = 'http://about.me/peihsinsu'; simon.nodejs = ‘http://opennodes.arecord.us'; simon.googleshare = 'http://gappsnews.blogspot.tw' simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw'; simon.blog = ‘http://peihsinsu.blogspot.com'; simon.slideshare = ‘http://slideshare.net/peihsinsu/'; simon.email = ‘simonsu.mail@.com’; simon.say(‘Good luck to everybody!'); https://www.facebook.com/groups/GCPUG.TW/

https://plus.google.com/u/0/communities/116100913832589966421 ● ● ●

Next Assembly required True On Demand Cloud

1st Wave 2nd Wave 3rd Wave Colocation Virtualized An actual, global Data Centers elastic cloud

Your kit, someone Standard virtual kit for Invest your energy in else’s building. Rent. Still yours to great apps. Yours to manage. manage.

Clusters Containers

Distributed Storage, Processing Storage Processing Memory Network & Machine Learning Application Runtime Services

Enabling No-Touch Operations

Data Services

Breakthrough Insights, Breakthrough Applications

Foundation Infrastructure & Operations

The Gear that Powers Google Capture Store Process Analyze

Pub/Sub Cloud BigQuery Cloud SQL Cloud Dataflow Dataproc BigQuery Dataproc Larger Logs Storage Storage (mySQL) Datastore Hadoop App Engine (NoSQL) Ecosystem BigQuery streaming Devices

Smart devices, Physical or Strong queue MapReduce Large scale data IoT devices virtual servers service for servers for large store for storing and Sensors as frontend handling large data data and serve 1M data receiver scale data transformation in query workload Devices injection 16.6K batch way or 43B 16.6K Events/sec Events/month Events/sec streaming MapReduce Dremel

GFS Big Table Colossus MillWheel

Flume

2002 2004 2006 2008 2010 2012 2013 Managed Service

Job Manager Work Manager User Code & SDK

Deploy & Schedule Progress & Logs

Monitoring UI

GCP ETL Analysis Orchestration • Movement • Reduction • Composition • Filtering • Batch computation • External orchestration • Enrichment • Continuous computation • Simulation • Shaping

Offline import Regional buckets (third party) Object versioning

Object change ACLs notification Online cloud Object lifecycle import (Cloud management Storage Transfer Service)

● ● ● ●

Map ParDo

Shuffle GroupByKey

Reduce ParDo ● A Direct Acyclic Graph of data processing transformations ● Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark ● May include multiple inputs and multiple outputs ● May encompass many logical MapReduce operations ● PCollections flow through the pipeline ❯ Read from standard data sources • GCS, Pub/Sub, BigQuery, Datastore

❯ Write your own custom source by teaching Your

Dataflow how to read it in parallel Source/Sink • Currently for bounded sources only Here ❯ Write to GCS, BigQuery, Pub/Sub • More coming… ❯ Can use a combination of text, JSON, XML, Avro formatted data ❯ A collection of data of type T in a pipeline {Seahawks, NFC, Champions, Seattle, - PCollection ...} ❯ Maybe be either bounded or unbounded in size {..., ❯ Created by using a PTransform to: “NFC Champions #GreenBay”, “Green Bay #superbowl!”, • Build from a java.util.Collection ... • Read from a backing data store “#GoHawks”, ...} • Transform an existing PCollection ❯ Often contain the key-value pairs using KV ● A step, or a processing operation that transforms data

○ convert format , group , filter data ● Type of Transforms

○ ParDo ○ GroupByKey ○ Combine ○ Flatten ■ Multiple PCollection objects that contain the same data type, you can merge them into a single logical PCollection using the Flatten transform ❯ Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} ❯ Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo

❯ Useful for KeyBySessionId

Filtering a data set.

Formatting or converting the type of each element { KV, in a data set. KV, , Extracting parts of each element in a data set. KV, … } Performing computations on each element in a data set. • Takes a PCollection of key-value pairs and gathers up all values with the same key {KV, KV, , KV, ...}

• Corresponds to the shuffle phase in Hadoop GroupByKey

{KV,{Seahawks, KV, Seattle, …}, Wait a minute… ,{NFC, …} KV, ...} How do you do a GroupByKey on an unbounded PCollection? KV

● ●

Globally redundant

Low latency (sub sec.) Publisher A Publisher B Publisher C N to N coupling Batched read/write Message 1 Message 2 Message 3 Push & Pull Topic A Topic B Topic C Guaranteed Delivery Auto expiration Subscription Subscription Cloud Subscription XA Subscription XB YC ZC Pub/Sub

Message 1 Message 3 Message 2 Message 3

Subscriber X Subscriber Y Subscriber Z ● ● ● Datalab BigQuery An easy tool for analysis and report An interactive analysis service