Google Dataflow 小試
Simon Su @ LinkerNetworks {Google Developer Expert} var simon = {/** I am at GCPUG.TW **/}; simon.aboutme = 'http://about.me/peihsinsu'; simon.nodejs = ‘http://opennodes.arecord.us'; simon.googleshare = 'http://gappsnews.blogspot.tw' simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw'; simon.blog = ‘http://peihsinsu.blogspot.com'; simon.slideshare = ‘http://slideshare.net/peihsinsu/'; simon.email = ‘simonsu.mail@gmail.com’; simon.say(‘Good luck to everybody!'); https://www.facebook.com/groups/GCPUG.TW/
https://plus.google.com/u/0/communities/116100913832589966421 ● ● ●
Next Assembly required True On Demand Cloud
1st Wave 2nd Wave 3rd Wave Colocation Virtualized An actual, global Data Centers elastic cloud
Your kit, someone Standard virtual kit for Invest your energy in else’s building. Rent. Still yours to great apps. Yours to manage. manage.
Clusters Containers
Distributed Storage, Processing Storage Processing Memory Network & Machine Learning Application Runtime Services
Enabling No-Touch Operations
Data Services
Breakthrough Insights, Breakthrough Applications
Foundation Infrastructure & Operations
The Gear that Powers Google Capture Store Process Analyze
Pub/Sub Cloud BigQuery Cloud SQL Cloud Dataflow Dataproc BigQuery Dataproc Larger Logs Storage Storage (mySQL) Datastore Hadoop App Engine (NoSQL) Ecosystem BigQuery streaming Devices
Smart devices, Physical or Strong queue MapReduce Large scale data IoT devices virtual servers service for servers for large store for storing and Sensors as frontend handling large data data and serve 1M data receiver scale data transformation in query workload Devices injection 16.6K batch way or 43B 16.6K Events/sec Events/month Events/sec streaming MapReduce Dremel Spanner
GFS Big Table Colossus MillWheel
Flume
2002 2004 2006 2008 2010 2012 2013 Managed Service
Job Manager Work Manager User Code & SDK
Deploy & Schedule Progress & Logs
Monitoring UI
GCP ETL Analysis Orchestration • Movement • Reduction • Composition • Filtering • Batch computation • External orchestration • Enrichment • Continuous computation • Simulation • Shaping
Offline import Regional buckets (third party) Object versioning
Object change ACLs notification Online cloud Object lifecycle import (Cloud management Storage Transfer Service)
● ● ● ●
Map ParDo
Shuffle GroupByKey
Reduce ParDo ● A Direct Acyclic Graph of data processing transformations ● Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark ● May include multiple inputs and multiple outputs ● May encompass many logical MapReduce operations ● PCollections flow through the pipeline ❯ Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore
❯ Write your own custom source by teaching Your
Dataflow how to read it in parallel Source/Sink • Currently for bounded sources only Here ❯ Write to GCS, BigQuery, Pub/Sub • More coming… ❯ Can use a combination of text, JSON, XML, Avro formatted data ❯ A collection of data of type T in a pipeline {Seahawks, NFC, Champions, Seattle, - PCollection
○ convert format , group , filter data ● Type of Transforms
○ ParDo ○ GroupByKey ○ Combine ○ Flatten ■ Multiple PCollection objects that contain the same data type, you can merge them into a single logical PCollection using the Flatten transform ❯ Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} ❯ Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo
❯ Useful for KeyBySessionId
Filtering a data set.
Formatting or converting the type of each element { KV, in a data set. KV, KV
• Corresponds to the shuffle phase in Hadoop GroupByKey
{KV ● ● Globally redundant Low latency (sub sec.) Publisher A Publisher B Publisher C N to N coupling Batched read/write Message 1 Message 2 Message 3 Push & Pull Topic A Topic B Topic C Guaranteed Delivery Auto expiration Subscription Subscription Cloud Subscription XA Subscription XB YC ZC Pub/Sub Message 1 Message 3 Message 2 Message 3 Subscriber X Subscriber Y Subscriber Z ● ● ● Datalab BigQuery An easy tool for analysis and report An interactive analysis service ,{Seahawks, KV