Google Dataflow 小試
Total Page:16
File Type:pdf, Size:1020Kb
Google Dataflow 小試 Simon Su @ LinkerNetworks {Google Developer Expert} var simon = {/** I am at GCPUG.TW **/}; simon.aboutme = 'http://about.me/peihsinsu'; simon.nodejs = ‘http://opennodes.arecord.us'; simon.googleshare = 'http://gappsnews.blogspot.tw' simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw'; simon.blog = ‘http://peihsinsu.blogspot.com'; simon.slideshare = ‘http://slideshare.net/peihsinsu/'; simon.email = ‘[email protected]’; simon.say(‘Good luck to everybody!'); https://www.facebook.com/groups/GCPUG.TW/ https://plus.google.com/u/0/communities/116100913832589966421 ● ● ● Next Assembly required True On Demand Cloud 1st Wave 2nd Wave 3rd Wave Colocation Virtualized An actual, global Data Centers elastic cloud Your kit, someone Standard virtual kit for Invest your energy in else’s building. Rent. Still yours to great apps. Yours to manage. manage. Clusters Containers Distributed Storage, Processing Storage Processing Memory Network & Machine Learning Application Runtime Services Enabling No-Touch Operations Data Services Breakthrough Insights, Breakthrough Applications Foundation Infrastructure & Operations The Gear that Powers Google Capture Store Process Analyze Pub/Sub Cloud BigQuery Cloud SQL Cloud Dataflow Dataproc BigQuery Dataproc Larger Logs Storage Storage (mySQL) Datastore Hadoop App Engine (NoSQL) Ecosystem BigQuery streaming Devices Smart devices, Physical or Strong queue MapReduce Large scale data IoT devices virtual servers service for servers for large store for storing and Sensors as frontend handling large data data and serve 1M data receiver scale data transformation in query workload Devices injection 16.6K batch way or 43B 16.6K Events/sec Events/month Events/sec streaming MapReduce Dremel Spanner GFS Big Table Colossus MillWheel Flume 2002 2004 2006 2008 2010 2012 2013 Managed Service Job Manager Work Manager User Code & SDK Deploy & Schedule Progress & Logs Monitoring UI GCP ETL Analysis Orchestration • Movement • Reduction • Composition • Filtering • Batch computation • External orchestration • Enrichment • Continuous computation • Simulation • Shaping Offline import Regional buckets (third party) Object versioning Object change ACLs notification Online cloud Object lifecycle import (Cloud management Storage Transfer Service) ● ● ● ● Map ParDo Shuffle GroupByKey Reduce ParDo ● A Direct Acyclic Graph of data processing transformations ● Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark ● May include multiple inputs and multiple outputs ● May encompass many logical MapReduce operations ● PCollections flow through the pipeline ❯ Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore ❯ Write your own custom source by teaching Your Dataflow how to read it in parallel Source/Sink • Currently for bounded sources only Here ❯ Write to GCS, BigQuery, Pub/Sub • More coming… ❯ Can use a combination of text, JSON, XML, Avro formatted data ❯ A collection of data of type T in a pipeline {Seahawks, NFC, Champions, Seattle, - PCollection<K,V> ...} ❯ Maybe be either bounded or unbounded in size {..., ❯ Created by using a PTransform to: “NFC Champions #GreenBay”, “Green Bay #superbowl!”, • Build from a java.util.Collection ... • Read from a backing data store “#GoHawks”, ...} • Transform an existing PCollection ❯ Often contain the key-value pairs using KV ● A step, or a processing operation that transforms data ○ convert format , group , filter data ● Type of Transforms ○ ParDo ○ GroupByKey ○ Combine ○ Flatten ■ Multiple PCollection objects that contain the same data type, you can merge them into a single logical PCollection using the Flatten transform ❯ Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} ❯ Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo ❯ Useful for KeyBySessionId Filtering a data set. Formatting or converting the type of each element { KV<S, Seahawks>, in a data set. KV<C,Champions>, <KV<S, Seattle>, Extracting parts of each element in a data set. KV<N, NFC>, … } Performing computations on each element in a data set. • Takes a PCollection of key-value pairs and gathers up all values with the same key {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} • Corresponds to the shuffle phase in Hadoop GroupByKey {KV<S, Seahawks>,{Seahawks, KV<C,Champions>, Seattle, …}, Wait a minute… <KV<S,KV<N, Seattle>,{NFC, …} KV<N, NFC>, ...} How do you do a GroupByKey on an unbounded PCollection? KV<C, {Champion, …}} ● ● Globally redundant Low latency (sub sec.) Publisher A Publisher B Publisher C N to N coupling Message 1 Message 2 Message 3 Batched read/write Push & Pull Topic A Topic B Topic C Guaranteed Delivery Auto expiration Subscription Subscription Cloud Subscription XA Subscription XB YC ZC Pub/Sub Message 1 Message 3 Message 2 Message 3 Subscriber X Subscriber Y Subscriber Z ● ● ● Datalab BigQuery An easy tool for analysis and report An interactive analysis service .