Creating #Serverless Data Analytics System Using Bigquery
Total Page:16
File Type:pdf, Size:1020Kb
Creating #serverless data analytics system using BigQuery Márton Kodok / @martonkodok Google Developer Expert at REEA.net April 2018 - Timisoara, Romania About me ● Geek. Hiker. Do-er. ● Among the Top3 romanians on Stackoverflow 120k reputation ● Google Developer Expert on Cloud technologies ● Crafting Web/Mobile backends at REEA.net ● BigQuery/Redis and database engine expert ● Active in mentoring and IT community Twitter: @martonkodok StackOverflow: pentium10 Slideshare: martonkodok GitHub: pentium10 Creating #serverless data analytics system using BigQuery @martonkodok REEA.net uses GCP Build on the same infrastructure that powers Google Google Cloud Platform (GCP) Compute Big Data Identity & Security Compute App Kubernetes Cloud Cloud Cloud Cloud Resource Cloud Security Key BigQuery Cloud IAM Engine Engine Engine Dataflow Dataproc Dataprep Manager Scanner Management Service Cloud Container- Cloud Cloud Data Data Loss Identity-Aware Security Key GPU Genomics BeyondCorp Functions Optimized OS Datalab Pub/Sub Studio Prevention API Proxy Enforcement Internet of Things Machine Learning Storage & Databases Cloud IoT Cloud Machine Cloud Cloud Cloud Video Cloud Cloud Cloud Transfer Core Learning Vision API Speech API Intelligence Storage Bigtable Datastore Appliance API Cloud Persistent Cloud Natural Cloud Cloud Advanced Cloud SQL Language API Translation Jobs API Solutions Lab Spanner Disk API Google Cloud Platform (GCP) Management Tools Networking Error Virtual Cloud Load Cloud Cloud Cloud Cloud Stackdriver Monitoring Logging Trace Reporting Private Cloud Balancing CDN External IP Firewall Rules Router Addresses Cloud Cloud Cloud Cloud Cloud Cloud Cloud Dedicated Debugger Cloud DNS Cloud VPN Deployment Endpoints Console Shell Interconnect Network Routes Interconnect Manager Developer Tools Cloud Mobile Cloud Cloud App Billing API APIs Cloud Cloud Source Cloud Cloud Tools Container Cloud SDK Deployment Repositories Tools for for IntelliJ Builder Manager Android Studio Cloud Cloud Container Google Plug-in Cloud Test Tools for Tools for Registry for Eclipse Lab PowerShell Visual Studio Meet Serverless Creating #serverless data analytics system using BigQuery @martonkodok Meet Serverless serverless data center depicted Creating #serverless data analytics system using BigQuery @martonkodok Event-driven serverless compute platform Event Source Business Value Multiple Platforms Streaming Cloud Analysis Services Changes in data state Event Router Data Warehouse Application Business logic events Gateway Pub/Sub Task Integrations HTTPS Cloud Functions @martonkodok Serverless is about maximizing elasticity, cost savings, and agility of cloud computing. Creating #serverless data analytics system using BigQuery @martonkodok Goal today Crafting a solution for building high-performance, petabyte scale data analytics, serverless reporting system on Google Cloud Platform Creating #serverless data analytics system using BigQuery @martonkodok Legacy Reporting System NGINX Database Service (Master/Slave) Compute Engine Compute Engine Compute Engine Compute Engine Cloud Load Balancing 10GB PD 10GB PD 10GB PD 10GB PD App 2 1 4 1 4 1 4 1 Report & Share Business Analysis Batch Processing Scheduled Compute Engine Tasks Multiple Instances Creating #serverless data analytics system using BigQuery @martonkodok Serverless Reporting System NGINX Database Service (Master/Slave) Compute Engine Compute Engine Compute Engine Compute Engine Cloud Load Balancing 10GB PD 10GB PD 10GB PD 10GB PD App 2 1 4 1 4 1 4 1 Report & Share Business Analysis Batch Processing Scheduled Compute Engine Tasks Multiple Instances Report & Share Business Analysis BigQuery Data Studio Creating #serverless data analytics system using BigQuery @martonkodok Creating #serverless data analytics system using BigQuery @martonkodok What is BigQuery? Analytics-as-a-Service - Data Warehouse in the Cloud Scales into Petabytes on Managed Google Infrastructure (US or EU zone) SQL 2011 + Javascript UDF (User Defined Functions) Familiar DB Structure (table, views, struct, nested, JSON) Integrates with Google Sheets + Cloud Storage + Pub/Sub connectors Decent pricing (queries $5/TB, storage: $20/TB cold: $10/TB) *Mar 2018 Open Interfaces (Web UI, BQ command line tool, REST, ODBC) Creating #serverless data analytics system using BigQuery @martonkodok BigQuery: Convenience of SQL Columnar storage (max 10 000 columns in table) Large files for loading: 5TB (CSV or JSON) UDF in Javascript or SQL Append-only tables prefered (DML syntax available) Day column partitioned tables (select * from t where day=’2018-01-01’)Rich SQL 2011: JSON,IP,Math,RegExp,Geocode,Window functions Modern data types: Record, Nested, Struct, Array Creating #serverless data analytics system using BigQuery @martonkodok Architecting for The Cloud On-Premises Servers Frontend Platform Services Pipelines Event Sourcing ETL Engine BigQuery Metrics / Logs/ Streaming Creating #serverless data analytics system using BigQuery @martonkodok “ Our project generates many/big files. How can I seamlessly ingest them? Creating #serverless data analytics system using BigQuery @martonkodok Serverless file ingest On-Premises Servers Triggered Code Frontend Platform Services Cloud Cloud Event Sourcing Application Storage Functions BigQuery Metrics / Logs/ Streaming Creating #serverless data analytics system using BigQuery @martonkodok “ Data needs to be processed in multiple services. How can we pipe to multiple places? Creating #serverless data analytics system using BigQuery @martonkodok Architecting for The Cloud On-Premises Servers Process Analyze Data Third-Party Studio Tools Frontend BigQuery Platform Services Stream Event Sourcing Cloud Cloud SQL Dataflow Batch Metrics / Logs/ Streaming Cloud Storage Creating #serverless data analytics system using BigQuery @martonkodok “ We have our app outside of GCP. How can we use the benefits of BigQuery? Creating #serverless data analytics system using BigQuery @martonkodok Data Pipeline Integration at REEA.net Development On-Premises Servers Team Frontend Platform Services Load Report & Share archive Export Event Sourcing Replay Business Analysis Data Analysts Standard Cloud Storage Devices Metrics / Logs/ HTTPS Streaming Pipelines Tools Tableau FluentD Analytics Backend QlikView BigQuery Application Database Data Studio Internal ServersServers SQL Dashboard Creating #serverless data analytics system using BigQuery @martonkodok The following slides will present a sample Fluentd configuration to: 1. Transform a record 2. Copy event to multiple outputs 3. Store event data in File (for backup/log purposes) 4. Stream to BigQuery (for immediate analyses) Creating #serverless data analytics system using BigQuery @martonkodok <filter frontend.user.*> Filter plugin mutates incoming data. Add/modify/delete @type record_transformer 1 event data transform attributes without a code deploy. </filter> <match frontend.user.*> The copy output plugin copies events to multiple outputs. @type copy 2 File(s), multiple databases, DB engines. <store> Great to ship same event to multiple subsystems. @type forest subtype file 3 </store> <store> The Bigquery output plugin on the fly streams the event to @type bigquery 4 the BigQuery warehouse. No need to write integration. </store> Data is available immediately for querying. … Whenever needed other output plugins can be wired in: </match> Kafka, Google Cloud Storage output plugin. Creating #serverless data analytics system using BigQuery @martonkodok 1 record_transformer 2 copy 3 file 4 BigQuery <filter frontend.user.*> syntax: Ruby, easy to use. @type record_transformer enable_ruby Great for: remove_keys host - date transformation, <record> - quick normalizations, bq {"insert_id":"${uid}","host":"${host}", - calculating something on the fly, "created":"${time.to_i}"} and store in clear log/analytics db avg ${record["total"] / record["count"]} - renaming without code deploy. </record> </filter> Creating #serverless data analytics system using BigQuery @martonkodok 1 record_transformer 2 copy 3 file 4 BigQuery <match frontend.user.*> @type copy <store> @type forest subtype file <template> path /tank/storage/${tag}.*.log time_slice_format %Y%m%d </template> </store> </match> Creating #serverless data analytics system using BigQuery @martonkodok 1 record_transformer 2 copy 3 file 4 BigQuery <match frontend.user.*> @type bigquery method insert Connector uses: auth_method json_key - JSON key auth file json_key /etc/td-agent/keys/key-31da042be48c.json - JSON table schema time_field timestamp time_slice_format %Y%m%d Pro features: table user$%{time_slice} - streaming to Partitioned tables ignore_unknown_values - ignore unknown values schema_path /etc/td-agent/schema/user_login.json (not reflected in schema) </match> Creating #serverless data analytics system using BigQuery @martonkodok Where to use BigQuery? ● On data that it is difficult to process/analyze using traditional databases ● Not a replacement to traditional DBs, but it compliments the system ● Major strength is handling Large datasets ● Applying Javascript UDF on columnar storage to resolve complex tasks (eg: JS for natural language processing) ● On streams (forms, IoT, Kafka) ● On exploring unstructured data Creating #serverless data analytics system using BigQuery @martonkodok Achievements - goal reached by measuring everything ➢ Optimize product pages ➢ Email engagement ➢ Funnel Analysis Creating #serverless data analytics system using BigQuery @martonkodok Achievements ● Funnel Analysis Creating #serverless data