Flexible Network Analytics in the Cloud
Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco Introduction
● Harsh realities of network analytics ● netbeam ● Demo ● Technology Stack ● Alternative Approaches ● Lessons Learned
2 Architecture
3 The Harsh Realities of Network Analytics
1. It’s a mess 3. There’s always more
● Your data isn’t neat and tidy ● More devices & more telemetry
2. Things change 4. It’s never really done
● What you need today may not ● Time and money are limited be what you need tomorrow.
4 Coping strategies
1. It’s a mess 3. There’s always more
● Design knowing things won’t ● Rely on the cloud for scaling be tidy
2. Things change 4. It’s never really done
● Keep raw data to keep your ● “What” not “How” options open
5 netbeam
Network Analytics in Google Cloud
Three Pillars
1. Real time analytics ○ Low latency, incomplete 2. Offline analytics ○ High latency, complete 3. Flexible data model ○ Changing needs? Recompute from raw data!
Secret sauce: Apache Beam
6 What is Apache Beam?
1. The Beam Programming Model 2. SDKs for writing Beam pipelines 3. Runners for existing distributed processing backends ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ Local runner for testing
Slide courtesy of the Apache Beam Project 7 The Evolution of Apache Beam
Colossus BigTable PubSub Dremel
Google Cloud Dataflow
Spanner Megastore Millwheel Flume Apache Beam
MapReduce
Slide courtesy of the Apache Beam Project 8 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream Processing)
avro
Bigtable Apache Beam BigQuery BigQuery (realtime) (Batch Processing) (immutable) (historical)
...
API
Client 9 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream)
avro ● Google Pubsub ● Uses Python outside Align/rates of Google Cloud to
poll devices and write Bigtable Rollups BigQuery BigQuery to Pubsub topic (realtime) 5m, 1h, 1d avg (immutable) (historical) ● Code within Google Cloud subscribes to Percentiles topic to process data ...
API
Client 10 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream)
avro ● Apache Beam / Google Dataflow Align/rates ● Stream processing
● Subscribes to Bigtable Rollups BigQuery BigQuery Pubsub topic (realtime) 5m, 1h, 1d avg (immutable) (historical)
Percentiles ...
API
Client 11 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream)
avro ● Apache Beam / Google Dataflow Align/rates ● Stream processing
● Subscribes to Bigtable Rollups BigQuery BigQuery Pubsub topic (realtime) 5m, 1h, 1d avg (immutable) (historical) ● Raw data is written to BigQuery Percentiles
...
API
Client 12 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream)
avro ● Apache Beam / Google Dataflow Align/rates ● Stream processing
● Subscribes to Bigtable Rollups BigQuery BigQuery Pubsub topic (realtime) 5m, 1h, 1d avg (immutable) (historical) ● Raw data is written to BigQuery Percentiles ● Real time transformed data ... (e.g. aligned data rates) written to Bigtable API ● Writes and makes use of meta data in BigTable (not shown)
Client 13 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream)
avro ● Cloud Bigtable ● Like HBase Align/rates ● Write to cells in rows,
indexed by keys Bigtable Rollups BigQuery BigQuery ● We write 1 day of (realtime) 5m, 1h, 1d avg (immutable) (historical) data to a single row (columns are the time Percentiles of day, key is metric and day) ... ● Fast access to row by key, can serve data from here API ● Store one year
Client 14 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream)
avro ● BigQuery ● Data warehousing Align/rates solution
● Cheap storage, SQL Bigtable Rollups BigQuery BigQuery access, but not (realtime) 5m, 1h, 1d avg (immutable) (historical) suitable for real-time access Percentiles ● Allows SQL queries for ad hoc ... investigation ● We store our source of truth here API
Client 15 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream)
avro ● BigQuery ● Data warehousing Align/rates solution
● Cheap storage, SQL Bigtable Rollups BigQuery BigQuery access, but not (realtime) 5m, 1h, 1d avg (immutable) (historical) suitable for real-time access Percentiles ● Allows SQL queries for ad hoc ... investigation ● We store our source of truth here API ● Also store historical data (7 years), imported via avro files
Client 16 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream)
avro ● Apache Beam / Google Dataflow Align/rates ● Batch processing
● Run with cron job Bigtable Rollups BigQuery BigQuery (realtime) 5m, 1h, 1d avg (immutable) (historical)
Percentiles ...
API
Client 17 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream)
avro ● Apache Beam / Google Dataflow Align/rates ● Batch processing
● Run with cron job Bigtable Rollups BigQuery BigQuery ● Recalculate Bigtable (realtime) 5m, 1h, 1d avg (immutable) (historical) data each night from source of truth in Percentiles BigQuery ...
API
Client 18 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream)
avro ● Apache Beam / Google Dataflow Align/rates ● Batch processing
● Run with cron job Bigtable Rollups BigQuery BigQuery ● Recalculate Bigtable (realtime) 5m, 1h, 1d avg (immutable) (historical) data each night from source of truth in Percentiles BigQuery ● Process Bigtable ... rows into new rows of 5min, 1 hr and 1 day aggregations API
Client 19 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream)
avro ● Apache Beam / Google Dataflow Align/rates ● Batch processing
● Run with cron job Bigtable Rollups BigQuery BigQuery ● Recalculate Bigtable (realtime) 5m, 1h, 1d avg (immutable) (historical) data each night from source of truth in Percentiles BigQuery ● Process Bigtable ... rows into new rows of 5min, 1 hr and 1 day aggregations API ● Additional pre-computed views e.g. percentiles for traffic distribution over a month Client 20 SNMP collection Old SNMP system Architecture Diagram system
Apache Beam (Stream)
avro ● API ● Currently runs on Align/rates App Engine
● Node.js Bigtable Rollups BigQuery BigQuery ● Serves data out of (realtime) 5m, 1h, 1d avg (immutable) (historical) Bigtable ● Timeseries data is Percentiles served as ‘tiles’, each tile is one row ... ● Would like to use Cloud Endpoints and provide a gRPC Dataserver API service (node.js) ● Looking forward to grpc-web solution
Client 21 Use case example: Historical Trends
22 Use case example: Historical Trends
Per-day SNMP collection Interface totals Stream to BQ BigQuery system Bigtable
Old SNMP system avro BigQuery (historical) Per-month totals
Bigtable rows
Dataserver API (node.js) Jan 1 Jan 2 ... Dec31
snmp-daily::2017-08::$interface 1.8 Pb 1.9 Pb ... 3.1 Pb
Client
Jan 1991 Feb 1991 ... Sep 2017
snmp-monthly-totals 28 Gb 29 Gb ... 56 Pb 23 Use case: real time anomaly detection Generates avg for each interface over the past 3 months for that hour/day
SNMP collection Baseline Stream to BQ BigQuery system generation Bigtable
Anomaly detection
Compares baseline to real time values to generate current deviation from normal
Dataserver API Mon Mon Mon Sun (node.js) 12am 1am 2am ... 11pm baseline::5m::avg::$interface 2.1 1.9 0.3 ... 0.5
Client
iface-1 iface-2 ... iface-n
anomaly::5m::avg +0.1 +2.0 ... -1.5 24 Use case example: Percentiles
25 Use case example: Percentiles Daily rollups 5m avg
SNMP collection Stream to Bigtable system Bigtable
Percentiles
Bigtable rows
Dataserver API 1 2 ... 8640 (node.js)
rollup-month-5m::2017-08::$interface::in 6Gbps 5Gbps ... 2Gbps
Client 1 pct 2 pct ... 99 pct
percentiles::2017-08::$interface::in 0.1 Gbps 0.3 Gbps ... 22.1Gbps 26 Demo
27 Example: Computing Total Traffic
# Python Beam SDK pipeline = beam.Pipeline('DirectRunner')
(pipeline | 'read' >> ReadFromText('./example.csv') | 'csv' >> beam.ParDo(FormatCSVDoFn()) | 'ifName key' >> beam.Map(group_by_device_interface) | 'group by iface' >> beam.GroupByKey() | 'compute rate' >> beam.FlatMap(compute_rate) | 'timestamp key' >> beam.Map(lambda row: (row['timestamp'], row['rateIn'])) | 'group by timestamp' >> beam.GroupByKey() | 'sum by timestamp' >> beam.Map(lambda rates: (rates[0], sum(rates[1]))) | 'format' >> beam.Map(lambda row: '{},{}'.format(row[0], row[1])) | 'save' >> beam.io.WriteToText('./total_by_timestamp')) pipeline.run()
Full code available at: http://x1024.net/blog/2017/05/chinog-flexible-network-analytics-in-the-cloud/ 28 Our Stack
● Apache Beam using Scio
Cloud Cloud ● Google Cloud Platform Dataflow Pub/Sub ○ Dataflow ○ Bigtable
○ BigQuery Cloud BigQuery Bigtable ○ Pub/Sub ○ App Engine ● Languages App Cloud ○ Scala Engine Endpoints ○ Javascript / Typescript ○ Python
29 Current Status & Future Plans
Current Future
Alpha version for SNMP data: More types of data:
● Ingest to BigQuery is working ● Flow data ● Migration of historical data is ● perfSONAR implemented. Awaiting final details before full conversion Machine Learning ● Streaming ingest to Bigtable still in Anomaly Detection process ● Early version of utilization visualization “Mash up” various data sources ● Simple data server can provide data to clients, but gRPC API coming ● Interface timeseries charts functional
30 Why not InfluxDB, Elastic or ${FAVORITE_DB}
● We have a data processing problem, not a data storage problem per se. ○ Beam and the ecosystem around it give a huge amount of flexibility -- can try new ideas as they occur to us ○ Ability to move to different platform components ○ machine learning (TensorFlow and others) ● InfluxDB & Elastic ○ require care and feeding -- have to think about disks and machines, etc. ○ At our last evaluation (a while ago now) InfluxDB wasn’t able to keep up with our load -- this may have changed but other benefits outweigh that. ○ Elastic doesn’t seem to be a good fit for long term storage -- everything is in the “hot” tier
31 Why the cloud? Why Google Cloud Platform?
Why the cloud?
● Focus on our problems not on infrastructure ● Scalability without needing to own lots of systems ● Managed services for databases and compute
Why Google Cloud?
● Apache Beam was Google Dataflow when we first encountered it ● More cohesive ecosystem than AWS in our experience
32 Lessons learned / Life in the cloud / Good & Bad
● This approach is not a silver bullet, but definitely makes many things easier ● Scaling is pretty sweet: we processed 4,005,271,066 points in 13 hours ● GCP Tech support could be better ● Despite early indications Python streaming support in Beam has been slow to appear. Python is a second class citizen. Fortunately Scio and Scala allow working with the Java SDK at a high level of abstraction. ● Scala is powerful but challenging at times ● Focus on developing your services, not on setting up machines to run them ○ Nice options for decomposing services (Endpoints/esp, load balancing, etc) ○ Service oriented ○ Battle tested software stacks
33 Thank you!
Peter Murphy
● MyESnet: https://my.es.net ● ESnet Open Source: http://software.es.net/ ○ http://software.es.net/react-timeseries-charts/ ○ http://software.es.net/pond/ ○ http://software.es.net/react-network-diagrams/ ● Scio: https://github.com/spotify/scio ● Beam: https://beam.apache.org
34