High Altitude, Low Risk- Measuring Reliability in the Cloud Using Open Source Technology

High Altitude, Low Risk- Measuring Reliability in the Cloud Using Open Source Technology

High Altitude, Low Risk Measuring Reliability in the Cloud using OSS technology [email protected] 1 table.of.contents 1. alex@digitalocean:~$ whoami 2. OSS building blocks at DigitalOcean 3. Service Level Management: productizing reliability data 4. What's Next/Wrap-Up 5. Questions (and Answers?) 2 alex@digitalocean:~$ whoami Global Cloud Hosting Provider 12 Data Centers, worldwide DO builds products that help engineering teams build, deploy and scale cloud applications 3 . 1 alex@digitalocean:~$ whoami my team 3 . 2 alex@digitalocean:~$ whoami 3 . 3 alex@digitalocean:~$ whoami Observability Applications - what is our mission? 1. to simplify and optimize consumption of data internally 2. to reduce incident MTTD/MTTR through custom applications 3. to help define, maintain, and broadcast source-of-truth performance and reliability data to the rest of the organization 3 . 4 alex@digitalocean:~$ whoami How can we achieve these things? 1. to simplify and optimize consumption of data 2. to reduce incident MTTD/MTTR through custom applications 3. to help define, maintain, and broadcast source-of-truth performance and reliability data to the rest of the organization 3 . 5 alex@digitalocean:~$ whoami How can we achieve these things? By building data products... ...for specific stakeholders... ...with widely available open source tools... ...to integrate with systems already in place! (hint: Service Level Management) 3 . 6 Open Source Software Building Blocks 4 OSS tools in use at DigitalOcean (a sampling) 5 . 1 OSS tools in use at DigitalOcean https://github.com/digitalocean 5 . 2 Service Level Management (SLM) 6 Service Level Management (SLM) SLM SLAs SLOs 7 . 1 Service Level Management (SLM) SLA an agreement with consequences. 7 . 2 Service Level Management (SLM) SLO an Objective, or goal, but not a commitment. 7 . 3 Service Level Management (SLM) SLA = service consumption SLO = service production 7 . 4 Service Level Management (SLM) prioritization 1. SLOs 2. SLAs 7 . 5 Service Level Management (SLM) ahem: ...to standardize, centralize, and publish objective data about the reliability of both customer- and internal-facing products and services. aka: a data product to serve all of DO ETL...E? 7 . 6 Service Level Management (SLM) E 8 SLM: E Extraction points: • • • • • 9 . 1 SLM: E Easy to implement and deploy at scale Flexible time-series metrics Counters: monotonically increasing Gauges: metrics for a current state of a service 10 . 1 SLM: E Export to HTTP endpoint for consumption func ServeSLAMetrics(e *Environment) error { reg := prometheus.NewRegistry() err := reg.Register(NewSLACollector(e.SLALastRunFile)) if err != nil { return err } handler := promhttp.HandlerFor(reg, promhttp.HandlerOpts{}) listener, err := net.Listen("tcp", e.SLAAddr) if err != nil { return err } go http.Serve(listener, handler) return nil } 11 . 1 SLM: E Globally-distributed servers ??? ... 11 . 2 SLM : E PANDORA a bunch of Prometheus servers on management droplets configured using a Git repository and some Go code services: # eng­platcore ­ luca­kubernetes­alerts ­ luca­kubernetes­apiservers ­ luca­kubernetes­kubelet ­ luca­kubernetes­nodes ­ luca­kubernetes­pods ­ luca­kubernetes­service­endpoints 11 . 3 SLM: E 11 . 4 SLM: E Caveat: data retention buffers 11 . 5 SLM: E Efficient distribution/consumption of global data, including: Centralized Logging ( ) Regional Product Lifecycle Events ( ) Global Metrics Data ( ) 12 . 1 SLM : E 12 . 2 SLM: E Caveat: data retention buffers 12 . 3 Service Level Management (SLM) ET 13 SLM: ET + prom_data_payload = json.loads( requests.get(full_prometheus_query_string, timeout=50))\ .content.decode()['data']['result'] 14 . 1 SLM: ET + for timeseries in prom_data_payload: metric = timeseries["metric"] values = timeseries["values"] rows.extend([ { **metric, "timestamp": datetime.fromtimestamp(v[0]), "value": float(v[1]) } for v in values]) return pd.DataFrame(rows) 15 . 1 SLM: ET + + @F.pandas_udf(df_fixed.schema, F.PandasUDFType.GROUPED_MAP) def resample_by_ts(x): return x.drop_duplicates('ts')\ .sort_values('ts')\ .set_index('ts')\ .asfreq('300s', method='ffill')\ .resample('300s')\ .first().reset_index() df_resampled = df_fixed.groupBy( "droplet_id", "node", "metric_name").apply(resample_by_ts) 16 . 1 Service Level Management (SLM) ETL 17 SLM: ETL + + spark_df \ .write \ .mode(saveMode="append") \ .partitionBy( "metric_name", "region", "year", "month", "day", "hour") \ .parquet(full_hdfs_target) 18 . 1 SLM: ETL + + import pyarrow as pa import pyarrow.parquet as pq def write(pandas_df, hdfs_url, *partition_cols): __check_url(hdfs_url) table = pa.Table.from_pandas(pandas_df, preserve_index=False) pq.write_to_dataset( table, root_path=hdfs_url, partition_cols=list(partition_cols), preserve_index=False, compression="snappy", flavor="spark") 18 . 2 Service Level Management (SLM) ETLE 19 SLM: ETLE massively parallel, cross-catalog query engine schema metadata data warehouse 20 . 1 SLM: ETLE 20 . 2 Service Level Management (SLM) HV Droplet Availability: an SLO Pipeline example 21 SLM: hvdroplet availability A quick product history: Droplet product launched in 2011 Pioneered selling VMs on SSDs over 100MM droplets deployed 22 . 1 SLM: hvdroplet availability virtualization API enum virDomainState { VIR_DOMAIN_NOSTATE = 0 VIR_DOMAIN_RUNNING = 1 VIR_DOMAIN_BLOCKED = 2 VIR_DOMAIN_PAUSED = 3 VIR_DOMAIN_SHUTDOWN = 4 VIR_DOMAIN_SHUTOFF = 5 VIR_DOMAIN_CRASHED = 6 VIR_DOMAIN_PMSUSPENDED = 7 } 22 . 2 SLM: hvdroplet availability 22 . 3 SLM: hvdroplet availability 22 . 4 SLM: hvdroplet availability 22 . 5 SLM: hvdroplet availability + + + + + + + 23 . 1 24 . 1 24 . 2 24 . 3 SLM: A Recap 1. Portfolio 2. (nearly) unlimited data history 3. API 25 . 1 Next Steps in Observability Applications 26 Next Steps SLM v2 batch streaming 27 . 1 Next Steps 27 . 2 Next Steps 27 . 3 Next Steps 27 . 4 Next Steps 27 . 5 Next Steps 27 . 6 Next Steps 27 . 7 Next Steps 27 . 8 Some final thoughts Productize your reliability data Mix n' match your OSS tools Love 2 the 9s 28 . 1 Thank You! 29.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    61 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us