Analyzing Performance in the Cloud Solving an Elastic Problem with a Scientific Approach

Analyzing Performance in the Cloud solving an elastic problem with a scientific approach Nicholas Wakou (Dell EMC), Alex Krzos (Red Hat) Thursday, October 27, 2016 Barcelona Openstack Summit 2016 Presenters Nicholas Wakou is a Principal Alex Krzos is a Senior Performance Performance Engineer with the Engineer at Red Hat working on Dell EMC Open Source Solutions Openstack [email protected] [email protected] https://www.openstack.org/summit/barcelona-2016/summit-schedule/events/16204/ana lyzing-performance-in-the-cloud-solving-an-elastic-problem-with-a-scientific-approach Agenda ➢ CLOUD DEFINITION & CHARACTERISTICS ➢ PERFORMANCE MEASURING TOOLS ➢ SPEC CLOUD IaaS 2016 BENCHMARK ➢ PERFORMANCE MONITORING TOOLS ➢ PERFORMANCE CHARACTERIZATION ➢ TUNING TIPS CLOUD DEFINITION & CHARACTERISTICS DEFINING A CLOUD NIST SPECIAL PUBLICATION 800-145 Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf CLOUD CHARACTERISTICS PERFORMANCE MEASURING TOOLS RALLY OpenStack Benchmarking Tool ➢ as-an-App and as-a-Service ➢ Verification ➢ Benchmarking ➢ Profiling ➢ Reports ➢ SLAs for Benchmarks ➢ Many plugins Source: What is Rally?, https://rally.readthedocs.io/en/latest/ PERFKIT BENCHMARKER OpenSource Living Benchmarking Framework containing a set of Benchmarks used to compare cloud offerings/environments ➢ 10+ Cloud Providers/Environments ➢ 34+ Benchmarks ➢ Large Community Involvement ➢ Capture Cloud Elasticity with Benchmark Results ➢ Use Cloud/Environment CLI Tooling ➢ Publish Results to BigQuery for Comparison Source: Introduction to Perfkit Benchmark and How to Extend it, https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/wiki/ Tech-Talks PERFKIT EXPLORER Dashboarding and Performance Analysis Tool for PerfKitBenchmarker Results ➢ Multiple Chart Options ➢ Uses BigQuery as backend data-store ➢ Hosted in Google App Engine Source: https://github.com/GoogleCloudPlatform/PerfKitExplorer CLOUDBENCH ➢ Framework that automates cloud-scale evaluation and benchmarking ➢ Benchmark Harness ▪ Requests the Cloud Manager to create an instance(s) ▪ Submit configuration plan and steps to the Cloud Manager on how the test will be performed ▪ At the end of the test, collect and log applicable performance data and logs ▪ Destroy Instances no longer needed for the test. HARNESS AND WORKLOAD CONTROL Benchmark Harness Cloud SUT Benchmark Harness. It comprises of Cloud Bench (CBTOOL) and baseline/elasticity drivers, Group of boxes represents an and report generators. application instance For white-box clouds the benchmark harness is outside the SUT. For black-box clouds, it can be in the same location or campus. BROWBEAT Orchestration tool for existing OpenStack Workloads ➢ Combines Workloads, Metrics, and Results into single tool ➢ Runs Performance Workloads: ➢ Rally - Control Plane ➢ Rally Plugins & Rally+pBench Plugins - Control+Data Plane ➢ Shaker - Network Data Plane ➢ PerfKitBenchmarker - Data Plane + Cloud Elasticity ➢ Provides Performance Infrastructure Installation and Configuration for ➢ Carbon/Graphite/Grafana ➢ Collectd ➢ ELK ➢ FluentD ➢ Provides dashboards for Visualizing and Comparing Results and System Performance Metrics BROWBEAT - RESULTS BROWBEAT - Metrics SPEC CLOUD IAAS 2016 BENCHMARK SPEC CLOUD IAAS 2016 BENCHMARK ➢ Measures performance of Infrastructure-as-a-Service (IaaS) Clouds ➢ Measures both control and data plane ▪ Control: management operations, e.g., Instance provisioning time ▪ Data: virtualization, network performance, runtime performance ➢ Uses workloads that ➢ resemble “real” customer applications ➢ benchmarks the cloud, not the application ➢ Produces metrics (“elasticity”, “scalability”, “provisioning time”) which allow comparison SPEC Cloud IaaS Benchmarking : Dell Leads the Way http://en.community.dell.com/techcenter/cloud/b/dell-cloud- blog/archive/2016/06/24/spec-cloud-iaas-benchmarking-dell-leads-the-way Scalability and Elasticity Analogy Climbing a mountain IDEAL Scalability • Mountain: Keep on climbing • Cloud: keep on adding load without errors Elasticity • Mountain: Each step takes identical time • Cloud: performance within limits as load increases Elasticity – time for each step { { { { { Scalability – conquering{ an infinitely high mountain { { c c 18 WHAT IS MEASURED? ➢ Measures the number of AIs that can be loaded onto a Cluster before SLA violations occur ➢ Measures the scalability and elasticity of the Cloud under Test (CuT) ➢ Not a measure of Instance density ➢ SPEC Cloud workloads can individually be used to stress the CuT: ▪ KMeans – CPU/Memory ▪ YCSB - IO SPEC CLOUD BENCHMARK PHASES Baseline Phase ▪ Determine the results KMeans baseline AI for a single application instance of a workload ▪ AI = stream of 5 runs YCSB baseline AI Elasticity Phase Determine cloud elasticity and scalability results when multiple workloads are run BENCHMARK STOPPING CONDITIONS ➢ 20% AIs fail to provision ➢ 10% AIs have errors in any run ➢ Max number of AIs set by Cloud Provider ➢ 50% AIs have QoS violations ▪ KMeans completion time ≤ 3.33x Baseline phase ▪ YCSB Throughput ≥ Baselinethroughput / 3 ▪ YCSB Read Response Time ≤ 20 x BaselineReadResponse Time ▪ YCSB Insert Response Time ≤ 20 x BaselineInsertResponse Time HIGH LEVEL REPORT SUMMARY PUBLISHED RESULTS WEBSITE https://www.spec.org/cloud_iaas2016/results/cloudiaas2016.html PERFORMANCE MONITORING TOOLS CEILOMETER Another familiar OpenStack project ➢ https://wiki.openstack.org/wiki/Telemetry ➢ Goal is to efficiently collect, normalize and transform data produced by OpenStack services ➢ Interacts directly with the OpenStack services through defined interfaces ➢ Applications can leverage Ceilometer to gather OpenStack performance data Source: http://docs.openstack.org/developer/ceilometer/architecture.html COLLECTD/GRAPHITE/GRAFANA ➢ Collectd ➢ Daemon to collect System Performance Statistics ➢ Plugins for CPU, Memory, Disk, Network, Process, … ➢ Graphite/Carbon ➢ Carbon receives metrics, and flushes them to whisper database files ➢ Graphite is webapp frontend to Carbon ➢ Grafana ➢ Visualize metrics from multiple backends. GANGLIA Ganglia is a scalable, distributed monitoring system for high-performance computing systems such as Server Nodes, Clusters and Grids. - Relatively easy to setup - Tracks a lot hardware-centric metrics - Low operational burden PERFORMANCE CHARACTERIZATION PROVISIONING TIME: SPEC CLOUD ➢ The time needed to bring up a new instance, or add more resources (like CPU or storage) to an existing instance ➢ Instance: Time FROM request to create a new instance TO time when the instance responds to a netcat probe on port 22. ➢ Application instance: Time FROM request to create a new instance TO time when the AI reports readiness to accept client requests. ➢ Provisioning Time Characterization using Baseline phase ➢ Increase number of VMs (vary YCSB seeds KMeans and/or Hadoop slaves) and note impact on provisioning time. ➢ vary instance configuration (flavor) IO LIMITS PCI-E Limits For PCI-E Gen-3 capable slots. http://www.tested.com/tech/457440-theoretical-vs- actual-bandwidth-pci-express-and-thunderbolt/ SAS Limit An LSI whitepaper, Switched SAS: Sharable, Scalable SAS Infrastructure http://www.abacus.cz/prilohy/_5025/5025548/SAS_Switch_White%20Paper_US-EN_092210. pdf NETWORK/IO CHARACTERIZATION ➢ SPEC Cloud YCSB Baseline tests – Throughput (ops/s) ➢ Vary number of Seeds ➢ Increase number of YCSB records and operations ➢ Increase number of YCSB threads ➢ CloudBench fio ➢ CloudBench Netperf ➢ Understand network utilization under load ➢ Management networks ➢ Data networks (Neutron tenant) ➢ Monitor with Ganglia, collectd, Linux tools (vmstat, iostat etc) CPU CHARACTERIZATION ➢ Understand CPU utilization under load ➢ Monitor with Ganglia, collectd, graphana ➢ Linux tools (top, vmstat), SPEC Cloud, Kmeans Note: ✓ CPU user time ✓ CPU system time ✓ CPU iowait time ✓ CPU irq time ➢ Use SPEC Cloud Baseline tests for CPU Characterization ➢ Vary number of hadoopslaves ➢ Increase sample size, number of dimensions, number of clusters SCALABILITY/ELASTICITY ➢ Understand Scalability/Elasticity of the CuT ➢ SPEC Cloud Elasticity phase ➢ Vary number of AIs ➢ Monitor with FDR html report TUNING TIPS HARDWARE/OS TUNING ➢ Latest BIOS and Firmware revs ➢ Appropriate BIOS settings ➢ RAID/JBOD ➢ Disk controller ➢ NIC driver- Interrupt coalescing and affinitization ➢ NIC bonding ➢ NIC jumbo frames ➢ OS configuration settings CLOUD TUNING ▪ HW/OS Tuning ▪ Cloud Configs/Settings ▪ Workload tuning INSTANCE CONFIGURATION Performance is impacted by ▪ Instance type (flavor) ▪ Number of Instances OVER-SUBSCRIPTION Beware of over-subscription !!! LOCAL STORAGE Use of local storage instead of shared storage (like Ceph) could improve performance by over 50% ... depending on Ceph replication. Source: OpenStack: Install and configure a storage node - OpenStackkilo. http://docs.OpenStack.org/kilo/install-guide/install/yum/content/cinder-install-storage-node.html (2015) NUMA NODES Pinning instance CPU to physical CPUs (NUMA nodes) on local storage further improves performance. Source: Red Hat: Cpu pinning and numa topology awareness in OpenStackcompute. http://redhatstackblog.redhat.com/2015/05/05/cpu-pinning-and-numa-topology-awareness-in-OpenStack-compute/ (2015)

Analyzing Performance in the Cloud Solving an Elastic Problem with a Scientific Approach

ENOS: a Holistic Framework for Conducting Scientific

Towards Measuring and Understanding Performance in Infrastructure- and Function-As-A-Service Clouds

Measuring Cloud Network Performance with Perfkit Benchmarker

ENOS: a Holistic Framework for Conducting Scientific Evaluations Of

Optimal Allocation of Virtual Machines in Multi-Cloud Environments with Reserved and On-Demand Pricing

CERN Cloud Benchmark Suite

Bulut Ortamlarinda Hipervizör Ve Konteyner Tipi Sanallaştirmanin Farkli Özellikte Iş Yüklerinin Performansina Etkisinin Değerlendirilmesi

HPC on Openstack Review of the Our Cloud Platform Project

Platforma Vert.X

Contributions to Large-Scale Distributed Systems the Infrastructure Viewpoint - Toward Fog and Edge Computing As the Next Utility Computing Paradigm? Adrien Lebre

Network Function Virtualization Benchmarking: Performance Evaluation of an IP Multimedia Subsystem with the Gym Framework Avalia

MUSA D7.1 Initial Market Study