Analyzing Performance in the Cloud Solving an Elastic Problem with a Scientific Approach
Total Page:16
File Type:pdf, Size:1020Kb
Analyzing Performance in the Cloud solving an elastic problem with a scientific approach Nicholas Wakou (Dell EMC), Alex Krzos (Red Hat) Thursday, October 27, 2016 Barcelona Openstack Summit 2016 Presenters Nicholas Wakou is a Principal Alex Krzos is a Senior Performance Performance Engineer with the Engineer at Red Hat working on Dell EMC Open Source Solutions Openstack [email protected] [email protected] https://www.openstack.org/summit/barcelona-2016/summit-schedule/events/16204/ana lyzing-performance-in-the-cloud-solving-an-elastic-problem-with-a-scientific-approach Agenda ➢ CLOUD DEFINITION & CHARACTERISTICS ➢ PERFORMANCE MEASURING TOOLS ➢ SPEC CLOUD IaaS 2016 BENCHMARK ➢ PERFORMANCE MONITORING TOOLS ➢ PERFORMANCE CHARACTERIZATION ➢ TUNING TIPS CLOUD DEFINITION & CHARACTERISTICS DEFINING A CLOUD NIST SPECIAL PUBLICATION 800-145 Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf CLOUD CHARACTERISTICS PERFORMANCE MEASURING TOOLS RALLY OpenStack Benchmarking Tool ➢ as-an-App and as-a-Service ➢ Verification ➢ Benchmarking ➢ Profiling ➢ Reports ➢ SLAs for Benchmarks ➢ Many plugins Source: What is Rally?, https://rally.readthedocs.io/en/latest/ PERFKIT BENCHMARKER OpenSource Living Benchmarking Framework containing a set of Benchmarks used to compare cloud offerings/environments ➢ 10+ Cloud Providers/Environments ➢ 34+ Benchmarks ➢ Large Community Involvement ➢ Capture Cloud Elasticity with Benchmark Results ➢ Use Cloud/Environment CLI Tooling ➢ Publish Results to BigQuery for Comparison Source: Introduction to Perfkit Benchmark and How to Extend it, https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/wiki/ Tech-Talks PERFKIT EXPLORER Dashboarding and Performance Analysis Tool for PerfKitBenchmarker Results ➢ Multiple Chart Options ➢ Uses BigQuery as backend data-store ➢ Hosted in Google App Engine Source: https://github.com/GoogleCloudPlatform/PerfKitExplorer CLOUDBENCH ➢ Framework that automates cloud-scale evaluation and benchmarking ➢ Benchmark Harness ▪ Requests the Cloud Manager to create an instance(s) ▪ Submit configuration plan and steps to the Cloud Manager on how the test will be performed ▪ At the end of the test, collect and log applicable performance data and logs ▪ Destroy Instances no longer needed for the test. HARNESS AND WORKLOAD CONTROL Benchmark Harness Cloud SUT Benchmark Harness. It comprises of Cloud Bench (CBTOOL) and baseline/elasticity drivers, Group of boxes represents an and report generators. application instance For white-box clouds the benchmark harness is outside the SUT. For black-box clouds, it can be in the same location or campus. BROWBEAT Orchestration tool for existing OpenStack Workloads ➢ Combines Workloads, Metrics, and Results into single tool ➢ Runs Performance Workloads: ➢ Rally - Control Plane ➢ Rally Plugins & Rally+pBench Plugins - Control+Data Plane ➢ Shaker - Network Data Plane ➢ PerfKitBenchmarker - Data Plane + Cloud Elasticity ➢ Provides Performance Infrastructure Installation and Configuration for ➢ Carbon/Graphite/Grafana ➢ Collectd ➢ ELK ➢ FluentD ➢ Provides dashboards for Visualizing and Comparing Results and System Performance Metrics BROWBEAT - RESULTS BROWBEAT - Metrics SPEC CLOUD IAAS 2016 BENCHMARK SPEC CLOUD IAAS 2016 BENCHMARK ➢ Measures performance of Infrastructure-as-a-Service (IaaS) Clouds ➢ Measures both control and data plane ▪ Control: management operations, e.g., Instance provisioning time ▪ Data: virtualization, network performance, runtime performance ➢ Uses workloads that ➢ resemble “real” customer applications ➢ benchmarks the cloud, not the application ➢ Produces metrics (“elasticity”, “scalability”, “provisioning time”) which allow comparison SPEC Cloud IaaS Benchmarking : Dell Leads the Way http://en.community.dell.com/techcenter/cloud/b/dell-cloud- blog/archive/2016/06/24/spec-cloud-iaas-benchmarking-dell-leads-the-way Scalability and Elasticity Analogy Climbing a mountain IDEAL Scalability • Mountain: Keep on climbing • Cloud: keep on adding load without errors Elasticity • Mountain: Each step takes identical time • Cloud: performance within limits as load increases Elasticity – time for each step { { { { { Scalability – conquering{ an infinitely high mountain { { c c 18 WHAT IS MEASURED? ➢ Measures the number of AIs that can be loaded onto a Cluster before SLA violations occur ➢ Measures the scalability and elasticity of the Cloud under Test (CuT) ➢ Not a measure of Instance density ➢ SPEC Cloud workloads can individually be used to stress the CuT: ▪ KMeans – CPU/Memory ▪ YCSB - IO SPEC CLOUD BENCHMARK PHASES Baseline Phase ▪ Determine the results KMeans baseline AI for a single application instance of a workload ▪ AI = stream of 5 runs YCSB baseline AI Elasticity Phase Determine cloud elasticity and scalability results when multiple workloads are run BENCHMARK STOPPING CONDITIONS ➢ 20% AIs fail to provision ➢ 10% AIs have errors in any run ➢ Max number of AIs set by Cloud Provider ➢ 50% AIs have QoS violations ▪ KMeans completion time ≤ 3.33x Baseline phase ▪ YCSB Throughput ≥ Baselinethroughput / 3 ▪ YCSB Read Response Time ≤ 20 x BaselineReadResponse Time ▪ YCSB Insert Response Time ≤ 20 x BaselineInsertResponse Time HIGH LEVEL REPORT SUMMARY PUBLISHED RESULTS WEBSITE https://www.spec.org/cloud_iaas2016/results/cloudiaas2016.html PERFORMANCE MONITORING TOOLS CEILOMETER Another familiar OpenStack project ➢ https://wiki.openstack.org/wiki/Telemetry ➢ Goal is to efficiently collect, normalize and transform data produced by OpenStack services ➢ Interacts directly with the OpenStack services through defined interfaces ➢ Applications can leverage Ceilometer to gather OpenStack performance data Source: http://docs.openstack.org/developer/ceilometer/architecture.html COLLECTD/GRAPHITE/GRAFANA ➢ Collectd ➢ Daemon to collect System Performance Statistics ➢ Plugins for CPU, Memory, Disk, Network, Process, … ➢ Graphite/Carbon ➢ Carbon receives metrics, and flushes them to whisper database files ➢ Graphite is webapp frontend to Carbon ➢ Grafana ➢ Visualize metrics from multiple backends. GANGLIA Ganglia is a scalable, distributed monitoring system for high-performance computing systems such as Server Nodes, Clusters and Grids. - Relatively easy to setup - Tracks a lot hardware-centric metrics - Low operational burden PERFORMANCE CHARACTERIZATION PROVISIONING TIME: SPEC CLOUD ➢ The time needed to bring up a new instance, or add more resources (like CPU or storage) to an existing instance ➢ Instance: Time FROM request to create a new instance TO time when the instance responds to a netcat probe on port 22. ➢ Application instance: Time FROM request to create a new instance TO time when the AI reports readiness to accept client requests. ➢ Provisioning Time Characterization using Baseline phase ➢ Increase number of VMs (vary YCSB seeds KMeans and/or Hadoop slaves) and note impact on provisioning time. ➢ vary instance configuration (flavor) IO LIMITS PCI-E Limits For PCI-E Gen-3 capable slots. http://www.tested.com/tech/457440-theoretical-vs- actual-bandwidth-pci-express-and-thunderbolt/ SAS Limit An LSI whitepaper, Switched SAS: Sharable, Scalable SAS Infrastructure http://www.abacus.cz/prilohy/_5025/5025548/SAS_Switch_White%20Paper_US-EN_092210. pdf NETWORK/IO CHARACTERIZATION ➢ SPEC Cloud YCSB Baseline tests – Throughput (ops/s) ➢ Vary number of Seeds ➢ Increase number of YCSB records and operations ➢ Increase number of YCSB threads ➢ CloudBench fio ➢ CloudBench Netperf ➢ Understand network utilization under load ➢ Management networks ➢ Data networks (Neutron tenant) ➢ Monitor with Ganglia, collectd, Linux tools (vmstat, iostat etc) CPU CHARACTERIZATION ➢ Understand CPU utilization under load ➢ Monitor with Ganglia, collectd, graphana ➢ Linux tools (top, vmstat), SPEC Cloud, Kmeans Note: ✓ CPU user time ✓ CPU system time ✓ CPU iowait time ✓ CPU irq time ➢ Use SPEC Cloud Baseline tests for CPU Characterization ➢ Vary number of hadoopslaves ➢ Increase sample size, number of dimensions, number of clusters SCALABILITY/ELASTICITY ➢ Understand Scalability/Elasticity of the CuT ➢ SPEC Cloud Elasticity phase ➢ Vary number of AIs ➢ Monitor with FDR html report TUNING TIPS HARDWARE/OS TUNING ➢ Latest BIOS and Firmware revs ➢ Appropriate BIOS settings ➢ RAID/JBOD ➢ Disk controller ➢ NIC driver- Interrupt coalescing and affinitization ➢ NIC bonding ➢ NIC jumbo frames ➢ OS configuration settings CLOUD TUNING ▪ HW/OS Tuning ▪ Cloud Configs/Settings ▪ Workload tuning INSTANCE CONFIGURATION Performance is impacted by ▪ Instance type (flavor) ▪ Number of Instances OVER-SUBSCRIPTION Beware of over-subscription !!! LOCAL STORAGE Use of local storage instead of shared storage (like Ceph) could improve performance by over 50% ... depending on Ceph replication. Source: OpenStack: Install and configure a storage node - OpenStackkilo. http://docs.OpenStack.org/kilo/install-guide/install/yum/content/cinder-install-storage-node.html (2015) NUMA NODES Pinning instance CPU to physical CPUs (NUMA nodes) on local storage further improves performance. Source: Red Hat: Cpu pinning and numa topology awareness in OpenStackcompute. http://redhatstackblog.redhat.com/2015/05/05/cpu-pinning-and-numa-topology-awareness-in-OpenStack-compute/ (2015)