Contention Aware Frequency Scaling on Cmps with Guaranteed Quality

Contention Aware Frequency Scaling on CMPs with Guaranteed Quality of Service Hao Shen and Qinru Qiu Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, New York, USA {hshen01, qiqiu}@syr.edu Abstract Dynamic voltage and frequency scaling (DVFS) is another effective low power technique that has widely been used. Workload consolidation is usually performed in datacenters to improve server utilization for higher energy efficiency. One of the Compared to workload consolidation and runtime power key issues related to workload consolidation is contention for shared management, DVFS provides finer adjustment in performance resources such as last level cache, main memory, memory controller, and power consumption tradeoffs and associates with much etc. Dynamic voltage and frequency scaling (DVFS) of CPU is less control overhead. In a hierarchical power management another effective technique that has widely been used to trade the framework [1][8], the upper level is usually virtual machine performance for power reduction. We have found that the degree of management that performs workload consolidation, while the resource contention of a system affects its performance sensitivity to lower level is usually based on voltage and frequency scaling. CPU frequency. In this paper, we apply machine learning techniques Due to the speed gap between CPU and memory subsystem, to construct a model that quantifies runtime performance degradation the performance impact of DVFS is not linearly proportional caused by resource contention and frequency scaling. The inputs of our model are readings from Performance Monitoring Units (PMU) to the scale of frequency reduction [12]. Different applications screened using standard feature selection technique. The model is have different sensitivity to frequency scaling. A memory tested on an SMT-enabled chip multi-processor and it reaches up to intensive application usually suffers less performance 90% accuracy. Experimental results show that, guided by the degradation from DVFS than a CPU intensive one, as the CPU performance model, runtime power management techniques such as speed is no longer the performance bottleneck. The same can DVFS can achieve more accurate power and performance tradeoff be expected for many systems running multiple consolidated without violating the quality of service (QoS) agreement. The QoS workloads. As their performance constrained by the shared violation of the proposed system is significantly lower than systems resources, such as memory, power reduction can be achieved that have no performance degradation information. by applying DVFS without significant performance impact. Key words: consolidation, frequency scaling, power management, However, similar to systems with resource contention, it is contention hard to directly tell an application’s performance sensitivity to 1. Introduction frequency scaling without having a reference copy running. Performance degradation should be hidden from the customers, It has been pointed out [2] that the server energy efficiency especially in a cloud environment, where the quality of service reduces super-linearly as its utilization goes down. Due to the (QoS) is specified by the service level agreement (SLA) severe lack of energy proportionality in today’s computers, between service providers and customers and customers are workload consolidation is usually performed in datacenters to charged based upon usage or reservation of cloud resources. improve server utilization for higher energy efficiency. When How to guarantee the service level in a system that performs used together with power management on idle machines, this workload consolidation and DVFS for power control is an technique can lead to significant power savings [1]. urgent research problem. Today’s high-end servers have multiple processing units that Previous works studied how to optimize process scheduling to consist of several symmetric multiprocessing (SMP) cores. mitigate the resource contention ([4], [8]~[12]). Many of them Each physical core also comprises more than one logical cores aim at finding a metric that must be balanced across the enabled by the simultaneous multithreading (SMT) technique. running threads to minimize the resource contention. The One of the key issues related to workload consolidation is metrics are normally related to the last level cache miss rate. performance degradation due to the contentions for shared These works make the best effort to mitigate the resource resources. At SMP level these shared resources include main contention, however, they do not report the performance memory, last level cache, memory controller, etc. At SMT degradation during runtime. Hence, without a reference copy, level, the shared resources also include execution modules it is almost not possible to tell at runtime if certain scheduling such as instruction issue ports, ALU, branch target buffers, algorithm does improve the performance and how much it low level caches, etc. [5][6]. The degree of performance improves. After all, resource usage of a software program is degradation is a function of the resource usage of all processes dynamically changing. An increase in IPS (instruction per that are co-running and hence is hard to predict. Even if we second) does not necessarily indicate the adoption of a more can measure the execution time of an application accurately, efficient scheduling algorithm. It may simply because the there is no direct way to tell how much degradation that the program has completed loading all data from hard disk and process went through unless we have a reference copy of the started processing them. It would be beneficial if the service same application running on the same hardware machine by provider knows how much degradation the target process is itself alone. undergoing when it is co-scheduled with other processes competing for the shared resource and when the DVFS is *This work is supported in part by NSF under grant CNS-0845947 978-3-9815370-2-4/DATE14/©2014 EDAA applied. With such information, further adjustment in machine with 4 physical cores and 8 logical cores (SMT2). performance power tradeoff can be adopted. Each physical core has dedicated L1 and L2 cache (shared by two logical cores) while all cores share the same 8MB L3 The problem is further complicated when CPU frequency cache. It supports frequency scaling from 3.5 GHz to 1.6 GHz scaling is performed in a system with resource contention, with a step of 0.1 GHz. It is also equipped with 8GB two- because its impact on the usage of different resources is not channel 1600 MHz DDR3 memory. Ubuntu Linux is installed. equal. Finding an architecture level analytical model to The configuration of this experimental platform is quantify performance degradation in a system with resource representative among many commercial computers on the contention and frequency scaling is almost not possible. market nowadays. Machine learning techniques seem to be the only feasible solution [3]. Though many research papers assume that frequency scaling can be applied at core level, Intel Ivy Bridge processors only Some works have been proposed to apply machine learning have one voltage regulator, the per-core level frequency techniques to model the performance change of the tasks that scaling is disabled by firmware and OS [15]. Each physical have different co-runners [3][7][13]. Among these works, [3] core can be put in deep sleep C state independently [15] when is the most similar to this work. It uses the hardware they become idle. This state has very low power consumption performance counter information to estimate the performance due to power and clock gating. The socket power of our degradation caused by resource contention on an SMP experimental system is around 24W during idle state when machine. However, none of these works consider the deep sleep C state is enabled. When the deep C state is possibility that a system could also run at different voltage and disabled, the idle power becomes 36W at lowest frequency frequency levels. All of these previous works consider SMP and 63W at highest frequency. machine where only single thread is running on each core, therefore, they ignore the contention for shared execution Nowadays the memory subsystem becomes relatively fast. We resources. observe that running one single memory intensive task will be far from saturating the memory subsystem of the server. The In this work, we applied machine learning techniques to performance of the task scales almost linearly during develop a model which estimate performance degradation of a frequency scaling as the CPU and cache speed are still the task considering the impact of resource contention and bottleneck even for memory intensive tasks. The linear frequency scaling simultaneously. We need to point out that, relation stops only when multiple memory intensive tasks are this model does not “predict” the performance of a given task active running simultaneously. schedule and frequency setting. Instead, it monitors the PMUs 8‐0‐lbm 0‐8‐gamess 2‐6‐lbm‐SMT 2‐6‐lbm‐SMP of current server, and estimates its performance degradation 4‐4‐lbm‐SMT 4‐4‐lbm‐SMP 1‐0‐lbm 0‐1‐gamess 7‐1‐lbm‐SMT 7‐1‐lbm‐SMP with the respect to the reference of an ideal system (i.e. the 1 0.45 system without any resource contention and frequency scaling.) 0.9 0.8 0.4 The information can be used as feedbacks to guide scheduling 0.7 0.35 and DVFS. Compared to previous works (especially [3]), the 0.6 0.3 0.5 0.25 contributions of this paper are: 0.4 ormance 0.2 0.3 f performance 0.2 0.15 1. It studies the performance impact of resource contention per 0.1 0.1 and frequency scaling. Our results demonstrate the necessity 1.6 2.2 2.8 3.4 1.6 2.2 2.8 3.4 of considering them together at the same time for frequency(GHz) frequency(GHz) performance modeling. (a) Performance of uniform workload (b) Memory intensive task in hybrid workload 6‐2‐gamess‐SMP 6‐2‐gamess‐SMT 0.9 2.

Load more