Effective Cache Apportioning for Performance Isolation Under
Total Page:16
File Type:pdf, Size:1020Kb
Effective Cache Apportioning for Performance Isolation Under Compiler Guidance Bodhisatwa Chatterjee Sharjeel Khan Georgia Institute of Technology Georgia Institute of Technology Atlanta, USA Atlanta, USA [email protected] [email protected] Santosh Pande Georgia Institute of Technology Atlanta, USA [email protected] Abstract cache partitioning to divide the LLC among the co-executing With a growing number of cores per socket in modern data- applications in the system. Ideally, a cache partitioning centers where multi-tenancy of a diverse set of applications scheme obtains overall gains in system performance by pro- must be efficiently supported, effective sharing of the last viding a dedicated region of cache memory to high-priority level cache is a very important problem. This is challenging cache-intensive applications and ensures security against because modern workloads exhibit dynamic phase behaviour cache-sharing attacks by the notion of isolated execution in - their cache requirements & sensitivity vary across different an otherwise shared LLC. Apart from achieving superior execution points. To tackle this problem, we propose Com- application performance and improving system throughput CAS, a compiler guided cache apportioning system that pro- [7, 20, 31], cache partitioning can also serve a variety of pur- vides smart cache allocation to co-executing applications in a poses - improving system power and energy consumption system. The front-end of Com-CAS is primarily a compiler- [6, 23], ensuring fairness in resource allocation [26, 36] and framework equipped with learning mechanisms to predict even enabling worst case execution-time analysis of real-time cache requirements, while the backend consists of allocation systems [18]. Owing to these overwhelming benefits of cache framework with pro-active scheduler that apportions cache dy- partitioning, modern processor families (Intel® Xeon Series) namically to co-executing applications. Our system improved implement hardware way-partitioning through Cache Allo- average throughput by 21%, with a maximum of 54% while cation Technology (CAT) [10]. Such systems aim to provide maintaining the worst individual application execution time extended control and flexibility to the user by allowing them degradation within 15% to meet SLA requirements. to customize cache partitions according to each application’s requirement. 1 Introduction On the other hand, modern workloads exhibit dynamic phase behaviour throughout their execution resulting in High-performance computing systems facilitate concurrent rapidly changing cache demands. These behaviours arise due execution of multiple applications by sharing resources to complex control flows and input data driven behaviors in among them. The Last-Level Cache (LLC) is one such re- applications and its co-execution environments. It is often source, which is usually shared by all running applications the case that even across the same program artifact (such as arXiv:2102.09673v2 [cs.PL] 22 Feb 2021 in the system. However, sharing cache often results in inter- a loop), different cache requirements are exhibited during application interference, where multiple applications can map different invocations of the same. Majority of the prior works to the same cache line and incur conflict misses, resulting [6, 20, 26, 31] on cache partitioning often tends to classify in potential performance degradation. Another aspect of this applications into several categories on the lines of whether application co-existence in LLC is the increased vulnerabili- they are cache-sensitive or not. Based on this offline character- ties to shared-cache attacks such as the side-channel attacks ization, cache allocations for various application are decided [9, 17, 33], timing-attacks [3, 13], and cache-based denial- though a mix of static and runtime policies. The fact that a of-service (DOS) attacks [2, 11, 19, 35]. Furthermore, these single application (or even a given loop) can exhibit dual problems are exacerbated by the fact that the number of cores behaviour, i.e both cache-sensitive & cache-insensitive be- (and thus the processes) that share the LLC are rapidly in- haviour during its execution, is not taken into account. Some creasing in the recent architectural designs. Therefore, both approaches employ a ‘damage-control’ methodology, where from a performance and security point of view, it is imperative attempts are made to adjust cache partitions dynamically af- that LLCs are carefully managed. ter detecting changes in application behaviour by runtime To address these issues, modern computing systems use monitoring using performance counters and hardware moni- 1 tors [6, 22, 28, 32]. However, these approaches are reactive tion Framework. The framework maintains locality of and suffer from the problem of detection and reaction lag., partitions allocated to a given application. A net result of ie. the application behaviour of modern workloads are likely such apportioning decision making is that LLC misses to change before the adjustments are made leading to lost are shifted from applications with higher cache sensitiv- performance. ity (high sensitivity reuse) needs to the lower ones (lower In order to account for applications’ dynamic phase be- sensitivity reuse or streaming ones). haviors and provide smart & proactive cache partitioning, • Our results on all the benchmark suites show that supe- we propose Compiler-Guided Cache Apportioning Sys- rior performance of 21% average throughput increase tem (Com-CAS) with the goal of providing superior per- with a maximum of 54% including enhanced security formance & maximal isolation to reduce the threat of shared against shared-cache attacks and SLA agreement for all cache based attacks. The front-end of Com-CAS consists of applications can be achieved by our system. Probes Compiler Framework, which uses traditional com- The remainder of this paper is structured as follows: We begin piler analysis & learning mechanisms to predict dynamic by providing a high-level overview of Com-CAS in Section cache requirements of co-executing applications. The at- 3. The prior works and their shortcomings are discussed in tributes that determine cache requirements (memory foot- 4. The Probes Compiler Framework and techniques to esti- print, and data-reuse) are encapsulated in specialized library mate the cache requirements is presented in Section5 and markers called ‘probes’, which are statically inserted outside the apportioning algorithms and cache-partitioning scheme each an loop-nest in an application. These probes communi- for BCache Allocation Framework are explained in6. The ex- cate the dynamic resource information to BCache Allocation perimental evaluation of our system along with detailed case Framework, which makes up the back-end of Com-CAS. It studies are discussed in7. The final remarks and conclusions consists of allocation algorithms that dynamically apportions are presented in Section8. the LLC for co-executing applications based on aggregating probes-generated information, and a proactive scheduler that invokes the Intel CAT to perform the actual cache partitioning 2 Motivation: Dynamic Cache Apportioning according to application phase changes. We evaluated Com-CAS on an Intel Xeon Gold system with Determining memory requirements for a mix of running ap- 35 application mixes from GAP [1], a graph benchmark suite, plications in a system is quite a complex task. The cache Polybench [21], a numerical computational suite, and Rodinia demand for each process is dependent on specific program [5], a heterogeneous compute-intensive benchmark suite that points and can change drastically throughout its execution. represent a variety of workloads typically encountered in a It is possible for a same program artifact (loop) to exhibit cloud environment. The average throughput gain obtained different behaviour across multiple invocations during single over all the mixes is 21%, with the maximum one being upto execution. Identifying these dynamic variations in cache re- 54% with no degradation over a vanilla co-execution environ- quirements of an application is paramount for apportioning ment with CFS with no cache apportioning scheme. We also cache and obtaining superior performance. For example, let show the Com-CAS maximises the isolation of reuse heavy us consider an application mix, consisting of 5 instances of loops providing protection against the DOS attacks simulta- BC benchmark from the GAP Suite [1]. These applications neously while maintaining the SLA agreement of degradation are provided with the same input - Uniform Random Graph 24 < 15%. Overall, the main contributions of the paper can be with 2 nodes. Our goal is to leverage cache partitioning summarized as follows: through Intel® CAT to obtain superior performance. We now consider different possibilities to show that obvious solutions • We present a compiler-guided Cache Apportioning Sys- do not work: Can we partition the cache equally among tem that dynamically adjusts the LLC partitioning to identical processes? Since the distribution of processes in account for changing phase behaviours of modern work- the mix are identical, it is a natural inclination to provide loads. each process an equal amount of isolated cache ways. In a • We show that cache requirements of applications can be 11-way set associative cache system, one possible partition estimated through a combination of memory footprints, is giving any four processes 2-ways each and the fifth pro-