Observing the Invisible: Live Cache Inspection for High-Performance Embedded Systems
Total Page:16
File Type:pdf, Size:1020Kb
1 Observing the Invisible: Live Cache Inspection for High-Performance Embedded Systems Dharmesh Tarapore∗, Shahin Roozkhosh∗, Steven Brzozowski∗ and Renato Mancuso∗ ∗Boston University, USA fdharmesh, shahin, sbrz, [email protected] Abstract—The vast majority of high-performance embedded systems implement multi-level CPU cache hierarchies. But the exact behavior of these CPU caches has historically been opaque to system designers. Absent expensive hardware debuggers, an understanding of cache makeup remains tenuous at best. This enduring opacity further obscures the complex interplay among applications and OS-level components, particularly as they compete for the allocation of cache resources. Notwithstanding the relegation of cache comprehension to proxies such as static cache analysis, performance counter-based profiling, and cache hierarchy simulations, the underpinnings of cache structure and evolution continue to elude software-centric solutions. In this paper, we explore a novel method of studying cache contents and their evolution via snapshotting. Our method complements extant approaches for cache profiling to better formulate, validate, and refine hypotheses on the behavior of modern caches. We leverage cache introspection interfaces provided by vendors to perform live cache inspections without the need for external hardware. We present CacheFlow, a proof-of-concept Linux kernel module which snapshots cache contents on an NVIDIA Tegra TX1 system on chip and a Hardkernel Odroid XU4. Index Terms—cache, cache snapshotting, ramindex, cacheflow, cache debugging F 1 INTRODUCTION content. Importantly, our technique is not meant to replace other The burgeoning demand for high-performance embedded cache analysis approaches. Rather, it seeks to supplement systems among a diverse range of applications such as them with insights on the exact behavior of applications telemetry, embedded machine vision, and vector process- and system components that were not previously possi- ing has outstripped the capabilities of traditional micro- ble. While in this work we specifically focus on last-level controllers. For manufacturers, this has engendered a dis- (shared) cache analysis, the same technique can be used to cernible shift to system-on-chip modules (SoCs). Coupled profile private cache levels, TLBs, and the internal states of with their extensibility and improved mean time between coherence controllers. In summary, we make the following failures, SoCs offer improved reliability and functional- contributions: ity. To bridge the gap between increasingly faster CPU 1) This is the first paper to describe in detail an inter- speeds and comparatively slower main memory technolo- face, namely RAMINDEX, available on modern embedded gies (e.g. DRAM) most SoCs feature cache-based archi- CPUs that can be used to inspect the content of CPU tectures. Indeed, caches allow modern embedded CPUs caches. Despite its usefulness, the interface has received to meet the performance requirements of emerging data- little to no attention in the research community thus far; intensive workload. At the same time, the strong need 2) We present a technique called CacheFlow to perform for predictability in embedded applications has rendered cache content analysis via event-driven snapshotting; analyses of caches and their contents vital for system design, 3) We demonstrate that the technique can be implemented validation, and certification. on modern hardware by leveraging the RAMINDEX inter- Unfortunately, this interest runs counter to the general face and propose a proof-of-concept open-source Linux desire to abstract complexity. Cache mechanisms and poli- implementation1 cies are ensconced entirely in hardware, to eliminate soft- 4) We describe how to correlate information retrieved via ware interference and encourage portability. Consequently, cache snapshotting to user-level and kernel software software-based techniques used to study caches suffer from components deployed on the system under analysis; several shortcomings, which we detail in Section 5.3. 5) We evaluate some of the insights provided by the pro- In contrast, we propose CacheFlow: a technique that posed technique using real and synthetic benchmarks. can be implemented in software and in existing high- The rest of this paper is organized as follows: in Sec- performance SoCs to extract and analyze the contents of tion II, we document related research that inspired and cache memories. CacheFlow can be deployed in a live informed this paper. In Sections III and IV, we provide a system without the need for an external hardware debugger. bird’s-eye view of the concepts necessary to understand the By periodically sampling the cache state, we show that we mechanics of CacheFlow. In Sections V and VI, we detail the can reconstruct the behavior of multiple applications in the system; observe the impact of scheduling policies; and study 1. Our implementation can be found at: https://github.com/ how multi-core contention affects the composition of cached weirdindiankid/cacheflow 2 fundamentals of CacheFlow and its implementation. Section thus far, a few important limitations are worth noting. VII outlines the experiments we performed and documents Techniques that rely on cache models — i.e. static anal- their results. We also examine the implications of those ysis, symbolic execution, simulation — work under the results. Section VIII concludes the paper with an outlook assumption that the employed models accurately represent on future research. the true behavior of the hardware. Unfortunately, complex modern hardware often deviates from textbook models in unpredictable ways. Access to event counters subsequently 2 RELATED WORK only reveals partial information on the actual state of the Caches have a significant impact on the temporal behav- cache hierarchy. ior of embedded applications. But their design—oriented Undocumented cache behaviors have thus engendered toward programming transparency and average-case opti- measurement-based reverse engineering of cache policies, mization—makes performance impact analysis difficult. A in an attempt to empirically infer otherwise opaque be- plethora of techniques have approached cache analysis from havior [28]. Abel and Reineke, for instance, present a set multiple angles. We hereby provide a brief overview of the of microbenchmarks tailored specifically to deduce the re- research in this space. placement policies on a multitude of Intel processors [29]. Static Cache Analysis derives bounds on the access They further extend this work in [30] using chi, a frame- time of memory operations when caches are present [1]– work that uses parameter inference algorithms to perform a [3]. Works in this domain study the set of possible cache predetermined set of measurements that expose the cache’s states in the control-flow graph (CFG) of applications. Ab- replacement policy. Nonetheless, all approaches enumerated stract interpretation is widely employed for static cache so far suffer from a common limitation, best expressed analysis, as first proposed in [4] and [5]. For static analysis by Abel: ”since only a cache’s hit/miss behavior can be to be carried out, a precise model of the cache behavior observed [in software], it is impossible to infer anything is required. Techniques that consider Least-Recently Used about how it is realized internally.” [30] (LRU), Pseudo-LRU, and FIFO replacement policies [3], [6]– In contrast, the technique proposed in this paper is [9] have been studied. meant to complement the analysis and profiling strategies Symbolic Execution is a software technique for feasible reviewed thus far by allowing system designers to snapshot path exploration and WCET analysis [10], [11] of a program the actual contents of CPU caches. This in turn enables a subject to variable input vectors. It proffers a middle ground new set of strategies to extract/validate hardware mod- between simulation and static analysis. An interpreter fol- els, or to conduct application and system-level analysis lows the program; if the execution path depends on an on the utilization of cache resources. Unlike works that unknown, a new symbolic executor is forked. Each symbolic proposed cache snapshotting by means of hardware modifi- executor stands for many actual program runs whose actual cations [31]–[33] our technique can be entirely implemented values satisfy the path condition. in software and leverages hardware support that already As systems grow more complex, Cache Simulation tools exists in a broad line of high-performance embedded CPUs. are essential to study new designs and evaluate existing ones. Simulation of the entire processor — including cores, cache hierarchy, and on-chip interconnect — was proposed 3 BACKGROUND in [12]–[15]. Simulators that only focus on the cache hierar- In this section, we introduce a few fundamental concepts chy were studied in [16]–[18]. Depending on the component required to understand CacheFlow’s inner workings. First, under analysis, simulations abound. In the (i) execution- we review the structure and functioning of multi-level set- driven approach, the program to be traced runs locally on associative caches. Next, we briefly explore the organization the host platform; in the (ii) emulation-driven approach, the of virtual memory in the target class of SoCs. Readers fa- target program runs on an emulated platform and environ- miliar with set-associative caches and memory management ment created by the host; finally, in the (iii) trace-driven fundamentals