Observing the Invisible: Live Cache Inspection for High-Performance Embedded Systems

1 ”This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.” arXiv:2007.12271v1 [cs.OH] 23 Jul 2020 2 Observing the Invisible: Live Cache Inspection for High-Performance Embedded Systems Dharmesh Tarapore∗, Shahin Roozkhosh∗, Steven Brzozowski∗ and Renato Mancuso∗ ∗Boston University, USA fdharmesh, shahin, sbrz, [email protected] Abstract—The vast majority of high-performance embedded systems implement multi-level CPU cache hierarchies. But the exact behavior of these CPU caches has historically been opaque to system designers. Absent expensive hardware debuggers, an understanding of cache makeup remains tenuous at best. This enduring opacity further obscures the complex interplay among applications and OS-level components, particularly as they compete for the allocation of cache resources. Notwithstanding the relegation of cache comprehension to proxies such as static cache analysis, performance counter-based profiling, and cache hierarchy simulations, the underpinnings of cache structure and evolution continue to elude software-centric solutions. In this paper, we explore a novel method of studying cache contents and their evolution via snapshotting. Our method complements extant approaches for cache profiling to better formulate, validate, and refine hypotheses on the behavior of modern caches. We leverage cache introspection interfaces provided by vendors to perform live cache inspections without the need for external hardware. We present CacheFlow, a proof-of-concept Linux kernel module which snapshots cache contents on an NVIDIA Tegra TX1 SoC (system on chip). Index Terms—cache, cache snapshotting, ramindex, cacheflow, cache debugging F 1 INTRODUCTION content. Importantly, our technique is not meant to replace other The burgeoning demand for high-performance embedded cache analysis approaches. Rather, it seeks to supplement systems among a diverse range of applications such as them with insights on the exact behavior of applications telemetry, embedded machine vision, and vector process- and system components that were not previously possi- ing has outstripped the capabilities of traditional micro- ble. While in this work we specifically focus on last-level controllers. For manufacturers, this has engendered a dis- (shared) cache analysis, the same technique can be used to cernible shift to system-on-chip modules (SoCs). Coupled profile private cache levels, TLBs, and the internal states of with their extensibility and improved mean time between coherence controllers. In summary, we make the following failures, SoCs offer improved reliability and functional- contributions: ity. To bridge the gap between increasingly faster CPU 1) This is the first paper to describe in detail an inter- speeds and comparatively slower main memory technolo- face, namely RAMINDEX, available on modern embedded gies (e.g. DRAM) most SoCs feature cache-based archi- CPUs that can be used to inspect the content of CPU tectures. Indeed, caches allow modern embedded CPUs caches. Despite its usefulness, the interface has received to meet the performance requirements of emerging data- little to no attention in the research community thus far; intensive workload. At the same time, the strong need 2) We present a technique called CacheFlow to perform for predictability in embedded applications has rendered cache content analysis via event-driven snapshotting; analyses of caches and their contents vital for system design, 3) We demonstrate that the technique can be implemented validation, and certification. on modern hardware by leveraging the RAMINDEX inter- Unfortunately, this interest runs counter to the general face and propose a proof-of-concept open-source Linux desire to abstract complexity. Cache mechanisms and poli- implementation1 cies are ensconced entirely in hardware, to eliminate soft- 4) We describe how to correlate information retrieved via ware interference and encourage portability. Consequently, cache snapshotting to user-level and kernel software software-based techniques used to study caches suffer from components deployed on the system under analysis; several shortcomings, which we detail in Section 5.5. 5) We evaluate some of the insights provided by the pro- In contrast, we propose CacheFlow: a technique that posed technique using real and synthetic benchmarks. can be implemented in software and in existing high- The rest of this paper is organized as follows: in Sec- performance SoCs to extract and analyze the contents of tion II, we document related research that inspired and cache memories. CacheFlow can be deployed in a live informed this paper. In Sections III and IV, we provide a system without the need for an external hardware debugger. bird’s-eye view of the concepts necessary to understand the By periodically sampling the cache state, we show that we mechanics of CacheFlow. In Sections V and VI, we detail the can reconstruct the behavior of multiple applications in the system; observe the impact of scheduling policies; and study 1. Our implementation can be found at: https://github.com/ how multi-core contention affects the composition of cached weirdindiankid/cacheflow 3 fundamentals of CacheFlow and its implementation. Section thus far, a few important limitations are worth noting. VII outlines the experiments we performed and documents Techniques that rely on cache models — i.e. static anal- their results. We also examine the implications of those ysis, symbolic execution, simulation — work under the results. Section VIII concludes the paper with an outlook assumption that the employed models accurately represent on future research. the true behavior of the hardware. Unfortunately, complex modern hardware often deviates from textbook models in unpredictable ways. Access to event counters subsequently 2 RELATED WORK only reveals partial information on the actual state of the Caches have a significant impact on the temporal behav- cache hierarchy. ior of embedded applications. But their design—oriented The technique proposed in this paper is meant to com- toward programming transparency and average-case opti- plement the analysis and profiling strategies reviewed thus mization—makes performance impact analysis difficult. A far. It does so by allowing system designers to snapshot the plethora of techniques have approached cache analysis from actual content of CPU caches. This, in turn, enables a new multiple angles. We hereby provide a brief overview of the set of strategies to extract/validate hardware models, or to research in this space. conduct application and system-level analysis on the utiliza- Static Cache Analysis derives bounds on the access tion of cache resources. Unlike works that proposed cache time of memory operations when caches are present [1]– snapshotting by means of hardware modifications [28]–[30] [3]. Works in this domain study the set of possible cache our technique can be entirely implemented in software and states in the control-flow graph (CFG) of applications. Ab- leverages hardware support that already exists in a broad stract interpretation is widely employed for static cache line of high-performance embedded CPUs. analysis, as first proposed in [4] and [5]. For static analysis to be carried out, a precise model of the cache behavior is required. Techniques that consider Least-Recently Used 3 BACKGROUND (LRU), Pseudo-LRU, and FIFO replacement policies [3], [6]– In this section, we introduce a few fundamental concepts [9] have been studied. required to understand CacheFlow’s inner workings. First, Symbolic Execution is a software technique for feasible we review the structure and functioning of multi-level set- path exploration and WCET analysis [10], [11] of a program associative caches. Next, we briefly explore the organization subject to variable input vectors. It proffers a middle ground of virtual memory in the target class of SoCs. between simulation and static analysis. An interpreter follows the program; if the execution path depends on an unknown, a new symbolic executor is forked. Each symbolic 3.1 Multi-Level Caches executor stands for many actual program runs whose actual Modern high-performance embedded processors imple- values satisfy the path condition. ment multiple levels of caching. The first level (L1) is the As systems grow more complex, Cache Simulation tools closest to the CPU and its contents are usually private to are essential to study new designs and evaluate existing the local processor. Cache misses in L1 trigger a look-up in ones. Simulation of the entire processor — including cores, the next cache level (L2), which can be still private, shared cache hierarchy, and on-chip interconnect — was proposed among a subset (cluster) of cores, or globally shared by all in [12]–[15]. Simulators that only focus on the cache hierar- the cores. Additional levels (L3, L4, etc.) may also be present. chy were studied in [16]–[18]. Depending on the component The last act before a look-up in the main memory is to under analysis, simulations abound. In the (i) execution- query the Last-Level Cache (LLC). Without loss of generality, driven approach, the program to be traced runs locally on and to be more in line with our implementation, we consider the host platform; in the (ii) emulation-driven approach, the the typical cache layout of ARM-based processors. That is, target program runs on an emulated platform and environ- we assume private L1 caches and a globally shared L2, ment created

Observing the Invisible: Live Cache Inspection for High-Performance Embedded Systems

Memory Hierarchy Memory Hierarchy

Bootstomp: on the Security of Bootloaders in Mobile Devices

Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth Eurosys Conference (Eurosys'19), Dresden, Germany, 25-28 March 2019

FAN53525 3.0A, 2.4Mhz, Digitally Programmable Tinybuck® Regulator

Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc

Embedded Computer Solutions for Advanced Automation Control «

Caches & Memory

Stealing the Shared Cache for Fun and Profit

Low-Power Ultra-Small Edge AI Accelerators for Image Recog- Nition with Convolution Neural Networks: Analysis and Future Directions

Tegra Linux Driver Package

IBM Power Systems Performance Report Apr 13, 2021

Cache & Memory System