Computational Caches
Total Page:16
File Type:pdf, Size:1020Kb
Computational Caches Amos Waterland1 Elaine Angelino1 Ekin D. Cubuk1 Efthimios Kaxiras2,1 Ryan P. Adams1 Jonathan Appavoo3 Margo Seltzer1 1School of Engineering and Applied Sciences, Harvard University 2Department of Physics, Harvard University 3Department of Computer Science, Boston University ABSTRACT the question { can we cache the act of computation so that Caching is a well-known technique for speeding up com- it can be applied repeatedly and in new contexts? putation. We cache data from file systems and databases; Conventional caches have a crucial property that we would we cache dynamically generated code blocks; we cache page like to replicate in a computational cache: the ability to load translations in TLBs. We propose to cache the act of com- their contents at one time and place but use their contents putation, so that we can apply it later and in different con- repeatedly at different times and places. There are several texts. We use a state-space model of computation to sup- ways to exploit a computational cache having this property: port such caching, involving two interrelated parts: spec- we could load the cache and then use it later on the same ma- ulatively memoized predicted/resultant state pairs that we chine, or we could load the cache using large, powerful com- use to accelerate sequential computation, and trained proba- puters and stream it to smaller, more resource-constrained bilistic models that we use to generate predicted states from computers where the cost of a cache lookup is less than the which to speculatively execute. The key techniques that cost of executing the computation. make this approach feasible are designing probabilistic mod- Our computational cache design arises from a model of els that automatically focus on regions of program execution computation that we originally developed to extract paral- state space in which prediction is tractable and identifying lelism from sequential programs [30]. In this model, com- state space equivalence classes so that predictions need not putation is expressed as a walk through a high-dimensional be exact. state space describing the registers and memory; we call such a walk a trajectory. The model is practical enough that we have built a prototype x86 virtual machine and in current Categories and Subject Descriptors work have successfully used it to automatically parallelize C.5.1 [Large and Medium (\Mainframe") Comput- programs with obvious structure. The virtual machine em- ers]: Super (very large) computers; I.2.6 [Artificial Intel- beds in its execution loop learning, prediction, and parallel ligence]: Learning|Connectionism and neural nets speculative execution; we call this trajectory-based execu- tion. General Terms Computational cache entries are stored pieces of observed trajectories. The entries can be pieces of trajectories we Performance,Theory encountered during normal execution or during speculative execution. In either case, we can think of these entries as 1. INTRODUCTION begin-state and end-state pairs { any time the virtual ma- Caching has been used for decades to trade space for time. chine finds itself in a state matching the begin-state of some File system and database caches use memory to store fre- cache entry, it can immediately speed up execution by using quently used items to avoid costly I/O; code caches store the cache entry to jump to the corresponding end-state and dynamically instrumented code blocks to avoid costly rein- then continue normal execution. strumentation; TLBs cache page translations to avoid costly In the rest of this paper we summarize our model of com- page table lookups. Caches are used to both speed up re- putation, present the specifics of our computational cache, peated accesses to static data and to avoid recomputation by address the dual challenges of representing cache entries ef- storing the results of computation. However, computation ficiently and allowing each entry to be reused in as many is dynamic: an action that advances a state, such as that of different ways as possible, analyze the relationship between a program plus associated data, to another state. We pose the predictive accuracy of a cache's models and the resultant speedup, and report experimental results. Permission to make digital or hard copies of all or part of this work for per- sonal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear 2. MODEL OF COMPUTATION this notice and the full citation on the first page. Copyrights for components The two key ideas from trajectory-based execution that of this work owned by others than the author(s) must be honored. Abstract- give rise to computational caching are: representing the ing with credit is permitted. To copy otherwise, or republish, to post on state of computation as a state vector and representing the servers or to redistribute to lists, requires prior specific permission and/or a execution of an instruction as applying a fixed transition fee. Request permissions from [email protected]. SYSTOR ’13, June 30 - July 02 2013, Haifa, Israel function. A state vector contains all information relevant Copyright 2013 ACM 978-1-4503-2116-7/13/06 ...$15.00. to the future of the computation. A transition function de- pends only on an input state vector to deterministically pro- how long the sequence. The two vectors are themselves often duce an output state vector. highly compressible, so in practice a billion-instruction se- Consider the memory and registers of a conventional com- quence of megabit state vectors often results in just a kilobit puter as defining a very large state space. Applying the cache entry. transition function executes one instruction, deterministi- Our execution loop continually queries our cache for a pair cally moving the system from one state vector to the next whose first vector matches our current vector. If the cache state vector. Program execution thus traces out a trajectory hits, we immediately transform our current vector into the through the state space. For the moment, we exclude exter- second entry of the pair, speeding up computation by fast- nal non-determinism and I/O { we assume that execution forwarding through potentially billions of instructions. If begins after all input data and the program have been loaded the cache misses, we execute one instruction by applying the into memory. Programs can still use a pseudo-random num- transition function as normal, then try again. Sequences of ber generator; we simply require that it be seeded determin- state vectors usually have many symmetries, so our current istically. Given these constraints, execution is a memoryless vector need only match along the causal coordinates of a process { each state depends only upon the previous state. stored pair in order to get a cache hit. Matching is conser- This means that any pair of state vectors known to lie on the vative: we tolerate false negatives in order to guarantee that same trajectory serve as a compressed representation of all we will never have false positives. Cache hits speed up exe- computation that took place in between; our computational cution but produce the exact same computation as normal, cache design arises from this fact. and cache misses just result in normal execution with query Our prototype x86 virtual machine is a parallel program overhead. that has hard-wired into its execution loop the tasks of There are characteristics of both execution and the state (1) learning patterns from the observed state vector trace of space that we exploit in our computational cache: trajec- the program it is executing, (2) issuing predictions for future tories are not random, each instruction modifies only small states of the program, (3) executing the program specula- portions of the state, and even long sequences of instruc- tively based on the predictions, (4) storing the speculative tions frequently depend upon only small portions of the execution results in our computational cache, and (5) using state. Programs are frequently constructed out of compo- the computational cache to speed up execution. The virtual nents: functions, library routines, loops, etc. Within each machine smoothly increases or decreases the scale of these one of these constructs, execution typically depends on only tasks as a function of the raw compute, memory, and com- a small portion of the space and modifies only a small por- munications resources available to it. More raw resources tion of the space. We'll use these observations to construct cause the virtual machine to query and fill a larger compu- useful entries in a computational cache. tational cache, learn more complex models, predict further Let's begin by identifying some useful characteristics of into the future, and execute more speculations in parallel. a cache entry and then move on to techniques for produc- As aggressive as these ideas sound, we have previously ing such entries. First, an entry has to encapsulate enough demonstrated that we can use them in limited cases to au- computation that the benefit of using the entry to fast- tomatically speed up certain classes of binary programs [30]. forward through a certain minimum number of instructions This paper addresses a broader question { how can we go outweighs the cost of looking it up. We do not yet have a beyond speeding up a single computation to reusing work comprehensive cost model for the trade-off between caching performed in the context of one program to improve the a sequence of computation and regenerating that sequence [1], performance of a different program? but in our prototype virtual machine we find that entries We examine two ways to reuse and generalize information must comprise a pair of state vectors separated by at least gained from executing a program.