Determinism, Complexity, and Predictability in Computer Performance

Determinism, Complexity, and Predictability in Computer Performance Joshua Garland∗, Ryan G. Jamesz and Elizabeth Bradley∗y ∗Dept. of Computer Science University of Colorado, Boulder, Colorado 80309-0430 USA Email: [email protected] ySanta Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501 USA Email: [email protected] zComplexity Sciences Center & Physics Dept., University of California, Davis, California 95616 USA Email: [email protected] Abstract—Computers are deterministic dynamical save power by putting that thread on hold for that systems [1]. Among other things, that implies that time period (e.g., by migrating it to a processing one should be able to use deterministic forecast unit whose clock speed is scaled back). Computer rules to predict their behavior. That statement is sometimes—but not always—true. The memory and performance traces are, however, very complex. processor loads of some simple programs are easy to Even a simple “microkernel,” like a three-line predict, for example, but those of more-complex pro- loop that repeatedly initializes a matrix in column- grams like gcc are not. The goal of this paper is to major order, can produce chaotic performance determine why that is the case. We conjecture that, traces [1], as shown in Figure 1, and chaos places in practice, complexity can effectively overwhelm the predictive power of deterministic forecast models. fundamental limits on predictability. To explore that, we build models of a number of 4 x 10 performance traces from different programs running 2.3 on different Intel-based computers. We then calculate the permutation entropy—a temporal entropy 2.25 metric that uses ordinal analysis—of those traces and correlate those values against the prediction 2.2 success. 2.15 cache misses I. INTRODUCTION 2.1 Computers are among the most complex en- 2.05 7.742 7.743 7.744 7.745 7.746 7.747 7.748 7.749 7.75 7.751 4 gineered artifacts in current use. Modern micro- time (instructions x 100,000) x 10 processor chips contain multiple processing units and multi-layer memories, for instance, and they Fig. 1. A small snippet of the L2 cache miss rate of use complicated hardware/software strategies to col_major, a three-line C program that repeatedly initializes move data and threads of computation across a matrix in column-major order, running on an Intel Core Duo R -based machine. Even this simple program exhibits those resources. These features—along with all the chaotic performance dynamics. others that go into the design of these chips— make the patterns of their processor loads and The computer systems community has applied a arXiv:1305.5408v1 [nlin.CD] 23 May 2013 memory accesses highly complex and hard to variety of prediction strategies to traces like this, predict. Accurate forecasts of these quantities, most of which employ regression. An appealing if one could construct them, could be used to alternative builds on the recently established fact improve computer design. If one could predict that computers can be effectively modeled as de- that a particular computational thread would be terministic nonlinear dynamical systems [1]. This bogged down for the next 0.6 seconds waiting for result implies the existence of a deterministic fore- data from main memory, for instance, one could cast rule for those dynamics. In particular, one can use delay-coordinate embedding to reconstruct the 4 ) x 10 τ underlying dynamics of computer performance, 2.35 then use the resulting model to forecast the future 2.3 values of computer performance metrics such as 2.25 2.2 2.15 memory or processor loads [2]. In the case of 2.35 cache misses (t + 2 2.3 2.35 2.25 2.3 2.2 2.25 simple microkernels like the one that produced 2.2 4 2.15 x 10 2.15 2.1 2.1 4 2.05 2.05 x 10 the trace in Figure 1, this deterministic modeling cache misses (t + τ) cache misses (t) and forecast strategy works very well. In more- complicated programs, however, such as speech Fig. 2. A 3D projection of a delay-coordinate embedding of recognition software or compilers, this forecast the trace from Figure 1 with a delay (τ) of 100,000 instructions. strategy—as well as the traditional methods— break down quickly. from a d-dimensional smooth compact manifold This paper is a first step in understanding when, M to Re2d+1, where t is time, is a diffeomor- why, and how deterministic forecast strategies fail phism on M—in other words, that the recon- when they are applied to deterministic systems. structed dynamics and the true (hidden) dynamics We focus here on the specific example of computer have the same topology. performance. We conjecture that the complexity of This is an extremely powerful result: among traces from these systems—which results from the other things, it means that one can build a formal inherent dimension, nonlinearity, and nonstation- model of the full system dynamics without mea- arity of the dynamics, as well as from measure- suring (or even knowing) every one of its state ment issues like noise, aggregation, and finite data variables. This is the foundation of the modeling length—can make those deterministic signals ef- approach that is used in this paper. The first step fectively unpredictable. We argue that permutation in the process is to estimate values for the two entropy [3], a method for measuring the entropy free parameters in the delay-coordinate map: the of a real-valued-finite-length time series through delay τ and the dimension m. We follow standard ordinal analysis, is an effective way to explore procedures for this, choosing the first minimum that conjecture. We study four examples—two in the average mutual information as an estimate simple microkernels and two complex programs of τ [5] and using the false-near(est) neighbor from the SPEC benchmark suite—running on dif- method of [6], with a threshold of 10%, to estimate ferent Intel-based machines. For each program, we m. A plot of the data from Figure 1, embedded calculate the permutation entropy of the processor following this procedure, is shown in Figure 2. load (instructions per cycle) and memory-use effi- The coordinates of each point on this plot are ciency (cache-miss rates), then compare that to the differently delayed elements of the col_major prediction accuracy attainable for that trace using L2 cache miss rate time series y(t): that is, y(t) a simple deterministic model. on the first axis, y(t + τ) on the second, y(t + 2τ) on the third, and so on. Structure in these kinds of II. MODELING COMPUTER PERFORMANCE plots—clearly visible in Figure 2—is an indication Delay-coordinate embedding allows one to re- of determinism1. That structure can also be used construct a system’s full state-space dynamics to build a forecast model. from a single scalar time-series measurement— Given a nonlinear model of a deterministic dy- provided that some conditions hold regarding namical system in the form of a delay-coordinate that data. Specifically, if the underlying dynamics embedding like Figure 2, one can build deter- and the measurement function—the mapping from ministic forecast algorithms by capturing and ex- the unknown state vector X~ to the scalar value ploiting the geometry of the embedding. Many x that one is measuring—are both smooth and generic, Takens [4] formally proves that the delay- 1A deeper analysis of Figure 2—as alluded to on the previ- ous page—supports that diagnosis, confirming the presence of coordinate map a chaotic attractor in these cache-miss dynamics, with largest Lyapunov exponent λ1 = 8000 ± 200 instructions, embedded F (τ; m)(x) = ([x(t) x(t + τ) : : : x(t + mτ)]) in a 12-dimensional reconstruction space [1]. techniques have been developed by the dynamical Table I presents detailed results about the pre- systems community for this purpose (e.g., [7], [8]). diction accuracy of this algorithm on four different Perhaps the most straightforward is the “Lorenz examples: the col_major and 482.sphinx3 method of analogues” (LMA), which is essentially programs in Figures 3 and 4, as well as another nearest-neighbor prediction in the embedded state simple microkernel that initializes the same matrix space [9]. Even this simple algorithm—which as col_major, but in row-major order, and builds predictions by finding the nearest neighbor another complex program (403.gcc) from the in the embedded space of the given point, then SPEC cpu2006 benchmark suite. Both microker- taking that neighbor’s path as the forecast—works nels were run on the Intel Core Duo R machine; quite well on the trace in Figure 1, as shown in both SPEC benchmarks were run on the Intel i7 R Figure 3. On the other hand, if we use the same machine. We calculated a figure of merit for each prediction as follows. We held back the last k 3 4 x 10 elements of the N points in each measured time 2.35 series, built the forecast model by embedding the 2.3 first N − k points, used that embedding and the 2.25 LMA method to predict the next k points, then computed the Root Mean Squared Error (RMSE) 2.2 between the true and predicted signals: 2.15 col_major cache misses s 2.1 Pk 2 i=1(ci − pî) 2.05 RMSE = 82,125 82,625 83,125 83,625 84,125 84,625 85,125 85,625 86,125 time (instructions x 100,000) k To compare the success of predictions across Fig. 3.

Determinism, Complexity, and Predictability in Computer Performance

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support