<<

Determinism, Complexity, and in Computer Performance

Joshua Garland∗, Ryan G. James‡ and Elizabeth Bradley∗† ∗Dept. of Computer University of Colorado, Boulder, Colorado 80309-0430 USA Email: [email protected] †Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501 USA Email: [email protected] ‡Complexity Center & Physics Dept., University of California, Davis, California 95616 USA Email: [email protected]

Abstract—Computers are deterministic dynamical save power by putting that thread on hold for that [1]. Among other things, that implies that time period (e.g., by migrating it to a processing one should be able to use deterministic forecast unit whose clock speed is scaled back). Computer rules to predict their behavior. That statement is sometimes—but not always—true. The memory and performance traces are, however, very complex. processor loads of some simple programs are easy to Even a simple “microkernel,” like a three-line predict, for example, but those of more-complex pro- loop that repeatedly initializes a matrix in column- grams like gcc are not. The goal of this paper is to major order, can produce chaotic performance determine why that is the case. We conjecture that, traces [1], as shown in Figure 1, and chaos places in practice, complexity can effectively overwhelm the predictive power of deterministic forecast models. fundamental limits on predictability. To explore that, we build models of a number of

4 x 10 performance traces from different programs running 2.3 on different Intel-based computers. We then calcu- late the permutation entropy—a temporal entropy 2.25 metric that uses ordinal analysis—of those traces and correlate those values against the prediction 2.2 success. 2.15 cache misses I.INTRODUCTION 2.1 Computers are among the most complex en-

2.05 7.742 7.743 7.744 7.745 7.746 7.747 7.748 7.749 7.75 7.751 4 gineered artifacts in current use. Modern micro- time (instructions x 100,000) x 10 processor chips contain multiple processing units and multi-layer memories, for instance, and they Fig. 1. A small snippet of the L2 cache miss rate of use complicated hardware/software strategies to col_major, a three-line C program that repeatedly initializes move data and threads of computation across a matrix in column-major order, running on an Intel Core Duo R -based machine. Even this simple program exhibits those resources. These features—along with all the chaotic performance dynamics. others that go into the design of these chips— make the patterns of their processor loads and The computer systems community has applied a arXiv:1305.5408v1 [nlin.CD] 23 May 2013 memory accesses highly complex and hard to of prediction strategies to traces like this, predict. Accurate forecasts of these quantities, most of which employ regression. An appealing if one could construct them, could be used to alternative builds on the recently established fact improve computer design. If one could predict that computers can be effectively modeled as de- that a particular computational thread would be terministic nonlinear dynamical systems [1]. This bogged down for the next 0.6 seconds waiting for result implies the existence of a deterministic fore- data from main memory, for instance, one could cast rule for those dynamics. In particular, one can use delay-coordinate embedding to reconstruct the 4 ) x 10 τ underlying dynamics of computer performance, 2.35 then use the resulting model to forecast the future 2.3 values of computer performance metrics such as 2.25 2.2

2.15 memory or processor loads [2]. In the case of 2.35

cache misses (t + 2 2.3 2.35 2.25 2.3 2.2 2.25 simple microkernels like the one that produced 2.2 4 2.15 x 10 2.15 2.1 2.1 4 2.05 2.05 x 10 the trace in Figure 1, this deterministic modeling cache misses (t + τ) cache misses (t) and forecast strategy works very well. In more- complicated programs, however, such as speech Fig. 2. A 3D projection of a delay-coordinate embedding of recognition software or compilers, this forecast the trace from Figure 1 with a delay (τ) of 100,000 instructions. strategy—as well as the traditional methods— break down quickly. from a d-dimensional smooth compact This paper is a first step in understanding when, M to Re2d+1, where t is time, is a diffeomor- why, and how deterministic forecast strategies fail phism on M—in other words, that the recon- when they are applied to deterministic systems. structed dynamics and the true (hidden) dynamics We focus here on the specific example of computer have the same topology. performance. We conjecture that the complexity of This is an extremely powerful result: among traces from these systems—which results from the other things, it means that one can build a formal inherent , nonlinearity, and nonstation- model of the full dynamics without mea- arity of the dynamics, as well as from measure- suring (or even knowing) every one of its state ment issues like noise, aggregation, and finite data variables. This is the foundation of the modeling length—can make those deterministic signals ef- approach that is used in this paper. The first step fectively unpredictable. We argue that permutation in the process is to estimate values for the two entropy [3], a method for measuring the entropy free parameters in the delay-coordinate map: the of a real-valued-finite-length time series through delay τ and the dimension m. We follow standard ordinal analysis, is an effective way to explore procedures for this, choosing the first minimum that conjecture. We study four examples—two in the average mutual as an estimate simple microkernels and two complex programs of τ [5] and using the false-near(est) neighbor from the SPEC benchmark suite—running on dif- method of [6], with a threshold of 10%, to estimate ferent Intel-based machines. For each program, we m. A plot of the data from Figure 1, embedded calculate the permutation entropy of the processor following this procedure, is shown in Figure 2. load (instructions per cycle) and memory-use effi- The coordinates of each point on this plot are ciency (cache-miss rates), then compare that to the differently delayed elements of the col_major prediction accuracy attainable for that trace using L2 cache miss rate time series y(t): that is, y(t) a simple deterministic model. on the first axis, y(t + τ) on the second, y(t + 2τ) on the third, and so on. Structure in these kinds of II.MODELING COMPUTER PERFORMANCE plots—clearly visible in Figure 2—is an indication Delay-coordinate embedding allows one to re- of determinism1. That structure can also be used construct a system’s full state-space dynamics to build a forecast model. from a single scalar time-series measurement— Given a nonlinear model of a deterministic dy- provided that some conditions hold regarding namical system in the form of a delay-coordinate that data. Specifically, if the underlying dynamics embedding like Figure 2, one can build deter- and the measurement function—the mapping from ministic forecast by capturing and ex- the unknown state vector X~ to the scalar value ploiting the geometry of the embedding. Many x that one is measuring—are both smooth and generic, Takens [4] formally proves that the delay- 1A deeper analysis of Figure 2—as alluded to on the previ- ous page—supports that diagnosis, confirming the presence of coordinate map a chaotic in these cache-miss dynamics, with largest λ1 = 8000 ± 200 instructions, embedded F (τ, m)(x) = ([x(t) x(t + τ) . . . x(t + mτ)]) in a 12-dimensional reconstruction space [1]. techniques have been developed by the dynamical Table I presents detailed results about the pre- systems community for this purpose (e.g., [7], [8]). diction accuracy of this on four different Perhaps the most straightforward is the “Lorenz examples: the col_major and 482.sphinx3 method of analogues” (LMA), which is essentially programs in Figures 3 and 4, as well as another nearest-neighbor prediction in the embedded state simple microkernel that initializes the same matrix space [9]. Even this simple algorithm—which as col_major, but in row-major order, and builds predictions by finding the nearest neighbor another complex program (403.gcc) from the in the embedded space of the given point, then SPEC cpu2006 benchmark suite. Both microker- taking that neighbor’s path as the forecast—works nels were run on the Intel Core Duo R machine; quite well on the trace in Figure 1, as shown in both SPEC benchmarks were run on the Intel i7 R Figure 3. On the other hand, if we use the same machine. We calculated a figure of merit for each prediction as follows. We held back the last k 3 4 x 10 elements of the N points in each measured time

2.35 series, built the forecast model by embedding the

2.3 first N − k points, used that embedding and the

2.25 LMA method to predict the next k points, then computed the Root Mean Squared Error (RMSE) 2.2 between the true and predicted signals: 2.15

col_major cache misses s 2.1 Pk 2 i=1(ci − pˆi) 2.05 RMSE = 82,125 82,625 83,125 83,625 84,125 84,625 85,125 85,625 86,125 time (instructions x 100,000) k To compare the success of predictions across Fig. 3. A forecast of the last 4,000 points of the signal in signals with different units, we normalized RMSE Figure 1 using an LMA-based strategy on the embedding in Figure 2. Red circles and blue ×s are the true and predicted as follows: values, respectively; vertical bars show where these values RMSE differ. nRMSE = Xmax,obs − Xmin,obs approach to forecast the processor load2 of the 482.sphinx3 program from the SPEC cpu2006 TABLE I R NORMALIZEDROOTMEANSQUAREDERROR (NRMSE) OF benchmark suite, running on an Intel i7 -based 4000-POINTPREDICTIONSOFMEMORY & PROCESSOR machine, the prediction is far less accurate; see PERFORMANCE FROM DIFFERENT PROGRAMS. Figure 4. cache miss rate instrs per cycle row_major 0.0324 0.0778 3 col_major 0.0080 0.0161 2.5 403.gcc 0.1416 0.2033

2 482.sphinx3 0.2032 0.3670

1.5 The results in Table I show a clear distinc- 1 482.sphinx3 IPC tion between the two microkernels, whose fu- 0.5 ture behavior can be predicted effectively using

0 62,137 62,637 63,137 63,637 64,137 64,637 65,137 65,637 66,137 this simple deterministic modeling strategy, and time (instructions x 100,000) the more-complex SPEC benchmarks, for which Fig. 4. An LMA-based forecast of the last 4,000 points of this prediction strategy does not work nearly as a processor-load performance trace from the 482.sphinx3 well. This begs the question: If these traces all benchmark. Red circles and blue ×s are the true and predicted come from deterministic systems—computers— values, respectively; vertical bars show where these values differ. then why are they not equally predictable? Our

3Several different prediction horizons were analyzed in our 2Instructions per cycle, or IPC experiment; the results reported in this paper are for k=4000 conjecture is that the sheer complexity of the ordinal pattern, φ(x1, x2, x3), is 231 since x2 ≤ dynamics of the SPEC benchmarks running on the x3 ≤ x1. This method has many features; among Intel i7 R machine make them effectively impos- other things, it is robust to noise and requires no sible to predict. knowledge of the underlying mechanisms.

III.MEASURING COMPLEXITY Definition (Permutation Entropy). Given a time series {xt}t=1,...,T . Define Sn as all n! permuta- For the purposes of this paper, one can view en- tions π of order n. For each π ∈ Sn we determine tropy as a measure of complexity and predictabil- the relative frequency of that permutation occur- ity in a time series. A high-entropy time series is ring in {xt}t=1,...,T : almost completely unpredictable—and conversely. |{t|t ≤ T − n, φ(x , . . . , x ) = π}| This can be made more rigorous: Pesin’s relation p(π) = t+1 t+n [10] states that in chaotic dynamical systems, the T − n + 1 Shannon entropy rate is equal to the sum of the Where | · | is set cardinality. The permutation positive Lyapunov exponents, λi. The Lyapunov entropy of order n ≥ 2 is defined as exponents directly quantify the rate at which X nearby states of the system will diverge with time: H(n) = − p(π) log2 p(π) λt |∆x(t)| ≈ e |∆x(0)|. The faster the divergence, π∈Sn the more difficult prediction becomes. 0 ≤ H(n) ≤ log (n!) Utilizing entropy as a measure of temporal com- Notice that 2 [3]. With this plexity is by no means a new idea [11], [12]. Its ef- in mind, it is common in the literature to normalize permutation entropy as follows: H(n) . With this fective usage requires categorical data: xt ∈ S for log2(n!) some finite or countably infinite alphabet S, and convention, “low” entropy is close to 0 and “high” data taken from real-world systems is effectively entropy is close to 1. Finally, it should be noted real-valued. To get around this, one must discretize that the permutation entropy has been shown to be the data—typically by binning. Unfortunately, this identical to the Shannon entropy for many large is rarely a good solution to the problem, as the classes of systems [13]. binning of the values introduces an additional In practice, calculating permutation entropy in- dynamic on top of the intrinsic dynamics whose volves choosing a good value for the wordlength entropy is desired. The field of symbolic dynamics n. The key consideration here is that the value studies how to discretize a time series in such a be large enough that forbidden ordinals are dis- way that the intrinsic behavior is not perverted, but covered, yet small enough that reasonable statis- these methods are fragile in the face of noise and tics over the ordinals are gathered: e.g., n = argmax{T > 100`!}, assuming an average of 100 require further understanding of the underlying ` system, which defeats the purpose of measuring counts per ordinal. In the literature, 3 ≤ n ≤ 6 is the entropy in the first place. a standard choice—generally without any formal Bandt and Pompe introduced permutation en- justification. In theory, the permutation entropy tropy (PE) as a “natural complexity measure for should reach an asymptote with increasing n, time series” [3]. Permutation entropy employs a but that requires an arbitrarily long time series. method of discretizing real-valued time series that In practice, what one should do is calculate the follows the intrinsic behavior of the system under persistent permutation entropy by increasing n examination. Rather than looking at the statistics until the result converges, but data length issues of sequences of values, as is done when computing can intrude before that convergence is reached. the Shannon entropy, permutation entropy looks Table II shows the permutation entropy results at the statistics of the orderings of sequences of for the examples considered in this paper, with the values using ordinal analysis. Ordinal analysis of nRMSPE prediction accuracies from the previous a time series is the process of mapping successive section included alongside for easy comparison. time-ordered elements of a time series to their The relationship between prediction accuracy and value-ordered permutation of the same size. By the permutation entropy (PE) is as we conjec- way of example, if (x1, x2, x3) = (9, 1, 7) then its tured: performance traces with high PE—those TABLE II PREDICTIONERROR (INNRMSPE) AND PERMUTATION requires half a second to compute is not useful ENTROPY (FOR DIFFERENT WORDLENGTHS n for the purposes of real-time control of a computer system with a MHz clock rate. Another alternative cache misses error n = 4 n = 5 n = 6 is to sample several system variables simulta- row_major 0.0324 0.6751 0.5458 0.4491 col_major 0.0080 0.5029 0.4515 0.3955 neously and build multivariate delay-coordinate 403.gcc 0.1416 0.9916 0.9880 0.9835 embeddings. Existing approaches to that are com- 482.sphinx3 0.2032 0.9913 0.9866 0.9802 putationally prohibitive [14]. We are working on insts per cyc error n = 4 n = 5 n = 6 alternative methods that sidestep that complexity. row_major 0.0778 0.9723 0.9354 0.8876 col_major 0.0161 0.8356 0.7601 0.6880 ACKNOWLEDGMENT 403.gcc 0.2033 0.9862 0.9814 0.9764 482.sphinx3 0.3670 0.9951 0.9914 0.9849 This work was partially supported by NSF grant #CMMI-1245947 and ARO grant #W911NF-12- 1-0288. whose temporal complexity is high, in the sense REFERENCES that little information is being propagated forward in time—are indeed harder to predict using the [1] T. Myktowicz, A. Diwan, and E. Bradley, “Computers are dynamical systems,” Chaos, vol. 19, p. 033124, 2009, simple deterministic forecast model described in doi:10.1063/1.3187791. the previous section. The effects of changing n [2] J. Garland and E. Bradley, “Predicting computer per- are also interesting: using a longer wordlength formance dynamics,” in Proceedings of the 10th Inter- national Conference on Advances in Intelligent Data generally lowers the PE—a natural consequence Analysis X, Berlin, Heidelberg, 2011, pp. 173–184. of finite-length data—but the falloff is less rapid [3] C. Bandt and B. Pompe, “Permutation entropy: A natural in some traces than in others, suggesting that those complexity measure for time series,” Phys Rev Lett, vol. 88, no. 17, p. 174102, 2002. values are closer to the theoretical asymptote that [4] F. Takens, “Detecting strange in fluid turbu- exists for perfect data. The persistent PE values lence,” in Dynamical Systems and Turbulence, D. Rand of 0.5–0.6 for the row_major and col_major and L.-S. Young, Eds. Berlin: Springer, 1981, pp. 366– 381. cache-miss traces are consistent with dynamical [5] A. Fraser and H. Swinney, “Independent coordinates chaos, further corroborating the results of [1]. (PE for strange attractors from mutual information,” Physical values above 0.97 are consistent with white noise.) Review A, vol. 33, no. 2, pp. 1134–1140, 1986. [6] M. B. Kennel, R. Brown, and H. D. I. Abarbanel, Interestingly, the processor-load traces for these “Determining minimum embedding dimension using a two microkernels exhibit more temporal complex- geometrical construction,” Physical Review A, vol. 45, ity than the cache-miss traces. This may be a pp. 3403–3411, 1992. [7] M. Casdagli and S. Eubank, Eds., Nonlinear Modeling consequence of the lower baseline value of this and Forecasting. Addison Wesley, 1992. time series. [8] A. Weigend and N. Gershenfeld, Eds., Time Series Pre- diction: Forecasting the Future and Understanding the Past. , 1993. ONCLUSIONS UTURE ORK IV. C &F W [9] E. N. Lorenz, “Atmospheric predictability as revealed by naturally occurring analogues,” Journal of the Atmo- The results presented here suggest that permuta- spheric Sciences, vol. 26, pp. 636–646, 1969. tion entropy—a ordinal calculation of forward in- [10] Y. B. Pesin, “Characteristic Lyapunov exponents and formation transfer in a time series—is an effective smooth ,” Russian Mathematical Surveys, vol. 32, no. 4, p. 55, 1977. metric for predictability of computer performance [11] C. E. Shannon, “Prediction and entropy of printed En- traces. Experimentally, traces with a persistent PE glish,” Bell Systems Technical Journal, vol. 30, pp. 50– ' 0.97 have a natural level of complexity that 64, 1951. [12] R. Mantegna, S. Buldyrev, A. Goldberger, S. Havlin, may overshadow the inherent determinism in the C. Peng, M. Simons, and H. Stanley, “Linguistic features system dynamics, whereas traces with PE / 0.7 of noncoding DNA sequences,” Physical review letters, seem to be highly predictable (viz., at least an vol. 73, no. 23, pp. 3169–3172, 1994. [13] J. Amigo,´ Permutation Complexity in Dynamical Sys- order of magnitude improvement in nRMSPE). tems: Ordinal Patterns, Permutation Entropy and All If information is the limit, then gathering and That. Springer, 2012. using more information is an obvious next step. [14] L. Cao, A. Mees, and K. Judd, “Dynamics from mul- tivariate time series,” Physica D, vol. 121, pp. 75–88, There is an equally obvious tension here between 1998. data length and prediction speed: a forecast that