A Statistical Performance Model of the Opteron Processor
Total Page:16
File Type:pdf, Size:1020Kb
A Statistical Performance Model of the Opteron Processor Jeanine Cook Jonathan Cook Waleed Alkohlani New Mexico State University New Mexico State University New Mexico State University Klipsch School of Electrical Department of Computer Klipsch School of Electrical and Computer Engineering Science and Computer Engineering [email protected] [email protected] [email protected] ABSTRACT m5 [11], Simics/GEMS [25], and more recently MARSSx86 [5]{ Cycle-accurate simulation is the dominant methodology for the only one supporting the popular x86 architecture. Some processor design space analysis and performance prediction. of these have not been validated against real hardware in However, with the prevalence of multi-core, multi-threaded the multi-core mode (SESC and M5), and the others that architectures, this method has become highly impractical as are validated show large errors of up to 50% [5]. All cycle- the sole means for design due to its extreme slowdowns. We accurate multi-core simulators are very slow; a few hundred have developed a statistical technique for modeling multi- thousand simulated instructions per second is the most op- core processors that is based on Monte Carlo methods. Us- timistic speed with the help of native mode execution. Fur- ing this method, processor models of contemporary archi- ther, the architectures supported by many of these simula- tectures can be developed and applied to performance pre- tors are now obsolete. diction, bottleneck detection, and limited design space anal- ysis. To date, we have accurately modeled the IBM Cell, the To address these issues and to satisfy our own desire for Intel Itanium, and the Sun Niagara 1 and Niagara 2 proces- performance models of contemporary processors, we devel- sors [34, 33, 10]. In this paper, we present a work in progress oped a statistical modeling technique based on Monte Carlo which is applying this methodology to an out-of-order exe- methods that we believe can be applied to in-order and out- cution processor. We present the initial single-core model of-order execution, multi-core processors. We introduced and results for the AMD Barcelona (Opteron) processor. this method in [34, 33]; the method was presented in detail using the Niagara 2 model as an example in [10]. These 1. INTRODUCTION previous models were all of in-order execution processors. The complexity of contemporary processor architectures in The model presented here of the Opteron is the first we are combination with the large slowdowns of cycle-accurate sim- developing for an out-of-order execution architecture. ulators has given rise to the need for alternate performance analysis and prediction techniques that enable faster analy- Monte Carlo methods are a class of algorithms that rely on sis and identification of performance bottlenecks. As noted repeated random sampling to compute a result. Our pro- recently in a panel discussion on computer architecture re- cessor modeling technique probabilistically represents the search directions, \Intel cycle-accurate performance simula- executed instruction stream and randomly samples several tors currently run at roughly 10 thousand instructions per histograms that are related to stall conditions. Instructions second (KIPS) when simulating a unicore processor. At 10 proceed to specific delay centers (e.g., execution units, cache, KIPS, it will take more than nine days to simulate 1 second memory) based on probabilities. Cycles are counted for each of the eight-way multicore processor execution. Things will instruction according to the stall conditions and the latency only get worse as the number of simulated cores grows." [17]. of the specific delay center that executes the instruction. This time perturbation makes simulating a realistic work- The model converges and produces a result when it reaches load unrealistic! a threshold based on the difference in CPI between consec- utive intervals of a pre-defined size. All of the in-order exe- Another motivation for alternate performance analysis tech- cution models converge within seconds; the current Opteron niques is the lack of availability of simulators for modern model usually converges within seconds, but for some appli- multi-core architectures. A limited number of academically- cations, convergence takes minutes. available cycle-accurate simulators have various levels of multi- core simulation support, the most popular being SESC [31], New processor models using the Monte Carlo modeling tech- nique can be developed quickly. A basic model can be cre- ated in a few months; validating and re-developing for better accuracy takes another few months. The newer models (Ni- agara and Opteron) are coded in C++; the older models are written in C. The Opteron model is the largest at ap- proximately 2000 lines of code; all of the other models are implemented in hundreds of lines. In this paper, we present the initial Monte Carlo single-core processor model that we are developing for the Opteron, to extract a particular latency or to study a specific behav- Barcelona architecture. The Barcelona is a Family 10h ar- ior or micro-architecture feature. The data dependence his- chitecture; we model the micro-architecture core of Family tograms are sampled by the model to determine the latency 10h processors. We extended the methodology to enable of subsequent dependent instructions. modeling of out-of-order execution, which represents a sig- nificant extension in comparison to prior models with respect Data dependence information is collected according to which to the micro-architecture components modeled and overall type of data hazards can occur in the micro-architecture: model complexity. The Opteron model is parameterized and read-after-write (RAW), write-after-write (WAW), and write- generalized so that it is not just an attempt to model an ex- after-read (WAR). WAW and WAR hazards are typically re- isting processor, but can be configured to understand the solved through register renaming. Therefore, in most mod- performance tradeoffs when alternatives are evaluated. els, only true dependences (RAW) need to be accounted for. Our effort is not yet complete but is producing initial results. The data dependence information is represented as a dis- In this paper we present initial performance prediction re- tance in number of instructions between the operand pro- sults. Although we have performed some initial validation ducing and operand consuming instructions. For example, and model refinement, we are still working on areas where the distance between the producing and consuming instruc- the model is inaccurate, mostly due to iteratively under- tion in the following sequence is 3. standing the details of available documentation as they per- ldub [%i4 + %i2]; %i0 tain to actual processor behavior. The results are not final nop and, from experience, are expected to improve as validation nop is completed. or %i0; 2; %i0 2. MONTE CARLO PROCESSOR MODEL- These dependence distances are collected over the entire dy- ING (MCPM) TECHNIQUE namic execution of an application and are used to produce The Monte Carlo modeling methodology is a technique that a histogram. For the Opteron model, a histogram is gener- statistically models processor behavior and predicts the per- ated for each executed instruction; the histogram contains a formance of contemporary, single- and multi-core processors. distribution of the number of dynamic instructions that sep- Our statistical modeling technique is based on the Monte arate the producing instruction (e.g.,movq %rax, -8(%rbp)) Carlo method and uses both dynamic application and pro- and its consumer. Figure 1 shows the MOV64-to-use his- cessor characteristics as parameters to the model. Models are developed based on a particular existing processor and can subsequently be modified to model a different processor 4E+09 in the same family. 3.5E+09 3E+09 The models predict CPI (cycles per instruction) and use the equation: 2.5E+09 2E+09 CPI = CPIi + CPIs (1) 1.5E+09 # of Instrucons where CPIi is the intrinsic or ideal CPI based on the issue 1E+09 width, and CPI is the CPI due to stalls. The cause of s 500000000 stalls depends on the micro-architecture; data dependences, branch mis-predictions, and cache misses are stall causes 0 common to many processors. Stalls due to a full reorder 1 3 5 7 9 11 13 15 17 19 21 23 25 buffer, full reservation stations, and full scheduling queues Distance are also common in out-of-order execution processors. Once stall causes are identified, we collect data on their frequency using on-chip performance counters (branch mis-predictions, Figure 1: MOV64-to-use Histogram, astar cache misses, TLB misses) and binary instrumentation tools (data dependences). The cache and transition lookaside togram for the SPEC CPU2006 [4] INT benchmark, astar . buffer (TLB) hit/miss data is used to probabilistically deter- The figure shows that the instruction dependent on a pro- mine (realized as a transition probability) where in the cache ducing mov instruction will most likely be 4, 2, or 3 instruc- hierarchy a memory reference instruction is satisfied. The tions, respectively, following the producing instruction. branch mis-prediction rate is used to statistically determine whether a branch is predicted correctly or not. The dynamic instruction mix and any data associated with variable latency micro-architecture components is collected Latency characteristics for various processor components such using binary instrumentation. The instruction mix is used as caches, execution units (EUs), and branch predictors are to generate an instruction stream for the model that sta- also used in the model. These are primarily obtained from tistically represents the instruction stream generated by the the processor hardware manual and verified using micro- application. benchmarks. In cases where latencies are not given in the manual or when component latencies are variable, we use Monte Carlo processor models converge to a predicted CPI targeted micro-benchmarks to determine these numbers. Tar- based on a pre-defined threshold value.