Precise and Accurate Processor Simulation Harold W
Total Page:16
File Type:pdf, Size:1020Kb
Precise and Accurate Processor Simulation Harold W. Cain, Kevin M. Lepak, Brandon A. Schwartz, and Mikko H. Lipasti Computer Sciences Department Electrical and Computer Engineering University of Wisconsin University of Wisconsin 1210 W. Dayton Street 1415 Engineering Drive Madison, WI 53706 Madison, WI 53706 {cain,baschwar}@cs.wisc.edu {lepak,mikko}@ece.wisc.edu Abstract Outside of the industrial design cycle, simulators are also Precise and accurate simulation of processors and com- heavily used in the computer architecture academic puter systems is a painstaking, time-consuming, and research community. Within this context, simulators are error-prone task. Abstraction and simplification are primarily used as a vehicle for demonstrating or comparing powerful tools for reducing the cost and complexity of the utility of new architectural features, compilation tech- simulation, but are known to reduce precision. Similarly, niques, or microarchitectural techniques, rather than for limiting and simplifying the workloads that are used to helping to guide an actual design project. As a result, aca- drive simulation can simplify the task of the computer demic simulators are rarely used for functional or perfor- architect, while placing the accuracy of the simulation at mance validation, but strictly for proof of concept, design risk. Historically, precision has been favored over accu- space exploration, or quantitative trade-off analysis. racy, resulting in simulators that are able to analyze Simulation can occur at various levels of abstraction. minute machine model variations while mispredicting— Some possible approaches and their benefits and draw- sometimes dramatically—the actual performance of the backs are summarized in Table 1. For example, flexible processor being simulated. In this paper, we argue that and powerful analytical models that exploit queueing the- both precise and accurate simulators are needed, and ory can be used to examine system-level trade-offs, iden- provide evidence that counters conventional wisdom for tify performance bottlenecks, and make coarse three factors that affect precision and accuracy in simu- performance projections. Alternatively, equations that lation. First, we show that operating system effects are compute cycles per instruction (CPI) based on cache miss important not just for commercial workloads, but also rates and fixed latencies can also be used to estimate per- for SPEC integer benchmarks. Next, we show that sim- formance, though this approach is not effective for systems ulating incorrect speculative paths is largely unimpor- that are able to mask latency with concurrent activity. tant for both commercial workloads and SPEC integer These approaches are powerful and widely employed, but benchmarks. Finally, we argue that correct simulation of will not be considered further in this paper. Instead, we I/O behavior, even in uniprocessors, can affect simulator will focus on the last three alternatives: the use of either accuracy. trace-driven, execution-driven, or full-system simulation 1.0 Introduction and Motivation to simulate processors and computer systems. Trace-driven simulation utilizes execution traces collected For many years, simulation at various levels of abstraction from real systems. Collection schemes range from soft- has played a key role in the design of computer systems. ware-only schemes that instrument program binaries all There are numerous compelling reasons for implementing the way to proprietary hardware devices that connect to simulators, most of them obvious. Design teams need sim- processor debug ports. The former pollute the collected ulators throughout all phases of the design cycle. Initially, trace, since the software overhead slows program execu- during high-level design, simulation is used to narrow the tion relative to I/O devices and other external events. The design space and establish credible and feasible alterna- latter can run at full speed, but require expensive invest- tives that are likely to meet competitive performance ment in proprietary hardware, knowledge of debug port objectives. Later, during microarchitectural definition, a interfaces, and are probably not feasible for future multi- simulator helps guide engineering trade-offs by enabling gigahertz processors. Trace collection is also hampered by quantitative comparison of various alternatives. During the fact that usually only non-speculative or committed design implementation, simulators are employed for test- instructions are recorded in the trace. Hence, the effects of ing, functional validation, and late-cycle design trade-offs. speculatively executed instructions from incorrect branch Finally, simulators provide a useful reference for perfor- paths are lost. Furthermore, once a trace has been col- mance validation once real hardware becomes available. Modeling Technique Inputs Benefits Drawbacks Analytical models Cache miss rates; I/O rates Flexible, fast, convenient, pro- Cannot model concurrency; vide intuition lack of precision CPI Equations Core CPI, cache miss rates Simple, intuitive, reasonably Cannot model concurrency; accurate lack of precision Trace-driven Simulation Hardware traces; software Detailed, precise Trace collection challenges; traces lack of speculative effects; implementation complexity Execution-driven Simulation Programs, input sets Detailed, precise, speculative Implementation complexity; paths simulation time overhead; cor- rectness requirement; lack of OS and system effects Full-system, execution-driven Operating system, programs, Detailed, precise, accurate Implementation complexity, simulation (PHARMsim) input sets, disk images simulation time overhead, cor- rectness requirement TABLE 1. Attributes of various performance modeling techniques. lected, the cost associated with the disk space used to store lation. First, we show that operating system effects are the trace may be a limiting factor. important not just for commercial workloads (as shown by Prior work has argued that trace-driven simulation is no [8] and numerous others), but also for SPEC integer longer adequate for simulating modern, out-of-order pro- benchmarks [13]. Surprisingly, omitting the operating sys- cessors (e.g. [1]). In fact, the vast majority of research tem can introduce error that exceeds 100% for benchmarks papers published today employ execution-driven simula- that, according to conventional wisdom, hardly exercise tion and utilize relatively detailed and presumably precise the operating system. Next, we show that simulating incor- simulation. A recent paper argued that precise simulation rectly predicted speculative paths is largely unimportant is very important and can dramatically affect the conclu- for both commercial workloads and SPEC integer bench- sions one might draw about the relative benefits of specific marks. In most cases, even with an aggressively specula- microarchitectural techniques [6]. Some of these conclu- tive processor model that issues twice as many instructions sions were toned down in a subsequent publication [5]. as it retires, the bottom line effect on performance is usu- ally less than 1%, and only 2.4% in the worst case. Finally, One can reasonably conclude that the majority of recently we argue that correct simulation of I/O behavior, even in published computer architecture research papers place a uniprocessors, can affect simulator accuracy. We find that great deal of emphasis and effort on precision of simula- a direct-memory-access (DMA) engine that correctly mod- tion, and researchers invest large amounts of time imple- els the timing of cache line invalidates to lines that are menting and exercising detailed simulation models. written by I/O devices can affect miss rates by up to 2% In this paper, we show that conventional approaches to and performance by up to 1%, even in a uniprocessor with exercising processor simulators do so poorly vis-a-vis plenty of available memory bandwidth. accuracy that, practically speaking, precision is unimpor- We have found that the main drawback of a simulator that tant. We argue that the correct approach is to build a simu- does not cheat is the expense and overhead of correctly lator that is both precise and accurate. We accomplish this implementing such a simulator. For example, since our by building a simulator--PHARMsim--that does not cheat DMA engine implementation relies on the coherence pro- with respect to any aspect of simulation. Without the tocol to operate correctly, the coherence protocol must be investment in such a simulator, we assert that it is impossi- correctly implemented. Similarly, since the processors ble to determine whether or not the right abstractions and actually read values from the caches, rather than cheating simplifications have been applied to either the simulator or by reading from an artificially-maintained flat memory the workload that is driving it. image, DMA and multiprocessor coherence must be cor- We provide evidence that counters conventional wisdom rectly maintained in the caches. Furthermore, within the for three factors that affect precision and accuracy in simu- processor core, the register renaming, branch redirection, simulator to continue to explore a broad design space. As more and more features are modeled precisely, it becomes Design Space increasingly difficult to support design space exploration Exploration Performance that strays too far from the chosen direction. This trend Quantitative Validation also mirrors what is occurring in development;