<<

2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Systems (PMBS)

Validation of the gem5 Simulator for

Ayaz Akram1 Lina Sawalha2 University of California, Western Michigan University, Davis, CA Kalamazoo, MI [email protected] [email protected]

Abstract—gem5 has been extensively used in computer archi- heterogeneous systems composed of many cores and complex tecture simulations and in the evaluation of new architectures configurations. gem5 has been used by ARM research to for HPC (high performance ) systems. Previous perform HPC platform simulation and by AMD for their has validated gem5 against ARM platforms. However, gem5 still shows high inaccuracy when modeling x86 based processors. Exascale [4]. In this work, we focus on the simulation of a single node gem5 [3] supports several ISAs (instruction set architec- high performance system and study the sources of inaccuracies tures) like x86, ARM, Alpha, MIPS, RISC-V and SPARC. It of gem5. Then we validate gem5 simulator against an supports simple and quick functional simulation and detailed , Core-i7 (Haswell microarchitecture). We configured simulation modes. The detailed simulation can be either with gem5 as close as possible to match Core-i7 Haswell microar- chitecture configurations and made changes to the simulator to system call emulation or with full-system support, in which add some features, modified existing code, and tuned built-in gem5 is capable of a . gem5 configurations. As a result, we validated the simulator by fixing supports two different pipeline , in-order (IO) many sources of errors to match real hardware results with and out-of-order (OoO). gem5 has been previously validated less than 6% mean error rate for different control, memory, for ARM based targets [5], [6]. However, there does not dependency and microbenchmarks. exist any validation effort for x86 based targets (which is the I.INTRODUCTION primary ISA for most of the HPC systems today). In this Architectural and microarchitectural simulators play a work, we evaluate the accuracy of gem5 by comparing its vital role in computer research and performance results to those of real hardware (Haswell microarchitecture). evaluation as building real systems is very time consuming Then we point out the sources of errors in this simulator and expensive. Unfortunately, simulators are susceptible to and discuss how we resolved those errors to achieve higher modeling, abstraction and specification errors [1]. Validation accuracy. Finally, we validate the simulator for an out-of- of simulators is important to find out the sources of experi- order core (OoO) of Intel Core i7-4770. Our results show mental errors and fixing them, which leads to an increased a significant reduction in the absolute mean error rate from confidence in simulation results. Although some may argue 136% to less than 6% after validation. that when evaluating a new design over a base design, relative The paper is organized as follows: SectionII describes accuracy is more important; however, not validating the our validation methodology for gem5 simulator. Section III absolute accuracy of simulators can lead to skewed results describes the sources of inaccuracies found in gem5, and and misleading conclusions. Absolute experimental error discusses the modifications that we performed to validate the (compared to real hardware) has been the main method simulator. SectionIV discusses related work and sectionV, of measuring simulator’s inaccuracy and the need for their concludes the paper. validation [2]. In this paper, we validate one of the most II.VALIDATION METHODOLOGY used and microarchitecture simulators, gem5 [3], against one of Intel’s high-performance processors In many HPC systems, performance evaluation or simu- (Core i7-4770). Today’s high performance computing (HPC) lation validation of different components can be done using systems are very heterogeneous; for instance, they can be a mathematical model, for example, network simulation. composed of CPUs of various types, can integrate accelerators However, building a mathematical model is not suitable and use heterogeneous memory . This makes for validating microarchitecture simulators because of the gem5 one of the most appropriate architectural simulators to huge amount of configurations and details in such simulators, study HPC system architectures, as it allows for simulating specifically gem5. In , many simulation results are not known without knowing the runtime behavior of programs. 1Ayaz Akram finished this work while at Western Michigan University. Thus, building a mathematical model will not result in an 2Corresponding Author: Lina Sawalha ([email protected]) accurate validation of such simulators.

978-1-7281-5977-5/19/$31.00 ©2019 IEEE 53 DOI 10.1109/PMBS49563.2019.00012 In this work, we validate gem5 by comparing its results with use, include: number of instructions, micro-operations, load the results obtained from real hardware runs using hardware operations, store operations, branch operations, floating point monitoring counters (HMCs). We use perf tool to profile operations, operations, branch mispredictions, cache the different execution with HMCs [7]. Then we misses, speculated instructions, squashed instructions (after find the sources of inaccuracies in gem5 using the calculated misspeculation happens), execution cycles, and full-queue experimental error and by correlating the inaccuracies in events for various pipeline stages. The main performance simulated performance to different simulated architectural statistic we use in this work is instructions per cycle (IPC). statistics and pipeline events. We configured gem5 as close as Then we analyze the correlation and simulated performance possible to an Intel Haswell microarchitecture-based results to discover the possible sources of error in the (Core i7-4770), relying on multiple available sources [8]–[10]. simulator. This knowledge is combined with detailed results TableI shows the main configurations of the microarchitecture. and analysis of gem5’s code to improve the simulation The single-core experiments, which we focus on in this work, model. Finally, based on our findings, we apply two types of will eventually help to improve the accuracy of multi-core changes in the simulator, configurational changes and code and multi-node HPC system simulations. changes. To discuss the configurational calibration, we refer to the base configuration of the simulator as base config TABLE I: Target Configurations. and the calibrated configuration as calib config. base config Parameter Core i7 Like models Haswell on gem5 based on the parameters listed in Pipeline Out of Order TableI, but this configuration is set without an in-depth Fetch width 6 instructions per cycle knowledge of the simulator, as any new would do. Decode width 4-7 fused µ-ops calib config Decode queue 56 µ-ops is the configuration, which is achieved based on Rename and issue widths 4 fused µ-ops the initial correlation analysis and knowledge of specific Dispatch width 8 µ-ops details of the simulator. This calibration Commit width 4 fused µ-ops per cycle technique helps to improve the accuracy of our results. Code Reservation station 60 entries changes include both modifying existing code, to fix the errors Reorder buffer 192 entries that we found in the simulator, and developing and adding Number of stages 19 new features/optimizations to gem5, to better represent the L1 cache 32KB, 8 way L1 instruction cache 32KB, 8 way Haswell microarchitecture. Most of our changes and tuned 1 L2 cache size 256KB, 8 way configurations are available on github . L3 cache size 8 MB, 16 way III.RESULTS AND ANALYSIS Cache line size 64 L1 cache latency 4 cycles A. Observed Inaccuracies L2 cache latency 12 cycles Figure1 shows the values of IPC (instructions per cycle) L3 cache latency 36 cycles for all microbenchmarks on the unmodified and the modified Integer latency 1 cycle versions of the simulator normalized to the IPC values Floating point latency 5 cycles on real hardware. Figure2 shows the average percentage Packed latency 5 cycles inaccuracy observed on both versions of the simulator for all Mul/div latency 10 cycles categories of the microbenchmarks separately. As shown in Branch predictor Hybrid Branch misprediction penalty 14 cycles Figure2, the average inaccuracy for the unmodified simulator can become very large. The observed average error in IPC Our adopted methodology for the validation of gem5 values is 39% for control benchmarks, 8.5% for dependency simulator is based on the use of a set of microbenchmarks to benchmarks, 458% for execution benchmarks and 38.72% for focus on problems with specific sub-systems of the modeled memory benchmarks. As discussed in the previous section, CPU. The microbenchmarks used in this work are the same as in order to specify the possible sources of inaccuracies in the ones used for the validation of SiNUCA microarchitectural the simulation model, we also calculated the correlation simulator [11], which were inspired by the microbenchmarks between various gem5 events (statistics produced by the used for SimpleScalar simulator’s validation [2]. The used simulator) and percentage error in IPC values produced by the microbenchmarks are divided into sets of four control, six simulator as shown in Figure3. The figure shows an average dependency, five execution, and six memory microbenchmarks. of Pearson’s correlation coefficient (for all benchmarks), TableII shows a summary of the different microbenchmarks which can have a maximum value of +1 and a minimum used in this validation. value of -1. Positive correlation indicates a positive error in The first step in our validation methodology is to perform IPC value (overestimation of IPC by the simulator) and a statistical analysis to find a correlation (Pearson’s correlation negative correlation indicates a negative error in IPC values coefficient [12]) of different simulation statistics and the per- (underestimation of IPC by the simulator). Most of the centage error in the simulated performance statistic (compared to the real hardware). The specific simulation statistics we 1https://github.com/ayaz91/gem5Valid Haswell.git

54 TABLE II: Description of the Used Microbenchmarks.

MicroBenchmarks Target Construct Control Benchmarks Control Conditional conditional branches if-then-else in a loop Control Switch indirect jumps switch statements in a loop Control Complex hard-to-predict branches if-else and switch statements in a loop Control Random random behavior of branches if-then-else in a loop and branch outcome determined by a shift register Dependency Benchmarks Dep1 dependency chain dependent instructions with dependency length of 1 instruction Dep2 dependency chain dependent instructions with dependency length of 2 instructions Dep3 dependency chain dependent instructions with dependency length of 3 instruction Dep4 dependency chain dependent instructions with dependency length of 4 instructions Dep5 dependency chain dependent instructions with dependency length of 5 instructions Dep6 dependency chain dependent instructions with dependency length of 6 instructions Execution Benchmarks Ex int-add integer 32 independent instructions in a loop Ex int-multiply integer multiplier 32 independent instructions in a loop Ex fp-add floating point adder 32 independent instructions in a loop Ex fp-multiply floating point multiplier 32 independent instructions in a loop Ex fp-division floating point divider 32 independent instructions in a loop Memory Benchmarks Load Dependent 1 & 2 memory hierarchy for dependent load instructions linked lists walked in loops Load Independent 1 & 2 memory hierarchy for independent load instructions different cache sizes with 32 independent loads in a loop Store Independent memory hierarchy for independent store instructions different cache sizes with 32 parallel independent stores in a loop

UnModified 458 2 UnModified Modified 2.6 21.3 Modified 40 1.5

1 20

0.5 Normalized IPC Values Percentage Error in IPC values 0 0 Control Dep1 Dep2 Dep3 Dep4 Dep5 Dep6 Memory Execution Exfpdiv Exfpadd Exfpmul Exintadd Dependency Exintmul ControlSmall ControlSwitch MemLoadind1 MemLoadind2 MemStoreind1 MemStoreind2 Fig. 2: Average accuracy for MemLoadDep1 MemLoadDep2 ControlRandom ControlComplex microbenchmarks compared to Fig. 1: Normalized IPC values of microbenchmarks compared to the real hardware the real hardware cache related events at all levels (cache misses and accesses) of IewSQFullEvents (store-queue-full events), which shows show negative correlation with the percentage error in IPC that the main reason for high inaccuracy in case of memory values, indicating a possible overestimation of cache/memory store independent benchmarks would be an overestimated latencies. Similarly, many of the pipeline stall events (like wasted number of cycles in case of a store-queue-full . ROBFullEvents, IewSQFullEvents, IewIQFullEvents in gem5) A negative correlation of branch mispredictions with the show a strong negative correlation with the percentage error. percentage error in IPC suggests that the branch predictor is a This indicates that the number of wasted cycles in the potential source of error. This was also indicated in previous simulator in case of pipeline stall events are overestimated, works [13], [14], as actual branch predictors used in the real thus underestimating of performance. The number of exe- hardware are not published and branch predictors of gem5 cuted micro-operations (µ-ops) shows a positive correlation appear to lag in performance when compared to the actual with the percentage error in IPC values meaning that the hardware. benchmarks with a very high number of µ-ops usually lead B. Applied Modifications to the Simulator to overestimated IPC values. The number of store operations We use the knowledge obtained from previously discussed shows a very strong negative correlation explaining the statistical analysis to guide tuning and modifications in the low accuracy of the memory store independent benchmarks simulator. The applied changes to the simulator and the con- shown in Figure3. This strong negative correlation should figurational calibration improved the accuracy of the simulator be looked in association with strong negative correlation as shown in Figures1 and2. For the modified version of

55 hscag atclryipoe h P of IPC the improved particularly change This ausbtenteppln tgs ota h overall the that so stages, pipeline the between values changes the discuss we Next, benchmarks. memory for 7.7% h orecd ftesmltrt aex6instructions x86 make to simulator the of in code cases source many the in high very accuracy). was its ratio improved thus (and three of factor a by decode the and fetch the between (in added Specifically, stages to is an [16]. latency pipeline add latency extra the and i-cache this of missing one stages the to end for equal front compensate latency the hit in latency i-cache additional the the solution keep reduces possible to This One is cache. considerably. blocking fetch throughput a the unit’s as cache, fetch behave if non-blocking it even a makes words, from as unit other response configured In a is [16]). on i-cache here: waiting the discussed is also it (as while that of i-cache i-cache observed pipeline to We (OoO) request hardware). out-of-order not real the the in in unit (as fetch cycles four as set base in inaccuracy the improves reduced (and control change microbenchmarks this example, many For for accuracy). values IPC value the the Changing delay achieved. different of was with length along pipeline required two as set In was [15]). parameter observed well this been as has issue others “back-to-back” (this by inhibits instructions, execute two one, the of beyond and execution pipeline) issue the the of between stages (delay parameter delay of context the in configuration) (or problems. simulator observed the to made and dependency microbenchmarks for is execution 5.4% values for benchmarks, 0.5% IPC control microbenchmarks, in for error 9% average to reduced observed the simulator, the Correlation with Error eas on htmcoprtos( microoperations that found also We of value the increasing that found We issueToExecute − − 0 0 config . . 5 5 1 0 1 small fetchToDecode − h ntuto ah icce cesltnywas latency access (i-cache) cache instruction the 2 irbnhakfo 36 ool .% In 0.4%. only to 33.6% from microbenchmark . 08

Instructions · 10 0 ea ooein one to delay . i.3 orlto of Correlation 3: Fig. − 39

MicroOps 2 − 0 . 15

Loads − ea aaee)in parameter) delay 0 .

Stores 71 0 − base . 22 9

Branches . 05

Fpops · config − 10 gem5 0 calib − . 11 µ

IntegerOps 2 0 os oinstructions to -ops) . hs echanged we Thus, . 19 gem5

BranchPredictions − ftesimulator, the of config 0 issueToExecute . 36 control BranchMispredictions 0 calib . 19 vnswt h ecnaeerri efrac (IPC) performance in error percentage the with events

gem5 BTBLookups − improves 0 .

InsSpeculative 31 config − switch 0 does

InsSquashed . − 3 0 . 57 .

MemAccesses − 56 0 . n loloe at looked also and [9] in available information used we rsosbefrti op smsrdce ob ae.This taken. be to mispredicted is loop) this for (responsible and [17] ZSim as such simulators x86 other of code the to vr atieaino h ne op h rnhinstruction branch the on that loop, observed inner is the It of for times. repeats iteration responsible of one last number outer every is specific the loop a while process inner array, this the data the where through loop, traversing is nested consists problem microbenchmark a the The of of prediction. origin branch the the into to that Digging related observed microbenchmark. we of this number problem, size for significant this the misses a as cache showed same data simulator in the of The low is very cache. footprint be data memory should loop L1 the a misses as in cache case occur data accesses this offset of The an number size). with the line size) cache and 32KB (the (of bytes cache 64 data of L1 from values data 1 . Figure in point floating in microbenchmarks observed in related inaccuracies operations large the add eliminated point floating as operations gem5 classified division and wrongly are point floating the that of 10.5% ex from well. inaccuracy as in the 5.1% accuracy reduced to the change improves this and example, For values IPC overall the hold of (ROB)) values buffers the reducing reorder ops, and (RS) stations of SPEC- reservation terms for in [20], defined [19], are in on sources Since, observed other benchmarks. is and what CPU2006 of hardware ( 5% real within x86 the ratio an instructions achieved to modifications ops) of values Our realistic ratios. more instructions exhibit which [18], Sniper 34

L1DAccesses − 0 Memory µ µ int . 42 −

osdcdn oepatcl omk hs changes these make To practical. more decoding -ops L1DMisses osin -ops emodified We . 0 . add 47

L1IAccesses − 0 . 37 irbnhak h aeln fdfeetclasses different of labelling The microbenchmark. dep5

L1IMisses − load gem5 0 . 41

L2Accesses − 0 irbnhakadfo .%t .%in 1.5% to 6.5% from and microbenchmark . ind2 51

L2Misses − ed mrvmn swl.W noticed We well. as improvement needs gem5 0 . 51

L3Accesses − µ irbnhaklasindependent loads microbenchmark 0 . osadalppln tutrs(like pipeline all and -ops scd ocretterlbl,which labels, their correct to code ’s L3Misses 57 µ (ex 0 . ost ntuto aisimproves ratios instruction to -ops RenameIQFullEvents 39 − fp 0 .

IewIQFullEvents 31 − u n ex and mul 0 gem5 .

RenameSQFullEvents 65 − 0 .

IewSQfullEvents 79 − l ieiewidths pipeline all , 0 . 39 ROBFullEvents Correlation − 0 fp Cycles . 6 div) µ osto -ops shown µ µ - - TABLE III: Branch Miss Ratio TABLE IV: L1D Cache MPKI Benchmark Real SimUnMod SimMod Benchmark Real SimUnMod SimMod ControlComplex 0.0012 0.000205 0.00021 MemLoadind1 0.01065 0.0020794 0.00209926 ControlRandom 28.26 30 28.78 MemLoadind2 0.114602 120.399 0.00156352 ControlSmall 0.0004 0.000068 0.00007 MemLoadDep1 0.073911 0.001996 0.00200671 ControlSwitch 0.0011 0.01609 0.000179 MemLoadDep2 15.825 25.774 25.774077 MemLoadind1 0.0576 0.00936 0.00947 MemStoreind1 0.01583 0.001935 0.001996 MemLoadind2 0.0239 5.559 0.00513 MemStoreind2 0.132615 0.001522 0.001547 MemLoadDep1 0.0698 0.0093 0.00953 MemLoadDep2 0.0528 5.55975 5.559 simulator and the real hardware. Tables III andIV show a MemStoreind1 0.0528 0.009387 0.00956 MemStoreind2 0.0298 5.56 0.00513 comparison of the real hardware, the unmodified simulator and the modified version of the simulator using two microarchi- tectural events (branch miss ratio (branch misses per thousand results into access of a new cache block (on wrong path) in branch instructions) and L1 data cache misses per thousand set 0 of the cache, kicking out a useful block. This in turn instructions). The tables show only those microbenchmarks causes a high number of data cache misses leading to lower for which the events are relevant, control and memory IPC values. We implemented a loop predictor in gem5 that microbenchmarks. Generally the difference between the values eliminated the problem and improved the accuracy of this of these events in the modified and unmodified versions of type of microbenchmarks as shown in Figure1. Similar is the simulator is small. Moreover, the simulated event counts the case with Memory Store ind2 microbenchmark. are generally underestimated. These event counts should be We also observed that gem5 is quite conservative in terms of analyzed carefully as the inaccuracy observed in most of its pipelining behavior in the front-end stages of the pipeline. the cases is insignificant due to the small values of the Instead of proper queuing in case of a blocking events, gem5 events, and it does not affect the overall performance results has SkidBuffers at each stage that absorb inflight instructions of the simulator. The difference in events’ values of the whenever there is a blocking event. Moreover, in case of simulator and that of the real hardware can also be attributed removal of blocking event, any stage will get unblocked to the noise in the measured values of events on the real completely only after its SkidBuffer is fully drained, and hardware, which can become visible when the counts of the then would allow the previous stages to unblock as well. For measured events are very small as is the case for most of the example, in a worst case scenario, pipeline bubbles (or stalls) microbenchmarks. Few interesting cases where the accuracy is equal to a sum of delay between decode and rename stages in the values of the events in the modified version of the and delay between fetch and decode stages in case of a rename simulator is improved significantly include: control random, stage block. This behavior has also been observed by [15], memory load ind2 and memory store ind2 for branch miss [21]. This behavior should be analyzed in association with the ratios and memory load ind2 for L1D cache MPKI (misses observed strong negative correlation of pipeline stall events per thousand instructions). and the percentage error in simulation results (shown in Figure IV. RELATEDWORK 3) as discussed previously. To make the pipelining behavior There are many validation works of different simulators in less conservative, we might need more structural changes in the literature for different architectural and microarchitectural the pipeline of the simulator. For the moment, we relied on simulators, for example, SimpleScalar [2], SiNUCA [11], tweaking of the pipeline configuration (in calib config) to ESESC [22], ZSim [19], MARSSx86 [23], gem5 (ARM decrease the effect of additional pipeline bubbles. We figured ISA) [5], [6], Multi2Sim [24], Sniper [25], and Mc- out that we can change the values of the backward path SimA+ [26]. Nowatzki et al. [27] presented different observed delays between stages of the pipeline (which are set to 1 issues with four common performance and power simulators by default). Since these delays are set to 1, whenever any including gem5. The issues they found with gem5 include stage gets blocked it sends a to previous stages to wrongly classified and inefficient µ-ops and inconsistent block as well on the cycle irrespective of any spaces pipeline replay mechanism. in queues between the stages and the stages’ SkidBuffers Weaver and Mckee [28] showed that the use of unvalidated (gem5 structures used to store instructions in case of blocking cycle-by-cycle simulators with large inaccuracies cannot be events). Setting the values of delays on backward paths same justified in comparison to fast tools like , which as forward path for a particular stage can avoid most of can provide similar results at a much higher speed. Asri the wasted cycles in case of stalls, as any blocking stage et al. [23] calibrated MARSSx86 to simulate “architecture- would wait for the number of set cycles before conveying rich” heterogeneous processors. They showed that unvalidated the blocking message to the previous stage. This change simulators can result in significant differences between improved the performance specifically for control random simulation and real hardware results, which can be more and control switch microbenchmarks. problematic for accelerator studies. Akram and Sawalha Furthermore, we studied a couple of microarchitectural [14], [29], [30] compared experimental errors for different events to observe the differences in their values on the architectural simulators including gem5 and concluded that

57 the main sources of inaccuracies in simulation results are [10] Intel’s Haswell CPU Micrarchitecture. Available: http://www. overestimated branch mispredictions, imprecise decoding of realworldtech.com/haswell-cpu/ [Online; accessed 1-June-2017]. [11] M. A. Z. Alves, . Villavieja, M. Diener, F. B. Moreira, and P. O. A. instructions into microoperations and high cache misses. Navaux, “SiNUCA: A Validated Micro-Architecture Simulator,” in Walker et al. [13] recently presented GemStone; a systematic IEEE High Performance Computing and (HPCC), tool to compare reference hardware and simulator results, pp. 605–610, 2015. [12] J. Benesty, J. Chen, Y. Huang, and I. Cohen, “Pearson Correlation identify sources of error and apply modifications to the Coefficient,” in Noise reduction in speech processing, pp. 1–4, Springer, simulation model to improve its accuracy using statistical 2009. analysis. They used ARM Cortex-A15 gem5 model, to present [13] M. Walker, S. Bischoff, S. Diestelhorst, G. Merrett, and B. Al-Hashimi, “Hardware-validated cpu performance and energy modelling,” in IEEE GemStone, and identified branch prediction to be the main International Symposium on Performance Analysis of Systems and source of error in simulation results. Our work validates gem5 , (Belfast, UK), pp. 44–53, April 2018. against an x86 single-core target platform, which has not been [14] A. Akram and L. Sawalha, “x86 Computer Architecture Simulators: A Comparative Study,” in IEEE 34th International Conference on performed before. Computer Design (ICCD), pp. 638–645, 2016. ONCLUSION V. C [15] “CPU configuration (gem5-users).” https://www.mail-archive.com/ Validation of architectural simulators is important as it [email protected]/msg12282.html, 2015. [Online; accessed 5- August-20157]. increases the confidence in the results of simulators. While [16] “O3 fetch throughput when i-cache hit latency is more than 1 cycle gem5 has already been validated for ARM based processors, (gem5-users).” https://www.mail-archive.com/[email protected]/ there does not exist any validation effort for x86 based msg10393., 2015. [Online; accessed 5-August-2017]. [17] D. Sanchez and C. Kozyrakis, “ZSim: Fast and Accurate Microarchi- processors to the best of our knowledge. In this work, we have tectural Simulation of Thousand-Core Systems,” in Proceedings of identified various errors in the x86 out-of-order CPU model of the 40th Annual International Symposium on Computer Architecture gem5. The accuracy of x86 model of gem5 has been improved (ISCA), pp. 475–486, June 2013. [18] T. E. Carlson, W. Heirman, and L. Eeckhout, “Sniper: Exploring the significantly (reduction in mean error rate from 136% to 6% Level of Abstraction for Scalable and Accurate Parallel Multi-Core for the studied microbenchmarks) after applying different Simulation,” in ACM International Conference for High Performance modifications in the simulator. Our modifications included Computing, Networking, Storage and Analysis, (Seattle, WA), pp. 1–12, Nov. 2011. both simulator configuration calibration, code modification [19] Zsim, Zsim Tutorial Validation, 2015. Available: http://zsim.csail.mit. and adding new structures to the simulator. In the future, we edu/tutorial/slides/validation.pdf [Online; accessed 5-June-2017]. would like to dig deeper into some of the problems that we [20] ISA POWER STRUGGLES. Available: http://research.cs.wisc.edu/ vertical/wiki/index.php/Isa-power-struggles/Isa-power-struggles [On- already identified here to further reduce the inaccuracy and line; accessed 5-June-2017]. expand our study to use more realistic HPC benchmarks and [21] T. Tanimoto, T. Ono, and K. Inoue, “Dependence Graph Model for multi-core/multi-node HPC architectures. Accurate Critical Path Analysis on Out-of-Order Processors,” Journal REFERENCES of Information Processing, vol. 25, pp. 983–992, 2017. [22] E. K. Ardestani and J. Renau, “ESESC: A Fast Multicore Simulator [1] B. Black and J. P. Shen, “Calibration of Performance Using Time-Based Sampling,” in IEEE 19th International Symposium Models,” Computer, vol. 31, pp. 59–65, May 1998. on High Performance Computer Architecture, pp. 448–459, Shenzhen, [2] R. Desikan, D. Burger, and S. W. Keckler, “Measuring Experimental China, Shenzhen, China, 23-27 February 2013. Error in Microprocessor Simulation,” in Proceedings of the 28th annual [23] M. Asri, A. Pedram, L. K. John, and A. Gerstlauer, “Simulator International Symposium on Computer Architecture, pp. 266–277, Calibration for Accelerator-Rich Architecture Studies,” in International Gothenburg, Sweden, Gothenburg, Sweden, June 30-July 4 2001. Conference on Embedded Computer Systems: Architectures, Modeling [3] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, and Simulation, Samos, Greece, July 2016. J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al., “The gem5 [24] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, “Multi2Sim: Simulator,” ACM SIGARCH Computer Architecture News, vol. 39, A Simulation Framework for CPU-GPU Computing,” in Proceedings pp. 1–7, May 2011. of the 21st International Conference on Parallel Architectures and [4] M. J. Schulte, M. Ignatowski, G. H. Loh, B. M. Beckmann, W. C. Compilation Techniques, pp. 335–344, Minneapolis, MN, 2012. Brantley, S. Gurumurthi, N. Jayasena, I. Paul, S. K. Reinhardt, and [25] T. E. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout, “An G. Rodgers, “Achieving Exascale Capabilities through Heterogeneous Evaluation of High-Level Mechanistic Core Models,” ACM Transactions Computing,” IEEE Micro, vol. 35, no. 4, pp. 26–36, 2015. on Architecture and Code Optimization, vol. 11, no. 3, Article 28, 2014. [5] A. Butko, R. Garibotti, L. Ost, and G. Sassatelli, “Accuracy Evaluation 25 pages. of GEM5 Simulator System,” in IEEE 7th International Workshop on [26] J. H. Ahn, S. Li, O. Seongil, and N. P. Jouppi, “McSimA+: A Reconfigurable -centric Systems-on-Chip, pp. 1–7, York, Manycore Simulator with Application-level+ Simulation and Detailed UK, York, UK, 9-11 July 2012. Microarchitecture Modeling,” in IEEE International Symposium on [6] A. Gutierrez, J. Pusdesris, R. G. Dreslinski, T. Mudge, C. Sudanthi, Performance Analysis of Systems and Software, pp. 74–85, Austin, Tx, C. D. Emmons, M. Hayenga, and N. Paver, “Sources of Error in Full- 2013. System Simulation,” in IEEE International Symposium on Performance [27] T. Nowatzki, J. Menon, C.-H. Ho, and K. Sankaralingam, “Architectural Analysis of Systems and Software, pp. 13–22, Monterey, CA, 2014. Simulators Considered Harmful,” IEEE Micro, vol. 35, no. 6, pp. 4–12, [7] Perf, “perf: Linux profiling with performance counters,” 2015. Retrieved 2015. August 5, 2015 from https://perf.wiki.kernel.org/index.php/Main Page. [28] V. M. Weaver and S. A. McKee, “Are Cycle Accurate Simulations a [8] “Intel® 64 and IA-32 Architectures Software Developer Manu- Waste of Time,” in 7th Workshop on Duplicating, Deconstructing, and als.” Available: http://www.intel.com/content/www/us/en/processors/ Debunking (WDDD), pp. 40–53, 2008. architectures-software-developer-manuals.html, [Online, accessed Aug., [29] A. Akram and L. Sawalha, “A Comparison of x86 Computer Architec- 2019]. ture Simulators,” Tech. Rep. TR-CASRL-1-2016, Western Michigan [9] A. Fog, “Instruction tables: Lists of instruction latencies, throughputs University, MI, October 2016. and micro-operation breakdowns for Intel, AMD and VIA CPUs.” [30] A. Akram and L. Sawalha, “A Survey of Computer Architecture Available: http://www.agner.org/optimize/instruction tables.pdf, [Online, Simulation Techniques and Tools,” IEEE Access, 2019. accessed 3 September, 2016].

58