Automatic Microprocessor Performance Bug Detection
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Microprocessor Performance Bug Detection Erick Carvajal Barboza∗, Sara Jacob∗, Mahesh Ketkary, Michael Kishinevskyy, Paul Gratz∗, Jiang Hu∗ ∗Texas A&M University fecarvajal, sarajacob 96, [email protected], [email protected] yIntel Corportation fmahesh.c.ketkar, [email protected] Abstract—Processor design validation and debug is a difficult snoop filter eviction bug [42]. Here the performance regression and complex task, which consumes the lion’s share of the was > 10% on several benchmarks. The bug took several design process. Design bugs that affect processor performance months of debugging effort before being fully characterized. rather than its functionality are especially difficult to catch, particularly in new microarchitectures. This is because, unlike In a competitive environment where time to market and functional bugs, the correct processor performance of new performance are essential, there is a critical need for new, microarchitectures on complex, long-running benchmarks is more automated mechanisms for performance debugging. typically not deterministically known. Thus, when performance To date, little prior research in performance debugging exists, benchmarking new microarchitectures, performance teams may despite its importance and difficulty. The few existing works [9], assume that the design is correct when the performance of the new microarchitecture exceeds that of the previous generation, [51], [53] take an approach of manually constructing a reference despite significant performance regressions existing in the design. model for comparison. However, as described by Ho et al. [26], In this work we present a two-stage, machine learning-based the same bug can be repeated in the model and thus cannot be methodology that is able to detect the existence of performance detected. Moreover, building an effective model is very time bugs in microprocessors. Our results show that our best technique consuming and labor intensive. Thus, as a first step we focus detects 91.5% of microprocessor core performance bugs whose average IPC impact across the studied applications is greater on automating bug detection. We note that, simply detecting a than 1% versus a bug-free design with zero false positives. When bug in the pre-silicon phase or early in the post-silicon phase evaluated on memory system bugs, our technique achieves 100% is highly valuable to shorten the validation process or to avoid detection with zero false positives. Moreover, the detection is costly steppings. Further, bug detection is a challenge by itself. automatic, requiring very little performance engineer time. Performance bugs, while potentially significant versus a bug- I.I NTRODUCTION free version of the given microarchitecture, may be less than the difference between microarchitectures. Consider the data shown Verification and validation are typically the largest com- in Figure 1. The figure shows the performance of several SPEC ponent of the design effort of a new processor. The effort CPU2006 benchmarks [2] on gem5 [6], configured to emulate can be broadly divided into two distinct disciplines, namely, two recent Intel microarchitectures: Ivybridge (3rd gen Intel functional and performance verification. The former is con- “Core” [23]) and Skylake (6th gen [19]), using the methodology cerned with design correctness and has rightfully received discussed in Sections III and IV. The figure shows that the significant attention in the literature. Even though challenging, baseline, bug-free Skylake microarchitecture provides nearly it benefits from the availability of known correct output to 1.7X the performance of the older Ivybridge, as expected given compare against. Alternately, performance verification, which is the greater core execution resources provided. typically concerned with generation-over-generation workload performance improvement, suffers from the lack of a known Ivybridge (Bug-Free) Skylake (Bug-Free) Skylake (Bug 1) Skylake (Bug 2) 3.00 correct output to check against. Given the complexity of computer systems, it is often extremely difficult to accurately 2.00 predict the expected performance of a given design on a given arXiv:2011.08781v2 [cs.AR] 19 Nov 2020 workload. Performance bugs, nevertheless, are a critical concern 1.00 as the cadence of process technology scaling slows, as they rob Speedup 0.00 processor designs of the primary performance and efficiency 403.gcc 433.milc gains to be had through improved microarchitecture. 401.bzip2 444.namd450.soplex 458.sjeng 400.perlbench Further, while simulation studies may provide an approxima- 436.cactusADM Geometric Mean tion of how long a given application might run on a real system, Fig. 1: Speedup of Skylake simulation with and without prior work has shown that these performance estimates will performance bugs, normalized against Ivybridge simulation. typically vary greatly from real system performance [8], [18], [44]. Thus, the process to detect performance bugs, in practice, In addition to a bug-free case, we collected performance is via tedious and complex manual analysis, often taking large data for two arbitrarily selected bug-cases for Skylake. Bug 1 is periods of time to isolate and eliminate a single bug. For an instruction scheduling bug wherein xor instructions are the example, consider the case of the Xeon Platinum 8160 processor only instructions selected to be issued when they are the oldest instruction in the instruction queue. As the figure shows, the overall impact of this bug is very low (< 1% in average across • Our methodology detects 91.5% of these processor core the given workloads). Bug 2 is another instruction scheduling performance bugs leading to ≥ 1% IPC degradation with bug which causes sub instructions to be incorrectly marked as 0% false positives. It also achieves 100% accuracy for synchronizing, thus all younger instructions must wait to be the evaluated performance bugs in memory systems. issued after any sub instruction has been issued and all older As an early work in this space, we limit our scope somewhat instructions must issue before any sub instruction, regardless to maintain tractablity. This work is focused mainly on the of dependencies and hardware resources. The impact of this processor core (the most complex component in the system), bug is larger, at ∼ 7% across the given workloads. however we show that the methodology can be generalized to Nevertheless, for both bugs, the impact is less than the other units by evaluating its usage on memory systems. Further, difference between Skylake and Ivybridge. Thus, from the here we focus on performance bugs that affect cycle count and designers perspective, working on the newer Skylake design consider timing issues to be out of scope. with Ivybridge in hand, if early performance models showed The goal of this work is to find bugs in new architectures the results shown for Bug 2 (even more so Bug 1) designers that are incrementally evolved from those used in the training might incorrectly assume that the buggy Skylake design did data, it may not work as well when there have been large shifts not have performance bugs since it was much better than in microarchitectural details, such as going from in-order to previous generations. Even as the performance difference out-of-order. However, major generalized microarchitectural between microarchitecture generations decreases, differentiating changes are increasingly rare. For example, Intel processors’ between performance bugs and expected behavior remains a last major microarchitecture refactoring occurred ∼ 15 years challenge given the variability between simulated performance ago, from P4 to “Core”. Over this period, there have been and real systems. several microarchitectures where our approach could provide Examining the figure, we note that each bug’s impact varies value. In the event of larger changes, our method can be from workload to workload. That said, in the case of Bug 1, no partially reused by augmenting it with additional features. single benchmark shows a performance loss of more than 1%. II.O BJECTIVE AND ABASELINE APPROACH This makes sense since the bug only impacts the performance of applications which have, relatively rare, xor instructions. The objective of this work is the detection of perfor- From this, one might conclude that Bug 1 is trivial and perhaps mance bugs in new microprocessor designs. We explicitly needs not be fixed at all. This assumption however, would be look at microarchitectural-level bugs, as opposed to logic or wrong, as many workloads do in fact rely heavily upon the architectural-level bugs. As grounds for the detection, one or xor instruction, enough so that a moderate impact on them multiple legacy microarchitecture designs are available and translates into significant performance impact overall. assumed to be bug-free or have only minor performance bugs In this work, we explore several approaches to this problem remaining. Given the extensive pre-silicon and post-silicon and ultimately develop an automated, two-stage, machine debug, this assumption generally holds on legacy designs. learning-based methodology that will allow designers to There are no assumptions about how a bug may look like. automatically detect the existence of performance bugs in As it is very difficult to define a theoretical coverage model for microprocessor designs. Even though this work focus is the performance debug, our methodology is a heuristic approach. processor core, we show that the methodology can be used We introduce