4 PERFORMANCE- MONITORING FEATURES

THE ’S UNIQUE PERFORMANCE-MONITORING FEATURES

OVERCOME MANY LIMITATIONS AND PROBLEMS FOUND IN PREVIOUS

PROCESSORS. PENTIUM 4 PERFORMANCE MONITORING SUPPORTS

SIMULTANEOUS MULTITHREADED EXECUTION FEATURES.

Most modern, high-performance events. Because the Pentium 4 Xeon proces- processors have special, on-chip hardware sor is the first implementation of the that can monitor performance. The features architecture to support simultaneous multi- of this monitoring hardware typically include threading (SMT), its performance-monitor- event detectors and counters, qualification of ing support includes qualification of event event detection and counting by privilege detection by ID and qualification of mode and event characteristics, and support event counting by thread mode. 1 for event-based sampling. However, these fea- tures often suffer from a common set of prob- Event detectors and counters lems including a small number of counters, At more than 42 million transistors, the inability to distinguish between speculative Pentium 4 is significantly larger than its and nonspeculative events, and imprecise immediate predecessor, the Pentium III, event-based sampling. With the introduction which has 28 million transistors. With such a of the Pentium 4 processor, Intel has over- large design, the previous approaches used by Brinkley Sprunt come many of the performance monitoring Intel for organizing the performance event limitations of previous processors. The Pen- detectors and counters became impractical. Bucknell University tium 4 supports 48 event detectors and 18 These processor designs dispersed the event event counters, enabling the concurrent col- detectors across the chip, close to the units lection of a significantly larger set of perfor- where the performance events occur; event mance event counts than any other processor. counters were in a central location. This The Pentium 4 also provides several instruc- approach required routing all event count sig- tion-tagging mechanisms that enable count- nals to the centrally located event counters. ing of nonspeculative performance events To follow a similar approach in the larger (that is, events generated by instructions that Pentium 4 design would have required sig- retire). In addition to the support for impre- nificantly more silicon area to route the event cise event-based sampling (IEBS), the Pen- count buses to the event counters. Intel’s com- tium 4 also provides precise event-based puter architects considered two alternatives: sampling (PEBS) that unambiguously iden- incorporating a counter into the implemen- tifies instructions that cause key performance tation of each event detector or dispersing sev- events. PEBS support also lets users create eral blocks of counters across the chip and data address profiles for various memory letting geographically close event detectors

72 0272-1732/02/$17.00  2002 IEEE Detector 0 Unit 1 4 Detector 1 4 Detector 0 Unit 2 4 Detector 1 4 PMI_T0 40 bits PMI_T1 Detector 0 Unit 3 4 Detector 1 Counter 0 4 Counter 1 Counter 2

Detector 0 4 Counter 3 Detector 1 4 Counter 4 Unit N Detector 4 Counter 5 4 Detector 5 4

Figure 1. The general structure of the Pentium 4 event counters and detectors. The event detectors control the selection of events and qualification of event detection by privilege mode (OS and/or USR) and thread ID. Each Pentium 4 unit contains two or six event detec- tors. Each event counter block contains four or six counters, and each counter can select an event detector, perform threshold comparisons and edge detection, and qualify event count- ing by thread mode. The counters in the counter block can also signal performance monitor interrupts on counter overflow. share these blocks. The first approach would detectors, although the checker/retire unit has eliminate signal routing among event detec- three event detector pairs. tors and counters while providing one event Each event detector has a maximum 4-bit- counter for each event detector. The second wide output bus, providing a maximum incre- approach would reduce signal routing among ment per cycle of 15. The Pentium 4 event detectors because of the event detectors’ implementation routes output buses from each close proximity to a counter block. The first event detector to an event counter block. Most approach would have eliminated routing costs event counter blocks contain four counters and provided the user with more counters that with the exception of the counter block in the they could use concurrently. Because the sili- instruction queue unit, which has six coun- con area required to implement one counter ters. Each counter block has two output sig- per event detector was greater than that for nals for requesting interrupts on counter implementing several shared groups of coun- overflow for EBS. With this organization, ters, Intel’s computer architects chose the sec- many event detectors share a set of four (or six) ond approach. counters. Therefore, one of these groups can only count four (or six) events concurrently. Structure and function The event detectors allow the selection and Figure 1 shows the structure and function masking of an event and can qualify event of these event detector and counter groups. detection by privilege mode and thread ID. The units in Figure 1 are logic sets that per- Event counters support threshold compari- form a significant function in the Pentium 4’s son, edge detection, and thread mode quali- design, such as the branch prediction or the fication. They can also generate performance trace cache units. Implementing the event monitor interrupts on counter overflow to detectors in pairs supports concurrent detec- support EBS. Users can configure these fea- tion of the same event for different threads in tures using event select control registers a simultaneous multithreaded system. Most (ESCRs) and counter configuration control units have one pair of equally capable event registers (CCCRs).

JULY–AUGUST 2002 73 PENTIUM 4

BPU ISTEER IXLAT ITLB PMH MOB FSB BSU Counter block groups ESCR0 ESCR0 ESCR0 ESCR0 ESCR0 ESCR0 ESCR0 ESCR0 BPU CCCR/Counter 0 ESCR1 ESCR1 ESCR1 ESCR1 ESCR1 ESCR1 ESCR1 ESCR1 CCCR/Counter 1

CCCR/Counter 2 0 1 2 3 4 5 6 7 CCCR/Counter 3

MS TC TBPU MS ESCR0 ESCR0 ESCR0 CCCR/Counter 0 CCCR/Counter 1

ESCR1 ESCR1 ESCR1 CCCR/Counter 2 CCCR/Counter 3 0 1 2

FLAME FIRM SAAT U2L DAC FLAME ESCR0 ESCR0 ESCR0 ESCR0 ESCR0 CCCR/Counter 0 CCCR/Counter 1

ESCR1 ESCR1 ESCR1 ESCR1 ESCR1 CCCR/Counter 2 CCCR/Counter 3 0 1 2 3 5

IQ ALF RAT CRU IQ ESCR0 ESCR0 ESCR0 ESCR0 ESCR2 ESCR4 CCCR/Counter 0 CCCR/Counter 1

ESCR1 ESCR1 ESCR1 ESCR1 ESCR3 ESCR5 CCCR/Counter 2 CCCR/Counter 3 0 1 2 4 5 6 CCCR/Counter 4 ESCR select values for CCCRs CCCR/Counter 5

Figure 2. Interconnections among event detectors and event select control registers (ESCRs), and their associated counters and counter configuration control registers (CCCRs).

Groups into pairs, such as counters 0 and 1 in the Figure 2 shows the four groups of event BPU group. The paired counters share the detectors and counters implemented in the same set of event detectors and ESCRs. As Pentium 4. Each group consists of event Figure 2 shows, although the Pentium 4 con- detectors (containing ESCRs), shown for sev- tains 44 event detectors, it can only count 18 eral units in the figure, and a block of coun- events concurrently because only 18 counters ters (containing CCCRs). Designers named are available. Even so, the ability to count 18 each group according to the Pentium 4 unit events concurrently is a substantial increase that contains the counter block for that group. over the capabilities of other processors. For example, the branch prediction unit gives its name to the group that contains event Event selection counters in the BPU, ISTEER, IXLAT, ITLB, An ESCR, shown in Figure 3, configures an PMH, MOB, FSB, and BSU units, and these event detector. The 7-bit event select field detectors connect to a counter block in the selects the desired event, and the 16-bit event BPU that contains four CCCR-counter pairs. mask field selects a subset of the desired event. Counters in the counter blocks are grouped For example, the Pentium 4 branch_retired

74 IEEE MICRO event defines 4 bits in the event mask field to 3210 select from the following branch types: taken, 10987654321098765432109876543210 not taken, predicted, and mispredicted. To Event Tag count mispredicted branches, the user would Event mask select the branch_retired event and then set select value the branch-taken-mispredicted and branch- Reserved not-taken-mispredicted bits in the event mask. The ESCR’s low-order 4 bits allow qualifi- Tag enable cation of event detection by privilege mode T1_USR and thread ID. The thread ID identifies one T1_OS of the possible two threads concurrently exe- T0_USR cuting via the Pentium 4’s SMT capability. T0_OS See the “Simultaneous multithreading: Improving instruction throughput for high- Figure 3. Event select control register. performance processors” sidebar (next page) for a brief introduction to the concepts, goals, and performance implications of the Pentium 3210 4’s SMT features. For example, to count all 10987654321098765432109876543210 operating system events for both threads, the user would set both the T0_OS and T1_OS Reserved events. Defining each of these bits as a com- bination of both thread and privilege mode Reserved provides a greater degree of control than is OVF possible if the event detector used 2 bits (T0 Edge Enable Cascade and T1) to select thread qualification and the Compare Threshold Force_OVF other 2 bits (operating system, OS, and user, Complement USR) to select privilege mode. For example, OVF_PMI_T1 OVF_PMI_T0 the user could count all user events for thread 0 and all operating system events for thread 1 (by setting the T0_USR and T1_OS bits), which is not possible with individual T0, T1, Event select control register operating system, and USR bits in the ESCR.

Counter configuration Thread mode (single, dual, no, any) Figure 4 shows the bits and fields in a Figure 4. Counter configuration control register. CCCR that

• enable the counter, CCCR’s enable bit turns on the counter, and • select the event detector to use as the the ESCR select field chooses the event detec- source for counter increments, tor output that the counter should use. In Fig- • qualify event counting by the processor’s ure 2, the bold numbers in the boxes for each current thread mode, pair of event detectors represent the value that • configure the counter’s threshold and the CCCR’s ESCR select field uses to pick the edge detection capabilities, indicated event detector. • configure the interrupt generation on The CCCR’s compare, complement, and counter overflow, and threshold fields control the counter’s thresh- • enable the cascading of paired counters. old detection capabilities. When the compare bit is set, the counter compares the incoming The CCCR also has one status bit, OVF, to value from an event detector to the value in indicate whether an overflow has occurred. the threshold field. The complement bit Using an event counter requires enabling determines the type of comparison. A zero the counter and selecting the event detector complement bit selects a greater-than com- to supply increment values to the counter. The parison. Setting the complement bit to one

JULY–AUGUST 2002 75 PENTIUM 4

Simultaneous multithreading: Improving instruction throughput for high-performance processors Although modern, high-performance processors employ a wide vari- the execution rate it would attain if it were the only task. With this sharing ety of techniques to improve performance,1 these processors often stall technique, the processor’s instruction throughput remains high, even though waiting for various conditions to resolve. These stalls leave the proces- one task stalls. This dynamic sharing by two tasks lets the processor main- sor’s pipeline and its multiple execution units idle for many cycles. Prob- tain a high instruction throughput in the face of various stall conditions. ably the most egregious of these stall conditions is a cache miss that can This concurrent sharing of a processor’s resources by two or more take tens or even hundreds of cycles to resolve. Although compiler writ- threads is called simultaneous multithreading (SMT),2 and the Pentium 4 ers and processor designers do their best to schedule useful work during Xeon is the first implementation of the x86 architecture that supports a cache miss, these efforts are often not successful. Consequently, the SMT.3 A single Pentium 4 Xeon with SMT enabled can concurrently sup- processor seldom realizes its potential throughput (measured in the max- port two execution threads. As such, an SMT Pentium 4 Xeon appears to imum number of possible instructions completed per cycle). In cases where the operating system as a dual processor, even though it’s physically only only one task is ready to execute on the processor, nothing more can be one processor. Because the operating system believes the system has done except to wait for the stall to clear. However, when more than one two processors, it will assign ready-to-run tasks to both processors, allow- task is ready to execute (as is often the case for servers), the processor ing the sharing technique described previously to improve the overall can improve its overall instruction throughput by concurrently sharing instruction throughput. resources between two or more ready-to-run tasks. Although SMT can improve overall throughput, it can also degrade throughput if concurrent sharing of processor’s resources by multiple tasks Simultaneous multithreading causes excessive resource conflicts. Simple benchmarks can demonstrate Consider a processor design that lets two tasks concurrently share both SMT-induced improvements and degradations in throughput. resources. Such a processor would concurrently have instructions from both tasks flowing through its pipeline and using its resources. If neither results task incurs a stall, each task would consume roughly half the processor’s For example, let’s examine the performance of two benchmarks running resources and execute at roughly half the rate it could if it were the only in single- and dual-thread modes on the Pentium 4 Xeon. These bench- task on the system. Although this concurrent processing halves each task’s marks are fp_add_latency and l1_miss. The fp_add_latency benchmark execution rate, the processor’s instruction throughput is basically the consists of a long series of serially dependent floating-point (FP) adds. same as the throughput attained with only one task that does not stall. The l1_miss benchmark generates a steady stream of independent mem- Now consider the case where one of the tasks stalls. At this point, the ory loads, all of which miss the L1 cache. I obtained the throughput data other task can consume most of the processor’s resources and approach discussed here using Pentium 4 Xeon’s performance-monitoring support.

selects a less-than-or-equal-to comparison. ings for this field represent each of the proces- When the comparison of the input and sor’s possible thread modes. threshold values is true, the counter incre- ments by one. When the threshold feature is • In single-thread mode, only one thread is enabled, the counter’s edge detection capa- active, and the processor dedicates all of bility can also be activated via the edge bit. its resources to that thread. With the threshold comparisons enabled and • In dual-thread mode, two threads are the edge bit set, the edge detection hardware active, and they share the processor’s compares the last and current threshold com- resources. parison results. The counter will increment • In no-thread mode, no threads are active. by one only when the previous threshold com- However, some resources must respond parison was false, and the current threshold to other events in the system, and this comparison is true, thus detecting a rising activity can cause performance events. edge on the threshold filter. For example, consider the case of a dual- Since the Pentium 4 processor is the first processor system where one processor is implementation of the x86 architecture to in no-thread mode, and the other proces- support SMT, it also has performance-moni- sor is in dual-thread mode. The proces- toring support that allows qualification of sor in dual-thread mode could initiate event counting by the processor’s current memory transactions that the cache of thread mode. The CCCR’s 2-bit thread mode the processor in no-thread mode might field qualifies event counting by the proces- need to service. An example would be a sor’s current thread mode. The four encod- read-for-ownership transaction necessary

76 IEEE MICRO Consider the fp_add_latency benchmark. On the Pentium 4 Xeon, the copy of the benchmark in dual-thread mode results in excessive conflicts FP adder can start a new FP add every cycle—each FP add has a latency in the memory system, significantly degrading performance. of five cycles. Because the serial dependency between the FP adds is the Thus, the Pentium 4 Xeon’s SMT features can be both an advantage and bottleneck for the fp_add_latency benchmark, the benchmark attains a a detriment to overall performance. To obtain the full potential of SMT throughput of 0.21 microoperations (µops) per cycle (roughly one FP add while avoiding its pitfalls, operating systems must carefully select the every five cycles) when running in single-thread mode. When two copies tasks that will concurrently share an SMT processor. Researchers have of this benchmark run in dual-thread mode, the combined throughput is proposed a new task-scheduling approach—symbiotic task scheduling— 0.41 µops per cycle, almost twice the throughput attained in single-thread and investigated the use of hardware performance-monitoring data to mode. This example shows how SMT can significantly improve overall guide the job scheduling on SMT processors.4 This study showed symbi- throughput by essentially doubling processor performance compared to otic task scheduling to improve response time performance by as much running tasks sequentially in single-thread mode. The fp_add_latency as 17 percent. benchmark improves performance because of the FP adder’s long laten- cy and the serial dependency between the adds keeps the FP adder uti- lization low. These features let two tasks share the FP adder and other References processor resources with essentially no interference. 1. K. Diefendorff, “PC Processor , A Concise However, the performance of the l1_miss benchmark running in single- Review of the Techniques Used in Modern PC Processors,” thread mode compared with two copies of the l1_miss benchmark running Report, 12 July 1999. in dual-thread mode shows a very different throughput result. Because the 2. D.M. Tullsen, S.J. Eggers, and H.M. Levy, “Simultaneous memory loads in the l1_miss benchmark are all independent and each Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd one misses the L1 cache, the memory system is quickly flooded with Ann. Int’l Symp. Computer Architecture, ACM Press, New demand fetch requests from the L1 cache to the L2 cache. York,1995, pp. 392-403. In single-thread mode, the l1_miss benchmark attains a throughput of 3. D. Marr et al., “Hyper-Threading Technology Architecture and 1.23 µops per cycle. However, when the processor executes two copies Microarchitecture,” Intel Technology J., vol. 6, no. 1, 14 Feb. of the l1_miss benchmark in dual-thread mode, the combined through- 2002, pp. 1-12. put is only 0.65 µops per cycle. This is roughly half the throughput attain- 4. A. Snavely and D.M. Tullsen, “Symbiotic Jobscheduling for a able if the processor ran two copies of the benchmark sequentially in Simultaneous Multithreading Processor,” 9th Int’l Conf. single-thread mode. This happens because the memory system is already Architectural Support for Programming Languages and significantly loaded with only one copy of the l1_miss benchmark run- Operating Systems, ACM Press, New York, 2000, pp. 234- ning. Doubling the demand on the memory system by running another 244; http://www-cse.ucsd.edu/users/tullsen/asplos00.pdf.

for modifying the cache line of the sponding counter has overflowed. Upon an processor in no-thread mode. interrupt, the performance monitor inter- • The any mode enables event counting rupt’s interrupt service routine (ISR) checks independent of the current thread mode. the OVF bits in the active CCCRs to deter- mine which counter(s) overflowed. Note that Other features EBS implemented in this manner suffers from To support EBS, Pentium 4 counters can the same inaccuracies observed for previous generate interrupts on counter overflow using processors. the OVF_PMI_T0 and OVF_PMI_T1 fields The Pentium 4 supports cascading coun- in the CCCR (see Figure 4). Setting the ters, which the CCCR’s cascade bit controls OVF_PMI_T0 bit will generate a thread 0 (see Figure 4). When two counters are cascad- interrupt on counter overflow. Setting the ed, the first counter counts normally, but the OVF_PMI_T1 bit will generate a thread 1 second counter does not begin counting until interrupt on counter overflow. Additionally, the first counter overflows. By initializing the setting the FORCE_OVF bit configures the first counter to overflow after M events, the counter to overflow on every counter incre- second counter to overflow after N events, and ment. This feature is useful when the user the second counter to generate an interrupt on wants to collect profile information about all overflow, the user can collect information on occurrences of an infrequently occurring correlated events with a fairly fine degree of event. Because several counters can generate control. It’s also possible to cascade corre- interrupts on overflow, each CCCR contains sponding counters in different counter pairs an OVF bit that indicates that the corre- within a counter block. For example, BPU

JULY–AUGUST 2002 77 PENTIUM 4

counter 2 can start counting when BPU retains the tag until the µop retires successful- counter 0 overflows by setting the cascade bit ly or is cancelled. As the µops pass through the and clearing the enable bit in counter 2’s retirement logic, tagged µops can be counted, CCCR. When BPU counter 0 overflows, BPU providing a nonspeculative event count. counter 2 will be enabled to count. The Pentium 4 implements three tagging mechanisms: front end, execution, and replay. Obtaining nonspeculative event counts The front-end tagging mechanism tags the When counting events, it’s often helpful to µops responsible for events occurring early in distinguish between events caused by specula- the pipeline related to instruction fetch, tively executed instructions that never retire ver- instruction types, and µop delivery from the sus events caused by instructions that do retire. trace cache. The execution tagging mecha- For example, instructions that are executed in nism tags certain classes of µops as they write the shadow of a frequently mispredicted branch their results back to the register file. The replay often do not retire. These nonretiring instruc- tagging mechanism tags µops that are reissued tions can significantly increase the counts for (replayed) because of conditions such as cache certain events when compared to event counts misses, branch mispredictions, dependence obtained only from instructions that retire. violations, and resource conflicts. Both the These counts can even exceed the number of front-end and replay tagging mechanisms events caused by instructions that retire. Using have essentially one tag bit per µop and thus the total event count rather than the separated can only handle one event at a time. Howev- speculative and nonspeculative event counts can er, the execution tagging mechanism has four lead a performance analyst to draw flawed con- tag bits available per µop and, as such, can clusions. For example, consider a program that concurrently track up to four events. Several has a frequently accessed data structure that is machine-specific registers and, in some cases, small enough to fit in the data cache. However, the tag and tag value bits of the ESCRs (see this program also has a frequently executed Figure 3) enable these tagging mechanisms. branch instruction that is often mispredicted. The instructions in the shadow of this mispre- Precise event-based sampling dicted branch (which never retire) access data The Pentium 4’s PEBS support is a signifi- that is not needed by the program and that is cant performance-monitoring advantage. Pre- not in the data cache. If performance analysts vious processors only supported IEBS, and couldn’t distinguish between the nonspecula- the inaccuracy of these profiles often made tive and speculative data cache miss counts, they them useless. could draw the erroneous conclusion that data To provide PEBS support, the Pentium 4 needed by the program don’t fit in the cache. takes a different approach to collecting sam- This erroneous conclusion could then lead to ple data. Previous processors would merely futile efforts to tune the data structure’s size and issue a macroinstruction interrupt after a per- access patterns in an effort to improve cache per- formance event counter overflowed, and the formance. However, with the nonspeculative ISR collected the sample data. However, deep and speculative data counts properly separated, pipelines, superscalar execution, and latency analysts could focus efforts on improving the between the counter overflow and the actual prediction rate for the problematic branch interrupt cause wide variations in the accura- and/or adjust the code such that speculatively cy of samples collected in this manner. The executed instructions in the mispredicted Pentium 4 uses a microassist and a microc- branch’s shadow do not miss the data cache. ode-assist service routine to capture sample The Pentium 4 uses tagging mechanisms to data for PEBS. A microassist changes the provide nonspeculative event counts. As the source of the next µops to be executed in a processor decodes each x86 instruction, it fashion analogous to the way an interrupt breaks it into a sequence of one or more sim- changes the source of the next instructions to ple operations called microoperations, or µops. be executed. Microassists typically handle The Pentium 4’s tagging mechanisms enable infrequent or problematic conditions that tagging these µops when they cause certain occur during instruction execution (such as performance events. Once tagged, a µop raising an exception when a divide-by-zero

78 IEEE MICRO error occurs). When a microassist is signaled, for the PEBS buffer after storage of each sam- a microcode-assist service routine handles the ple. Once the buffer reaches this high water- condition that caused the microassist. The use mark, an interrupt is signaled and the of a microassist and a microcode service to col- corresponding ISR copies the samples from lect samples for PEBS avoids the inaccuracies the PEBS buffer to a more permanent loca- of IEBS. tion. With PEBS, the ISR copies samples already collected in the PEBS buffer to anoth- Implementation er location for analysis. It then empties the PEBS support on the Pentium 4 works in PEBS buffer to enable collection of more sam- the following manner. The user allocates a ples. Because the ISR does not collect the PEBS buffer in memory to hold the samples actual sample data when using PEBS, the to collect and sets a bit in the PEBS_ENABLE latency of the interrupt no longer affects the machine-specific register (MSR) to enable accuracy of the event-based samples. PEBS. The user then configures a µop tagging mechanism to tag certain µops as they flow Buffering through the pipeline. The user also configures Pentium 4 PEBS support has another a counter to count tagged µops as they retire. advantage over IEBS: The overall overhead of Once the counter overflows, the Pentium 4’s event-based sample buffering is lowered retirement logic examines all retiring µops. because the PEBS ISR processes many sam- When it finds a tagged retiring µop, it forces ples in one invocation of the ISR. In contrast, a microassist to occur just before the tagged IEBS requires a new invocation of the ISR for µop retires. A microassist is similar to a each sample. This improvement in sample- macroinstruction interrupt in that it halts the processing efficiency can either reduce the normal execution of instructions. However, interference experienced by the monitored unlike a software interrupt, the processor han- system or allow collection of more samples for dles the microassist entirely in microcode, and the same level of interference. no instructions after the microassist-causing instruction will retire before completion of Benchmark program the microassist service routine. The microas- To demonstrate the advantages of PEBS sist service routine collects the actual sample over IEBS, I created a simple benchmark pro- data by storing the current program counter gram. This benchmark uses a load instruction and the values in the general-purpose regis- in a nested loop to cause many L1 data cache ters in the PEBS buffer allocated for the sam- misses on the Pentium 4. Although only one ples. After completing the microassist, the load (the target load) in the nested loop actu- processor resumes normal execution and the ally causes a cache miss, the cache-missing event-causing instruction retires. load is preceded and followed by large sequen- Using a microassist and a microcode-assist tial blocks of loads that always hit the cache. service routine to collect the sample data I then created event-based profiles for L1 data instead of a macroinstruction interrupt and cache misses using PEBS and IEBS support an ISR avoids all of the inaccuracies of IEBS. on the Pentium 4. Since the microassist is invoked immediately Figure 5 (next page) summarizes the sam- prior to the retirement of the event-causing pling results from this benchmark. For PEBS, instruction, the association of the sample data 100 percent of the samples correctly identify with the event-causing instruction is insured the load instruction that misses the L1 data to be precise. Also, because the microassist is cache (identified as “+0” in Figure 5). In con- taken just before the event-causing instruc- trast, none of the IEBS samples identify the tion retires, the inaccuracies observed with correct load instruction. Moreover, the IEBS caused by the latency between the retire- instructions identified by IEBS were 65 or ment of the event-causing instruction and the more sequential instructions after the target invocation of an ISR to collect the sample data instruction (for example, 46.5 percent of all are no longer a factor that can affect the accu- the IEBS samples identified a load instruction racy of the sample data. 69 instructions after the load instruction that The microassist checks a high watermark misses the cache). This inaccuracy of IEBS on

JULY–AUGUST 2002 79 PENTIUM 4

100 problems. In previous processors, it was not PEBS possible to create these profiles with IEBS IEBS because the IEBS samples often incorrectly 80 identified the instructions causing memory system performance events. Let’s return to the l1_miss benchmark used 60 previously in the IEBS and PEBS discussion to demonstrate the creation and use of data 46.5 address profiles. To create an address profile 39.5 40 for the l1_miss benchmark, I configured the Samples (%) replay tagging mechanism to tag load µops that miss the L1 cache. I also configured a 20 counter to count tagged µops as they retire. 11.3 This benchmark generated 80 million L1 2.6 cache misses. A second benchmark, with 0 PEBS enabled, collected a sample every +0 +65 +66 +68 +69 100,000 L1 cache misses, generating 800 sam- Distance of event-based samples from ples. As mentioned previously, all 800 sam- the target instruction that miss the L1 cache ples identify the same instruction as missing the L1 cache. The address, opcode, and Figure 5. Comparison of the sample accuracy for precise and imprecise operands for the this instruction are event-based sampling. 80484c3: mov (%EAX), %EAX

the Pentium 4 is much worse than that for the This load instruction uses contents of the processor as assessed by the Pro- EAX register as the address for a load access fileMe team.2 This greater inaccuracy arises to memory whose result is written into the from the Pentium 4’s more aggressive design EAX register. Therefore, simply identifying and the increasing difference between proces- the EAX values for each sample provides the sor cycle time and main-memory access time. memory addresses responsible for the major- This greater inaccuracy also emphasizes the ity of the L1 misses in this benchmark. Table importance of the Pentium 4 PEBS support. 1 includes a count of unique EAX values for Pentium 4 PEBS support also enables the each sample. In this table, the sample count creation of data address profiles, a capability column indicates the number of samples that not possible with IEBS. A data address pro- contain the same EAX value. The extended- file identifies locations in data memory that instruction-pointer values (0x080484c3) for are associated with various memory system these samples are all the same as that for the performance events, such as cache and trans- instruction shown previously. The last column lation look-aside buffer misses. The user can indicates the EAX value that corresponds to recreate the data memory address used by the the memory address used by the loads caus- load or store µop instruction that caused the ing the L1 cache misses. The table rows are performance event. Doing so requires decod- sorted from lowest to highest EAX value. ing the instruction identified by the PEBS By examining the memory address data in sample and using the sampled register values Table 1, I can deduce the behavior of the to compute the effective address of the l1_miss benchmark. First, note that the lower instruction’s operands. By building a profile 12 bits of each EAX value are the same. Sec- of these data memory addresses, programmers ond, the difference between successive EAX can identify the regions in data memory that values is equal to 8 Kbytes. The Pentium 4 L1 cause common memory system performance cache is an 8-Kbyte, four-way set associative problems. Once programmers identify these cache. By repeatedly accessing a set of eight regions, it’s often possible to rearrange an locations in memory that are each 8 Kbytes application’s memory data structures to reduce apart, the benchmark thrashes the Pentium 4’s the frequency of memory system performance L1 cache, which can only hold the memory

80 IEEE MICRO contents of four accesses for this set of address- Table 1. Counts for each unique EAX es at any one time. This analysis matches the sampled value. benchmark’s behavior, which causes L1 cache misses while performing one million execu- Sample count EAX tions of an eight-iteration loop that moves 102 0x08049760 through a large array with an 8-Kbyte stride. 88 0x0804b760 If this were a data address profile from a real 106 0x0804d760 application, I could now use this analysis to 102 0x0804f760 reduce the frequency of cache misses by either 100 0x08051760 changing the order in which the instructions 100 0x08053760 access the data structure or by rearranging the 101 0x08055760 data structure layout in memory. 101 0x08057760 Problems, limitations, and opportunities Although the Pentium 4’s performance- ging event: µop_type. These features remain monitoring support significantly improves undocumented primarily because Intel design- upon that of its predecessors, its current ers have not yet sufficiently validated them for implementations do have problems and lim- external release. Although performance-mon- itations. The most significant of these is the itoring support is important and useful, it does lack of documentation for performance-mon- not significantly impact the number of proces- itoring capabilities, implementation bugs, and sors sold. Hence, the company dedicates fewer undisclosed features. The sheer complexity of resources to implementing, validating, and the Pentium 4 microarchitecture makes the maintaining performance-monitoring features creation of clear and accurate descriptions of than to admittedly more important processor its performance-monitoring events difficult. features, such as functional correctness and Early releases of this documentation provided overall performance. cryptic descriptions of events and also omit- Three areas for improvement will become ted key information. For example, early increasingly important as symmetric multi- descriptions of the front-end tagging mecha- processing and hyperthreading become more nism did not list any events detectable by the common and the use of performance-moni- front-end tagging mechanism. toring features grows. However, Intel routinely updates this doc- First, the interface and mechanisms that umentation online, making significant support qualification of event detection by improvements with each release.3 The most thread ID and event counting by thread mode recent releases of this documentation (which are not easily extended to support more than also include descriptions of the Xeon proces- two tasks executing concurrently. If future sor’s hyperthreading capabilities) fill several implementations increase the level of hyper- omissions in the originals and also make threading, providing these capabilities might numerous attempts to clarify the meaning and require a different approach. proper use of performance-monitoring events Second, the current performance-monitor- and features. ing features don’t provide support for attribut- Implementation bugs in the performance- ing event counts to specific tasks executed on monitoring features have also made some fea- a machine with a time-sharing operating sys- tures difficult to use. For example, the logic tem. The performance counters count events equations used to detect the IOQ_allocation for all tasks that execute on the processor event differ for different model versions and (here, the word tasks refers to multiple appli- the qualification of event counting by user and cations ready to run on the processor, not the supervisor privilege levels is ignored for this number of tasks that can concurrently execute event. Again, as Intel designers find these bugs, on the processor via hyperthreading). As a the documentation is updated accordingly. result, when the operating system shares the Lastly, several performance-monitoring fea- processor among several tasks, it’s not possible tures remain undocumented. For example, to attribute event counts to the tasks that Intel has only documented one front-end tag- caused them without altering the operating

JULY–AUGUST 2002 81 PENTIUM 4

system. This is also a problem for the buffer- many of the limitations of the performance- ing of PEBS samples by the processor. With monitoring hardware of previous processors. the current Pentium 4 implementation, if the The new features of the Pentium 4 greatly operating system switches tasks before the enhance collection of accurate processor per- PEBS ISR empties the PEBS buffer, the buffer formance data and will enable a new class of will contain samples from multiple tasks with performance-tuning tools and capabilities, no way to identify the source of each sample. such as the dynamic tuning of applications Third, the definition of performance-mon- and the operating system. itoring events, the capabilities of the perfor- mance-monitoring support, as well as the software interface to select events and configure References various performance-monitoring mechanisms 1. D. Marr et al., “Hyper-Threading Technology change with each new microarchitecture. For Architecture and Microarchitecture,” Intel example, the most recent x86 processor imple- Technology J., vol. 6, no. 1, 14 Feb. 2002, pp. mentations from Intel are based upon three dis- 1-12; http://developer.intel.com/technology/ tinct , all supporting itj/2002/volume06issue01/art01_hyper/ different events and capabilities: vol6iss1_art01.pdf. 2. J. Dean et al., “ProfileMe: Hardware Support • the microarchitecture, which debuted for Instruction-Level Profiling on Out-of- as the original Pentium and evolved into Order Processors,” Proc. 30th Symp. the Pentium MMX; Microarchitecture (Micro-30), IEEE CS Press, • the microarchitecture, which debuted Los Alamitos, Calif., 1997, pp. 292-302. as the Pentium Pro and evolved into the 3. IA-32 Intel Architecture Software Pentium II and Pentium III processors Developer’s Manual Volume 3: System along with their and Xeon ver- Programming Guide, order no. 245472, Intel, sions; and Santa Clara, Calif., 2002; http://developer. • the Willamette microarchitecture, which intel.com/design/pentium4/manuals/. debuted as the Pentium 4 and has evolved into the hyperthreaded Pentium 4 Xeon. Brinkley Sprunt is an assistant professor of electrical engineering at Bucknell University. Although many performance events are by His research interests include computer per- definition microarchitecture specific, the formance modeling, measurement, and opti- changes in event definitions and interfaces make mization. Sprunt has a PhD in electrical and it difficult to develop and maintain software that computer engineering from Carnegie Mellon uses performance-monitoring capabilities across University. He previously worked at Intel different microarchitectures. A few software where he was a member of the architecture tools manage to do this, such as Intel’s Vtune teams for the 80960, Pentium Pro, and Pen- tool (http://developer.intel.com/software/ tium 4 projects. He is a member of the IEEE products/vtune/vtune60/index.htm) and the and ACM. PAPI tools (http://icl.cs.utk.edu/projects/papi/). However, it would be a significant improvement Direct questions and comments about this if a core set of key performance events, mecha- article to Brinkley Sprunt, Bucknell Univ., Elec- nisms, and interfaces could be defined and trical Engineering Dept., Moore Ave., Lewis- maintained for each new microarchitecture burg, PA 17837; [email protected]. developed. For further information on this or any other ntel’s Pentium 4 processor provides perfor- computing topic, visit our Digital Library at Imance-monitoring features that overcome http://computer.org/publications/dlib.

82 IEEE MICRO