Pentium 4 Performance-Monitoring Features

PENTIUM 4 PERFORMANCE- MONITORING FEATURES THE INTEL PENTIUM 4’S UNIQUE PERFORMANCE-MONITORING FEATURES OVERCOME MANY LIMITATIONS AND PROBLEMS FOUND IN PREVIOUS PROCESSORS. PENTIUM 4 XEON PERFORMANCE MONITORING SUPPORTS SIMULTANEOUS MULTITHREADED EXECUTION FEATURES. Most modern, high-performance events. Because the Pentium 4 Xeon proces- processors have special, on-chip hardware sor is the first implementation of the x86 that can monitor performance. The features architecture to support simultaneous multi- of this monitoring hardware typically include threading (SMT), its performance-monitor- event detectors and counters, qualification of ing support includes qualification of event event detection and counting by privilege detection by thread ID and qualification of mode and event characteristics, and support event counting by thread mode. 1 for event-based sampling. However, these features often suffer from a common set of prob- Event detectors and counters lems including a small number of counters, At more than 42 million transistors, the inability to distinguish between speculative Pentium 4 is significantly larger than its and nonspeculative events, and imprecise immediate predecessor, the Pentium III, event-based sampling. With the introduction which has 28 million transistors. With such a of the Pentium 4 processor, Intel has over- large design, the previous approaches used by Brinkley Sprunt come many of the performance monitoring Intel for organizing the performance event limitations of previous processors. The Pen- detectors and counters became impractical. Bucknell University tium 4 supports 48 event detectors and 18 These processor designs dispersed the event event counters, enabling the concurrent col- detectors across the chip, close to the units lection of a significantly larger set of perfor- where the performance events occur; event mance event counts than any other processor. counters were in a central location. This The Pentium 4 also provides several instruc- approach required routing all event count sig- tion-tagging mechanisms that enable count- nals to the centrally located event counters. ing of nonspeculative performance events To follow a similar approach in the larger (that is, events generated by instructions that Pentium 4 design would have required sig- retire). In addition to the support for impre- nificantly more silicon area to route the event cise event-based sampling (IEBS), the Pen- count buses to the event counters. Intel’s com- tium 4 also provides precise event-based puter architects considered two alternatives: sampling (PEBS) that unambiguously iden- incorporating a counter into the implemen- tifies instructions that cause key performance tation of each event detector or dispersing sev- events. PEBS support also lets users create eral blocks of counters across the chip and data address profiles for various memory letting geographically close event detectors 72 0272-1732/02/$17.00 2002 IEEE Detector 0 Unit 1 4 Detector 1 4 Detector 0 Unit 2 4 Detector 1 4 PMI_T0 40 bits PMI_T1 Detector 0 Unit 3 4 Detector 1 Counter 0 4 Counter 1 Counter 2 Detector 0 4 Counter 3 Detector 1 4 Counter 4 Unit N Detector 4 Counter 5 4 Detector 5 4 Figure 1. The general structure of the Pentium 4 event counters and detectors. The event detectors control the selection of events and qualification of event detection by privilege mode (OS and/or USR) and thread ID. Each Pentium 4 unit contains two or six event detectors. Each event counter block contains four or six counters, and each counter can select an event detector, perform threshold comparisons and edge detection, and qualify event counting by thread mode. The counters in the counter block can also signal performance monitor interrupts on counter overflow. share these blocks. The first approach would detectors, although the checker/retire unit has eliminate signal routing among event detec- three event detector pairs. tors and counters while providing one event Each event detector has a maximum 4-bit- counter for each event detector. The second wide output bus, providing a maximum incre- approach would reduce signal routing among ment per cycle of 15. The Pentium 4 event detectors because of the event detectors’ implementation routes output buses from each close proximity to a counter block. The first event detector to an event counter block. Most approach would have eliminated routing costs event counter blocks contain four counters and provided the user with more counters that with the exception of the counter block in the they could use concurrently. Because the sili- instruction queue unit, which has six coun- con area required to implement one counter ters. Each counter block has two output sig- per event detector was greater than that for nals for requesting interrupts on counter implementing several shared groups of coun- overflow for EBS. With this organization, ters, Intel’s computer architects chose the sec- many event detectors share a set of four (or six) ond approach. counters. Therefore, one of these groups can only count four (or six) events concurrently. Structure and function The event detectors allow the selection and Figure 1 shows the structure and function masking of an event and can qualify event of these event detector and counter groups. detection by privilege mode and thread ID. The units in Figure 1 are logic sets that per- Event counters support threshold compari- form a significant function in the Pentium 4’s son, edge detection, and thread mode quali- design, such as the branch prediction or the fication. They can also generate performance trace cache units. Implementing the event monitor interrupts on counter overflow to detectors in pairs supports concurrent detec- support EBS. Users can configure these fea- tion of the same event for different threads in tures using event select control registers a simultaneous multithreaded system. Most (ESCRs) and counter configuration control units have one pair of equally capable event registers (CCCRs). JULY–AUGUST 2002 73 PENTIUM 4 BPU ISTEER IXLAT ITLB PMH MOB FSB BSU Counter block groups ESCR0 ESCR0 ESCR0 ESCR0 ESCR0 ESCR0 ESCR0 ESCR0 BPU CCCR/Counter 0 ESCR1 ESCR1 ESCR1 ESCR1 ESCR1 ESCR1 ESCR1 ESCR1 CCCR/Counter 1 CCCR/Counter 2 0 1 2 3 4 5 6 7 CCCR/Counter 3 MS TC TBPU MS ESCR0 ESCR0 ESCR0 CCCR/Counter 0 CCCR/Counter 1 ESCR1 ESCR1 ESCR1 CCCR/Counter 2 CCCR/Counter 3 0 1 2 FLAME FIRM SAAT U2L DAC FLAME ESCR0 ESCR0 ESCR0 ESCR0 ESCR0 CCCR/Counter 0 CCCR/Counter 1 ESCR1 ESCR1 ESCR1 ESCR1 ESCR1 CCCR/Counter 2 CCCR/Counter 3 0 1 2 3 5 IQ ALF RAT CRU IQ ESCR0 ESCR0 ESCR0 ESCR0 ESCR2 ESCR4 CCCR/Counter 0 CCCR/Counter 1 ESCR1 ESCR1 ESCR1 ESCR1 ESCR3 ESCR5 CCCR/Counter 2 CCCR/Counter 3 0 1 2 4 5 6 CCCR/Counter 4 ESCR select values for CCCRs CCCR/Counter 5 Figure 2. Interconnections among event detectors and event select control registers (ESCRs), and their associated counters and counter configuration control registers (CCCRs). Groups into pairs, such as counters 0 and 1 in the Figure 2 shows the four groups of event BPU group. The paired counters share the detectors and counters implemented in the same set of event detectors and ESCRs. As Pentium 4. Each group consists of event Figure 2 shows, although the Pentium 4 con- detectors (containing ESCRs), shown for sev- tains 44 event detectors, it can only count 18 eral units in the figure, and a block of coun- events concurrently because only 18 counters ters (containing CCCRs). Designers named are available. Even so, the ability to count 18 each group according to the Pentium 4 unit events concurrently is a substantial increase that contains the counter block for that group. over the capabilities of other processors. For example, the branch prediction unit gives its name to the group that contains event Event selection counters in the BPU, ISTEER, IXLAT, ITLB, An ESCR, shown in Figure 3, configures an PMH, MOB, FSB, and BSU units, and these event detector. The 7-bit event select field detectors connect to a counter block in the selects the desired event, and the 16-bit event BPU that contains four CCCR-counter pairs. mask field selects a subset of the desired event. Counters in the counter blocks are grouped For example, the Pentium 4 branch_retired 74 IEEE MICRO event defines 4 bits in the event mask field to 3210 select from the following branch types: taken, 10987654321098765432109876543210 not taken, predicted, and mispredicted. To Event Tag count mispredicted branches, the user would Event mask select the branch_retired event and then set select value the branch-taken-mispredicted and branch- Reserved not-taken-mispredicted bits in the event mask. The ESCR’s low-order 4 bits allow qualifi- Tag enable cation of event detection by privilege mode T1_USR and thread ID. The thread ID identifies one T1_OS of the possible two threads concurrently exe- T0_USR cuting via the Pentium 4’s SMT capability. T0_OS See the “Simultaneous multithreading: Improving instruction throughput for high- Figure 3. Event select control register. performance processors” sidebar (next page) for a brief introduction to the concepts, goals, and performance implications of the Pentium 3210 4’s SMT features. For example, to count all 10987654321098765432109876543210 operating system events for both threads, the user would set both the T0_OS and T1_OS Reserved events. Defining each of these bits as a com- bination of both thread and privilege mode Reserved provides a greater degree of control than is OVF possible if the event detector used 2 bits (T0 Edge Enable Cascade and T1) to select thread qualification and the Compare Threshold Force_OVF other 2 bits (operating system, OS, and user, Complement USR) to select privilege mode. For example, OVF_PMI_T1 OVF_PMI_T0 the user could count all user events for thread 0 and all operating system events for thread 1 (by setting the T0_USR and T1_OS bits), which is not possible with individual T0, T1, Event select control register operating system, and USR bits in the ESCR.

Pentium 4 Performance-Monitoring Features

Validating the Intel® Pentium® 4 Microprocessor

The Microarchitecture of the Pentium 4 Processor

Intel(R) Pentium(R) 4 Processor on 90 Nm Process Datasheet

The Intel X86 Microarchitectures Map Version 2.0

Features PCA-6188 Socket 478 Pentium® 4/Celeron® D

Test Drive Report for Intel Pentium 4 660

Intel Pentium 4 and Intel Xeon Processor Optimization

Intel® Pentium® 4 Processor on 90 Nm Process Specification Update

Intel's Haswell CPU Microarchitecture

Intel® Celeron® Processor in the 478-Pin Package up to 1.80 Ghz Datasheet

How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core

The Microarchitecture of the Pentium 4 Processor Overview