Intel® Technology Journal

Volume 07 Issue 02 Published, May 21, 2003 ISSN 1535-864X Intel® Technology Journal

Intel® Centrino™ Mobile Technology

Mobile PC Platforms Enabled with Intel® Centrino™ Mobile Technology

A compiled version of all papers from this issue of the Intel Technology Journal can be found at: http://developer.intel.com/technology/itj/index.htm Mobile PC Platforms Enabled with Intel® Centrino™ Mobile Technology

Gordon Chinn, Mobile Platforms Group, Intel Corporation Sanjiv Desai, Mobile Platforms Group, Intel Corporation Eric DiStefano, Mobile Platforms Group, Intel Corporation Krishnan Ravichandran, Mobile Platforms Group, Intel Corporation Shreekant (Ticky) Thakkar, Mobile Platforms Group, Intel Corporation

Index words: Bluetooth™, Intel® Centrino™ mobile technology, Intel® WCS, WLAN, battery, coexistence, electronic cooling, notebook, power, validation, wireless

ABSTRACT When multiple wireless technologies such as Bluetooth∗  Wireless LAN, and Wireless WAN are embedded in the Mobile PC platforms enabled with Intel Centrino™ same mobile system, the risk of interference between these mobile technology incorporate innovative technologies to radio frequency (RF) technologies greatly increases significantly increase the mobility of notebook PCs. In because of their close proximity to each other. In this this paper, we describe the following innovations: paper we discuss general technical RF challenges for • How to reduce interference between wireless mobile platforms and describe innovative techniques used communication technologies using the Intel® Wireless on platforms enabled with Intel Centrino mobile Coexistence System (Intel® WCS). technology. These techniques mitigate interference by using the Intel Wireless Coexistence System (Intel WCS). • Two techniques for significantly extending average platform battery life. We also discuss new and advanced platform power- management features for improved battery life. Two new • Simple thermal solutions for thinner and lighter form techniques are described: asynchronous voltage regulator factors. (VR) control with a power status indicator (PSI), and  Finally, we outline a comprehensive platform-level multiphase Intel Mobile Voltage Positioning (IMVP). validation process that helps deliver a stable platform with Platforms enabled with Intel Centrino mobile technology Intel quality. also incorporate state-of-the-art techniques to enable compact, lightweight thermal solutions in smaller form INTRODUCTION factor systems. The design includes an efficient remote Platforms enabled with Intel Centrino mobile technology heat exchanger, an air flow set aside for system cooling, incorporate innovative technologies to significantly and system venting, that together cool the processor and increase the mobility of notebook PCs. This paper the system as a whole, using only a single fan. The latest describes innovations in reducing interference between interface materials, heat pipe technology, and fans provide wireless communication technologies, extending average the required performance in a small package. platform battery life, and in driving a thermal solution for Finally, Intel validation best-known methods (BKMs) are thinner and lighter form factors. applied to ensure that these platforms deliver optimum

∗  Intel and Intel Centrino are trademarks or registered Other names and brands are the property of their trademarks of Intel Corporation or its subsidiaries in the respective owners United States and other countries.  Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.

Mobile PC Platforms Enabled with Intel® Centrino™ Mobile Technology 6

Intel Technology Journal, Vol. 7, Issue 2, May 2003 system performance with multiple wireless technologies Intel Wireless Coexistence System operating concurrently within the enterprise as well as While antenna isolation provides some interference when connected to hotspots outside the enterprise. mitigation between 802.11b and Bluetooth radios, performance is still impacted to some degree. For WIRELESS TECHNOLOGIES example, 802.11b data throughput is degraded by Platforms enabled with Intel Centrino mobile technology Bluetooth interference, even with 40dB of antenna incorporate innovative radio frequency (RF) technologies isolation, as shown by the lower curve in Figure 1. The to significantly increase the mobility of notebook PCs. upper curve shows the ideal throughput when no These technologies include Wireless Personal Area Bluetooth interference is present. Networking (WPAN), e.g., Bluetooth; Wireless Local Area Networking (WLAN), e.g., 802.11a and 802.11b; 802.11b Throughput and Wireless Wide Area Networking (WWAN), e.g., With Bluetooth Interference 6.00 General Radio Packet Service (GPRS). Integrating multiple wireless technologies greatly enhances 5.00 ) connectivity to the enterprise Intranet and the Internet, as 4.00 well as to peripherals such as PDAs, printers, and 3.00 headsets. 2.00 Data Rate(Mbps Radio Frequency Challenges in Mobile 1.00 0.00 Notebook PCs 1.00 10.00 100.00 1000.00 Range (meters) When incorporating RF technologies into a digital computing environment such as a notebook PC, traditional Figure 1: 802.11b throughput with Bluetooth system designers are faced with new challenges previously interference relegated to RF engineers. Antennas must be added for each frequency band. Since notebook PCs generally To further mitigate interference between Bluetooth and consist of many metallic components such as framing 802.11b, Intel WCS was developed as one of the Intel structures, hard drives, displays, antenna radiation patterns Centrino-enabling technologies. Intel WCS consists of a can be greatly distorted. This distortion can potentially combination of antenna isolation techniques, a channel cause significant RF performance variations, as the exchange (Figure 2), and priority signaling between an notebook PC’s physical orientation is varied relative to the Intel PRO/Wireless Network Connection 802.11 solution location of the intended communicating device such as an and a third-party Bluetooth module. Phase 1 of Intel access point (AP), a peripheral such as a printer, and even WCS has been implemented: it mitigates Bluetooth another notebook PC. interference and restores 802.11b data throughput nearly completely as shown by the chart in Figure 1. These RF technologies consist of very sensitive receivers and potentially high-power transmitters. When multiple RF technologies are embedded (co-existent) in the same 802.11b Bluetooth notebook PC, the risk of interference between these NIC Channel Module technologies is greatly increased as the transmitters of one Info radio can couple with the sensitive receivers of another radio, even if they operate in different frequency bands. Bluetooth 802.11b

Technologies such as Bluetooth and 802.11b have even ( dBm) greater interference risks since they both operate in the same 2.4GHz frequency band. Antenna isolation techniques mitigate some of these interference issues, but with multiple antennas and decreasing form factor sizes, Signal Strength optimal placement of antennas is not always possible. New interference-mitigating technologies are then 2483 MHz 2400 MHz Frequency required. For example, Intel Centrino mobile technology (MH ) includes the Intel Wireless Coexistence System (Intel Figure 2: Intel WCS Channel Exchange WCS), which significantly mitigates the interference Intel WCS is designed to complement the Adaptive between 802.11b and Bluetooth technologies. Frequency Hopping (AFH) interference mitigation algorithm being developed by the Bluetooth Special Interest Group (SIG). AFH will mitigate the impact of

Mobile PC Platforms Enabled with Intel® Centrino™ Mobile Technology 7

Intel Technology Journal, Vol. 7, Issue 2, May 2003

802.11b on Bluetooth data throughput, but only between EXTENDED BATTERY LIFE AFH-compliant Bluetooth devices as shown in Figure 3. TECHNIQUES

Bluetooth Throughput The battery life of a mobile PC is a function of the power With 802.11b Interference consumed by each of its components and the capacity of the battery. Figure 5 shows the breakdown of the power 800 700 consumption of a typical notebook while running the 600 battery life benchmark Ziff-Davis BL4.01. 500 400 Average Power during Battery Life Benchmark 300 CP

Data Rate, Kbps Rate, Data 6 200 3 U % MC 100 % 3 ICH % H 0 35 10 GFX- 0.1 1 10 100 % % Memorext Range (meters) 2 HDy % Figure 3: Bluetooth throughput with 802.11b DVD 10 WLAND interference % Clock Gen 3 % Power Supply loss Phase 2 of Intel WCS will add Bluetooth priority 10 2 % % Rest of platform signaling from the Bluetooth module to the Intel 10 6 % LC PRO/Wireless network connection, resulting in a % D restoration of connection reliability for both AFH and non-AFH (legacy) Bluetooth devices as shown in Figure Figure 5: Platform average power distribution 4. As can be seen, some of the largest power consumers are

Bluetooth Connection Performance the display (LCD), power supply, hard disk drive (HDD) with 802.11b interference and graphics. It is also noteworthy that the CPU is one of 90 the lowest consumers of battery life power (only consumes 80 6%) due to the advanced power-management built into the 70 ® 60 processor (Intel SpeedStep technology, QuickStart, Deep 50 Sleep, etc.) To learn more about processor power 40 management, please see “Intel® Pentium® M Processor 30 20 Power Estimation, Budgeting, Optimization, and

Page Scan Failure Rate, % 10 Validation” in this issue of the Intel Technology Journal. 0 0.1 1 10 100 On Intel Centrino mobile technology, Intel developed and Range (meters) enabled the following techniques to reduce the power Figure 4: Bluetooth connection performance consumption of some of these subsystems and to help increase the battery life: Bluetooth Enabling • Asynchronous Voltage Regulator (VR) Control with Multiple third-party Bluetooth silicon and module vendors Power Status Indicator # have been enabled to be compatible with Intel WCS, • Multiphase Intel Mobile Voltage Positioning (IMVP) including SiliconWave and Cambridge Silicon Radio. technology Extensive verification and validation testing has been completed with these silicon vendors, providing pre- validated system solutions to the customer. Asynchronous Voltage Regulator Control with In summary, integrating multiple RF technologies into PSI# mobile notebook PCs provides new challenges to systems The fully power-optimized CPU represents a very designers, including antenna gain uniformity and complicated load to the voltage regulator (VR). This is interference mitigation. Intel WCS, an Intel Centrino because its current draw can range from a few mAs to mobile technology, provides powerful interference several tens of Amperes, depending on the workload put mitigation between 802.11b and Bluetooth and enhances on the processor. The transitions from low to high AFH. currents can also occur rapidly (in ms). This makes it difficult for the VR designer to maintain high-power

Mobile PC Platforms Enabled with Intel® Centrino™ Mobile Technology 8

Intel Technology Journal, Vol. 7, Issue 2, May 2003

conversion efficiency in the full operating range. Referring to Figure 7, the logic driving the gate of QSYN is Typically, the efficiency is maximized at the highest end gated, in addition to CO, by a Power Status Indicator of the current range. However, the power-conversion (PSI). PSI, when LO, keeps the gate of QSYN low efficiency of the VR drops off quickly toward the low- regardless of CO level. When PSI is HI, the Mosfet power condition. This is because the standby or quiescent driver drives the gate of QSYN normally. In this (zero output loading) power of the VR is approximately application, the STP_CPU# signal can be used in place of 0.2-0.5W, which is comparable to the low-power CPU a PSI. load, causing the efficiency to be <50%. The Geyserville-III transition (Performance Optimized In a battery life benchmark, such as the ZDBL4.01, the Mode to Battery Optimized Mode) also requires processor spends >80% of the time in the low-current additional modifications due to the voltage differences state (C3). Therefore, it is very important to maximize the between the two modes. Figure 8 shows the final circuitry power-conversion efficiency of the VR during the C3 state that is used in Intel Centrino mobile technology. to extend the battery life of the notebook. The VR Cntr Mosfet Driver CO V Asynchronous VR control with PSI# is a method used to Q CORE SW (1.35V) increase the efficiency of the VR during the C3 state, RS without affecting the efficiency during the high-power L CPU STP_CPU# Q state. SYN C A VR is used to convert an unregulated high input VID0 voltage, either the AC/DC adapter or battery, to a suitable regulated DC voltage rail to power the core of the CPU. VID1 In this particular application, the regulated voltage is 1.35V ± 7.5%, including VR DC error, noise, and VID2 transient response error, etc. A switching voltage One-Shot regulator is used to produce high-power conversion VID3 efficiency. Due to the high current demand of the CPU, this VR operates in a continuous synchronous topology. VID4

VID5 VDC (10.8V) VR Cntr Mosfet Driver CO V Q CORE SW (1.35V) RS Figure 8: STP_CPU# override with VID code L CPU Optimized Intel Mobile Voltage Positioning QSYN C A CPU enabled with Intel SpeedStep technology operates at a different CPU core frequency. For example, when Figure 6: Synchronous VR operating with an AC adapter plugged into the notebook, Referring to Figure 6, a VR is used to regulate the battery battery life is not an issue, as power is always available voltage (8.1-12.6V, 10.8V nominally) down to 1.35V for (although power consumption can limit performance as the CPU core rail. A VR controller monitors the output the internal temperature may rise.) The CPU is placed voltage, VCORE, and compares it to an internal 1.35V into a Performance-Optimized Mode (POM), where the reference. core frequency is high (i.e., 1GHz). When operating from the battery as the input power source, the CPU is placed VDC (10.8V) into the Battery-Optimized Mode (BOM) where the core frequency is dropped to a lower value (i.e., 600MHz) to VR Cntr Mosfet Driver CO V Q CORE reduce power consumption. The CPU power demand for SW (1.35V) RS POM is higher than that of BOM. More importantly, the L CPU current demand for POM is much higher than for BOM. This means the output-decoupling requirement for POM is PSI QSYN C much worse than that for BOM. The CPU voltage tolerance is ± 7.5% for both POM and BOM because the Figure 7: Non-synchronous VR with CPU power Intel Mobile Voltage Positioning (IMVP) load line remains constant for both POM and BOM. Therefore, status indicator when switched to BOM, using the GMUXSEL signal

Mobile PC Platforms Enabled with Intel® Centrino™ Mobile Technology 9

Intel Technology Journal, Vol. 7, Issue 2, May 2003

(with low indicating that the CPU is in BOM and high CPU Intel Mobile Voltage Positioning Feature indicating POM), the output voltage can be “shifted” Results down to reduce power consumption. Figure 10 shows a battery from a charged status, 12.0V, Intel Mobile Voltage Positioning Design down to a discharged status, ~8.5V. The average power Implementation over the entire ZDBL run of 3 hours, 40 minutes is 11.77W. Figure 9 shows positive offset voltage is controlled by ROFFSET-P. A small-signal mosfet switch is added in series 13 25 with this resistor and ground. During POM, GMUXSEL 12.5 12 20 is high connecting ROFFSET-P to ground, providing a positive offset voltage. During BOM, GMUXSEL is low 11.5 turning off the mosfet, removing the positive offset 11 15 voltage. 10.5 10 10

During DeepSleep state, a negative offset voltage is 9.5 Power (W) Battery Battery Voltage(V) Voltage applied through ROFFSET-N and pulled down with the 9 5 Power STP_CPU# signal. 8.5 8 0 INTEL MOSFET DRIVER VDC Vcore Intel TG CO L RCS 3:17:32 PM 3:28:50 PM 3:40:08 PM 3:51:26 PM 4:02:44 PM 4:14:02 PM 4:25:20 PM 4:36:38 PM 4:47:56 PM 4:59:14 PM 5:10:32 PM 5:21:50 PM 5:33:08 PM 5:44:26 PM PM 5:55:44 6:07:02 PM PM 6:18:20 PM 6:29:38 6:40:56 PM PM 6:52:14 Mobile Voltage Time (s) BG Positioning C DAC DAC Figure 10: Discharge battery voltage/power profile– C DAC baseline R CORE COMPARATOR DAC R CMPREF CORE R 13 25 CMP OH R 12.5 OFFSET_P R OFFSET_N 12 20

GMUXSEL 11.5

11 15 STP_CPU# 10.5 10 10

Figure 9: IMVP implementation 9.5 Battery Power (W) Battery Voltage (V) Voltage Battery 9 Voltage 5 Power RESULTS 8.5 The techniques were implemented on two IBM T20 8 0 notebooks with 600 MHz Pentium III processors. All

data, unless mentioned otherwise, are averaged power 1:00:51 PM 1:12:19 PM 1:23:47 PM 1:35:15 PM 1:46:43 PM 1:58:11 PM 2:09:39 PM 2:21:07 PM 2:32:35 PM 10:54:43 AM 11:06:11 AM 11:17:39 AM 11:29:07 AM 11:40:35 AM 11:52:03 AM 12:03:31 PM 12:14:59 PM 12:26:27 PM 12:37:55 PM 12:49:23 PM numbers measured during the entire battery discharge (~3- Time (s) 4 hours duration) during a Ziff-Davis Battery Life benchmark, ZDBL4.01. Figure 11: Discharge battery power profile with CPU To establish accurate improvement (if any), a baseline IMVP VR measurement is first established. Two complete ZDBL Measurements after CPU Intel Mobile Voltage runs were performed on a new T20 notebook platform Positioning Modification show the average power over the with the same battery pack charged overnight each time. entire ZDBL run is 11.49W. This is a 280mW power An average runtime was calculated along with an average reduction in the CPU during the ZDBL run (see Figure platform/battery power measurement. 11).

 Pentium III is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.

Mobile PC Platforms Enabled with Intel® Centrino™ Mobile Technology 10

Intel Technology Journal, Vol. 7, Issue 2, May 2003

Non-Synchronous Voltage Regulator with PSI# THERMAL DESIGN Results The Intel Centrino mobile technology platform can be Figure 12 shows the battery discharge power over the cooled using a simple, compact, light-weight solution. entire 3 hour, 15 minute ZDBL run of the IBM T20 The approach, though simple, is fundamentally important notebook used for this study. to successful design. In this section, we discuss a strawman set of requirements, the overall cooling approach, cooling performance expectations, and the resulting cooling solution design. 23 21 For the purposes of this discussion, we define a set of 19 17 strawman requirements comprising the system form- 15 factor, a platform power scenario, and a set of boundary 13 conditions. We assume a system thickness of 25 mm 11 9 (total base and display) with a footprint of 320x260 mm, 7 which accommodates a 15" display. This thin form-factor 5 allows for approximately a 13 mm inside height below the keyboard, for a solution design. We further assume a power scenario in which the processor may be at 24.5W, 3:49:56 PM 4:01:16 PM 4:12:36 PM 4:23:56 PM 4:35:16 PM 4:46:36 PM 4:57:56 PM 5:09:16 PM 5:20:36 PM 5:31:56 PM 5:43:16 PM 5:54:36 PM 6:05:56 PM 6:17:16 PM 6:28:36 PM 6:39:56 PM 6:51:16 PM 7:02:36 PM and all the other components may be at 18.5W, for a total of 43W in the base of the system (see Figure 14). Figure 12: Discharge battery power profile–baseline Platform Power (High Power Application)

The average power over the entire ZDBL run of 3 hours, CPU MC 19 minutes, is 10.94W. 14 9% IC With the non-synchronous VR with PSI# (STP_CPU#) 1% GFX- 1% Memor 1% implemented, the discharge profile is as shown in Figure HDD 2% 13: 2% 57 DVD 7% WLAN 1% Clock generator 5% 23 Power Supply 21 Other 19 17 15 Figure 14: Platform TDP power 13 11 Finally, important boundary conditions are the maximum 9 7 temperature difference allowed between the component 5 and the user ambient ∆Tj-a. For the CPU, we assume 65ºC. Additionally, any given system design must consider chassis surface temperatures as well as acoustic

2:44:19 PM PM 2:56:36 3:08:53 PM 3:21:10 PM 3:33:27 PM PM 3:45:44 3:58:01 PM 4:10:18 PM 4:22:35 PM PM 4:34:52 4:47:09 PM 4:59:26 PM 5:11:43 PM PM 5:24:00 5:36:17 PM 5:48:34 PM 6:00:51 PM PM 6:13:08 limits on fan noise, for ergonomic constraints.

Figure 13: Discharge battery power profile with The overall thermal solution approach is illustrated in STP_CPU# VR Figure 15:

Air Outlet The average power over the entire ZDBL run of 3 hours, Heat Exchanger Air Outlet Heat Pipe 36 minutes, is 10.36W. This is a 580mW power reduction Fresh system in the CPU VR. airair flowflow Air Inlet SystemSystem airair outletoutlet In summary, as demonstrated in the above experiments, the two features implemented on the IBM T20 can result Fan NB Inside Air in over 1W of power savings during the battery life benchmark (ZDBL4.01) run. While the benefits may vary with different OEM systems, the expectation is that these CPU PCB Spreader Plate will save power and improve the life of the battery.

Figure 15: Notebook layout illustration

Mobile PC Platforms Enabled with Intel® Centrino™ Mobile Technology 11

Intel Technology Journal, Vol. 7, Issue 2, May 2003

It includes the now conventional remote heat exchanger and the effective θj-hp translates to approximately approach for component cooling, but with fresh air flow 0.9ºC/W. capability set aside to provide cool air to the system. The Prototypes from fan and heat exchange suppliers indicate fresh air flow is important to cool components and θ θ especially to keep skin temperatures cool. The passive that 1.1ºC/W can be achieved for hp-a. However, the hp-a cooling limit (i.e., dissipation by radiation and natural value must be de-rated to allow for fresh air to be set aside convection from the base of the system) is approximately for system cooling. Although the passive limits are close 18W, allowing for a chassis skin temperature of roughly to those accommodated by the maximum passive limits, 15ºC above ambient as is observed in typical system spreading is not perfect, and an allowance must be made designs; close to the non-processor power assumed above. for system constraints on effective passive dissipation. We assume 15% θhp-a performance set aside to The overall cooling performance expectations are complement system cooling. It is important to note that identified using the equation below: the air flow is provided directly from the exhaust of the ∆ − fan and flows underneath the motherboard to best protect T j−a Tsystem skin temperature. Two very important features of the θ − = j a system design provide short and direct air flow paths into PCPU the fan and out of the heat exchanger and adequate system where PCPU is the power we need to cool and ∆Tj-a is as air outlet vents, respectively. discussed above. T represents the net impact of the system Figure 16 shows a design rendition of the solution. The power of the rest of the platform on the component, in this weight and volume of prototypes are approximately 55g case the processor and the thermal solution. The value of and 4.3cm3, respectively, approximately one-third the T is dependent upon system design and is best system weight and size of solutions for higher-power processors. approximated prior to design by using system simulations. Assuming 10ºC for the strawman scenario discussed In summary, the thermal solution for Intel Centrino mobile above, the required performance θj-a is 2.25°C/W. technology platforms is compact and lightweight, utilizing only a single fan to cool the processor and the system. Figure 16 illustrates schematically the remote heat exchanger cooling solution and important locations of Temperature Location θ temperatures. The overall performance j-a is further Ambient approximated by the equation below: Fan θ ≈ θ + θ j−a j−hp hp−a 55x55 mm where θ is the thermal resistance from the component j-hp System air flow junction to the heat pipe and θhp-a is the remaining resistance from the heat pipe to the ambient temperature Heat Exchanger Heat Pipe through the heat exchanger. θj-hp includes resistance of Processor conduction through silicon, the interface material and contact resistances, and the resistance of the evaporator (region over the heat source) of the heat pipe. With new thermal interface materials, suitable thermal attach and Figure 16: Thermal solution heat pipe design, a performance of approximately 0.45 C- cm2/W can be achieved on a reference die. For PLATFORM-LEVEL VALIDATION performance verification, direct communication with This section outlines the comprehensive platform-level respective suppliers is recommended. All components of validation process that helped deliver a stable platform θ j-hp are affected by the source they are cooling and are with Intel quality. Intel validation BKMs were applied to θ  ® considered in j-hp. Although the Intel Pentium M ensure that a mobile platform with Intel Centrino mobile processor uses 0.13 micron process technology, the design technology delivered optimum system performance with is optimized to mitigate the effects of a smaller source, multiple wireless technologies operating concurrently within the enterprise as well as when connected to hotspots outside the enterprise.

 The goal for the platform with Intel Centrino mobile Pentium M is a registered trademark of Intel technology was extended validation of all platform Corporation or its subsidiaries in the United States and components to maximize reliability and interoperability. other countries.

Mobile PC Platforms Enabled with Intel® Centrino™ Mobile Technology 12

Intel Technology Journal, Vol. 7, Issue 2, May 2003

This validation was primarily divided into the following Figure 18: Intel Centrino technology platform three categories (see Figure 17): integration and validation 1. Validation of Intel and Third-Party Vendor (TPV) The focus of platform-level validation was the integration Intel Centrino mobile technology-enabled platform of Intel Centrino mobile technology components on an components. Intel Customer Reference Board platform and the performance of comprehensive compatibility and 2. Validation of Intel Centrino mobile technology- interoperability validation under mobile-specific stress test enabled platforms on customer reference boards. conditions as well as under typical usage-model scenarios. 3. End-User system validation. Figure 19 shows some examples of platform validation tests.

Systematic Approach to Validation Intel Platform Validation Tests Overview Antennas

Security S/W Test Category Test Examples Infrastructure Program BT Modules Pentium-M Processor and !Verify C3 and C4 State transitions occur with AAA Vendors the wireless device and its software Intel System Power Management Testing !Suspend and resume system from S1, S3, and Drivers Platform S4 states many times and verify the Wireless IT Seed Program Device and its software is in consistent state Display,Security LAN, h/w etc Validation TPM, Port Tokens Intel Si & Drivers OEM/ODM System Multiple Wireless Integration Perform the Banias Processor and System Banias, Odem, MontaraGM, Power States testing with all the wireless Calexico Validation Testing elements present in the system Display, LAN, PAN, Audio, Storage Stress Testing Verify wireless device and its software is in Component Validation Platform Validation End User System Validation consistent state while running CPU and network intensive applications like DVD Intel validations Si & Drivers in Intel validates full systems Intel runs Intel validation suite playback, and Packet Blaster own validation lab in own validation lablab on 5 OEM-provided notebooks 3rd parties & fellow travelers 12 configurations IT departments test seed units provided Banias platforms and validated Infrastructure programs validation consultation to run validate seed units at hot spots their own validation testing

Figure 19: Platform validation test examples Figure 17: Systematic approach to validation The third type of validation was platform-level validation Figure 18 shows Intel and TPV delivered components on using Intel Centrino mobile technology-based OEM a platform with Intel Centrino mobile technology. Each systems. The focus was testing in a real end-user Intel component underwent thorough comprehensive environment. For example, multiple Intel Centrino system and compatibility validation using Intel validation mobile technology-based systems with multiple platform BKMs traditionally used on processors and chipsets. To configurations were tested for compatibility with ensure high-quality validation of TPV components, Intel industry- standard Wireless LAN (WLAN) access points worked with TPVs to define comprehensive test plans for at various hotspots. validation of their components using Intel validation BKMs. Intel Wireless Verification Program Centrino Platform Integration & Validation As part of establishing Intel Centrino mobile technology as the premier Wireless LAN notebook PC client, Intel Internal to NB External to NB worked with many leading companies to accelerate the VPN, VPN, SECURITY ISV SECURITY Antennae ROAMING Stack/Driver/App deployment of 802.11 wireless communication STACKS STACKS capabilities in private and public spaces. The overall Bluetooth Drivers/ Bluetooth Utilities Device / AP goal is to help reduce the two major Wi-Fi adoption Drivers/ Utilities Intel WLAN 802.11x AP barriers: availability and awareness. The program works OS Drivers/ with traditional and emerging Wi-Fi providers and key WWAN Utilities location owners to achieve the following: Drivers/ Utilities LAN LAN 1. Deploy Wi-Fi in locations relevant to the business traveler, such as airports, hotels, and franchise chains. Intel Pentium-M Chipset 2. Verify Intel Centrino mobile technology in the BIOS / Power Management Centrino Platform installed infrastructure.

Mobile PC Platforms Enabled with Intel® Centrino™ Mobile Technology 13

Intel Technology Journal, Vol. 7, Issue 2, May 2003

3. Raise the awareness of the availability of the service incorporated into this paper; and Francis Truntzer for co- in these locations by directed co-marketing programs developing and driving wireless platform validation and a signage program. strategy.

CONCLUSION Platforms enabled with Intel Centrino mobile technology incorporate innovative technologies to significantly increase the mobility of notebook PCs. We described the following innovations: • Details are given on how to reduce interference between wireless communication technologies. The Intel Wireless Coexistence System is demonstrated to basically recover all WLAN performance under Bluetooth interference. • Two new, advanced platform power-management features for improved battery life are described in this paper: Asynchronous VR control with PSI, and multiphase IMVP. These features implemented on an IBM T20 can result in over 1W of power savings during the battery life benchmark (ZDBL4.01) run. While the benefits may vary with different OEM systems, the expectation is that these will save power and improve battery life • By coupling state-of-the-art thermal techniques with the advanced platform power-management features for lower power consumption, the thermal solution for a typical platform with Intel Centrino mobile technology was demonstrated to be compact and lightweight, utilizing only a single fan to cool the processor and the system. Finally, the Intel Centrino mobile technology platform is extensively tested and tuned for components of Intel Centrino mobile technology to optimally work together in order to maximize reliability and interoperability and to deliver on all four mobility vectors. Intel Centrino mobile technology continues to undergo extensive security validation with industry-standard security and leading third-party security solutions. Intel is continuing to conduct comprehensive infrastructure verification with the wireless LAN infrastructure ecosystem and public wireless LAN service providers. ACKNOWLEDGMENTS The authors thank the following individuals for their contributions: Al Bettner for antenna development and enabling; Marlon Cardenas and Je-Young Chang for driving thermal solution development; Camille Chen for driving WLAN coexistence integration and validation; Randy Durrant for Intel WCS systems design and Bluetooth enabling; Eddie Lin for driving Bluetooth coexistence integration and validation; Don Nguyen for developing the detailed white papers on each of these initiatives, which provided much of the data that is

Mobile PC Platforms Enabled with Intel® Centrino™ Mobile Technology 14

Intel Technology Journal, Vol. 7, Issue 2, May 2003

REFERENCES Internet experience. Ticky holds Ph.D. and M.S. degrees in Computer Science/Engineering from the University of For more information on Extended Battery Life Program, Manchester (England) and a B.S. degree in Computer please access the following Web site: Science/Statistics from the University of Manchester. His http://developer.intel.com/design/mobile/battery.htm e-mail is [email protected]. AUTHORS’ BIOGRAPHIES Copyright © Intel Corporation 2003. This publication Gordon Chinn is the Radio Frequency (RF) Systems was downloaded from http://developer.intel.com/. Architecture manager in Intel’s Mobile Platforms Group Legal notices at (MPG) and is responsible for enabling mobile platform http://www.intel.com/sites/corporate/tradmarx.htm. PCs with multiple, concurrent, embedded radio technologies. He has over 30 years of experience in various RF communications development positions in the commercial and aerospace industry. Gordon holds a B.S. and M.S. degree in Electrical Engineering and Computer Science from the University of California at Berkeley. His e-mail is [email protected]. Sanjiv Desai is the Mobile Systems and Wireless Architecture manager in Intel’s Mobile Platforms Group (MPG). He is responsible for the definition and architecture of mobile platforms at Intel. He has over 20 years of industry experience in various CPU, Chipset, and System design and development positions at Intel, National Semiconductor, and American Microsystems Inc. (Gould). Sanjiv holds an M.Sc. degree in Electrical Engineering from North Carolina A&T State University (USA) and a B.Sc. degree in Electrical Engineering from M.S. University (India). His e-mail address is [email protected]. Eric DiStefano is the manager of the Thermal Tools and Technology team in the Mobile Platforms Group. He has been with Intel defining thermal solutions for mobile processors and platforms for five years. Prior to Intel, he worked in the aerospace industry for 15 years developing advanced space propulsion, propellant feed systems, and cooling of various systems including avionics. His e-mail is [email protected]. Krishnan Ravichandran has a B.S.E.E. degree from IIT Madras, India and an M.S.E.E. degree from Carnegie Mellon University. He joined Intel in 1988 after completing his M.S. degree and has spent his entire career in the Mobile Group, performing a variety of roles (circuit design, chipset lead, CPU Design Manager, Program Manager, Platform Architecture and more recently Extended Battery Life Manager). His e-mail is [email protected]. Shreekant (Ticky) Thakkar is the co-director of Mobile Platform Architecture in Intel’s Mobile Platforms Group (MPG). He has over 24 years of experience in various development and planning positions at Intel and Sequent (now part of IBM). Ticky established direction for mobile notebook PCs to transition from portable to wireless computing–making the notebooks deliver the best wireless

Mobile PC Platforms Enabled with Intel® Centrino™ Mobile Technology 15

Paper1cover.qxd 1/31/03 9:11 AM Page 2

For further information visit: developer.intel.com/technology/itj/index.htm

Copyright © 2003, Intel Corporation. All rights reserved. Intel is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. For a complete listing of trademark information vistit: www.intel.com/sites/corporate/tradmarx.htm Volume 07 Issue 02 Published, May 21, 2003 ISSN 1535-864X Intel® Technology Journal

Intel® Centrino™ Mobile Technology

The Intel® Pentium® M Processor: Microarchitecture and Performance

A compiled version of all papers from this issue of the Intel Technology Journal can be found at: http://developer.intel.com/technology/itj/index.htm The Intel® Pentium® M Processor: Microarchitecture and Performance

Simcha Gochman, Mobile Platforms Group, Intel Corporation Ronny Ronen, Microprocessor Research Labs, Intel Corporation Ittai Anati, Mobile Platforms Group, Intel Corporation Ariel Berkovits, Mobile Platforms Group, Intel Corporation Tsvika Kurts, Mobile Platforms Group, Intel Corporation Alon Naveh, Mobile Platforms Group, Intel Corporation Ali Saeed, Mobile Platforms Group, Intel Corporation Zeev Sperber, Mobile Platforms Group, Intel Corporation Robert C. Valentine, Mobile Platforms Group, Intel Corporation Index words: Pentium® M processor, microarchitecture, power-aware design, branch prediction, instruction fusion, processor bus, SpeedStep® technology

We next describe the Pentium M processor’s Enhanced ABSTRACT   ® Intel SpeedStep technology that allows significant The Intel Pentium M processor is a key component of reduction in energy consumption without compromising  ™ Intel Centrino mobile technology platform. It is performance. Intel’s first microprocessor designed specifically for mobility. It provides outstanding mobile performance1 We conclude with demonstrating the superior and its dynamic power management enables energy performance and power-awareness of the Pentium M saving for longer battery life. processor by comparing it with other mobile processors on a variety of known industry benchmarks. Designing a mobile processor calls for different power/ performance tradeoffs than designing a traditional high- INTRODUCTION performance processor. In this paper we explain the design philosophy that was adopted by the Intel Pentium The distinction between Mobile and Desktop computing M processor’s architects to achieve best performance at segments is not new. There are several vectors in which given power and thermal constraints. these segments differ, two of which are relevant to our discussion: power dissipation and battery life [1]. We present an overview of the Intel Pentium M processor’s major advanced power-aware performance • Power, Power Density, and Thermal. The overall features including the innovative branch predictor, the dissipated power, as well as the power dissipated by dedicated stack manager, the micro-operation fusion, the chip per unit area, are important factors. Power and the Intel Pentium M processor bus. generates heat. In order to keep transistors within their allowed operating temperature range, the generated heat has to be dissipated from its source in a cost-effective manner. These constraints limit the  Intel Pentium M is a trademark of Intel Corporation or processor’s peak power consumption. Peak power its subsidiaries in the United States and other countries. consumption limits apply both to desktops and  mobile computers. However, the mobile computer’s Intel Centrino is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.

1 System performance, battery life and functionality will vary depending on your specific hardware and software SpeedStep is a trademark of Intel Corporation or its configurations. subsidiaries in the United States and other countries.

The Intel® Pentium® M Processor: Microarchitecture and Performance 21 Intel Technology Journal, Vol. 7, Issue 2, May 2003

smaller form-factor and lighter weight decrease the a lower power budget as well as minimizing the mobile processor’s power budget2. processor’s average power for extending battery life. The Intel Pentium M processor includes several new • Battery Life. Batteries are designed to support a innovative features that enable it to meet its design goals. certain Watts x Hours3. The higher the average power is, the shorter the time that a battery can This new processor has 77 million transistors operate. This constraint limits the processor’s implemented on Intel’s 0.13µ CMOS process, with six average power consumption, which is a crucial factor levels of copper interconnect. Its die size is 84 mm2 and for mobile computers but less relevant for desktop its peak power consumption is 24.5 watts at 1.6 GHz. Its computers. 3.2 GB/second processor bus helps to provide the high data bandwidths needed for today’s and tomorrow’s Until not so long ago, the size of the mobile market was demanding applications. It fully implements the IA32 significantly smaller than that of the desktop market, instruction set architecture [2], including Streaming causing mobile PC designers to retrofit processors SIMD Extension (SSE) and Streaming SIMD Extension designed to address the desktop market. Desktop 2 (SSE2) targeted for multimedia, content creation, processors were designed to achieve the highest scientific, and engineering applications. performance possible for all user profiles with little consideration of power consumption. Meeting the more We begin with an overview of the Intel Pentium M restrictive peak and average power constraints of mobile processor design philosophy. Then we examine in depth PCs involved a compromise. New processors were the major innovative power-aware and energy-aware adapted to the mobile market by either operating them at features of this processor. We conclude by comparing a lower voltage and frequency, hence compromising the performance and power-awareness of the Pentium M performance, or by delaying the implementation of their processor with those of other mobile processors. mobile version to the next-generation process technology, hence losing time-to-market. POWER-AWARENESS PHILOSOPHY The increased demand for mobile PCs, combined with AND STRATEGY the growing gap between desktop and mobile peak and Design tradeoffs for the mobile market are rather average power budgets, made it impractical to continue complicated and involve several challenges: the trend of using desktop processors for the mobile • market. Mobile users expect to have close-to-desktop Optimizing the design for maximum performance and performance even with the more restricted power and extended battery life. The challenge lies in how to thermal environment. Addressing this need called for a balance between these conflicting goals. processor designed with the mobile environment in • Trading performance for power. Performance mind. This is where the Intel Pentium M processor features–whether increasing instruction-level comes in. parallelism (ILP) or speeding up frequency–usually The Pentium M processor is Intel’s new flagship power- increase power consumption. Power-saving features aware microprocessor. Upon introduction (in March usually decrease performance. The challenge lies in 2003) its highest performing version ran at 1.6 GHz figuring out how much power one can afford to lose @1.47V4, its Low Voltage version ran at 1.1 GHz in order to implement a performance feature. @1.18V, and its Ultra Low Voltage version ran at Setting a clear direction for tradeoffs between higher 900 MHz @1.0V. It follows a new design approach performance and longer battery life and between with the goal of delivering breakthrough performance at performance and power was essential for converging the definition of the Intel Pentium M processor.

2 Currently (2003), typical desktop processor peak Higher Performance vs. Longer Battery Life power consumption is about 100W. Typical mobile Early in the design it was realized that the average power processor peak power is about 30W. consumption for typical usage of the Intel Pentium M 3 Currently (2003), typical battery capacity is 24- processor is only a small portion of the whole platform 72WxH. A typical Pentium M platform uses 48WxH power consumption–less than 10%. This low average batteries. A typical platform’s power is 13W, of which power is mainly due to the ability of the processor to the processor consumes about 1W on average. Smaller enter lower power states in idle periods and to the platforms use 24WxH to save weight. Enhanced Intel SpeedStep technology, which 4 1.47V is the highest operating voltage of the 0.13µ significantly reduces power in periods of low processor CMOS process on which the processor is implemented. activity. The majority of the power in the platform is

The Intel® Pentium® M Processor: Microarchitecture and Performance 22 Intel Technology Journal, Vol. 7, Issue 2, May 2003

consumed by other components: the LCD, the hard disk, where α is the activity factor, Power0 is the power the memory system, networking components, etc. Under consumption and C0 is the effective capacitance for a these circumstances, it was clear that the potential gain given design. The frequency is usually approximated as in system battery life by further reducing the processor’s being linearly proportional with the operating voltage, average power would be small5. With that in mind, we namely decided to optimize the design for the highest performance within power and thermal constraints, when F0 ≅ Kf * V0 the processor is active, and to focus on battery life when where Kf is the proportion constant. This leads to the it is idle. cubic dependency of power on the operating voltage It should be noted that the above may change in the α 3 future. We expect high-performing processors’ average Powermax = * C0 * V0 * Kf power to grow due to more complex processor logic and The performance at this operating condition is given by higher static power and new demanding workloads. We expect platforms to become smaller, simpler, and more Perf0 ≅ IPC0 * F0 efficient hence consuming less power. Therefore, the where IPC0 indicates the Instruction Per Cycle in portion of the processor’s average power as part of the Frequency F 7. overall platform power consumption will increase. 0 It can be derived from the above formulae (see also [3]) Trading Performance For Power that by increasing the voltage by 1%, for example, one The tradeoffs are different if we optimize for higher can increase performance by 1% through increased performance or for longer battery life. For higher frequency. This would result in a power increase of performing mobile processors (and in fact, now, in all approximately 3%. Thus, an alternate microarchitectural high-performing processors) the criterion is feature that gains less than 1% in performance for a “Maximizing performance at given thermal power increase of 3% or more should be rejected constraints.” For longer battery life the criterion is upfront. In general, a microarchitectural feature can be “Minimizing energy per task.” Below, we explain what regarded as power-aware, if the % ratio between the each criterion actually means. power increase and the performance gain is less than 3.

Maximizing Performance at Given Thermal Minimizing Energy Per Task Constraints Energy consumption in general is a sum of two The processor’s thermal map depends on the power components: active energy and idle energy. Minimizing consumption, the local power density at various points the idle energy consumption is relatively straightforward on the die, the cooling mechanism, and more. At the and does not involve conflicting design tradeoffs: the early stages of the Intel Pentium M processor’s processor enters a deep-sleep power state, stops the microarchitecture definition, when thermal information clocks, and lowers the operating voltage to the minimum was not available, we replaced the criterion “Maximizing allowed to sustain the internal state. Optimizing active performance at given thermal constraints” with: energy is more complex. A very slow execution “Maximizing performance at a given power envelope” 6. consumes less power for a longer period of time, while heavy parallelism reduces the active time but increases According to this criterion, a microarchitectural feature the active power. that gains performance or saves power should be better than simply using voltage/frequency scaling. For a given Energyactive = Poweractive * Timeactive working point of core voltage V0 and Frequency F0 the or power consumption of a processor is given by ≅ Energyactive Poweractive / Perfactive 2 Power0 = α * C0 * V0 * F0 This implies that in order to improve overall battery life, the % performance benefit must be greater than the additional power consumed.

5 In the ideal case where the 10% power is totally eliminated, battery life would be extended by only 11%. 7 For the sake of this discussion, we assume performance 6 The Intel Pentium M processor’s design assumed scales linearly with frequency. In reality, mainly due to power envelopes from 7W for passively cooled boxes up off-chips accesses (memory and I/O), performance does to 24W for Thin and Light platforms. not scale with frequency.

The Intel® Pentium® M Processor: Microarchitecture and Performance 23 Intel Technology Journal, Vol. 7, Issue 2, May 2003

During the definition of the Intel Pentium M processor 100% we tended to use the stricter criterion in each case: Energy Loss Constrained- Constrained Perf Loss 80% Performance • Performance improvement features were usually Breakeven line Wrong trade-off zone 60% included if they “minimized energy per task,” that is, Energy Loss 40% Constrained Perf they save energy. Features that pass this criterion Energy Gain improve performance and extend battery life–the Breakeven 20% line

ultimate win/win situation. Faster execution also 0%

% % % % % % % % % % % % % % % 0 7% 4 1 8% 2 -9% -6% -3 0 3 6 9 2 5 8% 4 7% implies longer idle time, allowing active units to be -3 -2 -2 -2 -1 -15 -1 1 1 1 21 2 2 30 -20% shut off, thus saving even more energy. Energy Gain Energy Gain Constrained Perf -40% Constrained Perf • Power-saving features may reduce Instructions Per Loss Gain

<=Power Gain | Po wer Loss => -60% Cycle (IPC) resulting in a performance loss. Such <= Performance loss | Performance gain => power-saving features were usually included only when they “maximized performance at given thermal Figure 1: Performance/power tradeoff zones constraints,” that is, the performance loss was Now that the tradeoffs are known, we examine the smaller than would have been achieved by just using strategies we used to identify power-aware features. voltage and frequency scaling. In practice, by Power-awareness means attacking power and energy applying this criterion, a tradeoff can be made consumption at all levels: between saved power and increased frequency, thus squeezing more performance at the peak allowed • Reducing number of instructions per task. power. • Reducing number of micro-ops per instruction. • Reducing number of transistor switches per micro-op. • Performance-improvement features that met the • Reducing the amount of energy per transistor switch. “maximize performance at given thermal constraints” criterion, but failed the “minimizing Reducing the Number of Instructions Per Task energy per task” one, were carefully judged and in From an architectural point of view the number of many cases included. Such features do increase instructions per task is fixed. However, from a performance but consume more energy. This loss is microarchitectural point of view, this is true only for the negligible, since, as mentioned above, the processor’s number of retired instructions. With branch prediction, average power as a portion of the overall system there are many speculated instructions running within the average power is relatively small. In fact, the faster processor that are not retired. A better branch predictor execution results in a longer idle time, potentially decreases the number of the speculated instructions, thus allowing additional energy savings. practically reducing the number of overall processed • Figure 1 illustrates the design tradeoffs from a instructions. Indeed, the extra logic involved in a better performance feature viewpoint. The magenta area (in branch predictor does consume power, but the gain in the lower right side) indicates a clear win–improving the reduced number of instructions exceeds that extra both performance at the power envelope and battery cost. life. The orange area (resides in the upper right side) Reducing the Number of Micro-ops Per Instruction indicates a tradeoff where constrained performance is Out-of-order implementations of the IA32 Instruction preferred over lower battery life. The green area Set Architecture (ISA) break macro-instructions into a (resides in the lower left side) indicates a tradeoff sequence of one or more simple operations, called where improved battery life is preferred over micro-operations, or micro-ops [4]. Handling and constrained performance. The white area (in the executing each micro-op consumes power. Eliminating upper left side) is a clear “drop.” micro-ops from the micro-op stream or combining several micro-ops together reduces the overall power. The Intel Pentium M processor micro-ops fusion and dedicated stack engine do exactly that. Here, as well, the gain in reduced micro-ops exceeds the cost of the extra logic involved.

Reducing the Number of Transistor Switches Per Micro-op This is more straightforward and intuitive. High performance processors run at a high frequency and

The Intel® Pentium® M Processor: Microarchitecture and Performance 24 Intel Technology Journal, Vol. 7, Issue 2, May 2003 provide a high degree of instruction-level parallelism– access latency, but the significant power saved switching a lot of transistors on the way. There is a clear justifies the small performance loss. gain in doing the same operation with a smaller number • Enhanced Intel SpeedStep technology. This of switches. In some cases, this is simple and involves advanced technology significantly reduces the only local optimizations such as accessing only the processor voltage (and temperature), hence leakage portion of a register or a cache line that is actually power, when processor activity is low. needed. In other cases, it calls for a global optimization to decide whether a unit will not be used for the next cycle, and thus can be shut off. Occasionally, such POWER-AWARE FEATURES power saving may involve a performance loss. The following sections describe several of the Intel Pentium M processor’s power-aware features. These Reducing the Amount of Energy Per Transistor features cover all the above-mentioned strategies: Switch Energy per switch depends on the transistor size and • Reducing the number of instructions per task: type, and on the operating voltage. Smaller transistors advanced branch prediction. and lower operating voltages reduce energy per switch • Reducing the number of micro-ops per instruction: [5]. The effect of microarchitecture here is rather micro-ops fusion and dedicated stack engine. limited. Transistor size and type are tuned so that they • Reducing the number of transistor switches per meet the timing constraints without wasting unnecessary micro-op: the Intel Pentium M processor bus and power. The Enhanced Intel SpeedStep technology various lower-level optimizations. reduces the operating voltage at low activity periods, • Reducing the amount of energy per transistor switch: thus reducing the energy per transistor switch. Intel SpeedStep technology. Microarchitecture can help reduce energy per switch by optimizing the amount of interconnect in the processor. ADVANCED BRANCH PREDICTION Each strategy affects performance, power, and energy in For high-frequency pipelined microprocessors, branch a different way. In most cases, features that fall under prediction continues to be one of the biggest ticket items one of these strategies save energy per task. However, for gaining performance. In the Intel Pentium M performance-improving features are likely to result in processor, the benefits are actually twofold: the decrease increased power consumption, e.g., better branch in speculative code gains performance and reduces the prediction reduces energy, but also reduces stalls, thus energy spent per instruction retired. increasing power. This may look bad, but, in fact, it is The advanced branch prediction in the Pentium M not. Higher performance at lower energy can be traded  processor is based on the Intel Pentium 4 processor’s for lower power by slowing down the processor either by [6] branch predictor. On top of that, two additional using voltage/frequency scaling or microarchitectural predictors to capture special program flows, were added: throttling. a Loop Detector and an Indirect Branch Predictor. Static Power

The power consumed by a processor consists of active Count Limit Prediction power (used to switch transistors) and static power (leakage of transistors under voltage). In this paper we +1 focus mainly on active power reduction, but it is worth mentioning how the Intel Pentium M processor also = reduces static power consumption. 0 The static power is roughly a function of the number of transistors, their type, the operating voltage, and the die temperature. The Pentium M processor reduces static power by several means: Figure 2: The Loop Detector logic • Low-leakage devices. The processor’s 1MB power managed L2 cache, which contains roughly two- thirds of the transistors in the processor, is built with low-leaking transistors. Low-leaking transistors are Pentium 4 is a registered trademark of Intel somewhat slower, thus slightly increasing the cache Corporation or its subsidiaries in the United States and other countries.

The Intel® Pentium® M Processor: Microarchitecture and Performance 25 Intel Technology Journal, Vol. 7, Issue 2, May 2003

The Loop Detector (Figure 2) analyzes branches to see if The Intel Pentium M processor branch predictor they have loop behavior. Loop behavior is defined as misprediction rate is 20% lower than that of previous moving in one direction (taken or not-taken) a fixed generation designs, resulting in as much as 7% in real number of times interspersed with a single movement in performance. Approximately 30% of this benefit comes the opposite direction. When such a branch is detected, from the combination of the Loop Detector and Indirect a set of counters are allocated in the predictor such that Branch Predictor. The Loop Predictor captures a the behavior of the program can be predicted completely common program behavior and benefits many accurately for larger iteration counts than typically applications regardless of compilation techniques. The captured by global or local history predictors. Indirect Target Predictor shows its gains more in specific applications where indirect branches are used to select The Indirect Branch Predictor solves the problematic data-dependent targets. In such applications, when the data-dependent indirect branches. Indirect branches are compilers use calculated branches rather than if-trees heavily used in object-oriented code (C++, Java), hence made from conditional branches, performance gains can they became a growing source of branch mispredictions. be a few percentage points. While most indirect branches have a single target at run time, some, such as a case statement in a byte-code interpreter, may have many targets. These targets are MICRO-OPS FUSION chosen in a data-dependent manner. Out-of-order implementations of the IA32 Instruction Set Architecture (ISA) break macro-instructions into a The Indirect Branch Predictor (Figure 3) chooses targets sequence of one or more simple operations, called based on a global control flow history, much the same micro-operations, or micro-ops. A conventional micro- way a global branch predictor chooses the direction of op consists of a single operation operating on two conditional branches using global control flow history. sources. The Instruction Decoder breaks a macro- As can be seen in the figure, it is an adjunct to the instruction into multiple micro-ops whenever the macro- normal target prediction device. Targets are always instruction operates on more than two sources or when allocated in the Instruction Pointer tagged table along the nature of the operation requires a sequence of with the type of branch. When a misprediction occurs unrelated operations. There are quite a few cases of due to a mispredicted target on an indirect branch, the macro-instructions that break into several micro-ops, two Indirect Branch Predictor allocates a new entry of which are store operations and load-and-op (read- corresponding to the global history leading to this modify) operations. instance of the indirect branch. This construction allows monotonic indirect branches to predict correctly from Macro-instructions that store data in memory are the IP-based Target array, and data-dependent indirect decoded as two independent micro-ops. The first branches to allocate as many targets as they may need operation–store-address–calculates the address of the for different global history patterns, which correlate with store, while the second operation–store-data–stores the the different targets. Entries in the Indirect Branch data into the Store Data buffer8. The separation between Predictor are tagged with the hit and type information in the store-data and the store-address operations is the IP-based target array to prevent “false positives” important for memory disambiguation. Breaking the from the Indirect Branch Predictor to lead to new store operation into two micro-ops allows the store- sources of target mispredictions. address operation to dispatch earlier, even before the stored data are known, enabling resolution of address Instruction Global conflicts and opening the memory pipeline for other Pointer History loads. A typical load-and-op macro-instruction consists of two micro-ops: the first operation reads the operand from an Target : type : hit target : hit address in memory, and the second operation calculates the result based on the data read from memory and the register operand. A load-and-op macro-instruction may have up to three register operands, so it must be implemented by two micro-ops. The atomic operations

hit target Figure 3: The Indirect Branch Predictor logic 8 The actual write to memory is done when the store retires. Only then are the data in the respective Store Data buffers written into the specified address.

The Intel® Pentium® M Processor: Microarchitecture and Performance 26 Intel Technology Journal, Vol. 7, Issue 2, May 2003 are inherently serial, and the second operation cannot The macro-instruction is decoded into a single fused start until the first operation completes. micro-op by the Instruction Decoder. The fused micro- op is allocated, renamed, and then issued into a single Splitting the macro-instruction into multiple micro-ops entry in the Reorder-Buffer and the Reservation-Station. also has its toll: To support fused micro-ops, each reservation-station • The increased number of micro-ops creates pressure entry can accommodate up to three source operands. on resources with limited bandwidth (rename, retire) When dispatching to the execution units, the Dispatcher or limited capacity (Reorder-Buffer, Reservation- controls the separate execution of each portion of the Station). This pressure eventually results in fused micro-op according to the readiness of its sources. performance loss. In a sense, the Dispatcher treats each portion as if it occupied the whole entry for itself. The Execution of • Splitting a macro-instruction into more than one each operation is performed in the same way as a non- micro-op is a complex operation that requires a fused micro-op with only minor changes made to the significantly more capable decoder. Due to its execution units. complexity, most implementations opt to have only one complex decoder; all other decoders are left to The fused store operation is depicted in Figure 5. handle macro-instructions that break only into a single micro-op. Decoded and renamed Fused store micro operation • Delivering more micro-ops through the system increases the energy required to complete a given instruction sequence. Dispatch Store Address Dispatch Store Data The Pentium M processor features the micro-ops fusion Save faults in Reg. File Save faults in Reg. File mechanism to reduce this performance and energy cost while maintaining the benefit of the out-of-order execution. With micro-ops fusion, the Instruction Decoder fuses two micro-ops into one micro-op and Retire values when both keeps them united throughout most parts of the out-of- operations completed order core of the processor–at allocation, dispatch, and retirement. To maintain their non-fused behavior benefits, the micro-ops are executed as non-fused Figure 5: Fused store flow operations at the execution level. This provides an The two micro-ops making up the fused store micro-op effectively wider instruction decoder, allocation, and can be issued to their relevant execution units in parallel. retirement. Figure 4 describes the different domains in The dispatch of the store-address operation to the which the micro-op is fused and unfused. address-generation unit is performed when its sources (the base and index registers) are ready. The dispatch of the store-data operation to the store data buffer unit can Instruction Decoder occur independently when its source operand is available. The retirement of the fused store can occur only after both operations complete. Micro-ops Allocation Fused and Renaming Micro-ops The fused load-and-op operation is depicted in Figure 6. domain

Un-fused Exe. Micro-ops Units domain

Figure 4: Micro-ops fusion domains

The Intel® Pentium® M Processor: Microarchitecture and Performance 27 Intel Technology Journal, Vol. 7, Issue 2, May 2003

Decode Macro-instruction into are typically used to implement a combined parameter Fused Micro-op and Rename and control-flow stack used in high-level programming languages. The ISA provides PUSH, POP, CALL, and RET, which have the obvious parameter stack and Dispatch load control-flow stack behaviors. The ISA dedicates the Save faults in Reg. File hardware Stack Pointer register (ESP) as the machine stack pointer, and this register is modified as a side Dispatch Operation effect of each of these instructions. Sequences of such Save Result in Reg. File instructions are quite common, for instance, PUSHing a set of operands and then using a CALL instruction is the standard mechanism for making a Procedure or Function Retire values when the Call. Operation completed In traditional out-of-order implementations of the IA32

ISA, these side-effect operations were performed by Figure 6: Fused load-and-op flow sending with each stack-related macro-instruction an additional micro-op to update the ESP register. This The two micro-ops making up the fused load-and-op micro-op adds or subtracts an immediate value to the micro-op are issued serially to the relevant execution ESP register. units. The dispatch of the load operation is performed when its sources (the base and index registers) are ready. The Intel Pentium M processor chose to implement the The dispatch of the “op” portion of the load-and-op ESP “side-effect” behavior in a more efficient way, operation to the execution unit can occur only after the using dedicated logic near the superscalar decoders. The load completes and the other operand is ready. The idea is to represent the programmer’s view of ESP retirement of the fused load-and-op micro-op can occur (ESPP) at any given point in time by some historic ESP only after both operations complete. living in the out-of-order execution core (ESPO) added to a delta (ESP ) that is maintained in the front end (see We have found that the fused micro-ops mechanism D also [7]): reduces the number of micro-ops handled by the out-of- order logic by more than 10%. The reduced number of ESPP := ESPO + ESPD micro-ops increases performance by effectively When, for example, a sequence of PUSHes and POPs is widening the issue, rename, and retire pipeline. The encountered in the instruction stream, the dedicated biggest boost is obtained during a burst of memory Stack Hardware executes the ESP side-effects in the operations, where micro-op fusion allows all decoders, decoders and updates the ESP register. Referring to rather than the one complex decoder, to process D Figure 7, we can see a superscalar implementation for incoming instructions. This practically widens the N+1 decoders passing the accumulated delta value processor decode, allocation, and retirement bandwidth across the decoders and updating the delta register with by a factor of three. the result, after the instructions are decoded. The The typical performance increase of the micro-op fusion hardware also patches the in-flight ESPD value into the is 5% for integer code and 9% for Floating Point (FP) address syllable of each of the stack referencing micro- code. The store fusion contributes most of the ops (patch HW not shown for clarity) so that the address performance increase for integer code. The two types of generation unit (AGU) can calculate the proper memory fused micro-ops contribute about equally to the location referenced by ESPP. This provides the performance increase of FP code. following benefits:

Delivering less micro-ops through the processor • Dependencies on ESP are removed since the ESPO decreases the energy required to complete a given value used for scheduling in the out-of-order machine instruction sequence since the same task is accomplished is not changed during the sequence of stack by processing fewer micro-ops. The gain in power operations. This allows more parallelism consumption is higher than the loss, due to the additional opportunities to be realized in the out-of-order logic required to implement the micro-ops fusion execution. mechanism. • ESPD updates are done using a small specialized dedicated adder, thus freeing the general execution DEDICATED STACK ENGINE units to work on other micro-ops. This allows a The IA32 Instruction Set Architecture (ISA) features higher degree of superscalarity for these instructions instructions for hardware-assisted stack management that

The Intel® Pentium® M Processor: Microarchitecture and Performance 28 Intel Technology Journal, Vol. 7, Issue 2, May 2003

without the cost associated with going to a higher including the ESP synchronization micro-ops. Clearly degree of superscalarity for all integer operations. this makes the out-of-order machine look virtually larger Additionally, since the ESP updating micro-ops have and frees execution bandwidth. However, the major been eliminated, the ALUs are free to be utilized by performance gain is to increase the front-end bandwidth more complex operations that would have been on these common instruction sequences to the full width blocked by the ESP updates, increasing execution of the superscalar decoders. These 5% of eliminated bandwidth. micro-ops result in a similar decrease in energy per instruction–in line with the Pentium M processor’s • Updating the delta register in the front end eliminates power-awareness direction. the ESP updates micro-ops from the out-of-order machine. Thus, power saving is realized since the large adders are not used for small operations, and the THE INTEL PENTIUM M PROCESSOR eliminated micro-ops do not toggle bits throughout BUS the machine. The Intel Pentium M processor bus was designed to provide a desktop-like performance while consuming Push EAX Push EDX Pop EBX significantly less power. Power saving is achieved by ESPD Decoder 40Decoder Decoder 4 the combination of protocol and circuit methods that are =8 0 1 N unique to the Pentium M processor bus. The processor bus supports 100 MHz bus clocks with a data rate of 400M transfers per second. It is a latched bus with an in-order queue of 8-pipelined transactions. Recovery Designed for mobile systems, the bus is optimized for a Information uni-processor environment. For example, it allows us to reduce the number of pins to save power:

• There are only 32 address bits that cover 4GB of Figure 7: The dedicated stack engine logic physical address space. There are two complications with the Dedicated Stack • The bus does not support dual-processors, since the Engine. Since it lives in the front of the pipeline, all its mobile systems’ power budget cannot support dual calculations are speculative. In order to recover a processors anyway. precise state at any point in the machine’s life, the value of ESPO and ESPD must be able to be recovered for all The Pentium M processor bus saves power aggressively instructions in the machine. ESPO is maintained by the when idle; it carefully controls its input buffer’s sense- out-of-order core as any other general-purpose register. amplifiers that sample the activity on the bus. When the The Intel Pentium M processor adds an additional table bus is idle, all sense amplifiers are disabled and do not (also shown in Figure 7) that saves the ESPD value and a consume any power. When the bus is active and address code relating to the effect on the ESPD register for every and data are driven on the bus, the input buffers are instruction in the machine. This allows the value of enabled in advance to ensure all information is captured ESPP to be recovered for all instructions either pre- or with no delay. post-execution. This allows for handling of either Faults The bus features many innovative mechanisms to reduce or Traps as defined in IA32. power while maintaining performance. Several of them The second complication occurs when the architectural are described below. value of ESP is needed inside the out-of-order machine, DPWR#: Data Bus Power Control. This is a special for instance, “XOR ESP,3” or, more commonly, when signal driven by the 855PM chipset whenever data are ESP is used in an address syllable. In this case, the transferred to the processor. DPWR# is used to decode logic automatically inserts a micro-op that carries dynamically enable the processor’s 64-bit data bus input out the ESP calculation. The ESP register can be P D sense amplifiers and their related controls (~80 signals) cleared since the architectural value is now coherent. A only when data are transferred to the bus. sync is not generated when the ESPD register is zero, so continued usage of ESP as a general-purpose register BPRI Control: This is a method to achieve the DPWR# will have no ill effects. functionality for the address bus. BPRI# is asserted whenever the 855PM chipset attempts to drive the bus. The Dedicated Stack Engine ESP typically eliminates It is used to dynamically enable the 32-bit address bus 5% of the micro-ops from an IA32 program compared to a processor not including this feature, even when

The Intel® Pentium® M Processor: Microarchitecture and Performance 29 Intel Technology Journal, Vol. 7, Issue 2, May 2003 input sense amplifiers and their related controls (~40 allow operations to resume seamlessly with no signals) only when a transaction is issued to the bus. performance penalty. Low Vtt: The Intel Pentium M processor’s I/O buffers One example for such a power predictor is the “Allocate work at a low voltage of 1.05V (Vtt). The low Vtt is an stall” predictor. Whenever the Reorder-Buffer is full, essential element to reduce the bus power. However, the Allocator stalls the pipeline. However, the Allocator operating at low Vtt introduces a new set of problems cannot tell if the Reorder-Buffer will remain full on the because the I/O buffer is working at the low linear point, next cycle. It therefore needs to re-evaluate the stall which affects the buffer’s characteristics. The bus condition every cycle. It turns out that in many cases includes a special Resistor Compensation (RCOMP) when the Reorder-Buffer is full, it stays so for very long method to adjust the buffer strength dynamically during periods. Therefore the power penalty in this case is run time. It accommodates the impacts of temperature, high. A specialized logic was defined to collect voltage drift, and bus topology. Thus, at any thermal information from the Reorder-Buffer and other units in and power state the Pentium M processor bus has full order to predict the nature of the next cycle. This logic impedance termination. It has split power planes that instructs the Allocator to hold on to the stall condition allow setting the I/O operating voltage to a fixed value and shut off its clocks. of 1.05V even though the core may be operating at a Another type of a microarchitectural power feature higher Enhanced Intel SpeedStep technology operating identifies the logic that is necessary for a specific point. operation and activates only that part of the processor. PSI: Power Status Indicator. The Pentium M processor Here are two examples for an implementation of such a bus provides a signal to reduce the overall platform power feature. power (not just the processor power!). This signal is In a conventional processor, all the units of a specific driven by the processor to control the current execution port electrically share the same source bus consumption of the Voltage Regulator (VR) when the wires. However, power can be saved if instead of processor operates at a low power state. driving the sources to all the execution units (EUs), only the wires that belong to the target EU are driven. LOWER-LEVEL POWER Therefore, the Pentium M processor execution units OPTIMIZATIONS were divided into a few segments (stacks), and a special This section describes several lower-level mechanisms logic was created in order to control the flow of data to that demonstrate the power awareness of the Intel every stack according to its actual destination (see Pentium M processor. Figure 8). A simple, yet effective method that was pursued in order Dispatch EUs Stack #1 to reduce power was to identify idle logic and shut it off. Port 0 Src1/Src2 Port 0 This was done locally and globally. Port 1 EU2 EU1 Locally, the design was thoroughly reviewed for any EU inefficiency during idle states. The goal was to gate the EU Stack Select Driver clocks as much as possible. Ideally, the clocks should be Enable EUs Stack #2 shut off for each pipe stage separately. However, if such Port 0 a naive approach is used, the added complexity may Port 1 Port 1 EU4 sometimes outweigh the gain. In these cases the logic is Src1/Src2 EU3 shut off for the entire unit only at the end of the operation, resulting in a smaller power saving. Globally, idle time identification is done at a higher Figure 8: Execution units stacking microarchitectural level, when the unit alone cannot A generic processor operates on several different data identify the idle period. For example, the first stage of a types with different widths. In an IA32 processor there unit is always kept awake in order to respond to are integer operations, operating on 32 bits, multimedia incoming messages. So, the goal was to create a few operations, operating on 64 bits or 128 bits, and floating- central controllers that can microarchitecturally identify point operations, operating on 80 bits. The most or predict idle periods and instruct the units to reduce common instructions are integer instructions that access power (either by shutting off their clocks or by disabling only a limited set of registers and use only 32-bit data parts of their logic). Also, the prediction logic should values. Toggling a wider bus and reading from a bigger register file consumes more power than is actually

The Intel® Pentium® M Processor: Microarchitecture and Performance 30 Intel Technology Journal, Vol. 7, Issue 2, May 2003 required. The Intel Pentium M processor saves power optimization according to the load. The Intel Pentium M by identifying integer operations in advance and processor introduces a multi-point Enhanced Intel activating only the appropriate hardware (see Figure 9). SpeedStep technology optimized for demand-based The savings include the narrower buses to and from the switching. EU during dispatch and writeback, and also other The Enhanced Intel SpeedStep technology attempts to elements in the renaming logic that are not accessed address the following challenges: while an integer operation is executing. This effectively transforms the processor into a 32-bit machine that • Minimizing system and processor unavailability. utilizes only resources needed for integer operations Operating point switching requires voltage to be while operating on integer data types. transitioned over a wide range (e.g., from 0.9V to 1.5V). Physical limitations of the power delivery Rename Dispatch EUs system translate this demand to over 100µs delay. A Data full clock generator Phase-Locked-Loop relock FP/XMM Integer Control Upper Integer WB [31:0] Register Register Array Data Data [85:0] Data requires approximately 30µs. The architecture needs Rename Rename Array Array [85:32] to ensure system memory access unavailability will [85:32] [31:0] not exceed 10-15µs, to match isochronous device Source needs. Write Integer [85:0] Enable Enable • Self-managed voltage and frequency stepping. The if non Enable if non Integer if non Integer Integer Enhanced Intel SpeedStep technology requires the migration of the mechanism from the chipset into the

processor. This introduces two challenges: (a) how to

Figure 9: Early identification of EU width sequence the operation when the processor clock is halted and (b) how to prevent loss of system events, ENHANCED INTEL SPEEDSTEP such as interrupts and snoops, previously blocked by the chipset during the transition. TECHNOLOGY Figure 10 depicts the high-level block diagram of the High-performance processors tend to have high power Enhanced Intel SpeedStep technology instantiation in the consumption during execution. This is a simple Intel Pentium M processor. derivative of the active power equation 2 INTR, NMI, SMI, INIT Power = α * C * V * F where V is the core voltage, F is the operating frequency, and α is the activity factor. However, proven mobile CPU Interrupt usage models, indicate that typical usage is bursty in Core Sampler Bus nature, requiring high performance only for short bursts Interface Core of time. Average power reduction can be achieved by Clock switching voltage and frequency to a lower operating point, when demand is low. The efficiency of the Frequency Sequencer solution depends on the ability to execute this operation- Controller point switch frequently and efficiently, to track demand. Core Clock Bus Clock Domain Previous generations of Intel mobile processors implemented the Intel basic SpeedStep technology [8]. It switches both voltage and frequency between two Figure 10: Enhanced Intel SpeedStep technology distinct states: Lowest Frequency Mode (LFM) and block diagram and clocking Highest Frequency Mode (HFM) by using the platform C3-idle state. While achieving low-power operation The Enhanced Intel SpeedStep technology uses three during LFM, the basic SpeedStep architecture does not novel principals to address the challenges stated above: fully address demand-based switching needs. The long 1. Voltage-Frequency switching separation. Unlike system unavailability time during transitions limits the previous architectures, the Enhanced Intel SpeedStep switching frequency due to interaction with streaming technology separates the voltage and frequency devices such as Audio Codec ’97 (AC ’97) and the transition stages (Figure11). Universal Serial Bus (USB). Additionally, having only two fixed operating points limits operating point

The Intel® Pentium® M Processor: Microarchitecture and Performance 31 Intel Technology Journal, Vol. 7, Issue 2, May 2003

Voltage is stepped in short increments, preventing Consequently, the Enhanced Intel SpeedStep technology clock noise and allowing processor execution during provides the Pentium M processor a flexible, multi-point the voltage transition stage. Thus, system memory operating mode, completely self managed, and with a and the processor are made available during the very low CPU and memory unavailability time, which longest segment of the operating point transition, optimizes its power and performance according to thereby minimizing unavailability time to only the demand. frequency transition stage. PERFORMANCE The Intel Pentium M processor architecture delivers breakthrough mobile performance and enables extended on Tra siti No-Change nsit battery life in notebook PCs. ran ion T wn Up Do Comprehensive information about this processor Voltage performance can be found in [9]. In this paper, we Frequency Stepping choose to demonstrate the power-awareness of the Transition to Pentium M processor. We measured the processor Target performance and average processor power consumption in various operation modes on several benchmarks and Voltage compared its performance and efficiency with those of Stepping Frequency 9 to Transition other mobile processors in similar configurations . Target Efficiency reflects energy per task. Benchmark efficiency is measured by dividing the benchmark Figure 11: Enhanced Intel SpeedStep® performance (1/execution-time) by the average processor technology transition sequencing power of that benchmark. As will be shown, the Pentium M processor exhibits higher performance and 2. Clock partitioning and recovery. During the superior efficiency. Enhanced Intel SpeedStep technology transition, only the core clock and Phase-Locked-Loop are stopped, The set of benchmarks includes (see more in Table 1): while the bus-clock is kept running. The Enhanced • Mobile Representative Office Productivity Intel SpeedStep technology logic was partitioned Workload such that only the command interface and core • Internet Experience workload controls operate on the core clock, while the • SPEC CPU 2000 V1.2 [10] sequencer and interrupt interface operate on the bus clock. Thus the logic can be kept active constantly, We compared the Pentium M processor with the even though the core clock has been halted. following: (see Table 2 for detailed system configurations): Additionally, the clock circuitry of the Pentium M processor was designed to utilize the active bus-clock • Intel® Pentium® M processor (1.6 GHz/600 MHz) to shorten core-clock relock time considerably. Thus • Mobile Intel® Pentium® 4 Processor - M (2.4/1. 2 core-clock restart time is set to only 10µs, GHz) minimizing the processor inactive time. • Mobile Intel® Pentium® III Processor - M (1.2 GHz/ 3. Event blocking. Interrupts, pin events, and snoop 800 MHz) requests sent during the frequency transition stage All system run the Windows∗ XP* operating system. The must not be lost, even though the core clock is not operation modes used are as follows: available to serve them.

The Enhanced Intel SpeedStep technology logic samples all pin events when the core clock is 9 The information included in this section was prepared stopped. These are re-sent to the processor once the specifically for this paper to provide insight into the core clock is available, preventing loss of events. success of the design criteria using known benchmarks as a workload. Some measurements were collected on Bus events (such as snoops and interrupt messages) reference boards that are not publicly available. are blocked off using the native BNR# protocol, Therefore, these results should be considered only as an which captures the bus for the frequency transition estimate for relative performance and efficiency. period. Thus bus and pin events are not missed; they are serviced once the core is capable and running.

The Intel® Pentium® M Processor: Microarchitecture and Performance 32 Intel Technology Journal, Vol. 7, Issue 2, May 2003

• Always On (Max Frequency) exhibits nearly double the efficiency advantage over the • Portable/Laptop (Adaptive Frequency) Intel Mobile Pentium 4 Processor - M on most of the • Maximum Battery (Min Frequency) tests in these benchmarks. Several tests, 179.art and 300.twolf, exhibit an even greater efficiency gain (over All scores are normalized to the Intel Mobile Pentium 4 8X and 5X respectively) mainly due to the large 1MB Processor - M (orange bar) score. The efficiency– power-managed L2 cache of the Intel Pentium M performance over power–is obtained from the processor. benchmark performance score divided by the average processor power for the duration of that benchmark. Portable/Laptop Mode Always On Mode In Portable/Laptop mode, frequency and voltage changes depend on the application demand. This mode is the In the Always On mode the processor always runs at its normal usage mode when the system is not connected to highest frequency. This mode is mostly used when the an AC power source. This mode demonstrates the system is connected to an AC power source. This mode effectiveness of combining the performance and power- demonstrates the inherent performance and power- awareness of the Pentium M processor with the energy- awareness of the Intel Pentium M processor without saving nature of the Enhanced Intel SpeedStep utilizing the Enhanced Intel SpeedStep technology. technology to provide end users with breakthrough

6 3 mobile performance and extended battery life. Relative Efficiency Relative Performance

5 2.5 In this mode, all three processors operate between their Intel Pentium M processor 1600MHz Intel Pentium 4 processor-M 2400MHz highest and lowest frequency operating points, Intel Pentium III processor-M 1200MHz 4 2 depending on the amount of work to be done. For processor-intensive workloads, each processor operates 3 1.5 at its highest operating voltage and runs at its maximum 2 1 frequency: the Intel Pentium M processor @ 1.6 GHz, the mobile Intel Pentium 4 Processor - M @ 2.4 GHz, 1 0.5 and the Mobile Intel Pentium III Processor - M @ 1.2

0 0 GHz. When there is no activity (idle period), each Mobile Office Internet SPECint base SPECfp base Mobile Office Internet SPECint base SPECfp base Productivity Experience 2000 2000 Productivity Experience 2000 2000 processor runs at its lowest frequency and voltage to conserve energy: the Intel Pentium M processor @ 600

MHz, the Mobile Intel Pentium 4 Processor - M @ 1.2 Figure 12: Always On mode performance and GHz, and the Mobile Intel Pentium III Processor - M @ efficiency 800 MHz. Using the efficient switching algorithms of Figure 12 presents comparative performance and the Enhanced Intel SpeedStep technology, the Pentium efficiency results in the Always On mode. M processor is transparently switched between the highest and lowest frequency and voltage states, giving In this mode, the Intel Pentium M processor performs the user the best of both worlds: maximum performance equal or better (2%-25%) than the Mobile Intel Pentium under demanding applications and lowest power during 4 Processor - M on all benchmarks, and is significantly idle periods. more efficient (2X-3X) than it. This shows that the 6 3 Pentium M processor power-awareness philosophy Relative Efficiency Relative Performance works: it does more work and consumes significantly 5 2.5 Intel Pentium M processor 1600/600 MHz Intel Pentium 4 processor-M 2400/1200 MHz less power in the same thermally constrained Intel Pentium III processor-M 1200/800 MHz environment. The Intel Mobile Pentium 4 Processor - 4 2

M, designed for higher performance at higher power 3 1.5 envelopes, cannot exploit its full performance potential in this thermally restricted environment and has to slow 2 1 down. 1 0.5

SPECint_base2000 and SPECfp_base2000 are of 0 0 Mobile Office Internet SPECint base SPECfp base Mobile Office Internet SPECint base SPECfp base particular interest here. The Pentium M processor Productivity Experience 2000 2000 Productivity Experience 2000 2000

∗ Figure 13: Portable/Laptop mode performance and Other brands and names are the property of their efficiency respective owners.

The Intel® Pentium® M Processor: Microarchitecture and Performance 33 Intel Technology Journal, Vol. 7, Issue 2, May 2003

Figure 13 presents comparative performance and Table 1: Benchmark description efficiency results in the Portable/Laptop mode. Results show that in this mode, the Pentium M processor Mobile Representative Office Productivity Work- performs equal or better (0%-30%) than the Mobile Intel load: Targeted to evaluate notebook user experience Pentium 4 Processor - M (a similar advantage to the under popular business-oriented applications in a Always On mode). However, the relative efficiency over Microsoft Windows* operating environment. Some the Mobile Intel Pentium 4 Processor - M in benchmarks usage models represented in this productivity that exhibit periods of lower activity went up, e.g., from workload include applications from Microsoft Office 3X to over 4X on the mobile representative Office XP∗ (i.e., Word 2002, Excel 2002, PowerPoint 2002, Productivity workload. This improved efficiency results Outlook 2002), McAfee* VirusScan*, Adobe* from the much lower power consumption of the Intel Photoshop*, WinZip* and others. Pentium M processor at its low frequency mode @ 600 MHz compared with the power consumption of the Internet Experience Workload: Measures PC client Mobile Intel Pentium 4 Processor - M @1.2 GHz. performance under a range of popular Internet technologies such as SSL*, XML*, VML*, Java*, etc. Maximum Battery Mode using applications such as Adobe Acrobat*, Apple In the Maximum Battery mode the processor runs at its Quicktime*, Cycore*, Cult3D*, Macromedia Flash*, lowest frequency. This mode is usually used when the Windows Media Player*, and RealNetworks user is away from an AC power source for a long time. RealVideo*. This mode demonstrates the ability of the Intel Pentium M processor to minimize energy consumption when SPEC CPU2000: The industry-standard benchmark longer battery life is crucial. that evaluates compute-intensive integer and floating- point application performance [10]. 6 3 Relative Efficiency Relative Performance

5 2.5 Intel Pentium M processor 600MHz Intel Pentium 4 processor-M 1200MHz Intel Pentium III processor-M 800MHz Table 2: System configurations 4 2 Intel Intel Dell Platform Reference Reference 3 1.5 Latitude C610 Platform Platform Mobile Intel® Intel® Mobile Intel® 2 1 CPU Pentium® III Pentium® M Pentium® 4 Processor-M Processor Processor-M 1 0.5 CPU Core Freq (MHz) 1200/800 1600/600 2400/1200 CPU Bus Freq 133 400 400 0 0 L2 Cache (KB) 512 1024 512 Mobile Office Internet SPECint base SPECfp base Mobile Office Internet SPECint base SPECfp base Productivity Experience 2000 2000 Productivity Experience 2000 2000 Chipset Intel 830M Intel 855PM Intel 845 Mem Size (MB) 512 Mem Type/Speed PC133 DDR 266 Mem CAS Latency 2-2-2 2-3-3 ATI Mobility Figure 14: Maximum Battery mode performance and Graphics Core ATI Radeon 9000 Radeon M6 efficiency Graphics Mem 16MB 64MB Gfx Driver 6.13.10.3293 6.13.10.6200 Figure 14 presents comparative performance and Screen Resolution 1024 x 768 x 32bpp 60Hz efficiency results in the Maximum Battery mode. When HDD Mfr/Model IBM IC25N040ATCS05-0 all three platforms are locked at their lowest processor HDD Size, Buffer, RPM 40GB IDE 8MB 5400RPM OS, Build, File System WinXP, SP1 5.1.2600, FAT 32 frequencies, the Intel Pentium M processor running at Intel ICH3 Intel ICH4 Intel ICH3 600 MHz draws much lower power than the other two LAN Integrated Integrated Integrated processors. This puts the Pentium M processor in a Ethernet Ctrl. Ethernet Ctrl. Ethernet Ctrl. better position to get more work done for the power consumed under the workloads tested. The processor does compromise performance in this mode: it is about 20% slower than the Mobile Intel Pentium 4 Processor - M. However, it is extremely more efficient–about 5X more–allowing it to do significantly more work with the same energy. ∗ Other brands and names are the property of their respective owners.

The Intel® Pentium® M Processor: Microarchitecture and Performance 34 Intel Technology Journal, Vol. 7, Issue 2, May 2003

CONCLUSION of the 27th International Symposium on Computer Architecture, June 2000, pp. 306-315. The Intel Pentium M processor is Intel’s first [8] Mobile Pentium® III processors–Intel SpeedStep® microprocessor designed specifically for the Technology requirements of tomorrow’s mobile PCs. It provides http://www.intel.com/support/processors/mobile/pen uncompromised performance while observing the tiumiii/ss.htm. thermal and energy requirements and limitations of the [9] Intel® Centrino™ Mobile Technology Performance mobile platform. Performance-enhancement features Brief were included only if proved to be power-efficient. The http://www.intel.com/performance/resources/mobiletec processor features many novel power-aware performance hnology/ mechanisms such as advanced branch prediction, micro- [10] SPEC CPU2000 V1.2 operation fusion, a dedicated stack engine, and the http://www.specbench.org/osg/cpu2000/. optimized Pentium M bus. It also features the Enhanced Intel SpeedStep technology to reduce energy AUTHORS’ BIOGRAPHIES consumption. Simcha Gochman is a principal engineer/architect with These unique features enable the Pentium M processor Intel’s Mobile Platforms Group in Haifa, Israel, leading to deliver breakthrough performance and enable the microarchitecture definition activities of the Intel extended battery life thereby providing users with a Pentium M processor and its future derivatives. Simcha superior mobile experience. has been with Intel for 18 years. Earlier in Intel he led the microarchitecture definition of the Intel Pentium ACKNOWLEDGMENTS Processor with MMX technology and was involved The authors thank all the researchers, architects, with the design of the 80860 processor and the 80387 designers, and validators who took the Intel Pentium M numeric coprocessor. Simcha received his M.Sc. degree processor from a vision to a real product. Special thanks from the Technion, Israel Institute of Technology in go to Wilson Huang from Intel’s Mobile Platforms 1984. His e-mail is [email protected]. Group who conducted the performance and power measurements used in this paper. Ronny Ronen is a principal engineer/researcher and director of Intel's Microprocessor Research Lab in Haifa, REFERENCES Israel, focusing on microarchitecture research. Ronny [1] R. Ronen, A. Mendelson, K. Lai, S.L. Lu, F. was heavily involved in the definition stages of the Intel Pollack, and J.P. Shen, “Coming Challenges in Pentium M processor. He has been with Intel for 22 Microarchitecture and Architecture,” in years. Earlier in Intel he led the compiler and Proceedings of the IEEE, Vol. 89, No. 3, March performance simulation activities in the Intel Israel 2001, pp. 325-340. Software department. Ronny received his M.Sc. degree [2] IA-32 Intel Architecture Software Developer’s from the Technion, Israel Institute of Technology in Manual Volume 1: Basic Architecture at 1979. His e-mail is [email protected]. http://developer.intel.com/design/pentium4/manuals/245 Ittai Anati is a senior architect with Intel’s Mobile 470.htm Platforms Group in Haifa, Israel, focusing on the micro- [3] D. M. Brooks et al., “Power-aware architecture of the Intel Pentium M processor family. microarchitecture: design and modeling challenges Ittai has been with Intel for 14 years. Earlier in Intel, for next-generation microprocessors,” IEEE Micro, Ittai was part of the i860XP, Pentium Overdrive, and Vol. 20, Issue: 6, Dec. 2000, pp. 26-44. Pentium with MMX technology design teams. He [4] D.B. Papworth, “Tuning the Pentium® Pro received his B.Sc. degree from the Technion, Israel microarchitecture,” IEEE Micro, Vol. 16-2, April Institute of Technology in 1990. His e-mail is 1996, p. 8. [email protected]. [5] C. Mead and L. Conway, Introduction to VLSI systems, Addison-Wesley Publishing Company, Ariel Berkovits is a senior architect with Intel’s Mobile Boston, Dec. 1980. Platforms Group in Haifa, Israel, leading the Intel [6] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Pentium M processor performance analysis, projection Carmean, A. Kyker, and P. Roussel, “The and performance validation activities. Ariel has been Microarchitecture of the Pentium 4 Processor,” Intel with Intel for 11 years. Earlier in Intel he led the Technology Journal, Issue 1, 2001, article 2. [7] M. Bekerman, A. Yoaz, F. Gabbay, S. Jourdan, M.  Kalaev, and R. Ronen, “Early Load Address MMX is a trademark of Intel Corporation or its Resolution via Register Tracking,” in Proceedings. subsidiaries in the United States and other countries

The Intel® Pentium® M Processor: Microarchitecture and Performance 35 Intel Technology Journal, Vol. 7, Issue 2, May 2003 development of performance simulators and was degree in Computer Engineering from Boston involved in the definition of graphic chipsets. He holds University. His e-mail is [email protected]. a B.Sc. degree from the Hebrew University in Jerusalem. Copyright © Intel Corporation 2003. This publication His e-mail is [email protected]. was downloaded from http://developer.intel.com/ . Tsvika Kurts is a senior architect with Intel’s Mobile Legal notices at Platforms Group in Haifa, Israel, focusing on the Intel http://www.intel.com/sites/corporate/tradmarx.htm. Pentium M processor bus, platform, and debug hooks. Tsvika has been with Intel for 18 years. Earlier in Intel Tsvika was part of the Intel Pentium® Pro processor bus architecture and system validation team and was involved in the Intel Pentium 4 processor bus protocol development. He received his B.Sc. degree in 1984 and his M.Sc. degree in 1992, both from the Technion, Israel Institute of Technology. His e-mail is [email protected]. Alon Naveh is a senior architect with Intel’s Mobile Platforms Group in Haifa, Israel, focusing on processor and platform power management. He received his B.Sc. degree from the Technion, Israel Institute of Technology in 1983, and holds an MBA degree from San Jose State University. Alon was involved in the definition of the Intel Pentium M processor power management, PCI-E, and the Odem chipset. Prior to Intel, Alon worked in Motorola Semiconductor and in National Semiconductor. His e-mail is [email protected]. Ali Saeed is a senior systems architect in Intel's Mobile Platforms Group. His primary research area includes performance and power profiling of processors and chipsets with an emphasis on performance, power, and battery life estimations. Prior to joining Intel, Ali was manager of performance analysis in Compaq’s Commercial PC Group. He holds a Bachelors degree in Electrical Engineering from the University of Houston. His e-mail is [email protected]. Zeev Sperber is a senior architect with Intel’s Mobile Platforms Group in Haifa, Israel. He received his B.Sc. degree from the Technion, Israel Institute of Technology in 1984. Zeev has been with Intel for 19 years. He was involved in the definition of the Intel Pentium M processor out-of-order and execution sections. Earlier in Intel, Zeev was part of the Intel Pentium processor with the MMX technology team. His e-mail is [email protected]. Robert C. Valentine is a senior architect with Intel’s Mobile Platforms Group in Haifa, Israel, focusing on the Intel Pentium M processor’s fetch and decode cluster. Robert has been with Intel for 13 years and has worked both as a microarchitect in the Haifa’s Microprocessor Platforms Group and as a member of Haifa’s Research team. Prior to Intel, Robert worked in Prime Computer and Honeywell Information Systems. He holds an M.Sc.

The Intel® Pentium® M Processor: Microarchitecture and Performance 36 Paper1cover.qxd 1/31/03 9:11 AM Page 2

For further information visit: developer.intel.com/technology/itj/index.htm

Intel® Centrino™ Mobile Technology

System-Level Validation of the Intel® Pentium® M Processor

A compiled version of all papers from this issue of the Intel Technology Journal can be found at: http://developer.intel.com/technology/itj/index.htm System-Level Validation of the Intel® Pentium® M Processor

Isic Silas, Mobile Platforms Group, Intel Corporation Igor Frumkin, Mobile Platforms Group, Intel Corporation Eilon Hazan, Mobile Platforms Group, Intel Corporation Ehud Mor, Mobile Platforms Group, Intel Corporation Genadiy Zobin, Mobile Platforms Group, Intel Corporation

Index words: Pentium® M processor, System Validation, Periodic SMI, Bus Marginality Testing

ABSTRACT processor with major architectural and micro-architectural changes in performance and power and frequency  ® The Intel Pentium M processor is the first Intel optimizations, which directly entered the mobile market, microprocessor with major architectural and micro- i.e., without a previously validated desktop version. architectural changes in performance and power and Features of the Pentium M processor such as Enhanced frequency optimizations, which directly entered the Intel SpeedStep technology, new branch prediction mobile market, i.e., without a previously validated desktop algorithms, and micro-ops fusion could potentially be a version. A very tight Intel Pentium M processor post- source of severe logic issues, while timing enhancements silicon period from first silicon to launch and lack of IA- and low voltage working points (to reduce the CPU’s 32 testing and debug expertise by the Mobile OEMs, power consumption) could result in marginality problems dictated a different approach to validate the CPU, at both in the CPU’s core and the I/O circuit. the Intel level and the OEMs’ level. The main goal was to ramp up the Pentium M validation capabilities very fast The traditional CPU post-silicon platform-level validation and uncover all silicon bugs before they were reported by cycle, at Intel, consists of parallel development and testing OEMs (to prevent the time and effort it would take to of system validation (SV), compatibility verification (CV), debug sightings in the OEMs’ environment), while and OEMs programs. maintaining the tight OEM development cycle from The objectives of CV are to ensure that the samples to launch. microprocessor functions correctly in the standard system, This paper describes a novel complex post-silicon system with existing operational systems and commercial validation methodology, that enabled us to deliver a applications, and that it runs on customer reference boards healthy Intel Pentium M processor for the launch of the equipped with commercially available (for the moment) Intel Centrino™ mobile technology, despite the tight hardware configurations. The drawbacks of using CV are validation schedule. This was achieved by Intel’s insufficient visibility and controllability on CPU features uncovering all logic and circuit issues so no silicon-related usage and testing coverage, and difficulties in debugging bugs were found by customers, and by reducing of failures that in many cases are platform hardware or dramatically the system-level-failures’ debug time. software rather than CPU-related. The OEMs’ primary goal is to maintain the tight customer INTRODUCTION development cycle from samples to launch. Their The Intel Pentium M microprocessor, a key element of validation process is very similar to that of CV, with a Intel Centrino mobile technology, is the first Intel focus on overall system functionality. SV is based on testing the microprocessor functionality in specially designed, SV-hardware configurations, by Intel, Pentium, Intel SpeedStep, and Intel Centrino are running unique SV random tests that typically have trademarks or registered trademarks of Intel Corporation nothing in common with any commercial software. The or its subsidiaries in the United States and other countries.

System-Level Validation of the Intel® Pentium® M Processor 37 Intel Technology Journal, Vol. 7, Issue 2, May 2003 special SV hardware aims to simulate all commercial Intel Pentium M Processor System Validation platform configurations including all the add-in cards Platform options. Similarly, the SV random testing aims to  simulate all the existing, as well as yet to be developed, “Golan”: Intel Pentium M processor/Intel 855PM commercial software code combinations. These two chipset SV Platform factors provide very good controllability on validation The design of the Pentium M SV platform focuses on coverage, since they allow for the accurate generation of enhancing the following capabilities that are a must for any desired event and IA-32 scenario. The SV qualitative and efficient validation of modern environment also assumes an efficient failure localization, microprocessors: which guarantees fast issue debugging and identification. • Coverage/controllability. This enables the most The very tight Intel Centrino mobile technology program complex scenarios on CPU busses and other pins, to schedule was mainly due to the short period between the be easily programmed via testing software. Thus the Intel Pentium M processor first samples and launch. By processor bus behavior can be simulated, when used not having debug hooks and expertise from the mobile with various chipset types and add-in cards, by using OEMs, we had to approach the validation cycle in a only one type of chipset and without any add-in cards. different way at both the Intel and OEM level. The goal • Determinism. This ensures that you can immediately of SV was to ramp up the Pentium M processor validation reproduce system failures, clock by clock. capabilities in the fastest way and uncover all silicon logic and circuit marginality issues before they were reported by • Automation. Automation allows for remote execution OEMs. Since OEMs got early Pentium M processor of SV tests and for remote/programmable engineering samples (on average, approximately two modification (shmoo) of CPU operating parameters weeks after SV got units) before validation was (e.g., voltage, frequency, temperature). completed, uncovering issues before they were uncovered by the OEMs was a huge challenge. At the same time, an Pentium M early enabling program was developed to allow OEMs to start their platform integration and validation; to test chipset, memory, and graphics subsystems; as well as Mobile other platform components, even before the Pentium M Agent ForkTG samples became available. PSB This paper describes different aspects of the Intel Pentium DDR M processor’s complex post-silicon validation program AGP Card 855PM and details the major results. Memory

THE INTEL PENTIUM M PROCESSOR Hub i/f SYSTEM VALIDATION Test Cards

The Intel Pentium M processor system validation (SV) ICH4-M PCI Test Cards methodology is based on the following major components: • A unique SV platform that differs from a standard USB, AC97, LAN etc. motherboard because of specially developed hardware agents connected to the CPU front-side bus and other pins. Figure 1: Block diagram of Golan, ® ®  • A package of random instruction-based SV testing an Intel Pentium M processor/Intel 855PM chipset software, that ensures excellent coverage for logic system validation platform and circuit marginality aspects. A block-diagram of the Intel Pentium M processor  • A Periodic SMI (PSMI) methodology for automatic configured with an Intel 855PM chipset SV platform, and very fast reproduction of system-level failures on named Golan, is shown in Figure 1.

the Pentium M processor software model (RTL) or The “brain” of this platform, the Mobile Agent, is the Debug Tester. implemented in a Field Programmable Gate Array (FPGA) and controls a wide set of features to support all SV requirements. The essential part of the Mobile Agent is the Programmable Signal Generator (PSG). This device

System-Level Validation of the Intel® Pentium® M Processor 38 Intel Technology Journal, Vol. 7, Issue 2, May 2003 was designed to enable simultaneous injection of signals true uni-processor environment. Nevertheless, neither the during CPU code execution. Those signals, which are processor nor the MCH are aware, logically, of its activated in different modes (e.g., single or burst, level or presence. When injecting PSB transactions, Fork TG pulse), include legacy signals like NMI, SMI, and INTR, manipulates each of the two chips to act as if the counter- which are connected directly to the processor pins, and chip is the source for those transactions. This innovative various break events, which are sent from the IO Control solution brings plenty of new features and benefits to the Hub (ICH). Each PSG signal duration and timing system validation environment. parameter can be programmed via software. In addition, Fork TG activity is based on controllable triggering and the Mobile Agent can identify different power inter-operating injection streams. The notion of injection management events and use them as trigger sources for implies placing transactions on the processor system-bus PSG signals. It guarantees good coverage of various and thus creating traffic. The density of this traffic as well border scenarios, essential for mobile platforms, that as its duration are regulated by Fork TG configuration previously were subject to real CPU problems or other parameters programmed via validation software as part of platform component issues. It should be emphasized that the SV tests. PSG completely eliminates the need for commercial Peripheral Component Interconnect (PCI) or graphics Three modes have been defined in order to regulate Fork cards for the SV board, ensuring at least the same level of TG injection streams: pre-defined, shadow, and self- CPU bus stressing and significantly better reproducibility. generated. The pre-defined mode is mainly used to exercise the cache cluster (L1 and L2) and the Memory The Golan SV board also includes a power management Ordering Buffer (MOB) in the processor. In this mode, Rule Checker/Tracker designed to monitor the CPU’s and Fork TG monitors the bus, looking for an event (e.g., a the system’s power management modes transitions, and bus transaction) that triggers it to stream the pre-defined the Enhanced Intel SpeedStep rules. package of transactions, which were programmed Very accurate, automated control over the system clock beforehand to Fork TG’s external RAM. scheme enables the CPU Processor System Bus (PSB) The shadow mode was conceived for exercising the cache frequency modification (shmoo), which is widely used by cluster of the CPU, along with the processor bus unit. In SV for Pentium M circuit marginality testing, together this mode the transaction sampled on the PSB undergoes with the Core voltage control (shmoo) capability, which is some updates in Fork TG, and the newly created also implemented in the Golan board. transaction is placed back on the bus as a part of the Fork Among other Golan SV board features, are power TG injection stream. In other words, the original consumption (V,I) measurements and history recording, transaction is ‘shadowed’ by the newly generated one, reset and straps, and a Joint Test Action Group (JTAG) altered according to some rule. This is the essence of this standard controller. operation. Possible alteration rules, chosen during SV test generation, include replacement of the request type (e.g., a Fork Transactions Generator: SV Processor System Read Invalidate operation instead of a Read Data Bus (PSB) Agent operation), replacement of the memory type, a change in

Another leading methodology for the Pentium M the length of the request, or an update of the transaction processor SV platform is the usage of active bus agents to address within the page (4K) boundary. provide bus activation right on the spot. The biggest advantage of these agents over end point agents (such as In Self-Generated mode, the PSB is stressed by bursting PCI cards) is their capability to inject aimed transactions, out injections irrespective of current bus events. Three triggered by normal bus events, and their high success rate parameters, start address, end address, and address step, in failure reproduction. The most important active bus along with the number of loops, set the duration of the agent used for the Pentium M processor validation was the self-generated test. Fork Transactions Generator (Fork TG), residing on the The Fork TG has been heavily used throughout the processor system bus (PSB). Pentium M processor post-silicon validation period, Because it was designed for mobile systems, the Pentium mostly for caches and the PSB stressing. This resulted in M processor bus supports uni-processor mode only, thus, the detection of two real silicon issues. preventing the usage of another CPU for its caches and bus unit stressing. Therefore, Fork TG, implemented as a Random Instruction Testing: Pentium M re-programmable FPGA, demonstrates a new concept in processor System Validation Methodology the architecture of a uni-processor validation platform. There are several different methodologies for processor Specially designed to be placed between the processor and system-level validation. The Pentium M processor system the Memory Control Hub (MCH), Fork TG is acting in a validation was principally based on the Random

System-Level Validation of the Intel® Pentium® M Processor 39 Intel Technology Journal, Vol. 7, Issue 2, May 2003

Instruction Testing (RIT) concept. This means that CPU Random Instruction Testing methodology assumes the correct functionality, logic and circuit wise, is being execution of many millions of seeds before the required checked by a tremendous amount of discrete tests aimed at coverage of all CPU features is achieved. Therefore, the covering all possible architectural and micro-architectural SV Lab infrastructure includes up to 100 Golan platforms scenarios enabled by Spec. Each random test (seed), and several hundred test generators able to produce seeds typically runs from reset, emulates some artificial 24-hours a day, executing approximately 1 trillion operational system with all its attributes (e.g., descriptor instructions per week. Enormous effort was put into and interrupt tables, exception handlers, paging enhancing generation and execution throughput, resolving hierarchy), and also contains several thousand instructions Lab network problems and achieving automatic control of “user’s code,” running under this pseudo-Operating over lab resources. As a result, around 10 billion random System (OS). Both OS attributes and user’s code are seeds, each containing several thousand random maximally random, in terms of instruction sets, memory instructions, were run by the Pentium M processor allocation for OS level structures, code and data, caching throughout its validation period from first silicon arrival to policy, etc. Random programming of SV platform launch. residing agents, namely, Mobile Agents (signals, events) One more important thing that influences the post-silicon and Fork TG (processor system bus transactions) are also validation success, especially during the first critical under seed responsibility. weeks, is the cleanliness of testing software. To ensure Random test generation is being done in a software the software is clean, the Pentium M processor test content environment, specially developed at Intel. It includes a has passed massive dry runs on Intel Pentium 4 kernel, which is applicable to any current and future platforms, on Golan SV boards equipped with the Erez microprocessor, and modules that define processor- interposer, described below, and on Pentium M models. specific test content. These modules, which, by As a result, when silicon arrived, the system validation themselves, are very complex software projects, are team faced few false alarms and could concentrate just on responsible for the coverage of microprocessor health. debugging real CPU issues. Test generation rules implemented in these modules result in the success of silicon in mass production, e.g., absence of escaped bugs. 100

The Pentium M SV test content has been developed, 80 taking into account these processor multiple architectural and micro-architectural enhancements, as compared to 60 % previous generations of CPUs, and the unique mobile 40 features of the Pentium M processor. SV engineers spent much time with architects and designers in order to 20 identify the most risky areas of this processor, before 0 implementing their coverage in testing software. Among 0 1020304050 the areas that required special attention were the weeks from silicon arrival following:

• Enhanced Intel SpeedStep technology assuming processor frequency transitions during regular code Figure 2: Cumulative bug detection rate by Intel ® execution. CPU response to different events, like Pentium M processor SV from first silicon to launch power management or external interrupts randomly All factors described in this section have resulted in very occurring during any of the transitions, was a special fast bug rate detection once Pentium M processor samples focus of SV arrived at the SV Lab. Figure 2 shows the cumulative bug • Various scenarios related to power management detection rate from first silicon to launch, and it clearly transitions of the CPU, e.g., to/from sleep and deep shows that 50% of the logic issues found in the Pentium sleep states, with heavy involvement in break events, M processor were uncovered during the first two to three interrupts, APIC messages, and snooping weeks of validation. Also, the latest bug that needed a silicon fix was uncovered after 20 weeks of validation, • Novel micro-architectural features, e.g., micro-ops while all further issues were of low severity and were fusion closed via errata. • Marginal data paths capable of causing system failures, specifically at high frequencies

System-Level Validation of the Intel® Pentium® M Processor 40 Intel Technology Journal, Vol. 7, Issue 2, May 2003

Periodic SMI (PSMI) Failures Reproduction and Pentium® 4 processors. The mobile-oriented features

Methodology of the Intel Pentium M processor brought new challenges derived from the following facts: An essential component of any silicon validation program is the capability to quickly and efficiently reproduce • SMI is heavily used in mobile platforms; therefore, system failures on the RTL model of the chip under test, the potential conflict between regular SMI and PSMI in the case of a logic issue, or on the Tester, in the case of should be eliminated. This has been done by a circuit issue. The major problem of reproduction is to separating SMI and PSMI both in the internal CPU ensure that silicon behavior on the system is absolutely design, and at the system memory space level. synchronized with that of the RTL/tester. This is because • Enhanced Intel SpeedStep technology, implemented most failures occur far beyond the reset point, while all in the Pentium M processor assumes internal clock processor arrays and registers, e.g., caches, TLBs, branch frequency changes during regular CPU running. predictor, contain some history, which is not visible These frequency transitions have to be captured by externally. The Periodic System Management Interrupt the PSMI handler and transferred correctly to the (PSMI) methodology, developed at Intel approximately simulation environment. five years ago, is aimed at resolving this problem and providing full synchronization between microprocessor • Power management features that cause partial behavior on the system and simulation environment during shutdown of Pentium M processor internal and the failure reproduction phase. This methodology external clocks are heavily used in the CPUs of assumes an injection of periodic signals on the SMI pin of mobile platforms. These also introduce serious the CPU, while the latter executes a failing test or reproduction issues. application, forcing it to enter periodically into a specially In order to ensure full readiness of the PSMI flow in all developed SMI handler, as shown in Figure 3. corner cases, and to accommodate the above challenges, PSMI PSMI PSMI PSMI PSMI Reset Failure PSMI was validated extensively in the pre-silicon period. Download A comprehensive test plan was written, which included 60 . . . Trace focus tests and over 100 random tests. Each test emulated the PSMI flow–a trace being taken from a simulated system and driven into the target simulation environment. A simulation acceleration machine was heavily used to

Synch Point Resume increase the execution throughput. State Synchronization Restore Despite the above increased complexity, the PSMI Dump Phase Phase methodology has demonstrated impressive results during

PSMI Handler Pentium M processor system-level validation. The PSMI flow was ready for use and reproduced the first logic issue Figure 3: Schematic description of periodic SMI the first week after first silicon arrival. Overall, the post- methodology silicon validation period, 100% (!) of all the logic issues The Logic Analyzer trace, downloaded for further brought from the system, have reproduced on the RTL reproduction, should reflect full CPU external activity model, with an average latency of only 1-2 days. As for (taken from all its pins) starting from the SMI closest to circuit issues mostly related to internal speed paths, Intel’s the failing point. The most critical parts of the success rate is approaching 94% – much higher than in sophisticated PSMI handler are the Dump and any previously validated microprocessors. Synchronization Phases as shown in Figure 3. They provide a dump of the content of the CPU internal state to PROCESSOR SYSTEM BUS memory, making it visible on the processor system bus, MARGINALITY VALIDATION and an initialization/flush of some arrays and state Due to the low-voltage and high-frequency Processor machines. Therefore, the CPU behavior, after the Sync System Bus (PSB) of the Pentium M processor, a point of the handler, is absolutely deterministic, and the thorough signal integrity validation became critical to trace is ready for conversion and reproduction on the RTL ensure the design robustness of customer reference boards model or Tester. and OEM’s mobile systems. Typically, bus marginality is PSMI technology has been successfully used for system- measured in two domains, namely, input buffer reference level validation of Intel Pentium MMX, Pentium® III

System-Level Validation of the Intel® Pentium® M Processor 41 Intel Technology Journal, Vol. 7, Issue 2, May 2003

voltage Vref, and data/strobe timing ∆t. A two- margin dimensional (Vref, ∆t) area of reliable system Target HOST network network Server functionality forms an ellipse, called an Eye Diagram, as margin shown in Figure 4. Target HOST One can see that the boundaries of an Eye Diagram reflect conditions where the system starts failing, while its center provides an optimal design point for the Vref and ∆t margin parameters. A single Eye Diagram measurement requires Target HOST several hundred iterations to be accomplished at different points, while measurements should be repeated under a wide spectrum of conditions and interconnection Figure 5: Block diagram of Bus Marginality Tester parameters, e.g., temperature, CPU or chipset skew units. (BMT) This makes the bus marginality testing extremely time BMT usage in the Pentium M processor project brought consuming (typically, about four to eight hours per “eye”). very impressive results. First of all, an average Eye ∆Vref Diagram measurement time was reduced from hours to ten Points of test: to fifteen minutes. It allowed us to build about 40,000 functional “eyes,” providing wide coverage of bus Fails or system interconnect validation, with a Shmoo of many corruption parameters: package and board impedance, silicon skew Pass ∆t material, temperature, IO voltage, various bus stress area patterns, various memory combinations, etc. BMT has uncovered and helped to resolve several serious system problems. Among them are two severe Small Outline Dual In-line Memory Module (SO-DIMM) Optimal Vref and ∆t memory issues, Double Data Rate (DDR) Logic Analyzer Interface issues, and DDR data/strobe de-centering Figure 4: An example of a bus marginality problems. The BMT was effectively used for the fast “Eye diagram” comparison of signal integrity bus robustness design of To intensify the process of Pentium M processor analog different OEM motherboards. Thus, we were able to give validation, a novel tool, named the Bus Marginality Tester OEMs early feedback on the robustness of their board (BMT) was developed by the Pentium M Signal Integrity design. and Analog Validation team. The BMT is actually a package of hardware and software tools that is used to Erez Interposer: Early Enabling Program measure automatic signal integrity bus marginality. Its A very tight Intel Pentium M processor post-silicon period structure is hierarchical, as shown in Figure 5, consisting from first silicon to launch has put significant pressure on of three levels: Target, Host, and Server. Bus stress runs OEMs’ platform testing and integration schedule. A on the Target level (actually a board under test), Host major breakthrough in this area was early availability of a controls the Eye Diagram building process and is specially designed small form-factor, full speed, pin connected to the Target via a PCI-to-PCI bridge. It compatible interposer, named Erez. Equipped with a regulates Eye Diagram parameters on the Target level via Pentium 4 processor, the Erez interposer was inserted into a specially designed margin card (Vref) and BIOS settings the CPU socket of the Pentium M system, and it (∆t). A single server manages several Target/Host blocks successfully imitated the functionality. This program via a Local Area Network (LAN) and provides the user paved the way for platform designers of OEM systems, interface with an operator. It allows massive automatic reference boards, and SV platforms to debug their parallel execution on several Target/Host blocks from one hardware and ramp-up the critical mass of systems, server. All Eye Diagram accumulated information coming months before Pentium M processor silicon arrived. In to the server is stored in a special database and is subject total, eight OEMs have participated in the early enabling to automatic sophisticated analysis. program, using Erez interposers. Each OEM used it for the development of several mobile platforms. In parallel, it also enabled BIOS and software developers to test and clean their products.

System-Level Validation of the Intel® Pentium® M Processor 42 Intel Technology Journal, Vol. 7, Issue 2, May 2003

RESULTS SUMMARY System Assurance department manager, responsible for developing hardware and software capabilities for An efficient System Validation program has been components and platform-level testing, to ensure the developed for the Intel Pentium M processor, a key delivery of healthy products and platforms. Isic joined element of Intel Centrino mobile technology. The Intel in 1988, working on many aspects of platform objective of system validation (SV) is to ensure the validation. He started as a hardware engineer developing delivery of a healthy Pentium M processor, from the point validation platforms, and then moved on to validation of view of logic and circuit marginality, in an extremely software and validation and debug tools. He is now tight schedule, taking into account that it is the first time a perceived as an Intel expert in the area of validation. His brand new processor enters the mobile market, without e-mail is [email protected] any sophisticated IA-32 testing and debug capabilities. Igor Frumkin is a validation group manager in Intel’s Intel uncovered 100% of the Pentium M logic and circuit Israel Design Center, currently leading system-level issues with no silicon-related bugs found by customers. In validation of Intel microprocessors for mobile platforms. addition, 50% of the tens of the silicon issues were Igor joined Intel in 1995. He received his B.Sc. and detected during the first two weeks of validation, showing M.Sc. degrees in Electrical Engineering from the Institute the high quality of the SV test suite and the excellent of Aviation Instruments Making in Leningrad (the former throughput of moving failures from system to RTL and Soviet Union) in 1981, and his D.Sc. degree in Quality Tester, using the PSMI tool. A combination of a unique Assurance from the Technion (Israel Institute of SV test environment, the PSMI tool, Erez early enabling Technology) in 1995. His e-mail is program, and the BMT allowed Intel to deliver the [email protected] Pentium M processor in time for the launch of Intel Centrino mobile technology. Eilon Hazan is a debug tools team manager in Intel’s Israel Design Center, leading the PSMI tool and other DISCUSSION debug capabilities for system-level validation in the The success of the Intel Pentium M processor system Mobile Platform Group. Eilon joined Intel in 1995. He validation program was predetermined by the following received his B.Sc. degree in Electrical Engineering from factors: the Technion (Israel Institute of Technology) in 1997. His e-mail is [email protected] • Validation engineers had extensive expertise in Ehud Mor is a Hardware Development group manager in processor and platform-level architecture, processor Intel’s Israel Design Center, responsible for the design of testing, and validation platform design hardware solutions for system validation. Mor received • Validation platforms and testing software were his B.Sc. degree in Electrical Engineering from the defined and developed in a very cooperative Technion (Israel Institute of Technology) in 1979. Before atmosphere joining Intel in 1994, Mor was a senior design engineer at Integrated Device Technology Inc., RAFAEL and Elbit • Innovative coverage and validation efficiency- Electronics. His e-mail is [email protected] oriented ideas (Fork Transactions Generator, Bus Marginality Tester, and Erez) were driven and Genadiy Zobin is the Signal Integrity Engineering Group implemented manager in the Intel Israel Development Center. He works • on interconnect signal integrity design and validation of Very thorough preparations were made to guarantee mobile platforms. Genadiy joined Intel in 1991. He has the readiness of the validation platform, testing wide experience in platform design and validation, software, and debugging tools (PSMI) on the eve of including logic, circuit, ASIC, and interconnect design. silicon arrival He received his M.Sc. degree in Electrical Engineering As modern microprocessors become more and more from Leningrad Polytechnic Institute in 1977. His e-mail sophisticated, and as market competition tightens the post- is [email protected] silicon schedule, it is essential to combine efficiently all Copyright © Intel Corporation 2003. This publication the above factors, as was done for the Intel Pentium M was downloaded from http://developer.intel.com/. processor.

AUTHORS’ BIOGRAPHIES Legal notices at Isic Silas received his B.Sc. degree in Electrical http://www.intel.com/sites/corporate/tradmarx.htm. Engineering from the Technion (Israel Institute of Technology) in 1988. He is the Intel Israel Mobile Group,

System-Level Validation of the Intel® Pentium® M Processor 43 Paper1cover.qxd 1/31/03 9:11 AM Page 2

For further information visit: developer.intel.com/technology/itj/index.htm

Intel® Centrino™ Mobile Technology

Intel® Pentium® M Processor Power Estimation, Budgeting, Optimization, and Validation

A compiled version of all papers from this issue of the Intel Technology Journal can be found at: http://developer.intel.com/technology/itj/index.htm Intel® Pentium® M Processor Power Estimation, Budgeting, Optimization, and Validation

Dani Genossar, Israel Mobile Platforms Group, Intel Corporation Nachum Shamir, Israel Mobile Platforms Group, Intel Corporation

Index words: power estimation, power budgeting, power reduction, Intel SpeedStep, thermal throttling.

ABSTRACT 1. Per block power estimation and break down to support the setting of the power Plan Of Record This paper describes the approach taken by the design (POR). team to ensure that the Intel Pentium® M processor will be a compelling microprocessor for the Intel® Centrino 2. Data to identify and plan power-reduction strategies. mobile technology-based platform. We discuss the power 3. On-going verification to ensure convergence towards estimation flow and describe the power and thermal driven the power POR. architecture and circuit enhancements of this architecture. To meet the aggressive POR, the design team used the The Intel Pentium M processor is Intel’s first CPU to data generated from the SDPE tool and invested heavily in provide an improved multi-gear low-overhead mechanism power optimizations. Silicon-based measurements  with Intel SpeedStep technology and an advanced demonstrate the success of the power reduction work and Thermal-Throttling-2 implementation. In normal show an excellent correlation with the power and thermal conditions, the operating system can dynamically adjust pre-silicon estimations. the processor speed according to the performance requirements, allowing a power-on-demand operation. To The end result is that the Intel Pentium M processor is a protect the device from overheating during extreme power compelling mobile product delivering high performance transitions, the Intel Pentium M processor uses a and improved battery life within a restricted mobile combined voltage and frequency control that provides thermal envelope. The developed architecture, efficient cooling with minimal impact on performance. methodologies, and knowledge base pave the way for the design of Intel’s next-generation power-efficient mobile Tools and methodologies were developed by the design products. team to extract and analyze the power data for each of the basic functional blocks of the Intel Pentium M processor. INTRODUCTION The flow core component, the Stochastic Dynamic Power Estimator (SDPE), is a novel statistical power estimation The design of a mobile processor is guided by the thermal tool. The power estimation activity provided the limitations of the platform. A thin form factor is desired, following: affecting the battery size and the efficiency of the cooling system. Reduction of active and idle power consumption provides longer battery life and allows operation at higher frequencies. The Intel Pentium M processor design team developed and applied pre-silicon analysis utilizing a statistical approach that enabled early estimation well Pentium M is a trademark of Intel Corporation or its before the design was stabilized. As a result we were able subsidiaries in the United States and other countries. to identify power-saving opportunities and overheating- Centrino is a trademark of Intel Corporation or its prone areas early in the design cycle. The Intel Pentium subsidiaries in the United States and other countries. M processor power saving amounted to 35% of the maximal active power of the processor core. A novel Intel SpeedStep is a trademark of Intel Corporation or its multi-gear low-overhead Intel SpeedStep technology subsidiaries in the United States and other countries. mechanism was implemented to provide frequency on

Intel® Pentium® M Processor Power Estimation, Budgeting, Optimization, and Validation 44 Intel Technology Journal, Vol. 7, Issue 2, May 2003 demand, and the advanced Thermal-Throttling-2 f is the frequency, C is the lumped capacitance, and significantly improved the utilization of the available is the supply voltage. cooling capabilities, contributing to the overall result of VCC making the Intel Pentium M processor an attractive The dynamic power analysis flow is described in Figure 1. mobile processor with high performance and low power The first stage is the development of the tests. For high- consumption. power tests, we maximize the command execution To accomplish our reduction goals the design team needed throughput, taking into account the processor parallel and to first set a POR for every functional unit in the out-of order capabilities. For low-power tests, we utilize processor. Once a POR was set, we needed to track the processor bottlenecks to achieve low execution adherence to this POR and provide the designer with the throughput. The high-power tests are used for examining relevant feedback. The power design work included a the thermal solution and power delivery efficiency. The wide range of activities. We started by developing our low-power tests help to identify power bugs and savings tools and methodologies. Analyses of previous-generation opportunities. CPUs served as a baseline for determining the power POR We developed low-power tests that are based on execution for the Intel Pentium M processor and for identifying bottlenecks. Use of slower commands such as division or power-saving opportunities. When the processor RTL square root cause the system to stall by filling the input model became functional we started developing tests and command queue. The other types of idle tests are power debug techniques and analyzed various high- and characterized by an empty command queue, for example, low-power applications. During the design phase we due to a second-level cache miss during code fetch. In identified power bugs and additional power-saving high-power tests, we optimize the command flow opportunities. When the circuit database was stabilized, according to the internal dependencies, execution ports we analyzed the thermal stress and guided the setting of configuration, and available dispatch and retire the temperature control logic. bandwidth. Some of the high-power tests maximize the Intel Pentium M processor silicon-based measurements overall power consumption, while others stress target show that the extensive power-reduction work was units. successful, allowing delivery of higher performance and higher frequency without increasing the power envelope.

The high power vs. idle ratio was significantly improved, Power tests coding: High/Low further reducing the average power consumption and power, Thermal stressing extending the battery life. A review of the pre-silicon dynamic power estimates showed an excellent correlation with overall error rates in the order of 5%, and it also Logic Simulation based activity factors and signal probabilities showed accurate identification of the device hot spots. extraction

In this paper we describe the power design efforts Circuit including the estimation and analysis flows, architecture SDPE capacitance enhancements, and the actual reduction work. We Power calculation data and ckt conclude with a few examples demonstrating our thermal directives throttling efficiency and the accuracy of our power- simulation models. Power analysis and Thermal analysis validation DYNAMIC POWER ESTIMATION FLOW Circuit The dynamic power is dissipated on charging the circuit Design Changes parasitic capacitances and is linearly dependent on the RTL number of signal toggles. To calculate the dynamic power we define the activity factor as the average number of Figure 1: The power design cycle of the Intel Pentium zero-to-one transitions during a clock cycle. To extract M processor. Logic simulations were performed to the activity factors we performed logic simulation and collect activity statistics. The data were used by SDPE counted toggles and state statistics. to calculate the power estimates. Analysis of the power results yielded design changes in the circuit and in the For each node we calculated the dynamic power using RTL. ⋅⋅⋅= VCfAFP 2 where AF is the activity factor, CC To calculate the dynamic power we first extract the activity statistics by logic simulation of the various power

Intel® Pentium® M Processor Power Estimation, Budgeting, Optimization, and Validation 45 Intel Technology Journal, Vol. 7, Issue 2, May 2003 tests. The activity statistics of the Functional Unit Block Using schematics analysis we identified all gated clocks (FUB) inputs were used by our power estimation flow, the and compared the activity in various tests. We identified Stochastic Dynamic Power Estimator (SDPE,) to generate clocks that operated during idle tests more than during vectors and calculate the FUB power. The SDPE high-power tests and clocks that operated at all times. estimates activity factors by transistor-level logic During the power debug we encountered power-related simulation. The SDPE flow uses the Monte-Carlo bugs that did not have a functional impact, hence they approach to generate the FUB input vectors according to were not detected by traditional validation methods. To given activity statistics. The input vectors are injected detect such bugs, tests were modified to step between into the FUB and propagated by a unit delay transistor- power modes. We verified that the power consumed was level simulation [1, 2]. Thanks to the statistical not affected by the history of the machine prior to the implementation we estimated the power from the earliest tested time window. design stages, even when mismatches between RTL and schematics existed on the FUB interface pins. The impact The estimation and automated analysis flows were of missing data on accuracy was measured by comparing developed and regularly run by a small group of three various default activity assignments. engineers. The total simulation run time is 13 hours for the core FUBs and an additional 30 hours for all the Level The statistical approach drastically reduces the data 2 cache arrays. Power analysis runs were performed every storage requirements and saves simulation time. In some few weeks during the design phase of the Intel Pentium M cases, the statistical approach affected the accuracy of the processor. power estimation. For most FUBs we obtained reasonable results and successfully identified the main power Logic Optimizations consumers and power-saving opportunities. Excellent correlation was found between the overall pre-silicon The most effective power-saving opportunities are to be power estimates and the results of the Intel Pentium M found at the logic level. First priority was given to clock processor power measurements. gating in order to prevent circuits from running when not used. The hierarchy in which this gating is to be Power Analysis and Validation implemented must be chosen carefully. Choosing a too high level hierarchy can significantly reduce the gating The results of the RTL simulation and SDPE runs provide opportunities while a too low level hierarchy can end up data for a variety of applications. Every few weeks we with a control logic that consumes more power than is calculated the power roll-up and compared each unit to saved by the gated logic. By carefully structuring the the POR commitment. The Max power test was chosen as microarchitecture, the machine is optimized to focus on the roll-up reference due to the wide coverage of the chip the required activities and minimize the redundant ones. resources. Other high-power tests were used to cover FUBs that were not operated by the Max power test. Idle New control logic was added to the first-level instruction tests were used for power design validation and for cache and data cache units to detect and eliminate cases identifying additional power-saving opportunities. where accesses are being requested within the same page/line, thereby eliminating the tag lookup and To monitor the clock-gating efficiency we analyzed the minimizing the number of accessed banks. The power idle tests activity. Unlike the power-down state, during saved by eliminating these cycles outweighs the power the idle tests the global clock is active and local logic consumed by the additional control logic, yielding a net gates the clock in the idle units. Observing the FUBs that savings for the overall design. are still active is an excellent way to identify gating candidates and power-design bugs. To implement the In the RAT (register renaming) unit, the register files were proposed gating one must justify the additional design partitioned according to the data types (MMX, Integer, effort. SDPE provides “what if” analysis for estimating Floating Point) instead of creating a worst-case combined the power-saving return on investment (ROI). The data width, and accesses are performed based on the analysis is performed by overriding the clock enable required data type. Thus the access of redundant data was signals. eliminated up front, reducing the active power consumed by these circuits. We applied several techniques to identify power bugs and additional gating candidates. Wide vectors that are active In the Branch Prediction Unit (BPU), which was designed during the idle tests were identified in RTL simulation and from scratch for the Intel Pentium M processor, many SDPE data. The root cause was mapped in most cases to one of the following categories: domino-driven gates,  wide latches with a non-gated clock, or bugs in the clock- MMX is a trademark of Intel Corporation or its gating control logic. subsidiaries in the United States and other countries.

Intel® Pentium® M Processor Power Estimation, Budgeting, Optimization, and Validation 46 Intel Technology Journal, Vol. 7, Issue 2, May 2003 power optimizations were made: Target Lookups take I/O Optimizations place only when needed, and Predictions are prevented for The I/O power supply was separated from the core power Unconditional Branches. supply to allow independent optimization of each of the power supply’s voltage levels, as is evident from the Intel Pentium M processor data sheet. A special Dynamic On Die Termination (ODT) circuit was added to the output buffer design that enables disconnecting the on die termination when the CPU drives the bus low, thereby reducing by half the power consumed by the I/O. Data inversion support (first implemented in the Pentium 4 processor) between the Intel® 855PM chipset and the Intel Pentium M processor further reduces the power dissipation, due to the line termination, by minimizing the number of bits driven low on the Gunning Transceiver Logic (GTL) bus. To minimize the time during which the bias current is on in the input buffers of the Processor Side CIRCUIT OPTIMIZATIONS Bus (PSB), a new signal (DPWR#) was added to the interface to indicate to the processor when to operate the Circuit techniques have a wide impact on the power input buffers. The overall impact of all these budget required for implementing a given functionality. optimizations reduced the active power of the PSB In the Intel Pentium M processor design, many domino- interface by a factor of 2 and the average power by a based circuits were replaced with static CMOS circuits factor of 10. eliminating the strong power effect domino circuits have on the input data polarity. Furthermore, by simplifying Power Management Optimization the timing restrictions on the domino inputs we were able to reach a favorable solution for both power and speed. The dynamic power component increases linearly with the frequency of the processor clock and by the square of the All arrays and register files were reviewed to ensure voltage, hence being able to dynamically adjust the optimal banks partitioning such that the capacitive load voltage and frequency to the workload has an enormous that was toggled per access was minimized. impact on the average power. The Multi-gear Intel In the 1M second-level cache, leakage current becomes a SpeedStep technology that was implemented in the Intel significant factor hence the required performance (i.e., Pentium M processor allows the operating system to access latency) was balanced with the power constraints, provide frequency on demand stepping through pre- yielding a design that reduced the second-level cache defined voltage-frequency pairs, spanning a range of leakage by a factor of two, by using non-minimal channel about 8x in the power and almost 3x in performance lengths for almost the entire array. between the lowest and highest power-performance points [3]. To meet the Intel Pentium M processor speed goal we used a high-performance-Low-Vt (LVT) device that has a Models generated from study of real applications usage in lower threshold voltage (Vt) at the expense of higher a mobile environment indicate that high performance is leakage. The need to carefully balance the utilization of typically needed only for short bursts of time. Average LVT devices to improve the speed of circuits with the power optimization therefore relies on the ability to switch increase in leakage, led to the setting of an explicit LVT the operating point frequently and in an efficient manner. usage POR for every block in the design. The synthesis To satisfy user interactive demands, the system response flow we used assigned the LVT devices at the cell level; time must be kept low. Optimization of the Intel therefore, the synthesis blocks had very high initial SpeedStep transition process implemented in the Pentium insertion levels. To reduce this insertion level, a device- M processor reduced the switching time and allowed level flow was utilized leaving LVT devices only on efficient use of the Multi-Gear Intel SpeedStep technology critical paths. This flow saved more than two-thirds of the mechanism with minimal latency and performance original assignments. A similar flow was used on data degradation. path designs to identify redundant assignments. When a block exceeded the allocated budget, a review was held to ensure the design was optimal. Pentium 4 is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.

Intel® Pentium® M Processor Power Estimation, Budgeting, Optimization, and Validation 47 Intel Technology Journal, Vol. 7, Issue 2, May 2003

Previous generations of Intel CPUs integrated a linear PRE-SILICON ESTIMATES VALIDATION throttle mechanism, stopping the CPU for short periods of A correlation study was performed using silicon-based time and allowing it to cool. The power impact of Infra-Red Emission Microscopy (IREM) measurements of controlling both core voltage and frequency is selected power tests. A simulated power density map and significantly higher than the linear yield of the standard the corresponding IREM image are presented in Figure 3 frequency adaptation. The Pentium M processor is Intel’s and Figure 4. The color-coding represents the average first CPU designed specifically for the mobile market power and local emission densities. The levels are black implementing improved Thermal Throttle 2. Thermal (lowest), red, orange, yellow and white (highest). Throttle 2 uses frequency and voltage scaling to control Although the simulation results are presented as average maximum and average CPU power. The use of combined power density per FUB, and the infrared emission is voltage and frequency scaling results in a lower mainly obtained from non-stacked n-type devices, the performance degradation compared to standard clock- correlation can clearly be seen. throttling techniques within any given cooling solution. Multiple measurements under various test modes enable THERMAL THROTTLING VALIDATION us to measure and confirm the impact of the major power- saving features. The results confirm our pre-silicon The cooling intensity of mobile CPUs is adjusted by the estimates and sum up to 35% of the total active power. system according to the processor temperature. Control is achieved by using variable fan speeds. The fan is activated only at high core temperatures to save the battery and reduce acoustic noise. The response of the cooling system is significantly slower than the processor self-heating process. As a result, the system cannot react on time to prevent rapid temperature increases affecting the device reliability, degrading maximal operating speed, and possibly even causing accidental shut-down due to temporary overheating. To study the system dynamic thermal response, a special power-stepping test was written, providing repeated high- power pulses with a programmable duration. To achieve extreme abnormal conditions the device was operated without a heat sink! The test alternates between high and low power segments at a duty cycle of 1 to 10 causing significant power and temperature transitions. Figure 3: Simulated power density Thermal Throttle 2 provided excellent results, and the successful temperature clamping is shown in Figure 2. It demonstrates that the Intel Pentium M processor can modify working conditions to ensure that the thermal envelope limit is not exceeded.

Throttling set-point

Steady state

Temperature Initial warm-up

Time

Figure 2: Bare die thermal throttling experiment Figure 4: IREM measurement demonstrating successful temperature clamping

Intel® Pentium® M Processor Power Estimation, Budgeting, Optimization, and Validation 48 Intel Technology Journal, Vol. 7, Issue 2, May 2003

SUMMARY [3] E. Rotem, A. Naveh, N. Shamir, O. Nathan, Internal publication, 2003. We described the dynamic power estimation work and its impact on the design of the Intel Pentium M processor. Low-power tests and various analysis techniques were used for identifying power-saving opportunities and for AUTHORS’ BIOGRAPHIES validating the power design. Most of the identified Dani Genossar has been with Intel since he graduated opportunities were implemented in the design. with a B.Sc. in Electrical Engineering Cum Laude from the Technion, Israel Institute of Technology, in 1988. As The paper also described the architecture enhancements a manager within the Mobile Platform Group Design and various circuit, logic, and I/O optimizations that were Team he is also responsible for the power-estimation and implemented during the design of the Intel Pentium M optimization efforts. His e-mail is processor. [email protected] Silicon measurements of the Intel Pentium M processor Nachum Shamir is the owner of iMPG power-estimation, yielded excellent correlation to our pre-silicon estimates analysis, and validation flows and leads the silicon power for a wide range of tests that are all within +/- 5% of our and thermal characterization efforts. He joined Intel in estimates. Furthermore, measurements also confirmed 1999. Nachum received a B.Sc. degree in Electrical that the features implemented for the sole purpose of Engineering from the Technion, Israel Institute of saving power account for a reduction of over 35% of the Technology, in 1988, an M.Sc. degree in Electrical active power consumed. Engineering from Tel Aviv University in 1994, and a Ph.D. degree in Electrical Engineering from the Technion CONCLUSION in 1999. Nachum served as adjunct lecturer at the The Intel Pentium M processor is a compelling mobile Technion Electrical Engineering faculty from 1998 to product delivering high performance within a restricted 2000. His e-mail is [email protected] mobile thermal envelope with improved battery life. It provides significant advantages to the Intel Centrino mobile technology-based platform. Copyright © Intel Corporation 2003. This publication was downloaded from http://developer.intel.com/. The developed architecture, methodologies, and knowledge base pave the way for the design of Intel’s Legal notices at next-generation power-efficient mobile products. http://www.intel.com/sites/corporate/tradmarx.htm.

ACKNOWLEDGMENTS We thank the entire Intel Pentium M processor design team for being power aware, for supporting the power- estimation efforts, and for producing a power-conscious design. We also acknowledge the contribution of the Intel Strategic CAD Labs (SCL) for providing and supporting the logic simulation tools that are embedded in the Stochastic Dynamic Power Estimator (SDPE) power- estimation flow.

REFERENCES [1] M. D. Aagaard, R. B. Jones, and C.-J. H. Seger, “Formal verification using parametric representations of Boolean constraints,” in Proceedings of the 1999 IEEE Design Automation Conference, pp. 402-7, 1999. [2] J. Yang and C.-J. H. Seger, “Introduction to generalized symbolic trajectory evaluation,” in Proceedings of the 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. pp. 360-5, 2001.

Intel® Pentium® M Processor Power Estimation, Budgeting, Optimization, and Validation 49 Paper1cover.qxd 1/31/03 9:11 AM Page 2

For further information visit: developer.intel.com/technology/itj/index.htm

Intel® Centrino™ Mobile Technology

Antenna Selection in Multicarrier Communication Systems

A compiled version of all papers from this issue of the Intel Technology Journal can be found at: http://developer.intel.com/technology/itj/index.htm Antenna Selection in Multicarrier Communication Systems

Yuval Finkelstein, Wireless Networking Group, Haifa, Intel Corporation

Index words: communication, wireless, OFDM, fading channels

ABSTRACT Low SNR at defective carriers is partly overcome by applying spatial diversity (antenna diversity). Receiving Imagine a packet of data transmitted by radio frequency the data simultaneously through multiple antennas through the air. This packet appears at two separate increases the detection reliability. However, full antenna antenna ports of a radio receiver. Even though the same diversity is an expensive solution. A more cost-saving data appear at both antennas, a different path in space is solution is to use semi-diversity. This is accomplished by associated with each antenna. Any path in space results in selecting only a single antenna out of a given set. This signal power reduction and usually in a power antenna should offer the most reliable detection. redistribution in the allocated frequency band. Both effects enhance the error rate and reduce the reliability of Selecting antenna is usually accomplished by considering the decoded data. The receiver usually supports the total antenna power. This criterion is unsatisfactory wireless link only through a single antenna due to cost because power is not the only factor that determines the considerations. This paper focuses on the particular case performance of the receiver. In fact, the distribution of where the signal at one antenna is strong, but does not the power among carriers is critical as well. For example, efficiently utilize the available frequency band, and at the take an antenna that has received a giant pulse of power. other antenna the signal is weak, but uniformly distributed Most of it emerges in a single carrier. Selecting this over the band. The paper provides a general algorithm for antenna while the rest of the carriers suffer low SNR selecting the optimal antenna and explains both exhibits a high error rate. quantitatively and qualitatively which antenna should be This paper provides a simple algorithm for choosing the selected. best antenna from among several. The algorithm improves the reliability of the data recovery process. The INTRODUCTION central pillar of this algorithm is an information-capacity- Multipath fading is a major limitation to high-rate data like parameter for each of the carriers. Using this transmission in wireless communication systems. Fading parameter, instead of estimating power alone, yields more is a result of destructive interference between replicas of useful information about the contribution of each carrier. the signal arriving through different paths. Another aspect The performance of the receiver, while equipped with of the same effect is an intersymbol interference. This each antenna, is predicted by combining the appropriate occurs when the replicas merge again at the antenna and contributions over all the carriers. successive data symbols are mixed together, forming a The antenna selection algorithm described in this paper mismatched symbol. was implemented inside Intel’s chipset that handles the Multicarrier communication methods overcome IEEE 802.11a standard. The chipset provides wireless  intersymbol interference by subdividing the allocated connectivity to mobile PCs and also serves Intel bandwidth into a few frequency sub-bands. At each Centrino™ mobile technology. carrier (sub-band), the data is transmitted using long The paper demonstrates the application of the algorithm symbol durations compared to the time delay between by referring to a simplified system consisting of only two replicas. Thus, the impact of intersymbol interference is carriers. Full treatment of the IEEE standard, which also reduced. Transmitting the data simultaneously through all includes error-correction coding, is beyond the scope of carriers results in an overall high rate of data transmission. this paper. However, the described principles are very However, destructive interference still distorts the data.

Those carriers that are subjected to destructive interference have a low Signal-to-Noise Ratio (SNR). Intel Centrino is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.

Antenna Selection in Multicarrier Communication Systems 50 Intel Technology Journal, Vol. 7, Issue 2, May 2003 relevant to practical systems. An effort was made to make was assigned by the transmitter to each frequency band the contents clear to technical staff who are not directly constitutes the demodulation process. Recognizing the involved with communication. Thus, terms such as correct point despite the noise and the channel fading is constellation points, Signal-to-Noise ratio, equalization vital for successfully receiving the data. Finally the data and gain control are demonstrated explicitly in this paper. is extracted by decoding the constellation points back into After mathematically constituting the relevant glossary of binary. terms, the suggested antenna-selection algorithm is introduced. Finally, the algorithm is proved to be DEFINITION OF A SIMPLIFIED SYSTEM successful in predicting the performance of a receiver In order to clarify our strategy, let us assume a simplified equipped with an arbitrary characteristics antenna. This is multicarrier communication system, which uses only two established by numerical simulation of decoding noisy carriers. That is, the allocated frequency band is only data. divided into two sub-bands. At each sub-band information is modulated by the well-known Quadrature RELEVANT GUIDELINES TO Phase-Shift Keying (QPSK) method. Figure 1 presents ORTHOGONAL FREQUENCY DIVISION the constellation plane for QPSK and our preferred MULTIPLEXING encoding scheme. As was previously mentioned, the main idea behind a multicarrier communication system is the division of a given frequency band into smaller frequency sub-bands. Orthogonal Frequency Division Multiplexing (OFDM) is Q a particular case of a multicarrier communication system, 01 11 in which the frequency sub-bands overlap. This j overlapping is not harmful due to the fact that the carriers are mathematically orthogonal. A good reason for using I OFDM is its inherent immunity against narrowband -1 1 interference. If a specific sub-band is severely disturbed due to multipath fading or interference, error-correction -j coding can still overcome the disturbance, thus preventing 00 10 the entire link from failing, as is the case with a single carrier. Another reason for using OFDM concerns the Figure 1: Bits encoding scheme for QPSK constellation complexity of the equalizer. Assuming that each sub-band Suppose that the sequence to be delivered is “1110.” For is sufficiently narrow, the effect of the channel is taken simplicity, we assume that no error-correction coding is into account by assigning only a single complex number to applied. Encoding this sequence results in two complex each carrier. In comparison to a single-carrier system, this numbers: 1+j and 1-j, which comprise the baseband simplifies the implementation of the equalizer to a great representation of this message. The transmitter combines extent. these two complex numbers into a waveform in the time More extensive explanations about the principles of domain by using the Inverse Fourier Transformation. This OFDM are documented elsewhere [1]. However, a few waveform finally modulates the amplitude and phase of more guidelines are presented here to help clarify the the RF carrier that is transmitted through the air. information in the remaining part of this paper. Following IEEE standard 802.11a, all carriers that constitute a single Demodulation and Noise data frame are subjected to the same modulation scheme: At the receiver the reverse process is taking place, which QAM constellation with a given order. The data is results in measured points in the constellation plane. encoded by assigning a specific binary combination to However, the constellation points measured by the each of the constellation points. This process is receiver are corrupted by noise. This noise cannot demonstrated in the next section. Thus, a single completely be avoided, no matter what the quality of the constellation point is determined for each frequency sub- receiver is. Once the message is transmitted the receiver band. Synthesizing the sequence of constellation points measures two constellation points, (I1,Q1) and (I2,Q2), in into a time domain signal is accomplished by Inverse carriers “1” and “2,” respectively. Although the exact Fourier Transformation. The inverse process takes place values 1+j and 1-j were sent by the transmitter, this would at the receiver. The time domain signal is transformed never be the case with the measured values, due to the back to the frequency domain by means of Discrete noise. Since no error-correction coding was applied, Fourier Transform. Identifying which constellation point demodulation is carried out just by assigning each

Antenna Selection in Multicarrier Communication Systems 51 Intel Technology Journal, Vol. 7, Issue 2, May 2003 measured point to its nearest constellation point. That is, It is very important to emphasize here that we do not demodulation is performed by identifying the quarter in restrict the statistical nature of the noise to any particular which the points (I1,Q1) and (I2,Q2) appear. If the noise kind of noise. Although we used white Gaussian noise in power is more intensive than that of the signal power, our simulations, this kind of noise does not always have to demodulation becomes a random process. In this case be used. The mean of the noise, however, should be zero. successful decoding would happen with a high probability The SNR is now defined as follows: of 0.5. This of course is not satisfactory for constituting a communication link.  2 2 2 −+− 2  We then adopted an approach that makes use of an  1 1 2 nPnP 2  ()σ [] =10 LogSNR = Ensemble average. This enables us to treat the noise in a dB 10 2 2   + nn  statistical manner. Figure 2 represents measurements of  1 2  (I ,Q , I ,Q ) for an ensemble of receivers assuming 1 1 2 2 =  2  additive white Gaussian noise. 10 Log10 2  σ 

The value of σ for the Gaussian distribution in Figure 2 is

0.5, which results in an SNR of 9 dB. Constellation Plane of Carrier 1 Constellation Plane of Carrier 2 2 2

1.5 1.5 CHANNEL EFFECT AND EQUALIZATION 1 1

0.5 0.5 Channel Effect

0 0

Q1 Q2 So far only the additive noise was considered as a

-0.5 -0.5 distraction to demodulation and data extraction. However, any actual signal is disturbed to some extent by -1 -1 channel distortion. Replicas of the signal coming from -1.5 -1.5 walls and equipment interfere with the antenna. The result -2 -2 of this interference is the so-called “channel” effect. The -2 0 2 -2 0 2 I1 I2 total power that appears at the antenna port of the receiver is always smaller than the transmitted power, due to channel losses. However, the channel also distorts the Figure 2: Constellation planes of carriers “1” and “2” original signal by changing its spectral content. In a We can see that all measurements are concentrated around multicarrier system, decomposing the received signal back the constellation point 1+j and 1-j according to the to its spectral component is actually equivalent to message that was sent. However, the noise shifts the measuring constellation points for each of the frequency measured values to different locations in the constellation sub-bands that constitute the system. Under the effect of plane. The figure makes visible the fact that each the channel, these constellation points do not appear at measured point can be described by the sum of two their original location as in the transmitter. Instead, both components: the signal component (either 1+j or 1-j) and their magnitude and phase are changed. That is, the the noise component (n1 and n2 for carriers “1” and “2,” channel moves each constellation point to an arbitrary new respectively). This can be mathematically expressed as location, even if noise is completely absent. follows: Equalization ()1 ++=+= njjQIP 1111 The channel distortion can be partly overcome by ()+−=+= 1 njjQIP 2222 equalization. The data packets are usually preceded by a The Ensemble average of the squared-magnitude of the known training sequence. Thus, the channel distortion can noise components is equal for both carriers, as is evident be analyzed and can be taken into account throughout the from Figure 2. It is designated by σ2: process of demodulation. Assuming that the frequency sub-bands are narrow enough, the channel influence for each carrier can be expressed with a single complex 2 nn 2 == σ 2 number. The phase and magnitude of each number 1 2 describe the transformation that the constellation points undergo in the relevant carrier. These complex numbers are called “channel coefficients.” The equalizer

Antenna Selection in Multicarrier Communication Systems 52 Intel Technology Journal, Vol. 7, Issue 2, May 2003 coefficients are no more than the reciprocal of the channel sub-bands and how the relative strength of the signal coefficients. Multiplying the measured constellation compares to the noise. points by the appropriate equalizer coefficients moves The expression for the SNR in the gain-controlled receiver them back to their normal place. If the training sequence is modified, since the total signal power is not 4 is long enough and the channel has static characteristics, (|1+j|2+|1-j|2=4) any more. Instead, it is the Ensemble the estimation error of the channel coefficients can be average of the total power (both signal and noise) that made as small as it needs to be. Therefore we neglect this equals 4. Thus, the total SNR, which refers to both noise component and assume ideal equalizer coefficients carriers together, is for the rest of this paper.  2 −σ 2  Nevertheless, equalization does not completely cancel the ()σ []=10LogSNR   total dB 10  2  impairments of the channel. This is due to the fact that  σ  equalization cannot compensate for the intrinsically low SNR in carriers that are suppressed by the channel; in Equalization and Modified Signal-to-Noise Ratios other words, carriers with a magnitude of channel After equalization, the original constellation points are coefficient that is much lower than one. The interplay restored with modified noise components: n between the total power of both carriers and the Eq ()++= 1 1 1 jP magnitude of their channel coefficients is most relevant to C1 the topic of this paper, that is, antenna selection. n Eq ()1 jP +−= 2 2 C Gain Control 2 We continue with our double-carriers simplified system. The Ensemble averages of the squared-magnitudes of the This time we add the influence of a channel. We modified noise components are designate the channel coefficients by C1 and C2 for 2 2 2 2 carriers “1” and “2,” respectively. Thus, the constellation 2 n σ 2 n σ ()σ 1 , ()σ 2 ==== points measured at the receiver are redefined by 1 C 2 2 C 2 1 C1 2 C2

()1 +×+=+= nCjjQIP Accordingly, 11111 () +×−=+= 1 nCjjQIP 22222  2  []= 10LogdBSNR   101  []σ 2   1  Next, we present the concept of Gain Control. While this  2  []= 10LogdBSNR   function is accomplished both analogically and digitally in 102  []σ 2   2  a rather complicated way in real systems, here we are just concerned with the essence of this feature. Our ideal gain controller verifies that ANTENNA DIVERSITY Suppose the receiver is equipped with two antennas: 2 2 =+ 1 PP 2 4 antenna A and antenna B. During signal detection the receiver switches back and forth between these two The Ensemble average that appears in the equation above antennas. The gain controller sets the gain such that the 2 2 verifies that all the receivers in our ensemble have exactly condition <|P1| >+<|P2| >=4 is fulfilled for the highly the same gain. Substituting the explicit expression for the energized antenna. Thus, saturation of the receiver’s points P1 and P2 and recalling that the variance of the amplifier is avoided no matter which antenna is attached. noise was defined as σ2 we get Once the gain is set, the receiver estimates two sets of channel coefficients. Each set refers to a different antenna 2 2 σ 2 =++ 1 CC 2 2 and consists of two channel coefficients, a single coefficient for each carrier. As soon as estimations are The magnitudes of C and C can take values either higher 1 2 completed, the receiver has to select either antenna A or or lower than one. An increase in the magnitude of one antenna B, in order to receive the rest of the data. Which coefficient occurs at the expense of a decrease in the one should it select? magnitude of the other. The reader should note that channel coefficients with a magnitude higher than one do Power Considerations not imply increased power in an absolute manner. The magnitudes of the channel coefficients only indicate how Suppose that antenna A is the one with the higher total the measured power is distributed among the frequency power. The gain controller sets the gain such that 2 AA 2 σ 2 =++ 1 CC 2 2

Antenna Selection in Multicarrier Communication Systems 53 Intel Technology Journal, Vol. 7, Issue 2, May 2003

Indexes A and B are attached to the channel coefficients of applied, making the BER analytical prediction too antennas A and B, respectively. Assuming that the signal complicated. Thus, the numerical exploration of the power in antenna B is reduced by a factor of X we get simplified communication system, along with an empirical

2 2 construction of the antenna selection algorithm, actually 2  + CCX BB  σ 2 =+ 2  1 2  provides a complete methodology that is applicable to practical systems as well. Note that at this antenna, it is only the signal power that is reduced by X and not the noise power. Thus, the noise σ2 level is the same no matter which antenna is selected. 0 This statement is valid since the amplifier gain is set just 10 before the beginning of the antenna selection process. S=0.1

-2 Symmetry and Asymmetry 10 S=1 BER As will be evident soon, the performance of the receiver is optimal when the signal power is homogenously distributed between the two frequency sub-bands. We -4 10 define an antenna whose channel coefficients are identical 0 5 10 15 20 25 30 (the same for both carriers) as a symmetric antenna. In SNR [dB] contrast, an asymmetric antenna is one with different 1 channel coefficients. If the higher power antenna is also a symmetric one, then selecting the best antenna becomes a 0.5 trivial task. However, the decision becomes more S=1 S=0.1 complicated if the high-power antenna is asymmetric, and 0 the low-power antenna is symmetric. We focus our -0.5 discussion on the latter. Since we defined antenna A to be the one with the higher power, we extend its definition and -1 -10 -5 0 5 10 15 declare it asymmetric. In order to take into account the Excess Power [dB] asymmetry property in a quantitative manner, we define the asymmetry parameter S as follows:

C S = 2 C1 Figure 3: BER results for different values of the Thus, 0≤S<1 for antenna A whereas S=1 for antenna B. asymmetry-parameter, S When selecting the antenna only the parameters X and S should be taken into account; these describe power The plot at the top of Figure 3 presents BER results for reduction in B and the asymmetry of the carriers in A, our double-carriers system, which relies on QPSK respectively. The values of these two parameters are modulation. Demodulation was carried out, as was accessible to the receiver since they can be ascertained explained earlier, by identifying to which quarter the once the channel coefficients are estimated. measured constellation points belong.

The horizontal axis of the figure shows SNRtotal in a Simulation of Bits Error Rate logarithmic scale. Its definition is repeated here again for In order to realize the effect parameters X and S have on the convenience of the reader: receiver performance, we simulated the entire simplified  2 −σ 2  system following the structure we described so far. We ()σ []=10LogSNR   total dB 10  σ 2  emphasize again that we do not restrict the statistical   nature of the noise to any particular kind. Although we Each colored line in the figure describes BER results for used white Gaussian noise in the simulations, this should different values of the asymmetry-parameter, S. The not always be the case. If the statistical nature of the noise furthest right is the case of S=0.1 with the worst is known in advance, the Bits Error Rate (BER) can be performance, as expected. The furthest left is the case of predicted analytically. However, because we do not S=1 with the best performance. In between, S displays assume any specific statistical distribution we use a intermediate values. numerical attitude instead of an analytical one. Furthermore, in practical communication systems, The empty circles at the figure denote the intersection of interleaving and error-correction coding are probably the curved lines with an invisible horizontal line, which

Antenna Selection in Multicarrier Communication Systems 54 Intel Technology Journal, Vol. 7, Issue 2, May 2003 crosses the vertical axis at the BER value of 10-3. These the same way, decreasing the grade for defective carriers circles are plotted again at the lower part of the figure with should be restricted if SNR goes below another value. a modified horizontal axis. The modification incorporates The antenna is therefore selected by following these steps: a shift of the zero to be just below the furthest left circle. • Grade each carrier according to its SNR. Thus, the value that fits each circle in the new scale is just the distance from the furthest left circle. The reader • Add up all grades to get a total score. should realize that the furthest left circle actually • represents the symmetric antenna B. Therefore, the rest of Select the antenna with the highest score. the circles specify the amount of excess power that should be added to the asymmetric antenna A, in order to achieve “Kapasity” the same BER result of The missing building block for the algorithm above is the 10-3. This excess power depends on the value of the function by which grades are calculated, versus the SNR. asymmetry-parameter, S, as the figure shows. We already indicated what should be the asymptotic behavior of that function for high and low values of SNR. ANTENNA SELECTION This kind of asymptotic behavior is similar to the information-capacity of a channel [2]. Indeed, estimating In constituting the antenna selection algorithm we follow channel capacity can be a useful tool for predicting these logical steps: receiver performance. However, we are not going to • Understand qualitatively why excess power is make any explicit use of information theory here. The required for compensating for asymmetry. function we are looking for is an empirical one, with free parameters for tuning the receiver performance. Despite • Based on this understanding, suggest an algorithm all that, we named the total antenna score “Kapasity,” that predicts receiver performance for given values of after the well-known phrase. X and S. We define the Kapasity function f(σ) as • Validate the algorithm and tune its free parameters. 1 This is done by reconstructing the dependence of the f ()σ = σ − excess power on the symmetry parameter S, as +  d  exp1   depicted in Figure 3.  q  • Verify proper antenna selection for any combination Note that f is defined in terms of σ instead of SNR, which of asymmetry parameters SA, SB, and power difference should make the equations that follow easier to X. understand. However, the variance of the noise for each carrier (σ2) and the SNR of each carrier are related, as was Therefore, why is excess power required for compensating explained previously. It is easy to verify that f has the for asymmetry? Increasing asymmetry while keeping total required asymptotic behavior, as σ approaches either zero power constant causes Signal-to-Noise Ratio (SNR) (i.e., high SNR) or infinity (i.e., low SNR). The explicit degradation in one carrier and SNR improvement in the form of the function f can be chosen in many other ways, other. In other words, decreasing S gradually from 1 to if the correct asymptotic behavior is kept. For hardware lower values increases the spread of equalized points in implementation, any piecewise linear approximation of the one constellation and decreases the spread in the other. function f can fit as well. The parameters d and q are free Thus, more and more points move to a wrong quarter in parameters that should be tuned in order to optimize the the constellation plane of one of the carriers while at the antenna selection process. Factors that affect the same time spreading is reduced in the other carrier. optimized values of d and q are the type of error- However, if spreading is initially small and a good correction coding and the available hardware resources for communication link can be established for S=1, reducing Kapasity calculations. However, it is easy to guess what the spreading further has no benefit at all. In contrast, the should be the value of the parameter d. Notice that f=0.5 SNR degradation at one of the carriers will finally shift σ points to the wrong quarter and the communication link when =d. At this point the gradient of f is maximal, and σ will fail. Applying more and more power while minor changes of result in major changes in the asymmetry increases keeps all constellation points inside Kapasity. This tendency also exists with the the quarter they really belong to. demodulation error rate, as the noise standard deviation approaches half of the distance between two adjacent Based on this explanation we propose a grade for each constellation points. Thus, the likely assumption is that carrier, based on its effective SNR. The grade is the optimal value of d for our QPSK constellations should calculated based on the fact that the benefit from high be around 1. We will see soon that this is really the case. signal power is saturated beyond a certain SNR value. In The value of q determines the slope of f around σ=d. The

Antenna Selection in Multicarrier Communication Systems 55 Intel Technology Journal, Vol. 7, Issue 2, May 2003 higher the value of q the lower the slope of f is at that From Figure 3 we derive σ and S for each circle and insert A A point. both of them into σ1 and σ2 above. Thus, the left-hand side of the equal-Kapasity equation is calculated for each VALIDATION OF THE ALGORITHM empty circle in Figure 3. All that is left is to extract X, Figure 3 describes the excess power required for an which only appears at the right-hand side of the equation. asymmetric antenna compared to a symmetric one, in order to keep both antennas performing at the same level. Excess Power Versus Symmetry Parameter 20 This illustrates one scenario for antenna selection where d=1.2 q=0.04 SB=1 and the Kapasity is equal for both antennas. We 15 d=1.2 q=0.16 follow this trend and speculate what should be X for each QPSK results value of SA. Formulating this mathematically in terms of 10 Kapasity we get 5 ()σ ()σ =+ ()σ BAA 21 2 fff Excess Power [dB] Excess Power 0 The superscripts A and B indicate which antenna σ -5 belongs to, and the subscripts 1 and 2 relate to carriers 1 0 0.2 0.4 0.6 0.8 1 and 2, respectively. Because we assume that antenna B is S symmetric, both its carriers have the same value of σ. The "Kapasity" function Therefore, from now on we omit the subscripts for 1 antenna B. d=1.2 q=0.04 0.8 d=1.2 q=0.16 A A B As soon as we express σ1 ,σ2 and σ in terms of X and S, we can insert them into the Kapasity equation above. We 0.6 go back to the mathematical model of the simplified f 0.4 system. Recall that after equalization the noise component in each carrier can be characterized by an 0.2 effective variance, which depends on the relevant channel coefficients 0 0 0.5 1 1.5 2 2.5 3 Sigma ( ) σ 2 σ 2 σ 2 * ()σ A 2 ()σ A 2 ()σ B 2 === 1 , 2 , A 2 A 2 B 2 1 2 CCC Figure 4: Excess Power calculated by using the

A A Kapasity function, versus the asymmetry-parameter, S Extracting C1 and C2 (in terms of S and σ) from the equations that define gain and asymmetry, The empty circles at the top plot in Figure 4 describe the Excess Power versus S as calculated by the Kapasity A 2 2 C 2 AA σ 2 2 , SCC ==++ model. Red and blue circles illustrate the results while 1 2 A C1 using (d=1.2, q=0.04) and (d=1.2, q=0.16), respectively. The shape that the Kapasity function takes using each set and inserting them into (*), we get of parameters is depicted at the lower part of Figure 4, keeping the same convention of colors. The red σ 2 σ A 1 += S 2 continuous line in Figure 4 represents the Excess Power 1 2 2 − σ versus S derived from the open circles in Figure 3. It is 1 + S 2 σ 2 clear that using the Kapasity model with (d=1.2, q=0.04) σ A = 2 S 2 2 − σ 2 accurately predicts the results of the numerical experiment. We consider this to be proof of the validity B σ Extracting C (in terms of X and ) from the equation that of the antenna selection algorithm. It is also evident that defines X, the initial guess of d≈1 is very close to the actual value. 2 2 22 2 − σ The other set of d and q, which isn’t a good fit in Figure 4, 2CCC BBB ==+ 1 2 X is discussed in the last section of this paper. and inserting it into (*), we get FINAL VALIDATION OF THE σ 2 B ALGORITHM σ = 2X 2 −σ 2 The algorithm is formulated following these steps:

Antenna Selection in Multicarrier Communication Systems 56 Intel Technology Journal, Vol. 7, Issue 2, May 2003

• Calculate the total Kapasity for each of the q=0.16). Replacing the Kapasity with its complement of relevant antennas (might be more than two) by 2 is just for producing a monotonic decreasing function summing up Kapasity contributions over all instead of an increasing one. This makes easier the carriers (probably much more than two): comparison between Kapasity plots and BER plots. Each = ()σ colored line in Figure 3 is transformed into a line with the ∑ fKapasity n n same color in Figure 5. The open circles show how points with a definite BER value in Figure 3 appear in the • Chose the antenna with the highest Kapasity. Kapasity diagram. It is clear that an absolutely horizontal Although the excellent fit between the red circles and the line exists, which separates circles of different colors. continuous line in Figure 4 validates the algorithm, we Thus, points with the same BER in Figure 3 become points would like to demonstrate its applicability in a broader with the same Kapasity in Figure 5. This finally sense. In the previous section, we focused on the case completes the validation of the proposed antenna selection algorithm. where SB=1. In this section we consider the general case in which both SB and SA have arbitrary values between 0 It is evident from the left-hand side plot of Figure 5 that and 1. hardware implementation of Kapasity calculations Let us focus again on the results of the numerical demands an extremely large number of bits. Otherwise experiment presented in Figure 3. We depicted the the antenna-selector will not be able to choose between performance of our double-carriers system in terms of the antennas. However, the parameters d=1.2 q=0.16 enable binary representation of the Kapasity using less BER versus SNR (σ) and the asymmetry parameter S. total bits, as shown in the right-hand side plot in Figure 5. This This time we consider the entire set of continuous lines in results, however, in a reduction in the efficiency of Figure 3 (instead of only considering the set of open antenna selection. It is evident from the fact that the circles). Recall that each line shows how the BER is separating line between same color circles is not a straight changed with SNR (σ), for each definite value of S. total horizontal line anymore. Thus, there is a tradeoff between Thus, for final validation of the algorithm, we should antenna selection efficiency and hardware cost. reconstruct the complete BER diagram of Figure 3 in terms of Kapasity. Figure 5 presents exactly that. The left-hand plot shows the value of [2-Kapasity] versus SNR(σ) and S using (d=1.2, q=0.04). The right-hand plot shows [2-Kapasity] versus SNR(σ) and S, using (d=1.2,

1 10 1E-3 to 1E-2 1E-3 to 1E-2 d=1.2 q=0.16 d=1.2 q=0.04 1E-4 to 1E-3 1E-4 to 1E-3 0 10 1E-5 to 1E-4 1E-5 to 1E-4 0 10

-5 10 S=0.1 -1 10 S=0.1 2 - Kapasity 2 - Kapasity 2 -

-10 10 S=1 -2 10 S=1

-15 -3 10 10 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 SNR [dB] SNR [dB]

Figure 5: Kapasity versus SNR(σ) and S

Antenna Selection in Multicarrier Communication Systems 57 Intel Technology Journal, Vol. 7, Issue 2, May 2003

CONCLUSION We reviewed the basic principles of multicarrier communication systems and explained how channels can modify power distribution among carriers and how equalization might affect the noise component in each carrier. We described an antenna selection algorithm developed in Intel. Although the algorithm was applied to a simplified communication system, the described methodology was found to be applicable for practical systems as well. Hence it was implemented inside Intel’s Orthogonal Frequency Division Multiplexing (OFDM) modem that provides wireless connectivity for mobile PCs.

REFERENCES [1] R. Van Nee and R. Prasad, OFDM For Wireless Multimedia Communication, Artech House, Boston- London, 2000. [2] J. G. Proakis, Digital Communications, McGraw-Hill, 4th ed., Boston, 2001.

AUTHOR’S BIOGRAPHY Yuval Finkelstein joined the wireless Digital Signal Processing team in the Intel Development Center in Haifa in 2000. He has prior industrial experience in the areas of electro optics and solid-state physics. He received his Ph.D. degree in physics from the Israel Institute of Technology, the Technion, in 1997. His e-mail is [email protected] Copyright © Intel Corporation 2003. This publication was downloaded from http://developer.intel.com/. Legal notices at http://developer.intel.com/sites/corporate/tradmarx.htm .

Antenna Selection in Multicarrier Communication Systems 58 Paper1cover.qxd 1/31/03 9:11 AM Page 2

For further information visit: developer.intel.com/technology/itj/index.htm