45nm Next Generation ® Core™ (Penryn)

HOT CHIPS 2007

Varghese George Principal Engineer, Intel Corp

Legal Disclaimer Today’s presentation may contain forward-looking statements. All statements made that are not historical facts are subject to a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent Earnings Release and our most recent Form 10-Q or 10-K filing available on our website for more information on the risk factors that could cause actual results to differ. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The Intel® Core™ Microarchitecture, Intel® , Intel® Pentium II, Intel® Pentium III, Intel® , Intel® , Intel® , Intel® , ®, ® may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. This document contains information on products in the design phase of development. Do not finalize a design with this information. Revised information will be published when the product is available. Verify with your local sales office that you have the latest datasheet before finalizing a design. All dates specified are target dates, are provided for planning purposes only and are subject to change. All products, dates, and figures specified are preliminary based on current expectations, provided for planning purposes only, and are subject to change without notice. Intel and the Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands are the property of their respective owners. Copyright © 2007, Intel Corporation

2 Outline

Next Generation 45nm Process Technology benefits Overview of Penryn Architecture and New Features Highlights of New Microarchitecture changes New Power Management Features in Penryn Deep Power Down Technology Enhanced Dynamic Acceleration Technology CC3 in Servers Summary

3

Next generation 45 nm Hi-K metal gate process technology: Continuing Moore’s Law 65 nm 45 nm Intel® Core™ 2 Duo Penryn family die (dual core) 2 143 mm 107 mm2

4-wide Dynamic Execution Same good stuff plus: Advanced smart cache: 4M L2 - 6M L2 cache Smart Memory access - 47 new SSE4 instructions Advanced digital media boost - Micro-arch enhancements for further IPC perf Intelligent power capability - Deep Power Down Technology* - Enhanced Dynamic Acceleration Technology* Penryn advantages with this next generation process: Lower transistor switching power and leakage current => reduced idle power => longer battery life, quieter systems Up to 2x improvement in transistor density: Significant area scalability to support additional capabilities within the existing power envelope Higher core and bus frequency for more performance in the same power envelope * Mobile only features 4 Enhanced Intel® Core™ Microarchitecture

Intel Core Microarchitecture New with the Penryn Family

Fast Radix-16 Divider Intel® Wide Dynamic Execution Faster OS Primitive Support Enhanced Intel Virtualization Technology

Intel® Advanced Smart Cache Larger Caches: up to 12MB 24 Way Set Associativity

Improved Store Forwarding Intel® Smart Memory Access Higher bus speeds

Intel® Advanced Digital Media Boost Intel SSE4 instructions Super Shuffle Engine

Intel® Intelligent Power Capability Deep Power Down Technology Enhanced Intel Dynamic Acceleration Tech

Increased Performance and Energy Efficiency Across Applications

Covered briefly in this presentation Focus of this presentation 5

New Intel® SSE4 Instructions

Intel® Intel® Intel® Intel® ® SSE SSE2 SSE3 SSSE3 Intel 1999 2000 2004 2006 SSE4 2007 70 144 13 32 instructions instructions instructions instructions Single- Double- Complex Decode Precision precision Arithmetic 47 instructions Vectors, Vectors Video Accelerators streaming 128-bit operations vector Graphics building blocks integer Coprocessor Accelerators

Penryn supports 47 new instructions Largest set of new instructions introduced since 2000 This continues a trend set by SSE, SSE2, SSE3, and SSSE3 Addresses some popular requests for key functionality Features to improve the compiler’s ability to efficiently vectorize code Video Encode acceleration functions Floating-point dot product operation (3D content) Streaming load for high b/w to WC memory (imaging, GPU-CPU sharing)

6 Super Shuffle Engine

Input data (2 x 16 bytes) Shuffle operations required for SSE

src1 B0 B1 B2 B3 B4 . . . B14 B15 data formatting operations Unpacking src2 B0 B1 B2 B3 B4 . . . B14 B15 Packing . . . Align concatenated sources Wide shifts Insertion and extraction . . . Super Setup for horizontal arithmetic Shuffle functions engine Penryn super shuffle engine

logic op . . . logic op performs 128 bit operation in a single cycle . . .

dest B0 B1 B2 B3 B4 . . . B14 B15 No software changes required

Output data (16 bytes) Doubles shuffle throughput

7

New Radix-16 divider architecture

Dividend Divisor Leading edge divider performance >50% speed-up over previous generations on average Pre-processing Innovative radix-16 based architecture Double pumped: processes 2x2 bits per cycle Hybrid 68-bit CSA/CPA based for faster pipelined hybrid 68b QSL quotient select adder Optimized square root <<2 Maximum Processor Instruction Latency (lower is better)

hybrid 68b QSL adder

<<2 Clocks

Post-processing

32-bit SP divide DP EP SP DP EP int divide divide square square square Quotient/remainder divide Previous generation rootPenryn root root

8 Streaming Load Instruction 16 byte Aligned load instruction on WC (write combining) memory Improves read bandwidth from WC mem by reading cache-line size quantities Graphics Mem Bus DDR3 DDR3

L2 Cache High BW Writes

Core ™ GFX Temp Buffer inside Existing L1 Data Cache Platforms But … Low BW Reads A B C D 800 MB/s (partial reads) DDR3 DDR3

45nm Next 4th Access Generation GFX 3rd Access Core ™ Loads 2nd Access 1st Access Fast Streaming Reads! With Streaming (Data usage in processor core) Up to 6.4 GB/s (cache line reads)

Streaming Load is ~8X faster reading from WC Memory

9

Power Management Overview

Penryn builds on the capabilities in the Core™ microarchitecture Fine grained clock gating during runtime: Applied all across the chip Sleep transistors: In large arrays such as L2 cache to reduce leakage power C-states: Various low power states during CPU idle P-states: Multiple performance states for run-time energy efficiency Thermal Monitor: Self-managed thermal control based on multiple thermal sensors Digital Thermal Sensor: For software access to CPU temperature Intelligent Voltage Regulator control: To optimize VR efficiency during low loads PECI interface: For side-band access to temperature for fan control Key New Mobile Power/Thermal Features in Penryn Deep Power Down Technology (DPD): A radically new and innovative idle power management state Enhanced Dynamic Acceleration Technology (EDAT): A deterministic and (TDP) power-neutral performance boost for Single Threaded apps

10 Deep Power Down Technology Overview Concept: When idle, required CPU state is saved in a power-preserved domain on chip and then power is shut off Exit: Upon a break event, Power/Clocks are turned on, State is restored and execution resumes seamlessly from where it left off transparent to OS / applications Can occur few 100s of times per second (between keystrokes)

Flexible C-States to Select Idle Power Level vs. Responsiveness State in DPD Active state Vcc: << Retention Deep Power Down C0 C1 C3 C4 technology state Core clk: OFF Core voltage*

off off off off PLL: OFF Core clock N e w

PLL off off off f Caches: OFF o r flushed flushed off L1 caches flushed flushed P e

n Key benefits of DPD r

L2 cache y

off n partiiall fllush Lowest Leakage CPU

Wakeup time* active state you can get to! Agnostic to min Vcc Idle power* state retention issues * Rough approximation

DPD enables reaching lower limit of CPU idle power of 0 W

11

Penryn DPD Implementation Overview VccP 4 Major parts: 8KB SRAMs STATE STORAGE: 8KB (per core) Powered from I/O Vcc (VccP) VccP ECC protected

MICROCODE: STATE DEFINITION: Does state save and restore What to include? Criteria: “Software seamless” Manages core synchronization Inclusions: Supports Auto-demote policy All Architectural state Most micro-architectural state PM UNIT (“hardware”): Exclusions: Temp registers used by ucode Coordinates cores at hardware level Some others on a case by case basis Manages the DPD Power up sequence Manages entry/exit protocol with platform

12 Deep Power Down Technology Entry/Exit

VccP S/W instruction initiates processor DPD entry

Small amout CPU does rest of sequencing with chipset/platform of logic is left Vcc powered VRM Protocol with chipset to block snoops due to DMA up by VccP during DPD traffic (no CPU wakeup required) while in DPD state VID control VccP FSB I/F Exit initiated by an break event in platform (interrupt) through the chipset I/O Chipset requests CPU sequences through rest of exit sequence – including driving VID to VRM, internal hardware reset, state restore and execution resumption

OS: Decides OS: execute CPU: shrink L2; CPU: signal C/S: stops to idle the MWAIT save uarch chipset to CPU clk, blk I/O processors instruction state enter DPD CPU: VRM dn

Deep power down state CPU: continue CPU: restore CPU: Internal C/S: signals Interrupt execution to Arch/uarch state, RESET, CPU wakeup, break next instr. Expand L2 PLL relock etc VRM ramp, bclk event

13

Interrupt Rate Sensitivity of Average Power Deeper C-states have non-trivial energy cost for transitions Today, OS makes C-state decisions based on latency, not energy cost Latency could be ok, but energy cost could be significantly different Too frequent transitions into deep C-states could result in net energy loss Worse battery life

Example of Average Power MM02 Penryn implements a “Auto- demote” policy in the CPU to address this issue Heuristics to detects when energy cost of DPD state entry and exit is potentially > savings in DPD state Demotes such cases to shallower Average Power Average Deep Power Down cross over point: ~3-4ms C-state (C4), although the OS requested DPD 0.5ms1ms 2ms 3ms 4ms5ms 10ms 15ms 20ms 50ms Limit Silicon results prove that this was a Interrupt Rate with C4 with Deep Power Down technology good decision! Source: Intel

14 DPD Results (Average Power)

Average Power (MM05* - Office Productivity)

Actual Penryn with DPD Disabled Results measured in (Core (tm) 2 Duo processor) Penryn Penryn with DPD Enabled 44% Silicon – Up to 44% reduction in Average Power

Average Power Average 27%

Source: Intel Corp, June 2007** Leakage Current 27% to 44% (based on the leakage of the part) reduction on Average Power as measured by Mobile Mark – Office Productivity benchmark due to DPD feature Significant improvement compared to previous generation (Merom) Measured Exit latency for DPD state: ~ 150 - 200 us => In expected range

**Config: Data from a sample distribution of parts with various leakage MM05 – BAPCO* Mobile Mark* 2005 benchmark measured on tester, under respective product POR Vcc, temperature etc 15

Enhanced Dynamic Acceleration Technology (EDAT) Concept: In multi-core CPUs, use the power headroom of idle core to boost performance of the non-idle core How it works: When one core enters an idle power C-state (CC3 or deeper) AND OS requests a higher performance state on the running core, The non-idle core is boosted up to a higher voltage, higher frequency (EDAT freq) Overall chip Power still remains within the specified Thermal Design Power (TDP)

2 cores active: Guaranteed Single core active: EDAT freq (marked) frequency P = P0 + P1 <= P (TDP Spec) P = P0 + P1 = TDP Spec Applicability: EDAT is EDAT Bin applicable mostly to TDP Top Freq. TDP Top Freq. TDP Top Freq. thermally constrained platforms such as mobile where the Core #2 at guaranteed frequency C3-DPD is not max Vcc limited Frequency Frequency Min Vcc Min Vcc Min Vcc Leakage CORE 0 CORE 1 CORE 0 CORE 1

EDAT provides a significant and predictable single-threaded performance boost

16 EDAT Implementation Overview Microarchitecture Entry based on OS request AND other core idle OS P-state Request “P[0]” Idle core defined as “CC3” or deeper C-state F/V Max F/V EDAT Freq pre-programmed in chip based on Clipping power, reliability and other constraints EDAT F/V Logic Exit EDAT mode when Idle core wakes up Max F/V Hysteresis mechanism Guar. F/V Allows short durations where 2 cores active Reduces perf loss for low activity wakeups EDAT Logic Implemented using a few counters EDAT Voltage Regulator needs to provide for this Disable Core CC-state Benefits most at high timer tick rates EDAT OS interface Control Hysteresis Logic Mechanism EDAT Freq exposed as {Guaranteed + epsilon} P-state in ACPI table: P[0] OS requests P[0] state if perf demand exists EDAT logic grants it if power headroom exists

17

EDAT Results (Performance)

EDAT Performance on SPEC CPU2000 (Estimated)

SPECint_base2000 (estimated) 1.08 SPECfp_base2000 (estimated) Actual 1.06 Results measured in 1.04 Penryn Silicon – Up 1.02 to 7% boost 1.00 in ST perf due to EDAT 0.98 Relative Performance Relative 0.96 Baseline: EDAT OFF EDAT ON; Low EDAT ON; High EDAT ON with Source: Intel Corp, as of June 2007* Interrupt rate Interrupt rate Hysteresis: High Interrupt rate

5% (on SPECfp_base2000 (est*)) and 7% (on SPECint_base2000 (est*)) performance gains due to EDAT within the same TDP power envelope!** High timer tick rates would have reduced the gains by a few % due to frequent wake up of idle core and resulting exits from EDAT mode Intelligent hysteresis mechanism in Penryn recovers most of that loss

**Configuration used for measurement: Matanzas CRB board, Crestline chipset DDR2-667 *Estimated SPEC metrics based on measurements on preproduction memory, Penryn A1 silicon @ 2.4 GHz, FSB 800, OS: Windows XP SP2 platforms. SPEC, SPECint and SPECfp are trademarks of SPEC. For more information on these benchmarks, see www.spec.org 18 New PM feature for Servers: CC3 State Concept: Snoops burn ~30% of active core power By avoiding snoops into idle cores, we save power How it works: Today, idle cores are put into Core C1 (CC1), which is a snoopable state In Penryn, idle cores can be put into Core C3 (CC3), which is a non-snoopable state. First level caches are flushed into the L2 cache before putting cores into CC3 This prevents cross core snoops and therefore the additional power burnt for snoops

ACTIVE Core0 in CC3C1E Cores IDLE Cores (CC3)(CC1) IDLE Cores (CC1)(CC3)IDLE IDLE CoresCores (CC1)(CC3)

L2 L2 L2 Core Core Core Core 0 snoop Core 0 snoop Core 0 snoop Core 0 snoop snoop snoop Core 0 L2 Cache L2 Cache L2 Cache L2 Cache Core Core 1 CoreCore 1 CoreCore 1 Core 1 snoop L2 Cache cross-core snoop snoop snoop

L2 Lookup Core 1 DRAM access Chipset Memory CC3 is a simple but effective optimization for power savings

19

CC3 Results (Power savings)

Power savings due to CC3 in Web Server workloads

2.0% 0.0% -2.0%0.0% 20.0% 40.0% 60.0% 80.0% 100.0% -4.0% Up to 16% CPU -6.0% power -8.0% reduction -10.0% measured on -12.0% web server -14.0% % Perf loss workloads with -16.0% % CPU pwr reduction CC3 feature -18.0% Source: Intel Corp, as of June 2007* % CPU utilization

Up to ~16%** CPU power reduction measured on Web Server workload with negligible performance loss with CC3 feature enabled Similar power reductions seen on SPEC CPU2000 (estimated by measurement on preproduction silicon) (single user mode), other idle power benchmarks for Server

*Configuration used for measurement: Bridgeport2, DP Clovertown 2.33 GHz, 1333 MHz ** Note that the power savings due to this feature will be system Bus, 4x2GB FB DIMM 667 MHz, OS: Linux 2.6.18.1; Workload: Specweb99; DBS OFF; configuration, OS and workload dependent. Measured by mapping CC3 as CC1 in BIOS

20 Summary

New generation Intel® Core™ microarchitecture is leading the way with technology New microarchitecture enhancements for higher performance and improved energy efficiency New SSE4 instructions for improved video, imaging, and 3D content performance New Power management features for improved battery life, performance and energy efficiency

Moore’s Law is alive and well with Penryn; Innovative features with tangible benefits being introduced for Performance and Power

21

Acknowledgements

To the entire Penryn team across Folsom, Santa Clara, Costa Rica and Israel and the Technology Development Group for developing and delivering this product to market

22 Legal Notices and Important Information Regarding the performance and power measurements in this presentation

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/index.htm

Intel may make changes to specifications, release dates and product descriptions at any time, without notice. Intel, Pentium and the Intel logo are trademarks or registered trademarks of the Intel Corporation or its subsidiaries in the United States and other countries.

*Other names and brands may be claimed as the property of others 23