IAIA--3232 ArchitectureArchitecture

Sunil Saxena Principal Engineer Corporation

September 11, 2000

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Labs

Agenda l Pentium ® III Processor New Features l Pentium ® 4 Processor New Features l Pentium ® 4 Processor Micro-architecture

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 2

1 IA Processor Roadmap

. . . Madison Extends IA Headroom, Extends IA Headroom, IA-64 Perf . . . ScalabilityScalability andand AvailabilityAvailability Deerfield forfor thethe MostMost IA-64 Price/Perf McKinley DemandingDemanding EnvironmentsEnvironments

Future . . . IA-32 ItaniumTM processor

Performance Foster Cascades OutstandingOutstanding ® Pentium III Xeon™ PerformancePerformance forfor processor 3232 BitBit VolumeVolume AppsApps

’99 ’00 ’01 ’02 .25µ .18µ .13µ Strong Execution on Itanium™ Processor, Continued Focus on the Long Term Intel ® Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 3

Pentium III Processor

l Pentium III Processor New Features –36-bit Physical Addressing – Physical Address Extension - PAE-36 – Page Size Extensions - PSE-36 –Page Attribute Table –Fast Floating-point save/restore –New Instructions –New Exceptions

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 4

2 36-bit Addressing

l 36-bit Addressing – PSE-36 – PAE-36 l PSE-36 – 4GB mapped through 4K of page directories and 4MB page tables – Memory above 4 GB is only accessible as 4 MB pages – Operating system can freely use both 4KB and 4MB pages without PDE structure change – All 4KB pages and page tables MUST reside below 4GB boundary – Reduces effort needed to develop & support changes in virtual memory subsystem

l PAE-36 – 4GB mapped through 16K of page directories and 16MB page tables – All Memory accessible as 4KB or 2MB pages – OS needs to load PDEPTRs for mapping changes on writes to CR3 – CONFIG_HIMEM to enable more than 4 GB physical memory Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 5

4KB Page Translation

Linear Address 10 10 12

PTE

PDE 4-Kb page

CR3 Page Table Page Directory 1024 Entries 1024 Entries

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 6

3 4KB PAE Translation

Linear Address 2 9 9 12

PTE 4-Kb page PDE

PDPE Page Table 512 Entries CR3 Page Directory Page Directory 512 Entries Pointer Table

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 7

4MB Translation

Linear Address 10 22

PDE

CR3 4-MB page Page Directory 1024 Entries Provides bits 31-22 of physical address of 4MB Page (Bits 12-21 are currently RESERVED)

31 22 17 12 0 Page Directory Entry

16 13 Provides bits 35-32 of physical address of 4MB Page (new) Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 8

4 Example Using PSE-36 8GB ONLY 4MB PAGES ABOVE 4GB

4- Entries 4MB Page 4-byte Entries 4K Page . . . 4K Page 4K Page 4GB 4MB Page CR3 Page Table Page Directory

4K & 4MB PAGES BELOW 4GB 0

Physical Memory Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 9

Page Attribute Table (PAT)

l Physical Memory Attributes Described through the Page-Tables – Builds upon enhanced memory type capability provided via MTRR’s in Pentium® Pro processor – Relaxes MTRR alignment/length requirements

l Builds upon PCD/PWT bits on IA-32 Architecture – These interact with effective memory type determination

l PAT Architecture – PAT is an 8-entry table indexed via PCD, PWT, and Resv. bits – Allows up to 8 memory attributes defined by the page tables – PAT is always enabled when Paging is used – Default table entries fully compatible with PCD/PWT/Resv settings

– PAT entries R/W programmable via RDMSR/WRMSR (0x277) – 8 bits per entry; 3 bits for attribute with other bits reserved – Memory attributes as specified by Pentium® Pro processor Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 10

5 Page Attribute Table (PAT)

l PAT Architecture (continued) – PAT Memory Types interact with MTRRs – As architecturally specified by Pentium® Pro processor – Implementation specific combinations remain undefined – Should not be depended upon by system software

l Precautions – OS Uses Page Directory as a Page Table: – Restricted to using 4 lowest PAT entries – PAT bit 7 in 4K PTE is PS bit when used as a PDE

– Memory type changes for pages require TLB invalidation – Follow procedure as when changing MTRRs – cache flush, TLB invalidation – PAT entries on multiple processors must be maintained in consistent manner by OS – All processors have same values in PAT Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 11

Page Attribute Table (PAT)

l Precautions (continued) – Page Aliasing – PAT maintains memory types according to linear addresses – Architecture allows OS to map single physical page with 2 linear addresses containing differing types – This may lead to undefined results and must be avoided

l PAT Uses – Essentially unlimited MTRRs – Provide support for more devices (frame buffers, RAID cards, etc…) to map memory as WC – Allows map system memory for specific optimizations – Memory shared with 3D accelerator/CPU for textures – Reduce eviction, read-for-ownership bus transactions and cache thrashing for common operations such as memory fill Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 12

6 Fast Floating Point Save/Restore

l These instructions minimize cost of saving/restoring Floating Point/MMX™ Technology State – Does NOT re-initialize the FPU state after saving – Performance improvements come from more natural format and alignment of the cpu state l State area is larger – 512 – MUST be aligned on 16 byte boundary, else GP(0) fault – Use of reserved fields risks incompatibility with future Intel Architecture processors l FXSAVE does not check for unmasked exceptions (i.e. like FNSAVE) – FXRSTOR does not fault when loading an image that contains pending exceptions Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 13

Pentium III New Instructions

Core Floating Multimedia Memory Architecture Point Arch. Architecture Architecture

MMX™ P6 bus + Pentium III FP Technology WC I/O processors= Dynamic + + + Execution New Media Streaming SIMD FP Instr. Mem Instr

l 52 New SIMD Single Precision Floating Point Instructions l up to 4 FP results per cycle l Eight 128 bit registers l 4 x Single precision FP numbers l 12 New Media Instructions l 8 New Cacheability Instructions Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 14

7 Prefetching Instruction

l Prefetch gets a cacheline at a time l Prefetch Hint (Load) Instructions – Instructions do not fault – Retires quickly to free up machine resources – Hints to cache at different levels – Store in different levels of cache hierarchy – Don’t store in the cache hierarchy (stream) l Potential OS tuning uses – e.g. TCP/IP Checksum gets ~2x speedup

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 15

Streaming Store Instruction l Store data to memory minimizing cache pollution l Potential OS tuning benefits – 128 bit registers used with streaming store to zero pages ~4x faster – mem copy ~2x faster using prefetch/stream together

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 16

8 New Exceptions

l Interrupt vector 19 used to invoke unmasked exception handlers – Provide larger (512 bytes of state) context record to handler – Handlers need to account for SIMD nature of Pentium III SSE numeric exceptions – One instruction can generate multiple exceptions l Exceptions are precise (reported when detected) l Pentium III SSE Instructions architecturally separate from x87- FP – Pentium III SSE Instructions do not report x87-FP/MMX™ Technology exceptions – New handlers must include IEEE filter to decode and emulate exception raising SIMD instructions

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 17

Pentium® 4 Processor

l Pentium 4 Processor New Features –SSE2 Instructions –Enhanced Prefetch Instructions –System Bus and Cache Enhancements l OS Recommendation l New Instruction support

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 18

9 Pentium 4 Architecture Overview

l Willamette is the next generation IA-32 processor microarchitecture l New micro-architecture – ~1.4x average performance of Pentium® III processor family on same process – Enables faster processor speeds (1 GHz+) – Trace Cache for Instruction Decode l Willamette New Instructions l New platform (chipsets, AGP4X)

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 19

Pentium 4 New Instructions l New 128 bit arithmetic instructions – Extend MMXä technology instructions from 64 bit to 128 bit data type – Operates on XMM registers instead of MMX/x87-FP registers – New 128-bit integer and SIMD-Integer instructions – Memory operands MUST be 128-bit aligned! Will cause Exception during executions if not aligned. – Packed 32 * 32 bit Multiply – Packed 64 bit Add/Subtract – Shift, Shuffle, Unpack, Move, Conversion l New SIMD Double Precision FP instructions – Full complement of FP arithmetic operations – Packed/Scalar DP Û SP conversions l New cache / memory management instructions – Cache line flush instruction – Fences (LFence / MFence) – New streaming store instructions Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 20

10 Streaming SIMD Extensions 2

Floating Point Registers Integer / x87 Registers (Scalar/packed SIMD-SP-FP, (64-bit Integer, x87 data) SIMD-DP-FP, 128-bit Integer) 80 128 64

XMM0 FP0 or MM0 . . . . .

XMM7 FP7 or MM7

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 21

Example SIMD Add (ADDPD)

l Effectively performs two double precision ops in one cycle l a1+b1=c1 in parallel with a0+b0=c0 l Useful for matrix operations

a1 a0 + + 128-bit Registers b1 b0 S2 c1 c0

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 22

11 Prefetches Prefetches

lThe Intel® Pentium® 4 processor has automatic prefetches which –Work on large buffers –Have Sequential access lEven fewer prefetches necessary UseUse sequentialsequential accessaccess toto buffersbuffers andand getget prefetchesprefetches “for“for free”free” Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 23

Prefetches But… l prefetch instructions may still be the best solution in some cases –PrefetchNTA reduces cache evictions of useful data (1.1-1.15x gain) –Benefits unusual (ie, non-contiguous) data access patterns –Can maximize read bandwidth to system memory –Increase fetch-ahead distance since memory- latency/computation delta increases

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 24

12 System Bus & Cache Enhancements

l The Pentium 4 system bus is an evolutionary extension of the P6 bus l 3.2 GByte/sec data transfer rate – 100MHz quad pumped data bus - similar to AGP-4X – Source synchronous 64 bit data bus l Caches – Trace cache for decoded instructions – 128 byte cache lines with 64 byte sectors – 256K on-die, 2nd level write-back, unified data and instruction cache l APIC – Messages now sent over front side bus – Physical destination mode expanded to 8-bits – ISR, IRR, TMR implementation increased to 256 bits – Remote read is no longer supported

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 25

OS Recommendations

l All spin-loops should include the PAUSE instruction – Backwards compatible with prior IA-32 processors – Significant performance benefit in future IA-32 processors – Already done in 2.4-test* kernels l Cache line size is 128 bytes with 64 byte sectors – Impact to hot locks – Hot locks should be on separate 64 byte sectors – Impact to alignment – 128 byte line allocation in cache l Use Non-execution based Timing Loops! – Already done in 2.4-test* kernels

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 26

13 Pentium 4 New Instruction Support

l FXSAVE/FXRSTOR support for Pentium 4 state – Already done if enabled for Pentium® III processor (Internet Streaming SIMD Extensions) – No New State! – Already done in 2.4-test* kernels l New Exception Handlers – Double Precision SIMD capable – IEEE Compliant l Prefetch and Streaming Store Optimizations – Integer state streaming store instruction MOVNTi – For zeroing, memcpy, etc. – Does not use FP state so DNA is avoided Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 27

PentiumPentium® 44 ProcessorProcessor MicroMicro--architecturearchitecture Next Generation IA-32 Micro-architecture

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Labs

14 Agenda

l IA-32 Processor Roadmap l Design Goals l Frequency l Instructions Per Cycle l Summary

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 29

Intel® Pentium® 4 Processor

Intel® NetBurst™ Micro-Architecture

NOW

P6 Micro-Architecture Performance Performance P5 Micro-Architecture

486 Micro-architecture

Time Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 30

15 Intel® Pentium® 4 Processor Design Goals

l Deliver world class performance across both existing and emerging applications l Deliver performance headroom and scalability for the future

MicroMicro--architecturearchitecture that that will will Drive Drive Performance Performance LeadershipLeadership for for the the Next Next Several Several Years Years

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 31

CPU Architecture 101

DeliveredDelivered PerformancePerformance == FrequencyFrequency ** InstructionsInstructions PerPer CycleCycle

FrequencyFrequencyFrequency

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 32

16 Frequency

l What limits frequency? –Process technology –Microarchitecture l On a given process technology –Fewer gates per pipeline stage will deliver higher frequency

FrequencyFrequency isis drivendriven byby MicroarchitectureMicroarchitecture Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 33

NetburstTM Micro-architecture Pipeline vs P6 Basic P6 Pipeline Intro at 733MHz 1 2 3 4 5 6 7 8 .18µ9 10 Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec

Basic Pentium® 4 Processor Pipeline Intro at 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16³ 171.4GHz18 19 20 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex.18µFlgs Br Ck Drive

HyperHyper pipelinedpipelined TechnologyTechnology enablesenables industryindustry leadingleading performanceperformance andand clockclock raterateIntel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 34

17 Hyper Pipelined Technology 20 Netburst Micro-Architecture Today 1.13GHz ³1.4GHz 1.13GHz 10 P6 Micro-Architecture Frequency Frequency

233MHz 166MHz 5 60MHz P5 Micro-Architecture Introduction Time Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 35

CPU Architecture 101

DeliveredDelivered PerformancePerformance == FrequencyFrequency ** InstructionsInstructions PerPer CycleCycle

InstructionsInstructionsInstructions PerPerPer CycleCycleCycle

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 36

18 Improving Instructions Per Cycle

l Improve efficiency –Branch prediction –Do more things in a clock l Reduce time it takes to do something –Reducing latency

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 37

Improving Instructions Per Cycle

l Improve efficiency –Branch prediction –Do more things in a clock l Reduce time it takes to do something –Reducing latency

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 38

19 Branch Prediction

l Accurate branch prediction is key to enabling longer pipelines l Dramatic improvement over P6 branch predictor: – 8x the size (4K) – Eliminated 1/3 of the mispredictions l Proven to be better than all other publicly disclosed predictors – (g-share, hybrid, etc)

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 39

Execution Trace Cache

l Advanced L1 instruction cache – Caches “decoded” IA-32 instructions (uops) l Removes decoder pipeline latency l Capacity is ~12K uOps l Integrates branches into single line – Follows predicted path of program execution

ExecutionExecution TraceTrace CacheCache feedsfeeds fastfast engineengine

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 40

20 Execution Trace Cache

1 cmp 2 br -> T1 .. ... (unused code) Trace Cache Delivery T1: 3 sub 4 br -> T2 1 cmp 2 br T1 3 T1: sub .. ... (unused code) 4 br T2 5 mov 6 sub T2: 5 mov 7 br T3 8 T3:add 9 sub 6 sub 10 mul 11 cmp 12 br T4 7 br -> T3 .. ... (unused code) T3: 8 add 9 sub 10 mul 11 cmp 12 br -> T4

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 41

Advanced Dynamic Execution

l Extends basic features found in P6 core l Very deep speculative execution –126 instructions in flight (3x P6) –48 loads (3x P6) and 24 stores (2x P6) l Provides larger window of visibility –Better use of execution resources

DeepDeep SpeculationSpeculation ImprovesImproves ParallelismParallelism

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 42

21 Improving Instructions Per Cycle

l Improve efficiency –Branch prediction –Do more things in a clock l Reduce time it takes to do something –Reducing latency

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 43

Rapid Execution Engine

l Dramatically lower ALU latency l P6: –1 clock @ 1GHz 1ns l P4P: –½ clock @ >1.4GHz

<0.36ns Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 44

22 Example with Higher IPC and Faster Clock! Pentium® 4 Code P6 Processor Sequence @1GHz @1.4GHz Ld Add Add Ld Add Add

10 clocks 6 clocks 10ns 4.3ns IPC = 0.6 IPC = 1.0 IPCIntel = 1.0 Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 45

Recap Pentium®III Pentium® 4 Relative Processor Processor Improvement Frequency 1 GHz > 1.4 Ghz > 1.4 Adder Speed 1 ns < .36 ns > 2.8 L1 Cache Speed 3 ns < 1.42 ns > 2.1 L1 Cache Size 16 KB 8 KB 0.5 L1 Cache Bandwidth 16 GB/sec > 44.8 GB/sec > 2.8 L2 Cache Bandwidth 16 GB/sec > 44.8 GB/sec > 2.8 Uop Fetch Bandwidth 3 billion/sec > 4.2 billion/sec > 1.4 Adder Bandwidth 2 billion/sec > 5.6 billion/sec > 2.8 Branch targets 512 4092 8 Instructions In flight 40 126 3.15 Loads in flight 16 48 3 Stores in flight 12 24 2 Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 46

23 Example - Security and e-Commerce l Secure transactions enable e-Commerce l SSL is the standard for secure Web transactions – Protocol for secure communication – Built upon a core set of algorithms – Public-key encryption – RSA, DSA, Diffie-Hellman, etc. – Message digest – SHA-1, MD5, etc. – Digital signature – Bulk encryption – RC4, DES, 3DES, AES

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 47

Security Impacts Performance SSL - The basics Browser Web Server

Client Hello(settings,etc)

Server Hello (certificate, suite,etc)

Pre-master secret key

Session ID, Ready

Session ID, Ready

Data exchanges (bulk encryption) Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 48

24 Security Impacts Performance

The high cost of SSL Source - Intel Lab research Transaction time

1 2 4 8 16 32 Kbytes Transmitted

SecureSecure transactionstransactions areare ordersorders ofof Intel Labs Copyright © 2000 Intel Corporation.magnitudemagnitudeLinux slowerSuperclusterslower thanUsersthan Conference nonnon--securesecurePage 49

Identify Key Algorithms Computation in SSL

l Goal: Increase the number of secure transactions – Identify server performance issues in SSL – One server may deal with hundreds of clients

Montgomery Product

setup Authentication Bulk Data close Encryption

70%

Compute time consumed by one short SSL transaction Source - Intel Lab research Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 50

25 Breakthrough Performance on Pentium 4 Processor Architectural Features

l New instructions in SSE2 – PMULUDQ (32x32=>64) – PADDQ (64+64=>64) – PSHUFD (Re-arrange DWORDs) l All pipelined l SIMD

IncreaseIncrease sizesize andand reducereduce numbernumber ofof individualindividual multiplicationsmultiplications

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 51

Breakthrough Performance on Pentium 4 Processor Timings

Algorithm Bits Lang Clocks Ratio Naïve 32-bit 1x16 51000 19.07 Optimized ASM using MUL 1x32 asm 28750 10.75 Using Pentium 4 New Instructions2x32 asm 2675 1.00

AlmostAlmost 20x20x performanceperformance gaingain versusversus naïvenaïve implementationimplementation Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 52

26 Breakthrough Performance on Pentium 4 Processor Summary

l Expect 400+ 1024-bit RSA Decrypts/second l Breakthrough performance on public key algorithms for Intel® Pentium® 4 processor – The right architecture – The right instruction set

PentiumPentium®® 44processorprocessor deliversdelivers moremore securesecure transactionstransactions toto moremore usersusers

Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 53

27