IAIA--3232 ArchitectureArchitecture
Sunil Saxena Principal Engineer Intel Corporation
September 11, 2000
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Labs
Agenda l Pentium ® III Processor New Features l Pentium ® 4 Processor New Features l Pentium ® 4 Processor Micro-architecture
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 2
1 IA Processor Roadmap
. . . Madison Extends IA Headroom, Extends IA Headroom, IA-64 Perf . . . ScalabilityScalability andand AvailabilityAvailability Deerfield forfor thethe MostMost IA-64 Price/Perf McKinley DemandingDemanding EnvironmentsEnvironments
Future . . . IA-32 ItaniumTM processor
Performance Foster Cascades OutstandingOutstanding ® Pentium III Xeon™ PerformancePerformance forfor processor 3232 BitBit VolumeVolume AppsApps
’99 ’00 ’01 ’02 .25µ .18µ .13µ Strong Execution on Itanium™ Processor, Continued Focus on the Long Term Intel ® Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 3
Pentium III Processor
l Pentium III Processor New Features –36-bit Physical Addressing – Physical Address Extension - PAE-36 – Page Size Extensions - PSE-36 –Page Attribute Table –Fast Floating-point save/restore –New Instructions –New Exceptions
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 4
2 36-bit Addressing
l 36-bit Addressing – PSE-36 – PAE-36 l PSE-36 – 4GB mapped through 4K of page directories and 4MB page tables – Memory above 4 GB is only accessible as 4 MB pages – Operating system can freely use both 4KB and 4MB pages without PDE structure change – All 4KB pages and page tables MUST reside below 4GB boundary – Reduces effort needed to develop & support changes in virtual memory subsystem
l PAE-36 – 4GB mapped through 16K of page directories and 16MB page tables – All Memory accessible as 4KB or 2MB pages – OS needs to load PDEPTRs for mapping changes on writes to CR3 – CONFIG_HIMEM to enable more than 4 GB physical memory Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 5
4KB Page Translation
Linear Address 10 10 12
PTE
PDE 4-Kb page
CR3 Page Table Page Directory 1024 Entries 1024 Entries
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 6
3 4KB PAE Translation
Linear Address 2 9 9 12
PTE 4-Kb page PDE
PDPE Page Table 512 Entries CR3 Page Directory Page Directory 512 Entries Pointer Table
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 7
4MB Translation
Linear Address 10 22
PDE
CR3 4-MB page Page Directory 1024 Entries Provides bits 31-22 of physical address of 4MB Page (Bits 12-21 are currently RESERVED)
31 22 17 12 0 Page Directory Entry
16 13 Provides bits 35-32 of physical address of 4MB Page (new) Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 8
4 Example Using PSE-36 8GB ONLY 4MB PAGES ABOVE 4GB
4-byte Entries 4MB Page 4-byte Entries 4K Page . . . 4K Page 4K Page 4GB 4MB Page CR3 Page Table Page Directory
4K & 4MB PAGES BELOW 4GB 0
Physical Memory Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 9
Page Attribute Table (PAT)
l Physical Memory Attributes Described through the Page-Tables – Builds upon enhanced memory type capability provided via MTRR’s in Pentium® Pro processor – Relaxes MTRR alignment/length requirements
l Builds upon PCD/PWT bits on IA-32 Architecture – These interact with effective memory type determination
l PAT Architecture – PAT is an 8-entry table indexed via PCD, PWT, and Resv. bits – Allows up to 8 memory attributes defined by the page tables – PAT is always enabled when Paging is used – Default table entries fully compatible with PCD/PWT/Resv settings
– PAT entries R/W programmable via RDMSR/WRMSR (0x277) – 8 bits per entry; 3 bits for attribute with other bits reserved – Memory attributes as specified by Pentium® Pro processor Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 10
5 Page Attribute Table (PAT)
l PAT Architecture (continued) – PAT Memory Types interact with MTRRs – As architecturally specified by Pentium® Pro processor – Implementation specific combinations remain undefined – Should not be depended upon by system software
l Precautions – OS Uses Page Directory as a Page Table: – Restricted to using 4 lowest PAT entries – PAT bit 7 in 4K PTE is PS bit when used as a PDE
– Memory type changes for pages require TLB invalidation – Follow procedure as when changing MTRRs – cache flush, TLB invalidation – PAT entries on multiple processors must be maintained in consistent manner by OS – All processors have same values in PAT Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 11
Page Attribute Table (PAT)
l Precautions (continued) – Page Aliasing – PAT maintains memory types according to linear addresses – Architecture allows OS to map single physical page with 2 linear addresses containing differing types – This may lead to undefined results and must be avoided
l PAT Uses – Essentially unlimited MTRRs – Provide support for more devices (frame buffers, RAID cards, etc…) to map memory as WC – Allows map system memory for specific optimizations – Memory shared with 3D accelerator/CPU for textures – Reduce eviction, read-for-ownership bus transactions and cache thrashing for common operations such as memory fill Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 12
6 Fast Floating Point Save/Restore
l These instructions minimize cost of saving/restoring Floating Point/MMX™ Technology State – Does NOT re-initialize the FPU state after saving – Performance improvements come from more natural format and alignment of the cpu state l State area is larger – 512 bytes – MUST be aligned on 16 byte boundary, else GP(0) fault – Use of reserved fields risks incompatibility with future Intel Architecture processors l FXSAVE does not check for unmasked exceptions (i.e. like FNSAVE) – FXRSTOR does not fault when loading an image that contains pending exceptions Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 13
Pentium III New Instructions
Core Floating Multimedia Memory Architecture Point Arch. Architecture Architecture
MMX™ P6 bus + Pentium III FP Technology WC I/O processors= Dynamic + + + Execution New Media Streaming SIMD FP Instr. Mem Instr
l 52 New SIMD Single Precision Floating Point Instructions l up to 4 FP results per cycle l Eight 128 bit registers l 4 x Single precision FP numbers l 12 New Media Instructions l 8 New Cacheability Instructions Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 14
7 Prefetching Instruction
l Prefetch gets a cacheline at a time l Prefetch Hint (Load) Instructions – Instructions do not fault – Retires quickly to free up machine resources – Hints to cache at different levels – Store in different levels of cache hierarchy – Don’t store in the cache hierarchy (stream) l Potential OS tuning uses – e.g. TCP/IP Checksum gets ~2x speedup
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 15
Streaming Store Instruction l Store data to memory minimizing cache pollution l Potential OS tuning benefits – 128 bit registers used with streaming store to zero pages ~4x faster – mem copy ~2x faster using prefetch/stream together
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 16
8 New Exceptions
l Interrupt vector 19 used to invoke unmasked exception handlers – Provide larger (512 bytes of state) context record to handler – Handlers need to account for SIMD nature of Pentium III SSE numeric exceptions – One instruction can generate multiple exceptions l Exceptions are precise (reported when detected) l Pentium III SSE Instructions architecturally separate from x87- FP – Pentium III SSE Instructions do not report x87-FP/MMX™ Technology exceptions – New handlers must include IEEE filter to decode and emulate exception raising SIMD instructions
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 17
Pentium® 4 Processor
l Pentium 4 Processor New Features –SSE2 Instructions –Enhanced Prefetch Instructions –System Bus and Cache Enhancements l OS Recommendation l New Instruction support
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 18
9 Pentium 4 Architecture Overview
l Willamette is the next generation IA-32 processor microarchitecture l New micro-architecture – ~1.4x average performance of Pentium® III processor family on same process – Enables faster processor speeds (1 GHz+) – Trace Cache for Instruction Decode l Willamette New Instructions l New platform (chipsets, AGP4X)
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 19
Pentium 4 New Instructions l New 128 bit arithmetic instructions – Extend MMXä technology instructions from 64 bit to 128 bit data type – Operates on XMM registers instead of MMX/x87-FP registers – New 128-bit integer and SIMD-Integer instructions – Memory operands MUST be 128-bit aligned! Will cause Exception during executions if not aligned. – Packed 32 * 32 bit Multiply – Packed 64 bit Add/Subtract – Shift, Shuffle, Unpack, Move, Conversion l New SIMD Double Precision FP instructions – Full complement of FP arithmetic operations – Packed/Scalar DP Û SP conversions l New cache / memory management instructions – Cache line flush instruction – Fences (LFence / MFence) – New streaming store instructions Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 20
10 Streaming SIMD Extensions 2
Floating Point Registers Integer / x87 Registers (Scalar/packed SIMD-SP-FP, (64-bit Integer, x87 data) SIMD-DP-FP, 128-bit Integer) 80 128 64
XMM0 FP0 or MM0 . . . . .
XMM7 FP7 or MM7
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 21
Example SIMD Add (ADDPD)
l Effectively performs two double precision ops in one cycle l a1+b1=c1 in parallel with a0+b0=c0 l Useful for matrix operations
a1 a0 + + 128-bit Registers b1 b0 S2 c1 c0
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 22
11 Prefetches Prefetches
lThe Intel® Pentium® 4 processor has automatic prefetches which –Work on large buffers –Have Sequential access lEven fewer prefetches necessary UseUse sequentialsequential accessaccess toto buffersbuffers andand getget prefetchesprefetches “for“for free”free” Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 23
Prefetches But… l prefetch instructions may still be the best solution in some cases –PrefetchNTA reduces cache evictions of useful data (1.1-1.15x gain) –Benefits unusual (ie, non-contiguous) data access patterns –Can maximize read bandwidth to system memory –Increase fetch-ahead distance since memory- latency/computation delta increases
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 24
12 System Bus & Cache Enhancements
l The Pentium 4 system bus is an evolutionary extension of the P6 bus l 3.2 GByte/sec data transfer rate – 100MHz quad pumped data bus - similar to AGP-4X – Source synchronous 64 bit data bus l Caches – Trace cache for decoded instructions – 128 byte cache lines with 64 byte sectors – 256K on-die, 2nd level write-back, unified data and instruction cache l APIC – Messages now sent over front side bus – Physical destination mode expanded to 8-bits – ISR, IRR, TMR implementation increased to 256 bits – Remote read is no longer supported
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 25
OS Recommendations
l All spin-loops should include the PAUSE instruction – Backwards compatible with prior IA-32 processors – Significant performance benefit in future IA-32 processors – Already done in 2.4-test* kernels l Cache line size is 128 bytes with 64 byte sectors – Impact to hot locks – Hot locks should be on separate 64 byte sectors – Impact to data structure alignment – 128 byte line allocation in cache l Use Non-execution based Timing Loops! – Already done in 2.4-test* kernels
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 26
13 Pentium 4 New Instruction Support
l FXSAVE/FXRSTOR support for Pentium 4 state – Already done if enabled for Pentium® III processor (Internet Streaming SIMD Extensions) – No New State! – Already done in 2.4-test* kernels l New Exception Handlers – Double Precision SIMD capable – IEEE Compliant l Prefetch and Streaming Store Optimizations – Integer state streaming store instruction MOVNTi – For zeroing, memcpy, etc. – Does not use FP state so DNA is avoided Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 27
PentiumPentium® 44 ProcessorProcessor MicroMicro--architecturearchitecture Next Generation IA-32 Micro-architecture
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Labs
14 Agenda
l IA-32 Processor Roadmap l Design Goals l Frequency l Instructions Per Cycle l Summary
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 29
Intel® Pentium® 4 Processor
Intel® NetBurst™ Micro-Architecture
NOW
P6 Micro-Architecture Performance Performance P5 Micro-Architecture
486 Micro-architecture
Time Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 30
15 Intel® Pentium® 4 Processor Design Goals
l Deliver world class performance across both existing and emerging applications l Deliver performance headroom and scalability for the future
MicroMicro--architecturearchitecture that that will will Drive Drive Performance Performance LeadershipLeadership for for the the Next Next Several Several Years Years
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 31
CPU Architecture 101
DeliveredDelivered PerformancePerformance == FrequencyFrequency ** InstructionsInstructions PerPer CycleCycle
FrequencyFrequencyFrequency
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 32
16 Frequency
l What limits frequency? –Process technology –Microarchitecture l On a given process technology –Fewer gates per pipeline stage will deliver higher frequency
FrequencyFrequency isis drivendriven byby MicroarchitectureMicroarchitecture Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 33
NetburstTM Micro-architecture Pipeline vs P6 Basic P6 Pipeline Intro at 733MHz 1 2 3 4 5 6 7 8 .18µ9 10 Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec
Basic Pentium® 4 Processor Pipeline Intro at 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16³ 171.4GHz18 19 20 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex.18µFlgs Br Ck Drive
HyperHyper pipelinedpipelined TechnologyTechnology enablesenables industryindustry leadingleading performanceperformance andand clockclock raterateIntel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 34
17 Hyper Pipelined Technology 20 Netburst Micro-Architecture Today 1.13GHz ³1.4GHz 1.13GHz 10 P6 Micro-Architecture Frequency Frequency
233MHz 166MHz 5 60MHz P5 Micro-Architecture Introduction Time Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 35
CPU Architecture 101
DeliveredDelivered PerformancePerformance == FrequencyFrequency ** InstructionsInstructions PerPer CycleCycle
InstructionsInstructionsInstructions PerPerPer CycleCycleCycle
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 36
18 Improving Instructions Per Cycle
l Improve efficiency –Branch prediction –Do more things in a clock l Reduce time it takes to do something –Reducing latency
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 37
Improving Instructions Per Cycle
l Improve efficiency –Branch prediction –Do more things in a clock l Reduce time it takes to do something –Reducing latency
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 38
19 Branch Prediction
l Accurate branch prediction is key to enabling longer pipelines l Dramatic improvement over P6 branch predictor: – 8x the size (4K) – Eliminated 1/3 of the mispredictions l Proven to be better than all other publicly disclosed predictors – (g-share, hybrid, etc)
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 39
Execution Trace Cache
l Advanced L1 instruction cache – Caches “decoded” IA-32 instructions (uops) l Removes decoder pipeline latency l Capacity is ~12K uOps l Integrates branches into single line – Follows predicted path of program execution
ExecutionExecution TraceTrace CacheCache feedsfeeds fastfast engineengine
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 40
20 Execution Trace Cache
1 cmp 2 br -> T1 .. ... (unused code) Trace Cache Delivery T1: 3 sub 4 br -> T2 1 cmp 2 br T1 3 T1: sub .. ... (unused code) 4 br T2 5 mov 6 sub T2: 5 mov 7 br T3 8 T3:add 9 sub 6 sub 10 mul 11 cmp 12 br T4 7 br -> T3 .. ... (unused code) T3: 8 add 9 sub 10 mul 11 cmp 12 br -> T4
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 41
Advanced Dynamic Execution
l Extends basic features found in P6 core l Very deep speculative execution –126 instructions in flight (3x P6) –48 loads (3x P6) and 24 stores (2x P6) l Provides larger window of visibility –Better use of execution resources
DeepDeep SpeculationSpeculation ImprovesImproves ParallelismParallelism
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 42
21 Improving Instructions Per Cycle
l Improve efficiency –Branch prediction –Do more things in a clock l Reduce time it takes to do something –Reducing latency
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 43
Rapid Execution Engine
l Dramatically lower ALU latency l P6: –1 clock @ 1GHz 1ns l P4P: –½ clock @ >1.4GHz
<0.36ns Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 44
22 Example with Higher IPC and Faster Clock! Pentium® 4 Code P6 Processor Sequence @1GHz @1.4GHz Ld Add Add Ld Add Add
10 clocks 6 clocks 10ns 4.3ns IPC = 0.6 IPC = 1.0 IPCIntel = 1.0 Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 45
Recap Pentium®III Pentium® 4 Relative Processor Processor Improvement Frequency 1 GHz > 1.4 Ghz > 1.4 Adder Speed 1 ns < .36 ns > 2.8 L1 Cache Speed 3 ns < 1.42 ns > 2.1 L1 Cache Size 16 KB 8 KB 0.5 L1 Cache Bandwidth 16 GB/sec > 44.8 GB/sec > 2.8 L2 Cache Bandwidth 16 GB/sec > 44.8 GB/sec > 2.8 Uop Fetch Bandwidth 3 billion/sec > 4.2 billion/sec > 1.4 Adder Bandwidth 2 billion/sec > 5.6 billion/sec > 2.8 Branch targets 512 4092 8 Instructions In flight 40 126 3.15 Loads in flight 16 48 3 Stores in flight 12 24 2 Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 46
23 Example - Security and e-Commerce l Secure transactions enable e-Commerce l SSL is the standard for secure Web transactions – Protocol for secure communication – Built upon a core set of algorithms – Public-key encryption – RSA, DSA, Diffie-Hellman, etc. – Message digest – SHA-1, MD5, etc. – Digital signature – Bulk encryption – RC4, DES, 3DES, AES
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 47
Security Impacts Performance SSL - The basics Browser Web Server
Client Hello(settings,etc)
Server Hello (certificate, suite,etc)
Pre-master secret key
Session ID, Ready
Session ID, Ready
Data exchanges (bulk encryption) Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 48
24 Security Impacts Performance
The high cost of SSL Source - Intel Lab research Transaction time
1 2 4 8 16 32 Kbytes Transmitted
SecureSecure transactionstransactions areare ordersorders ofof Intel Labs Copyright © 2000 Intel Corporation.magnitudemagnitudeLinux slowerSuperclusterslower thanUsersthan Conference nonnon--securesecurePage 49
Identify Key Algorithms Computation in SSL
l Goal: Increase the number of secure transactions – Identify server performance issues in SSL – One server may deal with hundreds of clients
Montgomery Product
setup Authentication Bulk Data close Encryption
70%
Compute time consumed by one short SSL transaction Source - Intel Lab research Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 50
25 Breakthrough Performance on Pentium 4 Processor Architectural Features
l New instructions in SSE2 – PMULUDQ (32x32=>64) – PADDQ (64+64=>64) – PSHUFD (Re-arrange DWORDs) l All pipelined l SIMD
IncreaseIncrease sizesize andand reducereduce numbernumber ofof individualindividual multiplicationsmultiplications
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 51
Breakthrough Performance on Pentium 4 Processor Timings
Algorithm Bits Lang Clocks Ratio Naïve 32-bit 1x16 C 51000 19.07 Optimized ASM using MUL 1x32 asm 28750 10.75 Using Pentium 4 New Instructions2x32 asm 2675 1.00
AlmostAlmost 20x20x performanceperformance gaingain versusversus naïvenaïve implementationimplementation Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 52
26 Breakthrough Performance on Pentium 4 Processor Summary
l Expect 400+ 1024-bit RSA Decrypts/second l Breakthrough performance on public key algorithms for Intel® Pentium® 4 processor – The right architecture – The right instruction set
PentiumPentium®® 44processorprocessor deliversdelivers moremore securesecure transactionstransactions toto moremore usersusers
Intel Labs Copyright © 2000 Intel Corporation. Linux Supercluster Users Conference Page 53
27