AMD Presentation for Linux Kernel Summit

AMD Presentation For Linux Kernel Summit Richard A. Brunner AMD Fellow July 2005 Progress Report Linux is moving into the mainstream with AMD64 technology • AMD64+Linux is penetrating deeper into the Data Center – Demanding mainframe/UNIX functionality: 64-bit, NUMA, multi- core, and virtualization – Requiring solutions to infrastructure issues: more power management, security, and manageability – Requesting innovation without disruption: evolution as opposed to revolution - need to maintain compatibility and stability – Using servers and workstations as the proving ground: Linux must do well in these area before they move to Linux on the desktop • AMD continues the trend of openly providing early technical information on our products to the developer community for feedback. Page 2 July 2005 Linux Kernel Summit AMD’s Technology Roadmap Technology Roadmap N=Today (N+1) (N+2) (N+3) Page 4 July 2005 Linux Kernel Summit 130nm to 90nm Performance & Power Page 5 July 2005 Linux Kernel Summit 65nm Progress Page 6 July 2005 Linux Kernel Summit Post-45nm Research Begun and Processing Page 7 July 2005 Linux Kernel Summit AMD’s Processor Roadmap Introducing AMD64 Dual-Core Processor • Two AMD Opteron™ CPU cores on a single die, each with 1MB L2 cache • 90nm, ~205 million Core 0 transistors* 1-MB L2 – Approximately same die size as 130nm single-core AMD Opteron processor* • 95 watt power envelope fits Northbridge into 90nm power infrastructure • Retains compatibility with existing 32-bit and 64-bit x86-base software 1-MB L2 • Introduced with “K8” Core 1 Revision E core in April 2005 *Based on current revisions of the design Page 9 July 2005 Linux Kernel Summit Designed From The Start To Add Second Core • Shared Northbridge – 3 HyperTransport™ technology links – Dual-channel (128 bit) DDR i/f Existing AMD64 • AMD Opteron™ CPU with Direct Processor Design Connect Architecture was designed as CMP from the start 1MB 1MB – Second port on SRI, request L2 Cache L2 Cache management, two APICs • Two complete CPU cores SRI –SMP model Core 0 Core 1 – Simpler, less-restrictive X-bar programming model than “logical core” approach DDR1 DRAM HyperTransport™ – No need to “pause” one core to give Interface Links 0,1,2 other exclusive use of shared resources Page 10 July 2005 Linux Kernel Summit AMD Dual-Core Technology AMD Athlon™ 64 X2 Dual-Core Processor (Announced June 2005) Model # Freq L2 Cache 4800+ 2.4 Ghz 1 MB + 1 MB 4600+ 2.4 Ghz 512KB + 512KB Desktop 4400+ 2.2 Ghz 1 MB + 1 MB 4200+ 2.2 Ghz 512KB + 512KB http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8796_9240,00.html AMD Opteron™ Processor Dual-Core Models (Announced on April 21, 2005) Freq 1-way Up to 2-way Up to 8-way 1.8 GHz Model 165 Model 265 Model 865 2.0 GHz Model 170 Model 270 Model 870 Server/Workstation 2.2 GHz Model 175 Model 275 Model 875 http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_9485_13041%5E13076,00.html Page 11 July 2005 Linux Kernel Summit Dual-Core Performance/Watt • SPECweb® 99_SSL Secure Web Connections Example. • Data Center rack space and power budgets are often fixed. • Perf/Watt focus maximizes use of resources. • Typical 48U Rack has 9KVA of Power. Page 12 July 2005 Linux Kernel Summit AMD Direct Connect Architecture + Dual-Core DDR1 16x16cHT Opteron Opteron MEM 800 800 cHT MEM K8 cHT BW DDR BW REV (MHz) 16x16 (MHz) 1-ch/2-ch 1-w/2-w (GB/s) (GB/s) Opteron Opteron 800 800 CG HT1- 3.2/6.4 DDR1 3.2/6.4 800 -400 16x16 E HT1- 4.0/8.0 DDR1 3.2/6.4 1000 -400 PCI-E CORE 0 F HT1- 4.0/8.0 DDR2 Chipset 1000 PCI-E South CORE 1 Bridge Page 13 July 2005 Linux Kernel Summit SSE3 Support • AMD K8 Revision “E” and • ADDSUB[PD,PS] xmm1, xmm2/m128 newer are designed to – Provides interleaved packed add and support SSE3 subtract • Supports SSE3 • FISTTP m16int/m32int/m64int instructions reported by – Like FISTP but with forced truncation CPUID.SSE3 feature flag • HADD[PD,PS] xmm1, xmm2/m128 – Horizontal Adds • Ten new SSE • HSUB[PD,PS] xmm1, xmm2/m128 instructions and one new – Horizontal Subtracts x87 instruction (13 total • LDDQU xmm, m128 opcodes). – Special 128-bit Unaligned load • Monitor/Mwait planned • MOV[D,HD,LD]DUP xmm1, xmm2/m64 in 2007 – Move and Duplicate some elements • CMPXCHG16B planned in 2006 Page 14 July 2005 Linux Kernel Summit Desktop/Workstation Roadmap Page 15 July 2005 Linux Kernel Summit Server/Workstation Roadmap Page 16 July 2005 Linux Kernel Summit Planned 2006 Processor Features • Multi-Core capable • DDR2 support • RDTSCP – see next slide • CMPXCHG16B – compare 16bytes, exchange 16-bytes • Correctable Machine-Check Exception Thresholding • HW Virtualization support (AMD “Pacifica”) Page 17 July 2005 Linux Kernel Summit RDTSCP: Read Serialized TSC Pair • New instruction, similar to RDTSC: – Returns 64-bit TSC value in %edx:%eax – Is a serializing operation -- prevents speculative reads of TSC – Returns TSC_AUX[31:0] MSR in %ecx at same time as TSC ¾ OS initializes TSC_AUX to meaningful value ¾ Atomicity ensures no context switch btw read of TSC & TSC_AUX. – Availability determined by new extended CPUID feature flag • Allows TSC and OS-supplied value (such as CPU number) to be read atomically in a serializing way in user mode. – TSC rates between CPUs in MP-system may vary – Linux can put CPU number in TSC_AUX so user-mode get- time-of-day knows which per-cpu adjustments to use to fix- up TSC value. Page 18 July 2005 Linux Kernel Summit Planned 2007 Processor Features • Multi-core capable • DDR3 support • 1-GB pages – see next slide • 48-bit Physical Addressing – see later slide • Greater than 32-socket support • P-state Invariant TSC (APIC Timer is already) • P-state Fire-n-Forget • Monitor/Mwait • Shared L3-cache • Further Virtualization extensions Page 19 July 2005 Linux Kernel Summit 1 Gigabyte Pages & 48-bit Physical Addresses 64-bit 63 4847 39 38 30 29 0 VA Sign-Extend PML4-O PDP-O Offset CR3 PML4E PDP Page Map Page Dir Level 4 Pointer Table Table Physical 47 30 29 0 Address Page PA Offset Plan is for Physical Address in PTEs to be 48 bits for all page sizes. Page 20 July 2005 Linux Kernel Summit Virtualization Discussion AMD Virtualization Directions • AMD “Pacifica”: HW-Virtualization-Assist. Base features planned launch in 2006 Generation • Primary components of Architecture: – Host/guest management hardware support – Event Injection ¾ Eliminates need for VMM code to emulate x86 exception delivery ¾ Designed to reduce VMM development time significantly – Nested Page Tables ¾ Designed to improve VMM performance, and reduce overhead ¾ Helps reduce VMM complexity Page 22 July 2005 Linux Kernel Summit Core “Pacifica” Architecture: VMRUN • Virtualization based on Virtual Machine Run ( VMRUN) instruction • VMRUN executed by host causes the guest to run • Guest runs until it exits back to the host • Host resumes at the instruction following VMRUN Host instruction Stream while (1) { VMCB // Do World Switch Data rAX = &VMCB Guest instruction Stream Struct VMLOAD(rAX) while (running_VMM) { VMRUN(rAX) switch (exitcode) { // handle intercept // within VMM context } Intercepts VMSAVE(rAX) } Page 23 July 2005 Linux Kernel Summit Core “Pacifica” Architecture: Intercepts • Guest runs until: – It performs an action that causes an exit to the host – It explicitly executes the VMMCALL instruction • The VMCB for a guest has settings that determine what actions cause the guest to exit to host – These intercepts can vary from guest to guest – Two kinds of intercepts ¾ Exception & Interrupt Intercepts ¾ Instruction Intercepts – Rich set of intercepts allow the host to set customize each guest’s privileges • Information about the intercepted event is put into the VMCB on exit Page 24 July 2005 Linux Kernel Summit Nested Paging • CPU maps each Guest_PA to Host_VA and then translates to Host_PA • CPU builds compound gVA_to_hPA TLB entries (guarded by ASID) • Far more efficient than “Shadow Page Tables”, all handled by CPU Guest Translation Host Translation gPA = gen_PML4(gCR3,gVA); hPA = hTRANS( hVA = gPA ); entry = MEMORY[ hPA ]; gPA = gen_PDP(gVA, entry); hPA = hTRANS( hVA = gPA ); entry = MEMORY[ hPA ]; gPA = gen_PDE(gVA, entry); hPA = hTRANS( hVA = gPA ); entry = MEMORY[ hPA ]; gPA = gen_PTE(gVA, entry); hPA = hTRANS( hVA = gPA ); entry = MEMORY[ hPA ]; gPA = gen_PA(gVA, entry); hPA = hTRANS( hVA = gPA ); Page 25 July 2005 Linux Kernel Summit Challenges / Issues Multi-core Numbering • Assume system has non-power-of-two number-of-cores in at least 1 processor due to design or retirement of bad core(s). – How to tell OS? How to keep “sanity” in core/processor bit masks? • BIOS calculates “Rounded Number of Cores” (RNC): – RNC = 2^ceil( log2(Number_of_Cores) ) • BIOS assigns APIC IDs of each processor’s cores to an RNC- aligned block of IDs: – APIC_ID[ proc=i, core=j ] = RNC * (OFFSET + i) + j • Example: 2-processor system Proc Core APIC ID – proc 0 has 3-cores 0 0 0x8 = 4*(2+0) + 0 – proc 1 has 4-cores 0 1 0x9 = 4*(2+0) + 1 – RNC = 4 on all cores 0 2 0xA = 4*(2+0) + 2 rsvd = 4*(2+0) + 0 Want APIC_ID[M:0] to always specify core 1 0 0xC = 4*(2+1) + 0 Initial 1 1 0xD = 4*(2+1) + 1 APIC ID: pppp … cccc Want APIC_ID[N:M+1] to always specify processor 1 2 0xE = 4*(2+1) + 2 1 3 0xF = 4*(2+1) + 3 Page 27 July 2005 Linux Kernel Summit Multi-core Numbering (cont) • OS should use same process to discover topology of processors & cores. • OS can not assume that BSP’s CPUID.number_of_cores is same for all processors. • OS can assume that RNC calculated on any processor is same for all processors.

AMD Presentation for Linux Kernel Summit

Memorandum in Opposition to Hewlett-Packard Company's Motion to Quash Intel's Subpoena Duces Tecum

Reverse Engineering X86 Processor Microcode

AMD's Early Processor Lines, up to the Hammer Family (Families K8

The X86 Is Dead. Long Live the X86!

The Microarchitecture of Intel and AMD Cpus

“架构+工艺”，Cpu 业务拉动业绩持续成长（）投资价值分析报告｜ Amd Amd.O 2019.10.10

AMD K8 Processor Architecture

Instruction Latencies and Throughput for AMD and Intel X86 Processors

The Microarchitecture of Intel, AMD and VIA Cpus: an Optimization Guide for Assembly Programmers and Compiler Makers

Welcome to Anandtech.Com [ Article: Intel Core Versus AMD's K

UPGRADING and REPAIRING Pcs

An Exploratory Analysis of Microcode As a Building Block for System Defenses