AMDAMD 3DNow!3DNow!TM TechnologyTechnology andand thethe K6-2K6-2 MicroprocessorMicroprocessor

Stuart Oberman Fred Weber Norbert Juffa Greg Favor

California Division OUTLINEOUTLINE

• Motivation for 3DNow! Technology • Features of the 3DNow! instruction set • AMD-K6-2 implementation • 3D graphics performance • Future AccelerationAcceleration ofof MultimediaMultimedia ApplicationsApplications • Multimedia applications have become an integral part of the PC platform – Multimedia algorithms are computationally intensive

Application or Geometry transform, Triangle Game Physics Clipping, & Lighting Set up Rendering

floating point floating point floating point & integer intensive intensive integer intensive intensive MotivationMotivation forfor 3DNow!3DNow! TechnologyTechnology • Why a new technology? – Previous focus has been on integer intensive pixel rendering tasks: MMX and 3D graphics hardware – 3D graphics performance now limited by floating-point intensive front-end of – New applications require realistic physical modeling • What is it? – New set of instructions to accelerate FP computation – Defined in collaboration with leading ISV’s – Maximizes the performance of graphics accelerator cards by improving the front-end of graphics pipeline 3DNow!3DNow! TechnologyTechnology

Application or Geometry transform, Triangle Pixel Game Physics Clipping, & Lighting Set up Rendering

floating point floating point floating point & integer intensive intensive integer intensive intensive

3DNow! Technology Accelerates Graphic Cards Accelerates • Benefits – Accelerates most floating-point intensive multimedia operations • Graphics pipeline (physics, geometry, setup) • Audio processing 3DNow!3DNow! TechnologyTechnology

• 21 new instructions – “LEAN and MEAN” design philosophy – Includes only performance critical features • SIMD floating-point instructions – Compatible with IEEE single precision data type – Two 32-bit FP values per 64-bit reg/mem operand – Uses MMX registers -> avoids register stack – No exceptions – Limited rounding modes – No switching overhead between 3DNow! and MMX – Peak throughput of 4 FLOPS per cycle • No core OS support required ClassesClasses ofof InstructionsInstructions -- FPFP

• Basic arithmetic – PFADD, PFSUB, PFSUBR, PFACC, PFMUL • Comparisons – PFCMPEQ, PFCMPGT, PFCMPGE • Min/max – PFMIN, PFMAX • Conversions – PF2ID, PI2FD • Reciprocal and reciprocal square root – PFRCP, PFRSQRT – PFRCPIT1, PFRSQIT1, PFRCPIT2 ClassesClasses ofof InstructionsInstructions -- NonNon FPFP

• Integer – PAVGUSB, PMULHRW

• Data movement – PREFETCH/PREFETCHW

• Overhead reduction – FEMMS ReciprocalReciprocal andand ReciprocalReciprocal SquareSquare RootRoot

• Alternative to “classical” DIV and SQRT – Reciprocal and reciprocal square root frequently used in graphics applications – Higher performance through reuse of common divisors and radicands • Choice of reduced (14-15b) or full precision – Reduced precision sufficient for many applications and is higher performance – Avoid all long latency operations; full precision synthesized from fully-pipelined Newton-Raphson iterations ops ReciprocalReciprocal IterationIteration InstructionsInstructions • Reciprocal Newton-Raphson iteration – To compute full-precision reciprocal of b using initial

approximation R0:

•Rfull = R0 x (2 - b x R0)

–R0 is accurate to about 14 bits, b is a 24 bit number

– PFRCPIT1 performs b x R0 rounded to 32 bits, inverts the result (one’s complement), and compresses out leading 8 bits known to be identical, leaving 24 bits – PFRCPIT2 expands the previous result to 32 bits,

multiplies by R0, adds a fixed bias, and rounds to 24 bits ReciprocalReciprocal AccuracyAccuracy • Fast approximations – PFRCP accurate to 14.9 bits – PFRSQRT accurate to 15.8 bits

• Full precision – PFRCP, PFRCPIT1, PFRCPIT2 sequence provides IEEE RN result for > 99% of all operands; remaining differ by 1 unit-in-the-last-place – PFRSQRT, PFMUL, PFRSQIT1, PFRCPIT2 sequence provides IEEE RN result for > 87% of all operands; remaining differ by 1 unit-in-the-last-place AMD-K6-2AMD-K6-2 MicroprocessorMicroprocessor

• Worldwide launch May 28, 1998 • Implemented in 0.25um CMOS process • 9.3M transistors on a die of 80 mm² • New features of AMD-K6-2 – Superscalar 3DNow! and MMX units • Dual decode and dual execution pipelines • No decode pairing restrictions • Only one cycle misalignment penalty on memory accesses – 100 MHz Front Side • Increases local bus and L2 cache bandwidth by 50% • Redesigned I/O timing to allow for low cost 100 MHz motherboard AMD-K6-2AMD-K6-2 BlockBlock DiagramDiagram

32KByte Level-OneLevel-One Instruction CacheCache PredecodePredecode 6464 EntryEntry ITLBITLB Logic 20KByte PredecodePredecode Cache

1616 ByteByte FetchFetch Level-OneLevel-One CacheCache Branch Logic ControllerController DualDual InstructionInstruction Decoders (8192-Entry BHT) x86x86 toto RISC86RISC86 (16-Entry BTC) 100100 MhzMhz (16-Entry(16-Entry RAS)RAS) Super7 BusBus Four RISC86 DecodeDecode InterfaceInterface Out-of-OrderOut-of-Order Scheduler ExecutionExecution EngineEngine Scheduler Instruction Buffer Control Unit BranchBranch (24 RISC86) Control Unit Resolution Unit SixSix RISC86RISC86 (24 RISC86) Resolution Unit OperationOperation IssueIssue

Load Store RegisterRegister UnitUnit XX RegisterRegister UnitUnit YY Floating- Point UnitUnit UnitUnit (Integer/MMX/3D)(Integer/MMX/3D) (Integer/MMX/3D)(Integer/MMX/3D) Unit

Store QueueQueue

Level-One Dual-PortDual-Port Data Cache 128 EntryEntry DTLBDTLB (32KByte)(32KByte) AMD-K6-2AMD-K6-2 MultimediaMultimedia UnitsUnits

Register X Execution Shared Resources Register Y Execution Pipeline Pipeline

FP Add Recip / RecipSqrt

MMX MMX ALU ALU MMX+FP Multiplier

MMX Shifter AMD-K6-2AMD-K6-2 MultimediaMultimedia PerformancePerformance

Latency Throughput Instruction Type (cycles) (cycles)

3DNow! FP 2 1

3DNow! / MMX integer ALU 1 1

MMX multiply 2 1 RecipRecip // RecipRecip SqrtSqrt PerformancePerformance

K6-2 PII Performance Performance

14 bit reciprocal 2 cycles - pipelined

15 bit reciprocal 2 cycles - square root pipelined

24 bit reciprocal 6 cycles ~ 17 cycles pipelined non-pipelined

24 bit reciprocal 8 cycles ~ 28 cycles square root pipelined non-pipelined SharedShared 3DNow!3DNow! andand MMXMMX MultiplierMultiplier

Data Format Selection 32 bits Booth 2 Encoders

Booth Muxes MMX Data Alignment 17 PP 16b 16b

4-2 CSA Sign EX1 AL x BL 4-2 CSA Extended

4-2 CSA Zeroed AH x BH 3,2 CSA out Binary Tree 64 bits

4-2 CSA 3,2 CSA 3,2 CSA

MUX

64b CPA 64b CPA EX2 Final Result Selection (32b) 3D3D WinbenchWinbench 9898 PerformancePerformance

3D Winbench 98 / Windows 95 (DirectX 6.0 optimized for 3DNow!TM)

AMD-K6-2/300 1110

Pentium II 300 961

Pentium II 350 1060

Pentium II 400 1120

600 700 800 900 1000 1100 1200

Windows 95 OSR 2.1, 32MB DRAM, Maxtor DiamondMax IDE HD, 512K L2 cache, Diamond Viper V330 4MB AGP. AMD-K6 3D processor based system: Microstar 5169 mainboard supporting 100MHz bus. Pentium® II 300 based system: Abit LX6 mainboard supporting 66MHz bus (Pentium II 300 based systems with 100MHz bus not currently commercially available). Pentium II 350 and Pentium II 400 based systems: Asus P2B mainboard supporting 100MHz bus QuakeQuake 22 PerformancePerformance

88

86 Highest Performance without Sound

84 Timedemo: DEMO1.DM2

82 80.9 79.7 80 77.7 78

76

74 72.4 71.8 72

70 68.4 70.5 68

66

Average Frames Per Second (FPS) Per Second Frames Average 64.5

64

62 60.3 AMD-K6-2, Dual 60 AMD-K6-2, Single Voodoo2 Intel Pentium-II, Dual Voodoo2 58 56.6 Intel Pentium-II, Single Voodoo2 56 640x480 800x600 1024x768 (horizontal x vertical) K6K6 FamilyFamily andand FutureFuture K7

Sharptooth µP 400 Forum ‘98 AMD-K6-2

AMD-K6 .35 µ process .25 µ process 300 MMX Enhanced 100 MHz Bus 8.8M Transistors On-chip, Processor 162 mm² .25 µ process Speed Backside L2 100 MHz Bus 100 MHz Frontside L3 .25 µ process Performance MHz 100 MHz Frontside L2 Superscalar MMX 200 Lower Power Superscalar MMXTM 3DNow! Technology Higher Speeds 3DNow!TM Technology 21.3M Transistors 8.8M Transistors 9.3M Transistors 118 mm² 68 mm² 80 mm² AMD-K6

1H’97 2H’97 1H’98 2H’98 1H’99