AMD 3Dnow!3Dnow!TM Technologytechnology Andand Thethe K6-2K6-2 Microprocessormicroprocessor

AMDAMD 3DNow!3DNow!TM TechnologyTechnology andand thethe K6-2K6-2 MicroprocessorMicroprocessor Stuart Oberman Fred Weber Norbert Juffa Greg Favor California Microprocessor Division Advanced Micro Devices OUTLINEOUTLINE • Motivation for 3DNow! Technology • Features of the 3DNow! instruction set • AMD-K6-2 implementation • 3D graphics performance • Future AccelerationAcceleration ofof MultimediaMultimedia ApplicationsApplications • Multimedia applications have become an integral part of the PC platform – Multimedia algorithms are computationally intensive Application or Geometry transform, Triangle Pixel Game Physics Clipping, & Lighting Set up Rendering floating point floating point floating point & integer intensive intensive integer intensive intensive MotivationMotivation forfor 3DNow!3DNow! TechnologyTechnology • Why a new technology? – Previous focus has been on integer intensive pixel rendering tasks: MMX and 3D graphics hardware – 3D graphics performance now limited by floating-point intensive front-end of graphics pipeline – New applications require realistic physical modeling • What is it? – New set of instructions to accelerate FP computation – Defined in collaboration with leading ISV’s – Maximizes the performance of graphics accelerator cards by improving the front-end of graphics pipeline 3DNow!3DNow! TechnologyTechnology Application or Geometry transform, Triangle Pixel Game Physics Clipping, & Lighting Set up Rendering floating point floating point floating point & integer intensive intensive integer intensive intensive 3DNow! Technology Accelerates Graphic Cards Accelerates • Benefits – Accelerates most floating-point intensive multimedia operations • Graphics pipeline (physics, geometry, setup) • Audio processing 3DNow!3DNow! TechnologyTechnology • 21 new instructions – “LEAN and MEAN” design philosophy – Includes only performance critical features • SIMD floating-point instructions – Compatible with IEEE single precision data type – Two 32-bit FP values per 64-bit reg/mem operand – Uses MMX registers -> avoids x87 register stack – No exceptions – Limited rounding modes – No switching overhead between 3DNow! and MMX – Peak throughput of 4 FLOPS per cycle • No core OS support required ClassesClasses ofof InstructionsInstructions -- FPFP • Basic arithmetic – PFADD, PFSUB, PFSUBR, PFACC, PFMUL • Comparisons – PFCMPEQ, PFCMPGT, PFCMPGE • Min/max – PFMIN, PFMAX • Conversions – PF2ID, PI2FD • Reciprocal and reciprocal square root – PFRCP, PFRSQRT – PFRCPIT1, PFRSQIT1, PFRCPIT2 ClassesClasses ofof InstructionsInstructions -- NonNon FPFP • Integer – PAVGUSB, PMULHRW • Data movement – PREFETCH/PREFETCHW • Overhead reduction – FEMMS ReciprocalReciprocal andand ReciprocalReciprocal SquareSquare RootRoot • Alternative to “classical” DIV and SQRT – Reciprocal and reciprocal square root frequently used in graphics applications – Higher performance through reuse of common divisors and radicands • Choice of reduced (14-15b) or full precision – Reduced precision sufficient for many applications and is higher performance – Avoid all long latency operations; full precision synthesized from fully-pipelined Newton-Raphson iterations ops ReciprocalReciprocal IterationIteration InstructionsInstructions • Reciprocal Newton-Raphson iteration – To compute full-precision reciprocal of b using initial approximation R0: •Rfull = R0 x (2 - b x R0) –R0 is accurate to about 14 bits, b is a 24 bit number – PFRCPIT1 performs b x R0 rounded to 32 bits, inverts the result (one’s complement), and compresses out leading 8 bits known to be identical, leaving 24 bits – PFRCPIT2 expands the previous result to 32 bits, multiplies by R0, adds a fixed bias, and rounds to 24 bits ReciprocalReciprocal AccuracyAccuracy • Fast approximations – PFRCP accurate to 14.9 bits – PFRSQRT accurate to 15.8 bits • Full precision – PFRCP, PFRCPIT1, PFRCPIT2 sequence provides IEEE RN result for > 99% of all operands; remaining differ by 1 unit-in-the-last-place – PFRSQRT, PFMUL, PFRSQIT1, PFRCPIT2 sequence provides IEEE RN result for > 87% of all operands; remaining differ by 1 unit-in-the-last-place AMD-K6-2AMD-K6-2 MicroprocessorMicroprocessor • Worldwide launch May 28, 1998 • Implemented in 0.25um CMOS process • 9.3M transistors on a die of 80 mm² • New features of AMD-K6-2 – Superscalar 3DNow! and MMX units • Dual decode and dual execution pipelines • No decode pairing restrictions • Only one cycle misalignment penalty on memory accesses – 100 MHz Front Side bus • Increases local bus and L2 cache bandwidth by 50% • Redesigned I/O timing to allow for low cost 100 MHz motherboard AMD-K6-2AMD-K6-2 BlockBlock DiagramDiagram 32KByte Level-OneLevel-One Instruction CacheCache PredecodePredecode 6464 EntryEntry ITLBITLB Logic 20KByte PredecodePredecode Cache 1616 ByteByte FetchFetch Level-OneLevel-One CacheCache Branch Logic ControllerController DualDual InstructionInstruction Decoders (8192-Entry BHT) x86x86 toto RISC86RISC86 (16-Entry BTC) 100100 MhzMhz (16-Entry(16-Entry RAS)RAS) Super7 BusBus Four RISC86 DecodeDecode InterfaceInterface Out-of-OrderOut-of-Order Scheduler ExecutionExecution EngineEngine Scheduler Instruction Buffer Control Unit BranchBranch (24 RISC86) Control Unit Resolution Unit SixSix RISC86RISC86 (24 RISC86) Resolution Unit OperationOperation IssueIssue Load Store RegisterRegister UnitUnit XX RegisterRegister UnitUnit YY Floating- Point UnitUnit UnitUnit (Integer/MMX/3D)(Integer/MMX/3D) (Integer/MMX/3D)(Integer/MMX/3D) Unit Store QueueQueue Level-One Dual-PortDual-Port Data Cache 128 EntryEntry DTLBDTLB (32KByte)(32KByte) AMD-K6-2AMD-K6-2 MultimediaMultimedia UnitsUnits Register X Execution Shared Resources Register Y Execution Pipeline Pipeline FP Add Recip / RecipSqrt MMX MMX ALU ALU MMX+FP Multiplier MMX Shifter AMD-K6-2AMD-K6-2 MultimediaMultimedia PerformancePerformance Latency Throughput Instruction Type (cycles) (cycles) 3DNow! FP 2 1 3DNow! / MMX integer ALU 1 1 MMX multiply 2 1 RecipRecip // RecipRecip SqrtSqrt PerformancePerformance K6-2 PII Performance Performance 14 bit reciprocal 2 cycles - pipelined 15 bit reciprocal 2 cycles - square root pipelined 24 bit reciprocal 6 cycles ~ 17 cycles pipelined non-pipelined 24 bit reciprocal 8 cycles ~ 28 cycles square root pipelined non-pipelined SharedShared 3DNow!3DNow! andand MMXMMX MultiplierMultiplier Data Format Selection 32 bits Booth 2 Encoders Booth Muxes MMX Data Alignment 17 PP 16b 16b 4-2 CSA Sign EX1 AL x BL 4-2 CSA Extended 4-2 CSA Zeroed AH x BH 3,2 CSA out Binary Tree 64 bits 4-2 CSA 3,2 CSA 3,2 CSA MUX 64b CPA 64b CPA EX2 Final Result Selection (32b) 3D3D WinbenchWinbench 9898 PerformancePerformance 3D Winbench 98 / Windows 95 (DirectX 6.0 optimized for 3DNow!TM) AMD-K6-2/300 1110 Pentium II 300 961 Pentium II 350 1060 Pentium II 400 1120 600 700 800 900 1000 1100 1200 Windows 95 OSR 2.1, 32MB DRAM, Maxtor DiamondMax IDE HD, 512K L2 cache, Diamond Viper V330 4MB AGP. AMD-K6 3D processor based system: Microstar 5169 mainboard supporting 100MHz bus. Pentium® II 300 based system: Abit LX6 mainboard supporting 66MHz bus (Pentium II 300 based systems with 100MHz bus not currently commercially available). Pentium II 350 and Pentium II 400 based systems: Asus P2B mainboard supporting 100MHz bus QuakeQuake 22 PerformancePerformance 88 86 Highest Performance without Sound 84 Timedemo: DEMO1.DM2 82 80.9 79.7 80 77.7 78 76 74 72.4 71.8 72 70 68.4 70.5 68 66 Average Frames Per Second (FPS) Per Second Frames Average 64.5 64 62 60.3 AMD-K6-2, Dual Voodoo2 60 AMD-K6-2, Single Voodoo2 Intel Pentium-II, Dual Voodoo2 58 56.6 Intel Pentium-II, Single Voodoo2 56 640x480 800x600 1024x768 Display Resolution (horizontal x vertical) K6K6 FamilyFamily andand FutureFuture K7 Sharptooth µP 400 Forum ‘98 AMD-K6-2 AMD-K6 .35 µ process .25 µ process 300 MMX Enhanced 100 MHz Bus 8.8M Transistors On-chip, Processor 162 mm² .25 µ process Speed Backside L2 100 MHz Bus 100 MHz Frontside L3 .25 µ process Performance MHz 100 MHz Frontside L2 Superscalar MMX 200 Lower Power Superscalar MMXTM 3DNow! Technology Higher Speeds 3DNow!TM Technology 21.3M Transistors 8.8M Transistors 9.3M Transistors 118 mm² 68 mm² 80 mm² AMD-K6 1H’97 2H’97 1H’98 2H’98 1H’99.

Load more