Madison Processor

TheThe RoadRoad toto BillionBillion TransistorTransistor ProcessorProcessor ChipsChips inin thethe NearNear FutureFuture DileepDileep BhandarkarBhandarkar RogerRoger GolliverGolliver Intel Corporation April 22th, 2003 Copyright © 2002 Intel Corporation. Outline yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy Moore’sMoore’s LawLaw VideoVideo yy ParallelismParallelism inin MicroprocessorsMicroprocessors TodayToday yy MultiprocessorMultiprocessor SystemsSystems yy TheThe PathPath toto BillionBillion TransistorsTransistors yy SummarySummary ©2002, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others BirthBirth ofof thethe RevolutionRevolution ---- TheThe IntelIntel 40044004 IntroducedIntroduced NovemberNovember 15,15, 19711971 108108 KHz,KHz, 5050 KIPsKIPs ,, 23002300 1010μμ transistorstransistors 20012001 –– Pentium®Pentium® 44 ProcessorProcessor Introduced November 20, 2000 @1.5 GHz core, 400 MT/s bus 42 Million 0.18µ transistors August 27, 2001 @2 GHz, 400 MT/s bus 640 SPECint_base2000* 704 SPECfp_base2000* SourceSource:: hhtttptp:/://www/www.specbench.org/cpu2000/results/.specbench.org/cpu2000/results/ 3030 YearsYears ofof ProgressProgress yy40044004 toto PentiumPentium®® 44 processorprocessor –– TransistorTransistor count:count: 20,000x20,000x increaseincrease –– Frequency:Frequency: 20,000x20,000x increaseincrease –– 39%39% CompoundCompound AnnualAnnual GrowthGrowth raterate 20022002 –– Pentium®Pentium® 44 ProcessorProcessor November 14, 2002 @3.06 GHz, 533 MT/s bus 1099 SPECint_base2000* 1077 SPECfp_base2000* 55 Million 130 nm process SourceSource:: hhtttptp:/://www/www.specbench.org/cpu2000/results/.specbench.org/cpu2000/results/ ItaniumItanium® 22 ProcessorProcessor OverviewOverview y .18μm bulk, 6 layer Al Branch Unit Floating Point Unit process IA32 Pipeline Control y 8 stage, fully stalled in- L1I order pipeline cache ALAT Integer Multi- Int L1D Medi y Symmetric six integer- Datapath RF cache a issue design CLK unit HPW y IA32 execution engine DTLB integrated y 3 levels of cache on-die 21.6 mm L2D Array and Control L3 Tag totaling 3.3MB y 221 Million transistors Bus Logic y 130W @1GHz, 1.5V y 421 mm2 die 2 y 142 mm CPU core L3 Cache 19.5mm MadisonMadison ProcessorProcessor Branch Unit Floating Point Unit y 130 nm process Pipeline Control y ~1.5 GHz L1I y ~1.5 GHz IA32 Cache Multi- ALAT Integer Int Media y 6 MB L3 Cache L1D DataPath y 6 MB L3 Cache RF Unit Cache CLK HPW y 24 way set DTLB associative L2D Cache L3 Tag/LRU PADS y ~130W y 410 M transistors Bus Logic y 374 square mm L3 Cache ContinuingContinuing atat thisthis RateRate byby EndEnd ofof thethe DecadeDecade 10,000 1,000 Itanium ® 2 100 • 221M in 2002 • 410M in 2003 Pentium® 4 Million 10 Million Pentium® Pro Transistors 1 486 Pentium proc 386 0.1 286 8085 8086 0.01 8080 8008 4004 0.001 ’70 ’80 ’90 ’00 ’10 Billion Transistors possible within 4 years NanotechnologyNanotechnology AdvancementsAdvancements Influenza virus 50nm 90nm Node Lgate = 50nm 65nm Node Production - 2003 45nm Node Lgate = 30nm Lgate = 20nm Production - 2005 Production - 2007 30nm Node Lgate = 15nm Production - 2009 Process Advancements Fulfill Moore’s Law Source: Intel SemiconductorSemiconductor ManufacturingManufacturing ProcessProcess EvolutionEvolution Actual Forecast Process name P852 P854 P856 P858 Px60 P1262 P1264 Production 1993 1995 1997 1999 2001 2003 2005 Generation 0.50 0.35 0.25 0.18µm 130 nm 90 nm 65 nm Gate Length 0.50 0.35 0.20 0.13 <70 nm <50 nm <35 nm Wafer Size (mm) 200 200 200 200 200/300 300 300 New generation every 2 years Outline yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy Moore’sMoore’s LawLaw VideoVideo yy ParallelismParallelism inin MicroprocessorsMicroprocessors TodayToday yy MultiprocessorMultiprocessor SystemsSystems yy TheThe PathPath toto BillionBillion TransistorsTransistors yy SummarySummary ©2002, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others Moore’sMoore’s LawLaw Outline yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy Moore’sMoore’s LawLaw VideoVideo yy ParallelismParallelism inin MicroprocessorsMicroprocessors TodayToday yy MultiprocessorMultiprocessor SystemsSystems yy TheThe PathPath toto BillionBillion TransistorsTransistors yy SummarySummary ©2002, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others DifferentDifferent FormsForms ofof ParallelismParallelism yyWithinWithin aa processorprocessor corecore –– multiplemultiple issueissue processorsprocessors withwith lotslots ofof executionexecution unitsunits –– widerwider superscalarsuperscalar –– explicitexplicit parallelismparallelism yyMultipleMultiple processorsprocessors onon aa chipchip –– HardwareHardware MultiMulti ThreadingThreading –– MultipleMultiple corescores EPICEPIC ArchitectureArchitecture FeaturesFeatures Explicitly Parallel Instruction Computing y Enable wide execution by providing processor implementations that compiler can take advantage of y Performance through parallelism – Multiple execution units and issue ports in parallel – 2 bundles (up to 6 Instructions) dispatched every cycle y Massive on-chip resources – 128 general registers, 128 floating point registers y 64 predicpredicateate registers, 8 branch registers – Exploit parallelism – Efficient management engines (register stack engine) y Provide features that enable compiler to reschedule programs using advanced features (predication, speculation, software pipelining) y Enable, enhance, express, and exploit parallelism ItaniumItanium® 22 ProcessorProcessor ArchitectureArchitecture Itanium 2 processor 6.4 GB/s 6.4 GB/s High speed 128 bits wide System bus 400 MHz System bus 3 MB L3, 256k L2, 32k L1 all on-die Large on-die cache, reduced latency 8-stage pipeline 8 12345678910 11 11 Issue ports 328 on-board Registers Large number of Execution units Lots of 6 Integer, 2 FP, 2 Load & 3 Branch 1 SIMD 2 Store parallelism within a 1 GHz single core 6 Instructions / Cycle Itanium 2 processor 221 million transistors total 25 million in CPU core logic HighHigh PerformancePerformance DesignDesign SpaceSpace With Each Process Generation y Frequency increases by about 1.5X y Vcc will scale by only ~0.8 y Active power will scale by ~0.9 y Active power density will increase by ~30-80% y Leakage power will make it even worse Doubling performance requires more than 4 times the transistors Instruction Thread Level Level Parallelism Parallelism Single-Stream Perf HW Multi-Threading Chip Multiprocessing Faster frequency Multi-Core Wider Superscalar More Out-of-order speculation Challenges: Power and Design Complexity PowerPower && PerformancePerformance TradeoffsTradeoffs y Throughput Perf α SQRT(Frequency) y But Power α Capacitance * Voltage2 * Frequency – Frequency α Voltage y Power increases non-linearly with Frequency y Throughput Performance can be achieved at Lower Power with multiple low power cores y Tradeoff between single stream and throughput performance for a constant power envelop Can optimize for either Throughput or Single Stream SingleSingle--streamstream PerformancePerformance vsvs CostsCosts 25 SPECInt vs 486 vs 486 20 SPECFP perf perf Die Size 15 Power 10 5 Relative power/die size/ Relative power/die size/ 0 486 Pentium® Pentium® III Pentium® 4 Processor Processor Processor Die Size and Power have increased with Scalar Performance LongLong LatencyLatency DRAMDRAM Accesses:Accesses: NeedsNeeds MemoryMemory LevelLevel ParallelismParallelism (MLP)(MLP) 1400 1200 1000 800 600 ions during DRAM Access ions during DRAM Access 400 200 Peak Instruct Peak Instruct 0 Pentium® Pentium-Pro Pentium III Pentium 4 Future CPUs Processor Processor Processor Processor 66 MHz 200 MHz 1100 MHz 2 GHz MultithreadingMultithreading y Introduced on Intel® Xeon™ Processor MP y Two logical processors for < 5% additional die area y Executes two tasks simultaneously – Two different applications – Two threads of same application – Overlap execution behind memory access y CPU maintains architecture state for two processors – Two logical processors per physical processor y Power efficient performance gain y 20-30% performance improvement on many throughput oriented workloads HyperThreadingHyperThreading Technology:Technology: WhatWhat waswas addedadded toto IntelIntel XeonXeon MPMP Processor?Processor? Instruction Streaming Buffers Next Instruction Pointer Return Stack Predictor Trace Cache Next IP Trace Cache Fill Buffers Instruction TLB Register Alias Tables <5% die size (& max power), up to 30% performance increase IBMIBM Power4Power4 DualDual ProcessorProcessor onon aa ChipChip Two cores (~30M transistors each) Large Shared L2: Multi-ported: 3 independent L3 & Mem slices Controller: L3 tags Chip-to-Chip & on-die for MCM-to-MCM full-speed Fabric: coherency Glueless SMP checks *Other names and brands may be claimed as the property of others MontecitoMontecito DetailsDetails yy DualDual--corecore processorprocessor Montecito Die yy EachEach corecore withwith itsits ownown L3L3 Core

Madison Processor

Outline ECE473 Computer Architecture and Organization • Technology Trends • Introduction to Computer Technology Trends Architecture

Lecture Note 1

On the Hardware Reduction of Z-Datapath of Vectoring CORDIC

18-447 Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures

The Advantages of FRAM-Based Smart Ics for Next Generation Government Electronic Ids

45-Year CPU Evolution: One Law and Two Equations

The Birth, Evolution and Future of Microprocessor

Nano-Cmos Scaling Problems and Implications

Performance and Energy Efficient Network-On-Chip Architectures

Benchmarking the Intel FPGA SDK for Opencl Memory Interface

Class-Action Lawsuit

Lecture 5: Scaled MOSFETS for Ics