Madison Processor

Total Page:16

File Type:pdf, Size:1020Kb

Madison Processor TheThe RoadRoad toto BillionBillion TransistorTransistor ProcessorProcessor ChipsChips inin thethe NearNear FutureFuture DileepDileep BhandarkarBhandarkar RogerRoger GolliverGolliver Intel Corporation April 22th, 2003 Copyright © 2002 Intel Corporation. Outline yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy Moore’sMoore’s LawLaw VideoVideo yy ParallelismParallelism inin MicroprocessorsMicroprocessors TodayToday yy MultiprocessorMultiprocessor SystemsSystems yy TheThe PathPath toto BillionBillion TransistorsTransistors yy SummarySummary ©2002, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others BirthBirth ofof thethe RevolutionRevolution ---- TheThe IntelIntel 40044004 IntroducedIntroduced NovemberNovember 15,15, 19711971 108108 KHz,KHz, 5050 KIPsKIPs ,, 23002300 1010μμ transistorstransistors 20012001 –– Pentium®Pentium® 44 ProcessorProcessor Introduced November 20, 2000 @1.5 GHz core, 400 MT/s bus 42 Million 0.18µ transistors August 27, 2001 @2 GHz, 400 MT/s bus 640 SPECint_base2000* 704 SPECfp_base2000* SourceSource:: hhtttptp:/://www/www.specbench.org/cpu2000/results/.specbench.org/cpu2000/results/ 3030 YearsYears ofof ProgressProgress yy40044004 toto PentiumPentium®® 44 processorprocessor –– TransistorTransistor count:count: 20,000x20,000x increaseincrease –– Frequency:Frequency: 20,000x20,000x increaseincrease –– 39%39% CompoundCompound AnnualAnnual GrowthGrowth raterate 20022002 –– Pentium®Pentium® 44 ProcessorProcessor November 14, 2002 @3.06 GHz, 533 MT/s bus 1099 SPECint_base2000* 1077 SPECfp_base2000* 55 Million 130 nm process SourceSource:: hhtttptp:/://www/www.specbench.org/cpu2000/results/.specbench.org/cpu2000/results/ ItaniumItanium® 22 ProcessorProcessor OverviewOverview y .18μm bulk, 6 layer Al Branch Unit Floating Point Unit process IA32 Pipeline Control y 8 stage, fully stalled in- L1I order pipeline cache ALAT Integer Multi- Int L1D Medi y Symmetric six integer- Datapath RF cache a issue design CLK unit HPW y IA32 execution engine DTLB integrated y 3 levels of cache on-die 21.6 mm L2D Array and Control L3 Tag totaling 3.3MB y 221 Million transistors Bus Logic y 130W @1GHz, 1.5V y 421 mm2 die 2 y 142 mm CPU core L3 Cache 19.5mm MadisonMadison ProcessorProcessor Branch Unit Floating Point Unit y 130 nm process Pipeline Control y ~1.5 GHz L1I y ~1.5 GHz IA32 Cache Multi- ALAT Integer Int Media y 6 MB L3 Cache L1D DataPath y 6 MB L3 Cache RF Unit Cache CLK HPW y 24 way set DTLB associative L2D Cache L3 Tag/LRU PADS y ~130W y 410 M transistors Bus Logic y 374 square mm L3 Cache ContinuingContinuing atat thisthis RateRate byby EndEnd ofof thethe DecadeDecade 10,000 1,000 Itanium ® 2 100 • 221M in 2002 • 410M in 2003 Pentium® 4 Million 10 Million Pentium® Pro Transistors 1 486 Pentium proc 386 0.1 286 8085 8086 0.01 8080 8008 4004 0.001 ’70 ’80 ’90 ’00 ’10 Billion Transistors possible within 4 years NanotechnologyNanotechnology AdvancementsAdvancements Influenza virus 50nm 90nm Node Lgate = 50nm 65nm Node Production - 2003 45nm Node Lgate = 30nm Lgate = 20nm Production - 2005 Production - 2007 30nm Node Lgate = 15nm Production - 2009 Process Advancements Fulfill Moore’s Law Source: Intel SemiconductorSemiconductor ManufacturingManufacturing ProcessProcess EvolutionEvolution Actual Forecast Process name P852 P854 P856 P858 Px60 P1262 P1264 Production 1993 1995 1997 1999 2001 2003 2005 Generation 0.50 0.35 0.25 0.18µm 130 nm 90 nm 65 nm Gate Length 0.50 0.35 0.20 0.13 <70 nm <50 nm <35 nm Wafer Size (mm) 200 200 200 200 200/300 300 300 New generation every 2 years Outline yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy Moore’sMoore’s LawLaw VideoVideo yy ParallelismParallelism inin MicroprocessorsMicroprocessors TodayToday yy MultiprocessorMultiprocessor SystemsSystems yy TheThe PathPath toto BillionBillion TransistorsTransistors yy SummarySummary ©2002, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others Moore’sMoore’s LawLaw Outline yy SemiconductorSemiconductor TechnologyTechnology EvolutionEvolution yy Moore’sMoore’s LawLaw VideoVideo yy ParallelismParallelism inin MicroprocessorsMicroprocessors TodayToday yy MultiprocessorMultiprocessor SystemsSystems yy TheThe PathPath toto BillionBillion TransistorsTransistors yy SummarySummary ©2002, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries *Other names and brands may be claimed as the property of others DifferentDifferent FormsForms ofof ParallelismParallelism yyWithinWithin aa processorprocessor corecore –– multiplemultiple issueissue processorsprocessors withwith lotslots ofof executionexecution unitsunits –– widerwider superscalarsuperscalar –– explicitexplicit parallelismparallelism yyMultipleMultiple processorsprocessors onon aa chipchip –– HardwareHardware MultiMulti ThreadingThreading –– MultipleMultiple corescores EPICEPIC ArchitectureArchitecture FeaturesFeatures Explicitly Parallel Instruction Computing y Enable wide execution by providing processor implementations that compiler can take advantage of y Performance through parallelism – Multiple execution units and issue ports in parallel – 2 bundles (up to 6 Instructions) dispatched every cycle y Massive on-chip resources – 128 general registers, 128 floating point registers y 64 predicpredicateate registers, 8 branch registers – Exploit parallelism – Efficient management engines (register stack engine) y Provide features that enable compiler to reschedule programs using advanced features (predication, speculation, software pipelining) y Enable, enhance, express, and exploit parallelism ItaniumItanium® 22 ProcessorProcessor ArchitectureArchitecture Itanium 2 processor 6.4 GB/s 6.4 GB/s High speed 128 bits wide System bus 400 MHz System bus 3 MB L3, 256k L2, 32k L1 all on-die Large on-die cache, reduced latency 8-stage pipeline 8 12345678910 11 11 Issue ports 328 on-board Registers Large number of Execution units Lots of 6 Integer, 2 FP, 2 Load & 3 Branch 1 SIMD 2 Store parallelism within a 1 GHz single core 6 Instructions / Cycle Itanium 2 processor 221 million transistors total 25 million in CPU core logic HighHigh PerformancePerformance DesignDesign SpaceSpace With Each Process Generation y Frequency increases by about 1.5X y Vcc will scale by only ~0.8 y Active power will scale by ~0.9 y Active power density will increase by ~30-80% y Leakage power will make it even worse Doubling performance requires more than 4 times the transistors Instruction Thread Level Level Parallelism Parallelism Single-Stream Perf HW Multi-Threading Chip Multiprocessing Faster frequency Multi-Core Wider Superscalar More Out-of-order speculation Challenges: Power and Design Complexity PowerPower && PerformancePerformance TradeoffsTradeoffs y Throughput Perf α SQRT(Frequency) y But Power α Capacitance * Voltage2 * Frequency – Frequency α Voltage y Power increases non-linearly with Frequency y Throughput Performance can be achieved at Lower Power with multiple low power cores y Tradeoff between single stream and throughput performance for a constant power envelop Can optimize for either Throughput or Single Stream SingleSingle--streamstream PerformancePerformance vsvs CostsCosts 25 SPECInt vs 486 vs 486 20 SPECFP perf perf Die Size 15 Power 10 5 Relative power/die size/ Relative power/die size/ 0 486 Pentium® Pentium® III Pentium® 4 Processor Processor Processor Die Size and Power have increased with Scalar Performance LongLong LatencyLatency DRAMDRAM Accesses:Accesses: NeedsNeeds MemoryMemory LevelLevel ParallelismParallelism (MLP)(MLP) 1400 1200 1000 800 600 ions during DRAM Access ions during DRAM Access 400 200 Peak Instruct Peak Instruct 0 Pentium® Pentium-Pro Pentium III Pentium 4 Future CPUs Processor Processor Processor Processor 66 MHz 200 MHz 1100 MHz 2 GHz MultithreadingMultithreading y Introduced on Intel® Xeon™ Processor MP y Two logical processors for < 5% additional die area y Executes two tasks simultaneously – Two different applications – Two threads of same application – Overlap execution behind memory access y CPU maintains architecture state for two processors – Two logical processors per physical processor y Power efficient performance gain y 20-30% performance improvement on many throughput oriented workloads HyperThreadingHyperThreading Technology:Technology: WhatWhat waswas addedadded toto IntelIntel XeonXeon MPMP Processor?Processor? Instruction Streaming Buffers Next Instruction Pointer Return Stack Predictor Trace Cache Next IP Trace Cache Fill Buffers Instruction TLB Register Alias Tables <5% die size (& max power), up to 30% performance increase IBMIBM Power4Power4 DualDual ProcessorProcessor onon aa ChipChip Two cores (~30M transistors each) Large Shared L2: Multi-ported: 3 independent L3 & Mem slices Controller: L3 tags Chip-to-Chip & on-die for MCM-to-MCM full-speed Fabric: coherency Glueless SMP checks *Other names and brands may be claimed as the property of others MontecitoMontecito DetailsDetails yy DualDual--corecore processorprocessor Montecito Die yy EachEach corecore withwith itsits ownown L3L3 Core
Recommended publications
  • Outline ECE473 Computer Architecture and Organization • Technology Trends • Introduction to Computer Technology Trends Architecture
    Outline ECE473 Computer Architecture and Organization • Technology Trends • Introduction to Computer Technology Trends Architecture Lecturer: Prof. Yifeng Zhu Fall, 2009 Portions of these slides are derived from: ECE473 Lec 1.1 ECE473 Lec 1.2 Dave Patterson © UCB Birth of the Revolution -- What If Your Salary? The Intel 4004 • Parameters – $16 base First Microprocessor in 1971 – 59% growth/year – 40 years • Intel 4004 • 2300 transistors • Initially $16 Æ buy book • Barely a processor • 3rd year’s $64 Æ buy computer game • Could access 300 bytes • 16th year’s $27 ,000 Æ buy cacar of memory • 22nd year’s $430,000 Æ buy house th @intel • 40 year’s > billion dollars Æ buy a lot Introduced November 15, 1971 You have to find fundamental new ways to spend money! 108 KHz, 50 KIPs, 2300 10μ transistors ECE473 Lec 1.3 ECE473 Lec 1.4 2002 - Intel Itanium 2 Processor for Servers 2002 – Pentium® 4 Processor • 64-bit processors Branch Unit Floating Point Unit • .18μm bulk, 6 layer Al process IA32 Pipeline Control November 14, 2002 L1I • 8 stage, fully stalled in- cache ALAT Integer Multi- Int order pipeline L1D Medi Datapath RF @3.06 GHz, 533 MT/s bus cache a • Symmetric six integer- CLK unit issue design HPW DTLB 1099 SPECint_base2000* • IA32 execution engine 1077 SPECfp_base2000* integrated 21.6 mm L2D Array and Control L3 Tag • 3 levels of cache on-die totaling 3.3MB 55 Million 130 nm process • 221 Million transistors Bus Logic • 130W @1GHz, 1.5V • 421 mm2 die @intel • 142 mm2 CPU core L3 Cache ECE473 Lec 1.5 ECE473 19.5mm Lec 1.6 Source: http://www.specbench.org/cpu2000/results/ @intel 2006 - Intel Core Duo Processors for Desktop 2008 - Intel Core i7 64-bit x86-64 PERFORMANCE • Successor to the Intel Core 2 family 40% • Max CPU clock: 2.66 GHz to 3.33 GHz • Cores :4(: 4 (physical)8(), 8 (logical) • 45 nm CMOS process • Adding GPU into the processor POWER 40% …relative to Intel® Pentium® D 960 When compared to the Intel® Pentium® D processor 960.
    [Show full text]
  • Lecture Note 1
    EE586 VLSI Design Partha Pande School of EECS Washington State University [email protected] Lecture 1 (Introduction) Why is designing digital ICs different today than it was before? Will it change in future? The First Computer The Babbage Difference Engine (1832) 25,000 parts cost: £17,470 ENIAC - The first electronic computer (1946) The Transistor Revolution First transistor Bell Labs, 1948 The First Integrated Circuits Bipolar logic 1960’s ECL 3-input Gate Motorola 1966 Intel 4004 Micro-Processor 1971 1000 transistors 1 MHz operation Intel Pentium (IV) microprocessor Moore’s Law In 1965, Gordon Moore noted that the number of transistors on a chip doubled every 18 to 24 months. He made a prediction that semiconductor technology will double its effectiveness every 18 months Moore’s Law 16 15 14 13 12 11 10 9 8 7 6 OF THE NUMBER OF 2 5 4 LOG 3 2 1 COMPONENTS PER INTEGRATED FUNCTION 0 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 Electronics, April 19, 1965. Evolution in Complexity Transistor Counts 1 Billion K Transistors 1,000,000 100,000 Pentium® III 10,000 Pentium® II Pentium® Pro 1,000 Pentium® i486 100 i386 80286 10 8086 Source: Intel 1 1975 1980 1985 1990 1995 2000 2005 2010 Projected Courtesy, Intel Moore’s law in Microprocessors 1000 2X growth in 1.96 years! 100 10 P6 Pentium® proc 1 486 386 0.1 286 Transistors (MT) Transistors 8086 Transistors8085 on Lead Microprocessors double every 2 years 0.01 8080 8008 4004 0.001 1970 1980 1990 2000 2010 Year Courtesy, Intel Die Size Growth 100 P6
    [Show full text]
  • On the Hardware Reduction of Z-Datapath of Vectoring CORDIC
    On the Hardware Reduction of z-Datapath of Vectoring CORDIC R. Stapenhurst*, K. Maharatna**, J. Mathew*, J.L.Nunez-Yanez* and D. K. Pradhan* *University of Bristol, Bristol, UK **University of Southampton, Southampton, UK [email protected] Abstract— In this article we present a novel design of a hardware wordlength larger than 18-bits the hardware requirement of it optimal vectoring CORDIC processor. We present a mathematical becomes more than the classical CORDIC. theory to show that using bipolar binary notation it is possible to eliminate all the arithmetic computations required along the z- In this particular work we propose a formulation to eliminate datapath. Using this technique it is possible to achieve three and 1.5 all the arithmetic operations along the z-datapath for conventional times reduction in the number of registers and adder respectively two-sided vector rotation and thereby reducing the hardware compared to classical CORDIC. Following this, a 16-bit vectoring while increasing the accuracy. Also the resulting architecture CORDIC is designed for the application in Synchronizer for IEEE shows significant hardware saving as the wordlength increases. 802.11a standard. The total area and dynamic power consumption Although we stick to the 2’s complement number system, without of the processor is 0.14 mm2 and 700μW respectively when loss of generality, this formulation can be adopted easily for synthesized in 0.18μm CMOS library which shows its effectiveness redundant arithmetic and higher radix formulation. A 16-bit as a low-area low-power processor. processor developed following this formulation requires 0.14 mm2 area and consumes 700 μW dynamic power when synthesized in 0.18μm CMOS library.
    [Show full text]
  • 18-447 Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures
    18-447 Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 1/28/2015 Agenda for Today & Next Few Lectures n Single-cycle Microarchitectures n Multi-cycle and Microprogrammed Microarchitectures n Pipelining n Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, … n Out-of-Order Execution n Issues in OoO Execution: Load-Store Handling, … 2 Reminder on Assignments n Lab 2 due next Friday (Feb 6) q Start early! n HW 1 due today n HW 2 out n Remember that all is for your benefit q Homeworks, especially so q All assignments can take time, but the goal is for you to learn very well 3 Lab 1 Grades 25 20 15 10 5 Number of Students 0 30 40 50 60 70 80 90 100 n Mean: 88.0 n Median: 96.0 n Standard Deviation: 16.9 4 Extra Credit for Lab Assignment 2 n Complete your normal (single-cycle) implementation first, and get it checked off in lab. n Then, implement the MIPS core using a microcoded approach similar to what we will discuss in class. n We are not specifying any particular details of the microcode format or the microarchitecture; you can be creative. n For the extra credit, the microcoded implementation should execute the same programs that your ordinary implementation does, and you should demo it by the normal lab deadline. n You will get maximum 4% of course grade n Document what you have done and demonstrate well 5 Readings for Today n P&P, Revised Appendix C q Microarchitecture of the LC-3b q Appendix A (LC-3b ISA) will be useful in following this n P&H, Appendix D q Mapping Control to Hardware n Optional q Maurice Wilkes, “The Best Way to Design an Automatic Calculating Machine,” Manchester Univ.
    [Show full text]
  • The Advantages of FRAM-Based Smart Ics for Next Generation Government Electronic Ids
    The Advantages of FRAM-Based Smart ICs for Next Generation Government Electronic IDs By Joseph Pearson and Dr. Ted Moise Texas Instruments, Inc. September 27, 2007 Executive Summary Electronic versions of passports and other government-issued identification documents use an Integrated Circuit (IC) or chip to establish a digital link between the holder and personal biometric information, such as a digitized photo, fingerprint or iris image. Designed to enhance border, physical and IT security, electronic chips ensure that the person holding a passport or government document is the one to whom it was legitimately issued. The next generation ICs will employ an advanced embedded memory technology, called FRAM (Ferroelectric Random Access Memory), which considerably improves the speed and reliability of future smart, secure e-passports and government ID documents. More than 50 countries have electronic passport (e-passport) programs, and many countries are also putting in place more secure forms of electronic citizen, visitor and government employee identification. As the volume of document issuance increases and new security threats occur, there is an increased need for industry-standard, next-generation contactless smart IC solutions that securely store, process and communicate data. These new smart ICs will have increased writing speeds to produce and process documents faster and more efficiently, as well as enhanced memory for future security requirements. When FRAM is manufactured at the 130 nanometer semiconductor process node, and embedded in a smart IC, it surpasses the limitations of current Electrically Erasable Programmable Read-Only Memory (EEPROM) and other memory technologies used in government ID applications. The imminent introduction of this innovative memory technology for smart ICs signals a shift in performance in smart card applications deployed in government electronic ID documents.
    [Show full text]
  • 45-Year CPU Evolution: One Law and Two Equations
    45-year CPU evolution: one law and two equations Daniel Etiemble LRI-CNRS University Paris Sud Orsay, France [email protected] Abstract— Moore’s law and two equations allow to explain the a) IC is the instruction count. main trends of CPU evolution since MOS technologies have been b) CPI is the clock cycles per instruction and IPC = 1/CPI is the used to implement microprocessors. Instruction count per clock cycle. c) Tc is the clock cycle time and F=1/Tc is the clock frequency. Keywords—Moore’s law, execution time, CM0S power dissipation. The Power dissipation of CMOS circuits is the second I. INTRODUCTION equation (2). CMOS power dissipation is decomposed into static and dynamic powers. For dynamic power, Vdd is the power A new era started when MOS technologies were used to supply, F is the clock frequency, ΣCi is the sum of gate and build microprocessors. After pMOS (Intel 4004 in 1971) and interconnection capacitances and α is the average percentage of nMOS (Intel 8080 in 1974), CMOS became quickly the leading switching capacitances: α is the activity factor of the overall technology, used by Intel since 1985 with 80386 CPU. circuit MOS technologies obey an empirical law, stated in 1965 and 2 Pd = Pdstatic + α x ΣCi x Vdd x F (2) known as Moore’s law: the number of transistors integrated on a chip doubles every N months. Fig. 1 presents the evolution for II. CONSEQUENCES OF MOORE LAW DRAM memories, processors (MPU) and three types of read- only memories [1]. The growth rate decreases with years, from A.
    [Show full text]
  • The Birth, Evolution and Future of Microprocessor
    The Birth, Evolution and Future of Microprocessor Swetha Kogatam Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT timed sequence through the bus system to output devices such as The world's first microprocessor, the 4004, was co-developed by CRT Screens, networks, or printers. In some cases, the terms Busicom, a Japanese manufacturer of calculators, and Intel, a U.S. 'CPU' and 'microprocessor' are used interchangeably to denote the manufacturer of semiconductors. The basic architecture of 4004 same device. was developed in August 1969; a concrete plan for the 4004 The different ways in which microprocessors are categorized are: system was finalized in December 1969; and the first microprocessor was successfully developed in March 1971. a) CISC (Complex Instruction Set Computers) Microprocessors, which became the "technology to open up a new b) RISC (Reduced Instruction Set Computers) era," brought two outstanding impacts, "power of intelligence" and "power of computing". First, microprocessors opened up a new a) VLIW(Very Long Instruction Word Computers) "era of programming" through replacing with software, the b) Super scalar processors hardwired logic based on IC's of the former "era of logic". At the same time, microprocessors allowed young engineers access to "power of computing" for the creative development of personal 2. BIRTH OF THE MICROPROCESSOR computers and computer games, which in turn led to growth in the In 1970, Intel introduced the first dynamic RAM, which increased software industry, and paved the way to the development of high- IC memory by a factor of four.
    [Show full text]
  • Nano-Cmos Scaling Problems and Implications
    CHAPTER 1 NANO-CMOS SCALING PROBLEMS AND IMPLICATIONS 1.1 DESIGN METHODOLOGY IN THE NANO-CMOS ERA As process technology scales beyond 100-nm feature sizes, for functional and high-yielding silicon the traditional design approach needs to be modified to cope with the increased process variation, interconnect processing difficulties, and other newly exacerbated physical effects. The scaling of gate oxide (Figure 1.1) in the nano-CMOS regime results in a significant increase in gate direct tunnel- ing current. Subthreshold leakage and gate direct tunneling current (Figure 1.2) are no longer second-order effects [1,15]. The effect of gate-induced drain leak- age (GIDL) will be felt in designs, such as DRAM (Chapter 7) and low-power SRAM (Chapter 9), where the gate voltage is driven negative with respect to the source [15]. If these effects are not taken care of, the result will be a nonfunctional SRAM, DRAM, or any other circuit that uses this technique to reduce subthresh- old leakage. In some cases even wide muxes and flip-flops may be affected. Subthreshold leakage and gate current are not the only issues that we have to deal with at a functional level, but also the power management of chips for high-performance circuits such as microprocessors, digital signal processors, and graphics processing units. Power management is also a challenge in mobile applications. Furthermore, optical lithography will be stretched to the limit even when en- hanced resolution extension technologies (RETs) are employed. These techniques Nano-CMOS Circuit and Physical Design, by Ban P. Wong, Anurag Mittal, Yu Cao, and Greg Starr ISBN 0-471-46610-7 Copyright © 2005 John Wiley & Sons, Inc.
    [Show full text]
  • Performance and Energy Efficient Network-On-Chip Architectures
    Linköping Studies in Science and Technology Dissertation No. 1130 Performance and Energy Efficient Network-on-Chip Architectures Sriram R. Vangal Electronic Devices Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden Linköping 2007 ISBN 978-91-85895-91-5 ISSN 0345-7524 ii Performance and Energy Efficient Network-on-Chip Architectures Sriram R. Vangal ISBN 978-91-85895-91-5 Copyright Sriram. R. Vangal, 2007 Linköping Studies in Science and Technology Dissertation No. 1130 ISSN 0345-7524 Electronic Devices Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden Linköping 2007 Author email: [email protected] Cover Image A chip microphotograph of the industry’s first programmable 80-tile teraFLOPS processor, which is implemented in a 65-nm eight-metal CMOS technology. Printed by LiU-Tryck, Linköping University Linköping, Sweden, 2007 Abstract The scaling of MOS transistors into the nanometer regime opens the possibility for creating large Network-on-Chip (NoC) architectures containing hundreds of integrated processing elements with on-chip communication. NoC architectures, with structured on-chip networks are emerging as a scalable and modular solution to global communications within large systems-on-chip. NoCs mitigate the emerging wire-delay problem and addresses the need for substantial interconnect bandwidth by replacing today’s shared buses with packet-switched router networks. With on-chip communication consuming a significant portion of the chip power and area budgets, there is a compelling need for compact, low power routers. While applications dictate the choice of the compute core, the advent of multimedia applications, such as three-dimensional (3D) graphics and signal processing, places stronger demands for self-contained, low-latency floating-point processors with increased throughput.
    [Show full text]
  • Benchmarking the Intel FPGA SDK for Opencl Memory Interface
    The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface Hamid Reza Zohouri*†1, Satoshi Matsuoka*‡ *Tokyo Institute of Technology, †Edgecortix Inc. Japan, ‡RIKEN Center for Computational Science (R-CCS) {zohour.h.aa@m, matsu@is}.titech.ac.jp Abstract—Supported by their high power efficiency and efficiency on Intel FPGAs with different configurations recent advancements in High Level Synthesis (HLS), FPGAs are for input/output arrays, vector size, interleaving, kernel quickly finding their way into HPC and cloud systems. Large programming model, on-chip channels, operating amounts of work have been done so far on loop and area frequency, padding, and multiple types of blocking. optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and • We outline one performance bug in Intel’s compiler, and efficiency of the memory controller of FPGAs is missing in multiple deficiencies in the memory controller, leading literature, which becomes even more crucial when the limited to significant loss of memory performance for typical memory bandwidth of modern FPGAs compared to their GPU applications. In some of these cases, we provide work- counterparts is taken into account. In this work, we will analyze arounds to improve the memory performance. the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, II. METHODOLOGY vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of A. Memory Benchmark Suite overlapped blocking. Our results point to multiple shortcomings For our evaluation, we develop an open-source benchmark in the memory controller of Intel FPGAs, especially with respect suite called FPGAMemBench, available at https://github.com/ to memory access alignment, that can hinder the programmer’s zohourih/FPGAMemBench.
    [Show full text]
  • Class-Action Lawsuit
    Case 3:20-cv-00863-SI Document 1 Filed 05/29/20 Page 1 of 279 Steve D. Larson, OSB No. 863540 Email: [email protected] Jennifer S. Wagner, OSB No. 024470 Email: [email protected] STOLL STOLL BERNE LOKTING & SHLACHTER P.C. 209 SW Oak Street, Suite 500 Portland, Oregon 97204 Telephone: (503) 227-1600 Attorneys for Plaintiffs [Additional Counsel Listed on Signature Page.] UNITED STATES DISTRICT COURT DISTRICT OF OREGON PORTLAND DIVISION BLUE PEAK HOSTING, LLC, PAMELA Case No. GREEN, TITI RICAFORT, MARGARITE SIMPSON, and MICHAEL NELSON, on behalf of CLASS ACTION ALLEGATION themselves and all others similarly situated, COMPLAINT Plaintiffs, DEMAND FOR JURY TRIAL v. INTEL CORPORATION, a Delaware corporation, Defendant. CLASS ACTION ALLEGATION COMPLAINT Case 3:20-cv-00863-SI Document 1 Filed 05/29/20 Page 2 of 279 Plaintiffs Blue Peak Hosting, LLC, Pamela Green, Titi Ricafort, Margarite Sampson, and Michael Nelson, individually and on behalf of the members of the Class defined below, allege the following against Defendant Intel Corporation (“Intel” or “the Company”), based upon personal knowledge with respect to themselves and on information and belief derived from, among other things, the investigation of counsel and review of public documents as to all other matters. INTRODUCTION 1. Despite Intel’s intentional concealment of specific design choices that it long knew rendered its central processing units (“CPUs” or “processors”) unsecure, it was only in January 2018 that it was first revealed to the public that Intel’s CPUs have significant security vulnerabilities that gave unauthorized program instructions access to protected data. 2. A CPU is the “brain” in every computer and mobile device and processes all of the essential applications, including the handling of confidential information such as passwords and encryption keys.
    [Show full text]
  • Lecture 5: Scaled MOSFETS for Ics
    Lecture 5: Scaled MOSFETS for ICs MSE 6001, Semiconductor Materials Lectures Fall 2006 5 MOSFETS and scaling Silicon is a mediocre semiconductor, and several other semiconductors have better electrical and optical properties. However, the very high quality of the electrical properties of the silicon-silicon dioxide interface allows very good metal-oxide-semiconductor field-effect transistors (MOSFETs) to be fabricated. These devices have several properties, such as operation frequency and power con- sumption, that improve as their size is scaled down to smaller dimensions, which also allows more transistors to be packed onto a chip. The smaller sizes are achieved using higher-resolution pho- tolithography, which is improved by steadily improving the fabrication technologies. The trends of steadily improving performance and greater integration density with time are generically refered to A 70 Mbit SRAM test vehicle with >as0.5 “Moore’s billion trans Law”.istors Of course, scaling has to end eventually, somewhere before the scale of atoms and incorporating all of the features described in this paper has been fabricated on this technologyis. reached.The aggressive de- sign rules allow for a small 0.57Pm2 6-T SRAM cell that is also compatible with high performance logic processing. A top view of the cell after poly patterni5.1ng is show Basicn in Figur MOSFETe 12. In addition to small size, this cell has a robust static noise margin down to 0.7V VDD to alloMOSFETsw low voltage turnopera- on and off via the non-linear gate capacitor between the gate electrode and the tion (Fig. 13). Figure 14 is a Shmosubstrate.o plot for the The 70 M MOSFETb of Fig.
    [Show full text]