PUMA” Microarchitecture PUMA = Programmable Ultra-Efficient Memristor-Based Accelerator

Total Page:16

File Type:pdf, Size:1020Kb

PUMA” Microarchitecture PUMA = Programmable Ultra-Efficient Memristor-Based Accelerator Accelerating Machine Learning and Neural Networks: Software for Hybrid Memristor-based Computing Dejan Milojicic, Distinguished Technologist, Hewlett Packard Labs Presentation to IEEE-CNSV, May 14, 2019 Co-authors from Hewlett Packard Labs, Penn State, Purdue, UIUC, USP, GaTech, ETH, …. Materials from the IEEE ICRC’[16, 17, 18], IEEE ICDCS’18, ACM ASPLOS [16, 19], and more Potential Approaches vs. DisruptionComputing in Memory in Computing Revisited Stack SystemAlgorithm Software Stack for Dot Product Engine Language Non von Neumann API computing Memory-Driven Computing Architecture Architectural changes ISA Microarchitecture FU Hidden logic changes device “More Moore” Level 1 2 3 4 Source: Tom Conte LEGEND: No Disruption Total Disruption Potential Approaches vs. Disruption in Computing Stack Algorithm Language Non von Neumann API computing Memory-Driven Computing Architecture Architectural changes ISA Microarchitecture FU Hidden logic changes device “More Moore” Level 1 2 3 4 Source: Tom Conte LEGEND: No Disruption Total Disruption • • • • • • • – – – – – – – – – – – Generalize or Die: ISA and Compiler for Memristor Accelerators Aayush Ankit, Pedro Bruel, Sai Rahul Chalamalasetti, Izzat El Hajj, Cat Graves, Dejan Milojicic, Geoffrey Ndu, John Paul Strachan ACM ASPLOS 2019, IEEE ICRC 2018 Potential Approaches vs. Disruption in Computing Stack SystemAlgorithm Software Stack for Dot Product Engine Language Non von Neumann API computing Architecture Architectural changes ISA Microarchitecture FU Hidden logic changes device “More Moore” Level 1 2 3 4 Source: Tom Conte LEGEND: No Disruption Total Disruption Outline Introduction to Dot Product Engine (DPE) Memristor Accelerator Architecture and laboratory demonstrations New Instruction Set Architecture (ISA) to facilitate development Microarchitecture Compiler Performance Benchmarks Business Opportunity and Next Steps Summary 22 Intelligent Edge Opportunities for Machine Learning / AI 7.2 6.0 4.9 3.6 2.2 1.2 2020 2021 2022 2023 2024 2025 Total Addressable Market – Edge Accelerator Machine learning is part of real products, real revenue, real growth ($Billions) Over $30 Billion invested annually Reflects software and hardware investments by startups, established players (Google, Apple, Microsoft, Nvidia, etc.) and countries (China) HPE Labs has significant R&D history and IP in machine learning and cognitive systems Current and past programs supported by US government Task-specific accelerators a growing opportunity to enable new applications at the Edge • Perform vector-matrix multiplication (VMM) in a single cycle Input Voltage • Memristors = nanoscale, non-volatile memory cells vector • High data density; each memristor can be 1-8 bits • Allows efficient multiplication in analog domain Output • Key advantage is in-memory processing O . I current Ij = ∑i Gij Vi • Many pathways for future scaling Array size, Memristor levels, lower current, 3D layers, DAC/ADC Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, DPE, continued and Vivek Srikumar. 2016. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. Proceedings of ACM ISCA 2016 T T T T TILE MP s T T T T eDRAM OR Buffer S+A T T T T IMA IMA IMA IMA T T T T IMA IMA IMA IMA EXTERNAL IO INTERFACE CHIP (NODE) In-Situ Multiply Accumulate OR IR – Input Register S+A OR – Output Register XB XB DAC MP – Max Pool Unit DAC ADC S+A – Shift and Add S+H S+H IR s – Sigmoid Unit ADC XB – Memristor Crossbar XB XB ADC DAC S+H – Sample and Hold DAC DAC – Digital to Analog S+H S+H ADC 26 ADC – Analog to Digital Dot Product Engine: lab prototype 128x64 Chip image Experimental demonstrations of analog vector-matrix multiplications (VMM) in memristor chip Each grayscale pixel is an analog memristor >7 bits per memristor cell 64x64 Custom control circuitry for in-memory computing 32x32 16x16 SDL designing and taping- out integrated memristor- control system Dot Product Engine – signal processing Dot Product Engine – single-layer neural network, MNIST Successful Neural Network inference with memristor-based analog computing Challenges: Generalize or Die – Need to broaden application space for the DPE – Need to reduce software development burden for customers Solutions – Developed an Instruction Set Architecture (ISA) supporting state-of-the-art neural network types – Developed a high performance microarchitecture implementing ISA – Developed compiler to add programming support and increased functionality – Benchmarked performance of the resulting end-to-end solution Load Store Copy Set MVM ALU – add, sub, mult, div shift, logical, ReLU, etc. JMP, BEQ Halt Implementing the ISA: the “PUMA” microarchitecture PUMA = Programmable Ultra-efficient Memristor-based Accelerator Each core contains multiple memristor xbars performing matrix vector multiplications. Also supports simpler digital operations (logical, scalar, and vector units). 3-stage pipeline, instruction decoder, and instruction memory. Core Tile Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Geoffrey Ndu, Martin Foltin, R Stanley Williams, Paolo Faraboschi, Wen-mei W Hwu, John Paul Strachan, Kaushik Roy, Dejan S Milojicic Proceedings of the Twenty-Fourth APLOS, 2019. Implementing the ISA: the “PUMA” microarchitecture PUMA = Programmable Ultra-efficient Memristor-based Accelerator Tile Node Within each “tile” are multiple “cores” with a shared memory Each “node” consists of for holding data to/from other multiple tiles, with an on-chip tiles and to/from cores. network, implementing the NN layers Cores perform all scalar, vector, and matrix operations. Minimal area and power overhead Architecture Applications Supported Tile-area Tile-power (mm2) (mW) ISAAC CNN 0.372 330 PUMA CNN, RNN, LSTM, BM, 0.491 377.9 RBM, MLP, etc (1.32x larger) (1.14x larger) ISAAC = accelerator for CNN. PUMA = accelerator for CNN, RNN, BM, RBM, MLP, etc. New ISA and Microarchitecture in PUMA provides much broader application support, with a small area and power cost 35 The PUMA compiler – As applications grow in size and inter-core/tile dependencies become more complex, a compiler becomes mandatory to increase programmers’ productivity – We designed and implemented a complete compiler that translates high-level programs to PUMA assembly – Compiler supports the following aspects – Programming Interface – Graph Partitioning – Data Communication – Tile Placement – MVM Coalescing – Linearization – Memory Allocation 36 Memristor-based DPE programming Map Dot- Map Non- Virtual to Build Task Code products to dot-products Physical Tile Graph Generation Crossbars to Tiles Assignment Host Graph partitioning and data communication 40 Instruction usage (importance of different execution units in PUMA core) Inter-Tile Data Transfer Inter-Core Data Transfer Control Flow Scalar Functional Unit Vector Functional Unit MVM Unit (crossbar) RBM (V500-H500) BM (V500-H500) RNN (26-93-61) LSTM (26-120-61) MLP (64-150-150-14) CNN (Lenet5) 0% 20% 40% 60% 80% 100% All ISA components play an important role across various neural networks 41 Software stack • DPE-SW components Application Application Application …… Application • ONNX Interpreter: converts ONNX model to input to our compiler/assembler and eventually hardware Neural nets Tensorflow CNTK Caffe2 frameworks DPE Tool • Applications optimization: improve performance of deployed …… models (quantization, weights, etc.) • Compiler and assembler: converts models into executables onnx-tf onnx-cntk onnx-caffe models onnx-dpe for our own hardware …… • Driver: operating system support for the hardware running onnx DPE, e.g. for multiple PCIe boards interpreter • Emulator: used for hardware verification (Brazil) and as a app optimiz. development platform for software, QEMU inference compiler & • Summary assembler • Only required DPE software components being developed for users & developers’ community, leverage rest! driver • Focus on inference (hardware, performance, robustness) DPE framework • Extensive collaborations to support use cases and demos; emulator but also driving future support, e.g. training DPE MemristorMemristor-- FPGAFPGA ASICASIC--11 ASICASIC--22 hardware …… basedbased42 DPEDPE 42 Inference latency normalized to PUMA (lower is better) 100,000.00 10,000.00 1,000.00 Haswell Intel processors 100.00 Skylake Kepler 10.00 Normalized Latency Normalized Maxwell NVDIA accelerators Pascal 1.00 PUMA Our solution 0.10 L4 L5 L3 L5 Big LSTM- Vgg16 Vgg19 LSTM 2048 MLP NMT WLM CNN Types of Neural Networks Our solution is better than Intel processors (10x-104x) and NVIDIA accelerators (10x-102x) except for one NN 43 Inference energy normalized to PUMA (lower is better) 10,000,000.00 1,000,000.00 Haswell 100,000.00 Intel processors Skylake 10,000.00 Kepler 1,000.00 Maxwell NVDIA accelerators 100.00 Pascal PUMA Our solution 10.00 Normalized Energy Normalized 1.00 0.10 L4 L5 L3 L5 Big LSTM- Vgg16 Vgg19 LSTM 2048 MLP NMT WLM CNN Types of Neural Networks Our solution is substantially better than both Intel processors (103x-106x) and NVIDIA accelerators (10x-103x) 44 Batch throughput normalized to Haswell (higher is better) Intel processors NVDIA accelerators Our solution Haswell Skylake Kepler Maxwell Pascal PUMA 1,000,000.00 100,000.00 10,000.00 1,000.00 100.00 10.00 1.00 0.10 0.01 Normalized Throughput Normalized B32 B32 B64 B16 B64 B16 B32 B64 B16 B32 B64 B16 B64 B16 B32 B64 B16 B32 B64 B16 B32 B16 B32 B64 B128 B128 B128 B128 B128 B128 B128 B128 L4 L5 L3 L5 BigLSTM LSTM-2048
Recommended publications
  • GPTPU: Accelerating Applications Using Edge Tensor Processing Units Kuan-Chieh Hsu and Hung-Wei Tseng University of California, Riverside {Khsu037, Htseng}@Ucr.Edu
    GPTPU: Accelerating Applications using Edge Tensor Processing Units Kuan-Chieh Hsu and Hung-Wei Tseng University of California, Riverside {khsu037, htseng}@ucr.edu This paper is a pre-print of a paper in the 2021 SC, the Interna- Two decades ago, graphics processing units (GPUs) were just tional Conference for High Performance Computing, Networking, domain-specific accelerators used for shading and rendering. But Storage and Analysis. Please refer to the conference proceedings intensive research into high-performance algorithms, architectures, for the most complete version. systems, and compilers [3–12] and the availability of frameworks like CUDA [13] and OpenCL [14], have revolutionized GPUs and ABSTRACT transformed them into high-performance, general-purpose vector Neural network (NN) accelerators have been integrated into a wide- processors. We expect a similar revolution to take place with NN spectrum of computer systems to accommodate the rapidly growing accelerators—a revolution that will create general-purpose matrix demands for artificial intelligence (AI) and machine learning (ML) processors for a broader spectrum of applications. However, de- applications. NN accelerators share the idea of providing native mocratizing these NN accelerators for non-AI/ML workloads will hardware support for operations on multidimensional tensor data. require the system framework and the programmer to tackle the Therefore, NN accelerators are theoretically tensor processors that following issues: can improve system performance for any problem that uses ten- (1) The microarchitectures and instructions of NN accelerators sors as inputs/outputs. Unfortunately, commercially available NN are optimized for NN workloads, instead of general matrix/tensor accelerators only expose computation capabilities through AI/ML- algebra.
    [Show full text]
  • Amd Filed: February 24, 2009 (Period: December 27, 2008)
    FORM 10-K ADVANCED MICRO DEVICES INC - amd Filed: February 24, 2009 (period: December 27, 2008) Annual report which provides a comprehensive overview of the company for the past year Table of Contents 10-K - FORM 10-K PART I ITEM 1. 1 PART I ITEM 1. BUSINESS ITEM 1A. RISK FACTORS ITEM 1B. UNRESOLVED STAFF COMMENTS ITEM 2. PROPERTIES ITEM 3. LEGAL PROCEEDINGS ITEM 4. SUBMISSION OF MATTERS TO A VOTE OF SECURITY HOLDERS PART II ITEM 5. MARKET FOR REGISTRANT S COMMON EQUITY, RELATED STOCKHOLDER MATTERS AND ISSUER PURCHASES OF EQUITY SECURITIES ITEM 6. SELECTED FINANCIAL DATA ITEM 7. MANAGEMENT S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS ITEM 7A. QUANTITATIVE AND QUALITATIVE DISCLOSURE ABOUT MARKET RISK ITEM 8. FINANCIAL STATEMENTS AND SUPPLEMENTARY DATA ITEM 9. CHANGES IN AND DISAGREEMENTS WITH ACCOUNTANTS ON ACCOUNTING AND FINANCIAL DISCLOSURE ITEM 9A. CONTROLS AND PROCEDURES ITEM 9B. OTHER INFORMATION PART III ITEM 10. DIRECTORS, EXECUTIVE OFFICERS AND CORPORATE GOVERNANCE ITEM 11. EXECUTIVE COMPENSATION ITEM 12. SECURITY OWNERSHIP OF CERTAIN BENEFICIAL OWNERS AND MANAGEMENT AND RELATED STOCKHOLDER MATTERS ITEM 13. CERTAIN RELATIONSHIPS AND RELATED TRANSACTIONS AND DIRECTOR INDEPENDENCE ITEM 14. PRINCIPAL ACCOUNTANT FEES AND SERVICES PART IV ITEM 15. EXHIBITS, FINANCIAL STATEMENT SCHEDULES SIGNATURES EX-10.5(A) (OUTSIDE DIRECTOR EQUITY COMPENSATION POLICY) EX-10.19 (SEPARATION AGREEMENT AND GENERAL RELEASE) EX-21 (LIST OF AMD SUBSIDIARIES) EX-23.A (CONSENT OF ERNST YOUNG LLP - ADVANCED MICRO DEVICES) EX-23.B
    [Show full text]
  • Quad-Core Catamount and R&D in Multi-Core Lightweight Kernels
    Quad-core Catamount and R&D in Multi-core Lightweight Kernels Salishan Conference on High-Speed Computing Gleneden Beach, Oregon April 21-24, 2008 Kevin Pedretti Senior Member of Technical Staff Scalable System Software, Dept. 1423 [email protected] SAND Number: 2008-1725A Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. Outline • Introduction • Quad-core Catamount LWK results • Open-source LWK • Research directions • Conclusion Going on Four Decades of UNIX Operating System = Collection of software and APIs Users care about environment, not implementation details LWK is about getting details right for scalability LWK Overview Basic Architecture Memory Management … … Policy n 1 n N tio tio Page 3 Page 3 Maker ca ca i l Libc.a Libc.a (PCT) pp ppli Page 2 Page 2 A libmpi.a A libmpi.a Page 1 Page 1 Policy Enforcer/HAL (QK) Page 0 Page 0 Privileged Hardware Physical Application Memory Virtual • POSIX-like environment Memory • Inverted resource management • Very low noise OS noise/jitter • Straight-forward network stack (e.g., no pinning) • Simplicity leads to reliability Nov 2007 Top500 Top 10 System Lightweight Kernel Compute Processors: Timeline 82% run a LWK 1990 – Sandia/UNM OS (SUNMOS), nCube-2 1991 – Linux 0.02 1993 – SUNMOS ported to Intel Paragon (1800 nodes) 1993 – SUNMOS experience used to design Puma First implementation of Portals communication architecture 1994
    [Show full text]
  • ADVANCED MICRO DEVICES, INC. (Exact Name of Registrant As Specified in Its Charter)
    UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 8-K CURRENT REPORT Pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934 July 19, 2007 Date of Report (Date of earliest event reported) ADVANCED MICRO DEVICES, INC. (Exact name of registrant as specified in its charter) Delaware 001-07882 94-1692300 (State of Incorporation) (Commission File Number) (IRS Employer Identification Number) One AMD Place P.O. Box 3453 Sunnyvale, California 94088-3453 (Address of principal executive offices) (Zip Code) (408) 749-4000 (Registrant’s telephone number, including area code) N/A (Former Name or Former Address, if Changed Since Last Report) Check the appropriate box below if the Form 8-K filing is intended to simultaneously satisfy the filing obligation of the registrant under any of the following provisions: ¨ Written communications pursuant to Rule 425 under the Securities Act (17 CFR 230.425) ¨ Soliciting material pursuant to Rule 14a-12 under the Exchange Act (17 CFR 240.14a-12) ¨ Pre-commencement communications pursuant to Rule 14d-2(b) under the Exchange Act (17 CFR 240.14d-2(b)) ¨ Pre-commencement communications pursuant to Rule 13e-4(c) under the Exchange Act (17 CFR 240.13e-4(c)) Item 2.02 Results of Operations and Financial Condition. Item 7.01 Regulation FD Disclosure. The information in this Report, including the Exhibit 99.1 attached hereto, is furnished pursuant to Item 2.02 and Item 7.01 of this Form 8-K. Consequently, it is not deemed “filed” for the purposes of Section 18 of the Securities and Exchange Act of 1934, as amended, or otherwise subject to the liabilities of that section.
    [Show full text]
  • AMD's Early Processor Lines, up to the Hammer Family (Families K8
    AMD’s early processor lines, up to the Hammer Family (Families K8 - K10.5h) Dezső Sima October 2018 (Ver. 1.1) Sima Dezső, 2018 AMD’s early processor lines, up to the Hammer Family (Families K8 - K10.5h) • 1. Introduction to AMD’s processor families • 2. AMD’s 32-bit x86 families • 3. Migration of 32-bit ISAs and microarchitectures to 64-bit • 4. Overview of AMD’s K8 – K10.5 (Hammer-based) families • 5. The K8 (Hammer) family • 6. The K10 Barcelona family • 7. The K10.5 Shanghai family • 8. The K10.5 Istambul family • 9. The K10.5-based Magny-Course/Lisbon family • 10. References 1. Introduction to AMD’s processor families 1. Introduction to AMD’s processor families (1) 1. Introduction to AMD’s processor families AMD’s early x86 processor history [1] AMD’s own processors Second sourced processors 1. Introduction to AMD’s processor families (2) Evolution of AMD’s early processors [2] 1. Introduction to AMD’s processor families (3) Historical remarks 1) Beyond x86 processors AMD also designed and marketed two embedded processor families; • the 2900 family of bipolar, 4-bit slice microprocessors (1975-?) used in a number of processors, such as particular DEC 11 family models, and • the 29000 family (29K family) of CMOS, 32-bit embedded microcontrollers (1987-95). In late 1995 AMD cancelled their 29K family development and transferred the related design team to the firm’s K5 effort, in order to focus on x86 processors [3]. 2) Initially, AMD designed the Am386/486 processors that were clones of Intel’s processors.
    [Show full text]
  • N51tp/Te AMD PUMA Rev 0.6 2008/Nov/05
    N51Tp/Te AMD PUMA www.asus.com Rev 0.6 2008/Nov/05 2-Spindle, AMD Puma 15.6” HD LED TFT Panel Specifications Processor & AMD Turion(tm) 64 Mobile Technology Dual Core n Bat. Charging/full/low (Orange) Cache n Lion CPU ZM84/82/80, S1g2 package, n Wireless indicator (Blue) 2M L2 cache, 800MHz, Hyper Transport 3.0 n Bluetooth Indicator (Blue) n Lion CPU RM-72, S1g2 package, n Storage Device Access (Blue) 1M L2 cache, 800MHz, Hyper Transport 3.0 Near Power button n Num Lock (White) AMD Athlon 64 Mobile Technology Dual Core n Cap. Lock (White) n Lion CPU QL-62 S1g2 package, n Ionizer (White) 1M L2 cache, 667MHz, Hyper Transport 3.0 Cap sensor n Mode Switch (Blue) BIOS n AMI BIOS code (Below LEDs dim 3 times when activated.) n 8Mb Flash EPROM n Rewind (Blue) n PMU, Plug & Play n Play/Pause n Boot from USB, LAN,Ai-Flash 3,Ai-Flash 4, Ai-Flash 5 n Stop n Forward Chipset n AMD RS780M + SB700 HT3.0(5200MHz) n Volume up/down n Ionizer Main Memory n 2 x SODIMM Sockets for Expansion Up to 4 GB (BTO n Hybrid Power4Gear Option) n Splendid n Dual Channel DDR2 800 DRAM support n Touchpad disable Display n 15.6” VESA Like LED HD 1366*768; Full-HD Video Camera n Fixed Camera, 1.3M 1920*1080 (N51 BTO) n Fixed Camera, 2.0M n AI Light Sensor n Without Camera ATI M96(Tp) with 1G VRAM, Graphics & Video n Keyboard n Universal Numeric K/B ATI M92(Te) (New package), with 512M VRAM Module n n Vista K/B Start Button n DDR2 VRAM(64Mx16): 1G(Tp); 512M(Te) n DX10.1 Support Ionizer n Sunyou DC5V, 1.8KV n H.264/VC-1 Hardware decoding n HDCP support for HDMI port n One
    [Show full text]
  • AMD Zen Rohin, Vijay, Brandon Outline
    AMD Zen Rohin, Vijay, Brandon Outline 1. History and Overview 2. Datapath Structure 3. Memory Hierarchy 4. Zen 2 Improvements History and Overview AMD History ● IBM production too large, forced Intel to license their designs to 3rd parties ● AMD fills the gap, produces clones for 15ish years - legal battles ensued ● K5 first in-house x86 chip in 1996 ● Added more features like out of order, L2 caches, etc ● Current CPUs are Zen* tomshardware.com/picturestory/71 3-amd-cpu-history.html Zen Brand ● Performance desktop and mobile computing ○ Athlon ○ Ryzen 3, Ryzen 5, Ryzen 7, Ryzen 9 ○ Ryzen Threadripper ● Server ○ EPYC https://en.wikichip.org/wiki/amd/microarchitectures/zen Zen History ● Aimed to replace two of AMD’s older chips ○ Excavator: high performance architecture ○ Puma: low power architecture https://en.wikichip.org/wiki/amd/microarchitectures/zen#Block_Diagram Zen Architecture ● Quad-core ● Fetch 4 instructions/cycle ● Op cache 2k instructions ● 168 physical integer registers ● 72 out of order loads ● Large shared L3 cache ● 2 threads per core https://www.slideshare.net/AMD/amd-epyc-microp rocessor-architecture Datapath Structure Fetch ● Decoupled branch predictor ○ Runs ahead of fetches ○ Successful predictions help latency and memory parallelism ○ Mispredictions incur power penalty ● 3 layer TLB ○ L0: 8 entries ○ L1: 64 entries ○ L2: 512 entries https://www.anandtech.com/show/10591/amd-zen-microarchiture-p art-2-extracting-instructionlevel-parallelism/3 Branch Predictor ● Perceptron: simple neural network ● Table of perceptrons, each a vector of weights ● Branch address used to access perceptron table ● Dot product between weight vector and branch history vector Perceptron Branch Predictor ● ~10% improve prediction rates over gshare predictor - (2, 2) correlating predictor ● Can utilize longer branch histories ○ Hardware requirements scale linearly whereas they scale exponentially for other predictors D.
    [Show full text]
  • AMD Details Next-Generation Platform for Notebook Pcsamd Details Next-Generation Platform for Notebook Pcs 18 May 2007
    AMD Details Next-Generation Platform for Notebook PCsAMD Details Next-Generation Platform for Notebook PCs 18 May 2007 At a press conference in Tokyo, Japan, AMD today AMD’s next-generation notebook “Griffin” officially disclosed more details of its next- microprocessor. With “Griffin,” AMD will deliver a generation open platform for notebook computing. number of new capabilities to enhance battery life Codenamed “Puma,” the platform is designed to and overall mobile computing performance. deliver battery life, graphics and video processing enhancements and improved overall system New notebook processing innovations in “Griffin” performance for an enhanced visual experience. include: The “Puma” platform is expected to build on the -- power optimized HyperTransport and memory successful launches of the AMD M690 mobile controllers integrated in the processor silicon that chipset and 65nm process-based AMD Turion 64 operate on a separate power plane as the X2 dual-core mobile technology in April and May processor cores, thereby enabling the cores to go 2007, respectively. into reduced power states; -- dynamic performance scaling offers enhanced The key technologies that comprise “Puma” are battery life with reduced power consumption AMD’s next-generation notebook processor, through separate voltage planes enabling each codenamed “Griffin”, matched with the next- core to operate at independent frequency and generation AMD “RS780” mobile chipset. This voltage; and new platform exemplifies AMD’s commitment to -- power-optimized HyperTransport 3.0 with a more improve platform stability, time to market, than tripling of peak I/O bandwidth, plus new power performance/energy-efficiency and overall features including dynamic scaling of link widths. consumer and commercial customers’ experience via its acquisition and integration of ATI.
    [Show full text]
  • Intel Corporation, Founded 1968, Is the Largest Microprocessor Company in the World, with the Greatest Overall Share of the Microprocessor Market Worldwide
    Competitive Strategies for the Microprocessor Market William Fan Leon Liu Simpson Zhang Winston Zhao 1 | P a g e Executive Summary____________________________________________________________ Intel Corporation, founded 1968, is the largest microprocessor company in the world, with the greatest overall share of the microprocessor market worldwide. Along with its major competitor AMD, Intel is a participant in a highly volatile market environment born from the introduction of the personal computer and fed by the phenomenal growth of multimedia software and the Internet. To maintain this dominant position, Intel must recognize a paradigm shift currently taking place within the market, specifically the growth of mid-range highly power efficient and multi-threading computers, and adapt its product direction to compensate. The modern microprocessor industry has been influenced greatly by Intel and the efforts of its competitors. There are highly stratified divisions between low, mid, and high end markets in addition to substantially differing design philosophies between desktop, laptop, embedded device, and server microprocessors. Intel’s main product offerings are the Pentium, a mid to high range desktop processor; the Celeron, a low price model designed to secure the low end market; the Itanium, a high end server processor using a next generation architecture; and the Xeon, a mid to low range cost effective server processor. We devise a set of strategies for each sector of the market with the intention to maintain short term profitability and a superior future position. For the desktop market, Intel should attempt to leverage its current dominant position, while continuing development on the integrated chipsets through intensive research investments.
    [Show full text]
  • PUMA: a Programmable Ultra-Efficient Memristor-Based Accelerator for Machine Learning Inference
    PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference Aayush Ankit Izzat El Hajj∗ Sai Rahul Chalamalasetti Purdue University, American University of Beirut Hewlett Packard Enterprise Hewlett Packard Enterprise Geoffrey Ndu Martin Foltin R. Stanley Williams Hewlett Packard Enterprise Hewlett Packard Enterprise Hewlett Packard Enterprise Paolo Faraboschi Wen-mei Hwu John Paul Strachan Hewlett Packard Enterprise University of Illinois at Hewlett Packard Enterprise Urbana-Champaign Kaushik Roy Dejan S Milojicic Purdue University Hewlett Packard Enterprise Abstract PUMA’s components to evaluate performance and energy Memristor crossbars are circuits capable of performing ana- consumption. A PUMA accelerator running at 1 GHz can 2 log matrix-vector multiplications, overcoming the fundamen- reach area and power efficiency of 577 GOPS/s/mm and tal energy efficiency limitations of digital logic. They have 837 GOPS/s/W, respectively. Our evaluation of diverse ML been shown to be effective in special-purpose accelerators applications from image recognition, machine translation, for a limited set of neural network applications. and language modelling (5M-800M synapses) shows that We present the Programmable Ultra-efficient Memristor- PUMA achieves up to 2,446× energy and 66× latency im- based Accelerator (PUMA) which enhances memristor cross- provement for inference compared to state-of-the-art GPUs. bars with general purpose execution units to enable the Compared to an application-specific memristor-based ac- acceleration of a wide variety of Machine Learning (ML) celerator, PUMA incurs small energy overheads at similar inference workloads. PUMA’s microarchitecture techniques inference latency and added programmability. exposed through a specialized Instruction Set Architecture Keywords memristors, accelerators, machine learning, neu- (ISA) retain the efficiency of in-memory computing and ana- ral networks log circuitry, without compromising programmability.
    [Show full text]
  • AMD ENERGY STAR Server Energy Efficiency Technology
    AMD Opteron Energy Efficiency Technology ENERGY STAR Computer Servers Off-Season Meeting June 2014 David Reiner | Server Performance & Power Optimization Manager Donna Sadowy | Sr. Manager, Government Relations AMD 2013-2014 SERVER ROADMAP 2013 2014 AMD Opteron™ 6300 and 4300 Series “Warsaw” CPU 2P and 4P 4, 6, 8, 12 or 16 “Piledriver” CPU Cores 12 or 16 “Piledriver” CPU Cores Enterprise, 35W-140W Mainstream Platforms 32nm 32nm AMD Opteron™ 3300 Series “Berlin” CPU/APU 4 or 8 “Piledriver” CPU Cores 4 “Steamroller” CPU Cores 25W-65W TDP GCN Graphics Compute Units (APU) HSA Features (APU) 32nm 28nm 1P Web/Enterprise Services Clusters AMD Opteron™ X1150 CPU and X2150 APU “Seattle” CPU 4 “Jaguar” CPU Cores ARM “A57” CPU Cores GCN Graphics Compute Units (APU) 9W-22W 28nm SoC 28nm SoC AMD roadmaps are subject to change without notice or obligations to notify of changes. Placement of boxes intended to represent first year of production shipments. A History of Energy Efficiency 2010 2011 2012 2013 2014 2015 Finer grained power Integrated Dynamic power tracking and tracking, increased voltage management with DVFS “Kaveri” 10X voltage granularity regulation 10.0x Thermal aware Dynamic thermal 9.0x power tracking for short Platform aware dynamic management term boost thermal management 8.0x “Richland” 7.0x Video decode Video encode Audio Inter-frame “Trinity” acceleration acceleration acceleration power gating 6.0x Voltage-adaptive Per part 5.0x Finer grained Fine grained power frequency adaptive voltage planes gating “Llano” scaling voltage 4.0x
    [Show full text]
  • Intel Atom® Processor S1200 Product Family for Microserver Datasheet
    Intel® Atom™ Processor S1200 Product Family for Microserver Datasheet, Volume 2 of 2 December 2012 Document Number: 328195-001 INFORMATIONLegal Lines and Disclaimers IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
    [Show full text]