T04: Fundamental Memory Microarchitecture

Total Page:16

File Type:pdf, Size:1020Kb

T04: Fundamental Memory Microarchitecture ECE 4750 Computer Architecture, Fall 2020 T04 Fundamental Memory Microarchitecture School of Electrical and Computer Engineering Cornell University revision: 2020-10-28-14-41 1 Memory Microarchitectural Design Patterns 3 1.1. Transactions and Steps . .3 1.2. Microarchitecture Overview . .4 2 FSM Cache 5 2.1. High-Level Idea for FSM Cache . .6 2.2. FSM Cache Datapath . .7 2.3. FSM Cache Control Unit . .9 2.4. Analyzing Performance . 10 3 Pipelined Cache 12 3.1. High-Level Idea for Pipelined Cache . 13 3.2. Pipelined Cache Datapath and Control Unit . 14 3.3. Analyzing Performance . 20 3.4. Pipelined Cache with TLB . 22 4 Cache Microarchitecture Optimizations 25 4.1. Reduce Hit Time . 25 4.2. Reduce Miss Rate . 26 1 4.3. Reduce Miss Penalty . 30 4.4. Cache Optimization Summary . 31 5 Case Study: ARM Cortex A8 and Intel Core i7 32 5.1. ARM Cortex A8 . 32 5.2. Intel Core i7 . 33 Copyright © 2018 Christopher Batten, Christina Delimitrou. All rights reserved. This hand- out was originally prepared by Prof. Christopher Batten at Cornell University for ECE 4750 / CS 4420. It has since been updated by Prof. Christina Delimitrou in 2017-2020. Download and use of this handout is permitted for individual educational non-commercial purposes only. Redistribution either in part or in whole via both commercial or non-commercial means requires written permission. 2 1. Memory Microarchitectural Design Patterns 1.1. Transactions and Steps 1. Memory Microarchitectural Design Patterns Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num Accesses Miss Extra Accesses Microarchitecture Hit Latency for Translation FSM Cache >1 1+ Pipelined Cache ≈1 1+ Pipelined Cache + TLB ≈1 ≈0 1.1. Transactions and Steps • We can think of each memory access as a transaction • Executing a memory access involves a sequence of steps – Check Tag : Check one or more tags in cache – Select Victim : Select victim line from cache using replacement policy – Evict Victim : Evict victim line from cache and write victim to memory – Refill : Refill requested line by reading line from memory – Write Mem : Write requested word to memory – Access Data : Read or write requested word in cache 3 1. Memory Microarchitectural Design Patterns 1.1. Transactions and Steps Steps for Write-Through 1.2. Microarchitecture with No Write Allocate Overview mmureq 32b mmuresp Memory Management Unit cachereq 32b cacheresp Control Unit Control Status Datapath Tag Data Steps for Write-Back Array Array with Write Allocate mreq 128b mresp Main Memory >1 cycle combinational 4 2. FSM Cache 2. FSM Cache Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle mmureq 32b mmuresp Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num AccessesMemory ManagementMiss Unit cachereq 32b cacheresp Extra Accesses Microarchitecture Hit Latency for Translation Control Unit FSM Cache >1 1+ Pipelined Cache ≈Control1 1+ Status Pipelined Cache + TLB ≈Datapath1 ≈0 Tag Data Assumptions Array Array • Page-based translation, no TLB, physically addr cache mreq 128b mresp • Single-ported combinational Main Memory >1 cycle SRAMs for tag, data storage combinational • Unrealistic combinational main memory • Cache requests are 4 B Configuration • Four 16 B cache lines • Two-way set-associative • Replacement policy: LRU • Write policy: write-through, no write allocate 5 2. FSM Cache 2.1. High-Level Idea for FSM Cache 2.1. High-Level Idea for FSM Cache read hit Check Access Tag read miss Data Select Refill write Victim Write Mem write hit Check Access Check Access Check Access Tag Select Access Check Access Refill Tag Data Tag Data Tag Data Check Victim Data Tag Data read write Write read read read FSM Mem hit hit hit miss hit 6 2. FSM Cache 2.2. FSM Cache Datapath 2.2. FSM Cache Datapath As with processors, we design our cache datapath by incrementally adding support for each transaction and resolving conflicts with muxes. Implementing READ transactions that hit !cachereq_val tag idx off 00 27b 1b 2b MT cache idx tag idx tag idx off req. addr tarray0 tarray1 read _match _match hit = = Way 0 Way 1 Tag 2b Array [127:96] tarray0_en tarray0_en 32b [95:64] cache [63:32] 128b resp. Data [31:0] data darray_en Array MRD MT: Check tag MRD: Read data array, return cacheresp Implementing READ transactions that miss !cachereq_val tag idx off 00 27b 1b 2b MT cache tag idx tag tag idx tag idx off req. addr tarray0 tarray1 _sel read read _match _match victim miss hit victim = = R0 Way 0 Way 1 Tag 2b Array [127:96] tarray0_en tarray0_en tarray0_wen tarray0_wen 32b [95:64] word enable cache R1 [63:32] 1111 128b resp. Data [31:0] data darray_en Array z4b 128b darray_en darray_wen MRD MT: Check tag memreq. memresp. MRD: Read data array, return cacheresp addr data R0: Send refill memreq, get memresp R1: Write data array with refill cache line 7 2. FSM Cache 2.2. FSM Cache Datapath Implementing WRITE transactions that miss !cachereq_val tag idx off 00 27b 1b 2b write MT cache tag idx tag tag idx tag idx off req. addr tarray0 tarray1 _sel read read _match _match victim miss hit victim = = R0 Way 0 Way 1 Tag 2b Array [127:96] tarray0_en tarray0_en tarray0_wen tarray0_wen 32b [95:64] word 32b enable cache R1 [63:32] 1111 128b resp. cache Data [31:0] data req. darray_en Array data z4b z4b zext 128b darray_en darray_wen MRD MWD z4b_sel 128b MT: Check tag memreq. memreq. memresp. MRD: Read data array, return cacheresp addr data data R0: Send refill memreq, get memresp R1: Write data array with refill cache line MWD: Send write memreq, write data array Implementing WRITE transactions that hit !cachereq_val tag idx off 00 27b 1b 2b write MT cache tag idx tag tag idx tag off idx off req. addr tarray0 tarray1 _sel read read _match _match victim miss hit victim = = R0 Way 0 Way 1 Tag 1111 2b Array word_ [127:96] tarray0_en tarray0_en darray_ tarray0_wen tarray0_wen write_ en_sel 32b sel word [95:64] enableword 32b 128b enable cache R1 repl [63:32] 128b 1111 128b resp. cache unit Data [31:0] data req. darray_en Array data z4b z4b zext 128b darray_en darray_wen MRD MWD z4b_sel 128b MT: Check tag memreq. memreq. memresp. MRD: Read data array, return cacheresp addr data data R0: Send refill memreq, get memresp R1: Write data array with refill cache line MWD: Send write memreq, write data array 8 2. FSM Cache 2.3. FSM Cache Control Unit 2.3. FSM Cache Control Unit MT read R0 R1 MRD miss read write hit MWD We will need to keep valid bits in the control unit, with one valid bit for every cache line. We will also need to keep use bits which are updated on every access to indicate which was the last “used” line. Assume we create the following two control signals generated by the FSM control unit. hit = ( tarray0_match && valid0[idx] ) || ( tarray1_match && valid1[idx] ) victim = !use[idx] tarray0 tarray1 darray darray worden z4b memreq cachresp en wen en wen en wen sel sel sel val op val MT MRD R0 R1 MWD 9 2. FSM Cache 2.4. Analyzing Performance 2.4. Analyzing Performance Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num Accesses Miss Estimating cycle time !cachereq_val tag idx off 00 27b 1b 2b write MT cache tag idx tag tag idx tag off idx off req. addr tarray0 tarray1 _sel read read _match _match victim miss hit victim = = R0 Way 0 Way 1 Tag 1111 2b Array • register read/write = 1t word_ [127:96] tarray0_en tarray0_en darray_ tarray0_wen tarray0_wen write_ en_sel 32b sel word [95:64] • tag array read/write = 10t enableword 32b 128b enable cache R1 repl [63:32] • data array read/write = 10t 128b 1111 128b resp. cache unit Data [31:0] data • mem read/writereq. = 20t darray_en Array data z4b z4b zext 128b darray_en • decoder = 3t darray_wen MRD MWD z4b_sel 128b • comparator = 10t MT: Check tag memreq. memreq. memresp. MRD: Read data array, return cacheresp • mux addr= 3t data data R0: Send refill memreq, get memresp R1: Write data array with refill cache line • repl unit = 0t MWD: Send write memreq, write data array • z4b = 0t 10 2. FSM Cache 2.4. Analyzing Performance Estimating AMAL Consider the following sequence of memory acceses which might correspond to copying 4 B elements from a source array to a destination array. Each array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x2000 rd 0x1004 wr 0x2004 rd 0x1008 wr 0x2008 ... rd 0x1040 wr 0x2040 Consider the following sequence of memory acceses which might correspond to incrementing 4 B elements in an array. The array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x1000 rd 0x1004 wr 0x1004 rd 0x1008 wr 0x1008 ... rd 0x1040 wr 0x1040 11 3. Pipelined Cache 3. Pipelined Cache Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle mmureq 32b mmuresp Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num AccessesMemory ManagementMiss Unit cachereq 32b cacheresp Extra Accesses Microarchitecture Hit Latency for Translation Control Unit FSM Cache >1 1+ Pipelined Cache ≈Control1 1+ Status Pipelined Cache + TLB ≈Datapath1 ≈0 Tag Data Assumptions Array Array • Page-based translation, no TLB, physically addr cache mreq 128b mresp • Single-ported combinational Main Memory >1 cycle SRAMs for tag, data storage combinational • Unrealistic combinational main memory • Cache requests are 4 B Configuration • Four 16 B cache lines • Direct-mapped • Replacement policy: LRU • Write policy: write-through, no write allocate 12 3.
Recommended publications
  • Inside Intel® Core™ Microarchitecture Setting New Standards for Energy-Efficient Performance
    White Paper Inside Intel® Core™ Microarchitecture Setting New Standards for Energy-Efficient Performance Ofri Wechsler Intel Fellow, Mobility Group Director, Mobility Microprocessor Architecture Intel Corporation White Paper Inside Intel®Core™ Microarchitecture Introduction Introduction 2 The Intel® Core™ microarchitecture is a new foundation for Intel®Core™ Microarchitecture Design Goals 3 Intel® architecture-based desktop, mobile, and mainstream server multi-core processors. This state-of-the-art multi-core optimized Delivering Energy-Efficient Performance 4 and power-efficient microarchitecture is designed to deliver Intel®Core™ Microarchitecture Innovations 5 increased performance and performance-per-watt—thus increasing Intel® Wide Dynamic Execution 6 overall energy efficiency. This new microarchitecture extends the energy efficient philosophy first delivered in Intel's mobile Intel® Intelligent Power Capability 8 microarchitecture found in the Intel® Pentium® M processor, and Intel® Advanced Smart Cache 8 greatly enhances it with many new and leading edge microar- Intel® Smart Memory Access 9 chitectural innovations as well as existing Intel NetBurst® microarchitecture features. What’s more, it incorporates many Intel® Advanced Digital Media Boost 10 new and significant innovations designed to optimize the Intel®Core™ Microarchitecture and Software 11 power, performance, and scalability of multi-core processors. Summary 12 The Intel Core microarchitecture shows Intel’s continued Learn More 12 innovation by delivering both greater energy efficiency Author Biographies 12 and compute capability required for the new workloads and usage models now making their way across computing. With its higher performance and low power, the new Intel Core microarchitecture will be the basis for many new solutions and form factors. In the home, these include higher performing, ultra-quiet, sleek and low-power computer designs, and new advances in more sophisticated, user-friendly entertainment systems.
    [Show full text]
  • Caching Basics
    Introduction Why memory subsystem design is important • CPU speeds increase 25%-30% per year • DRAM speeds increase 2%-11% per year Autumn 2006 CSE P548 - Memory Hierarchy 1 Memory Hierarchy Levels of memory with different sizes & speeds • close to the CPU: small, fast access • close to memory: large, slow access Memory hierarchies improve performance • caches: demand-driven storage • principal of locality of reference temporal: a referenced word will be referenced again soon spatial: words near a reference word will be referenced soon • speed/size trade-off in technology ⇒ fast access for most references First Cache: IBM 360/85 in the late ‘60s Autumn 2006 CSE P548 - Memory Hierarchy 2 1 Cache Organization Block: • # bytes associated with 1 tag • usually the # bytes transferred on a memory request Set: the blocks that can be accessed with the same index bits Associativity: the number of blocks in a set • direct mapped • set associative • fully associative Size: # bytes of data How do you calculate this? Autumn 2006 CSE P548 - Memory Hierarchy 3 Logical Diagram of a Cache Autumn 2006 CSE P548 - Memory Hierarchy 4 2 Logical Diagram of a Set-associative Cache Autumn 2006 CSE P548 - Memory Hierarchy 5 Accessing a Cache General formulas • number of index bits = log2(cache size / block size) (for a direct mapped cache) • number of index bits = log2(cache size /( block size * associativity)) (for a set-associative cache) Autumn 2006 CSE P548 - Memory Hierarchy 6 3 Design Tradeoffs Cache size the bigger the cache, + the higher the hit ratio
    [Show full text]
  • POWER-AWARE MICROARCHITECTURE: Design and Modeling Challenges for Next-Generation Microprocessors
    POWER-AWARE MICROARCHITECTURE: Design and Modeling Challenges for Next-Generation Microprocessors THE ABILITY TO ESTIMATE POWER CONSUMPTION DURING EARLY-STAGE DEFINITION AND TRADE-OFF STUDIES IS A KEY NEW METHODOLOGY ENHANCEMENT. OPPORTUNITIES FOR SAVING POWER CAN BE EXPOSED VIA MICROARCHITECTURE-LEVEL MODELING, PARTICULARLY THROUGH CLOCK- GATING AND DYNAMIC ADAPTATION. Power dissipation limits have Thus far, most of the work done in the area David M. Brooks emerged as a major constraint in the design of high-level power estimation has been focused of microprocessors. At the low end of the per- at the register-transfer-level (RTL) description Pradip Bose formance spectrum, namely in the world of in the processor design flow. Only recently have handheld and portable devices or systems, we seen a surge of interest in estimating power Stanley E. Schuster power has always dominated over perfor- at the microarchitecture definition stage, and mance (execution time) as the primary design specific work on power-efficient microarchi- Hans Jacobson issue. Battery life and system cost constraints tecture design has been reported.2-8 drive the design team to consider power over Here, we describe the approach of using Prabhakar N. Kudva performance in such a scenario. energy-enabled performance simulators in Increasingly, however, power is also a key early design. We examine some of the emerg- Alper Buyuktosunoglu design issue in the workstation and server mar- ing paradigms in processor design and com- kets (see Gowan et al.)1 In this high-end arena ment on their inherent power-performance John-David Wellman the increasing microarchitectural complexities, characteristics. clock frequencies, and die sizes push the chip- Victor Zyuban level—and hence the system-level—power Power-performance efficiency consumption to such levels that traditionally See the “Power-performance fundamentals” Manish Gupta air-cooled multiprocessor server boxes may box.
    [Show full text]
  • 45-Year CPU Evolution: One Law and Two Equations
    45-year CPU evolution: one law and two equations Daniel Etiemble LRI-CNRS University Paris Sud Orsay, France [email protected] Abstract— Moore’s law and two equations allow to explain the a) IC is the instruction count. main trends of CPU evolution since MOS technologies have been b) CPI is the clock cycles per instruction and IPC = 1/CPI is the used to implement microprocessors. Instruction count per clock cycle. c) Tc is the clock cycle time and F=1/Tc is the clock frequency. Keywords—Moore’s law, execution time, CM0S power dissipation. The Power dissipation of CMOS circuits is the second I. INTRODUCTION equation (2). CMOS power dissipation is decomposed into static and dynamic powers. For dynamic power, Vdd is the power A new era started when MOS technologies were used to supply, F is the clock frequency, ΣCi is the sum of gate and build microprocessors. After pMOS (Intel 4004 in 1971) and interconnection capacitances and α is the average percentage of nMOS (Intel 8080 in 1974), CMOS became quickly the leading switching capacitances: α is the activity factor of the overall technology, used by Intel since 1985 with 80386 CPU. circuit MOS technologies obey an empirical law, stated in 1965 and 2 Pd = Pdstatic + α x ΣCi x Vdd x F (2) known as Moore’s law: the number of transistors integrated on a chip doubles every N months. Fig. 1 presents the evolution for II. CONSEQUENCES OF MOORE LAW DRAM memories, processors (MPU) and three types of read- only memories [1]. The growth rate decreases with years, from A.
    [Show full text]
  • Hierarchical Roofline Analysis for Gpus: Accelerating Performance
    Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System Charlene Yang, Thorsten Kurth Samuel Williams National Energy Research Scientific Computing Center Computational Research Division Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA Berkeley, CA 94720, USA fcjyang, [email protected] [email protected] Abstract—The Roofline performance model provides an Performance (GFLOP/s) is bound by: intuitive and insightful approach to identifying performance bottlenecks and guiding performance optimization. In prepa- Peak GFLOP/s GFLOP/s ≤ min (1) ration for the next-generation supercomputer Perlmutter at Peak GB/s × Arithmetic Intensity NERSC, this paper presents a methodology to construct a hi- erarchical Roofline on NVIDIA GPUs and extend it to support which produces the traditional Roofline formulation when reduced precision and Tensor Cores. The hierarchical Roofline incorporates L1, L2, device memory and system memory plotted on a log-log plot. bandwidths into one single figure, and it offers more profound Previously, the Roofline model was expanded to support insights into performance analysis than the traditional DRAM- the full memory hierarchy [2], [3] by adding additional band- only Roofline. We use our Roofline methodology to analyze width “ceilings”. Similarly, additional ceilings beneath the three proxy applications: GPP from BerkeleyGW, HPGMG Roofline can be added to represent performance bottlenecks from AMReX, and conv2d from TensorFlow. In so doing, we demonstrate the ability of our methodology to readily arising from lack of vectorization or the failure to exploit understand various aspects of performance and performance fused multiply-add (FMA) instructions. bottlenecks on NVIDIA GPUs and motivate code optimizations.
    [Show full text]
  • An Algorithmic Theory of Caches by Sridhar Ramachandran
    An algorithmic theory of caches by Sridhar Ramachandran Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY. December 1999 Massachusetts Institute of Technology 1999. All rights reserved. Author Department of Electrical Engineering and Computer Science Jan 31, 1999 Certified by / -f Charles E. Leiserson Professor of Computer Science and Engineering Thesis Supervisor Accepted by Arthur C. Smith Chairman, Departmental Committee on Graduate Students MSSACHUSVTS INSTITUT OF TECHNOLOGY MAR 0 4 2000 LIBRARIES 2 An algorithmic theory of caches by Sridhar Ramachandran Submitted to the Department of Electrical Engineeringand Computer Science on Jan 31, 1999 in partialfulfillment of the requirementsfor the degree of Master of Science. Abstract The ideal-cache model, an extension of the RAM model, evaluates the referential locality exhibited by algorithms. The ideal-cache model is characterized by two parameters-the cache size Z, and line length L. As suggested by its name, the ideal-cache model practices automatic, optimal, omniscient replacement algorithm. The performance of an algorithm on the ideal-cache model consists of two measures-the RAM running time, called work complexity, and the number of misses on the ideal cache, called cache complexity. This thesis proposes the ideal-cache model as a "bridging" model for caches in the sense proposed by Valiant [49]. A bridging model for caches serves two purposes. It can be viewed as a hardware "ideal" that influences cache design. On the other hand, it can be used as a powerful tool to design cache-efficient algorithms.
    [Show full text]
  • Generalized Methods for Application Specific Hardware Specialization
    Generalized methods for application specific hardware specialization by Snehasish Kumar M. Sc., Simon Fraser University, 2013 B. Tech., Biju Patnaik University of Technology, 2010 Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the School of Computing Science Faculty of Applied Sciences c Snehasish Kumar 2017 SIMON FRASER UNIVERSITY Spring 2017 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, education, satire, parody, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. Approval Name: Snehasish Kumar Degree: Doctor of Philosophy (Computing Science) Title: Generalized methods for application specific hardware specialization Examining Committee: Chair: Binay Bhattacharyya Professor Arrvindh Shriraman Senior Supervisor Associate Professor Simon Fraser University William Sumner Supervisor Assistant Professor Simon Fraser University Vijayalakshmi Srinivasan Supervisor Research Staff Member, IBM Research Alexandra Fedorova Supervisor Associate Professor University of British Columbia Richard Vaughan Internal Examiner Associate Professor Simon Fraser University Andreas Moshovos External Examiner Professor University of Toronto Date Defended: November 21, 2016 ii Abstract Since the invention of the microprocessor in 1971, the computational capacity of the microprocessor has scaled over 1000× with Moore and Dennard scaling. Dennard scaling ended with a rapid increase in leakage power 30 years after it was proposed. This ushered in the era of multiprocessing where additional transistors afforded by Moore’s scaling were put to use. With the scaling of computational capacity no longer guaranteed every generation, application specific hardware specialization is an attractive alternative to sustain scaling trends.
    [Show full text]
  • 14. Caching and Cache-Efficient Algorithms
    MITOCW | 14. Caching and Cache-Efficient Algorithms The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. JULIAN SHUN: All right. So we've talked a little bit about caching before, but today we're going to talk in much more detail about caching and how to design cache-efficient algorithms. So first, let's look at the caching hardware on modern machines today. So here's what the cache hierarchy looks like for a multicore chip. We have a whole bunch of processors. They all have their own private L1 caches for both the data, as well as the instruction. They also have a private L2 cache. And then they share a last level cache, or L3 cache, which is also called LLC. They're all connected to a memory controller that can access DRAM. And then, oftentimes, you'll have multiple chips on the same server, and these chips would be connected through a network. So here we have a bunch of multicore chips that are connected together. So we can see that there are different levels of memory here. And the sizes of each one of these levels of memory is different. So the sizes tend to go up as you move up the memory hierarchy. The L1 caches tend to be about 32 kilobytes. In fact, these are the specifications for the machines that you're using in this class.
    [Show full text]
  • Hardware Architecture
    Hardware Architecture Components Computing Infrastructure Components Servers Clients LAN & WLAN Internet Connectivity Computation Software Storage Backup Integration is the Key ! Security Data Network Management Computer Today’s Computer Computer Model: Von Neumann Architecture Computer Model Input: keyboard, mouse, scanner, punch cards Processing: CPU executes the computer program Output: monitor, printer, fax machine Storage: hard drive, optical media, diskettes, magnetic tape Von Neumann architecture - Wiki Article (15 min YouTube Video) Components Computer Components Components Computer Components CPU Memory Hard Disk Mother Board CD/DVD Drives Adaptors Power Supply Display Keyboard Mouse Network Interface I/O ports CPU CPU CPU – Central Processing Unit (Microprocessor) consists of three parts: Control Unit • Execute programs/instructions: the machine language • Move data from one memory location to another • Communicate between other parts of a PC Arithmetic Logic Unit • Arithmetic operations: add, subtract, multiply, divide • Logic operations: and, or, xor • Floating point operations: real number manipulation Registers CPU Processor Architecture See How the CPU Works In One Lesson (20 min YouTube Video) CPU CPU CPU speed is influenced by several factors: Chip Manufacturing Technology: nm (2002: 130 nm, 2004: 90nm, 2006: 65 nm, 2008: 45nm, 2010:32nm, Latest is 22nm) Clock speed: Gigahertz (Typical : 2 – 3 GHz, Maximum 5.5 GHz) Front Side Bus: MHz (Typical: 1333MHz , 1666MHz) Word size : 32-bit or 64-bit word sizes Cache: Level 1 (64 KB per core), Level 2 (256 KB per core) caches on die. Now Level 3 (2 MB to 8 MB shared) cache also on die Instruction set size: X86 (CISC), RISC Microarchitecture: CPU Internal Architecture (Ivy Bridge, Haswell) Single Core/Multi Core Multi Threading Hyper Threading vs.
    [Show full text]
  • Microcontroller Serial Interfaces
    Microcontroller Serial Interfaces Dr. Francesco Conti [email protected] Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU Example: STM32F101 MCU Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU • Memory • RAM (from 512B to 256kB) • FLASH (from 512B to 1MB) Example: STM32F101 MCU Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU • Memory • RAM (from 512B to 256kB) • FLASH (from 512B to 1MB) • Peripherals • DMA • Timer • Interfaces • Digital Interfaces • Analog Timer DMAs Example: STM32F101 MCU Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU • Memory • RAM (from 512B to 256kB) • FLASH (from 512B to 1MB) • Peripherals • DMA • Timer • Interfaces • Digital • Analog • Interconnect Example: STM32F101 MCU • AHB system bus (ARM-based MCUs) • APB peripheral bus (ARM-based MCUs) Microcontroller System Architecture Each MCU (micro-controller unit) is characterized by: • Microprocessor • 8,16,32 bit architecture • Usually “simple” in-order microarchitecture, no FPU • Memory • RAM (from 512B to 256kB) • FLASH
    [Show full text]
  • Open Jishen-Zhao-Dissertation.Pdf
    The Pennsylvania State University The Graduate School RETHINKING THE MEMORY HIERARCHY DESIGN WITH NONVOLATILE MEMORY TECHNOLOGIES A Dissertation in Computer Science and Engineering by Jishen Zhao c 2014 Jishen Zhao Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2014 The dissertation of Jishen Zhao was reviewed and approved∗ by the following: Yuan Xie Professor of Computer Science and Engineering Dissertation Advisor, Chair of Committee Mary Jane Irwin Professor of Computer Science and Engineering Vijaykrishnan Narayanan Professor of Computer Science and Engineering Zhiwen Liu Associate Professor of Electrical Engineering Onur Mutlu Associate Professor of Electrical and Computer Engineering Carnegie Mellon University Special Member Lee Coraor Associate Professor of Computer Science and Engineering Director of Academic Affairs ∗Signatures are on file in the Graduate School. Abstract The memory hierarchy, including processor caches and the main memory, is an important component of various computer systems. The memory hierarchy is becoming a fundamental performance and energy bottleneck, due to the widening gap between the increasing bandwidth and energy demands of modern applications and the limited performance and energy efficiency provided by traditional memory technologies. As a result, computer architects are facing significant challenges in developing high-performance, energy-efficient, and reliable memory hierarchies. New byte-addressable nonvolatile memories (NVRAMs) are emerging with unique properties that are likely to open doors to novel memory hierarchy designs to tackle the challenges. However, substantial advancements in redesigning the existing memory hierarchy organizations are needed to realize their full potential. This dissertation focuses on re-architecting the current memory hierarchy design with NVRAMs, producing high-performance, energy-efficient memory designs for both CPU and graphics processor (GPU) systems.
    [Show full text]
  • Beating In-Order Stalls with “Flea-Flicker” Two-Pass Pipelining
    Beating in-order stalls with “flea-flicker”∗two-pass pipelining Ronald D. Barnes Erik M. Nystrom John W. Sias Sanjay J. Patel Nacho Navarro Wen-mei W. Hwu Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign {rdbarnes, nystrom, sias, sjp, nacho, hwu}@crhc.uiuc.edu Abstract control speculation features allow the compiler to miti- gate control dependences, further increasing static schedul- Accommodating the uncertain latency of load instructions ing freedom. Predication enables the compiler to optimize is one of the most vexing problems in in-order microarchi- program decision and to overlap independent control con- tecture design and compiler development. Compilers can structs while minimizing code growth. In the absence of generate schedules with a high degree of instruction-level unanticipated run-time delays such as cache miss-induced parallelism but cannot effectively accommodate unantici- stalls, the compiler can effectively utilize execution re- pated latencies; incorporating traditional out-of-order exe- sources, overlap execution latencies, and work around exe- cution into the microarchitecture hides some of this latency cution constraints [1]. For example, we have measured that, but redundantly performs work done by the compiler and when run-time stall cycles are discounted, the Intel refer- adds additional pipeline stages. Although effective tech- ence compiler can achieve an average throughput of 2.5 in- niques, such as prefetching and threading, have been pro- structions per cycle (IPC) across SPECint2000 benchmarks posed to deal with anticipable, long-latency misses, the for a 1.0GHz Itanium 2 processor. shorter, more diffuse stalls due to difficult-to-anticipate, Run-time stall cycles of various types prolong the execu- first- or second-level misses are less easily hidden on in- tion of the compiler-generated schedule, in the noted exam- order architectures.
    [Show full text]