Computer Architecture Cache Memory Design

Total Page:16

File Type:pdf, Size:1020Kb

Computer Architecture Cache Memory Design Course Objectives: Where are we? CS 211: Computer Architecture Cache Memory Design CS 135 CS 211: Part 2! CS 211: Part 3 • Discussions thus far • Multiprocessing concepts ¾ Processor architectures to increase the processing speed ¾ Multi-core processors ¾ Focused entirely on how instructions can be executed faster ¾ Have not addressed the other components that go into putting it ¾ Parallel processing all together ¾ Cluster computing ¾ Other components: Memory, I/O, Compiler • Next: ¾ Memory design: how is memory organized to facilitate fast access • Embedded Systems – another dimension ¾ Focus on cache design ¾ Challenges and what’s different ¾ Compiler optimization: how does compiler generate ‘optimized’ machine code • Reconfigurable architectures ¾ With a look at techniques for ILP ¾ What are they? And why ? ¾ We have seen some techniques already, and will cover some more in memory design before getting to formal architecture of compilers ¾ Quick look at I/O and Virtual Memory CS 135 CS 135 1 Memory Memory Technology • Memory Comes in Many Flavors • In our discussions (on MIPS pipeline, ¾ SRAM (Static Random Access Memory) ¾ Like a register file; once data written to SRAM its contents superscalar, EPIC) we’ve constantly stay valid – no need to refresh it been assuming that we can access our ¾ DRAM (Dynamic Random Access Memory) ¾ Like leaky capacitors – data stored into DRAM chip charging operand from memory in 1 clock cycle… memory cells to max values; charge slowly leaks and will eventually be too low to be valid – therefore refresh circuitry ¾ This is possible, but its complicated rewrites data and charges back to max ¾ Static RAM is faster but more expensive ¾ We’ll now discuss how this happens ¾ Cache uses static RAM • We’ll talk about… ¾ ROM, EPROM, EEPROM, Flash, etc. ¾ Read only memories – store OS ¾ Memory Technology ¾ Disks, Tapes, etc. ¾ Memory Hierarchy • Difference in speed, price and “size” ¾ Caches ¾ Fast is small and/or expensive ¾ Memory ¾ Large is slow and/or expensive ¾ Virtual Memory CS 135 CS 135 Is there a problem with DRAM? Why Not Only DRAM? Processor-DRAM Memory Gap (latency) • Can lose data when no power µProc ¾ Volatile storage 1000 CPU 60%/yr. • Not large enough for some things “Moore’s Law” (2X/1.5yr) ¾ Backed up by storage (disk) 100 Processor-Memory ¾ Virtual memory, paging, etc. Performance Gap: • Not fast enough for processor accesses 10 grows 50% / year ¾ Takes hundreds of cycles to return data DRAM ¾ OK in very regular applications Performance DRAM 9%/yr. ¾ Can use SW pipelining, vectors 1 (2X/10yrs) ¾ Not OK in most other applications 1989 1992 1980 1981 1987 1988 1990 1991 1993 1994 1995 1996 1997 1998 1982 1983 1984 1985 1986 1999 2000 Time CS 135 CS 135 2 The principle of locality… Levels in a typical memory hierarchy • …says that most programs don’t access all code or data uniformly ¾ e.g. in a loop, small subset of instructions might be executed over and over again… ¾ …& a block of memory addresses might be accessed sequentially… • This has led to “memory hierarchies” • Some important things to note: ¾ Fast memory is expensive ¾ Levels of memory usually smaller/faster than previous ¾ Levels of memory usually “subset” one another ¾ All the stuff in a higher level is in some level below it CS 135 CS 135 Memory Hierarchies The Full Memory Hierarchy “always reuse a good idea” Capacity Upper Level • Key Principles Access Time Staging Cost Xfer Unit faster ¾ Locality – most programs do not access code or data CPU Registers uniformly 100s Bytes Registers <10s ns Instr. Operands prog./compiler ¾ Smaller hardware is faster 1-8 bytes Cache K Bytes Our current Cache • Goal 10-100 ns focus 1-0.1 cents/bit cache cntl ¾ Design a memory hierarchy “with cost almost as low Blocks 8-128 bytes as the cheapest level of the hierarchy and speed Main Memory M Bytes Memory almost as fast as the fastest level” 200ns- 500ns $.0001-.00001 cents /bit ¾ This implies that we be clever about keeping more likely OS Disk Pages 4K-16K bytes used data as “close” to the CPU as possible G Bytes, 10 ms (10,000,000 ns) Disk • Levels provide subsets 10 -5 - 10 -6 cents/bit user/operator Files ¾ Anything (data) found in a particular level is also found Mbytes Tape Larger in the next level below. infinite sec-min Tape Lower Level ¾ Each level maps from a slower, larger memory to a -8 smaller but faster memory 10 CS 135 CS 135 3 Propagation delay bounds where Cache: Terminology memory can be placed • Cache is name given to the first level of the memory hierarchy encountered once an address leaves the CPU ¾ Double Data Rate (DDR) SDRAM Takes advantage of the principle of locality • The term cache is also now applied whenever buffering is employed to reuse XDR planned 3 to 6 GHz items • Cache controller ¾ The HW that controls access to cache or generates request to memory ¾ • No cache in 1980 PCs to 2-level cache by 1995..! CS 135 CS 135 What is a cache? • Small, fast storage used to improve average access time to Caches: multilevel slow memory. • Exploits spatial and temporal locality • In computer architecture, almost everything is a cache! ¾ Registers “a cache” on variables – software managed Main ¾ First-level cache a cache on second-level cache CPU cache ¾ Second-level cache a cache on memory Memory ¾ Memory a cache on disk (virtual memory) ¾ TLB a cache on page table ¾ Branch-prediction a cache on prediction information? Proc/Regs L1-Cache CPU L2 L3 Main Bigger Faster L2-Cache L1 cache cache Memory Memory ~256KB ~4MB Disk, Tape, etc. 16~32KB ~10 pclk latency ~50 pclk latency CS 135 1~2 CSpclk 135 latency 4 A brief description of a cache Terminology Summary • Cache = next level of memory hierarchy up from register file • Hit: data appears in block in upper level (i.e. block X in cache) ¾ Hit Rate: fraction of memory access found in upper level ¾ All values in register file should be in cache ¾ Hit Time: time to access upper level which consists of ¾ RAM access time + Time to determine hit/miss • Miss: data needs to be retrieved from a block in the lower level • Cache entries usually referred to as (i.e. block Y in memory) “blocks” ¾ Miss Rate = 1 - (Hit Rate) ¾ Miss Penalty: Extra time to replace a block in the upper level + ¾ Block is minimum amount of information that ¾ Time to deliver the block the processor can be in cache • Hit Time << Miss Penalty (500 instructions on Alpha 21264) ¾ fixed size collection of data, retrieved from memory Lower Level and placed into the cache To Processor Upper Level Memory Memory • Processor generates request for Blk X From Processor data/inst, first look up the cache Blk Y • If we’re looking for item in a cache and CS 135find it, have a cache hit; if not then a CS 135 Definitions Cache Basics • Locating a block requires two attributes: • Cache consists of block-sized lines ¾ Size of block ¾ Line size typically power of two ¾ Organization of blocks within the cache ¾ Typically 16 to 128 bytes in size • Block size (also referred to as line size) • Example ¾ Granularity at which cache operates ¾ Suppose block size is 128 bytes ¾ Each block is contiguous series of bytes in memory and begins on a naturally aligned boundary ¾ Lowest seven bits determine offset within block ¾ Eg: cache with 16 byte blocks ¾ Read data at address A=0x7fffa3f4 ¾ each contains 16 bytes ¾ Address begins to block with base address 0x7fffa380 ¾ First byte aligned to 16 byte boundaries in address space ¾ Low order 4 bits of address of first byte would be 0000 ¾ Smallest usable block size is the natural word size of the processor ¾ Else would require splitting an access across blocks and slows down translation CS 135 CS 135 5 Memory Hierarchy Unified or Separate I-Cache and D-Cache • Placing the fastest memory near the CPU • Two types of accesses: can result in increases in performance ¾ Instruction fetch • Consider the number of cycles the CPU is ¾ Data fetch (load/store instructions) stalled waiting for a memory access – • Unified Cache memory stall cycles ¾ One large cache for both instructions and date ¾ Pros: simpler to manage, less hardware complexity ¾ Cons: how to divide cache between data and ¾ CPU execution time = instructions? Confuses the standard harvard architecture (CPU clk cycles + Memory stall cycles) * clk cycle time. model; optimizations difficult • Separate Instruction and Data cache ¾ Memory stall cycles = ¾ Instruction fetch goes to I-Cache number of misses * miss penalty = ¾ Data access from Load/Stores goes to D-cache IC*(memory accesses/instruction)*miss rate* miss penalty ¾ Pros: easier to optimize each ¾ Cons: more expensive; how to decide on sizes of each CS 135 CS 135 Cache Design--Questions Where can a block be placed in a cache? • Q1: Where can a block be placed in • 3 schemes for block placement in a the upper level? cache: ¾ block placement ¾ Direct mapped cache: ¾ Block (or data to be stored) can go to only 1 place in • Q2: How is a block found if it is in the cache upper level? ¾ Usually: (Block address) MOD (# of blocks in the cache) ¾ block identification • Q3: Which block should be replaced on ¾ Fully associative cache: ¾ Block can be placed anywhere in cache a miss? ¾ block replacement ¾ Set associative cache: • Q4: What happens on a write? ¾ “Set” = a group of blocks in the cache ¾ Block mapped onto a set & then block can be placed ¾ Write strategy anywhere within that set ¾ Usually: (Block address) MOD (# of sets in the cache) CS 135 CS 135 ¾ If n blocks in a set, we call it n-way set associative 6 Where can a block be placed in a cache? Associativity Fully Associative Direct Mapped Set Associative • If you have associativity > 1 you have 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 to have a replacement policy ¾ FIFO Cache: Set 0Set 1Set 2Set 3 ¾ LRU ¾ Random Block 12 can go Block 12 can go Block 12 can go anywhere only into Block 4 anywhere in set 0 • “Full” or “Full-map” associativity means (12 mod 8) (12 mod 4) you check every tag in parallel and a 1 2 3 4 5 6 7 8 9….
Recommended publications
  • Data Storage the CPU-Memory
    Data Storage •Disks • Hard disk (HDD) • Solid state drive (SSD) •Random Access Memory • Dynamic RAM (DRAM) • Static RAM (SRAM) •Registers • %rax, %rbx, ... Sean Barker 1 The CPU-Memory Gap 100,000,000.0 10,000,000.0 Disk 1,000,000.0 100,000.0 SSD Disk seek time 10,000.0 SSD access time 1,000.0 DRAM access time Time (ns) Time 100.0 DRAM SRAM access time CPU cycle time 10.0 Effective CPU cycle time 1.0 0.1 CPU 0.0 1985 1990 1995 2000 2003 2005 2010 2015 Year Sean Barker 2 Caching Smaller, faster, more expensive Cache 8 4 9 10 14 3 memory caches a subset of the blocks Data is copied in block-sized 10 4 transfer units Larger, slower, cheaper memory Memory 0 1 2 3 viewed as par@@oned into “blocks” 4 5 6 7 8 9 10 11 12 13 14 15 Sean Barker 3 Cache Hit Request: 14 Cache 8 9 14 3 Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Sean Barker 4 Cache Miss Request: 12 Cache 8 12 9 14 3 12 Request: 12 Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Sean Barker 5 Locality ¢ Temporal locality: ¢ Spa0al locality: Sean Barker 6 Locality Example (1) sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Sean Barker 7 Locality Example (2) int sum_array_rows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } Sean Barker 8 Locality Example (3) int sum_array_cols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } Sean Barker 9 The Memory Hierarchy The Memory Hierarchy Smaller On 1 cycle to access CPU Chip Registers Faster Storage Costlier instrs can L1, L2 per byte directly Cache(s) ~10’s of cycles to access access (SRAM) Main memory ~100 cycles to access (DRAM) Larger Slower Flash SSD / Local network ~100 M cycles to access Cheaper Local secondary storage (disk) per byte slower Remote secondary storage than local (tapes, Web servers / Internet) disk to access Sean Barker 10.
    [Show full text]
  • Branch-Directed and Pointer-Based Data Cache Prefetching
    Branch-directed and Pointer-based Data Cache Prefetching Yue Liu Mona Dimitri David R. Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA Abstract The design of the on-chip cache memory and branch prediction logic has become an integral part of a microprocessor implementation. Branch predictors reduce the effects of control hazards on pipeline performance. Branch prediction implementations have been proposed which eliminate a majority of the pipeline stalls associated with branches. Caches are commonly used to reduce the performance gap between microprocessor and memory technology. Caches can only provide this benefit if the requested address is in the cache when requested. To increase the probability that addresses are present when requested, prefetching can be employed. Prefetching attempts to prime the cache with instructions and data which will be accessed in the near future. The work presented here describes two new prefetching algorithms. The first ties data cache prefetching to branches in the instruction stream. History of the data stream is incorporated into a Branch Target Buffer (BTB). Since branch behavior determines program control flow, data access patterns are also dependent upon this behavior. Results indicate that combining this strategy with tagged prefetching can significantly improve cache hit ratios. While improving cache hit rates is important, the resulting memory bus traffic introduced can be an issue. Our prefetching technique can also significantly reduce the overall memory bus traffic. The second prefetching scheme dynamically identifies pointer-based data references in the data stream, and records these references in order to successfully prime the cache with the next element in the linked data structure in advance of its access.
    [Show full text]
  • Cache-Aware Roofline Model: Upgrading the Loft
    1 Cache-aware Roofline model: Upgrading the loft Aleksandar Ilic, Frederico Pratas, and Leonel Sousa INESC-ID/IST, Technical University of Lisbon, Portugal ilic,fcpp,las @inesc-id.pt f g Abstract—The Roofline model graphically represents the attainable upper bound performance of a computer architecture. This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache-awareness, thus significantly improving the guidelines for application optimization. The proposed model was experimentally verified for different architectures by taking advantage of built-in hardware counters with a curve fitness above 90%. Index Terms—Multicore computer architectures, Performance modeling, Application optimization F 1 INTRODUCTION in Tab. 1. The horizontal part of the Roofline model cor- Driven by the increasing complexity of modern applica- responds to the compute bound region where the Fp can tions, microprocessors provide a huge diversity of compu- be achieved. The slanted part of the model represents the tational characteristics and capabilities. While this diversity memory bound region where the performance is limited is important to fulfill the existing computational needs, it by BD. The ridge point, where the two lines meet, marks imposes challenges to fully exploit architectures’ potential. the minimum I required to achieve Fp. In general, the A model that provides insights into the system performance original Roofline modeling concept [10] ties the Fp and the capabilities is a valuable tool to assist in the development theoretical bandwidth of a single memory level, with the I and optimization of applications and future architectures.
    [Show full text]
  • ARM-Architecture Simulator Archulator
    Embedded ECE Lab Arm Architecture Simulator University of Arizona ARM-Architecture Simulator Archulator Contents Purpose ............................................................................................................................................ 2 Description ...................................................................................................................................... 2 Block Diagram ............................................................................................................................ 2 Component Description .............................................................................................................. 3 Buses ..................................................................................................................................................... 3 µP .......................................................................................................................................................... 3 Caches ................................................................................................................................................... 3 Memory ................................................................................................................................................. 3 CoProcessor .......................................................................................................................................... 3 Using the Archulator ......................................................................................................................
    [Show full text]
  • Multi-Tier Caching Technology™
    Multi-Tier Caching Technology™ Technology Paper Authored by: How Layering an Application’s Cache Improves Performance Modern data storage needs go far beyond just computing. From creative professional environments to desktop systems, Seagate provides solutions for almost any application that requires large volumes of storage. As a leader in NAND, hybrid, SMR and conventional magnetic recording technologies, Seagate® applies different levels of caching and media optimization to benefit performance and capacity. Multi-Tier Caching (MTC) Technology brings the highest performance and areal density to a multitude of applications. Each Seagate product is uniquely tailored to meet the performance requirements of a specific use case with the right memory, NAND, and media type and size. This paper explains how MTC Technology works to optimize hard drive performance. MTC Technology: Key Advantages Capacity requirements can vary greatly from business to business. While the fastest performance can be achieved using Dynamic Random Access Memory (DRAM) cache, the data in the DRAM is not persistent through power cycles and DRAM is very expensive compared to other media. NAND flash data survives through power cycles but it is still very expensive compared to a magnetic storage medium. Magnetic storage media cache offers good performance at a very low cost. On the downside, media cache takes away overall disk drive capacity from PMR or SMR main store media. MTC Technology solves this dilemma by using these diverse media components in combination to offer different levels of performance and capacity at varying price points. By carefully tuning firmware with appropriate cache types and sizes, the end user can experience excellent overall system performance.
    [Show full text]
  • Non-Sequential Instruction Cache Prefetching for Multiple--Issue Processors
    Non-Sequential Instruction Cache Prefetching for Multiple--Issue Processors Alexander V. Veidenbaum Qingbo Zhao Abduhl Shameer Dept. of Information and DSET Inc. Microsoft Inc. Computer Science University of California Irvine 1 [email protected] Abstract This paper presents a novel instruction cache prefetching mechanism for multiple-issue processors. Such processors at high clock rates often have to use a small instruction cache which can have significant miss rates. Prefetching from secondary cache or even memory can hide the instruction cache miss penalties, but only if initiated sufficiently far ahead of the current program counter. Existing instruction cache prefetching methods are strictly sequential and do not prefetch past conditional branches which may occur almost every clock cycle in wide-issue processors. In this study, multi-level branch prediction is used to overcome this limitation. By keeping branch history and target addresses, two methods are defined to predict a future PC several branches past the current branch. A prefetching architecture using such a mechanism is defined and evaluated with respect to its accuracy, the impact of the instruction prefetching on performance, and its interaction with sequential prefetching. Both PC-based and history- based predictors are used to perform a single-lookup prediction. Targeting an on-chip L2 cache with low latency, prediction for 3 branch levels is evaluated for a 4-issue processor and cache architecture patterned after the DEC Alpha-21164. It is shown that history-based predictor is more accurate, but both predictors are effective. The prefetching unit using them can be effective and succeeds when the sequential prefetcher fails.
    [Show full text]
  • CUDA 11 and A100 - WHAT’S NEW? Markus Hrywniak, 23Rd June 2020 TOPICS for TODAY
    CUDA 11 AND A100 - WHAT’S NEW? Markus Hrywniak, 23rd June 2020 TOPICS FOR TODAY Ampere architecture – A100, powering DGX–A100, HGX-A100... and soon, FZ Jülich‘s JUWELS Booster New CUDA 11 Toolkit release Overview of features Talk next week: Third generation Tensor Cores GTC talks go into much more details. See references! 2 HGX-A100 4-GPU HGX-A100 8-GPU • 4 A100 with NVLINK • 8 A100 with NVSwitch 3 HIERARCHY OF SCALES Multi-System Rack Multi-GPU System Multi-SM GPU Multi-Core SM Unlimited Scale 8 GPUs 108 Multiprocessors 2048 threads 4 AMDAHL’S LAW serial section parallel section serial section Amdahl’s Law parallel section Shortest possible runtime is sum of serial section times Time saved serial section Some Parallelism Increased Parallelism Infinite Parallelism Program time = Parallel sections take less time Parallel sections take no time sum(serial times + parallel times) Serial sections take same time Serial sections take same time 5 OVERCOMING AMDAHL: ASYNCHRONY & LATENCY serial section parallel section serial section Split up serial & parallel components parallel section serial section Some Parallelism Task Parallelism Infinite Parallelism Program time = Parallel sections overlap with serial sections Parallel sections take no time sum(serial times + parallel times) Serial sections take same time 6 OVERCOMING AMDAHL: ASYNCHRONY & LATENCY CUDA Concurrency Mechanisms At Every Scope CUDA Kernel Threads, Warps, Blocks, Barriers Application CUDA Streams, CUDA Graphs Node Multi-Process Service, GPU-Direct System NCCL, CUDA-Aware MPI, NVSHMEM 7 OVERCOMING AMDAHL: ASYNCHRONY & LATENCY Execution Overheads Non-productive latencies (waste) Operation Latency Network latencies Memory read/write File I/O ..
    [Show full text]
  • Denial-Of-Service Attacks on Shared Cache in Multicore: Analysis and Prevention
    Denial-of-Service Attacks on Shared Cache in Multicore: Analysis and Prevention Michael G. Bechtel, Heechul Yun University of Kansas, USA. fmbechtel, [email protected] Abstract—In this paper we investigate the feasibility of denial- of the cores can access the cache until it is unblocked, which of-service (DoS) attacks on shared caches in multicore platforms. can take a long time (e.g., hundreds of CPU cycles) as it may With carefully engineered attacker tasks, we are able to cause need to access slow main memory. Therefore, if an attacker more than 300X execution time increases on a victim task running on a dedicated core on a popular embedded multicore platform, can intentionally induce shared cache blocking, it can cause regardless of whether we partition its shared cache or not. Based significant timing impacts to other tasks on different cores. on careful experimentation on real and simulated multicore This is because every shared cache hit access, which would platforms, we identify an internal hardware structure of a non- normally take a few cycles, would instead take a few hundreds blocking cache, namely the cache writeback buffer, as a potential cycles until the cache is unblocked. target of shared cache DoS attacks. We propose an OS-level solution to prevent such DoS attacks by extending a state-of-the- While most prior works on cache isolation focused on art memory bandwidth regulation mechanism. We implement the cache space partitioning, either in software or in hardware [9], proposed mechanism in Linux on a real multicore platform and [16]–[19], [25], [27], [37], [43], [44], these cache partitioning show its effectiveness in protecting against cache DoS attacks.
    [Show full text]
  • Fetch Directed Instruction Prefetching
    In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32), November 1999. Fetch Directed Instruction Prefetching y z Glenn Reinmany Brad Calder Todd Austin y Department of Computer Science and Engineering, University of California, San Diego z Electrical Engineering and Computer Science Department, University of Michigan Abstract consumed by the instruction cache. This decoupling allows Instruction supply is a crucial component of processor the branch predictor to run ahead of the instruction fetch. performance. Instruction prefetching has been proposed as This can be beneficial when the instruction cache has a miss, a mechanism to help reduce instruction cache misses, which or when the execution core backs up, thereby filling up the in turn can help increase instruction supply to the processor. connecting buffers and causing the instruction cache to stall. In this paper we examine a new instruction prefetch ar- This second case can occur because of data cache misses or chitecture called Fetch Directed Prefetching, and compare long latency instructions in the pipeline. In this paper we it to the performance of next-line prefetching and streaming examine using the fetch block addresses in the FTQ to pro- buffers. This architecture uses a decoupled branch predic- vide Fetch Directed Prefetching (FDP) for the instruction tor and instruction cache, so the branch predictor can run cache. We use the fetch block addresses stored in the FTQ ahead of the instruction cache fetch. In addition, we ex- to guide instruction prefetching, masking stall cycles due to amine marking fetch blocks in the branch predictor that are instruction cache misses.
    [Show full text]
  • A Characterization of Processor Performance in the VAX-1 L/780
    A Characterization of Processor Performance in the VAX-1 l/780 Joel S. Emer Douglas W. Clark Digital Equipment Corp. Digital Equipment Corp. 77 Reed Road 295 Foster Street Hudson, MA 01749 Littleton, MA 01460 ABSTRACT effect of many architectural and implementation features. This paper reports the results of a study of VAX- llR80 processor performance using a novel hardware Prior related work includes studies of opcode monitoring technique. A micro-PC histogram frequency and other features of instruction- monitor was buiit for these measurements. It kee s a processing [lo. 11,15,161; some studies report timing count of the number of microcode cycles execute z( at Information as well [l, 4,121. each microcode location. Measurement ex eriments were performed on live timesharing wor i loads as After describing our methods and workloads in well as on synthetic workloads of several types. The Section 2, we will re ort the frequencies of various histogram counts allow the calculation of the processor events in 5 ections 3 and 4. Section 5 frequency of various architectural events, such as the resents the complete, detailed timing results, and frequency of different types of opcodes and operand !!Iection 6 concludes the paper. specifiers, as well as the frequency of some im lementation-s ecific events, such as translation bu h er misses. ?phe measurement technique also yields the amount of processing time spent, in various 2. DEFINITIONS AND METHODS activities, such as ordinary microcode computation, memory management, and processor stalls of 2.1 VAX-l l/780 Structure different kinds. This paper reports in detail the amount of time the “average’ VAX instruction The llf780 processor is composed of two major spends in these activities.
    [Show full text]
  • Instruction Set Innovations for Convey's HC-1 Computer
    Instruction Set Innovations for Convey's HC-1 Computer THE WORLD’S FIRST HYBRID-CORE COMPUTER. Hot Chips Conference 2009 [email protected] Introduction to Convey Computer • Company Status – Second round venture based startup company – Product beta systems are at customer sites – Currently staffing at 36 people – Located in Richardson, Texas • Investors – Four Venture Capital Investors • Interwest Partners (Menlo Park) • CenterPoint Ventures (Dallas) • Rho Ventures (New York) • Braemar Energy Ventures (Boston) – Two Industry Investors • Intel Capital • Xilinx Presentation Outline • Overview of HC-1 Computer • Instruction Set Innovations • Application Examples Page 3 Hot Chips Conference 2009 What is a Hybrid-Core Computer ? A hybrid-core computer improves application performance by combining an x86 processor with hardware that implements application-specific instructions. ANSI Standard Applications C/C++/Fortran Convey Compilers x86 Coprocessor Instructions Instructions Intel® Processor Hybrid-Core Coprocessor Oil & Gas& Oil Financial Sciences Custom CAE Application-Specific Personalities Cache-coherent shared virtual memory Page 4 Hot Chips Conference 2009 What Is a Personality? • A personality is a reloadable set of instructions that augment x86 application the x86 instruction set Processor specific – Applicable to a class of applications instructions or specific to a particular code • Each personality is a set of files that includes: – The bits loaded into the Coprocessor – Information used by the Convey compiler • List of
    [Show full text]
  • Dynamic Eviction Set Algorithms and Their Applicability to Cache Characterisation
    UPTEC IT 20036 Examensarbete 30 hp September 2020 Dynamic Eviction Set Algorithms and Their Applicability to Cache Characterisation Maria Lindqvist Institutionen för informationsteknologi Department of Information Technology Abstract Dynamic Eviction Set Algorithms and Their Applicability to Cache Characterisation Maria Lindqvist Teknisk- naturvetenskaplig fakultet UTH-enheten Eviction sets are groups of memory addresses that map to the same cache set. They can be used to perform efficient information-leaking attacks Besöksadress: against the cache memory, so-called cache side channel attacks. In Ångströmlaboratoriet Lägerhyddsvägen 1 this project, two different algorithms that find such sets are implemented Hus 4, Plan 0 and compared. The second of the algorithms improves on the first by using a concept called group testing. It is also evaluated if these algorithms can Postadress: be used to analyse or reverse engineer the cache characteristics, which is a Box 536 751 21 Uppsala new area of application for this type of algorithms. The results show that the optimised algorithm performs significantly better than the previous Telefon: state-of-the-art algorithm. This means that countermeasures developed 018 – 471 30 03 against this type of attacks need to be designed with the possibility of Telefax: faster attacks in mind. The results also shows, as a proof-of-concept, that 018 – 471 30 00 it is possible to use these algorithms to create a tool for cache analysis. Hemsida: http://www.teknat.uu.se/student Handledare: Christos Sakalis Ämnesgranskare: Stefanos Kaxiras Examinator: Lars-Åke Nordén UPTEC IT 20036 Tryckt av: Reprocentralen ITC Acknowledgements I would like to thank my supervisor Christos Sakalis for all the guidance, dis- cussions and feedback during this thesis.
    [Show full text]