Register Pointer Architecture for Efficient Embedded Processors

Total Page:16

File Type:pdf, Size:1020Kb

Register Pointer Architecture for Efficient Embedded Processors Register Pointer Architecture for Efficient Embedded Processors JongSoo Park, Sung-Boem Park, James D. Balfour, David Black-Schaffer Christos Kozyrakis and William J. Dally Computer Systems Laboratory Stanford University {jongsoo, sbpark84, jbalfour, davidbbs, kozyraki, dally}@stanford.edu Abstract instruction word length. Hence, large register files can be used without sacrificing code density. Conventional register file architectures cannot optimally • Register Naming Flexibility: By dynamically modify- exploit temporal locality in data references due to their lim- ing register pointers, a small set of instructions can flexi- ited capacity and static encoding of register addresses in bly access data allocated in the register file in a way that instructions. In conventional embedded architectures, the maximizes data reuse. register file capacity cannot be increased without resorting We introduce extensions to the ARM instruction set to to longer instruction words. Similarly, loop unrolling is of- implement the RPA. In addition to the conventional register ten required to exploit locality in the register file accesses file, the modified architecture includes a dereferencible reg- across iterations because naming registers statically is in- ister file (DRF). Existing arithmetic and load/store instruc- flexible. Both optimizations lead to significant code size in- tions can use a register pointer (RP) in any register operand creases, which is undesirable in embedded systems. position to access the contents of the DRF. We define ef- In this paper, we introduce the Register Pointer Architec- ficient update policies for RPs to support common access ture (RPA), which allows registers to be accessed indirectly patterns with minimal runtime overhead. At the microar- through register pointers. Indirection allows a larger regis- chitecture level, we describe the interlocks and forwarding ter file to be used without increasing the length of instruc- paths needed to minimize read-after-write hazards on RPs. tion words. Additional register file capacity allows many We execute a set of embedded applications on a model of loads and stores, such as those introduced by spill code, to the modified processor to demonstrate that the RPA leads to be eliminated, which improves performance and reduces en- a speedup of up to 2.8× and energy savings of up to 68%. ergy consumption. Moreover, indirection affords additional We compare the RPA to alternative techniques that pro- flexibility in naming registers, which reduces the need to vide register indexing flexibility or software controlled stor- apply loop unrolling in order to maximize reuse of register age near to the processor. Loop unrolling can be used to fol- allocated variables. low arbitrary data patterns within the register file. We show that RPA leads to similar flexibility in register file accesses 1 Introduction without the code size increases introduced by longer regis- ter names and replicated loop bodies. A software-controlled Embedded system designers must optimize three efficiency scratchpad memory could be used to capture temporal lo- metrics: performance, energy consumption, and static code cality in embedded applications. Nevertheless, a scratchpad size. The processor register file helps improve the first two memory suffers from requiring explicit load and store in- metrics. By storing frequently accessed data close to the structions to make data available to arithmetic instructions. functional units, the register file reduces the time and energy In summary, the major contributions of this paper are: we required to access data from caches or main memory. How- introduce the RPA architecture, which supports indirect reg- ever, conventional register file architectures cannot fully ex- ister file access through register pointers; we explore design ploit temporal locality because of their limited capacity and options for RPA at the instruction set and microarchitecture lack of support for indirection. level, including parameters such as the number of additional The Register Pointer Architecture (RPA) supports large registers; and, we compare an embedded processor imple- register files to reduce the time and energy expended ac- menting the RPA to a conventional organization. We also cesses data caches without increasing code size. The main compare to techniques such as loop unrolling and scratch- idea of the RPA is to allow instructions to access registers pad memory. indirectly through register pointers. This provides two key The remainder of this paper is organized as follows. Sec- benefits: tion 2 provides a detailed description of RPA including in- • Large Register File Capacity: Indirect register access struction set and microarchitectural considerations. Sec- relaxes the correlation between register file capacity and tions 3 and 6 describes the experimental methodology and 1 results. Sections 5 and 6 present related work and conclu- pertinent resources introduce interesting tradeoffs amongst sions. performance, software flexibility, energy consumption, and 2 Register Pointer Architecture hardware complexity, as described below. This section describes the RPA architecture at the instruc- Number of DRF Entries: The relationship between the tion set and microarchitecture levels. reduction of cache accesses and the number of DRF entries varies by application. For applications such as 2.1 Instruction Set Architecture 1DFIR, cache access reduction is a linear function of the The RPA extends a base instruction set, such as ARM, to DRF size, while for matrix multiplication, it follows a support indirect register file accesses. An instruction indi- square root function. For applications with table lookups, rectly accesses a register by identifying in its encoding a it is a step function: it saturates once the lookup table fits register pointer, whose contents provide the address for the in the DRF. actual register file reference. While implementations of ex- There are two costs associated with larger DRF sizes. isting ISAs may use indirect register accesses to implement First, the energy consumption of a DRF access increases techniques such as register renaming [1], the RPA exposes with the number of DRF entries. Second, the DRF ac- the indirection to the software. cess time increases because the larger row decoders in- In addition to the conventional register file (RF), the cur longer delays while the discharge time of the longer RPA defines a dereferencible register file (DRF) and regis- bit-lines increases. The additional access latency should ter pointers (RPs). The DRF contains all registers accessed not adversely impact program execution times unless it indirectly, while the RPs contain indirection information requires increasing the clock cycle time. Given a small (DRF address and other configuration fields). Data process- DRF (fewer than 256 entries), we expect the clock cycle ing and transfer instructions can specify any of the follow- time will be limited by the cache access latency rather ing as an input or output operand: an RF entry, a DRF entry than the DRF. through an RP, or an RP. Note that only register operands are modified by the RPA; no additional opcodes or memory In Section 4, we examine how these factors influence addressing modes are added to the instruction set. the DRF size which best balances performance improve- The example instruction shown below illustrates impor- ments with energy consumption. tant features of the RPA ISA. The instruction adds r0, a RF Number of DRF Ports: We can limit the number of DRF entry, to the DRF entry pointed by the RP p0. The sum is ports to one read port and one write port, which reduces stored in the RP p1. The postfix operator “!” increments RP the DRF area and energy consumption. However, the re- p0 after it is dereferenced. duced number of DRF ports penalizes the performance of applications which have multiple independent data add p1, r[p0]!, r0 streams, such as 1DFIR. Assembly instructions access the DRF by specifying an RP Number and Binding of Register Pointers: The simplest enclosed in square brackets after the symbol r. An RP is design, in terms of hardware complexity, would provide accessed as though it was a conventional general purpose one dedicated RP for each operand position. Using the register: by directly specifying the RP as an operand. ARM ISA as an example, we would have pd, pn and pm, To reduce the overhead of RP updates, a register pointer which correspond to the d, n and m operands, respectively. may be incremented when named in an instruction. For fur- However, such a scheme lacks flexibility and may intro- ther flexibility, we support circular addressing using two duce overheads when the same stream is used as an in- additional fields in each RP: a base address (begin), and put and an output or in different input operand positions. a limit address (end). An attempt to increment the address Providing more RPs with flexible bindings tends to in- beyond the end address causes the pointer to wrap around to crease the encoded instruction width and the complexity the begin address. An overflow bit is set when the pointer of the interlocks required in pipelined processor imple- wraps around to allow the software to detect the wrapping mentations. around if desired. Each RP stores the base address and limit address field in its higher bits. When an instruction derefer- We evaluate these parameters quantitatively in Sec- ences an RP, it accesses only the least significant bits, whose tion 4.1, focusing primarily on the number of DRF entries. contents are the actual address for DRF accesses. 2.3 RPA Processor Organization The number of bits required to encode the RP and DRF addresses depends on the number of RPs, the binding of Figure 1 shows a five-stage processor pipeline modified to RPs, and the number of access modes. Note that the number implement the RPA. The main additions are the DRF and of DRF entries does not influence the encoding of instruc- RPs, which, as in a conventional scalar pipeline, are read in tions. The specific modification on ARM ISA is described the decode stage. The post-increment of the RPs is also per- in Section 3.
Recommended publications
  • Analyzing the Benefits of Scratchpad Memories for Scientific Matrix
    The Pennsylvania State University The Graduate School Department of Computer Science and Engineering ANALYZING THE BENEFITS OF SCRATCHPAD MEMORIES FOR SCIENTIFIC MATRIX COMPUTATIONS A Thesis in Computer Science and Engineering by Bryan Alan Cover c 2008 Bryan Alan Cover Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science May 2008 The thesis of Bryan Alan Cover was reviewed and approved* by the following: Mary Jane Irwin Evan Pugh Professor of Computer Science and Engineering Thesis Co-Advisor Padma Raghavan Professor of Computer Science and Engineering Thesis Co-Advisor Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering *Signatures are on file in the Graduate School. iii Abstract Scratchpad memories (SPMs) have been shown to be more energy efficient, have faster access times, and take up less area than traditional hardware-managed caches. This, coupled with the predictability of data presence and reduced thermal properties, makes SPMs an attractive alternative to cache for many scientific applications. In this work, SPM based systems are considered for a variety of different functions. The first study performed is to analyze SPMs for their thermal and area properties on a conven- tional RISC processor. Six performance optimized variants of architecture are explored that evaluate the impact of having an SPM in the on-chip memory hierarchy. Increasing the performance and energy efficiency of both dense and sparse matrix-vector multipli- cation on a chip multi-processor are also looked at. The efficient utilization of the SPM is ensured by profiling the application for the data structures which do not perform well in traditional cache.
    [Show full text]
  • Memory Architectures 13 Preeti Ranjan Panda
    Memory Architectures 13 Preeti Ranjan Panda Abstract In this chapter we discuss the topic of memory organization in embedded systems and Systems-on-Chips (SoCs). We start with the simplest hardware- based systems needing registers for storage and proceed to hardware/software codesigned systems with several standard structures such as Static Random- Access Memory (SRAM) and Dynamic Random-Access Memory (DRAM). In the process, we touch upon concepts such as caches and Scratchpad Memories (SPMs). In general, the emphasis is on concepts that are more generally found in SoCs and less on general-purpose computing systems, although this distinction is not very clearly defined with respect to the memory subsystem. We touch upon implementations of these ideas in modern research and commercial scenarios. In this chapter, we also point out issues arising in the context of the memory architectures that become exported as problems to be addressed by the compiler and system designer. Acronyms ALU Arithmetic-Logic Unit ARM Advanced Risc Machines CGRA Coarse Grained Reconfigurable Architecture CPU Central Processing Unit DMA Direct Memory Access DRAM Dynamic Random-Access Memory GPGPU General-Purpose computing on Graphics Processing Units GPU Graphics Processing Unit HDL Hardware Description Language PCM Phase Change Memory P.R. Panda () Department of Computer Science and Engineering, Indian Institute of Technology Delhi, New Delhi, India e-mail: [email protected] © Springer Science+Business Media Dordrecht 2017 411 S. Ha, J. Teich (eds.), Handbook of Hardware/Software Codesign, DOI 10.1007/978-94-017-7267-9_14 412 P.R. Panda SoC System-on-Chip SPM Scratchpad Memory SRAM Static Random-Access Memory STT-RAM Spin-Transfer Torque Random-Access Memory SWC Software Cache Contents 13.1 Motivating the Significance of Memory....................................
    [Show full text]
  • Computer Organization and Architecture Designing for Performance Ninth Edition
    COMPUTER ORGANIZATION AND ARCHITECTURE DESIGNING FOR PERFORMANCE NINTH EDITION William Stallings Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montréal Toronto Delhi Mexico City São Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo Editorial Director: Marcia Horton Designer: Bruce Kenselaar Executive Editor: Tracy Dunkelberger Manager, Visual Research: Karen Sanatar Associate Editor: Carole Snyder Manager, Rights and Permissions: Mike Joyce Director of Marketing: Patrice Jones Text Permission Coordinator: Jen Roach Marketing Manager: Yez Alayan Cover Art: Charles Bowman/Robert Harding Marketing Coordinator: Kathryn Ferranti Lead Media Project Manager: Daniel Sandin Marketing Assistant: Emma Snider Full-Service Project Management: Shiny Rajesh/ Director of Production: Vince O’Brien Integra Software Services Pvt. Ltd. Managing Editor: Jeff Holcomb Composition: Integra Software Services Pvt. Ltd. Production Project Manager: Kayla Smith-Tarbox Printer/Binder: Edward Brothers Production Editor: Pat Brown Cover Printer: Lehigh-Phoenix Color/Hagerstown Manufacturing Buyer: Pat Brown Text Font: Times Ten-Roman Creative Director: Jayne Conte Credits: Figure 2.14: reprinted with permission from The Computer Language Company, Inc. Figure 17.10: Buyya, Rajkumar, High-Performance Cluster Computing: Architectures and Systems, Vol I, 1st edition, ©1999. Reprinted and Electronically reproduced by permission of Pearson Education, Inc. Upper Saddle River, New Jersey, Figure 17.11: Reprinted with permission from Ethernet Alliance. Credits and acknowledgments borrowed from other sources and reproduced, with permission, in this textbook appear on the appropriate page within text. Copyright © 2013, 2010, 2006 by Pearson Education, Inc., publishing as Prentice Hall. All rights reserved. Manufactured in the United States of America.
    [Show full text]
  • Introduction to the Intel® Nios® II Soft Processor
    Introduction to the Intel® Nios® II Soft Processor For Quartus® Prime 18.1 1 Introduction This tutorial presents an introduction the Intel® Nios® II processor, which is a soft processor that can be instantiated on an Intel FPGA device. It describes the basic architecture of Nios II and its instruction set. The Nios II processor and its associated memory and peripheral components are easily instantiated by using Intel’s SOPC Builder or Platform Designer tool in conjunction with the Quartus® Prime software. A full description of the Nios II processor is provided in the Nios II Processor Reference Handbook, which is available in the literature section of the Intel web site. Introductions to the SOPC Builder and Platform Designer tools are given in the tutorials Introduction to the Intel SOPC Builder and Introduction to the Intel Platform Designer Tool, respectively. Both can be found in the University Program section of the web site. Contents: • Nios II System • Overview of Nios II Processor Features • Register Structure • Accessing Memory and I/O Devices • Addressing • Instruction Set • Assembler Directives • Example Program • Exception Processing • Cache Memory • Tightly Coupled Memory Intel Corporation - FPGA University Program 1 March 2019 ® ® INTRODUCTION TO THE INTEL NIOS II SOFT PROCESSOR For Quartus® Prime 18.1 2 Background Intel’s Nios II is a soft processor, defined in a hardware description language, which can be implemented in Intel’s FPGA devices by using the Quartus Prime CAD system. This tutorial provides a basic introduction to the Nios II processor, intended for a user who wishes to implement a Nios II based system on an Intel Development and Education board.
    [Show full text]
  • Computer Organization & Architecture Eie
    COMPUTER ORGANIZATION & ARCHITECTURE EIE 411 Course Lecturer: Engr Banji Adedayo. Reg COREN. The characteristics of different computers vary considerably from category to category. Computers for data processing activities have different features than those with scientific features. Even computers configured within the same application area have variations in design. Computer architecture is the science of integrating those components to achieve a level of functionality and performance. It is logical organization or designs of the hardware that make up the computer system. The internal organization of a digital system is defined by the sequence of micro operations it performs on the data stored in its registers. The internal structure of a MICRO-PROCESSOR is called its architecture and includes the number lay out and functionality of registers, memory cell, decoders, controllers and clocks. HISTORY OF COMPUTER HARDWARE The first use of the word ‘Computer’ was recorded in 1613, referring to a person who carried out calculation or computation. A brief History: Computer as we all know 2day had its beginning with 19th century English Mathematics Professor named Chales Babage. He designed the analytical engine and it was this design that the basic frame work of the computer of today are based on. 1st Generation 1937-1946 The first electronic digital computer was built by Dr John V. Atanasoff & Berry Cliford (ABC). In 1943 an electronic computer named colossus was built for military. 1946 – The first general purpose digital computer- the Electronic Numerical Integrator and computer (ENIAC) was built. This computer weighed 30 tons and had 18,000 vacuum tubes which were used for processing.
    [Show full text]
  • Gpgpus: Overview Pedagogically Precursor Concepts UNIVERSITY of ILLINOIS at URBANA-CHAMPAIGN
    GPGPUs: Overview Pedagogically precursor concepts UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN © 2018 L. V. Kale at the University of Illinois Urbana-Champaign Precursor Concepts: Pedagogical • Architectural elements that, at least pedagogically, are precursors to understanding GPGPUs • SIMD and vector units. • We have already seen those • Large scale “Hyperthreading” for latency tolerance • Scratchpad memories • High BandWidth memory L.V.Kale 2 Tera MTA, Sun UltraSPARC T1 (Niagara) • Tera computers, With Burton Smith as a co-founder (1987) • Precursor: HEP processor (Denelcor Inc.), 1982 • First machine Was MTA • MTA-1, MTA-2, MTA-3 • Basic idea: • Support a huge number of hardWare threads (128) • Each With its oWn registers • No Cache! • Switch among threads on every cycle, thus tolerating DRAM latency • These threads could be running different processes • Such processors are called “barrel processors” in literature • But they sWitched to the “next thread” alWays.. So your turn is 127 clocks aWay, always • Was especially good for highly irregular accesses L.V.Kale 3 Scratchpad memory • Caches are complex and can cause unpredictable impact on performance • Scratchpad memories are made from SRAM on chip • Separate part of the address space • As fast or faster than caches, because no tag-matching or associative search • Need explicit instructions to bring data into scratchpad from memory • Load/store instructions exist betWeen registers and scratchpad • Example: IBM/Toshiba/Sony cell processor used in PS/3 • 1 PPE, and 8 SPE cores • Each SPE core has 256 KiB scratchpad • DMA is mechanism for moving data betWeen scratchpad and external DRAM L.V.Kale 4 High BandWidth Memory • As you increase compute capacity of processor chips, the bandWidth to DRAM comes under pressure • Many past improvements (e.g.
    [Show full text]
  • A Modular Soft Processor Core in VHDL
    A Modular Soft Processor Core in VHDL Jack Whitham 2002-2003 This is a Third Year project submitted for the degree of MEng in the Department of Computer Science at the University of York. The project will attempt to demonstrate that a modular soft processor core can be designed and implemented on an FPGA, and that the core can be optimised to run a particular embedded application using a minimal amount of FPGA space. The word count of this project (as counted by the Unix wc command after detex was run on the LaTeX source) is 33647 words. This includes all text in the main report and Appendices A, B and C. Excluding source code, the project is 70 pages in length. i Contents I. Introduction 1 1. Background and Literature 1 1.1. Soft Processor Cores . 1 1.2. A Field Programmable Gate Array . 1 1.3. VHSIC Hardware Definition Language (VHDL) . 2 1.4. The Motorola 68020 . 2 II. High-level Project Decisions 3 2. Should the design be based on an existing one? 3 3. Which processor should the soft core be based upon? 3 4. Which processor should be chosen? 3 5. Restating the aims of the project in terms of the chosen processor 4 III. Modular Processor Design Decisions 4 6. Processor Design 4 6.1. Alternatives to a complete processor implementation . 4 6.2. A real processor . 5 6.3. Instruction Decoder and Control Logic . 5 6.4. Arithmetic and Logic Unit (ALU) . 7 6.5. Register File . 7 6.6. Links between Components .
    [Show full text]
  • Assignment Solutions
    Week 1: Assignment Solutions 1. Which of the following statements are true? a. The ENIAC computer was built using mechanical relays. b. Harvard Mark1 computer was built using mechanical relays. c. PASCALINE computer could multiply and divide numbers by repeated addition and subtraction. d. Charles Babbage built his automatic computing engine in 19th century. Solution: ((b) and (c)) ENIAC was built using vacuum tubes. Charles Babbage designed his automatic computing engine but could not built it. 2. Which of the following statements are true for Moore’s law? a. Moore’s law predicts that power dissipation will double every 18 months. b. Moore’s law predicts that the number of transistors per chip will double every 18 months. c. Moore’s law predicts that the speed of VLSI circuits will double every 18 months. d. None of the above. Solution: (b) Moore’s law only predicts that number of transistors per chip will double every 18 months. 3. Which of the following generates the necessary signals required to execute an instruction in a computer? a. Arithmetic and Logic Unit b. Memory Unit c. Control Unit d. Input/Output Unit Solution: (c) Control unit acts as the nerve center of a computer and generates the necessary control signals required to execute an instruction. 4. An instruction ADD R1, A is stored at memory location 4004H. R1 is a processor register and A is a memory location with address 400CH. Each instruction is 32-bit long. What will be the values of PC, IR and MAR during execution of the instruction? a.
    [Show full text]
  • Hardware and Software Optimizations for GPU Resource Management
    Hardware and Software Optimizations for GPU Resource Management A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Vishwesh Jatala to the DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY KANPUR, INDIA December, 2018 ii Scanned by CamScanner Abstract Graphics Processing Units (GPUs) are widely adopted across various domains due to their massive thread level parallelism (TLP). The TLP that is present in the GPUs is limited by the number of resident threads, which in turn depends on the available resources in the GPUs { such as registers and scratchpad memory. Recent GPUs aim to improve the TLP, and consequently the throughput, by increasing the number of resources. Further, the improvements in semiconductor fabrication enable smaller feature sizes. However, for smaller feature sizes, the leakage power is a significant part of the total power consumption. In this thesis, we provide hardware and software solutions that aim towards the two problems of GPU design: improving throughput and reducing leakage energy. In the first work of the thesis, we focus on improving the performance of GPUs by effective resource management. In GPUs, resources (registers and scratchpad memory) are allocated at thread block level granularity, as a result, some of the resources may not be used up completely and hence will be wasted. We propose an approach that shares the resources of SM to utilize the wasted resources by launching more thread blocks in each SM. We show the effectiveness of our approach for two resources: registers and scratchpad memory (shared memory). On evaluating our approach experimentally with 19 kernels from several benchmark suites, we observed that kernels that underutilize register resource show an average improvement of 11% with register sharing.
    [Show full text]
  • Register Are Used to Quickly Accept, Store, and Transfer Data And
    Register are used to quickly accept, store, and transfer data and instructions that are being used immediately by the CPU, there are various types of Registers those are used for various purpose. Among of the some Mostly used Registers named as AC or Accumulator, Data Register or DR, the AR or Address Register, program counter (PC), Memory Data Register (MDR) ,Index register,Memory Buffer Register. These Registers are used for performing the various Operations. While we are working on the System then these Registers are used by the CPU for Performing the Operations. When We Gives Some Input to the System then the Input will be Stored into the Registers and When the System will gives us the Results after Processing then the Result will also be from the Registers. So that they are used by the CPU for Processing the Data which is given by the User. Registers Perform:- 1) Fetch: The Fetch Operation is used for taking the instructions those are given by the user and the Instructions those are stored into the Main Memory will be fetch by using Registers. 2) Decode: The Decode Operation is used for interpreting the Instructions means the Instructions are decoded means the CPU will find out which Operation is to be performed on the Instructions. 3) Execute: The Execute Operation is performed by the CPU. And Results those are produced by the CPU are then Stored into the Memory and after that they are displayed on the user Screen. Types of Registers are as Followings 1. MAR stand for Memory Address Register This register holds the memory addresses of data and instructions.
    [Show full text]
  • Improving Memory Subsystem Performance Using Viva: Virtual Vector Architecture
    Improving Memory Subsystem Performance using ViVA: Virtual Vector Architecture Joseph Gebis12,Leonid Oliker12, John Shalf1, Samuel Williams12,Katherine Yelick12 1 CRD/NERSC, Lawrence Berkeley National Laboratory Berkeley, CA 94720 2 CS Division, University of California at Berkeley, Berkeley, CA 94720 fJGebis, LOliker, JShalf, SWWilliams, [email protected] Abstract. The disparity between microprocessor clock frequencies and memory latency is a primary reason why many demanding applications run well below peak achievable performance. Software controlled scratchpad memories, such as the Cell local store, attempt to ameliorate this discrepancy by enabling precise control over memory movement; however, scratchpad technology confronts the programmer and compiler with an unfamiliar and difficult programming model. In this work, we present the Virtual Vector Architecture (ViVA), which combines the memory semantics of vector computers with a software-controlled scratchpad memory in order to provide a more effective and practical approach to latency hiding. ViVA requires minimal changes to the core design and could thus be eas- ily integrated with conventional processor cores. To validate our approach, we implemented ViVA on the Mambo cycle-accurate full system simulator, which was carefully calibrated to match the performance on our underlying PowerPC Apple G5 architecture. Results show that ViVA is able to deliver significant per- formance benefits over scalar techniques for a variety of memory access pat- terns as well as two important memory-bound compact kernels, corner turn and sparse matrix-vector multiplication — achieving 2x–13x improvement compared the scalar version. Overall, our preliminary ViVA exploration points to a promis- ing approach for improving application performance on leading microprocessors with minimal design and complexity costs, in a power efficient manner.
    [Show full text]
  • IAR C/C++ Compiler Reference Guide for V850
    IAR Embedded Workbench® IAR C/C++ Compiler Reference Guide for the Renesas V850 Microcontroller Family CV850-9 COPYRIGHT NOTICE © 1998–2013 IAR Systems AB. No part of this document may be reproduced without the prior written consent of IAR Systems AB. The software described in this document is furnished under a license and may only be used or copied in accordance with the terms of such a license. DISCLAIMER The information in this document is subject to change without notice and does not represent a commitment on any part of IAR Systems. While the information contained herein is assumed to be accurate, IAR Systems assumes no responsibility for any errors or omissions. In no event shall IAR Systems, its employees, its contractors, or the authors of this document be liable for special, direct, indirect, or consequential damage, losses, costs, charges, claims, demands, claim for lost profits, fees, or expenses of any nature or kind. TRADEMARKS IAR Systems, IAR Embedded Workbench, C-SPY, visualSTATE, The Code to Success, IAR KickStart Kit, I-jet, I-scope, IAR and the logotype of IAR Systems are trademarks or registered trademarks owned by IAR Systems AB. Microsoft and Windows are registered trademarks of Microsoft Corporation. Renesas is a registered trademark of Renesas Electronics Corporation. V850 is a trademark of Renesas Electronics Corporation. Adobe and Acrobat Reader are registered trademarks of Adobe Systems Incorporated. All other product names are trademarks or registered trademarks of their respective owners. EDITION NOTICE Ninth edition: May 2013 Part number: CV850-9 This guide applies to version 4.x of IAR Embedded Workbench® for the Renesas V850 microcontroller family.
    [Show full text]