Evaluating ASIC, DSP, and RISC Architectures for Embedded Applications

Total Page:16

File Type:pdf, Size:1020Kb

Evaluating ASIC, DSP, and RISC Architectures for Embedded Applications Evaluating ASIC, DSP, and RISC Architectures for Embedded Applications Marc E. Campbell Northrop Grumman Corporation Abstract = = ⋅ 2 ⋅ ⋅ ⋅ Mathematical analysis and empirical evaluation of the solid PowerCPU P C V f N %N = = ⋅ 2 ⋅ ⋅ ⋅ (1-1) state equation PowerCMOS P C V f N %N is presented in this paper which identifies a measurable metric for evaluating where, relative advantages of ASIC, DSP, and RISC architectures for C = Capacitance embedded applications. Relationships are examined which can V = CPU Core Voltage help predict relative future architecture performance as new f = CPU Core Clock Frequency generations of CMOS solid state technology become N = Number of Gates available. In particular, Performance/Watt is shown to be an %N = Percentage of Gates which change state on a given Architecture-Technology Metric which can be used to • calibrate ASIC, DSP, & RISC performance density potential clock cycle. relative to a solid state technology generations, Some primary architectural and technoloyg components in • measure & evaluate architectural changes, and the CMOS power consumption equation 1-1 can be noted. • project a architecture performance density roadmap. 1.2. Architecture 1. Architecture Metric Basis First, the architecture is a function of the number of gates Intuitively, more efficient architectures will use less energy N and how gates toggle to perform a task, %N . Physical to complete the same task on the same generation CMOS CMOS gate real estate is quantified by N . How the software solid state technology. Beyond this intuition, interacts with the CMOS gates is quantified by how the gates (1) How can the relative advantages between ASIC, DSP, toggle, %N . Thus software architecture is also reflected in and RISC architectures be quantified? %N for software programmable devices. The system (2) How can the relative advantages between different architecture of combined hardware and software is largely ASIC, DSP, and RISC options be projected into future represented by N%N . embedded system design plans? The key concept is the power consumption behavior of a Architecture = N ⋅%N (1-2) CMOS gate. More complex architectures require more gates. More 1.1. CMOS device power consumption specialized architectures have fewer gate toggles per unit task. Thus, an Architecture Specialization metric can be defined Power consumption is an externally measurable physical based on how many clock cycles and gates are required to property of CMOS ASIC, DSP, and RISC devices which are complete a given functional task. The fewer gates and cycles used to build larger computational systems. The power used to get the task done the more specialized, efficient, or consumption for a CMOS device is shown in equation 1-1. “tuned” the processor is for the required task. 1.3. CMOS technology Digital Equipment Corporation (DEC) [2]. Figure 2-1 does not include the power consumption required by support logic. Second, the advances in solid state process technologies can 1 be observed as a function of gate geometry, voltage, and 0.8 frequency. Newer technologies systems are design with smaller gate geometry, lower voltages, and higher frequencies. 0.6 Changes in gate geometry affect changes in capacitance. 0.4 0.2 1 1 SPECfp95 / Watt SolidStateTechnology(), 2 , f (1-3) C V 0 350 400 450 500 2. Architecture Metric Observations Alpha 21164 Clock (MHz) Figure 2-1. Alpha Performance/Watt vs. Clock At this point, several observations based on sustained performance per watt can be made: 2.2. Voltage 2.1. Clock frequency Observation 2. A voltage (V 2 ) decrease significantly improves performance per watt for the same architecture. Observation 1. Higher performance by only increasing the For the same architecture with corresponding clock rate f has no net performance per watt benefit. N%N Oc Let, task definition, the relative performance per watt improvement task operations can be shown by equation 2-3, Performance = = ()Performance unit_ time second Power VoltageChangeBenefit = CPU new operations cycles ()Performance = PowerCPU old cycle second Oc 2 = (2-1) C ⋅V ⋅ N ⋅%N Oc f = new Oc where, ⋅ 2 ⋅ ⋅ C Vold N %N Oc = sustained operation per cycle, and f = clock frequency. CV 2 = old (2-3) CV 2 Then it follows that, new For the cases where C is near constant or the change in C ()operations is small with respect to the change in V 2 the performance per Performance = second watt benefit is approximated as, PowerCPU watt O V 2 = c VoltageChangeBenefit ≈ old (2-4) 2 (2-2) 2 C ⋅V ⋅ N ⋅%N Vnew The observation that clock frequency f does not affect The approximate performance per watt benefit as an performance per watt for the same CPU architecture N is also architecture voltage changes from 3.5 volts toward 2.0 volts is illustrated in Figure 1-1 based on information available from calculated in equation 2-5. V 2 3.32 3.3 = = 2.7 (2-5) 40 2 2 Alpha V2.0 2.0 MIPS 30 PA-RISC Note also that, ∂ PowerPC Pnew ∂ 2 P 2 ∂f V 20 ∝ V ⇒ = new (2-6) ∂ ∂P 2 f old Vold ∂f 10 SPECfp92 / Peak Watt Thus, the 2x improvement in ∂P ∂f slope improvement 0 shown in figure 2-2, and the associated 2x improvement in 91 92 93 94 95 96 performance / watt, for the DEC Alpha 21164 [2] can be 2 mostly attributed to the change in V rather than by a change Figure 2-3. SPECfp92 Performance/Watt vs. in architecture. The data in figure 2-2 does not include power Release Year consumed by support logic. Alpha 21164, Using a 64-bit processor, where only 32-bits is required, 60 Vdd= 3.3V will potentially consume on the order of 2x more energy on 55 Alpha 21164, large scale systems. Likewise, Field Programmable Gate 50 Vdd=2.5V, Array (FPGA) processors would be an energy efficient choice 45 core=2.0V 5 watts / 34 MHz 40 for bit level template matching algorithms. 35 Watts / CPU 30 5 watts / 66 MHz 2.4. Architecture specialization 25 20 Observation 4. For given operational task O and a given 200 300 400 500 c technology generation, Performance / Watt provides a metric Alpha 21164 Clock (MHz) of Architecture Specialization. Figure 2-2. DEC Alpha Watt vs. Clock The Architecture Specialization metric is concisely stated in equation 2-7. Performance 2.3. Architecture width ()(2-7) ArchitectureSpecializationDSP = Watt DSP RISC ()Performance Watt RISC Observation 3. Smaller architecture widths will have lower The Architecture Specialization metric has the nice feature power consumption per completed task. of being directly measurable. In contrast, an N%N count Since 64-bit floating point register uses 2x more gates N determination would require a gate level simulation which is than a 32-bit floating point register, the 64-bit architecture can not as widely available an actual production device. be expected to consume 2x more power for the same solid The Architecture Specialization metric of a 32-bit SHARC state CMOS generation. DSP with respect to a 64-bit 21064A and 21164 Alpha RISC The difference between 32-bit and 64-bit performance / watt for sustained FFT throughput / watt is shown in figure 2-4. can be observed in figure 2-3. The 32-bit MIPS and 32-bit The 18x Architecture Specialization measurement is based PowerPC show a 2x SPECfp92 performance per watt on measured sustained throughput / watt for 1D complex compared with the 64-bit DEC Alpha and 64-bit Hewlett FFTs averaged over vector lengths ranging from 64 to 32K for Packard PA-RISC processors in the 1995 and 1996 release the same voltage system. The SHARC 21060 is designed for years as the respective architectures matured in the market. efficient FFT performance. The Architecture Specialization metric will vary significantly for other tasks. RISC 64-bit @ 5 volts being given the value of 1. CMOS 70 technology generation axis shows the corresponding 60 Ave.Ops Sec Watt 58 DSP32 = = 18 minimum and/or specified device voltage. Notice that the 50 Ave.Ops Sec 3.14 Watt RISC64 DSP-RISC architecture offset is greater than the precision 40 offset within a RISC Architecture Family. 30 Ave.Ops Sec Watt 10.9 NEW = = 2.3 Ave.Ops Sec Figure 2-6 illustrates how a lag of 2 CMOS generations 20 4.8 FFT MFLOPS / Watt Watt OLD can diminish the specialization advantage that an architecture 10 might have. 0 2 Generations ASIC 16-bit 64 128 256 512 1024 2048 4096 8192 16384 32768 1000 Complex FFT Length (Points) Alpha 21064A, 275 MHz, Alpha 21164, 333 MHz, SHARC 21060, 40 MHz, (A) DSP 32-bit 275 MFLOP Peak, 3.3 666 MFLOP Peak, 2.2 120 MFLOP Peak, 3.3 Volts, 33+8 Watts Volt, 25.4 + 6 Watts Volt, 1.75 Watts 100 (B) Figure 2-4 FFT MFLOPS / Watt RISC 32-bit RISC 64-bit Notice that the 2.6x Alpha performance per watt increase (C) 2 10 can be largely attributed to a V core voltage change and not (D) to a fundamental shift to a more DSP-specialized architecture. Architecture Metric (Performance/Watt) An aside observation is that more specialized architectures 1 5 3.3 2.2 1.5 1 inherently require more specialized programming techniques Solid State Technology (Voltage) and tools. FPGAs use a Hardware Description Language Figure 2-6 Two Generation Lag (HDL). Floating point DSP chipsets primarily have The performance/watt offset between FPGA, Integer DSP, assembly and C support. General purpose RISC processors Floating Point DSP, and RISC will be persistent, as each have a wide variety of programming languages including niche architecture takes advantage of the same advancements in solid languages such as Fortran, Prolog, and SmallTalk. state technology.
Recommended publications
  • Accelerating HPL Using the Intel Xeon Phi 7120P Coprocessors
    Accelerating HPL using the Intel Xeon Phi 7120P Coprocessors Authors: Saeed Iqbal and Deepthi Cherlopalle The Intel Xeon Phi Series can be used to accelerate HPC applications in the C4130. The highly parallel architecture on Phi Coprocessors can boost the parallel applications. These coprocessors work seamlessly with the standard Xeon E5 processors series to provide additional parallel hardware to boost parallel applications. A key benefit of the Xeon Phi series is that these don’t require redesigning the application, only compiler directives are required to be able to use the Xeon Phi coprocessor. Fundamentally, the Intel Xeon series are many-core parallel processors, with each core having a dedicated L2 cache. The cores are connected through a bi-directional ring interconnects. Intel offers a complete set of development, performance monitoring and tuning tools through its Parallel Studio and VTune. The goal is to enable HPC users to get advantage from the parallel hardware with minimal changes to the code. The Xeon Phi has two modes of operation, the offload mode and native mode. In the offload mode designed parts of the application are “offloaded” to the Xeon Phi, if available in the server. Required code and data is copied from a host to the coprocessor, processing is done parallel in the Phi coprocessor and results move back to the host. There are two kinds of offload modes, non-shared and virtual-shared memory modes. Each offload mode offers different levels of user control on data movement to and from the coprocessor and incurs different types of overheads. In the native mode, the application runs on both host and Xeon Phi simultaneously, communication required data among themselves as need.
    [Show full text]
  • Intel Cirrascale and Petrobras Case Study
    Case Study Intel® Xeon Phi™ Coprocessor Intel® Xeon® Processor E5 Family Big Data Analytics High-Performance Computing Energy Accelerating Energy Exploration with Intel® Xeon Phi™ Coprocessors Cirrascale delivers scalable performance by combining its innovative PCIe switch riser with Intel® processors and coprocessors To find new oil and gas reservoirs, organizations are focusing exploration in the deep sea and in complex geological formations. As energy companies such as Petrobras work to locate and map those reservoirs, they need powerful IT resources that can process multiple iterations of seismic models and quickly deliver precise results. IT solution provider Cirrascale began building systems with Intel® Xeon Phi™ coprocessors to provide the scalable performance Petrobras and other customers need while holding down costs. Challenges • Enable deep-sea exploration. Improve reservoir mapping accuracy with detailed seismic processing. • Accelerate performance of seismic applications. Speed time to results while controlling costs. • Improve scalability. Enable server performance and density to scale as data volumes grow and workloads become more demanding. Solution • Custom Cirrascale servers with Intel Xeon Phi coprocessors. Employ new compute blades with the Intel® Xeon® processor E5 family and Intel Xeon Phi coprocessors. Cirrascale uses custom PCIe switch risers for fast, peer-to-peer communication among coprocessors. Technology Results • Linear scaling. Performance increases linearly as Intel Xeon Phi coprocessors “Working together, the are added to the system. Intel® Xeon® processors • Simplified development model. Developers no longer need to spend time optimizing data placement. and Intel® Xeon Phi™ coprocessors help Business Value • Faster, better analysis. More detailed and accurate modeling in less time HPC applications shed improves oil and gas exploration.
    [Show full text]
  • Historical Perspective and Further Reading 162.E1
    2.21 Historical Perspective and Further Reading 162.e1 2.21 Historical Perspective and Further Reading Th is section surveys the history of in struction set architectures over time, and we give a short history of programming languages and compilers. ISAs include accumulator architectures, general-purpose register architectures, stack architectures, and a brief history of ARMv7 and the x86. We also review the controversial subjects of high-level-language computer architectures and reduced instruction set computer architectures. Th e history of programming languages includes Fortran, Lisp, Algol, C, Cobol, Pascal, Simula, Smalltalk, C+ + , and Java, and the history of compilers includes the key milestones and the pioneers who achieved them. Accumulator Architectures Hardware was precious in the earliest stored-program computers. Consequently, computer pioneers could not aff ord the number of registers found in today’s architectures. In fact, these architectures had a single register for arithmetic instructions. Since all operations would accumulate in one register, it was called the accumulator , and this style of instruction set is given the same name. For example, accumulator Archaic EDSAC in 1949 had a single accumulator. term for register. On-line Th e three-operand format of RISC-V suggests that a single register is at least two use of it as a synonym for registers shy of our needs. Having the accumulator as both a source operand and “register” is a fairly reliable indication that the user the destination of the operation fi lls part of the shortfall, but it still leaves us one has been around quite a operand short. Th at fi nal operand is found in memory.
    [Show full text]
  • Power Measurement Tutorial for the Green500 List
    Power Measurement Tutorial for the Green500 List R. Ge, X. Feng, H. Pyla, K. Cameron, W. Feng June 27, 2007 Contents 1 The Metric for Energy-Efficiency Evaluation 1 2 How to Obtain P¯(Rmax)? 2 2.1 The Definition of P¯(Rmax)...................................... 2 2.2 Deriving P¯(Rmax) from Unit Power . 2 2.3 Measuring Unit Power . 3 3 The Measurement Procedure 3 3.1 Equipment Check List . 4 3.2 Software Installation . 4 3.3 Hardware Connection . 4 3.4 Power Measurement Procedure . 5 4 Appendix 6 4.1 Frequently Asked Questions . 6 4.2 Resources . 6 1 The Metric for Energy-Efficiency Evaluation This tutorial serves as a practical guide for measuring the computer system power that is required as part of a Green500 submission. It describes the basic procedures to be followed in order to measure the power consumption of a supercomputer. A supercomputer that appears on The TOP500 List can easily consume megawatts of electric power. This power consumption may lead to operating costs that exceed acquisition costs as well as intolerable system failure rates. In recent years, we have witnessed an increasingly stronger movement towards energy-efficient computing systems in academia, government, and industry. Thus, the purpose of the Green500 List is to provide a ranking of the most energy-efficient supercomputers in the world and serve as a complementary view to the TOP500 List. However, as pointed out in [1, 2], identifying a single objective metric for energy efficiency in supercom- puters is a difficult task. Based on [1, 2] and given the already existing use of the “performance per watt” metric, the Green500 List uses “performance per watt” (PPW) as its metric to rank the energy efficiency of supercomputers.
    [Show full text]
  • Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems
    Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-core Systems Viren Kumar Alexandra Fedorova Simon Fraser University Simon Fraser University 8888 University Dr 8888 University Dr Vancouver, Canada Vancouver, Canada [email protected] [email protected] ABSTRACT performance per watt than homogeneous multicore proces- Single-ISA heterogeneous multicore architectures promise to sors. As power consumption in data centers becomes a grow- deliver plenty of cores with varying complexity, speed and ing concern [3], deploying ASISA multicore systems is an performance in the near future. Virtualization enables mul- increasingly attractive opportunity. These systems perform tiple operating systems to run concurrently as distinct, in- at their best if application workloads are assigned to het- dependent guest domains, thereby reducing core idle time erogeneous cores in consideration of their runtime proper- and maximizing throughput. This paper seeks to identify a ties [4][13][12][18][24][21]. Therefore, understanding how to heuristic that can aid in intelligently scheduling these vir- schedule data-center workloads on ASISA systems is an im- tualized workloads to maximize performance while reducing portant problem. This paper takes the first step towards power consumption. understanding the properties of data center workloads that determine how they should be scheduled on ASISA multi- We propose that the controlling domain in a Virtual Ma- core processors. Since virtual machine technology is a de chine Monitor or hypervisor is relatively insensitive to changes facto standard for data centers, we study virtual machine in core frequency, and thus scheduling it on a slower core (VM) workloads. saves power while only slightly affecting guest domain per- formance.
    [Show full text]
  • Comparing the Power and Performance of Intel's SCC to State
    Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs Ehsan Totoni, Babak Behzad, Swapnil Ghike, Josep Torrellas Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA E-mail: ftotoni2, bbehza2, ghike2, [email protected] Abstract—Power dissipation and energy consumption are be- A key architectural challenge now is how to support in- coming increasingly important architectural design constraints in creasing parallelism and scale performance, while being power different types of computers, from embedded systems to large- and energy efficient. There are multiple options on the table, scale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the namely “heavy-weight” multi-cores (such as general purpose best use of exponentially increasing numbers of transistors within processors), “light-weight” many-cores (such as Intel’s Single- the power and energy budgets. Intel SCC is an appealing option Chip Cloud Computer (SCC) [1]), low-power processors (such for future many-core architectures. In this paper, we use various as embedded processors), and SIMD-like highly-parallel archi- scalable applications to quantitatively compare and analyze tectures (such as General-Purpose Graphics Processing Units the performance, power consumption and energy efficiency of different cutting-edge platforms that differ in architectural build. (GPGPUs)). These platforms include the Intel Single-Chip Cloud Computer The Intel SCC [1] is a research chip made by Intel Labs (SCC) many-core, the Intel Core i7 general-purpose multi-core, to explore future many-core architectures. It has 48 Pentium the Intel Atom low-power processor, and the Nvidia ION2 (P54C) cores in 24 tiles of two cores each.
    [Show full text]
  • Computer Organization EECC 550 • Introduction: Modern Computer Design Levels, Components, Technology Trends, Register Transfer Week 1 Notation (RTN)
    Computer Organization EECC 550 • Introduction: Modern Computer Design Levels, Components, Technology Trends, Register Transfer Week 1 Notation (RTN). [Chapters 1, 2] • Instruction Set Architecture (ISA) Characteristics and Classifications: CISC Vs. RISC. [Chapter 2] Week 2 • MIPS: An Example RISC ISA. Syntax, Instruction Formats, Addressing Modes, Encoding & Examples. [Chapter 2] • Central Processor Unit (CPU) & Computer System Performance Measures. [Chapter 4] Week 3 • CPU Organization: Datapath & Control Unit Design. [Chapter 5] Week 4 – MIPS Single Cycle Datapath & Control Unit Design. – MIPS Multicycle Datapath and Finite State Machine Control Unit Design. Week 5 • Microprogrammed Control Unit Design. [Chapter 5] – Microprogramming Project Week 6 • Midterm Review and Midterm Exam Week 7 • CPU Pipelining. [Chapter 6] • The Memory Hierarchy: Cache Design & Performance. [Chapter 7] Week 8 • The Memory Hierarchy: Main & Virtual Memory. [Chapter 7] Week 9 • Input/Output Organization & System Performance Evaluation. [Chapter 8] Week 10 • Computer Arithmetic & ALU Design. [Chapter 3] If time permits. Week 11 • Final Exam. EECC550 - Shaaban #1 Lec # 1 Winter 2005 11-29-2005 Computing System History/Trends + Instruction Set Architecture (ISA) Fundamentals • Computing Element Choices: – Computing Element Programmability – Spatial vs. Temporal Computing – Main Processor Types/Applications • General Purpose Processor Generations • The Von Neumann Computer Model • CPU Organization (Design) • Recent Trends in Computer Design/performance • Hierarchy
    [Show full text]
  • NVIDIA Ampere GA102 GPU Architecture Whitepaper
    NVIDIA AMPERE GA102 GPU ARCHITECTURE Second-Generation RTX Updated with NVIDIA RTX A6000 and NVIDIA A40 Information V2.0 Table of Contents Introduction 5 GA102 Key Features 7 2x FP32 Processing 7 Second-Generation RT Core 7 Third-Generation Tensor Cores 8 GDDR6X and GDDR6 Memory 8 Third-Generation NVLink® 8 PCIe Gen 4 9 Ampere GPU Architecture In-Depth 10 GPC, TPC, and SM High-Level Architecture 10 ROP Optimizations 11 GA10x SM Architecture 11 2x FP32 Throughput 12 Larger and Faster Unified Shared Memory and L1 Data Cache 13 Performance Per Watt 16 Second-Generation Ray Tracing Engine in GA10x GPUs 17 Ampere Architecture RTX Processors in Action 19 GA10x GPU Hardware Acceleration for Ray-Traced Motion Blur 20 Third-Generation Tensor Cores in GA10x GPUs 24 Comparison of Turing vs GA10x GPU Tensor Cores 24 NVIDIA Ampere Architecture Tensor Cores Support New DL Data Types 26 Fine-Grained Structured Sparsity 26 NVIDIA DLSS 8K 28 GDDR6X Memory 30 RTX IO 32 Introducing NVIDIA RTX IO 33 How NVIDIA RTX IO Works 34 Display and Video Engine 38 DisplayPort 1.4a with DSC 1.2a 38 HDMI 2.1 with DSC 1.2a 38 Fifth Generation NVDEC - Hardware-Accelerated Video Decoding 39 AV1 Hardware Decode 40 Seventh Generation NVENC - Hardware-Accelerated Video Encoding 40 NVIDIA Ampere GA102 GPU Architecture ii Conclusion 42 Appendix A - Additional GeForce GA10x GPU Specifications 44 GeForce RTX 3090 44 GeForce RTX 3070 46 Appendix B - New Memory Error Detection and Replay (EDR) Technology 49 Appendix C - RTX A6000 GPU Perf ormance 50 List of Figures Figure 1.
    [Show full text]
  • Phaser 600 Color Printer User Manual
    WELCOME! This is the home page of the Phaser 600 Color Printer User Manual. See Tips on using this guide, or go immediately to the Contents. Phaser® 600 Wide-Format Color Printer Tips on using this guide ■ Use the navigation buttons in Acrobat Reader to move through the document: 1 23 4 5 6 1 and 4 are the First Page and Last Page buttons. These buttons move to the first or last page of a document. 2 and 3 are the Previous Page and Next Page buttons. These buttons move the document backward or forward one page at a time. 5 and 6 are the Go Back and Go Forward buttons. These buttons let you retrace your steps through a document, moving to each page or view in the order visited. ■ For best results, use the Adobe Acrobat Reader version 2.1 to read this guide. Version 2.1 of the Acrobat Reader is supplied on your printer’s CD-ROM. ■ Click on the page numbers in the Contents on the following pages to open the topics you want to read. ■ You can click on the page numbers following keywords in the Index to open topics of interest. ■ You can also use the Bookmarks provided by the Acrobat Reader to navigate through this guide. If the Bookmarks are not already displayed at the left of the window, select Bookmarks and Page from the View menu. ■ If you have difficultly reading any small type or seeing the details in any of the illustrations, you can use the Acrobat Reader’s Magnification feature.
    [Show full text]
  • RISC-V Geneology
    RISC-V Geneology Tony Chen David A. Patterson Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2016-6 http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-6.html January 24, 2016 Copyright © 2016, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Introduction RISC-V is an open instruction set designed along RISC principles developed originally at UC Berkeley1 and is now set to become an open industry standard under the governance of the RISC-V Foundation (www.riscv.org). Since the instruction set architecture (ISA) is unrestricted, organizations can share implementations as well as open source compilers and operating systems. Designed for use in custom systems on a chip, RISC-V consists of a base set of instructions called RV32I along with optional extensions for multiply and divide (RV32M), atomic operations (RV32A), single-precision floating point (RV32F), and double-precision floating point (RV32D). The base and these four extensions are collectively called RV32G. This report discusses the historical precedents of RV32G. We look at 18 prior instruction set architectures, chosen primarily from earlier UC Berkeley RISC architectures and major proprietary RISC instruction sets. Among the 122 instructions in RV32G: ● 6 instructions do not have precedents among the selected instruction sets, ● 98 instructions of the 116 with precedents appear in at least three different instruction sets.
    [Show full text]
  • Using Intel Processors for DSP Applications
    Using Intel® Processors for DSP Applications: Comparing the Performance of Freescale MPC8641D and Two Intel Core™2 Duo Processors Mike Delves N.A. Software Ltd David Tetley GE Fanuc Intelligent Platforms Introduction The high performance embedded DSP processor market has moved steadily over the past decade from one dominated by specialised fixed point processors towards those with recognisable similarities to the general- purpose processor market: DSP processors have provided floating point, substantial caches, and capable C/C++ compilers, as well as SIMD (Single Instruction Multiple Data) features providing fast vector operations. The inclusion of SIMD features to general purpose processors has led to them replacing specialised DSPs in many high performance embedded applications (such as Radar, SAR, SIGINT and image processing). Indeed, probably the most widely used general purpose processor for DSP in the past five years is the PowerPC family, which was used for a long while in the Apple range of PCs. The G4 generation of the PowerPC has dominated the embedded high performance computing market for over 8 years.If the PowerPC can be accepted both in the general-purpose and the embedded DSP markets, what about other processor families? This question is motivated by the very fast rate of development of general- purpose silicon over the past four years: faster cycle times, larger caches, faster front side buses, lower-power variants, multi-core technology, vector instruction sets, and plans for more of everything well into the future, with development funded by the huge general-purpose market. Here, we look in particular at how well the current family of Intel® low power processors perform against the PowerPC.
    [Show full text]
  • Effectiveness of the MAX-2 Multimedia Extensions for PA-RISC 2.0 Processors
    Effectiveness of the MAX-2 Multimedia Extensions for PA-RISC 2.0 Processors Ruby Lee Hewlett-Packard Company HotChips IX Stanford, CA, August 24-26,1997 Outline Introduction PA-RISC MAX-2 features and examples Mix Permute Multiply with Shift&Add Conditionals with Saturation Arith (e.g., Absolute Values) Performance Comparison with / without MAX-2 General-Purpose Workloads will include Increasing Amounts of Media Processing MM a b a b 2 1 2 1 b c b c functionality 5 2 5 2 A B C D 1 2 22 2 2 33 3 4 55 59 A B C D 1 2 A B C D 22 1 2 22 2 2 2 2 33 33 3 4 55 59 3 4 55 59 Distributed Multimedia Real-time Information Access Communications Tool Tool Computation Tool time 1980 1990 2000 Multimedia Extensions for General-Purpose Processors MAX-1 for HP PA-RISC (product Jan '94) VIS for Sun Sparc (H2 '95) MAX-2 for HP PA-RISC (product Mar '96) MMX for Intel x86 (chips Jan '97) MDMX for SGI MIPS-V (tbd) MVI for DEC Alpha (tbd) Ideally, different media streams map onto both the integer and floating-point datapaths of microprocessors images GR: GR: 32x32 video 32x64 ALU SMU FP: graphics FP:16x64 Mem 32x64 audio FMAC PA-RISC 2.0 Processor Datapath Subword Parallelism in a General-Purpose Processor with Multimedia Extensions General Regs. y5 y6 y7 y8 x5 x6 x7 x8 x1 x2 x3 x4 y1 y2 y3 y4 Partitionable Partitionable 64-bit ALU 64-bit ALU 8 ops / cycle Subword Parallel MAX-2 Instructions in PA-RISC 2.0 Parallel Add (modulo or saturation) Parallel Subtract (modulo or saturation) Parallel Shift Right (1,2 or 3 bits) and Add Parallel Shift Left (1,2 or 3 bits) and Add Parallel Average Parallel Shift Right (n bits) Parallel Shift Left (n bits) Mix Permute MAX-2 Leverages Existing Processing Resources FP: INTEGER FLOAT GR: 16x64 General Regs.
    [Show full text]