Introduction to Digital Signal Processors

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Digital Signal Processors INTRODUCTION TO Accumulator architecture DIGITAL SIGNAL PROCESSORS Memory-register architecture Prof. Brian L. Evans in collaboration with Niranjan Damera-Venkata and Magesh Valliappan Load-store architecture Embedded Signal Processing Laboratory The University of Texas at Austin Austin, TX 78712-1084 http://anchovy.ece.utexas.edu/ Outline n Signal processing applications n Conventional DSP architecture n Pipelining in DSP processors n RISC vs. DSP processor architectures n TI TMS320C6x VLIW DSP architecture n Signal and image processing applications n Signal processing on general-purpose processors n Conclusion 2 Signal Processing Applications n Low-cost embedded systems 4 Modems, cellular telephones, disk drives, printers n High-throughput applications 4 Halftoning, radar, high-resolution sonar, tomography n PC based multimedia 4 Compression/decompression of audio, graphics, video n Embedded processor requirements 4 Inexpensive with small area and volume 4 Deterministic interrupt service routine latency 4 Low power: ~50 mW (TMS320C5402/20: 0.32 mA/MIP) 3 Conventional DSP Architecture n High data throughput 4 Harvard architecture n Separate data memory/bus and program memory/bus n Three reads and one or two writes per instruction cycle 4 Short deterministic interrupt service routine latency 4 Multiply-accumulate (MAC) in a single instruction cycle 4 Special addressing modes supported in hardware n Modulo addressing for circular buffers (e.g. FIR filters) n Bit-reversed addressing (e.g. fast Fourier transforms) 4Instructions to keep the 3-4 stages of the pipeline full n Zero-overhead looping (one pipeline flush to set up) n Delayed branches 4 Conventional DSP Architecture (con’t) Data-shifting n Modulo Time Buffer contents Next sample x x x x x addressing n=N N-K+1 N-K+1 N-1 N N+1 4 implementing x x n=N+1 N-K+2 xN-K+3 N xN+1 x circular buffers N+2 x x and delay lines N-K+3 xN-K+4 N+1 xN+2 n=N+2 xN+3 Modulo addressing Time Buffer contents Next sample n Bit reversed n=N x x x x x x addressing N-2 N-1 N N-K+1 N-K+2 N+1 4 used to x x x x xN+2 n=N+1 N-2 xN-1 N N+1 xN-K+2 N-K+3 implement the radix-2 x x xx x x x x n=N+2 N-2 N-1 NN N+1 N+2 N-K+3 xN-K+4N-K+4 x FFT N+3 5 Conventional DSP Architecture (con’t) Fixed-Point Floating-Point Cost/Unit $5 $10 Architecture accumulator load-store or memory-register Registers 2–4 data 8 or 16 data 8 address 8 or 16 address Data Words 16 or 24 bit integer and 32 bit integer and fixed-point fixed/floating-point On-Chip 2–64 kwords data * 8–64 kwords data Memory 2–64 kwords program 8–64 kwords program Address 16 kw–8 Mw data 16 Mw–4Gw data Space 16–64 kw program ** 16 Mw–4 Gw program Compilers C compilers; C, C++ compilers; poor code generation better code generation Examples TI TMS320C5x; TI TMS320C3x; Motorola 56000 Analog Devices SHARC 6 Conventional DSP Architecture (con’t) n Market share: 95% fixed-point, 5% floating-point n Each processor has dozens of configurations 4 Size and map of data and program memory 4A/D, input/output buffers, interfaces, timers, and D/A n Drawbacks to conventional DSP processors 4 No byte addressing (desirable for image and video) 4 Limited on-chip memory 4Limited addressable memory on fixed-point DSPs: except Motorola 56300 (16 Mw data; 32 Mw program) and C548/C549/C54xx (8 Mw data; 256 kw program) 4 Non-standard C extensions to support fixed-point data 7 Pipelining Sequential (Motorola 56000) Fetch Decode Read Execute Pipelined (Most conventional DSP processors) Fetch Decode Read Execute Superscalar (Pentium, MIPS) Managing Pipelines •compiler or programmer •interlocking Fetch Decode Read Execute Superpipelined (CDC7600) •hardware instruction scheduling Fetch Decode Read Execute 8 Pipelining: Operation Fetch Decode Read n Time-stationary pipeline model Execute 4 Programmer controls each cycle F D R E 4 Motorola DSP56001 D C B A MAC X0,Y0,A X:(R0)+,X0 Y:(R4)-,Y0 E D C B F E D C n Data-stationary pipeline model G F E D H G F E 4 Programmer specifies data I H G F operations J I H G 4TMS320C30/40 K J I H L K J I MPYF *++AR0(1),*++AR1(IR0),R0 L - K J L - K n Interlocked pipeline L - 4 Programmer is “protected” from L pipeline effects 9 Pipelining: Hazards Fetch Decode Read n A control hazard occurs when a Execute branch instruction is decoded F D R E 4 “Flush” the pipeline D C B A 4 or: Delayed branch (expose pipeline) E D C B n A data hazard occurs because F E D C br F E D an operand cannot be read yet G br F E 4 Intended by programmer - - br F 4 or: Interlock hardware inserts - - - br “bubble” X - - - TMS320C5x example Y X - - Y - X - LAC #064h LAR AR2, DATA Z Y - X SAMM AR2 LACC *- NOP Z Y - LACC *- Z Y Z 10 Pipelining: Avoiding Control Hazards Fetch Decode Read A key factor in the numeric performance Execute of DSPs is the provision of special F D R E hardware to perform looping. D C B A RPT COUNT E D C B TBLR *+ F E D C rpt F E D n A repeat instruction repeats X rpt F E one instruction or a block of X - rpt F X - - rpt instructions after repeat X X - - n The pipeline is filled with X X X - X X X X repeated instruction (or X X X X block of instructions) X X X X n Cost: one pipeline flush only 11 RISC vs. DSP: Instruction Encoding n RISC: Superscalar Reorder Load/store FP Unit Integer Unit n DSP: Horizontal microcode Load/store Load/store Address ALU Multiplier 12 RISC vs. DSP: Memory Hierarchy n RISC Registers I/D Physical Cache memory Out of TLB order TLB: Translation Lookaside Buffer I Cache Internal n DSP memories Registers External memories DMA Controller DMA: Direct Memory Access 13 TI TMS320C6x VLIW DSP Architecture Simplified Architecture Program RAM Data RAM or Cache Addr Internal Buses DMA Data Regs (A0-A15) .D1 .D2 Regs (B0-B15) Serial Port External .M1 .M2 Host Port Memory .L1 .L2 -Sync Boot Load -Async .S1 .S2 Timers Control Regs Pwr Down CPU 14 TI TMS320C6x VLIW DSP Architecture n One instruction cycle per clock cycle n Two parallel data paths with single-cycle units: 4 Data unit - 32-bit address calculations (modulo, linear) 4 Multiplier unit - 16 bit x 16 bit with 32-bit result 4 Logical unit - 40-bit (saturation) arithmetic & compares 4 Shifter unit - 32-bit integer ALU and 40-bit shifter n 16 32-bit general purpose registers in each path 4 40 bits can be stored in adjacent even/odd registers n 32-bit addressing of 8/16/32 bit data n Fixed-point (C62x) and floating-point (C67x) n C67x computes floating-point multiply in 4 cycles 15 TI TMS320C6x VLIW DSP Architecture n TMS320C6211: $21 in volume 4 150 MHz, 300 million MACs/sec, 1200 RISC MIPS 4 on-chip: 4k x 8 bits program and 4k x 8 bits data (plus 64k x 8 bits L2 cache) n Deep pipeline 4 7-11 stages in C62x: fetch 4, decode 2, execute 1-5 4 7-16 stages in C67x: fetch 4, decode 2, execute 1-10 4 If a branch is in the pipeline, interrupts are disabled (the latency of a branch instruction is 5 cycles) 4 Avoid branches by using conditional execution n No hardware protection against pipeline hazards 4 Compiler and assembler must prevent pipeline hazards 16 C5x and C6x Addressing Modes n Immediate TMS320C5x TMS320C6x 4 The operand is part of the instruction ADD #0FFh add .L1 -13,A1,A6 n Register 4 The operand is (implied) add .L1 A7,A6,A7 specified in a register n Direct 4 The address of the operand is part of the ADD 010h not supported instruction (added to imply memory page) n Indirect ADD * ldw .L1 *A5++[8],A1 4 The address of the operand is stored in a register 17 TMS320C6x vs. Pentium MMX Processor Peak BDTI ISR Power Unit Area Volume MIPS marks latency Price Pentium 466 49 1.14 µs 4.25 W $213 5.5” x 2.5” 8.789 in3 MMX 233 Pentium 532 56 1.00 µs 4.85 W $348 5.5” x 2.5” 8.789 in3 MMX 266 C62x 1200 74 0.12 µs 1.45 W $25 1.3” x 1.3” 0.118 in3 150 MHz C62x 1600 99 0.09 µs 1.94 W $96 1.3” x 1.3” 0.118 in3 200 MHz BDTImarks: Berkeley Design Technology Inc. DSP benchmark results (larger means better) http://www.bdti.com/bdtimark/results.htm http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/processors.html 18 Application: FIR Filter z-1 z-1 z-1 n Each tap requires 4 Fetching one data sample 4Fetching one operand 4Multiplying two numbers 4Accumulating multiplication result 4Shifting one sample in the delay line n Computing an FIR tap in one instruction cycle 4Three data memory accesses 4Auto-increment or decrement addressing modes 4Modulo addressing to implement delay line as circular buffer 4 Eleven RISC instructions 19 Application: FIR Filter on a TMS320C5x Coefficients Data COEFFP .set 02000h ; Program mem address X .set 037Fh ; Newest data sample LASTAP .set 037FH ; Oldest data sample … LAR AR3, #LASTAP ; Point to oldest sample RPT #127 MACD COEFFP, *- ; Do the thing APAC SACH Y,1 ; Store result -- note shift 20 Application: FIR Filter on a TMS320C62x Coefficients Data Single-Cycle Loop ..
Recommended publications
  • EEMBC and the Purposes of Embedded Processor Benchmarking Markus Levy, President
    EEMBC and the Purposes of Embedded Processor Benchmarking Markus Levy, President ISPASS 2005 Certified Performance Analysis for Embedded Systems Designers EEMBC: A Historical Perspective • Began as an EDN Magazine project in April 1997 • Replace Dhrystone • Have meaningful measure for explaining processor behavior • Developed business model • Invited worldwide processor vendors • A consortium was born 1 EEMBC Membership • Board Member • Membership Dues: $30,000 (1st year); $16,000 (subsequent years) • Access and Participation on ALL Subcommittees • Full Voting Rights • Subcommittee Member • Membership Dues Are Subcommittee Specific • Access to Specific Benchmarks • Technical Involvement Within Subcommittee • Help Determine Next Generation Benchmarks • Special Academic Membership EEMBC Philosophy: Standardized Benchmarks and Certified Scores • Member derived benchmarks • Determine the standard, the process, and the benchmarks • Open to industry feedback • Ensures all processor/compiler vendors are running the same tests • Certification process ensures credibility • All benchmark scores officially validated before publication • The entire benchmark environment must be disclosed 2 Embedded Industry: Represented by Very Diverse Applications • Networking • Storage, low- and high-end routers, switches • Consumer • Games, set top boxes, car navigation, smartcards • Wireless • Cellular, routers • Office Automation • Printers, copiers, imaging • Automotive • Engine control, Telematics Traditional Division of Embedded Applications Low High Power
    [Show full text]
  • Cloud Workbench a Web-Based Framework for Benchmarking Cloud Services
    Bachelor August 12, 2014 Cloud WorkBench A Web-Based Framework for Benchmarking Cloud Services Joel Scheuner of Muensterlingen, Switzerland (10-741-494) supervised by Prof. Dr. Harald C. Gall Philipp Leitner, Jürgen Cito software evolution & architecture lab Bachelor Cloud WorkBench A Web-Based Framework for Benchmarking Cloud Services Joel Scheuner software evolution & architecture lab Bachelor Author: Joel Scheuner, [email protected] Project period: 04.03.2014 - 14.08.2014 Software Evolution & Architecture Lab Department of Informatics, University of Zurich Acknowledgements This bachelor thesis constitutes the last milestone on the way to my first academic graduation. I would like to thank all the people who accompanied and supported me in the last four years of study and work in industry paving the way to this thesis. My personal thanks go to my parents supporting me in many ways. The decision to choose a complex topic in an area where I had no personal experience in advance, neither theoretically nor technologically, made the past four and a half months challenging, demanding, work-intensive, but also very educational which is what remains in good memory afterwards. Regarding this thesis, I would like to offer special thanks to Philipp Leitner who assisted me during the whole process and gave valuable advices. Moreover, I want to thank Jürgen Cito for his mainly technologically-related recommendations, Rita Erne for her time spent with me on language review sessions, and Prof. Harald Gall for giving me the opportunity to write this thesis at the Software Evolution & Architecture Lab at the University of Zurich and providing fundings and infrastructure to realize this thesis.
    [Show full text]
  • Automatic Benchmark Profiling Through Advanced Trace Analysis Alexis Martin, Vania Marangozova-Martin
    Automatic Benchmark Profiling through Advanced Trace Analysis Alexis Martin, Vania Marangozova-Martin To cite this version: Alexis Martin, Vania Marangozova-Martin. Automatic Benchmark Profiling through Advanced Trace Analysis. [Research Report] RR-8889, Inria - Research Centre Grenoble – Rhône-Alpes; Université Grenoble Alpes; CNRS. 2016. hal-01292618 HAL Id: hal-01292618 https://hal.inria.fr/hal-01292618 Submitted on 24 Mar 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Automatic Benchmark Profiling through Advanced Trace Analysis Alexis Martin , Vania Marangozova-Martin RESEARCH REPORT N° 8889 March 23, 2016 Project-Team Polaris ISSN 0249-6399 ISRN INRIA/RR--8889--FR+ENG Automatic Benchmark Profiling through Advanced Trace Analysis Alexis Martin ∗ † ‡, Vania Marangozova-Martin ∗ † ‡ Project-Team Polaris Research Report n° 8889 — March 23, 2016 — 15 pages Abstract: Benchmarking has proven to be crucial for the investigation of the behavior and performances of a system. However, the choice of relevant benchmarks still remains a challenge. To help the process of comparing and choosing among benchmarks, we propose a solution for automatic benchmark profiling. It computes unified benchmark profiles reflecting benchmarks’ duration, function repartition, stability, CPU efficiency, parallelization and memory usage.
    [Show full text]
  • DSP56303 Product Brief, Rev
    Freescale Semiconductor DSP56303PB Product Brief Rev. 2, 2/2005 DSP56303 24-Bit Digital Signal Processor 16 6 6 3 Memory Expansion Area The DSP56303 is intended Triple X Data HI08 ESSI SCI PrograM Y Data for use in telecommunication Timer RAM RAM RAM 4096 × 24 2048 × 24 2048 × 24 applications, such as multi- bits bits bits line voice/data/ fax (default) (default) (default) processing, video Peripheral Expansion Area conferencing, audio PM_EB XM_EB PIO_EB YAB YM_EB applications, control, and Address External 18 Generation XAB Address general digital signal Unit PAB Bus Address Six-Channel DAB Switch processing. DMA Unit External 24-Bit Bus 13 Bootstrap DSP56300 Interface ROM and Inst. Core Cache Control Control DDB 24 Internal YDB External Data XDB Data Bus Bus PDB Switch Data Switch GDB EXTAL Power Clock Management Data ALU 5 Generator Program Program Program × + → XTAL Interrupt Decode Address 24 24 56 56-bit MAC JTAG PLL Two 56-bit Accumulators Controller Controller Generator 56-bit Barrel Shifter OnCE™ DE 2 MODA/IRQA MODB/IRQB RESET MODC/IRQC PINIT/NMI MODD/IRQD Figure 1. DSP56303 Block Diagram The DSP56303 is a member of the DSP56300 core family of programmable CMOS DSPs. Significant architectural features of the DSP56300 core family include a barrel shifter, 24-bit addressing, instruction cache, and DMA. The DSP56303 offers 100 million multiply-accumulates per second (MMACS) using an internal 100 MHz clock at 3.0–3.6 volts. The DSP56300 core family offers a rich instruction set and low power dissipation, as well as increasing levels of speed and power to enable wireless, telecommunications, and multimedia products.
    [Show full text]
  • Low Power Asynchronous Digital Signal Processing
    LOW POWER ASYNCHRONOUS DIGITAL SIGNAL PROCESSING A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Science & Engineering October 2000 Michael John George Lewis Department of Computer Science 1 Contents Chapter 1: Introduction ....................................................................................14 Digital Signal Processing ...............................................................................15 Evolution of digital signal processors ....................................................17 Architectural features of modern DSPs .........................................................19 High performance multiplier circuits .....................................................20 Memory architecture ..............................................................................21 Data address generation .........................................................................21 Loop management ..................................................................................23 Numerical precision, overflows and rounding .......................................24 Architecture of the GSM Mobile Phone System ...........................................25 Channel equalization ..............................................................................28 Error correction and Viterbi decoding ...................................................29 Speech transcoding ................................................................................31 Half-rate and enhanced
    [Show full text]
  • Automatic Benchmark Profiling Through Advanced Trace Analysis
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Hal - Université Grenoble Alpes Automatic Benchmark Profiling through Advanced Trace Analysis Alexis Martin, Vania Marangozova-Martin To cite this version: Alexis Martin, Vania Marangozova-Martin. Automatic Benchmark Profiling through Advanced Trace Analysis. [Research Report] RR-8889, Inria - Research Centre Grenoble { Rh^one-Alpes; Universit´eGrenoble Alpes; CNRS. 2016. <hal-01292618> HAL Id: hal-01292618 https://hal.inria.fr/hal-01292618 Submitted on 24 Mar 2016 HAL is a multi-disciplinary open access L'archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destin´eeau d´ep^otet `ala diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publi´esou non, lished or not. The documents may come from ´emanant des ´etablissements d'enseignement et de teaching and research institutions in France or recherche fran¸caisou ´etrangers,des laboratoires abroad, or from public or private research centers. publics ou priv´es. Automatic Benchmark Profiling through Advanced Trace Analysis Alexis Martin , Vania Marangozova-Martin RESEARCH REPORT N° 8889 March 23, 2016 Project-Team Polaris ISSN 0249-6399 ISRN INRIA/RR--8889--FR+ENG Automatic Benchmark Profiling through Advanced Trace Analysis Alexis Martin ∗ † ‡, Vania Marangozova-Martin ∗ † ‡ Project-Team Polaris Research Report n° 8889 — March 23, 2016 — 15 pages Abstract: Benchmarking has proven to be crucial for the investigation of the behavior and performances of a system. However, the choice of relevant benchmarks still remains a challenge. To help the process of comparing and choosing among benchmarks, we propose a solution for automatic benchmark profiling.
    [Show full text]
  • Rlsc & DSP Advanced Microprocessor System Design
    Purdue University Purdue e-Pubs ECE Technical Reports Electrical and Computer Engineering 3-1-1992 RlSC & DSP Advanced Microprocessor System Design; Sample Projects, Fall 1991 John E. Fredine Purdue University, School of Electrical Engineering Dennis L. Goeckel Purdue University, School of Electrical Engineering David G. Meyer Purdue University, School of Electrical Engineering Stuart E. Sailer Purdue University, School of Electrical Engineering Glenn E. Schmottlach Purdue University, School of Electrical Engineering Follow this and additional works at: http://docs.lib.purdue.edu/ecetr Fredine, John E.; Goeckel, Dennis L.; Meyer, David G.; Sailer, Stuart E.; and Schmottlach, Glenn E., "RlSC & DSP Advanced Microprocessor System Design; Sample Projects, Fall 1991" (1992). ECE Technical Reports. Paper 302. http://docs.lib.purdue.edu/ecetr/302 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. RISC & DSP Advanced Microprocessor System Design Sample Projects, Fall 1991 John E. Fredine Dennis L. Goeckel David G. Meyer Stuart E. Sailer Glenn E. Schmottlach TR-EE 92- 11 March 1992 School of Electrical Engineering Purdue University West Lafayette, Indiana 47907 RlSC & DSP Advanced Microprocessor System Design Sample Projects, Fall 1991 John E. Fredine Dennis L. Goeckel David G. Meyer Stuart E. Sailer Glenn E. Schrnottlach School of Electrical Engineering Purdue University West Lafayette, Indiana 47907 Table of Contents Abstract ...................................................................................................................
    [Show full text]
  • BOOM): an Industry- Competitive, Synthesizable, Parameterized RISC-V Processor
    The Berkeley Out-of-Order Machine (BOOM): An Industry- Competitive, Synthesizable, Parameterized RISC-V Processor Christopher Celio David A. Patterson Krste Asanović Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2015-167 http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-167.html June 13, 2015 Copyright © 2015, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. The Berkeley Out-of-Order Machine (BOOM): An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor Christopher Celio, David Patterson, and Krste Asanovic´ University of California, Berkeley, California 94720–1770 [email protected] BOOM is a work-in-progress. Results shown are prelimi- nary and subject to change as of 2015 June. I$ L1 D$ (32k) L2 data 1. The Berkeley Out-of-Order Machine BOOM is a synthesizable, parameterized, superscalar out- exe of-order RISC-V core designed to serve as the prototypical baseline processor for future micro-architectural studies of uncore regfile out-of-order processors. Our goal is to provide a readable, issue open-source implementation for use in education, research, exe and industry. uncore BOOM is written in roughly 9,000 lines of the hardware L2 data (256k) construction language Chisel.
    [Show full text]
  • A Free, Commercially Representative Embedded Benchmark Suite
    MiBench: A free, commercially representative embedded benchmark suite Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, Richard B. Brown {mguthaus,jringenb,ernstd,taustin,tnm,brown}@eecs.umich.edu The University of Michigan Electrical Engineering and Computer Science 1301 Beal Ave., Ann Arbor, MI 48109-2122 Abstract Among the common features of these This paper examines a set of commercially microarchitectures are deep pipelines, significant representative embedded programs and compares them instruction level parallelism, sophisticated branch to an existing benchmark suite, SPEC2000. A new prediction schemes, and large caches. version of SimpleScalar that has been adapted to the Although this class of machines has been the chief ARM instruction set is used to characterize the focus of the computer architecture community, performance of the benchmarks using configurations relatively few microprocessors are employed in this similar to current and next generation embedded market segment. The vast majority of microprocessors processors. Several characteristics distinguish the are employed in embedded applications [8]. Although representative embedded programs from the existing many are just inexpensive microcontrollers, their SPEC benchmarks including instruction distribution, combined sales account for nearly half of all memory behavior, and available parallelism. The microprocessor revenue. Furthermore, the embedded embedded benchmarks, called MiBench, are freely application domain is the fastest growing market available to all researchers. segment in the microprocessor industry. The wide range of applications makes it difficult to 1. Introduction characterize the embedded domain. In fact, an Performance based design has made benchmarking embedded benchmark suite should reflect this by a critical part of the design process [1].
    [Show full text]
  • Low-Power High Performance Computing
    Low-Power High Performance Computing Panagiotis Kritikakos August 16, 2011 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011 Abstract The emerging development of computer systems to be used for HPC require a change in the architecture for processors. New design approaches and technologies need to be embraced by the HPC community for making a case for new approaches in system design for making it possible to be used for Exascale Supercomputers within the next two decades, as well as to reduce the CO2 emissions of supercomputers and scientific clusters, leading to greener computing. Power is listed as one of the most important issues and constraint for future Exascale systems. In this project we build a hybrid cluster, investigating, measuring and evaluating the performance of low-power CPUs, such as Intel Atom and ARM (Marvell 88F6281) against commodity Intel Xeon CPU that can be found within standard HPC and data-center clusters. Three main factors are considered: computational performance and efficiency, power efficiency and porting effort. Contents 1 Introduction 1 1.1 Report organisation . 2 2 Background 3 2.1 RISC versus CISC . 3 2.2 HPC Architectures . 4 2.2.1 System architectures . 4 2.2.2 Memory architectures . 5 2.3 Power issues in modern HPC systems . 9 2.4 Energy and application efficiency . 10 3 Literature review 12 3.1 Green500 . 12 3.2 Supercomputing in Small Spaces (SSS) . 12 3.3 The AppleTV Cluster . 13 3.4 Sony Playstation 3 Cluster . 13 3.5 Microsoft XBox Cluster . 14 3.6 IBM BlueGene/Q .
    [Show full text]
  • The DENX U-Boot and Linux Guide (DULG) for Canyonlands
    The DENX U-Boot and Linux Guide (DULG) for canyonlands Table of contents: • 1. Abstract • 2. Introduction ♦ 2.1. Copyright ♦ 2.2. Disclaimer ♦ 2.3. Availability ♦ 2.4. Credits ♦ 2.5. Translations ♦ 2.6. Feedback ♦ 2.7. Conventions • 3. Embedded Linux Development Kit ♦ 3.1. ELDK Availability ♦ 3.2. ELDK Getting Help ♦ 3.3. Supported Host Systems ♦ 3.4. Supported Target Architectures ♦ 3.5. Installation ◊ 3.5.1. Product Packaging ◊ 3.5.2. Downloading the ELDK ◊ 3.5.3. Initial Installation ◊ 3.5.4. Installation and Removal of Individual Packages ◊ 3.5.5. Removal of the Entire Installation ♦ 3.6. Working with ELDK ◊ 3.6.1. Switching Between Multiple Installations ♦ 3.7. Mounting Target Components via NFS ♦ 3.8. Rebuilding ELDK Components ◊ 3.8.1. ELDK Source Distribution ◊ 3.8.2. Rebuilding Target Packages ◊ 3.8.3. Rebuilding ELDT Packages ♦ 3.9. ELDK Packages ◊ 3.9.1. List of ELDT Packages ◊ 3.9.2. List of Target Packages ♦ 3.10. Rebuilding the ELDK from Scratch ◊ 3.10.1. ELDK Build Process Overview ◊ 3.10.2. Setting Up ELDK Build Environment ◊ 3.10.3. build.sh Usage ◊ 3.10.4. Format of the cpkgs.lst and tpkgs.lst Files ♦ 3.11. Notes for Solaris 2.x Host Environment • 4. System Setup ♦ 4.1. Serial Console Access ♦ 4.2. Configuring the "cu" command ♦ 4.3. Configuring the "kermit" command ♦ 4.4. Using the "minicom" program ♦ 4.5. Permission Denied Problems ♦ 4.6. Configuration of a TFTP Server ♦ 4.7. Configuration of a BOOTP / DHCP Server ♦ 4.8. Configuring a NFS Server • 5.
    [Show full text]
  • Exploring Coremark™ – a Benchmark Maximizing Simplicity and Efficacy by Shay Gal-On and Markus Levy
    Exploring CoreMark™ – A Benchmark Maximizing Simplicity and Efficacy By Shay Gal-On and Markus Levy There have been many attempts to provide a single number that can totally quantify the ability of a CPU. Be it MHz, MOPS, MFLOPS - all are simple to derive but misleading when looking at actual performance potential. Dhrystone was the first attempt to tie a performance indicator, namely DMIPS, to execution of real code - a good attempt, which has long served the industry, but is no longer meaningful. BogoMIPS attempts to measure how fast a CPU can “do nothing”, for what that is worth. The need still exists for a simple and standardized benchmark that provides meaningful information about the CPU core - introducing CoreMark, available for free download from www.coremark.org. CoreMark ties a performance indicator to execution of simple code, but rather than being entirely arbitrary and synthetic, the code for the benchmark uses basic data structures and algorithms that are common in practically any application. Furthermore, EEMBC carefully chose the CoreMark implementation such that all computations are driven by run-time provided values to prevent code elimination during compile time optimization. CoreMark also sets specific rules about how to run the code and report results, thereby eliminating inconsistencies. CoreMark Composition To appreciate the value of CoreMark, it’s worthwhile to dissect its composition, which in general is comprised of lists, strings, and arrays (matrixes to be exact). Lists commonly exercise pointers and are also characterized by non-serial memory access patterns. In terms of testing the core of a CPU, list processing predominantly tests how fast data can be used to scan through the list.
    [Show full text]