Embedded DSP Application Specific Instruction Set Processors

“Liu: fm-p374123” — 2008/5/6 — 12:00 — pagei—#1 The Morgan Kaufmann Series in Systems on Silicon Series Editor: Wayne Wolf, Georgia Institute of Technology

The ’s Guide to VHDL, Second Edition Peter J. Ashenden The System Designer’s Guide to VHDL-AMS Peter J. Ashenden, Gregory D. Peterson, and Darrell A. Teegarden Modeling Embedded Systems and SoCs Axel Jantsch ASIC and FPGA Verification: A Guide to Component Modeling Richard Munden Multiprocessor Systems-on-Chips Edited by Ahmed Amine Jerraya and Wayne Wolf Functional Verification Bruce Wile, John Goss, and Wolfgang Roesner Customizable and Configurable Embedded Processors Edited by Paolo Ienne and Rainer Leupers Networks-on-Chips:Technology and Tools Edited by Giovanni De Micheli and Luca Benini VLSI Test Principles & Edited by Laung-Terng Wang, Cheng-Wen Wu, and Xiaoqing Wen Designing SoCs with Configured Processors Steve Leibson ESL Design and Verification Grant Martin, Andrew Piziali, and Brian Bailey Aspect-Oriented Programming with e David Robinson Reconfigurable Computing:The Theory and Practice of FPGA-Based Computation Edited by Scott Hauck and André DeHon System-on-Chip Test Architectures Edited by Laung-Terng Wang, Charles Stroud, and Nur Touba Verification Techniques for System- Design Masahiro Fujita, Indradeep Ghosh, and Mukul Prasad VHDL-2008: Just the New Stuff Peter J. Ashenden and Jim Lewis On-Chip Communication Architectures: System on Chip Interconnect Sudeep Pasricha and Nikil Dutt Embedded DSP Processor Design: Application Specific Instruction Set Processors Dake Liu Processor Description Languages: Applications and Methodologies Edited by Prabhat Mishra and Nikil Dutt

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page ii — #2 Embedded DSP Processor Design Application Specific Instruction Set Processors

Dake Liu

AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO

Morgan Kaufmann Publishers is an imprint of Elsevier

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page iii — #3 Morgan Kaufmann Publishers is an imprint of Elsevier. 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA

This book is printed on acid-free paper. ϱ

Copyright © 2008 by Elsevier Inc. All rights reserved.

Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or all capital letters. All trademarks that appear or are otherwise referred to in this work belong to their respective owners. Neither Morgan Kaufmann Publishers nor the authors and other contributors of this work have any relationship or affiliation with such trademark owners nor do such trademark owners confirm, endorse or approve the contents of this work. Readers, however,should contact the appropriate companies for more information regarding trademarks and any related registrations.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, scanning, or otherwise—without prior written permission of the publisher.

Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail: [email protected]. You may also complete your request online via the Elsevier homepage (http://elsevier.com),by selecting “Support & Contact”then “Copyright and Permission”and then “Obtaining Permissions.”

Library of Congress Cataloging-in-Publication Data Liu, Dake, 1957- Embedded DSP processor design: application specific instruction set processors / Dake Liu. p. cm. – (The Morgan Kaufmann series in systems on silicon) Includes index. ISBN 978-0-12-374123-3 1. Embedded systems. 2. Signal processing–Digital techniques. 3. Digital integrated circuits. 4. Application-specific integrated circuits. I. Title. TK7895.E42D35 2008 621.39’16–dc22 2008012910

ISBN: 978-0-12-374123-3

For information on all Morgan Kaufmann publications, visit our website at www.mkp.com or www.books.elsevier.com

Printed in the United States of America 0809101112 54321

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page iv — #4 To Meiying and Angie

“Liu: fm-p374123” — 2008/5/6 — 12:00 — pagev—#5 “Liu: fm-p374123” — 2008/5/6 — 12:00 — page vi — #6 Contents

Preface ...... xix List of Trademarks and Product Names ...... xxv

CHAPTER 1 Introduction 1 1.1 How to Read the Book...... 1 1.2 DSP Theory for Hardware ...... 5 1.2.1 Review of DSP Theory and Fundamentals ...... 5 1.2.2 ADC and Finite-Length Modeling ...... 6 1.2.3 Digital Filters...... 8 1.2.4 Transform ...... 10 1.2.5 Adaptive Filter and Signal Enhancement ...... 12 1.2.6 Random and Autocorrelation ...... 14 1.3 Theory, Applications, and Implementations ...... 15 1.4 DSP Applications ...... 17 1.4.1 Real-Time Concept ...... 17 1.4.2 Communication Systems...... 17 1.4.3 Multimedia Signal Processing Systems ...... 19 1.4.4 Review on Applications ...... 23 1.5 DSP Implementations ...... 24 1.5.1 DSP Implementation on GPP ...... 25 1.5.2 DSP Implementation on GP DSP Processors ...... 25 1.5.3 DSP Implementation on ASIP ...... 26 1.5.4 DSP Implementation on ASIC...... 26 1.5.5 Trade-off and Decision of Implementations...... 28 1.6 Review of Processors and Systems...... 29 1.6.1 DSP Processor ...... 29 1.6.2 DSP Firmware ...... 30 1.6.3 Overview ...... 32 1.6.4 DSP in an Embedded System ...... 34 1.6.5 Fundamentals of Embedded Computing...... 35 1.7 Design Flow ...... 36 1.7.1 Hardware Design Flow in General ...... 36 1.7.2 ASIP Hardware Design Flow ...... 38 1.7.3 ASIP Design Automation...... 40 1.8 Conclusions ...... 43 Exercises...... 44 References ...... 45 vii

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page vii — #7 viii Contents

CHAPTER 2 Numerical Representation and Finite-Length DSP 47 2.1 Fixed-Point Numerical Representation ...... 47 2.1.1 An Intuitive Example ...... 48 2.1.2 Fixed-Point Numerical Representation...... 50 2.1.3 Fixed-Point Binary Representation ...... 51 2.1.4 Integer Binary Representation...... 52 2.1.5 Fractional Binary Representation...... 53 2.1.6 Fixed-Point Operands ...... 54 2.1.7 Integer or Fractional...... 55 2.1.8 Other Binary Data Formats ...... 63 2.2 Data Quality Measure ...... 65 2.2.1 Noise, Distortion, Dynamic Range, and Precision..... 65 2.2.2 Quantitative Concept of Dynamic Range and Precision ...... 68 2.3 Floating-Point Numerical Representation ...... 69 2.4 Block Floating-Point...... 73 2.5 DSP Based on Finite Precision...... 76 2.5.1 The Way of Quantization—Rounding and Truncation 76 2.5.2 Overflow Saturation and Guards ...... 78 2.5.3 Requirements on Guards...... 81 2.5.4 Execution Order...... 82 2.6 Examples of Corner Cases ...... 82 2.7 Conclusions ...... 83 Exercises...... 84 References ...... 85

CHAPTER 3 DSP Architectures 87 3.1 DSP Subsystem Architecture...... 87 3.2 Processor Architecture...... 88 3.2.1 Inside a DSP Subsystem ...... 89 3.2.2 DSP (Memory ) Architecture...... 91 3.2.3 Functional Description at Top Architecture Level .... 95 3.2.4 DSP Architecture Design ...... 97 3.3 Inside a DSP Core ...... 101 3.3.1 The and Register Bus ...... 101 3.3.2 MAC ...... 101 3.3.3 ALU ...... 103 3.3.4 ...... 104 3.3.5 Control Path ...... 105 3.3.6 Address Generator (AGU) ...... 108 3.4 The Difference between GPP and ASIP DSP ...... 109 3.4.1 The Difference between Designing a GPP and ASIP DSP...... 109 3.4.2 Comparing DSP Processors to Other Processors..... 110 3.4.3 CISC or RISC ...... 113

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page viii — #8 Contents ix

3.5 Advanced DSP Architecture...... 116 3.5.1 DSP with Extreme Specification...... 116 3.5.2 ILP DSP Processors ...... 120 3.5.3 Dual MAC and SIMD...... 122 3.5.4 VLIW and Superscalar ...... 128 3.5.5 On-Chip Multicore DSP ...... 145 3.6 Conclusions ...... 153 Exercises...... 154 References ...... 157

CHAPTER 4 DSP ASIP Design Flow 159 4.1 Design and Use of ASIP ...... 159 4.1.1 What Is ASIP?...... 159 4.1.2 DSP ASIP Design Flow...... 160 4.2 Understanding Applications Through Profiling...... 162 4.3 Architecture Selection ...... 163 4.3.1 General Methodology ...... 163 4.3.2 Architectures...... 168 4.3.3 Quantitative Approach ...... 172 4.4 Designing Instruction Sets...... 173 4.5 Designing the Toolchain ...... 174 4.6 Design...... 178 4.7 Firmware Design ...... 179 4.7.1 Real-time Firmware...... 180 4.7.2 Firmware with Finite Precision...... 181 4.7.3 Firmware Design Flow for One Application ...... 181 4.7.4 Firmware Design Flow for Multiapplications ...... 183 4.8 Conclusions ...... 184 Exercises...... 184 References ...... 185

CHAPTER 5 A Simple DSP Core—The Junior Processor 187 5.1 Junior—A Simple DSP Processor ...... 187 5.2 Instruction Set and Operations...... 188 5.2.1 Load/Store Instructions ...... 188 5.2.2 Addressing for Data Memory Access ...... 190 5.2.3 Instructions for Basic Arithmetic Operations ...... 190 5.2.4 Logic and Shift Operations...... 191 5.2.5 Program Flow Control Instructions ...... 192 5.3 Assembly Coding ...... 194 5.4 Assembly Benchmarking ...... 197 5.4.1 Benchmarking of Block Transfer...... 199 5.4.2 Benchmarking of Single-Sample FIR...... 199 5.4.3 Benchmarking of Frame FIR ...... 201 5.4.4 Benchmarking of Single-Sample Biquad IIR...... 204

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page ix — #9 x Contents

5.4.5 Benchmarking of 16-bit Division ...... 205 5.4.6 Benchmarking of Vector Maximum Tracking ...... 206 5.4.7 Benchmarking of 8 ϫ 8 DCT ...... 207 5.4.8 Benchmarking of 256-point FFT ...... 210 5.4.9 Benchmarking of Windowing...... 211 5.5 Discussion of Junior DSP ...... 212 5.6 Conclusions ...... 214 Exercises...... 215 References ...... 215

CHAPTER 6 Code Profiling for ASIP Design 217 6.1 Source Code Profiling...... 217 6.1.1 What Is Source Code Profiling? ...... 218 6.1.2 Why Profiling? ...... 220 6.1.3 What to Profile ...... 221 6.1.4 How to Profile ...... 224 6.1.5 The Language to Profile ...... 225 6.2 Static Profiling ...... 226 6.2.1 Dynamic and Static Profiling...... 226 6.2.2 Static Profiling ...... 226 6.2.3 Fine-grained Static Profiling...... 227 6.2.4 Coarse-grained Static Profiling ...... 229 6.3 Dynamic Profiling ...... 231 6.3.1 Instrumentation for Coarse-grained Profiling ...... 231 6.3.2 Instrumentation for Fine-grained Profiling...... 231 6.3.3 Implement Instrumentation ...... 232 6.4 Use of Reference Assembly Codes ...... 234 6.4.1 Expose Hidden Costs ...... 234 6.4.2 Understanding Assembly Codes ...... 235 6.5 Quality Evaluation of Results ...... 236 6.5.1 Evaluating Results of Source Code Profiling ...... 236 6.5.2 Using Profiling Results ...... 236 6.6 Conclusions ...... 237 Exercises...... 237 References ...... 237

CHAPTER 7 Assembly Instruction Set Design 239 7.1 Methodology ...... 239 7.1.1 Opportunities and Constraints ...... 239 7.1.2 Classification of General Instructions ...... 244 7.1.3 Design of General RISC Subset Instructions ...... 245 7.1.4 Specify CISC Instructions ...... 248 7.1.5 For Undergraduates: From Junior to Senior ...... 249 7.2 Designing RISC Subset Instructions...... 250

“Liu: fm-p374123” — 2008/5/6 — 12:00 — pagex—#10 Contents xi

7.2.1 Data Access Instructions ...... 250 7.2.2 Basic Arithmetic Instructions ...... 256 7.2.3 Unsigned ALU Instructions ...... 264 7.2.4 Program Flow Control Instructions ...... 265 7.3 CISC Subset Instructions ...... 271 7.3.1 MAC and Multiplication Instructions...... 271 7.3.2 Double-Precision Arithmetic Instructions...... 274 7.3.3 Other CISC Instructions ...... 277 7.4 Accelerated Extensions ...... 277 7.4.1 Challenges ...... 277 7.4.2 Methodology...... 278 7.5 Instructions for Instruction Level Parallel (ILP) Architecture 280 7.5.1 Superscalar...... 280 7.5.2 VLIW Instructions ...... 280 7.5.3 SIMD Instructions ...... 282 7.6 Memory and Register Addressing ...... 286 7.6.1 Register Addressing...... 287 7.6.2 Data Memory Addressing ...... 290 7.6.3 Hardware Accelerated Memory Addressing ...... 295 7.7 Coding ...... 301 7.7.1 Assembly Encoding ...... 301 7.7.2 Machine Code Coding ...... 304 7.7.3 Examples...... 306 7.8 Conclusions ...... 309 Exercises...... 310 References ...... 312

CHAPTER 8 Software Development Toolchain 315 8.1 What Is Toolchain and IDE? ...... 315 8.1.1 ASIP User’s View on IDE...... 316 8.1.2 ASIP Designer’s View on IDE...... 317 8.2 Code Analysis ...... 318 8.2.1 Lexical Analysis ...... 319 8.2.2 Syntax Analysis...... 319 8.2.3 Semantic Analysis ...... 323 8.3 Profiler and WCET Analyzer ...... 324 8.4 Compiler Overview ...... 326 8.4.1 Intermediate Code Generation ...... 326 8.4.2 Code Optimization ...... 328 8.4.3 Code Generation ...... 332 8.4.4 Error Handler ...... 334 8.4.5 Compiler Generator and Verification of a Generated Compiler...... 335 8.5 Assembler ...... 335

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xi — #11 xii Contents

8.6 Linker ...... 337 8.7 Simulator and Basics ...... 339 8.7.1 Instruction Set Simulator (ISS)...... 341 8.7.2 Processor Simulator ...... 349 8.7.3 Architecture Simulator ...... 350 8.8 Debugger and GUI ...... 350 8.8.1 Debugger ...... 350 8.8.2 SW Debugging ...... 351 8.8.3 GUI ...... 352 8.9 Evaluation of Programming Tools ...... 353 8.10 Conclusions ...... 354 Exercises...... 354 References ...... 355

CHAPTER 9 Evaluation of an Instruction Set 357 9.1 Benchmarking ...... 357 9.1.1 Benchmarking DSP Kernel ...... 360 9.1.2 Some Benchmarking Examples ...... 365 9.2 Instruction Use Profiling...... 365 9.3 Coverage Analysis ...... 366 9.4 Conclusions ...... 366 References ...... 367

CHAPTER 10 Design of DSP Microarchitecture 369 10.1 Introduction to Microarchitecture ...... 369 10.1.1 Microarchitecture versus Architecture ...... 369 10.1.2 Microarchitecture Design ...... 370 10.2 Microarchitecture-level Components ...... 370 10.2.1 Basic Logic Components...... 371 10.2.2 Arithmetic Components ...... 373 10.3 Hardware Design Fundamentals ...... 374 10.3.1 Function Partitioning...... 374 10.3.2 Function Allocation...... 375 10.3.3 HW Multiplexing ...... 376 10.3.4 Scheduling of Hardware Execution ...... 379 10.3.5 Modeling and Simulation ...... 381 10.4 Functional Specification at Microarchitecture Level ...... 381 10.4.1 Intermodule Block Diagram ...... 381 10.4.2 Microarchitecture Schematic ...... 382 10.4.3 Module Functional Flowchart ...... 382 10.4.4 Finite State Machine ...... 387 10.4.5 Truth Table for Coding and Decoding ...... 389 10.5 ASIP Microarchitecture Design Flow...... 390 10.5.1 Exposing Microoperations ...... 391

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xii — #12 Contents xiii

10.5.2 Allocation and Partitioning of Microoperations ...... 391 10.5.3 Scheduling Microoperations ...... 393 10.5.4 HW Multiplexing of Microoperations ...... 393 10.5.5 Microoperations Integration ...... 394 10.6 Conclusions ...... 396 Exercises...... 396 References ...... 397

CHAPTER 11 Design of Register File and Register Bus 399 11.1 Datapath ...... 399 11.2 Design of Register Files ...... 400 11.2.1 General Register File ...... 400 11.2.2 Design of a Simple Register File ...... 401 11.2.3 Pipeline around Register File ...... 403 11.2.4 Special Registers in a General Register File ...... 404 11.3 Design of Advanced Register Files ...... 406 11.3.1 Register File for Cluster Datapath ...... 406 11.3.2 Ultra Large Register File...... 408 11.4 Conclusions ...... 410 Exercises...... 410 References ...... 411

CHAPTER 12 ALU HW Implementation 413 12.1 Arithmetic and Logic Unit (ALU) ...... 413 12.2 Design of Arithmetic Unit (AU)...... 415 12.2.1 Implementation Methodology ...... 415 12.2.2 Select Kernel Components ...... 416 12.2.3 Implementing Simple AU Instructions ...... 418 12.2.4 Implementing Special AU Instructions ...... 423 12.3 Shift and Rotation ...... 426 12.3.1 Design a Shifter Using a Shifter Primitive ...... 427 12.3.2 Design a Shifter Using Truth Tables...... 430 12.3.3 Logic Operation and Data Manipulation ...... 430 12.4 ALU Integration...... 433 12.4.1 Preprocessing and Postprocessing ...... 433 12.4.2 ALU Integration ...... 433 12.5 Conclusions ...... 434 Exercises...... 435 References ...... 438

CHAPTER 13 MAC Hardware Implementation 439 13.1 Introduction ...... 439 13.1.1 Review of Convolution...... 439 13.1.2 MAC Fundamentals ...... 440

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xiii — #13 xiv Contents

13.2 MAC Implementation ...... 442 13.2.1 MAC Instructions ...... 442 13.2.2 Implementing Multiplications ...... 442 13.2.3 Implementing MAC Instructions ...... 446 13.2.4 Implementing Double-Precision Instructions ...... 449 13.2.5 Accessing ACR Context...... 451 13.2.6 Flag Operations and Other Postoperations ...... 455 13.3 A MAC Design Case ...... 456 13.4 MAC Integrations...... 465 13.4.1 Physical Critical-Path ...... 465 13.4.2 Pipeline in a MAC ...... 466 13.5 Dual MAC, Multiple MAC, and VLIW ...... 468 13.6 Conclusions ...... 470 Exercises...... 471 References ...... 474 CHAPTER 14 Control Path Design 475 14.1 Control Paths ...... 475 14.2 Control Path Organization ...... 476 14.2.1 Pipeline Consideration ...... 478 14.2.2 Interrupt Management ...... 483 14.3 Control Path Hardware Design ...... 486 14.3.1 Top-level Structure ...... 486 14.3.2 Design of Program Memory and Peripherals ...... 488 14.3.3 Loading Code ...... 489 14.3.4 Instruction Flow Controller ...... 491 14.3.5 Loop Controller ...... 494 14.3.6 PC Stack ...... 496 14.3.7 Senior PC FSM Example...... 499 14.4 Instruction Decoder ...... 502 14.4.1 Control Signal Decoding ...... 503 14.4.2 Decoding Order ...... 505 14.4.3 Decoding for Exception, Interrupt, Jump, and Conditional Execution ...... 505 14.4.4 Issues of Multicycle Execution ...... 506 14.4.5 VLIW Machine Decoding...... 508 14.4.6 Decoding for Superscalar ...... 509 14.5 Conclusions ...... 510 Exercises...... 510 References ...... 512 CHAPTER 15 Design of Memory Subsystems 513 15.1 Memory and Peripherals...... 513 15.1.1 Memory Modules ...... 513 15.1.2 Memory Peripheral Circuits ...... 517

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xiv — #14 Contents xv

15.2 Design of Memory Addressing Circuitry ...... 524 15.2.1 General Addressing Circuit ...... 524 15.2.2 Modulo Addressing Circuit...... 527 15.3 Buses ...... 531 15.4 ...... 532 15.4.1 Problems ...... 532 15.4.2 Memory Hierarchy of DSP Processors ...... 533 15.5 DMA...... 535 15.5.1 DMA Concepts ...... 535 15.5.2 Configuring a Program for a DMA Task ...... 539 15.5.3 A SoC View ...... 543 15.6 Conclusions ...... 543 Exercises...... 543 References ...... 545

CHAPTER 16 DSP Core Peripherals 547 16.1 Peripherals...... 547 16.2 Design a Peripheral Module ...... 549 16.2.1 Design of a Common Interface in Peripheral Modules ...... 550 16.2.2 Protocol Design of Peripheral Modules ...... 554 16.3 Interrupt Handler ...... 555 16.3.1 Interrupt Basics...... 555 16.3.2 Interrupt Sources...... 555 16.3.3 Interrupt Requests ...... 557 16.3.4 Interrupt Handling Process ...... 558 16.3.5 A Case Study ...... 561 16.4 Timers ...... 567 16.5 Direct Memory Access (DMA)...... 570 16.5.1 DMA Basics ...... 570 16.5.2 Design a Simple DMA ...... 573 16.5.3 Advanced DMA Controller ...... 581 16.5.4 DMA Benchmarking ...... 589 16.6 SerialPorts...... 589 16.6.1 Bit Synchronization...... 589 16.6.2 Packet Synchronization ...... 592 16.6.3 Arbitration...... 593 16.6.4 Control of a Serial Port ...... 594 16.7 Parallel Ports ...... 594 16.8 Conclusions ...... 594 Exercises...... 595 References ...... 596

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xv — #15 xvi Contents

CHAPTER 17 Design for DSP Functional Acceleration 597 17.1 Functional Acceleration...... 597 17.1.1 Loosely Connected Accelerator...... 598 17.1.2 Tightly Connected Accelerator...... 599 17.2 Accelerator Specification ...... 601 17.2.1 Principle ...... 601 17.2.2 An Accelerator with One Single Instruction ...... 601 17.2.3 An Accelerator with Multiple Instructions ...... 602 17.2.4 An Accelerator as a Slave Processor ...... 603 17.3 Scalable Processor and Accelerator Interface ...... 604 17.3.1 Configurability and Extendibility ...... 604 17.3.2 Extendible Hardware Interface ...... 608 17.3.3 Extendible Programmer Tools ...... 611 17.4 Accelerator Design Flow ...... 616 17.5 Conclusions ...... 616 Exercises...... 617 References ...... 618

CHAPTER 18 Real-time Fixed-point DSP Firmware 619 18.1 Firmware (FW) ...... 619 18.2 Application Modeling Under HW Constraints ...... 620 18.2.1 Understanding Applications ...... 620 18.2.2 Understanding Hardware ...... 624 18.2.3 Selection ...... 626 18.2.4 Language Selection ...... 633 18.2.5 Real-time Firmware Implementation ...... 635 18.2.6 Firmware for Fixed-point Data ...... 638 18.3 Assembly Implementation ...... 646 18.3.1 General Flow and C-Compiling ...... 646 18.3.2 Plan and Specify for Assembly Coding ...... 647 18.3.3 Fixed-point Assembly Kernels ...... 648 18.3.4 Low Cycle Cost Assembly Coding ...... 649 18.3.5 Storage Efficient Assembly Kernels ...... 652 18.3.6 Function Libraries ...... 656 18.3.7 Optimize Control Codes ...... 658 18.4 Assembly-level Integration and Release ...... 659 18.5 Conclusions ...... 661 References ...... 661

CHAPTER 19 ASIP Integration and Verification 663 19.1 Integration ...... 663 19.1.1 HW Integration of an ASIP Core ...... 665 19.1.2 Integration of a DSP Subsystem and a DSP Processor 668

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xvi — #16 Contents xvii

19.1.3 HW Integration of a SoC ...... 675 19.1.4 Integration of SoC Simulator...... 685 19.2 Functional Verification ...... 686 19.2.1 The Basics ...... 686 19.2.2 Verification Process...... 689 19.2.3 Verification Techniques...... 691 19.2.4 Speed-up Verification ...... 697 19.2.5 Simulation or Emulation ...... 699 19.2.6 Verification of an ASIP ...... 700 19.2.7 Writing Testbench ...... 700 19.3 Conclusions ...... 701 Exercises...... 703 References ...... 703

CHAPTER 20 Parallel Streaming Signal Processing 705 20.1 Streaming DSP ...... 705 20.1.1 Streaming Signals ...... 705 20.1.2 Parallel Streaming DSP Processors ...... 705 20.2 Parallel Architecture, Divide and Conquer ...... 707 20.2.1 Review of Parallel Architectures ...... 707 20.2.2 Divide and Conquer ...... 710 20.3 Expose Control Complexities ...... 712 20.3.1 General Control Handling ...... 712 20.3.2 Exposing Challenges ...... 713 20.3.3 SIMT Architecture for Low-level Parallel Applications...... 716 20.3.4 Design of Multicore DSP Subsystems ...... 721 20.4 Streaming Data Manipulations...... 726 20.4.1 Data Complexity of Streaming DSP ...... 726 20.4.2 Data Complexity: Case 1—Video ...... 726 20.4.3 Data Complexity: Case 2—Radio Baseband...... 732 20.5 NoC for Parallel Memory Access ...... 735 20.5.1 ...... 735 20.5.2 Analyses of Parallel Memory Access for NoC Design ...... 736 20.6 Parallel Memory Architecture ...... 739 20.6.1 Requirements for Parallel Algorithms ...... 739 20.6.2 ...... 740 20.6.3 Ultra-large Register File ...... 743 20.7 P3RMA for Streaming DSP Processors ...... 744 20.7.1 Parallel Vector Scratchpad Memories...... 745 20.7.2 The Memory Subsystem Hardware ...... 747

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xvii — #17 xviii Contents

20.7.3 Parallel Programming by Hand ...... 748 20.7.4 Programming Toolchain for P3RMA ...... 754 20.8 Conclusions ...... 757 References ...... 758

Glossary 761

Appendix 769

Index 771

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xviii — #18 Preface

In the late 1990s, when I was preparing a course called “Design of Embedded DSP (Digital Signal Processing) Processors”at Linköping University,Sweden,I could not find a textbook describing the fundamentals of embedded processor design. It became my first and prime motivation for writing this book. During my work time in industry,I could not find any suitable and comprehensive reference book either, which led to my second motivation for writing such a book. It has been my belief that this book will be a valuable textbook or reference book for anyone interested in the design of embedded systems in all its aspects,from hardware design to firmware design.Although this book was written mainly forASIP (application-specific instruc- tion set processor) or ASIC (application-specific ) designers,it will also benefit software programmers who want more hardware knowledge, such as DSP application engineers. While reading this book, you will get opportunities to go through the design of a programmable device for a class of applications, step by step. The material in this book is suitable for teaching senior undergraduate students and graduate students in Electrical and .This book can also be used as a reference book for engineers who are designing or want to design application- specific DSP processors, general processors, accelerators, peripheral modules, and even . Embedded system designers (e.g., DSP firmware designers) will also benefit from the knowledge of real-time system design elaborated in this book. Classical CPU designers will benefit from the exposed difference between CPU and ASIP.This book is also a fundamental reference book for researchers. Fundamental DSP theory and basic digital logic design are addressed in this book as background knowledge. Very basic concepts and methods were used without redundant introduction. Readers without the fundamental knowledge should read relevant books about DSP theory, logic design, and before reading this book. DSP, as opposed to general-purpose computing systems, has been a major tech- nology driven by embedded applications and the semiconductor technologies.This is evident from the growing market of DSP-based products.The increasing need for DSP and DSP processors can be found everywhere in today’s society in areas such as multimedia, wireless communications, Internet terminals, car electronics, robot, healthcare, environment monitoring and control, education, scientific computing, industrial control, transportation, and defense. DSP is used widely for various applications such as data enhancement, data compression, pattern recognition, simulation, emulation, and optimization. Sig- nal recovery in advanced digital communications is a good example of data enhancement. Other data enhancement applications include error correction,echo cancellation, and noise suppression. Data compression is another important area in today’s daily used facilities: for example, voice, music, image, video, and Internet data needs to be compressed to fit into a limited bandwidth for transmission and xix

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xix — #19 xx Preface

storage. If voice was not compressed by a DSP processor in a mobile phone, the cost of a mobile phone call would be 10 to 15 times higher. If video signals were not compressed, DVD players and digital video broadcasting would not be possible. Pattern recognition techniques are used for voice recognition,language recogni- tion, and image target recognition for healthcare, car driving, and defense. Physical simulation has been used for gaming, training (education), scientific computing, defense, and experiments that are expensive or even impossible to realize in the real world. The global market share of DSP processors and microcontrollers is more than 95% of the total volume of processors sold in 2006. DSP processors for embedded applications have led to a major shift in the semiconductor industry. The sales of DSP processors have reached 20% of the global semiconductor market since 2002. Taking only the DSP processors in mobile phones as an example, the total sales in 2006 was more than $10 billion US. General-purpose DSP processors (commercial off-the-shelf DSP processors) usu- ally have a high degree of flexibility, a friendly design environment, and sufficient design references. General-purpose DSP processors are preferred when require- ments on power, performance, and silicon area are not very critical. When these requirements are strict, embedded DSP processors as ASIP will become a necessity. Figure P.1 shows the trend of the different DSP market shares. The figure clearly shows that the future of ASIP DSP is obviously exciting—this is my third motivation for writing this book. Most DSP applications can be categorized as streaming signal processing, in which the processing speed is higher than the speed of incoming signals. Classic DSP hardware for streaming signal processing was usually implemented on non- programmable ASICs to minimize the silicon cost. Recently programmability has become a vital issue because the complexity as well as design costs keep going higher. Programmability has been required by industries in order to support mul- timodes or multistandard applications. Thanks to the ongoing progress of modern

FIGURE P.1 Trend of different DSP market shares (FreehandDSP, Sweden).

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xx — #20 Preface xxi

VLSI (very large scale integrated circuits) technologies,programmable features have been realistic since the 1990s. An ongoing trend is that the architecture of MPU ( as the central processor in personal ) is converging.The architecture of DSP processors is diverging. One reason is that applications running on DSPs are diverging. However, the functionalities will be relatively fixed when a DSP processor is embedded in a system. Another reason is that the requirements are very critical on silicon efficiency, power consumption, and design cost of embedded processors or ASIP. General-purpose processor designers think of ultimate performance and ultimate flexibility.The instruction set must be general because the application is unknown, and the programmer’s behavior is unknown. ASIP designers have to think about the application and cost first. Usually the biggest challenge for ASIP designers is the efficiency issue. Based on the carefully specified function coverage, the goal of an ASIP design is to reach the highest performance over silicon and the highest performance over power consumption as well as the highest performance over the design cost. The requirement on flexibility should be sufficient instead of ultimate. The performance is application-specific instead of the highest.This book will expose and analyze the differences between general-purpose processor designers and ASIP designers. An ASIP is often a SIP (Silicon or Silicon IP,or IP). More SoC () solutions use ASIP IP.Therefore, the focus of this book will be the design of IP cores of ASIP for embedded systems on a chip. Silicon IP has been used as components in silicon chip since the mid-1990s. The requirements for quality design of silicon IP became higher after 2000 because silicon IP has been well accepted and widely used. In Figure P.2,the system design complexity is divided into the system design complexity and the component design complexity. Around the middle to late 1980s, RTL components (for example, multipliers and adders) were optimized as the lowest level components of system designs. RTL components took a certain degree of design complexity from the system design so that the system could be relatively more advanced comparing the system designed on a transistor level. During the mid-1990s, the system design became so advanced and complicated that programmable IP has to be used as the lowest level component to relax the system design complexity. Because an IP usually is designed by the third party or another design team, the system complexity can therefore be shared. Also, because an IP usually is designed for multiusers, the design cost usually is shared by multiusers; relatively high IP design cost is therefore acceptable. This is the fourth motivation of writing the book—to show ways to design high quality programmable IP as components for multiusers. The fourth motivation became even more important when the platform-based design concept was introduced recently.A platform is a partly designed application- specific system that can be used to adapt to a custom design with minimum cost. The platform-based system design requires the minimum design cost while plug- ging a programmable IP on the platform and running firmware on it. It means

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xxi — #21 xxii Preface

FIGURE P.2 Handling complexity of the design using ASIP IP and platform.

that the design of ASIP must be both silicon-efficient and platform-oriented. The platform-adaptive ASIP design skills offered by this book will thus be even more interesting. I deeply acknowledge the research and teaching contributions from my PhD students in my Division of Computer Engineering, Department of Electrical Engi- neering at Linköping University, Sweden. The labs of the course and part of the contents of this book are based on their research work. Di Wu and Johan Eilert read through the book and provided enormous suggestions. Per Karlström managed all MS Word problems and figure formatting using his fantastic Word-VBA programming skills. Andreas Ehliar went through all code examples in the book. Per Karlström, Johan Eilert, Andreas Ehliar, Di Wu, and Master students Vinodh Ravinath and Bobo Svangård implemented the Senior DSP processor core. Research engineer Anders (S) Nilsson made the assembler and instruction set simulator of Senior, the processor used as the example of the book. Acknowledg- ment also goes to other PhD students: Rizwan Asghar, Dr. Anders Nilsson, Dr. Eric Tell, Dr. Tomas Henriksson, Dr. Daniel Wiklund, Dr. Ulf Nordqvist, and Lic. Mikael Olausson. I thank all Master students who participated in the course “Design of Embedded DSP Processors” from 1999 to 2007. My sincere thanks go to Freehand DSP AB (Ltd.), Sweden (or VIA Tech Sweden after 2002),a leading company developing DSP processors for communications and home electronic applications. Special thanks to my friend and boss, CEO Harald Bergh, who went through several chapters and provided very professional and valuable suggestions. I was the cofounder, CTO, and vice president of Freehand

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xxii — #22 Preface xxiii

DSP AB (Ltd.) Stockholm, Sweden, during 1999 and 2002, which was later acquired by VIA Technologies in 2002. I thank CoresonicAB (Ltd.),Linköping,Sweden,a leading DSP core SIP company for programmable radio baseband solutions. I am a cofounder and currently the CTO of this company. I sincerely thank my best friend,Professor Christer Svensson,at the Department of (ISY),Linköping University, Sweden,who had been my supervisor (1990–1994) during my research toward my technology doctor degree. Christer is the cofounder and the Chairman of the Board of Coresonic. I also thank cofounders Dr. Eric Tell, Dr. Anders Nilsson, and Daniel Svensson for many useful discussions and encouragements. All staff of Coresonic AB are greatly acknowledged. I greatly acknowledge the following experts for their insightful discussions: Vodafone chair Professor Gerhard Fettweis,TU Dresden;Professor Christoph Kessler of Linköping University; Professor Lars Svensson of Chalmers University; Professor Viktor Öwall of Lund University; Professor Petru Eles of Linköping University; Dr. Carl-Fredrik Lenderson of Sony Ericsson; Professor Dr. Xiaoning Nie of Infineon Munich;Dr. Franz Dielacher,CSO,Infineon connections Villach;and Infineon fellow Professor Dr. Lajos Gazsi, Düsseldorf. Finally, the most acknowledgment and gratitude goes to my dear wife Meiying and my daughter Angie. Without their love, understanding, and support, this book would never have been possible.

Dake Liu December 2007 Linköping, Sweden

REFERENCES [1] Strauss, W. (2000). Digital signal processing, the new semiconductor industry technology driver. IEEE Signal Processing Magazine, March, 52–56. [2] http://www.fwdconcepts.com. [3] BDTI, DSP selection guide http://www.bdti.com. [4] Claasen,T. (2006). An industry perspective on current and future state-of-the-art in system-on- chip (SoC) technology. Proceedings of the IEEE 94(6). [5] http://www.da.isy.liu.se/∼dake. [6] http://www.viatech.se. [7] http://www.coresonic.com.

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xxiii — #23 “Liu: fm-p374123” — 2008/5/6 — 12:00 — page xxiv — #24 List of Trademarks and Product Names

ADI processors

ARM processors

CEVA DSP processors

Coresonic LeoCore processors

FreehandDSP

Freescale

Infineon Camel processor

Intel Pentium and 8x86 processors

NXP EVP16

xxv

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xxv — #25 xxvi List of Trademarks and Product Names

Openrisc of Opencores

SPI CELL (Sony Panasonic IBM) processors

TI (Texas Instrument) DSP processors

Xilinx FPGA

Tools and programs include MATLAB and Simulink of Mathworks

Design compiler of Synopsys

LISA and Processor Designer of CoWare

ZSP of LSI

GCC from GNU

“Liu: fm-p374123” — 2008/5/6 — 12:00 — page xxvi — #26