Digital Signal Processing – II

www.jntuworld.com Digital Signal Processing – 8 December 24, 2009 VIII. DSP Processors 2007 Syllabus: Introduction to programmable DSPs: Multiplier and Multiplier-Accumulator (MAC), Modified bus structures and memory access schemes in DSPs, Multiple access memory, Multiport memory, VLSI architecture, Pipelining, Special addressing modes, On-chip peripherals. Architecture of TMS 320C5X – Introduction, Bus structure, Central Arithmetic Logic Unit, Auxiliary register, Index register, Auxiliary register, Compare register, Block move address register, Parallel Logic Unit, Memory mapped registers, Program controller, Some flags in the status registers, On-chip registers, On-chip peripherals. Contents: 8.1 DSP Processors – Market 8.2 DSP Processors – Features 8.3 Multiply-and-Accumulate 8.4 Interrupts – handling incoming signal values 8.5 Fixed- and Floating-point 8.6 Real-time FIR filter example We shall use “DSP” to mean Digital Signal Processor(s) and sometimes even refer to them as DSP processors, also as programmable digital signal processors (PDSPs). This includes 1. General purpose DSPs (such as the TMS320’s of TI and DSP563’s of Motorola and others) 2. Special purpose DSPs tailored to specific applications like FFT However, the so-called programmability of DSPs mentioned above pales in comparison to that of general purpose CPUs. In what follows, for the most part, we contrast general purpose CPUs (whose strength is in general purpose programmability) with DSPs (whose strength is in high throughput, hardwired, number crunching). For a convergence of both kinds of processor refer to the article on Intel’s Larrabee in IEEE Spectrum, January 2009. DSP-8 (DSP Processors) 1 of 8 Dr. Ravi Billa www.jntuworld.com www.jntuworld.com 8.1 DSP Processors – Market Market 1 Small, low-power, relatively weak DSPs Mass-produced consumer products - Toys, Automobiles Inexpensive 2 More capable fixed point processors Cell phones, Digital answering machines, Modems 3 Strongest, often floating point, DSPs Image and video processing, Server applications 8.2 DSP Processors – Features Features 1 DSP-specific instructions (e.g., MAC) 2 Special address registers 3 Zero-overhead loops 4 Multiple memory buses and banks 5 Instruction pipelines 6 Fast interrupt servicing (fast context switch) 7 Specialized IO ports 8 Special addressing modes (e.g., bit reversal) 8.3 Multiply-and-Accumulate DSP algorithms are characterized by intensive number crunching that may exceed the capabilities of a general purpose CPU. Due to arithmetic instructions specifically tailored to DSP needs a DSP processor can be much faster for specific tasks. The most common special purpose task in digital signal processing is the multiply-and -accumulate (MAC) operation illustrated by the FIR filter M M y(n) = bj x(n j) = b j x j = b0 x(n) + b1 x(n–1) + … + bM x(n–M) j 0 j 0 Such a repeated MAC operation occurs in other situations as well. Further, the operands b and x need not have the same index j. Letting b and x have independent indices j and k, the following loop accomplishes the MAC operation Loop: update j, update k a ← a + b j xk DSP-8 (DSP Processors) 2 of 8 Dr. Ravi Billa www.jntuworld.com www.jntuworld.com [Aside The operation a + (b * c), using floating point numbers, may be done with two roundings (once when b and c are multiplied and a second time when the product is added to a), or with just one rounding (where the entire expression a + (b * c) is evaluated in one step). The latter is called a fused multiply-add (FMA) or fused multiply-accumulate (FMAC) included in the IEEE Std. 754.] Loop overhead A general purpose CPU would implement the above sum of products operation in a fixed length loop such as For i = 0 to M {statements} that involves considerable overhead apart from the statements. This overhead consists of: Loop Overhead (General purpose CPU) 1 Provide a CPU register to store the loop index 2 Initialize index register 3 After each pass increment and check loop index for termination 4 (If a CPU register is not available: provide a memory location for indexing, retrieve it, increment it, check it and store it back in memory) 5 Except for the last pass, a jump back to the top of the loop DSP processors provide a zero-overhead hardware mechanism (a REPEAT or DO) that can repeat an instruction or set of instructions a prescribed number of times. Due to hardware support for this repetition structure no clock cycles are wasted on branching or incrementing and checking the loop index. The number of loop iterations is necessarily limited. If loop nesting is allowed not all loops may be zero-overhead. Enhancing the CPU architecture Inside the loop How would a general purpose CPU carry out the computations inside the loop? Assume that {b} and {x} are stored as arrays in memory. Assume that the CPU has pointer registers j and k that can be directly updated and used to retrieve data from memory, two arithmetic registers b and x that can be used as operands of arithmetic operations, a double length register p to receive the product and an accumulator a for summing the products. The instruction sequence for one pass through the loop on a general purpose CPU looks like this: Inside the Loop 1 update j 2 update k 3 b ← bj 4 x ← xk 5 fetch (multiply) instruction DSP-8 (DSP Processors) 3 of 8 Dr. Ravi Billa www.jntuworld.com www.jntuworld.com 6 decode (multiply) instruction 7 execute (multiply) instruction (p ← bj xk) 8 fetch (add) instruction 9 decode (add) instruction 10 execute (add) instruction (a ← a + p) Assuming that each of the lines above takes one unit of time – call it an “instruction time” or “clock cycle” – (multiplication easily takes several units of time but we assume it is the same as the rest), the sequence takes 10 units of time to complete. We could add a “multiply and add” (call it MAC) instruction to the instruction set of the CPU (that is, we augment CPU with the appropriate hardware): this would merge the last 6 lines (lines 5 through 10) in the above segment into just 3 lines, and there would then be 7 lines taking 7 units of time as shown below: Inside the Loop 1 update j 2 update k 3 b ← bj 4 x ← xk 5 fetch MAC instruction 6 decode MAC instruction 7 execute MAC instruction (a ← a + bj xk) A DSP can perform a MAC operation in a single unit of time. Many use this feature as the definition of a DSP. We want to describe below how this is accomplished. Update pointers simultaneously since they are independent. We add two address updating units to the processor hardware. Since these two updates can be done in parallel we show them as one line in the sequence, the sequence now taking 6 units of time: Inside the Loop 1 update j AND update k 2 b ← bj 3 x ← xk 4 fetch MAC instruction 5 decode MAC instruction 6 execute MAC (a ← a + bj xk) Memory Architecture Load registers b and x simultaneously Since bj and xk are completely independent we can make provision to read them simultaneously from memory into the appropriate registers. In the standard CPU situation there is just one bus connection to the memory; and even connecting two buses to the same (one) memory does not help; and the so-called “dual port memories” are expensive and slow. In a radical departure from the memory architecture of the standard CPU, the DSP can define multiple memory banks each served by its own bus. Now bj and xk can be DSP-8 (DSP Processors) 4 of 8 Dr. Ravi Billa www.jntuworld.com www.jntuworld.com loaded simultaneously from memory into the registers j and k, shown in the sequence below by listing the two operations on the same line, the sequence now taking 5 units of time: Inside the Loop 1 update j AND update k 2 b ← bj AND x ← xk 3 fetch MAC instruction 4 decode MAC instruction 5 execute MAC (a ← a + bj xk) We next turn to the last three lines (fetch, decode and execute) in the sequence. Caches Standard CPUs use instruction caches to speed up the execution. Caching implies different amounts of run-time (that is, unpredictability) depending on the state of the caches when operation starts. However, DSPs are designed for real-time use where the prediction of exact timing may be critical. Therefore caches are usually avoided in DSPs because caching complicates the calculation of program execution time. Harvard architecture Now we consider the fetching of one instruction while previous ones are still being decoded or executed. There can now be a clash while fetching an instruction from memory at the same time that data related to a prior instruction is being transferred to or from memory. The solution is to use separate memory banks and separate buses. Previously we used different memory banks for different categories of data but now we are talking about a memory bank for instructions versus a memory bank for data. The memory banks have independent address spaces and are called program memory and data memory – resulting in the Harvard architecture. The CPU can fetch the next instruction and simultaneously do a load/store of a memory word. Standard computers use the same memory space for program and data, this being called the von Neumann architecture (Pennsylvania architecture or Princeton architecture?). Most DSPs abide by the Harvard architecture in order to be able to overlap instruction fetches with data transfers. The idea of overlapping brings us to pipelining.

Digital Signal Processing – II

Chapter 5 Multiprocessors and Thread-Level Parallelism

Computer Organization & Architecture Eie

Parallel Computer Architecture

Computer Memory Architecture Pdf

Introduction to Computer Architecture

Programmable Address Generation Unit for Deep Neural Network Accelerators

FPGA Implementation of RISC-Based Memory-Centric Processor Architecture

A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

Computer Architectures & Hardware Programming 1

Choosing a DSP Processor Berkeley Design Technology, Inc

PRIME: a Novel Processing-In-Memory Architecture for Neural Network Computation in Reram-Based Main Memory

Computer Architecture Review Von Neumann Model, the CPU