<<

DSP Processors Introduction……… Overview

 What is a Digital (DSP)?

 Processor Trends – Architectures.

 What are Hardware Trends – Other Processor options?

 What is available in Market?

 How to Choose a DSP?

 Conclusions Overview

 What is a Digital Signal Processor (DSP)?

 Processor Trends – Architectures.

 What are Signal Processing Hardware Trends – Other Processor options?

 What is available in Market?

 How to Choose a DSP?

 Conclusions What is a DSP?

 Digital Signal Processors are specifically designed to handle Digital Signal Processing tasks.

 DSPs must also have a predictable execution time.

 DSPs are designed to operate in real time. Overview

 What is a Digital Signal Processor (DSP)?

 Processor Trends – Architectures.

 What are Signal Processing Hardware Trends – Processor options?

 What is available in Market?

 How to Choose a DSP?

 Conclusions Processor Trends - Architectures

 Hardware Units in DSP Processors

 Multiplier Accumulator (MAC) Unit.

 Most common operation in digital signal processing is array multiplication.

 Consider the implementation of an FIR , the most common DSP technique. Hardware Units in DSP Processors

 To implement the operation in real time

 we require a hardware multiplier unit

 which will give the result of multiplication in a single clock cycle

 We also need to add or accumulate the results, so we need an .

 Together it is known as a MAC unit

 one of the mandatory requirements of a programmable DSP. Hardware Units in DSP Processors

 Circular Buffers:

 To calculate the output sample, we must have access to a certain number of the most recent samples from the input.

 For example, suppose we use eight coefficients in this

filter, a0, a1,…., a7. This means we must know the value of the eight most recent samples from the input signal, x[n], x[n-1],…x[n-7].

 These eight samples must be stored in memory and continually updated as new samples are acquired.

 The best way to manage these stored samples is circular buffering. Hardware Units in DSP Processors Hardware Units in DSP Processors

 Circular buffer is placed eight consecutive memory locations, 20041 to 20048.

 The idea of circular buffering is that the end of this linear array is connected to its beginning; memory location 20041 is viewed as being next to 20048, just as 20044 is next to 20045.

 We keep track of the array by a pointer that indicates where the most recent sample resides.

 When a new sample is acquired, it replaces the oldest sample in the array, and the pointer is moved one address ahead. Hardware Units in DSP Processors

 Four parameters are needed to manage a circular buffer.

 A pointer that indicates the start of the circular buffer in memory (in this example, 20041).

 A pointer indicating the end of the array (e.g., 20048), or a variable that holds its length (e.g., 8).

 The step size of the memory addressing must be specified.

 These three values define the size and configuration of the circular buffer, and will not change during the program operation.

 the pointer to the most recent sample, must be modified as each new sample is acquired.

 There must be program logic that controls how this fourth value is updated based on the value of the first three values. Hardware Units in DSP Processors

Modified Structures and Memory Access Schemes: Hardware Units in DSP Processors

 Multiple Access Memory:

 The number of memory accesses per clock cycle can also be increased

 using a high speed memory that permits more than one access per clock period. (Eg. DARAM)

 Multiple access RAM can be connected to the processing unit of the

 Multiported Memory:

 They dispense with the need for storing the program and data in two different memory chips.

 They are more expensive. Processor Trends - Architectures

 Processor Architecture Trends are

 VLIW

 Advanced Super Harvard

 SIMD

 Simplified instruction sets – Architectures to increase clock speeds, compatibility. - (RISC).

 More complex instruction sets for higher performance. - (CISC).

 Mixed width instruction sets to reduce memory usage.

 Deeper pipelines to enable higher clock speeds..

 DSP Enhanced GPP. Architecture Evolution

 In the traditional Von-Neumann architecture there is only a single memory and a single bus for transferring data into and out of CPU. Architecture Evolution

 In Harvard Architecture, there are memories for data and program with separate buses for each.

 Since the buses operate independently, program instructions and data can be fetched at the same time, improving the speed. Architecture Evolution

 Another improvement is the Super Harvard Architecture.  A handicap of the basic Harvard design is that the data memory bus is busier than the program memory bus.  To improve upon this situation, we start by relocating part of the "data" to program memory.  For instance, we might place the filter coefficients in program memory, while keeping the input signal in data memory. Architecture Evolution

 However, DSP generally spend most of their execution time in loops.

 This means that the same set of program instructions will continually pass from program memory to the CPU.

 The Super Harvard architecture takes advantage of this situation by including an instruction in the CPU.

 This is a small memory that contains about 32 of the most recent program instructions. Architecture Evolution Architecture Evolution

 I/O controller is connected to data memory, through which the enter and exit the system.

 Most of the processors contain both serial and parallel communications ports.

 Dedicated hardware allows these data streams to be transferred directly into memory (, or DMA), without having to pass through the CPU's registers.

 This type of high speed I/O is a key characteristic of DSPs.

 Some DSPs have onboard analog-to-digital and digital-to- analog converters, a feature called mixed signal. Exploiting ILP - VLIW

ILP - Instruction Level Parallelism

Ability to perform multiple operations (or instructions), from a single instruction stream, in parallel Exploiting ILP

 It is a set of design techniques that speed up programs by executing in parallel several RISC style operations,

 such as memory loads and stores, integer additions, floating point multiplications.

 These operations are taken from a single stream of execution rather than from parallel tasks.

 Available ILP: Inherent in a region of the code

 Achievable ILP: provided by the hardware. Exploiting ILP

 ILP Hardware: Hardware can offer ILP in several ways.

 Several of the functional units found in a processor can execute at the same time.

 Here we allow operations to execute simultaneously on each of the functional units.

 Having separate register banks for the integer and floating point data can help us to do this by reducing potential hardware resource conflicts. Exploiting ILP

 Multiple copies of the functional units, possibly accessing different register files to add register bandwidth, can be added for the purpose of executing in parallel.

 Functional units with longer than one cycle can be pipelined.

 That is pipelining the floating point and cache operations so that one can be initiated each cycle, even though each might take several cycles to finish. General ILP Organization

CPU FU-1

FU-2

FU-3

Instruction unit fetch

Instruction unit decode Register

FU-4 memory Data

Instruction memory Instruction Bypassing network Bypassing FU-5 Exploiting ILP

 Example: Consider the code,

 Cycle 1: add t3=t1,t2

 Cycle 2: store [addr 0] = t3

 Cycle 3: fmul f6 = f7,f14

 Cycle 4: ....waiting….

 Cycle 5: ....waiting….

 Cycle 6: fmul f7 = f7, f15 Exploiting ILP

 Cycle 7: ....waiting….

 Cycle 8: ....waiting….

 Cycle 9: add t1 = p2, p7

 Cycle 10: add t5 = p2, p10

 Cycle 11: add t4 = t1,t5

 Cycle 12: store [addr 1] = t4  IF we have 3 integer units, one floating point unit and one load/store unit, then the code can be arranged as, Exploiting ILP

 Cycle 1:

 add t3=t1,t2

 add t1 = p2, p7

 add t5 = p2, p10

 fmul f6 = f7,f14

 Cycle 2:

 add t4 = t1,t5

 fmul f7 = f7, f15

 store [addr 0] = t3

 Cycle 3:

 store [addr 1] = t4 VLIW

VLIW = Very Long Instruction Word architecture

Instruction format: operation 1 operation 2 operation 3 operation 4 operation 5 VLIW

 VLIW Architecture:

 Very Long Instruction Word architecture

 They have a number of processing units (data paths). i.e., a number of ALUs, MAC units, shifters etc.

 The VLIW is accessed from memory and is used to specify the operands and operations to be performed by each of the data paths.

 The multiple functional units share a common multiported register file for fetching the operands and storing the results. VLIW

 The performance gains that can be achieved with VLIW architecture depends on the degree of parallelism in the selected for a DSP application and the number of functional units.

 The throughput will be higher only if the algorithm involves execution of independent operations.

 It is the compiler that does the job of determining ILP and scheduling it on the functional units. A VLIW Architecture with 7 FUs

Instruction Memory

Int FU Int FU Int FU LD/ST LD/ST FP FU FP FU

Floating Point Int Register File Register File

Data Memory SIMD

 SIMD (Single Instruction Multiple Data)

 A single stream of instructions will be broadcasted to a number of processors

 All processors execute the same program but operate on different data.

 Nodes have Mesh or hypercube connectivity

 Each PE can exchange values with their neighbors, has a few registers, some local memory and an ALU. An SIMD Organization

SIMD Execution Method

node1 node2 node-K time Instruction 1

Instruction 2

Instruction 3

Instruction n Architecture Trends – The Down Side

 VLIW, SIMD and deep pipelines can increase

 Memory use.

 Energy consumption.

 Code generation complexity, programming difficulty.

 Simple instruction sets often increase memory usage.

 More instructions are needed to accomplish a given task.

 Complex instruction sets hinder compatibility.

 Compatibility can bring messy compromises. Summary

Each processor makes different tradeoffs, depending on its target application top speed is often not the goal Overview

 What is a Digital Signal Processor (DSP)?

 Processor Trends – Architectures.

 What are Signal Processing Hardware Trends – Processor options?

 What is available in Market?

 How to Choose a DSP?

 Conclusions DSP Hardware Trends

 Today’s system engineer have a wealth of options for implementing DSP tasks.

 GPP

 DSPs

 Application Specific DSPs

 Customizable Cores

 ASSPs – Application Specific Standard Products

 ASICs - Application Specific Integrated Chips

 FPGAs – Field Programmable Gate Arrays Overview

 What is a Digital Signal Processor (DSP)?

 What are Signal Processing Hardware Trends – Processor options?

 Processor Trends – Architectures.

 What is available in Market?

 How to Choose a DSP?

 Conclusions How to Choose?

 Performance Analysis

 Comparing benchmarking approaches Benchmarking approaches

 How to ?

 Simplified metrics

 E.g., MIPS, MOPS,MMACS

 Full DSP Applications

 E.g., V.90

 DSP Algorithms “kernal” benchmarks

 E.g., FIR Filter, FFT etc. Algorithm Kernel Benchmarks

 Most of the benchmarks are based on DSP algorithm kernels

 DSP algorithm kernels are the most computationally intensive portions of DSP applications

 Example includes FFTs, IIR & FIR filters and Viterbi decoders

 Benchmark results are used with application profiling to predict overall performance Algorithm Kernel Benchmarks

Application Profile

OTHER 25%

IDCT 39%

Denorm 11%

Window 25% Algorithm Kernel Benchmarks

 Advantages

 Relevant, Chosen by analysis of real DSP applications.

 Kernels are short, allowing

 Functionality to be precisely specified.

 Benchmarks to be implemented, optimized in a reasonable amount of time.

 Disadvantages

 Not practical to implement all important algorithms.

 Do not reflect application-level optimizations and trade- offs. Emerging Benchmarking Challenges

 New technologies create performance- analysis challenges

 Multi-core Devices

 DSP-enhanced FPGAs

 Application-specific processors

 Customizable processors

 Reconfigurable processors Emerging challenges

 Evolving applications and tools also lead to new challenges

 Increasing reliance on C compilers

 For technologies not well served by kernel benchmarks, such as

 FPGAs

 Application-specific Processors

 Practicality concerns can be partly addressed by

 Using off-the-shelf applications where ever available,or

 Using simplified applications Overview

 What is a Digital Signal Processor (DSP)?

 What are Signal Processing Hardware Trends – Processor options?

 Processor Trends – Architectures.

 What is available in Market?

 How to Choose a DSP?

 Conclusions What is available in Market?

 Latest Processors

 High performance processors

TMS320C64xx

 StarCore SC140

 Low Power Processors

 Texas Instruments TMS320C55xx

(ADSP-BF53x)

 General-purpose/ DSP Processors

PXA2xx

 Texas Instruments OMAP5910 DSP Speed

SPEED PERFORMANCE 1. TI ‘C5502 (300 MHz) 6480 2. ADI ‘BF53x (600 8000 MHz) 6000 3. TI ‘C6414 (720 MHz) 3360 3430 4000 4. StarCore SC 140 (300 1460 MHz) 930 2000 5. Intel PXA2xx (400 0 MHz) 1 2 3 4 5 DSP Speed

 What factors affect DSP Speed?

 Parallelism

 How many parallel operations can be performed per cycle

 Instruction Set

 Suitability for the task at hand

 Clock Speed

 Data types

 Data Bandwidth DSP Speed

Depth

 Instructional latencies

 Support for DSP oriented features

 DSP Addressing modes

 Zero-overhead looping

 Saturation, scaling, rounding Memory Use

Lower is Better Memory Speed Comparison

1. TI ‘C55xx (8/16/32/48) 300 256 2. ADI ‘BF53x (16/32/64) 200 146 140 144 140 3. TI ‘C64xx (32) 4. StarCore SC 140

Bytes 100 (16/32) 5. Intel PXA2xx (16/32 0 MHz) 1 2 3 4 5 Memory Use

 What factors affect Memory use?

 Processors’ memory usage are affected by

 Instruction Set

 Wider instructions take more memory

 Mixed width instructions becoming popular – Use short simple instructions for simple tasks and use longer instructions for more complex tasks

 Suitability of instruction set for task at hand Memory Use

 Architecture

 VLIW, SIMD and deeper pipelines may encourage optimizations that increase memory use to obtain speed optimized code

 Compiler Quality (for compiled codes) Energy Efficiency

Higher is Better

ENERGY EFFICIENCY 1. TI ‘C5502 (300 MHz) 1.26V 2. ADI ‘BF53x (600 16.9 16.1 MHz) 1.2V 20 13.7 11.8 3. TI ‘C6414 (300 MHz) 15 1.0V 10 4. Motorola MSC8101 2.6 (SC 140) (300 MHz) 5 1.5V 0 5. Intel PXA2xx (400 1 2 3 4 5 MHz) 1.0V Energy Efficiency

 What factors affect Energy efficiency?

 Processors’ energy efficiency is affected by

 Speed

 Fabrication , voltage, circuit design, logic design

 Hardware Implementation

 Memory usage

 Compiler quality (for compiled code) Cost Performance

Higher is Better 1. TI ‘C5502 (300 MHz) COST PERFORMANCE $10 2. ADI ‘BF53x (600 MHz) $6 400 375.9 3. TI ‘C6414 (300 MHz) 300 $45 146.2 4. Motorola MSC8101 200 98.3 (SC 140) (300 MHz) 100 29 25.6 $118

0 5. Intel PXA2xx (400 1 2 3 4 5 MHz) $27 Cost Performance

 What factors affect Cost Performance?

 Speed

 Chip Cost, which is affected by

 Fabrication process

 Size of on-chip memory – influenced by processor’s memory usage

 On-chip peripherals

 Manufacturing volume Cost Performance

 But good cost-performance results do not necessarily mean chip is suitable for applications with severe cost constraints.

 Users does not want to pay for more performance than needed. Overview

 What is a Digital Signal Processor (DSP)?

 What are Signal Processing Hardware Trends – Processor options?

 Processor Trends – Architectures.

 What is available in Market?

 How to Choose a DSP?

 Conclusions Conclusions

 DSP Processor architecture innovations has accelerated greatly  New processor types are increasingly competitive

 DSP enhanced general purpose processors

 Multiprocessor chips

 Customizable cores  Non-processor approaches are increasingly competitive

 DSP-enhanced FPGAs Conclusions

Architectural options are expanding Conclusions

 Today’s DSP oriented processors cannot be meaningfully compared using simplified matrices

 Relevant, meaningful benchmark results are essential for processor evaluation

 There is no ideal processor

 Fastest does not mean best

 The “best” processor depends on the details of the application

 Different architectural approaches make different performance trade-offs

 Understanding these is key to select a processor