DSP Processors Introduction……… Overview
What is a Digital Signal Processor (DSP)?
Processor Trends – Architectures.
What are Signal Processing Hardware Trends – Other Processor options?
What is available in Market?
How to Choose a DSP?
Conclusions Overview
What is a Digital Signal Processor (DSP)?
Processor Trends – Architectures.
What are Signal Processing Hardware Trends – Other Processor options?
What is available in Market?
How to Choose a DSP?
Conclusions What is a DSP?
Digital Signal Processors are microprocessors specifically designed to handle Digital Signal Processing tasks.
DSPs must also have a predictable execution time.
DSPs are designed to operate in real time. Overview
What is a Digital Signal Processor (DSP)?
Processor Trends – Architectures.
What are Signal Processing Hardware Trends – Processor options?
What is available in Market?
How to Choose a DSP?
Conclusions Processor Trends - Architectures
Hardware Units in DSP Processors
Multiplier Accumulator (MAC) Unit.
Most common operation in digital signal processing is array multiplication.
Consider the implementation of an FIR digital filter, the most common DSP technique. Hardware Units in DSP Processors
To implement the operation in real time
we require a hardware multiplier unit
which will give the result of multiplication in a single clock cycle
We also need to add or accumulate the results, so we need an adder.
Together it is known as a MAC unit
one of the mandatory requirements of a programmable DSP. Hardware Units in DSP Processors
Circular Buffers:
To calculate the output sample, we must have access to a certain number of the most recent samples from the input.
For example, suppose we use eight coefficients in this
filter, a0, a1,…., a7. This means we must know the value of the eight most recent samples from the input signal, x[n], x[n-1],…x[n-7].
These eight samples must be stored in memory and continually updated as new samples are acquired.
The best way to manage these stored samples is circular buffering. Hardware Units in DSP Processors Hardware Units in DSP Processors
Circular buffer is placed eight consecutive memory locations, 20041 to 20048.
The idea of circular buffering is that the end of this linear array is connected to its beginning; memory location 20041 is viewed as being next to 20048, just as 20044 is next to 20045.
We keep track of the array by a pointer that indicates where the most recent sample resides.
When a new sample is acquired, it replaces the oldest sample in the array, and the pointer is moved one address ahead. Hardware Units in DSP Processors
Four parameters are needed to manage a circular buffer.
A pointer that indicates the start of the circular buffer in memory (in this example, 20041).
A pointer indicating the end of the array (e.g., 20048), or a variable that holds its length (e.g., 8).
The step size of the memory addressing must be specified.
These three values define the size and configuration of the circular buffer, and will not change during the program operation.
the pointer to the most recent sample, must be modified as each new sample is acquired.
There must be program logic that controls how this fourth value is updated based on the value of the first three values. Hardware Units in DSP Processors
Modified Bus Structures and Memory Access Schemes: Hardware Units in DSP Processors
Multiple Access Memory:
The number of memory accesses per clock cycle can also be increased
using a high speed memory that permits more than one access per clock period. (Eg. DARAM)
Multiple access RAM can be connected to the processing unit of the Harvard Architecture
Multiported Memory:
They dispense with the need for storing the program and data in two different memory chips.
They are more expensive. Processor Trends - Architectures
Processor Architecture Trends are
VLIW
Advanced Super Harvard
SIMD
Simplified instruction sets – Architectures to increase clock speeds, compatibility. - (RISC).
More complex instruction sets for higher performance. - (CISC).
Mixed width instruction sets to reduce memory usage.
Deeper pipelines to enable higher clock speeds..
DSP Enhanced GPP. Architecture Evolution
In the traditional Von-Neumann architecture there is only a single memory and a single bus for transferring data into and out of CPU. Architecture Evolution
In Harvard Architecture, there are memories for data and program with separate buses for each.
Since the buses operate independently, program instructions and data can be fetched at the same time, improving the speed. Architecture Evolution
Another improvement is the Super Harvard Architecture. A handicap of the basic Harvard design is that the data memory bus is busier than the program memory bus. To improve upon this situation, we start by relocating part of the "data" to program memory. For instance, we might place the filter coefficients in program memory, while keeping the input signal in data memory. Architecture Evolution
However, DSP algorithms generally spend most of their execution time in loops.
This means that the same set of program instructions will continually pass from program memory to the CPU.
The Super Harvard architecture takes advantage of this situation by including an instruction cache in the CPU.
This is a small memory that contains about 32 of the most recent program instructions. Architecture Evolution Architecture Evolution
I/O controller is connected to data memory, through which the signals enter and exit the system.
Most of the processors contain both serial and parallel communications ports.
Dedicated hardware allows these data streams to be transferred directly into memory (Direct Memory Access, or DMA), without having to pass through the CPU's registers.
This type of high speed I/O is a key characteristic of DSPs.
Some DSPs have onboard analog-to-digital and digital-to- analog converters, a feature called mixed signal. Exploiting ILP - VLIW
ILP - Instruction Level Parallelism
Ability to perform multiple operations (or instructions), from a single instruction stream, in parallel Exploiting ILP
It is a set of design techniques that speed up programs by executing in parallel several RISC style operations,
such as memory loads and stores, integer additions, floating point multiplications.
These operations are taken from a single stream of execution rather than from parallel tasks.
Available ILP: Inherent in a region of the code
Achievable ILP: provided by the hardware. Exploiting ILP
ILP Hardware: Hardware can offer ILP in several ways.
Several of the functional units found in a processor can execute at the same time.
Here we allow operations to execute simultaneously on each of the functional units.
Having separate register banks for the integer and floating point data can help us to do this by reducing potential hardware resource conflicts. Exploiting ILP
Multiple copies of the functional units, possibly accessing different register files to add register bandwidth, can be added for the purpose of executing in parallel.
Functional units with latency longer than one cycle can be pipelined.
That is pipelining the floating point and cache operations so that one can be initiated each cycle, even though each might take several cycles to finish. General ILP Organization
CPU FU-1
FU-2
FU-3
Instruction unit fetch
Instruction unit decode Register file Register
FU-4 memory Data
Instruction memory Instruction Bypassing network Bypassing FU-5 Exploiting ILP
Example: Consider the code,
Cycle 1: add t3=t1,t2
Cycle 2: store [addr 0] = t3
Cycle 3: fmul f6 = f7,f14
Cycle 4: ....waiting….
Cycle 5: ....waiting….
Cycle 6: fmul f7 = f7, f15 Exploiting ILP
Cycle 7: ....waiting….
Cycle 8: ....waiting….
Cycle 9: add t1 = p2, p7
Cycle 10: add t5 = p2, p10
Cycle 11: add t4 = t1,t5
Cycle 12: store [addr 1] = t4 IF we have 3 integer units, one floating point unit and one load/store unit, then the code can be arranged as, Exploiting ILP
Cycle 1:
add t3=t1,t2
add t1 = p2, p7
add t5 = p2, p10
fmul f6 = f7,f14
Cycle 2:
add t4 = t1,t5
fmul f7 = f7, f15
store [addr 0] = t3
Cycle 3:
store [addr 1] = t4 VLIW
VLIW = Very Long Instruction Word architecture
Instruction format: operation 1 operation 2 operation 3 operation 4 operation 5 VLIW
VLIW Architecture:
Very Long Instruction Word architecture
They have a number of processing units (data paths). i.e., a number of ALUs, MAC units, shifters etc.
The VLIW is accessed from memory and is used to specify the operands and operations to be performed by each of the data paths.
The multiple functional units share a common multiported register file for fetching the operands and storing the results. VLIW
The performance gains that can be achieved with VLIW architecture depends on the degree of parallelism in the algorithm selected for a DSP application and the number of functional units.
The throughput will be higher only if the algorithm involves execution of independent operations.
It is the compiler that does the job of determining ILP and scheduling it on the functional units. A VLIW Architecture with 7 FUs
Instruction Memory
Int FU Int FU Int FU LD/ST LD/ST FP FU FP FU
Floating Point Int Register File Register File
Data Memory SIMD
SIMD (Single Instruction Multiple Data)
A single stream of instructions will be broadcasted to a number of processors
All processors execute the same program but operate on different data.
Nodes have Mesh or hypercube connectivity
Each PE can exchange values with their neighbors, has a few registers, some local memory and an ALU. An SIMD Organization
SIMD Execution Method
node1 node2 node-K time Instruction 1
Instruction 2
Instruction 3
Instruction n Architecture Trends – The Down Side
VLIW, SIMD and deep pipelines can increase
Memory use.
Energy consumption.
Code generation complexity, programming difficulty.
Simple instruction sets often increase memory usage.
More instructions are needed to accomplish a given task.
Complex instruction sets hinder compatibility.
Compatibility can bring messy compromises. Summary
Each processor makes different tradeoffs, depending on its target application top speed is often not the goal Overview
What is a Digital Signal Processor (DSP)?
Processor Trends – Architectures.
What are Signal Processing Hardware Trends – Processor options?
What is available in Market?
How to Choose a DSP?
Conclusions DSP Hardware Trends
Today’s system engineer have a wealth of options for implementing DSP tasks.
GPP
DSPs
Application Specific DSPs
Customizable Cores
ASSPs – Application Specific Standard Products
ASICs - Application Specific Integrated Chips
FPGAs – Field Programmable Gate Arrays Overview
What is a Digital Signal Processor (DSP)?
What are Signal Processing Hardware Trends – Processor options?
Processor Trends – Architectures.
What is available in Market?
How to Choose a DSP?
Conclusions How to Choose?
Performance Analysis
Comparing benchmarking approaches Benchmarking approaches
How to Benchmark?
Simplified metrics
E.g., MIPS, MOPS,MMACS
Full DSP Applications
E.g., V.90 Modem
DSP Algorithms “kernal” benchmarks
E.g., FIR Filter, FFT etc. Algorithm Kernel Benchmarks
Most of the benchmarks are based on DSP algorithm kernels
DSP algorithm kernels are the most computationally intensive portions of DSP applications
Example includes FFTs, IIR & FIR filters and Viterbi decoders
Benchmark results are used with application profiling to predict overall performance Algorithm Kernel Benchmarks
Application Profile
OTHER 25%
IDCT 39%
Denorm 11%
Window 25% Algorithm Kernel Benchmarks
Advantages
Relevant, Chosen by analysis of real DSP applications.
Kernels are short, allowing
Functionality to be precisely specified.
Benchmarks to be implemented, optimized in a reasonable amount of time.
Disadvantages
Not practical to implement all important algorithms.
Do not reflect application-level optimizations and trade- offs. Emerging Benchmarking Challenges
New technologies create performance- analysis challenges
Multi-core Devices
DSP-enhanced FPGAs
Application-specific processors
Customizable processors
Reconfigurable processors Emerging challenges
Evolving applications and tools also lead to new challenges
Increasing reliance on C compilers
For technologies not well served by kernel benchmarks, such as
FPGAs
Application-specific Processors
Practicality concerns can be partly addressed by
Using off-the-shelf applications where ever available,or
Using simplified applications Overview
What is a Digital Signal Processor (DSP)?
What are Signal Processing Hardware Trends – Processor options?
Processor Trends – Architectures.
What is available in Market?
How to Choose a DSP?
Conclusions What is available in Market?
Latest Processors
High performance processors
Texas Instruments TMS320C64xx
StarCore SC140
Low Power Processors
Texas Instruments TMS320C55xx
Analog Devices Blackfin (ADSP-BF53x)
General-purpose/ DSP Processors
Intel PXA2xx
Texas Instruments OMAP5910 DSP Speed
SPEED PERFORMANCE 1. TI ‘C5502 (300 MHz) 6480 2. ADI ‘BF53x (600 8000 MHz) 6000 3. TI ‘C6414 (720 MHz) 3360 3430 4000 4. StarCore SC 140 (300 1460 MHz) 930 2000 5. Intel PXA2xx (400 0 MHz) 1 2 3 4 5 DSP Speed
What factors affect DSP Speed?
Parallelism
How many parallel operations can be performed per cycle
Instruction Set
Suitability for the task at hand
Clock Speed
Data types
Data Bandwidth DSP Speed
Pipeline Depth
Instructional latencies
Support for DSP oriented features
DSP Addressing modes
Zero-overhead looping
Saturation, scaling, rounding Memory Use
Lower is Better Memory Speed Comparison
1. TI ‘C55xx (8/16/32/48) 300 256 2. ADI ‘BF53x (16/32/64) 200 146 140 144 140 3. TI ‘C64xx (32) 4. StarCore SC 140
Bytes 100 (16/32) 5. Intel PXA2xx (16/32 0 MHz) 1 2 3 4 5 Memory Use
What factors affect Memory use?
Processors’ memory usage are affected by
Instruction Set
Wider instructions take more memory
Mixed width instructions becoming popular – Use short simple instructions for simple tasks and use longer instructions for more complex tasks
Suitability of instruction set for task at hand Memory Use
Architecture
VLIW, SIMD and deeper pipelines may encourage optimizations that increase memory use to obtain speed optimized code
Compiler Quality (for compiled codes) Energy Efficiency
Higher is Better
ENERGY EFFICIENCY 1. TI ‘C5502 (300 MHz) 1.26V 2. ADI ‘BF53x (600 16.9 16.1 MHz) 1.2V 20 13.7 11.8 3. TI ‘C6414 (300 MHz) 15 1.0V 10 4. Motorola MSC8101 2.6 (SC 140) (300 MHz) 5 1.5V 0 5. Intel PXA2xx (400 1 2 3 4 5 MHz) 1.0V Energy Efficiency
What factors affect Energy efficiency?
Processors’ energy efficiency is affected by
Speed
Fabrication process, voltage, circuit design, logic design
Hardware Implementation
Memory usage
Compiler quality (for compiled code) Cost Performance
Higher is Better 1. TI ‘C5502 (300 MHz) COST PERFORMANCE $10 2. ADI ‘BF53x (600 MHz) $6 400 375.9 3. TI ‘C6414 (300 MHz) 300 $45 146.2 4. Motorola MSC8101 200 98.3 (SC 140) (300 MHz) 100 29 25.6 $118
0 5. Intel PXA2xx (400 1 2 3 4 5 MHz) $27 Cost Performance
What factors affect Cost Performance?
Speed
Chip Cost, which is affected by
Fabrication process
Size of on-chip memory – influenced by processor’s memory usage
On-chip peripherals
Manufacturing volume Cost Performance
But good cost-performance results do not necessarily mean chip is suitable for applications with severe cost constraints.
Users does not want to pay for more performance than needed. Overview
What is a Digital Signal Processor (DSP)?
What are Signal Processing Hardware Trends – Processor options?
Processor Trends – Architectures.
What is available in Market?
How to Choose a DSP?
Conclusions Conclusions
DSP Processor architecture innovations has accelerated greatly New processor types are increasingly competitive
DSP enhanced general purpose processors
Multiprocessor chips
Customizable cores Non-processor approaches are increasingly competitive
DSP-enhanced FPGAs Conclusions
Architectural options are expanding Conclusions
Today’s DSP oriented processors cannot be meaningfully compared using simplified matrices
Relevant, meaningful benchmark results are essential for processor evaluation
There is no ideal processor
Fastest does not mean best
The “best” processor depends on the details of the application
Different architectural approaches make different performance trade-offs
Understanding these is key to select a processor