Introduction to Digital Signal Processors

INTRODUCTION TO Accumulator architecture DIGITAL SIGNAL PROCESSORS Memory-register architecture Prof. Brian L. Evans in collaboration with Niranjan Damera-Venkata and Magesh Valliappan Load-store architecture Embedded Signal Processing Laboratory The University of Texas at Austin Austin, TX 78712-1084 http://signal.ece.utexas.edu/ Outline n Signal processing applications n Conventional DSP architecture n Pipelining in DSP processors n RISC vs. DSP processor architectures n TI TMS320C6x VLIW DSP architecture n Signal and image processing applications n Signal processing on general-purpose processors n Conclusion 2 Signal Processing Applications n Low-cost embedded systems 4 Modems, cellular telephones, disk drives, printers n High-throughput applications 4 Halftoning, base stations, 3-D sonar, tomography n PC based multimedia 4 Compression/decompression of audio, graphics, video n Embedded processor requirements 4 Inexpensive with small area and volume 4 Deterministic interrupt service routine latency 4 Low power: ~50 mW (TMS320C54x uses 0.36 mA/MIP) 3 Conventional DSP Architecture n Harvard architecture 4 Separate data memory/bus and program memory/bus 4 Three reads and one or two writes per instruction cycle n Deterministic interrupt service routine latency n Multiply-accumulate in single instruction cycle n Special addressing modes supported in hardware 4 Modulo addressing for circular buffers (e.g. FIR filters) 4 Bit-reversed addressing (e.g. fast Fourier transforms) n Instructions to keep the pipeline (3-4 stages) full 4 Zero-overhead looping (one pipeline flush to set up) 4 Delayed branches 4 Conventional DSP Architecture (con’t) Data-shifting n Modulo addressing Time Buffer contents Next sample x x x x x 4 implementing n=N N-K+1 N-K+1 N-1 N N+1 circular buffers and x x x x n=N+1 N-K+2 N-K+3 N N+1 x delay lines N+2 x x N-K+3 xN-K+4 N+1 xN+2 n=N+2 xN+3 Modulo addressing Time Buffer contents Next sample n Bit reversed n=N x x x x x x addressing N-2 N-1 N N-K+1 N-K+2 N+1 4 x used to x x x x x x N+2 n=N+1 N-2 N-1 N N+1 N-K+2 N-K+3 implement the radix-2 FFT x x xx x x x x n=N+2 N-2 N-1 NN N+1 N+2 N-K+3 xNN-K+4-K+4 xN+3 5 Conventional DSP Architecture (con’t) Fixed-Point Floating-Point Cost/Unit $5 - $79 $5 - $381 Architecture Accumulator load-store or memory-register Registers 2-4 data 8 or 16 data 8 address 8 or 16 address Data Words 16 or 24 bit integer 32 bit integer and and fixed-point fixed/floating-point On-Chip 2-64 kwords data 8-64 kwords data Memory 2-64 kwords program 8-64 kwords program Address 16-128 kw data 16 Mw – 4Gw data Space 16-64 kw program 16 Mw – 4 Gw program Compilers C compilers; C, C++ compilers; poor code generation better code generation Examples TI TMS320C5x; TI TMS320C3x; Motorola 56000 Analog Devices SHARC 6 Conventional DSP Architecture (con’t) n Market share: 95% fixed-point, 5% floating-point n Each processor family has dozens of members with different on-chip configurations 4 Size and map of data and program memory 4 A/D, input/output buffers, interfaces, timers, and D/A n Drawbacks to conventional DSP processors 4 No byte addressing (needed for image and video) 4 Limited on-chip memory 4 Limited addressable memory on fixed-point DSPs, except Motorola 56300 (16 Mw data; 64 Mw program) 4 Non-standard C extensions to support fixed-point data 7 Pipelining Sequential (Motorola 56000) Fetch Decode Read Execute Pipelined (Most conventional DSP processors) Fetch Decode Read Execute Superscalar (Pentium, MIPS) Managing Pipelines •compiler or programmer •pipeline interlocking Fetch Decode Read Execute in the processor Superpipelined (CDC7600) •hardware instruction scheduling Fetch Decode Read Execute 8 Pipelining: Operation Fetch Decode Read n Time-stationary pipeline model Execute 4 Programmer controls each cycle F D R E 4 Motorola DSP56001 D C B A E D C B MAC X0,Y0,A X:(R0)+,X0 Y:(R4)-,Y0 F E D C G F E D n Data-stationary pipeline model H G F E 4 Programmer specifies data I H G F operations J I H G 4 TMS320C30/40 K J I H L K J I MPYF *++AR0(1),*++AR1(IR0),R0 L - K J L - K n Interlocked pipeline L - 4 Programmer is “protected” from L pipeline effects 9 Pipelining: Hazards Fetch Decode Read n A control hazard occurs when a Execute branch instruction is decoded F D R E 4 “Flush” the pipeline D C B A 4 or: Delayed branch (expose pipeline) E D C B F E D C n A data hazard occurs because br F E D an operand cannot be read yet G br F E 4 Intended by programmer - - br F 4 or: Interlock hardware inserts - - - br “bubble” X - - - Y X - - TMS320C5x example Y - X - LAC #064h LAR AR2, DATA Z Y - X SAMM AR2 LACC *- Z Y - NOP Z Y LACC *- Z 10 Pipelining: Avoiding Control Hazards Fetch Decode Read A key factor in the numeric performance Execute of DSPs is the provision of special F D R E hardware to perform looping. D C B RPT COUNT E D AB CD TBLR *+ F E CD E rpt F E F n A repeat instruction repeats X rpt F rpt X - rpt one instruction or a block of - X - - - instructions after repeat X X - X X X X n The pipeline is filled with X X X X X repeated instruction (or X X X X block of instructions) X X n Cost: one pipeline flush only 11 RISC vs. DSP: Instruction Encoding n RISC: Superscalar Reorder Load/store FP Unit Integer Unit n DSP: Horizontal microcode Load/store Load/store Address ALU Multiplier 12 RISC vs. DSP: Memory Hierarchy n RISC Registers I/D Physical Cache memory Out of TLB order TLB: Translation Lookaside Buffer I Cache Internal n DSP memories Registers External memories DMA Controller DMA: Direct Memory Access 13 TI TMS320C6x VLIW DSP Architecture Simplified Architecture Program RAM Data RAM or Cache Addr Internal Buses DMA Data .D1 .D2 Regs Regs Serial Port (A0-A15) External .M1 .M2 (B0-B15) Host Port Memory .L1 .L2 -Sync Boot Load -Async .S1 .S2 Timers Control Regs Pwr Down CPU 14 TI TMS320C6x VLIW DSP Architecture n Two parallel data paths with single-cycle units: 4 Data unit - 32-bit address calculations (modulo, linear) 4 Multiplier unit - 16 bit x 16 bit with 32-bit result 4 Logical unit - 40-bit (saturation) arithmetic & compares 4 Shifter unit - 32-bit integer ALU and 40-bit shifter n 16 32-bit registers in each data path 4 40 bits can be stored in adjacent even/odd registers n Fixed-point (C62x) and floating-point (C67x) n TMS320C6201: $25 in volume 4 150 MHz, 300 million MACs/sec, 1200 RISC MIPS 4 On-chip memory: 16 k x 32 program, 32 k x 16 data 15 TI TMS320C6x VLIW DSP Architecture n One instruction cycle every clock cycle n Deep pipeline 4 7-11 stages in C62x: fetch 4, decode 2, execute 1-5 4 7-16 stages in C67x: fetch 4, decode 2, execute 1-10 4 If a branch is in the pipeline, interrupts are disabled (the latency of a branch is 5 cycles) 4 Avoid branches by using conditional execution n No hardware protection against pipeline hazards 4 Compiler and assembler must prevent pipeline hazards n C67x computes floating-point multiply in 4 cycles 16 C5x and C6x Addressing Modes n Immediate TMS320C5x TMS320C6x 4 The operand is part of the instruction ADD #0FFh add .L1 -13,A1,A6 n Register 4 The operand is specified in a register (implied) add .L1 A7,A6,A7 n Direct 4 The address of the operand is part of the instruction (added ADD 010h not supported to imply memory page) n Indirect 4 The address of the operand is ADD * ldw .L1 *A5++[8],A1 stored in a register 17 TMS320C6x vs. Pentium MMX Processor Peak BDTI ISR Power Unit Area Volume MIPS marks latency Price Pentium 466 49 1.14 ms 4.25 W $213 5.5” x 2.5” 8.789 in3 MMX 233 Pentium 532 56 1.00 ms 4.85 W $348 5.5” x 2.5” 8.789 in3 MMX 266 C62x 1200 74 0.12 ms 1.45 W $25 1.3” x 1.3” 0.118 in3 150 MHz C62x 1600 99 0.09 ms 1.94 W $96 1.3” x 1.3” 0.118 in3 200 MHz BDTImarks: Berkeley Design Technology Inc. DSP benchmark results (larger means better) http://www.bdti.com/bdtimark/results.htm http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/processors.html 18 Application: FIR Filter z-1 z-1 z-1 n Each tap requires 4 Fetching one data sample 4 Fetching one operand 4 Multiplying two numbers 4 Accumulating multiplication result 4 Shifting one sample in the delay line n Computing an FIR tap in one instruction cycle 4 Three data memory accesses 4 Auto-increment or decrement addressing modes 4 Modulo addressing to implement delay line as circular buffer 19 Application: FIR Filter on a TMS320C5x Coefficients Data COEFFP .set 02000h ; Program mem address X .set 037Fh ; Newest data sample LASTAP .set 037FH ; Oldest data sample … LAR AR3, #LASTAP ; Point to oldest sample RPT #127 MACD COEFFP, *- ; Do the thing APAC SACH Y,1 ; Store result -- note shift 20 Application: FIR Filter on a TMS320C62x Coefficients Data Single-Cycle Loop ..

Load more