Load-store architecture Load-store Accumulator architecture Accumulator Memory-register architecture Magesh Valliappan in collaboration with in collaboration Austin, TX 78712-1084 PROCESSORS Prof. Brian L. Evans L. Brian Prof.
DIGITAL SIGNAL http://anchovy.ece.utexas.edu/ INTRODUCTION TO Niranjan Damera-Venkata and The University of Texas at Austin at Texas of University The Embedded Signal Processing Laboratory Processing Signal Embedded Outline n Signal processing applications n Conventional DSP architecture n Pipelining in DSP processors n RISC vs. DSP processor architectures n TI TMS320C6x VLIW DSP architecture n Signal and image processing applications n Signal processing on general-purpose processors n Conclusion
2 Signal Processing Applications n Low-cost embedded systems 4 Modems, cellular telephones, disk drives, printers n High-throughput applications 4 Halftoning, radar, high-resolution sonar, tomography n PC based multimedia 4 Compression/decompression of audio, graphics, video n Embedded processor requirements 4 Inexpensive with small area and volume 4 Deterministic interrupt service routine latency 4 Low power: ~50 mW (TMS320C5402/20: 0.32 mA/MIP)
3 Conventional DSP Architecture n High data throughput 4 Harvard architecture
n Separate data memory/bus and program memory/bus
n Three reads and one or two writes per instruction cycle 4 Short deterministic interrupt service routine latency 4 Multiply-accumulate (MAC) in a single instruction cycle 4 Special addressing modes supported in hardware
n Modulo addressing for circular buffers (e.g. FIR filters)
n Bit-reversed addressing (e.g. fast Fourier transforms) 4Instructions to keep the 3-4 stages of the pipeline full
n Zero-overhead looping (one pipeline flush to set up)
n Delayed branches
4 Conventional DSP Architecture ( Data-shifting n Modulo Time Buffer contents Next sample x x x x x addressing n=N N-K+1 N-K+1 N-1 con’t)N N+1
4 implementing x x n=N+1 N-K+2 xN-K+3 N xN+1 x circular buffers N+2 x x and delay lines N-K+3 xN-K+4 N+1 xN+2 n=N+2 xN+3
Modulo addressing
Time Buffer contents Next sample n Bit reversed n=N x x x x x x addressing N-2 N-1 N N-K+1 N-K+2 N+1 4 used to x x x x xN+2 n=N+1 N-2 xN-1 N N+1 xN-K+2 N-K+3 implement the radix-2 x x xx x x x x n=N+2 N-2 N-1 NN N+1 N+2 N-K+3 xN-K+4N-K+4 x FFT N+3
5 Conventional DSP Architecture (
Fixed-Point Floating-Point Cost/Unit $5 $10con’t) Architecture accumulator load-store or memory-register Registers 2–4 data 8 or 16 data 8 address 8 or 16 address Data Words 16 or 24 bit integer and 32 bit integer and fixed-point fixed/floating-point On-Chip 2–64 kwords data * 8–64 kwords data Memory 2–64 kwords program 8–64 kwords program Address 16 kw–8 Mw data 16 Mw–4Gw data Space 16–64 kw program ** 16 Mw–4 Gw program Compilers C compilers; C, C++ compilers; poor code generation better code generation Examples TI TMS320C5x; TI TMS320C3x; Motorola 56000 Analog Devices SHARC 6 Conventional DSP Architecture ( n Market share: 95% fixed-point, 5% floating-point con’t) n Each processor has dozens of configurations 4 Size and map of data and program memory 4A/D, input/output buffers, interfaces, timers, and D/A n Drawbacks to conventional DSP processors 4 No byte addressing (desirable for image and video) 4 Limited on-chip memory 4Limited addressable memory on fixed-point DSPs: except Motorola 56300 (16 Mw data; 32 Mw program) and C548/C549/C54xx (8 Mw data; 256 kw program) 4 Non-standard C extensions to support fixed-point data
7 Pipelining
Sequential (Motorola 56000)
Fetch Decode Read Execute Pipelined (Most conventional DSP processors)
Fetch Decode Read Execute Superscalar (Pentium, MIPS) Managing Pipelines •compiler or programmer •interlocking Fetch Decode Read Execute Superpipelined (CDC7600) •hardware instruction scheduling
Fetch Decode Read Execute 8 Pipelining: Operation
Fetch Decode Read n Time-stationary pipeline model Execute 4 Programmer controls each cycle F D R E 4 Motorola DSP56001 D C B A MAC X0,Y0,A X:(R0)+,X0 Y:(R4)-,Y0 E D C B F E D C n Data-stationary pipeline model G F E D H G F E 4 Programmer specifies data I H G F operations J I H G 4TMS320C30/40 K J I H L K J I MPYF *++AR0(1),*++AR1(IR0),R0 L - K J L - K n Interlocked pipeline L - 4 Programmer is “protected” from L pipeline effects 9 Pipelining: Hazards
Fetch Decode Read n A control hazard occurs when a Execute branch instruction is decoded F D R E 4 “Flush” the pipeline D C B A 4 or: Delayed branch (expose pipeline) E D C B n A data hazard occurs because F E D C br F E D an operand cannot be read yet G br F E 4 Intended by programmer - - br F 4 or: Interlock hardware inserts - - - br “bubble” X - - - TMS320C5x example Y X - - Y - X - LAC #064h LAR AR2, DATA Z Y - X SAMM AR2 LACC *- NOP Z Y - LACC *- Z Y Z 10 Pipelining: Avoiding Control Hazards
Fetch Decode Read A key factor in the numeric performance Execute of DSPs is the provision of special F D R E hardware to perform looping. D C B A RPT COUNT E D C B TBLR *+ F E D C rpt F E D n A repeat instruction repeats X rpt F E one instruction or a block of X - rpt F X - - rpt instructions after repeat X X - - n The pipeline is filled with X X X - X X X X repeated instruction (or X X X X block of instructions) X X X X n Cost: one pipeline flush only
11 RISC vs. DSP: Instruction Encoding n RISC: Superscalar Reorder
Load/store
FP Unit Integer Unit n DSP: Horizontal microcode
Load/store
Load/store
Address ALU Multiplier
12 RISC vs. DSP: Memory Hierarchy n RISC Registers I/D Physical Cache memory Out of TLB order TLB: Translation Lookaside Buffer
I Cache Internal n DSP memories
Registers External memories
DMA Controller DMA: Direct Memory Access
13 TI TMS320C6x VLIW DSP Architecture
Simplified Architecture Program RAM Data RAM or Cache Addr Internal Buses DMA Data Regs (A0-A15) .D1 .D2 Regs (B0-B15) Serial Port
External .M1 .M2 Host Port Memory .L1 .L2 -Sync Boot Load -Async .S1 .S2 Timers Control Regs Pwr Down CPU
14 TI TMS320C6x VLIW DSP Architecture n One instruction cycle per clock cycle n Two parallel data paths with single-cycle units: 4 Data unit - 32-bit address calculations (modulo, linear) 4 Multiplier unit - 16 bit x 16 bit with 32-bit result 4 Logical unit - 40-bit (saturation) arithmetic & compares 4 Shifter unit - 32-bit integer ALU and 40-bit shifter n 16 32-bit general purpose registers in each path 4 40 bits can be stored in adjacent even/odd registers n 32-bit addressing of 8/16/32 bit data n Fixed-point (C62x) and floating-point (C67x) n C67x computes floating-point multiply in 4 cycles 15 TI TMS320C6x VLIW DSP Architecture n TMS320C6211: $21 in volume 4 150 MHz, 300 million MACs/sec, 1200 RISC MIPS 4 on-chip: 4k x 8 bits program and 4k x 8 bits data (plus 64k x 8 bits L2 cache) n Deep pipeline 4 7-11 stages in C62x: fetch 4, decode 2, execute 1-5 4 7-16 stages in C67x: fetch 4, decode 2, execute 1-10 4 If a branch is in the pipeline, interrupts are disabled (the latency of a branch instruction is 5 cycles) 4 Avoid branches by using conditional execution n No hardware protection against pipeline hazards 4 Compiler and assembler must prevent pipeline hazards 16 C5x and C6x Addressing Modes n Immediate TMS320C5x TMS320C6x 4 The operand is part of the instruction ADD #0FFh add .L1 -13,A1,A6 n Register 4 The operand is (implied) add .L1 A7,A6,A7 specified in a register n Direct 4 The address of the operand is part of the ADD 010h not supported instruction (added to imply memory page) n Indirect ADD * ldw .L1 *A5++[8],A1 4 The address of the operand is stored in a register 17 TMS320C6x vs. Pentium MMX
Processor Peak BDTI ISR Power Unit Area Volume MIPS marks latency Price
Pentium 466 49 1.14 µs 4.25 W $213 5.5” x 2.5” 8.789 in3 MMX 233
Pentium 532 56 1.00 µs 4.85 W $348 5.5” x 2.5” 8.789 in3 MMX 266
C62x 1200 74 0.12 µs 1.45 W $25 1.3” x 1.3” 0.118 in3 150 MHz
C62x 1600 99 0.09 µs 1.94 W $96 1.3” x 1.3” 0.118 in3 200 MHz
BDTImarks: Berkeley Design Technology Inc. DSP benchmark results (larger means better) http://www.bdti.com/bdtimark/results.htm http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/processors.html
18 Application: FIR Filter
z-1 z-1 z-1 n Each tap requires 4 Fetching one data sample 4Fetching one operand 4Multiplying two numbers 4Accumulating multiplication result 4Shifting one sample in the delay line n Computing an FIR tap in one instruction cycle 4Three data memory accesses 4Auto-increment or decrement addressing modes 4Modulo addressing to implement delay line as circular buffer 4 Eleven RISC instructions
19 Application: FIR Filter on a TMS320C5x
Coefficients
Data
COEFFP .set 02000h ; Program mem address X .set 037Fh ; Newest data sample LASTAP .set 037FH ; Oldest data sample
… LAR AR3, #LASTAP ; Point to oldest sample RPT #127 MACD COEFFP, *- ; Do the thing APAC SACH Y,1 ; Store result -- note shift
20 Application: FIR Filter on a TMS320C62x
Coefficients
Data
Single-Cycle Loop
...
C7: ldh .D1 *A1++, A2 ; Read coefficient || ldh .D2 *B1++, B2 ; Read data || [B0] sub .L2 B0, 1, B0 ; Decrement counter || [B0] B .S2 c7 ; Branch if not zero || mpy .M1x A2, B2, A3 ; Form product || add .L1 A4, A3, A4 ; Accumulate result
...
21 Ordered Dithering on a TMS320C62x
1/8 5/8
7/8 3/8
Single-Cycle Loop Array of thresholds
...
C7: ldb .D1 *A1++, A2 ; Read pixel || ldb .D2 *B1++, B2 ; Read threshold || [B0] sub .L2 B0, 1, B0 ; Decrement counter || [B0] B .S2 c7 ; Branch if not zero || cmpgtu .L1x A2, B2, A3 ; Threshold and store || stb .D1 A3, *A5++ ; Accumulate result
...
22 DSP Cores
n ASIC with: 4 Programmable DSP 4 RAM 4 ROM 4 Standard cells 4 Codec 4Peripherals 4 Gate array 4 Microcontroller
23 DSP on General Purpose Processors n Multimedia applications on PCs 4 Video, audio, graphics and animation 4 Repetitive parallel sequences of instructions n Native signal processing examples 4 Sun Visual Instruction Set (UltraSPARC 1/2) 4 Intel MMX (Pentium I/II/III) 4Intel Concurrent SIMD-FP (Pentium III) n Single Instruction Multiple Data (SIMD) 4 One instruction acts on multiple data in parallel 4 Well-suited for graphics
24 DSP on General Purpose Processors ( n Programming is considerably tougher 4 C/C++ compilers do not generate native signal processingcon’t) code except Metrowerks CodeWarrior 4 gives MMX code 4 Libraries of routines using native signal processing 4 Hand code using in-line assembly for best performance 4 Pack/unpack data not aligned on SIMD word boundaries 4 50-cycle penalty to switch out of MMX; 0 penalty for VIS 4 Saturation arithmetic in MMX; not supported in VIS 4 Extended-precision accumulation in MMX; none in VIS n Speedup for applications 4 Signal and image processing - 1.5:1 to 2:1 4 Graphics - 4:1 to 6:1 (no packing/unpacking)
25 Intel MMX Instruction Set n 64-bit SIMD register (4 data types) 4 64-bit quad word 4 Packed byte (8 bytes packed into 64 bits) 4 Packed word (4 16-bit words packed into 64 bits) 4 Packed double word (2 double words packed into 64 bits) n 57 new instructions 4 Pack and unpack 4Add, subtract, multiply, and multiply/accumulate n Saturation and wraparound arithmetic n Maximum parallelism possible 4 8:1 for 8-bit additions 4 4:1 for 8 x 16 multiplication or 16-bit additions
26 Concluding Remarks n Conventional digital signal processors 4 High performance vs. power consumption/cost/volume 4Excel at one-dimensional processing 4Have instructions tailored to specific applications n TMS320C6x VLIW DSP 4High performance vs. cost/volume 4Excel at multidimensional signal processing 4A maximum of 22 RISC instructions per cycle n Native Signal Processing 4 Available on desktop computers 4 Excels at graphics 4 A maximum of 8 RISC instructions per cycle n In-line assembly code for best performance 27 Concluding Remarks n Digital signal processor market 4 40% annual growth rate since 1990 4 $3.5 billion revenue in 1998 4 45% TI, 25% Lucent, 10% Motorola, 8% Analog Devices n Independent benchmarking by industry 4 Berkeley Design Technology Inc. http://www.bdti.com 4 EDN Embedded Microprocessor Benchmark Consortium http://www.eembc.org n Web resources 4 comp.dsp newsgroup: FAQ www.bdti.com/faq/dsp_faq.html 4 embedded processors and systems: www.eg3.com 4 on-line courses and DSP boards: www.techonline.com
28 References
1 G. E. Allen, B. L. Evans, and D. C. Schanbacher, “Real-Time Sonar Beamforming on a Unix Workstation,” Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, pp. 764-768, 1998. http://www.ece.utexas.edu/~bevans/papers/1998/beamforming/ 2 R. Bhargava, R. Radhakrishnan, B. L. Evans, and L. K. John, “Evaluating MMX Technology Using DSP and Multimedia Applications,” Proc. IEEE Sym. On Microarchitecture, pp. 37-46, 1998. http://www.ece.utexas.edu/~ravib/mmxdsp/ 3 W. Chen, H. J. Reekie, S. Bhave, and E. A. Lee, “Native Signal Processing on the UltraSPARC in the Ptolemy Environment,” Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, 1996. http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/21_nsp/vis/ 4 B. L. Evans, “EE379K-17 Real-Time DSP Laboratory,” UT Austin. http://www.ece.utexas.edu/~bevans/courses/realtime/ 5 B. L. Evans, “EE382C Embedded Software Systems,” UT Austin. http://www.ece.utexas.edu/~bevans/courses/ee382c/ 6 A. Kulkarni and A. Dube, “Evaluation of the Code Generation Domain in Ptolemy,” http://www.ece.utexas.edu/~bevans/talks/benchmarking97/sld001.htm 7 P. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Processor Fundamentals, IEEE Press, 1997.
29