Load-store architecture Load-store Accumulator architecture Accumulator Memory-register architecture Magesh Valliappan in collaboration with in collaboration Austin, TX 78712-1084 PROCESSORS Prof. Brian L. Evans L. Brian Prof.

DIGITAL SIGNAL http://anchovy.ece.utexas.edu/ INTRODUCTION TO Niranjan Damera-Venkata and The University of Texas at Austin at Texas of University The Embedded Signal Processing Laboratory Processing Signal Embedded Outline n Signal processing applications n Conventional DSP architecture n Pipelining in DSP processors n RISC vs. DSP processor architectures n TI TMS320C6x VLIW DSP architecture n Signal and image processing applications n Signal processing on general-purpose processors n Conclusion

2 Signal Processing Applications n Low-cost embedded systems 4 Modems, cellular telephones, disk drives, printers n High-throughput applications 4 Halftoning, radar, high-resolution sonar, tomography n PC based multimedia 4 Compression/decompression of audio, graphics, video n Embedded processor requirements 4 Inexpensive with small area and volume 4 Deterministic service routine latency 4 Low power: ~50 mW (TMS320C5402/20: 0.32 mA/MIP)

3 Conventional DSP Architecture n High data throughput 4 Harvard architecture

n Separate data memory/ and program memory/bus

n Three reads and one or two writes per instruction cycle 4 Short deterministic interrupt service routine latency 4 Multiply-accumulate (MAC) in a single instruction cycle 4 Special addressing modes supported in hardware

n Modulo addressing for circular buffers (e.g. FIR filters)

n -reversed addressing (e.g. fast Fourier transforms) 4Instructions to keep the 3-4 stages of the pipeline full

n Zero-overhead looping (one pipeline flush to set up)

n Delayed branches

4 Conventional DSP Architecture ( Data-shifting n Modulo Time Buffer contents Next sample x x x x x addressing n=N N-K+1 N-K+1 N-1 con’t)N N+1

4 implementing x x n=N+1 N-K+2 xN-K+3 N xN+1 x circular buffers N+2 x x and delay lines N-K+3 xN-K+4 N+1 xN+2 n=N+2 xN+3

Modulo addressing

Time Buffer contents Next sample n Bit reversed n=N x x x x x x addressing N-2 N-1 N N-K+1 N-K+2 N+1 4 used to x x x x xN+2 n=N+1 N-2 xN-1 N N+1 xN-K+2 N-K+3 implement the radix-2 x x xx x x x x n=N+2 N-2 N-1 NN N+1 N+2 N-K+3 xN-K+4N-K+4 x FFT N+3

5 Conventional DSP Architecture (

Fixed-Point Floating-Point Cost/Unit $5 $10con’t) Architecture accumulator load-store or memory-register Registers 2–4 data 8 or 16 data 8 address 8 or 16 address Data Words 16 or 24 bit integer and 32 bit integer and fixed-point fixed/floating-point On-Chip 2–64 kwords data * 8–64 kwords data Memory 2–64 kwords program 8–64 kwords program Address 16 kw–8 Mw data 16 Mw–4Gw data Space 16–64 kw program ** 16 Mw–4 Gw program C compilers; C, C++ compilers; poor code generation better code generation Examples TI TMS320C5x; TI TMS320C3x; Motorola 56000 SHARC 6 Conventional DSP Architecture ( n Market share: 95% fixed-point, 5% floating-point con’t) n Each processor has dozens of configurations 4 Size and map of data and program memory 4A/D, input/output buffers, interfaces, timers, and D/A n Drawbacks to conventional DSP processors 4 No byte addressing (desirable for image and video) 4 Limited on-chip memory 4Limited addressable memory on fixed-point DSPs: except Motorola 56300 (16 Mw data; 32 Mw program) and C548/C549/C54xx (8 Mw data; 256 kw program) 4 Non-standard C extensions to support fixed-point data

7 Pipelining

Sequential (Motorola 56000)

Fetch Decode Read Execute Pipelined (Most conventional DSP processors)

Fetch Decode Read Execute Superscalar (Pentium, MIPS) Managing Pipelines • or programmer •interlocking Fetch Decode Read Execute Superpipelined (CDC7600) •hardware instruction scheduling

Fetch Decode Read Execute 8 Pipelining: Operation

Fetch Decode Read n Time-stationary pipeline model Execute 4 Programmer controls each cycle F D R E 4 Motorola DSP56001 D C B A MAC X0,Y0,A X:(R0)+,X0 Y:(R4)-,Y0 E D C B F E D C n Data-stationary pipeline model G F E D H G F E 4 Programmer specifies data I H G F operations J I H G 4TMS320C30/40 K J I H L K J I MPYF *++AR0(1),*++AR1(IR0),R0 L - K J L - K n Interlocked pipeline L - 4 Programmer is “protected” from L pipeline effects 9 Pipelining: Hazards

Fetch Decode Read n A control hazard occurs when a Execute branch instruction is decoded F D R E 4 “Flush” the pipeline D C B A 4 or: Delayed branch (expose pipeline) E D C B n A data hazard occurs because F E D C br F E D an operand cannot be read yet G br F E 4 Intended by programmer - - br F 4 or: Interlock hardware inserts - - - br “bubble” X - - - TMS320C5x example Y X - - Y - X - LAC #064h LAR AR2, DATA Z Y - X SAMM AR2 LACC *- NOP Z Y - LACC *- Z Y Z 10 Pipelining: Avoiding Control Hazards

Fetch Decode Read A key factor in the numeric performance Execute of DSPs is the provision of special F D R E hardware to perform looping. D C B A RPT COUNT E D C B TBLR *+ F E D C rpt F E D n A repeat instruction repeats X rpt F E one instruction or a block of X - rpt F X - - rpt instructions after repeat X X - - n The pipeline is filled with X X X - X X X X repeated instruction (or X X X X block of instructions) X X X X n Cost: one pipeline flush only

11 RISC vs. DSP: Instruction Encoding n RISC: Superscalar Reorder

Load/store

FP Unit Integer Unit n DSP: Horizontal microcode

Load/store

Load/store

Address ALU Multiplier

12 RISC vs. DSP: Memory Hierarchy n RISC Registers I/D Physical Cache memory Out of TLB order TLB: Translation Lookaside Buffer

I Cache Internal n DSP memories

Registers External memories

DMA Controller DMA: Direct Memory Access

13 TI TMS320C6x VLIW DSP Architecture

Simplified Architecture Program RAM Data RAM or Cache Addr Internal Buses DMA Data Regs (A0-A15) .D1 .D2 Regs (B0-B15) Serial Port

External .M1 .M2 Host Port Memory .L1 .L2 -Sync Boot Load -Async .S1 .S2 Timers Control Regs Pwr Down CPU

14 TI TMS320C6x VLIW DSP Architecture n One instruction cycle per clock cycle n Two parallel data paths with single-cycle units: 4 Data unit - 32-bit address calculations (modulo, linear) 4 Multiplier unit - 16 bit x 16 bit with 32-bit result 4 Logical unit - 40-bit (saturation) arithmetic & compares 4 Shifter unit - 32-bit integer ALU and 40-bit shifter n 16 32-bit general purpose registers in each path 4 40 can be stored in adjacent even/odd registers n 32-bit addressing of 8/16/32 bit data n Fixed-point (C62x) and floating-point (C67x) n C67x computes floating-point multiply in 4 cycles 15 TI TMS320C6x VLIW DSP Architecture n TMS320C6211: $21 in volume 4 150 MHz, 300 million MACs/sec, 1200 RISC MIPS 4 on-chip: 4k x 8 bits program and 4k x 8 bits data (plus 64k x 8 bits L2 cache) n Deep pipeline 4 7-11 stages in C62x: fetch 4, decode 2, execute 1-5 4 7-16 stages in C67x: fetch 4, decode 2, execute 1-10 4 If a branch is in the pipeline, are disabled (the latency of a branch instruction is 5 cycles) 4 Avoid branches by using conditional execution n No hardware protection against pipeline hazards 4 Compiler and assembler must prevent pipeline hazards 16 C5x and C6x Addressing Modes n Immediate TMS320C5x TMS320C6x 4 The operand is part of the instruction ADD #0FFh add .L1 -13,A1,A6 n Register 4 The operand is (implied) add .L1 A7,A6,A7 specified in a register n Direct 4 The address of the operand is part of the ADD 010h not supported instruction (added to imply memory page) n Indirect ADD * ldw .L1 *A5++[8],A1 4 The address of the operand is stored in a register 17 TMS320C6x vs. Pentium MMX

Processor Peak BDTI ISR Power Unit Area Volume MIPS marks latency Price

Pentium 466 49 1.14 µs 4.25 W $213 5.5” x 2.5” 8.789 in3 MMX 233

Pentium 532 56 1.00 µs 4.85 W $348 5.5” x 2.5” 8.789 in3 MMX 266

C62x 1200 74 0.12 µs 1.45 W $25 1.3” x 1.3” 0.118 in3 150 MHz

C62x 1600 99 0.09 µs 1.94 W $96 1.3” x 1.3” 0.118 in3 200 MHz

BDTImarks: Berkeley Design Technology Inc. DSP results (larger means better) http://www.bdti.com/bdtimark/results.htm http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/processors.html

18 Application: FIR Filter

z-1 z-1 z-1 n Each tap requires 4 Fetching one data sample 4Fetching one operand 4Multiplying two numbers 4Accumulating multiplication result 4Shifting one sample in the delay line n Computing an FIR tap in one instruction cycle 4Three data memory accesses 4Auto-increment or decrement addressing modes 4Modulo addressing to implement delay line as circular buffer 4 Eleven RISC instructions

19 Application: FIR Filter on a TMS320C5x

Coefficients

Data

COEFFP .set 02000h ; Program mem address X .set 037Fh ; Newest data sample LASTAP .set 037FH ; Oldest data sample

… LAR AR3, #LASTAP ; Point to oldest sample RPT #127 MACD COEFFP, *- ; Do the thing APAC SACH Y,1 ; Store result -- note shift

20 Application: FIR Filter on a TMS320C62x

Coefficients

Data

Single-Cycle Loop

...

C7: ldh .D1 *A1++, A2 ; Read coefficient || ldh .D2 *B1++, B2 ; Read data || [B0] sub .L2 B0, 1, B0 ; Decrement counter || [B0] B .S2 c7 ; Branch if not zero || mpy .M1x A2, B2, A3 ; Form product || add .L1 A4, A3, A4 ; Accumulate result

...

21 Ordered Dithering on a TMS320C62x

1/8 5/8

7/8 3/8

Single-Cycle Loop Array of thresholds

...

C7: ldb .D1 *A1++, A2 ; Read pixel || ldb .D2 *B1++, B2 ; Read threshold || [B0] sub .L2 B0, 1, B0 ; Decrement counter || [B0] B .S2 c7 ; Branch if not zero || cmpgtu .L1x A2, B2, A3 ; Threshold and store || stb .D1 A3, *A5++ ; Accumulate result

...

22 DSP Cores

n ASIC with: 4 Programmable DSP 4 RAM 4 ROM 4 Standard cells 4 Codec 4Peripherals 4 Gate array 4

23 DSP on General Purpose Processors n Multimedia applications on PCs 4 Video, audio, graphics and animation 4 Repetitive parallel sequences of instructions n Native signal processing examples 4 Sun Visual Instruction Set (UltraSPARC 1/2) 4 MMX (Pentium I/II/III) 4Intel Concurrent SIMD-FP (Pentium III) n Single Instruction Multiple Data (SIMD) 4 One instruction acts on multiple data in parallel 4 Well-suited for graphics

24 DSP on General Purpose Processors ( n Programming is considerably tougher 4 C/C++ compilers do not generate native signal processingcon’t) code except Metrowerks CodeWarrior 4 gives MMX code 4 Libraries of routines using native signal processing 4 Hand code using in-line assembly for best performance 4 Pack/unpack data not aligned on SIMD word boundaries 4 50-cycle penalty to switch out of MMX; 0 penalty for VIS 4 Saturation arithmetic in MMX; not supported in VIS 4 Extended-precision accumulation in MMX; none in VIS n Speedup for applications 4 Signal and image processing - 1.5:1 to 2:1 4 Graphics - 4:1 to 6:1 (no packing/unpacking)

25 Intel MMX Instruction Set n 64-bit SIMD register (4 data types) 4 64-bit quad word 4 Packed byte (8 bytes packed into 64 bits) 4 Packed word (4 16-bit words packed into 64 bits) 4 Packed double word (2 double words packed into 64 bits) n 57 new instructions 4 Pack and unpack 4Add, subtract, multiply, and multiply/accumulate n Saturation and wraparound arithmetic n Maximum parallelism possible 4 8:1 for 8-bit additions 4 4:1 for 8 x 16 multiplication or 16-bit additions

26 Concluding Remarks n Conventional digital signal processors 4 High performance vs. power consumption/cost/volume 4Excel at one-dimensional processing 4Have instructions tailored to specific applications n TMS320C6x VLIW DSP 4High performance vs. cost/volume 4Excel at multidimensional signal processing 4A maximum of 22 RISC instructions per cycle n Native Signal Processing 4 Available on desktop computers 4 Excels at graphics 4 A maximum of 8 RISC instructions per cycle n In-line assembly code for best performance 27 Concluding Remarks n market 4 40% annual growth rate since 1990 4 $3.5 billion revenue in 1998 4 45% TI, 25% Lucent, 10% Motorola, 8% Analog Devices n Independent benchmarking by industry 4 Berkeley Design Technology Inc. http://www.bdti.com 4 EDN Embedded Benchmark Consortium http://www.eembc.org n Web resources 4 comp.dsp newsgroup: FAQ www.bdti.com/faq/dsp_faq.html 4 embedded processors and systems: www.eg3.com 4 on-line courses and DSP boards: www.techonline.com

28 References

1 G. E. Allen, B. L. Evans, and D. C. Schanbacher, “Real-Time Sonar Beamforming on a Unix ,” Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, pp. 764-768, 1998. http://www.ece.utexas.edu/~bevans/papers/1998/beamforming/ 2 R. Bhargava, R. Radhakrishnan, B. L. Evans, and L. K. John, “Evaluating MMX Technology Using DSP and Multimedia Applications,” Proc. IEEE Sym. On Microarchitecture, pp. 37-46, 1998. http://www.ece.utexas.edu/~ravib/mmxdsp/ 3 W. Chen, H. J. Reekie, S. Bhave, and E. A. Lee, “Native Signal Processing on the UltraSPARC in the Ptolemy Environment,” Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, 1996. http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/21_nsp/vis/ 4 B. L. Evans, “EE379K-17 Real-Time DSP Laboratory,” UT Austin. http://www.ece.utexas.edu/~bevans/courses/realtime/ 5 B. L. Evans, “EE382C Embedded Software Systems,” UT Austin. http://www.ece.utexas.edu/~bevans/courses/ee382c/ 6 A. Kulkarni and A. Dube, “Evaluation of the Code Generation Domain in Ptolemy,” http://www.ece.utexas.edu/~bevans/talks/benchmarking97/sld001.htm 7 P. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Processor Fundamentals, IEEE Press, 1997.

29