Load-store architecture Accumulator architecture Memory-register architecture Magesh Valliappan in collaboration with Austin, TX 78712-1084 PROCESSORS Prof. Brian L. Evans

DIGITAL SIGNAL http://signal.ece.utexas.edu/

INTRODUCTION TO Niranjan Damera-Venkata and The University of Texas at Austin Embedded Signal Processing Laboratory Outline

ÿ Signal processing applications

ÿ Conventional DSP architecture

ÿ Pipelining in DSP processors

ÿ RISC vs. DSP processor architectures

ÿ TI TMS320C6x VLIW DSP architecture

ÿ Signal and image processing applications

ÿ Signal processing on general-purpose processors

ÿ Conclusion

2-2 Signal Processing Applications

ÿ Low-cost medium-throughput embedded systems ÿ Modems, cellular telephones, disk drives, printers

ÿ High-throughput embedded systems ÿ Halftoning, cellular base stations, 3-D sonar, tomography

ÿ Embedded systems to enable PC multimedia ÿ Compression/decompression of audio, graphics, video

ÿ Embedded processor requirements ÿ Inexpensive with small area and volume ÿ Predictable input/output rates to/from processor ÿ Power constraints (severe for handheld applications)

2-3 Conventional DSP Architecture

ÿ Harvard architecture ÿ Separate data memory/ and program memory/bus ÿ Three reads from memory and one or two writes to memory

ÿ Deterministic service routine latency

ÿ Multiply-accumulate in single instruction cycle

ÿ Special addressing modes supported in hardware ÿ Modulo addressing for circular buffers (e.g. FIR filters) ÿ -reversed addressing (e.g. fast Fourier transforms)

ÿ Instructions to keep the pipeline (3-6 stages) full ÿ Zero-overhead looping (one pipeline flush to set up) ÿ Delayed branches 2-4 Conventional DSP Architecture (con’t) Data-shifting ÿ Modulo Time Buffer contents Next sample x x x x x addressing n=N N-K+1 N-K+1 N-1 N N+1

ÿ implementing x x n=N+1 N-K+2 xN-K+3 N xN+1 x circular buffers N+2

and delay lines xN-K+3 xN-K+4 xN+1 xN+2 n=N+2 xN+3

Modulo addressing

Time Buffer contents Next sample ÿ Bit reversed n=N x x x x x x addressing N-2 N-1 N N-K+1 N-K+2 N+1 ÿ used to x x x x xN+2 n=N+1 N-2 xN-1 N N+1 xN-K+2 N-K+3 implement the radix-2 x x xx x x x x n=N+2 N-2 N-1 NN N+1 N+2 N-K+3 xN-K+4N-K+4 x FFT N+3

2-5 Conventional DSP Processors

Fixed-Point Floating-Point Cost/Unit $3 - $79 $3 - $381 Architecture Accumulator load-store or memory-register Registers 2-4 data 8or16data 8 address 8 or 16 address Data Words 16 or 24 bit integer 32 bit integer and and fixed-point fixed/floating-point On-Chip 2-64 kwords data 8-64 kwords data Memory 2-64 kwords program 8-64 kwords program Address 16-128 kw data 16 Mw – 4Gw data Space 16-64 kw program 16 Mw – 4 Gw program Compilers Ccompilers; C, C++ compilers; poor code generation better code generation Examples TI TMS320C5x; TI TMS320C3x; Motorola 56000 Analog Devices SHARC

2-6 Conventional DSP Processors (con’t)

ÿ Market share: 95% fixed-point, 5% floating-point

ÿ Different on-chip configurations within each family ÿ Size and map of data and program memory ÿ A/D, input/output buffers, interfaces, timers, and D/A

ÿ Low power: 10-100 mW (C54 0.32 mA/MIP; C55 0.05 mA/MIP)

ÿ Drawbacks to conventional DSP processors ÿ No byte addressing (needed for image and video) ÿ Limited on-chip memory ÿ Limited addressable memory on fixed-point DSPs, except

ÿ Motorola 56300 (16 Mw data; 64 Mw program) 2-7 Texas Instruments C54xx (8 Mw data) Pipelining

Sequential (Motorola 56000)

Fetch Decode Read Execute Pipelined (Most conventional DSPs) Managing

Fetch Decode Read Execute Pipelines Superscalar (Pentium, MIPS) •compiler or programmer •pipeline interlocking in the processor Fetch Decode Read Execute •hardware (CDC7600, TMS320C6000) Superpipelined instruction scheduling

Fetch Decode Read Execute 2-8 Pipelining: Operation

Fetch Decode Read ÿ Time-stationary pipeline model Execute ÿ Programmer controls each cycle F D R E ÿ Motorola DSP56001 (X and Y data memories and registers; MAC D C B A means multiplication-accumulation) E D C B MAC X0,Y0,A X:(R0)+,X0 Y:(R4)-,Y0 F E D C G F E D ÿ Data-stationary pipeline model H G F E ÿ Programmer specifies data I H G F operations J I H G ÿ MPYFTI *++AR0(1),*++AR1(IR0),R0 TMS320C30 K J I H L K J I ÿ Interlocked pipeline L - K J ÿ “Protection” from pipeline effects L - K ÿ Implemented on TI TMS320C30 L - ÿ May not be reported by simulators: L inner loops may take extra cycles 2-9 Pipelining: Hazards

Fetch Decode Read ÿ A control hazard occurs when a Execute branch instruction is decoded F D R E ÿ Processor “flushes” the pipeline, or D C B A ÿ Use delayed branch (expose E D C B pipeline) F E D C br F E D ÿ A data hazard occurs because G br F E an operand cannot be read yet - - br F ÿ Intended by programmer, or - - - br ÿ Interlock hardware inserts “bubble” X - - - ÿ TI TMS320C5x (20 CPU and 16 I/O Y X - - registers, one accumulator, and one Y - X - LARaddress AR2, ADDR pointer; load ARP address implied reg. by *) Z Y - X LACC *- ; load accumulator w/ Z Y - ; contents of AR2 Z Y Z LAR: 2 cycles to update AR2 & ARP; need NOP after it 2-10 Pipelining: Avoiding Control Hazards

Fetch Decode Read A key factor in the numeric performance Execute of DSPs is the provision of special F D R E hardware to perform looping. D C B A ; repeat next inst. COUNT times RPT COUNT E D C B TBLR *+ F E D C rpt F E D ÿ A repeat instruction repeats X rpt F E one instruction or a block of X - rpt F X - - rpt instructions after repeat X X - - ÿ The pipeline is filled with X X X - X X X X repeated instruction (or X X X X block of instructions) X X X X ÿ Cost: one pipeline flush only

2-11 RISC vs. DSP: Instruction Encoding

ÿ RISC: Superscalar Reorder

Load/store

FP Unit Integer Unit ÿ DSP: Horizontal microcode

Load/store

Load/store

Address ALU Multiplier

2-12 RISC vs. DSP: Memory Hierarchy

ÿ RISC Registers I/D Physical Cache memory Out of TLB order TLB: Translation Lookaside Buffer

ICache Internal ÿ DSP memories

Registers External memories

DMA Controller DMA: Direct Memory Access

2-13 TI TMS320C6x VLIW DSP Architecture

Simplified Architecture Program RAM Data RAM or Cache Addr Internal Buses DMA Data

es(A0-A15) Regs .D1 .D2 (B0-B15) Regs Serial Port

External .M1 .M2 Host Port Memory .L1 .L2 -Sync Boot Load -Async .S1 .S2 Timers Control Regs Pwr Down CPU

2-14 TI TMS320C6x VLIW DSP Architecture

ÿ Two parallel data paths with single-cycle units: ÿ Data unit - 32-bit address calculations (modulo, linear) ÿ Multiplier unit - 16 bit x 16 bit with 32-bit result ÿ Logical unit - 40-bit (saturation) arithmetic & compares ÿ Shifter unit - 32-bit integer ALU and 40-bit shifter

ÿ 16 32-bit registers in each data path ÿ 40 can be stored in adjacent even/odd registers

ÿ Fixed-point (C62x) and floating-point (C67x)

ÿ TMS320C6201: $25 in volume ÿ 150 MHz, 300 million MACs/sec, 1200 RISC MIPS ÿ On-chip memory: 16 k x 32 program, 32 k x 16 data

2-15 TI TMS320C6x VLIW DSP Architecture

ÿ One instruction cycle every clock cycle

ÿ Deep pipeline ÿ 7-11 stages in C62x: fetch 4, decode 2, execute 1-5 ÿ 7-16 stages in C67x: fetch 4, decode 2, execute 1-10 ÿ If a branch is in the pipeline, are disabled (the latency of a branch is 5 cycles) ÿ Avoid branches by using conditional execution

ÿ No hardware protection against pipeline hazards ÿ Compiler and assembler must prevent pipeline hazards

ÿ C67x computes floating-point multiply in 4 cycles

2-16 C5x and C6x Addressing Modes

ÿ Immediate TMS320C5x TMS320C6x ÿ The operand is part of the instruction ADD #0FFh add .L1 -13,A1,A6 ÿ Register ÿ Operand is specified in (implied) add .L1 A7,A6,A7 aregister ÿ Direct ÿ Address of operand is part of the instruction ADD 010h not supported (added to imply memory page)

ÿ Indirect ADD * ldw .L1 *A5++[8],A1 ÿ Address of operand is stored in a register

2-17 Application: FIR Filter

z-1 z-1 z-1 ÿ Each tap requires ÿ Fetching one data sample ÿ Fetching one operand ÿ Multiplying two numbers ÿ Accumulating multiplication result ÿ Shifting one sample in the delay line

ÿ Computing an FIR tap in one instruction cycle ÿ Three data memory accesses ÿ Auto-increment or decrement addressing modes ÿ Modulo addressing to implement delay line as circular buffer

2-18 Application: FIR Filter on a TMS320C5x

Coefficients

Data

COEFFP .set 02000h ; Program mem address X .set 037Fh ; Newest data sample LASTAP .set 037FH ; Oldest data sample

… LAR AR3, #LASTAP ; Point to oldest sample RPT #127 ; Repeat next inst. 126 times MACD COEFFP, *- ; Compute one tap of FIR APAC SACH Y,1 ; Store result -- note shift

2-19 Application: FIR Filter on a TMS320C62x

Coefficients

Data

Single-Cycle Loop

...

C7: ldh .D1 *A1++, A2 ; Read coefficient || ldh .D2 *B1++, B2 ; Read data || [B0] sub .L2 B0, 1, B0 ; Decrement counter || [B0] B .S2 c7 ; Branch if not zero || mpy .M1x A2, B2, A3 ; Form product || add .L1 A4, A3, A4 ; Accumulate result

...

2-20 Ordered Dithering on a TMS320C62x

periodic 1/8 5/8 7/8 3/8 array of thresholds 7/8 3/8 1/8 5/8 Throughput of two cycles ; remove next two lines if thresholds in linear array MVK .S1 0x0001,AMR ; modulo block size 2^2 MVKH .S1 0x4000,AMR ; modulo addr reg B6 ; initialize A6 and B6 .trip 100 ; minimum loop count dith: LDB .D1 *A6++,A4 ; read pixel || LDB .D2 *B6++,B4 ; read threshold || CMPGTU .L1x A4,B4,A1 ; threshold pixel || ZERO .S1 A5 ; 0 if <= threshold [A1] MVK .S1 255,A5 ; 255 if > threshold || STB .D1 A5,*A6++ ; store result ||[B0] SUB .L2 B0,1,B0 ; decrement counter ||[B0] B .S2 dith ; branch if not zero 2-21 DSP Cores

ÿ ASIC with: ÿ Programmable DSP ÿ RAM ÿ ROM ÿ Standard cells ÿ Codec ÿ Peripherals ÿ Gate array ÿ

2-22 DSP on General Purpose Processors

ÿ Multimedia applications on PCs ÿ Video, audio, graphics and animation ÿ Repetitive parallel sequences of instructions

ÿ Native signal processing examples ÿ Sun Visual Instruction Set (UltraSPARC 1/2) ÿ Intel MMX (Pentium I/II/III) ÿ Intel Concurrent SIMD-FP (Pentium III)

ÿ Single Instruction Multiple Data (SIMD) ÿ One instruction acts on multiple data in parallel ÿ Well-suited for graphics

2-23 DSP on General Purpose Processors (con’t)

ÿ Programming is considerably tougher ÿ Ability of compilers to generate code for instruction set extensions may lag (e.g. four years for Pentium MMX) ÿ Libraries of routines using native signal processing ÿ Hand code using in-line assembly for best performance

ÿ Single-instruction multiple-data (SIMD) approach ÿ Pack/unpack data not aligned on SIMD word boundaries ÿ Saturation arithmetic in Intel MMX; not supported in Sun VIS ÿ Extended-precision accumulation in MMX; none in Sun VIS

ÿ Speedup for applications for Intel MMX and Sun

VIS 2-24 Signal and image processing: 1.5:1 to 2:1 Intel MMX Instruction Set

ÿ 64-bit SIMD register (4 data types) ÿ 64-bit quad word ÿ Packed byte (8 bytes packed into 64 bits) ÿ Packed word (4 16-bit words packed into 64 bits) ÿ Packed double word (2 double words packed into 64 bits) ÿ 57 new instructions ÿ Pack and unpack ÿ Add, subtract, multiply, and multiply/accumulate ÿ Saturation and wraparound arithmetic ÿ Maximum parallelism possible ÿ 8:1 for 8-bit additions ÿ 4:1 for 8 x 16 multiplication or 16-bit additions

2-25 Concluding Remarks

ÿ Conventional digital signal processors ÿ High performance vs. power consumption/cost/volume ÿ Excel at one-dimensional processing ÿ Per cycle: 1 16x16 MAC & 4 16-bit RISC instructions ÿ TMS320C6000 VLIW DSP ÿ High performance vs. cost/volume ÿ Excel at multidimensional signal processing ÿ Per cycle: 2 16x16 MACs & 4 32-bit RISC instructions ÿ Native Signal Processing ÿ Available on desktop computers ÿ Excels at graphics ÿ Per cycle: 2 8x16 MACs OR 8 8-bit RISC instructions ÿ Use in-line assembly code for best performance

2-26 Concluding Remarks

ÿ market ÿ Revenue: $3.5 Billion ‘98, $4.4 Billion ‘99, and $6.1 Billion ‘00 ÿ 40% annual growth rate 1990-2000 ÿ 45% TI, 25% Lucent, 10% Motorola, 8% Analog Devices

ÿ Independent benchmarking by industry ÿ Berkeley Design Technology Inc. http://www.bdti.com ÿ EDN Embedded Microprocessor Benchmark Consortium http://www.eembc.org

ÿ Web resources ÿ Newsgroup comp.dsp: FAQ www.bdti.com/faq/dsp_faq.html ÿ Embedded processors and systems: www.eg3.com ÿ On-line courses: www.techonline.com 2-27 References

1 G. E. Allen, B. L. Evans, and D. C. Schanbacher, “Real-Time Sonar Beamforming on a Unix ,” Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, pp. 764-768, 1998. http://www.ece.utexas.edu/~bevans/papers/1998/beamforming/ 2 R. Bhargava, R. Radhakrishnan, B. L. Evans, and L. K. John, “Evaluating MMX Technology Using DSP and Multimedia Applications,” Proc. IEEE Sym. On Microarchitecture, pp. 37-46, 1998. http://www.ece.utexas.edu/~ravib/mmxdsp/ 3 W. Chen, H. J. Reekie, S. Bhave, and E. A. Lee, “Native Signal Processing on the UltraSPARC in the Ptolemy Environment,” Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, 1996. http://www.ece.utexas.edu/~bevans/courses/ee382c/lectures/21_nsp/vis/ 4 B. L. Evans, “EE379K-17 Real-Time DSP Laboratory,” UT Austin. http://www.ece.utexas.edu/~bevans/courses/realtime/ 5 B. L. Evans, “EE382C Embedded Software Systems,” UT Austin. http://www.ece.utexas.edu/~bevans/courses/ee382c/ 6 A. Kulkarni and A. Dube, “Evaluation of the Code Generation Domain in Ptolemy,” http://www.ece.utexas.edu/~bevans/talks/benchmarking97/sld001.htm 7 P.Lapsley,J.Bier,A.Shoham,andE.A.Lee,DSP Processor Fundamentals, IEEE Press, 1997.

2-28