Lecture 10a: Digital Signal Processors: A TI Architectural History Collated by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Dr. Brock Barton, Clark Hise TI; Dr. Surendar S. Magar, Berkeley Concept Research Corporation

1 DSP ARCHITECTURE EVOLUTION

Multipliers (MUL) Multiprocessors (MP)

Multi-Processing Video/Imaging W-CDMA Radars DSP Building Blocks Function/Application Specific Digital Radios & Slice Processors (MUL, etc.) ( MP) High-End Control Modems DSP µP and RISC Voice Coding ( MP ) Instruments µC and Analog µ Low-End P Modems

Application Examples Application Industrial Control

1980 1985 1990 1995 2 DSP ARCHITECTURE Enabling Technologies

Time Frame Approach Primary Application Enabling Technologies

Early 1970’s • Discrete logic • Non-real time • Bipolar SSI, MSI procesing • FFT algorithm • Simulation Late 1970’s • Building block • Military radars • Single chip bipolar multiplier • Digital Comm. • Flash A/D

Early 1980’s • Single Chip DSP µP • Telecom • µP architectures • Control • NMOS/CMOS

Late 1980’s • Function/Application • Computers • Vector processing specific chips • Communication • Parallel processing

Early 1990’s • • Video/Image Processing • Advanced multiprocessing • VLIW, MIMD, etc.

Late 1990’s • Single-chip • Wireless telephony • Low power single-chip DSP multiprocessing • Internet related • Multiprocessing

3 TMS320 Family Multiple DSP µP Generations

First Bit Size Clock Instruction MAC MOPS Device density (# Sample speed Throughput execution of transistors) (MHz) (ns)

Uniprocessor Based ( ) TMS32010 1982 16 integer 20 5 MIPS 400 5 58,000 (3µ)

TMS320C25 1985 16 integer 40 10 MIPS 100 20 160,000 (2µ) TMS320C30 1988 32 flt.pt. 33 17 MIPS 60 33 695,000 (1µ) TMS320C50 1991 16 integer 57 29 MIPS 35 60 1,000,000 (0.5µ) TMS320C2XXX 1995 16 integer 40 MIPS 25 80

Multiprocessor Based TMS320C80 1996 32 integer/flt. 2 GOPS MIMD 120 MFLOP TMS320C62XX 1997 16 integer 1600 MIPS 5 20 GOPS VLIW TMS310C67XX 1997 32 flt. pt. 5 1 GFLOP VLIW

4 First Generation DSP µP Case Study TMS32010 (Texas Instruments) - 1982

Features u 200 ns (5 MIPS) u 144 words (16 bit) on-chip data RAM u 1.5K words (16 bit) on-chip program ROM - TMS32010 u External program memory expansion to a total of 4K words at full speed u 16-bit instruction/data word u single cycle 32-bit ALU/accumulator u Single cycle 16 x 16-bit multiply in 200 ns u Two cycle MAC (5 MOPS) u Zero to 15-bit barrel shifter u Eight input and eight output channels

5 TMS32010 BLOCK DIAGRAM

6 TMS32010 Program Memory Maps

Microcomputer Mode Mode

Address 16-bit word 16-bit word

0 Reset 1st Word 0 Reset 1st Word 1 Reset 2nd Word 1 Reset 2nd Word Internal Memory 2 Interrupt 2 Interrupt Space

External Memory 1525 Space Internal Memory Space Reserved For Testing 1536 External Memory Space 4095 4095

7 Digital FIR Filter Implementation (Uniprocessor-Circular Buffer)

Start each Time here 1st. Cycle 2nd. Cycle End X0 Start a a a a Start n-1 n-2 1 0 X1

X2 a a 0 n-1 X3

X4

X X5

X n-1 End

Replace + starting value Acc with new value

8 TMS32010 FIR FILTER PROGRAM Indirect Addressing (Smaller Program Space)

Y(n) = x[n-(N-1)] . h(N-1) + x[n-(N-2)] . h(N-2) +…+ x(n) . h(0)

For N=50, Indirect Addressing t=42 µs (23.8 KHz) 9 For N=50, Direct Addressing t=21.6 µs (40.2 KHz) TMS320C203/LC203 BLOCK DIAGRAM DSP Core Approach - 1995

10 Third Generation DSP µP Case Study TMS320C30 - 1988 TMS320C30 Key Features u 60 ns single-cycle instruction execution time n 33.3 MFLOPS (million floating-point operations per second) n 16.7 MIPS (million ) u One 4K x 32-bit single-cycle dual-access on-chip ROM block u Two 1K x 32-bit single-cycle dual-access on-chip RAM blocks u 64 x 32-bit instruction u 32-bit instruction and data words, 24-bit addresses u 40/32-bit floating-point/integer multiplier and ALU u 32-bit barrel shifter

11 Third Generation DSP µP Case Study TMS320C30 - 1988

TMS320C30 Key Features (cont.) u Eight extended precision registers (accumulators) u Two address generators with eight auxiliary registers and two auxiliary register arithmetic units u On-chip direct memory Access (DMA) controller for concurrent I/O and CPU operation u Parallel ALU and multiplier instructions u Block repeat capability u Interlocked instructions for multiprocessing support u Two serial ports to support 8/16/32-bit transfers u Two 32-bit timers u 1 µ CDMOS

12 TMS320C30 BLOCK DIAGRAM

13 TMS320C3x CPU BLOCK DIAGRAM

14 TMS320C3x MEMORY BLOCK DIAGRAM

15 TMS320C30 Memory Organization

Oh Interrupt locations Oh Interrupt locations & reserved (192) & reserved (192) BFh external STRB active BFh COh External COh ROM STRB Active 7FFFFFh 0FFFh (Internal) 800000h 1000h 7FFFFFh Expansion MSTRB Expansion BUS MSTRB 800000h Active (8K) 801FFFh Active (8K) 801FFFh 802000h Reserved Reserved 802000h (8K) 803FFFh (8K) 803FFFh 804000h Expansion Bus Expansion Bus 804000h 805FFFh IOSTRB Active (8K) IOSTRB Active (8K) 805FFFh 806000h Reserved Reserved 806000h 807FFFH (8K) (8K) 807FFFH 80800h Peripheral Bus Memory Mapped Peripheral Bus Memory Mapped 80800h 8097FFh Registers (Internal) (6K) Registers (Internal) (6K) 8097FFh 809800h RAM Block 0 (1K) RAM Block 0 (1K) 809800h 809BFFh (Internal) (Internal) 809BFFh 809C00h RAM Block 1 (1K) RAM Block 1 (1K) 809C00h 809FFFh (Internal) (Internal) 80A00h 809FFFh External 80A00h External 0FFFFFFh STRB Active STRB Active 0FFFFFFh Microprocessor Mode Microcomputer Mode 16 TMS320C30 FIR FILTER PROGRAM

Y(n) = x[n-(N-1)] . h(N-1) + x[n-(N-2)] . h(N-2) +…+ x(n) . h(0)

For N=50, t=3.6 µs (277 KHz) 17 ‘C54x Architecture

18 TMS320C54x Internal Block Diagram

19 Architecture optimized for DSP #1: CPU designed for efficient DSP processing n MAC unit, 2 Accumulators, Additional , Barrel Shifter #2: Multiple busses for efficient data and program flow n Four busses and large on-chip memory that result in sustained performance near peak #3: Highly tuned instruction set for powerful DSP computing n Sophisticated instructions that execute in fewer cycles, with less code and low power demands

20 Key #1: DSP engine

40 å Y = an * xn n = 1 x a

MPY

ADD

y

21 Key #1: MAC Unit

MAC *AR2+, *AR3+, A

Data Acc A Temp Coeff Prgm Data Acc A S/U S/U

MPY A Fractional B Mode Bit ADD O

acc A acc B

22 Key #1: Accumulators + Adder

General-Purpose Math example: t = s+e-r

A Bus B Bus A B T D Shifter LD @s, A acc A acc B ALU ADD @e, A MUX U Bus SUB @r, A STL A, @t A B MAC

23 Key #1: Barrel shifter

LD @X, 16, A STH @B, Y A B C D

Barrel Shifter (-16-+31) S Bus

ALU E Bus

24 Key #1: Temporary register

LD @x, T MPY @a, A D X EXP A Encoder B

Temporary For example: Register A = xa T Bus

MAC ALU

25 Key #2: Efficient data/program flow

#1: CPU designed for efficient DSP processing n MAC unit, 2 Accumulators, Additional Adder, Barrel Shifter #2: Multiple busses for efficient data and program flow n Four busses and large on-chip memory that result in sustained performance near peak #3: Highly tuned instruction set for powerful DSP computing n Sophisticated instructions that execute in fewer cycles, with less code and low power demands

26 Key #2: Multiple busses MAC *AR2+, *AR3+, A EXTERNAL

P MEMORY MM UU D MM XX C UU EE XX E MEMORY

INTERNAL SS C D Central Arithmetic T MAC A B ALU SHIFTER Logic Unit M

27 Key #2: Pipeline Prefetch Fetch Decode Access Read Execute P F D A R E

K Prefetch: Calculate address of instruction K Fetch: Collect instruction K Decode: Interpret instruction K Access: Collect address of operand K Read: Collect operand K Execute: Perform operation

28 Key #2: Bus usage

CNTL PC ARs EXTERNAL

P MEMORY MM UU D MM XX C UU EE XX

MEMORY E INTERNAL SS Central Arithmetic T MAC A B ALU SHIFTER Logic Unit

29 Key #2: Pipeline performance CYCLES

P1 F1 D1 A1 R1 X1 P2 F2 D2 A2 R2 X2 P3 F3 D3 A3 R3 X3 P4 F4 D4 A4 R4 X4 P5 F5 D5 A5 R5 X5 P6 F6 D6 A6 R6 X6

Fully loaded pipeline

30 Key #3: Powerful instructions

#1: CPU designed for efficient DSP processing n MAC Unit, 2 Accumulators, Additional Adder, Barrel Shifter #2: Multiple busses for efficient data and program flow n Four busses and large on-chip memory that result in sustained performance near peak #3: Highly tuned instruction set for powerful DSP computing n Sophisticated instructions that execute in fewer cycles, with less code and low power demands

31 Key #3: Advanced applications

Symmetric FIR filter FIRS Adaptive filtering LMS

Polynomial evaluation POLY Code book search STRCD SACCD SRCCD Viterbi DADST DSADT CMPS

32 C62x Architecture

33 TMS320C6201 Revision 2

Program Cache / Program Memory 32-bit address, 256-Bit data512K RAM

Pwr C6201 CPU Megamodule Program Fetch Dwn Control Host Instruction Dispatch Registers Port Instruction Decode Interface Data Path 1 Data Path 2 Control 4- Logic DMA A B Register File Test Emulation L1 S1 M1 D1 D2 M2 S2 L2 Ext. Interrupts Memory Interface 2 Timers 2 Multi- channel Data Memory buffered 32-Bit address, 8-, 16-, 32-Bit data serial ports 512K Bits RAM (T1/E1)

34 C6201 Internal Memory Architecture

K Separate Internal Program and Data Spaces K Program n 16K 32-bit instructions (2K Fetch Packets) n 256-bit Fetch Width n Configurable as either w Direct Mapped Cache, Memory Mapped Program Memory K Data n 32K x 16 n Single Ported Accessible by Both CPU Data Buses n 4 x 8K 16-bit Banks w 2 Possible Simultaneous Memory Accesses (4 Banks) w 4-Way Interleave, Banks and Interleave Minimize Access Conflicts

35 C62x Interrupts K 12 Maskable Interrupts , Non-Maskable Interrupt (NMI) K Interrupt Return Pointers (IRP, NRP) K Fast Interrupt Handing n Branches Directly to 8-Instruction Service Fetch Packet n Can Branch out with no overhead for longer service n 7 Cycle Overhead : Time When No Code is Running n 12 Cycle Latency : Interrupt Response Time K Interrupt Acknowledge (IACK) and Number (INUM) Signals K Branch Delay Slots Protected From Interrupts K Edge Triggered

36 C62x

Registers A0 - A15 Registers B0 - B15

1X 2X

S1 S2 DDLSL SL DL D S1 S2 D S1 S2 D S1 S2 S2 S1 D S2 S1 D S2 S1 D DL SL SL DL D S2 S1 L1 S1 M1 D1 D2 M2 S2 L2

DDATA_I1 DDATA_I2 (load data) (load data) DDATA_O1 DADR1 DADR2 DDATA_O2 (store data) (address) (address) (store data)

Cross Paths 40-bit Write Paths (8 MSBs) 40-bit Read Paths/Store Paths

37 Functional Units K L-Unit (L1, L2) n 40-bit Integer ALU, Comparisons n Bit Counting, Normalization K S-Unit (S1, S2) n 32-bit ALU, 40-bit Shifter n Bitfield Operations, Branching K M-Unit (M1, M2) n 16 x 16 -> 32 K D-Unit (D1, D2) n 32-bit Add/Subtract n Address Calculations

38 C62x Datapaths

Registers A0 - A15 Registers B0 - B15

1X 2X

S1 S2 DDLSL SL DL D S1 S2 D S1 S2 D S1 S2 S2 S1 D S2 S1 D S2 S1 D DL SL SL DL D S2 S1 L1 S1 M1 D1 D2 M2 S2 L2

Cross Paths DDATA_O1 DDATA_I1 DDATA_I2 DDATA_O2 40-bit Write Paths (8 MSBs) (store data) (load data) (load data) (store data) 40-bit Read Paths/Store Paths DADR1 DADR2 (address) (address)

39 C62x Instruction Packing Instruction Packing Advanced VLIW K Fetch Packet Example 1 n CPU fetches 8 instructions/cycle K Execute Packet A B C D E F G H n CPU executes 1 to 8 instructions/cycle n Fetch packets can contain multiple execute packets A K Parallelism determined at compile / assembly B time K Examples C n 1) 8 parallel instructions D Example 2 n 2) 8 serial instructions E n 3) Mixed Serial/Parallel Groups w A // B F w C G w D H A B w E // F // G // H K Reduces Codesize, Number of Program Fetches, C Power Consumption D Example 3 E F G H

40 C62x Pipeline Operation Pipeline Phases Fetch Decode Execute

PG PS PW PR DP DC E1 E2 E3 E4 E5

u Decode u Single-Cycle Throughputn DP Instruction Dispatch uOperate in Lockn DCStep Instruction Decode uFetch u Execute n PG Programn E1 Address - E5 Execute Generate 1 through Execute 5 n PS Program Address Send n PW Program Access Ready Wait n PR Program Fetch Packet Receive

Execute Packet 1 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 2 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 3 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 4 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 5 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 6 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 7 PG PS PW PR DP DC E1 E2 E3 E4 E5

41 C62x Pipeline Operation Delay Slots u Delay Slots: number of extra cycles until result is: n written to register file n available for use by a subsequent instructions n Multi-cycle NOP instruction can fill delay slots while minimizing codesize impact

Most Instructions E1 No Delay

Integer Multiply E1 E2 1 Delay Slots

Loads E1 E2 E3 E4 E5 4 Delay Slots

Branches E1

Branch Target PGPSPWPRDPDC E1 5 Delay Slots

42 C6000 Pipeline Operation Benefits K Cycle Time n Allows 6 ns cycle time on 67x n Allows 5 ns cycle time & single cycle execution on C62x K Parallelism n 8 new instructions can always be dispatched every cycle K High Performance Internal Memory Access n Pipelined Program and Data Accesses n Two 32-bit Data Accesses/Cycle (C62x) n Two 64-bit Data Accesses/Cycle (C67x) n 256-bit Program Access/Cycle K Good Compiler Target n Visible: No Variable-Length Pipeline Flow n Deterministic: Order and Time of Execution n Orthogonal: Independent Instructions

43 C6000 Instruction Set Features Conditional Instructions

K All Instructions can be Conditional n A1, A2, B0, B1, B2 can be used as Conditions n Based on Zero or Non-Zero Value n Compare Instructions can allow other Conditions (<, >, etc) K Reduces Branching K Increases Parallelism

44 C6000 Instruction Set Addressing Features K Load-Store Architecture K Two Addressing Units (D1, D2) K Orthogonal n Any Register can be used for Addressing or Indexing K Signed/Unsigned Byte, Half-Word, Word, Double-Word Addressable n Indexes are Scaled by Type K Register or 5-Bit Unsigned Constant Index

45 C6000 Instruction Set Addressing Features K Indirect Addressing Modes n Pre-Increment *++R[index] n Post-Increment *R++[index] n Pre-Decrement *--R[index] n Post-Decrement *R--[index] n Positive Offset *+R[index] n Negative Offset *-R[index] K 15-bit Positive/Negative Constant Offset from Either B14 or B15

46 C6000 Instruction Set Addressing Features K Circular Addressing n Fast and Low Cost: Power of 2 Sizes and Alignment n Up to 8 Different Pointers/Buffers, Up to 2 Different Buffer Sizes K Dual Endian Support

47 C67x Architecture

48 TMS320C6701 DSP Block Diagram

Program Cache/Program Memory 32-bit address, 256-Bit data 512K Bits RAM

Power ’C67x Floating-Point CPU Core Down Program Fetch Control Host Instruction Dispatch Registers Port Instruction Decode Interface 4 Data Path 1 Data Path 2 Control Channel Logic DMA A Register File B Register File Test Emulation L1 S1 M1 D1 D2 M2 S2 L2 Interrupts External Memory Interface 2 Timers 2 Multi- Data Memory channel 32-Bit address buffered 8-, 16-, 32-Bit data serial ports (T1/E1) 512K Bits RAM

49 TMS320C6701 Advanced VLIW CPU (VelociTITM)

K 1 GFLOPS @ 167 MHz n 6-ns cycle time n 6 x 32-bit floating-point instructions/cycle K Load store architecture K 3.3-V I/Os, 1.8-V internal K Single- and double-precision IEEE floating-point K Dual data paths n 6 floating-point units / 8 x 32-bit instructions

50 TMS320C6701 Memory /Peripherals K Same as ’C6201 K External interface supports n SDRAM, SRAM, SBSRAM K 4-channel bootloading DMA K 16-bit host port interface K 1Mbit on-chip SRAM K 2 multichannel buffered serial ports (T1/E1) K Pin compatible with ’C6201

51 TMS320C67x CPU Core ’C67x Floating-Point CPU Core

Program Fetch Control Instruction Dispatch Registers

Instruction Decode Data Path 1 Data Path 2 Control Logic A Register File B Register File Test

Emulation L1 S1 M1 D1 D2 M2 S2 L2 Interrupts

Floating-Point Arithmetic Auxiliary Multiplier Logic Logic Capabilities Unit Unit Unit

52 C67x Interrupts K 12 Maskable Interrupts K Non-Maskable Interrupt (NMI) K Interrupt Return Pointers (IRP, NRP) K Fast Interrupt Handling n Branches Directly to 8-Instruction Service Fetch Packet n 7 Cycle Overhead: Time When No Code is Running n 12 Cycle Latency : Interrupt Response Time K Interrupt Acknowledge (IACK) and Number (INUM) Signals K Branch Delay Slots Protected From Interrupts K Edge Triggered

53 C67x New Instructions

.L Unit .M Unit .S Unit ADDSP MPYSP ABSSP ADDDP MPYDP ABSDP SUBSP MPYI CMPGTSP SUBDP MPYID CMPEQSP CMPLTSP INTSP MPY24 CMPGTDP INTDP MPY24H CMPEQDP SPINT CMPLTDP DPINT RCPSP SPTRUNC RCPDP DPTRUNC RSQRSP RSQRDP Floating Point Auxilary Point Floating Unit

DPSP Unit Multiply Point Floating

Floating Point Arithmetic Unit Arithmetic Point Floating SPDP

54 C67x Datapaths

u L-Unit (L1, L2) u 2 Data Paths n Floating-Point, 40-bit Integer ALU u 8 Functional Units n Bit Counting, Normalization n Orthogonal/Independent u n 2 Floating Point Multipliers S-Unit (S1, S2) n 2 Floating Point Arithmetic n Floating Point Auxiliary Unit n 2 Floating Point Auxiliary n 32-bit ALU/40-bit shifter u Control n Bitfield Operations, Branching n Independent n Up to 8 32-bit Instructions u M-Unit (M1, M2) u Registers n Multiplier: Integer & Floating-Point n 2 Files u D-Unit (D1, D2) n 32, 32-bit registers total n u Cross paths (1X, 2X) 32-bit add/subtract Addr Calculations Registers A0 - A15 Registers B0 - B15

1X 2X

S1 S2 DDLSL SL DL D S1 S2 D S1 S2 D S1 S2 S2 S1 D S2 S1 D S2 S1 D DL SL SLDL D S2 S1 L1 S1 M1 D1 D2 M2 S2 L2

55 C67x Instruction Packing Instruction Packing Enhanced VLIW

Example 1 K Fetch Packet A B C D E F G H n CPU fetches 8 instructions/cycle K Execute Packet n CPU executes 1 to 8 instructions/cycle A n Fetch packets can contain multiple execute packets B K Parallelism determined at C compile/assembly time K Examples D Example 2 n 1) 8 parallel instructions E n 2) 8 serial instructions n 3) Mixed Serial/Parallel Groups F M A // B G M C M D H A B M E // F // G // H C K Reduces n Codesize D Example 3 n Number of Program Fetches E n Power Consumption F G H

56 C67x Pipeline Operation Pipeline Phases Fetch Decode Execute

PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 uOperate in Lock Step u Decode uFetch n DP Instruction Dispatch n PG Program Address Generate n DC Instruction Decode n PS Program Address Send u Execute n PW Program Access Ready Wait n E1 - E5 Execute 1 through Execute 5 n PR Program Fetch Packet Receive n E6 - E10 Double Precision Only

Execute Packet 1 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Execute Packet 2 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Execute Packet 3 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Execute Packet 4 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Execute Packet 5 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Execute Packet 6 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Execute Packet 7 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10

57 C67x Pipeline Operation Delay Slots Delay Slots: number of extra cycles until result is: n written to register file n available for use by a subsequent instructions n Multi-cycle NOP instruction can fill delay slots while minimizing codesize impact

Most Integer E1 No Delay

Single-Precision E1 E2 E3 E4 3 Delay Slots

Loads E1 E2 E3 E4 E5 4 Delay Slots

Branches E1

Branch Target PG PS PW PR DP DC E1 5 Delay Slots

58 ’C67x and ’C62x Commonality u Driving commonality ( ) between ’C67x & ’C62x shortens ’C67x design time. u Maintaining symmetry between datapaths shortens the ’C67x design time. ’C62x CPU ’C67x CPU M-Unit 1 M-Unit 2 M-Unit 1 M-Unit 2 Multiplier Multiplier Multiplier Unit Multiplier Unit Unit Unit with Floating Point with Floating Point D-Unit 1 Control D-Unit 2 D-Unit 1 Control D-Unit 2 Data Load/ Registers Data Load/ Data Load/ Registers Data Load/ Store Emulation Store Store Emulation Store S-Unit 1 S-Unit 2 S-Unit 1 S-Unit 2 Auxiliary Auxiliary Auxiliary Logic Unit Auxiliary Logic Unit Logic Unit Logic Unit with Floating Point with Floating Point L-Unit 1 L-Unit 2 L-Unit 1 L-Unit 2 Arithmetic Arithmetic Arithmetic Logic Unit Logic Unit Logic Unit with Floating Point with Floating Point Decode Decode Register Register Register Register file file file file

Program Fetch & Dispatch Program Fetch & Dispatch

59 TMS320C80 MIMD MULTIPROCESSOR Texas Instruments - 1996

60 CopyrightCopyright 19991999

61