Lecture 10A: Digital Signal Processors: a TI Architectural History Collated By: Professor Kurt Keutzer Computer Science 252, Spring 2000 with Contributions From: Dr

Lecture 10a: Digital Signal Processors: A TI Architectural History Collated by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Dr. Brock Barton, Clark Hise TI; Dr. Surendar S. Magar, Berkeley Concept Research Corporation 1 DSP ARCHITECTURE EVOLUTION Multipliers (MUL) Multiprocessors (MP) Multi-Processing Video/Imaging W-CDMA Radars DSP Building Blocks Function/Application Specific Digital Radios & Bit Slice Processors (MUL, etc.) ( MP) High-End Control Modems DSP µP and RISC Voice Coding ( MP ) Instruments µC and Analog µ Low-End P Modems Application Examples Application Industrial Control 1980 1985 1990 1995 2 DSP ARCHITECTURE Enabling Technologies Time Frame Approach Primary Application Enabling Technologies Early 1970’s • Discrete logic • Non-real time • Bipolar SSI, MSI procesing • FFT algorithm • Simulation Late 1970’s • Building block • Military radars • Single chip bipolar multiplier • Digital Comm. • Flash A/D Early 1980’s • Single Chip DSP µP • Telecom • µP architectures • Control • NMOS/CMOS Late 1980’s • Function/Application • Computers • Vector processing specific chips • Communication • Parallel processing Early 1990’s • Multiprocessing • Video/Image Processing • Advanced multiprocessing • VLIW, MIMD, etc. Late 1990’s • Single-chip • Wireless telephony • Low power single-chip DSP multiprocessing • Internet related • Multiprocessing 3 Texas Instruments TMS320 Family Multiple DSP µP Generations First Bit Size Clock Instruction MAC MOPS Device density (# Sample speed Throughput execution of transistors) (MHz) (ns) Uniprocessor Based ( Harvard Architecture) TMS32010 1982 16 integer 20 5 MIPS 400 5 58,000 (3µ) TMS320C25 1985 16 integer 40 10 MIPS 100 20 160,000 (2µ) TMS320C30 1988 32 flt.pt. 33 17 MIPS 60 33 695,000 (1µ) TMS320C50 1991 16 integer 57 29 MIPS 35 60 1,000,000 (0.5µ) TMS320C2XXX 1995 16 integer 40 MIPS 25 80 Multiprocessor Based TMS320C80 1996 32 integer/flt. 2 GOPS MIMD 120 MFLOP TMS320C62XX 1997 16 integer 1600 MIPS 5 20 GOPS VLIW TMS310C67XX 1997 32 flt. pt. 5 1 GFLOP VLIW 4 First Generation DSP µP Case Study TMS32010 (Texas Instruments) - 1982 Features u 200 ns instruction cycle (5 MIPS) u 144 words (16 bit) on-chip data RAM u 1.5K words (16 bit) on-chip program ROM - TMS32010 u External program memory expansion to a total of 4K words at full speed u 16-bit instruction/data word u single cycle 32-bit ALU/accumulator u Single cycle 16 x 16-bit multiply in 200 ns u Two cycle MAC (5 MOPS) u Zero to 15-bit barrel shifter u Eight input and eight output channels 5 TMS32010 BLOCK DIAGRAM 6 TMS32010 Program Memory Maps Microcomputer Mode Microprocessor Mode Address 16-bit word 16-bit word 0 Reset 1st Word 0 Reset 1st Word 1 Reset 2nd Word 1 Reset 2nd Word Internal Memory 2 Interrupt 2 Interrupt Space External Memory 1525 Space Internal Memory Space Reserved For Testing 1536 External Memory Space 4095 4095 7 Digital FIR Filter Implementation (Uniprocessor-Circular Buffer) Start each Time here 1st. Cycle 2nd. Cycle End X0 Start a a a a Start n-1 n-2 1 0 X1 X2 a a 0 n-1 X3 X4 X X5 X n-1 End Replace + starting value Acc with new value 8 TMS32010 FIR FILTER PROGRAM Indirect Addressing (Smaller Program Space) Y(n) = x[n-(N-1)] . h(N-1) + x[n-(N-2)] . h(N-2) +…+ x(n) . h(0) For N=50, Indirect Addressing t=42 µs (23.8 KHz) 9 For N=50, Direct Addressing t=21.6 µs (40.2 KHz) TMS320C203/LC203 BLOCK DIAGRAM DSP Core Approach - 1995 10 Third Generation DSP µP Case Study TMS320C30 - 1988 TMS320C30 Key Features u 60 ns single-cycle instruction execution time n 33.3 MFLOPS (million floating-point operations per second) n 16.7 MIPS (million instructions per second) u One 4K x 32-bit single-cycle dual-access on-chip ROM block u Two 1K x 32-bit single-cycle dual-access on-chip RAM blocks u 64 x 32-bit instruction cache u 32-bit instruction and data words, 24-bit addresses u 40/32-bit floating-point/integer multiplier and ALU u 32-bit barrel shifter 11 Third Generation DSP µP Case Study TMS320C30 - 1988 TMS320C30 Key Features (cont.) u Eight extended precision registers (accumulators) u Two address generators with eight auxiliary registers and two auxiliary register arithmetic units u On-chip direct memory Access (DMA) controller for concurrent I/O and CPU operation u Parallel ALU and multiplier instructions u Block repeat capability u Interlocked instructions for multiprocessing support u Two serial ports to support 8/16/32-bit transfers u Two 32-bit timers u 1 µ CDMOS Process 12 TMS320C30 BLOCK DIAGRAM 13 TMS320C3x CPU BLOCK DIAGRAM 14 TMS320C3x MEMORY BLOCK DIAGRAM 15 TMS320C30 Memory Organization Oh Interrupt locations Oh Interrupt locations & reserved (192) & reserved (192) BFh external STRB active BFh COh External COh ROM STRB Active 7FFFFFh 0FFFh (Internal) 800000h 1000h 7FFFFFh Expansion BUS MSTRB Expansion BUS MSTRB 800000h Active (8K) 801FFFh Active (8K) 801FFFh 802000h Reserved Reserved 802000h (8K) 803FFFh (8K) 803FFFh 804000h Expansion Bus Expansion Bus 804000h 805FFFh IOSTRB Active (8K) IOSTRB Active (8K) 805FFFh 806000h Reserved Reserved 806000h 807FFFH (8K) (8K) 807FFFH 80800h Peripheral Bus Memory Mapped Peripheral Bus Memory Mapped 80800h 8097FFh Registers (Internal) (6K) Registers (Internal) (6K) 8097FFh 809800h RAM Block 0 (1K) RAM Block 0 (1K) 809800h 809BFFh (Internal) (Internal) 809BFFh 809C00h RAM Block 1 (1K) RAM Block 1 (1K) 809C00h 809FFFh (Internal) (Internal) 80A00h 809FFFh External 80A00h External 0FFFFFFh STRB Active STRB Active 0FFFFFFh Microprocessor Mode Microcomputer Mode 16 TMS320C30 FIR FILTER PROGRAM Y(n) = x[n-(N-1)] . h(N-1) + x[n-(N-2)] . h(N-2) +…+ x(n) . h(0) For N=50, t=3.6 µs (277 KHz) 17 ‘C54x Architecture 18 TMS320C54x Internal Block Diagram 19 Architecture optimized for DSP #1: CPU designed for efficient DSP processing n MAC unit, 2 Accumulators, Additional Adder, Barrel Shifter #2: Multiple busses for efficient data and program flow n Four busses and large on-chip memory that result in sustained performance near peak #3: Highly tuned instruction set for powerful DSP computing n Sophisticated instructions that execute in fewer cycles, with less code and low power demands 20 Key #1: DSP engine 40 å Y = an * xn n = 1 x a MPY ADD y 21 Key #1: MAC Unit MAC *AR2+, *AR3+, A Data Acc A Temp Coeff Prgm Data Acc A S/U S/U MPY A Fractional B Mode Bit ADD O acc A acc B 22 Key #1: Accumulators + Adder General-Purpose Math example: t = s+e-r A Bus B Bus A B C T D Shifter LD @s, A acc A acc B ALU ADD @e, A MUX U Bus SUB @r, A STL A, @t A B MAC 23 Key #1: Barrel shifter LD @X, 16, A STH @B, Y A B C D Barrel Shifter (-16-+31) S Bus ALU E Bus 24 Key #1: Temporary register LD @x, T MPY @a, A D X EXP A Encoder B Temporary For example: Register A = xa T Bus MAC ALU 25 Key #2: Efficient data/program flow #1: CPU designed for efficient DSP processing n MAC unit, 2 Accumulators, Additional Adder, Barrel Shifter #2: Multiple busses for efficient data and program flow n Four busses and large on-chip memory that result in sustained performance near peak #3: Highly tuned instruction set for powerful DSP computing n Sophisticated instructions that execute in fewer cycles, with less code and low power demands 26 Key #2: Multiple busses MAC *AR2+, *AR3+, A EXTERNAL P MEMORY MM UU D MM XX C UU EE XX E MEMORY INTERNAL SS C D Central Arithmetic T MAC A B ALU SHIFTER Logic Unit M 27 Key #2: Pipeline Prefetch Fetch Decode Access Read Execute P F D A R E K Prefetch: Calculate address of instruction K Fetch: Collect instruction K Decode: Interpret instruction K Access: Collect address of operand K Read: Collect operand K Execute: Perform operation 28 Key #2: Bus usage CNTL PC ARs EXTERNAL P MEMORY MM UU D MM XX C UU EE XX MEMORY E INTERNAL SS Central Arithmetic T MAC A B ALU SHIFTER Logic Unit 29 Key #2: Pipeline performance CYCLES P1 F1 D1 A1 R1 X1 P2 F2 D2 A2 R2 X2 P3 F3 D3 A3 R3 X3 P4 F4 D4 A4 R4 X4 P5 F5 D5 A5 R5 X5 P6 F6 D6 A6 R6 X6 Fully loaded pipeline 30 Key #3: Powerful instructions #1: CPU designed for efficient DSP processing n MAC Unit, 2 Accumulators, Additional Adder, Barrel Shifter #2: Multiple busses for efficient data and program flow n Four busses and large on-chip memory that result in sustained performance near peak #3: Highly tuned instruction set for powerful DSP computing n Sophisticated instructions that execute in fewer cycles, with less code and low power demands 31 Key #3: Advanced applications Symmetric FIR filter FIRS Adaptive filtering LMS Polynomial evaluation POLY Code book search STRCD SACCD SRCCD Viterbi DADST DSADT CMPS 32 C62x Architecture 33 TMS320C6201 Revision 2 Program Cache / Program Memory 32-bit address, 256-Bit data512K Bits RAM Pwr C6201 CPU Megamodule Program Fetch Dwn Control Host Instruction Dispatch Registers Port Instruction Decode Interface Data Path 1 Data Path 2 Control 4- Logic DMA A Register File B Register File Test Emulation L1 S1 M1 D1 D2 M2 S2 L2 Ext. Interrupts Memory Interface 2 Timers 2 Multi- channel Data Memory buffered 32-Bit address, 8-, 16-, 32-Bit data serial ports 512K Bits RAM (T1/E1) 34 C6201 Internal Memory Architecture K Separate Internal Program and Data Spaces K Program n 16K 32-bit instructions (2K Fetch Packets) n 256-bit Fetch Width n Configurable as either w Direct Mapped Cache, Memory Mapped Program Memory K Data n 32K x 16 n Single Ported Accessible by Both CPU Data Buses n 4 x 8K 16-bit Banks w 2 Possible Simultaneous Memory Accesses (4 Banks) w 4-Way Interleave, Banks and Interleave Minimize Access Conflicts 35 C62x Interrupts K 12 Maskable Interrupts , Non-Maskable Interrupt (NMI) K Interrupt

Load more