<<

Assembly Languages Model

One step up from machine . Assembly Languages language . Originally a more add r1,r2 COMS W4995-02 user-friendly way to program sub r2,r3 Prof. Stephen A. Edwards Now mostly a target cmp r3,r4 Fall 2002 Model of computation: PC → bne I1 ALU ↔ Registers ↔ Memory stored program sub r4,1 Department of I1: jmp I3 . .

Assembly Language Instructions Types of

Arithmetic, logical Built from two pieces: Each taken from a particular : • add, sub, mult Examples: add R1, R3, 3 • and, or • Cmp Register add r1, r2, r3 Operands Memory load/store Immediate add r1, r2, 10 What to do with the Where to get the data • ld, st Indirect mov r1, (r2) Control transfer Offset mov r1, 10(r3) • jmp PC Relative beq 100 • bne Reflect processor data pathways Complex • movs

Types of Assembly Languages CISC Assembly Language RISC Assembly Language

Assembly language closely tied to processor architecture Developed when people wrote language Response to growing use of At least four main types: Complicated, often specialized instructions with many Easier-to-target, uniform instruction sets effects CISC: Complex Instruction-Set Computer “Make the most common operations as fast as possible” Examples from architecture RISC: Reduced Instruction-Set Computer Load-store architecture: • String move DSP: • Arithmetic only performed on registers • Procedure enter, leave VLIW: Very Long Instruction Word • Memory load/store instructions for memory-register Many, complicated addressing modes transfers So complicated, often executed by a little program () Designed to be pipelined Examples: x86, 68000, PDP-11 Examples: SPARC, MIPS, HP-PA, PowerPC DSP Assembly Language VLIW Assembly Language Example: Euclid's

Digital signal processors designed specifically for signal Response to growing desire for instruction-level gcd(int m, int n) processing parallelism { Lots of regular arithmetic on vectors Using more transistors cheaper than running them faster int ; while ((r = m % n) != 0) { Many parallel ALUs Often written by hand m = n; Objective: keep them all busy all the time Irregular architectures to save power, area n = r; Heavily pipelined } Substantial instruction-level parallelism More regular instruction set return n; Examples: TI 320, 56000, Analog Devices } Very difficult to program by hand Looks like parallel RISC instructions Examples: Itanium, TI 320C6000

's Model Euclid on the i386 Euclid on the i386 .file "euclid." # Boilerplate 31 0 15 0 .version "01.01" .file "euclid.c" gcc2 compiled.: .version "01.01" eax Mostly cs Code segment gcc2 compiled.: .text # Stack Before Call ebx General- ds Data segment .text n 8(%esp) .align 4 # Start on 16-byte boundary .align 4 ecx Purpose- ss Stack segment .globl gcd # Make “gcd” -visible m 4(%esp) .globl gcd %esp→ R. A. 0(%esp) edx Registers es Extra segment .type gcd,@function .type gcd,@function gcd: fs Data segment gcd: Stack After Entry esi Source index pushl %ebp pushl %ebp gs Data segment n 12(%ebp) edi Destination index movl %esp,%ebp movl %esp,%ebp pushl %ebx m 8(%ebp) ebp Base pointer pushl %ebx R. A. 4(%ebp) movl 8(%ebp),%eax movl 8(%ebp),%eax → esp Stack pointer %ebp old ebp 0(%ebp) movl 12(%ebp),%ecx movl 12(%ebp),%ecx → − jmp .L6 %esp old ebx 4(%ebp) eflags Status word jmp .L6 .p2align 4,,7 .p2align 4,,7 eip Instruction Pointer

Euclid in the i386 Euclid on the i386 SPARC Programmer's Model jmp .L6 # Jump to local .L6 jmp .L6 .p2align 4,,7 # Skip ≤ 7 bytes to a multiple of 16 .p2align 4,,7 31 0 31 0 .L4: .L4: r0 Always 0 r24/i0 Input Registers movl %ecx,%eax movl %ecx,%eax # m = n . r1 Global Registers . movl %ebx,%ecx movl %ebx,%ecx # n = r . .L6: .L6: . r30/i6 Frame Pointer cltd # Sign-extend eax to edx:eax cltd r7 r31/i7 Return Address idivl %ecx # Compute edx:eax / ecx idivl %ecx r8/o0 Output Registers movl %edx,%ebx movl %edx,%ebx . . PSW Status Word testl %edx,%edx testl %edx,%edx # AND of edx and edx jne .L4 jne .L4 # branch if edx was 6= 0 r14/o6 Stack Pointer PC movl %ecx,%eax movl %ecx,%eax # Return n r15/o7 nPC Next PC movl -4(%ebp),%ebx movl -4(%ebp),%ebx r16/l0 Local Registers leave leave # Move ebp to esp, pop ebp . ret ret # Pop return address and branch . r23/l7 SPARC Register Windows Euclid on the SPARC Euclid on the SPARC .file "euclid.c" # Boilerplate mov %i0, %o1 r8/o0. . gcc2 compiled.: .LL3 mov %i1, %i0 The output registers of r15/o7 .global .rem # make .rem linker-visible r16/l0. .section ".text" # Executable code .LL5: the calling procedure . r23/l7 .align 4 mov %o0, %i0 # n = r become the inputs to r8/o0. r24/i0. .global gcd # make gcd linker-visible .LL3: . . mov %o1, %o0 # Compute the remainder of the called procedure r15/o7 r31/i7 .type gcd, #function call .rem, 0 # m / n, result in o0 r16/l0. .proc 04 The global registers . mov %i0, %o1 r23/l7 gcd: remain unchanged save %sp, -112, %sp r8/o0. r24/i0. # Next window, move SP . . cmp %o0, 0 The local registers are r15/o7 r31/i7 bne .LL5 mov %i0, %o1 # Move m into o1 r16/l0. mov %i0, %o1 # m = n (always executed) not visible across . b .LL3 # Unconditional branch ret # Return (actually jmp i7 + 8) procedures r23/l7 mov %i1, %i0 # Move n into i0 r24/i0. restore # Restore previous window . r31/i7

Digital Signal Processor Apps. Embedded Processor Conventional DSP Architecture Requirements Low-cost embedded systems Harvard architecture Inexpensive with small area and volume • Separate data memory/ and program memory/bus • Modems, cellular telephones, disk drives, printers Deterministic service routine latency • Three reads and one or two writes per instruction cycle High-throughput applications Low power: ≈50 mW (TMS320C54x uses 0.36 µA/MIPS) Deterministic interrupt service routine latency • Halftoning, base stations, 3- sonar, tomography Multiply-accumulate in single instruction cycle Special addressing modes supported in hardware PC based multimedia • Modulo addressing for circular buffers for FIR filters • Compression/decompression of audio, graphics, video • -reversed addressing for fast Fourier transforms Instructions to keep the (3-4 stages) full • Zero-overhead looping (one pipeline flush to set up) • Delayed branches

Conventional DSPs Conventional DSPs Example

Fixed-Point Floating-Point Market share: 95% fixed-point, 5% floating-point Finite Impulse Response filter (FIR) Cost/Unit $5–$79 $5–$381 Each processor comes in dozens of configurations Can be used for lowpass, highpass, bandpass, etc. Architecture Accumulator load-store • Data and program memory size Basic DSP operation Registers 2–4 data, 8 address 8–16 data, 8–16 address • Peripherals: A/D, D/A, serial, parallel ports, timers For each sample, computes Data Words 16 or 24 bit 32 bit Chip Memory 2–64K data+program 8–64K data+program Drawbacks k = Address Space 16–128K data 16M–4G data • No byte addressing (needed for image and video) yn X aixn+i 16–64K program 16M–4G program i=0 • Limited on-chip memory Compilers Bad C Better C, C++ where Examples TI TMS320C5x TI TMS320C3x • Limited addressable memory on most fixed-point Analog Devices SHARC DSPs a0, . . . , ak are filter coffecients, • Non-standard C extensions to support fixed-point data xn is the nth input sample, yn is the nth output sample. 56000 Programmer's Model 56001 Memory Spaces 56001 Address Generation

55 4847 2423 0 15 0 Three memory regions, each 64K: Addresses come from pointer register r0 . . . r7 x1 x0 Source Program Counter y1 y0 Registers Status Register • 24-bit Program memory a2 a1 a0 Accumulator Loop Address Offset registers n0 . . . n7 can be added to pointer b2 b1 b0 Accumulator Loop Count • 24-bit X data memory Modifier registers cause the address to wrap around 15. PC Stack 15 0 15 0 15 0 . • r7 n7 m7 24-bit Y data memory Zero modifier causes reverse-carry arithmetic . . . 0 . . . Idea: enable simultaneous access of program, sample, Address Notation Next value of r0 r4 n4 Address 15. SR Stack r3 n3 m3 . r0 (r0) r0 . . . Registers and coefficient memory . . . 0 r0 + n0 (r0+n0) r0 r0 n0 m0 Stack pointer Three on-chip memory spaces can be used this way r0 (r0)+ (r0 + 1) mod m0 r0 - 1 -(r0) r0 - 1 mod m0 One off-chip memory pathway connected to all three r0 (r0)- (r0 - 1) mod m0 memory spaces r0 (r0)+n0 (r0 + n0) mod m0 Only one off-chip access per cycle maximum r0 (r0)-n0 (r0 - n0) mod m0

FIR Filter in 56001 FIR Filter in 56001 TI TMS320C6000 VLIW DSP n equ 20 # Define symbolic constants movep y:input, x:(r0) # Load sample into memory start equ $40 # Clear accumulator A Eight instruction units dispatched by one very long samples equ $0 # Load a sample into x0 instruction word coeffs equ $0 # Load a coefficient Designed for DSP applications input equ $ffe0 # Memory-mapped I/O clr a x:(r0)+, x0 y:(r4)+, y0 output equ $ffe1 Orthogonal instruction set rep #n-1 # Repeat next instruction n-1 times org p:start # Locate in prog. memory # a = x0 × y0 Big, uniform register file (16 32-bit registers) move #samples, r0 # Pointers to samples # Next sample move #coeffs, r4 # and coefficients # Next coefficient Better compiler target than 56001 move #n-1, m0 # Prepare circular buffer mac x0,y0,a x:(r0)+, x0 y:(r4)+, y0 Deeply pipelined (up to 15 levels) move m0, m4 macr x0,y0,a (r0)- Complicated, but more regular, datapath movep a, y:output # Write output sample

Pipelining on the C6 FIR in One 'C6 Assembly Instruction Peripherals

One instruction issued per clock cycle Load a halfword (16 ) Often the whole point of the system Very deep pipeline Do this on unit D1 Memory-mapped I/O FIRLOOP: • 4 fetch cycles LDH .D1 *A1++, A2 ; Fetch next sample • Magical memory locations that make something || LDH .D2 *B1++, B2 ; Fetch next coeff. • happen or change on their own 2 decode cycles || [B0] SUB .L2 B0, 1, B0 ; Decrement count || [B0] B .S2 FIRLOOP ; Branch if non-zero • 1-10 execute cycles Typical meanings: || MPY .M1X A2, B2, A3 ; Sample × Coeff. Branch in pipeline disables || ADD .L1 A4, A3, A4 ; Accumulate result • Configuration (write) Conditional instructions avoid branch-induced stalls • Status (read) Use the cross path No hardware to protect against hazards • Predicated instruction (only if B0 non-zero) Address/Data (access more peripheral state) • Assembler or compiler’s responsibility Run these instruction in parallel Example: 56001 Port C Port C Registers for Parallel Port Port C SCI

Nine pins each usable as either simple parallel I/O or as Port C Control Register Three-pin interface part of two serial interfaces. Selects mode (parallel or serial) of each pin 422 Kbit/s NRZ asynchronous interface (RS-232-like) Pins: X: $FFE1 Lower 9 bits: 0 = parallel, 1 = serial 3.375 Mbit/s synchronous serial mode Parallel Serial Multidrop mode for multiprocessor systems PC0 RxD Serial Communication Interface (SCI) Port C Data Direction Register PC1 TxD Two Wakeup modes PC2 SCLK I/O direction of parallel pins • Idle line X: $FFE3 Lower 9 bits: 0 = input, 1 = output PC3 SC0 Synchronous Serial Interface (SSI) • PC4 SC1 Address bit PC5 SC2 Port C Data Register PC6 SCK Wired-OR mode Read = parallel input data, Write = parallel data out PC7 SRD On-chip or external baud rate generator PC8 STD X: $FFE5 Lower 9 bits Four interrupt priority levels

Port C SCI Registers Port C SCI Registers Port C SCI Registers

SCI Control Register SCI Status Register (Read only) SCI Clock Control Register X: $FFF0 Bits Function X: $FFF1 Bits Function X: $FFF2 Bits Function 0–2 Word select bits 0 Transmitter Empty 11–0 Clock Divider 3 Shift direction 1 Transmitter Reg Empty 12 Clock Output Divider 4 Send break 2 Receive Data Full 13 Clock Prescaler 5 Wakeup mode select 3 Idle Line 14 Receive Clock Source 6 Receiver wakeup enable 4 Overrun Error 15 Transmit Clock Source 7 Wired-OR mode select 5 Parity Error 8 Receiver enable 6 Framing Error 9 Transmitter enable 7 Received bit 8 10 Idle line interrupt enable 11 Receive interrupt enable 12 Transmit interrupt enable 13 Timer interrupt enable 15 Clock polarity

Port C SSI Port C SSI Registers

Intended for synchronous, constant-rate protocols SSI Control Register A $FFEC Easy interface to serial ADCs and DACs Prescaler, frame rate, word length Many more operating modes than SCI SSI Control Register B $FFED Six Pins (Rx, Tx, Clk, Rx Clk, Frame Sync, Tx Clk) Interrupt enables, various mode settings 8, 12, 16, or 24-bit words SSI Status/Time Slot Register $FFEE Sync, empty, oerrun SSI Receive/Transmit Data Register $FFEF 8, 16, or 24 bits of read/write data.