Assembly Languages Assembly Languages Assembly Language

Assembly Languages Assembly Language Model One step up from machine . Assembly Languages language . Originally a more add r1,r2 COMS W4995-02 user-friendly way to program sub r2,r3 Prof. Stephen A. Edwards Now mostly a compiler target cmp r3,r4 Fall 2002 Model of computation: PC ! bne I1 ALU $ Registers $ Memory Columbia University stored program computer sub r4,1 Department of Computer Science I1: jmp I3 . Assembly Language Instructions Types of Opcodes Operands Arithmetic, logical Built from two pieces: Each operand taken from a particular addressing mode: • add, sub, mult Examples: add R1, R3, 3 • and, or • Cmp Register add r1, r2, r3 Opcode Operands Memory load/store Immediate add r1, r2, 10 What to do with the data Where to get the data • ld, st Indirect mov r1, (r2) Control transfer Offset mov r1, 10(r3) • jmp PC Relative beq 100 • bne Reflect processor data pathways Complex • movs Types of Assembly Languages CISC Assembly Language RISC Assembly Language Assembly language closely tied to processor architecture Developed when people wrote assembly language Response to growing use of compilers At least four main types: Complicated, often specialized instructions with many Easier-to-target, uniform instruction sets effects CISC: Complex Instruction-Set Computer “Make the most common operations as fast as possible” Examples from x86 architecture RISC: Reduced Instruction-Set Computer Load-store architecture: • String move DSP: Digital Signal Processor • Arithmetic only performed on registers • Procedure enter, leave VLIW: Very Long Instruction Word • Memory load/store instructions for memory-register Many, complicated addressing modes transfers So complicated, often executed by a little program (microcode) Designed to be pipelined Examples: Intel x86, 68000, PDP-11 Examples: SPARC, MIPS, HP-PA, PowerPC DSP Assembly Language VLIW Assembly Language Example: Euclid's Algorithm Digital signal processors designed specifically for signal Response to growing desire for instruction-level int gcd(int m, int n) processing algorithms parallelism { Lots of regular arithmetic on vectors Using more transistors cheaper than running them faster int r; while ((r = m % n) != 0) { Many parallel ALUs Often written by hand m = n; Objective: keep them all busy all the time Irregular architectures to save power, area n = r; Heavily pipelined } Substantial instruction-level parallelism More regular instruction set return n; Examples: TI 320, Motorola 56000, Analog Devices } Very difficult to program by hand Looks like parallel RISC instructions Examples: Itanium, TI 320C6000 i386 Programmer's Model Euclid on the i386 Euclid on the i386 .file "euclid.c" # Boilerplate 31 0 15 0 .version "01.01" .file "euclid.c" gcc2 compiled.: .version "01.01" eax Mostly cs Code segment gcc2 compiled.: .text # Executable Stack Before Call ebx General- ds Data segment .text n 8(%esp) .align 4 # Start on 16-byte boundary .align 4 ecx Purpose- ss Stack segment .globl gcd # Make “gcd” linker-visible m 4(%esp) .globl gcd %esp! R. A. 0(%esp) edx Registers es Extra segment .type gcd,@function .type gcd,@function gcd: fs Data segment gcd: Stack After Entry esi Source index pushl %ebp pushl %ebp gs Data segment n 12(%ebp) edi Destination index movl %esp,%ebp movl %esp,%ebp pushl %ebx m 8(%ebp) ebp Base pointer pushl %ebx R. A. 4(%ebp) movl 8(%ebp),%eax movl 8(%ebp),%eax ! esp Stack pointer %ebp old ebp 0(%ebp) movl 12(%ebp),%ecx movl 12(%ebp),%ecx ! − jmp .L6 %esp old ebx 4(%ebp) eflags Status word jmp .L6 .p2align 4,,7 .p2align 4,,7 eip Instruction Pointer Euclid in the i386 Euclid on the i386 SPARC Programmer's Model jmp .L6 # Jump to local label .L6 jmp .L6 .p2align 4,,7 # Skip ≤ 7 bytes to a multiple of 16 .p2align 4,,7 31 0 31 0 .L4: .L4: r0 Always 0 r24/i0 Input Registers movl %ecx,%eax movl %ecx,%eax # m = n . r1 Global Registers . movl %ebx,%ecx movl %ebx,%ecx # n = r . .L6: .L6: . r30/i6 Frame Pointer cltd # Sign-extend eax to edx:eax cltd r7 r31/i7 Return Address idivl %ecx # Compute edx:eax = ecx idivl %ecx r8/o0 Output Registers movl %edx,%ebx movl %edx,%ebx . PSW Status Word testl %edx,%edx testl %edx,%edx # AND of edx and edx jne .L4 jne .L4 # branch if edx was 6= 0 r14/o6 Stack Pointer PC Program Counter movl %ecx,%eax movl %ecx,%eax # Return n r15/o7 nPC Next PC movl -4(%ebp),%ebx movl -4(%ebp),%ebx r16/l0 Local Registers leave leave # Move ebp to esp, pop ebp . ret ret # Pop return address and branch . r23/l7 SPARC Register Windows Euclid on the SPARC Euclid on the SPARC .file "euclid.c" # Boilerplate mov %i0, %o1 r8/o0. gcc2 compiled.: b .LL3 mov %i1, %i0 The output registers of r15/o7 .global .rem # make .rem linker-visible r16/l0. .section ".text" # Executable code .LL5: the calling procedure . r23/l7 .align 4 mov %o0, %i0 # n = r become the inputs to r8/o0. r24/i0. .global gcd # make gcd linker-visible .LL3: . mov %o1, %o0 # Compute the remainder of the called procedure r15/o7 r31/i7 .type gcd, #function call .rem, 0 # m = n, result in o0 r16/l0. .proc 04 The global registers . mov %i0, %o1 r23/l7 gcd: remain unchanged save %sp, -112, %sp r8/o0. r24/i0. # Next window, move SP . cmp %o0, 0 The local registers are r15/o7 r31/i7 bne .LL5 mov %i0, %o1 # Move m into o1 r16/l0. mov %i0, %o1 # m = n (always executed) not visible across . b .LL3 # Unconditional branch ret # Return (actually jmp i7 + 8) procedures r23/l7 mov %i1, %i0 # Move n into i0 r24/i0. restore # Restore previous window . r31/i7 Digital Signal Processor Apps. Embedded Processor Conventional DSP Architecture Requirements Low-cost embedded systems Harvard architecture Inexpensive with small area and volume • Separate data memory/bus and program memory/bus • Modems, cellular telephones, disk drives, printers Deterministic interrupt service routine latency • Three reads and one or two writes per instruction cycle High-throughput applications Low power: ≈50 mW (TMS320C54x uses 0.36 µA/MIPS) Deterministic interrupt service routine latency • Halftoning, base stations, 3-D sonar, tomography Multiply-accumulate in single instruction cycle Special addressing modes supported in hardware PC based multimedia • Modulo addressing for circular buffers for FIR filters • Compression/decompression of audio, graphics, video • Bit-reversed addressing for fast Fourier transforms Instructions to keep the pipeline (3-4 stages) full • Zero-overhead looping (one pipeline flush to set up) • Delayed branches Conventional DSPs Conventional DSPs Example Fixed-Point Floating-Point Market share: 95% fixed-point, 5% floating-point Finite Impulse Response filter (FIR) Cost/Unit $5–$79 $5–$381 Each processor comes in dozens of configurations Can be used for lowpass, highpass, bandpass, etc. Architecture Accumulator load-store • Data and program memory size Basic DSP operation Registers 2–4 data, 8 address 8–16 data, 8–16 address • Peripherals: A/D, D/A, serial, parallel ports, timers For each sample, computes Data Words 16 or 24 bit 32 bit Chip Memory 2–64K data+program 8–64K data+program Drawbacks k = Address Space 16–128K data 16M–4G data • No byte addressing (needed for image and video) yn X aixn+i 16–64K program 16M–4G program i=0 • Limited on-chip memory Compilers Bad C Better C, C++ where Examples TI TMS320C5x TI TMS320C3x • Limited addressable memory on most fixed-point Motorola 56000 Analog Devices SHARC DSPs a0; : : : ; ak are filter coffecients, • Non-standard C extensions to support fixed-point data xn is the nth input sample, yn is the nth output sample. 56000 Programmer's Model 56001 Memory Spaces 56001 Address Generation 55 4847 2423 0 15 0 Three memory regions, each 64K: Addresses come from pointer register r0 . r7 x1 x0 Source Program Counter y1 y0 Registers Status Register • 24-bit Program memory a2 a1 a0 Accumulator Loop Address Offset registers n0 . n7 can be added to pointer b2 b1 b0 Accumulator Loop Count • 24-bit X data memory Modifier registers cause the address to wrap around 15. PC Stack 15 0 15 0 15 0 . • r7 n7 m7 24-bit Y data memory Zero modifier causes reverse-carry arithmetic . 0 . Idea: enable simultaneous access of program, sample, Address Notation Next value of r0 r4 n4 m4 Address 15. SR Stack r3 n3 m3 . r0 (r0) r0 . Registers and coefficient memory . 0 r0 + n0 (r0+n0) r0 r0 n0 m0 Stack pointer Three on-chip memory spaces can be used this way r0 (r0)+ (r0 + 1) mod m0 r0 - 1 -(r0) r0 - 1 mod m0 One off-chip memory pathway connected to all three r0 (r0)- (r0 - 1) mod m0 memory spaces r0 (r0)+n0 (r0 + n0) mod m0 Only one off-chip access per cycle maximum r0 (r0)-n0 (r0 - n0) mod m0 FIR Filter in 56001 FIR Filter in 56001 TI TMS320C6000 VLIW DSP n equ 20 # Define symbolic constants movep y:input, x:(r0) # Load sample into memory start equ $40 # Clear accumulator A Eight instruction units dispatched by one very long samples equ $0 # Load a sample into x0 instruction word coeffs equ $0 # Load a coefficient Designed for DSP applications input equ $ffe0 # Memory-mapped I/O clr a x:(r0)+, x0 y:(r4)+, y0 output equ $ffe1 Orthogonal instruction set rep #n-1 # Repeat next instruction n-1 times org p:start # Locate in prog. memory # a = x0 × y0 Big, uniform register file (16 32-bit registers) move #samples, r0 # Pointers to samples # Next sample move #coeffs, r4 # and coefficients # Next coefficient Better compiler target than 56001 move #n-1, m0 # Prepare circular buffer mac x0,y0,a x:(r0)+, x0 y:(r4)+, y0 Deeply pipelined (up to 15 levels) move m0, m4 macr x0,y0,a (r0)- Complicated, but more regular, datapath movep a, y:output # Write output sample Pipelining on the C6 FIR in One 'C6 Assembly Instruction Peripherals One instruction issued per clock cycle Load a halfword (16 bits) Often the whole point of the system Very deep pipeline Do this on unit D1 Memory-mapped I/O FIRLOOP: • 4 fetch cycles LDH .D1 *A1++, A2 ; Fetch next sample • Magical memory locations that make something || LDH .D2 *B1++, B2 ; Fetch next coeff.

Load more