CSC 252/452: Computer Organization Programming with SSE3 Today XMM Registers ◼ 16 total, each 16 bytes ◼ 16 single-byte integers • Arrays – One-dimensional ◼ 8 16-bit integers – Multi-dimensional (nested) ◼ 4 32-bit integers – Multi-level ◼ 4 single-precision floats • Structures – Allocation ◼ 2 double-precision floats – Access ◼ 1 single-precision float – Alignment • Unions ◼ 1 double-precision float • Floating Point 1 2 Scalar & SIMD Operations ◼ Scalar Operations: Single Precision addss %xmm0,%xmm1 FP Basics %xmm0 • Arguments passed in %xmm0, %xmm1, ... + • Result returned in %xmm0 %xmm1 • All XMM registers caller-saved ◼ SIMD Operations: Single Precision addps %xmm0,%xmm1 float fadd(float x, float y) double dadd(double x, double y) { { %xmm0 return x + y; return x + y; + + + + } } %xmm1 # x in %xmm0, y in %xmm1 # x in %xmm0, y in %xmm1 addss %xmm1, %xmm0 addsd %xmm1, %xmm0 ◼ Scalar Operations: Double Precision addsd %xmm0,%xmm1 ret ret %xmm0 + %xmm1 3 4 1 CSC 252/452: Computer Organization FP Memory Referencing Other Aspects of FP Code • Integer (and pointer) arguments passed in regular registers • Lots of instructions • FP values passed in XMM registers • Different mov instructions to move between XMM registers, and – Different operations, different formats, ... between memory and XMM registers • Floating-point comparisons – Instructions ucomiss and ucomisd double dincr(double *p, double v) { – Set condition codes CF, ZF, and PF double x = *p; *p = x + v; • Using constant values return x; } – Set XMM0 register to 0 with instruction # p in %rdi, v in %xmm0 xorpd %xmm0, %xmm0 movapd %xmm0, %xmm1 # Copy v movsd (%rdi), %xmm0 # x = *p – Others loaded from memory addsd %xmm0, %xmm1 # t = x + v movsd %xmm1, (%rdi) # *p = t ret 5 6 Breakout n X n Matrix Access Array Elements Consider the following declaration of ▪ Address A + i * (C * K) + j * K a two-dimensional array ▪ C = n, K = 4 ▪ Must perform integer multiplication int Array[n][n]; /* Get element a[i][j] */ int var_ele(size_t n, int a[n][n], size_t i, size_t j) Assume n in %rdi; { Array in %rsi; return a[i][j]; } i in %rdx; # n in %rdi, a in %rsi, i in %rdx, j in %rcx j in %rcx imulq %rdx, %rdi # n*i leaq (%rsi,%rdi,4), %rax # a + 4*n*i Write the assembly code (x86-based) to movl (%rax,%rcx,4), %eax # a + 4*n*i + 4*j read Array[i][j] into register %eax ret 7 7 8 2 CSC 252/452: Computer Organization Instruction Set Architecture • Assembly Language View – Processor state Application • Registers, memory, … Program – Instructions CSC 252: • addl, movl, leal, … Compiler OS • How instructions are encoded as ISA Processor Architecture bytes CPU Design How do we go from a sequence of instructions to actual execution? Circuit Design Chip Layout 9 10 9 10 Overview of Logic Design Digital Signals • Fundamental Hardware Requirements 0 1 0 – Communication • How to get values from one place to another – Computation – combinational logic Voltage – Storage – sequential logic – Clock to drive the next computation • Bits are Our Friends Time – Everything expressed in terms of values 0 and 1 – Use voltage thresholds to extract discrete values from continuous – Communication signal • Low or high voltage on wire – Simplest version: 1-bit signal – Computation • Either high range (1) or low range (0) • Compute Boolean functions – Storage • With guard range between them • Store bits of information – Not strongly affected by noise or low quality circuit elements • Can make circuits simple, small, and fast 11 11 12 3 CSC 252/452: Computer Organization Basic Building Block: Transistors Basic Building Block: Transistors 13 14 13 14 CMOS: Complementary MOS CMOS: NOR and NAND Gates • Use both n-type and p-type NAND Gate (NOT + AND) Your text here By Reza Mirhosseini - originally uploaded to en.wikipedia (file log), Public Domain, https://commons.wikimedia.org/w/index.php?curid=12271062 https://upload.wikimedia.org/wikipedia/commons/thumb/e/e2/CMOS_NAND.s vg/280px-CMOS_NAND.svg.png 15 16 15 16 4 CSC 252/452: Computer Organization Computing with Logic Gates Combinational Circuits And Or Not Acyclic Network a a out out a out b b out = a && b out = a || b out = !a Primary Primary Inputs Outputs – Outputs are Boolean functions of inputs – Respond continuously to changes in inputs • With some, small delay Rising Delay Falling Delay a && b • Acyclic Network of Logic Gates b – Continously responds to changes on primary inputs Voltage – Primary outputs become (after some delay) a Boolean functions of primary inputs Time 17 18 17 18 Arithmetic Logic Unit Sequential Logic: Memory and Control • Sequential: 0 1 2 3 – Output depends on the current input values and Y A Y A Y A Y A A A A A the previous sequence of input values. L X + Y L X - Y L X & Y L X ^ Y U U U U – Are Cyclic: X B OF X B OF X B OF X B OF ZF ZF ZF ZF • Output of a gate feeds its input at some future time. CF CF CF CF – Memory: • Remember results of previous operations – Combinational logic • Use them as inputs. • Continuously responding to inputs – Example of use: – Control signal selects function computed • Build registers and memory units. • Corresponding to 4 arithmetic/logical operations in Y86 – Also computes values for condition codes 19 20 19 20 5 CSC 252/452: Computer Organization Clocks Edge-Triggered Latch D • Signal used to synchronize activity in a R Data processor Q+ • Every operation must be completed in the time Q– between two clock pulses (or rising edges) --- C S T Clock the cycle time Trigger • Maximum clock rate (frequency) determined by – Only in latching mode for C the slowest logic path in the circuit (the critical brief period T path) • Rising clock edge D – Value latched depends on data as clock rises Clock Q+ – Output remains stable at Time all other times 21 22 21 22 StructureRegisters Register Operation D i7 Q+ C o7 D State = x State = y i6 Q+ C o6 i D Rising 5 Q+ o Input = y Output = x Output = y C 5 clock D i4 Q+ C o4 I O x y D i3 Q+ C o3 D i2 Q+ C o2 D i1 Q+ C o1 Clock D i0 Q+ o C 0 – Stores data bits Clock – For most of time acts as barrier between input – Stores word of data and output • Different from program registers seen in assembly code – Collection of edge-triggered latches – As clock rises, loads input – Loads input on rising edge of clock 23 24 23 24 6 CSC 252/452: Computer Organization State Machine Example Random-Access Memory valA A Comb. Logic srcA 0 valW Register W Read ports dstW Write port valB file B A – Accumulator srcB L 0 Out circuit U MUX Clock In 1 – Load or – Stores multiple words of memory Load accumulate Clock • Address input specifies which word to read or write on each – Register file cycle • Holds values of program registers Clock • %eax, %esp, etc. • Register identifier serves as address Load – ID 8 implies no read or write performed In x0 x1 x2 x3 x4 x5 – Multiple Ports • Can read and/or write multiple words in one cycle Out x0 x0+x1 x0+x1+x2 x3 x3+x4 x3+x4+x5 – Each has separate address and data input/output 25 26 25 26 Register File Timing Building Blocks valA • Reading = x fun A 2 srcA – Like combinational logic • Combinational Logic Register – Compute Boolean functions of A file – Output data generated based on A x valB inputs B input address L srcB • After some delay – Continuously respond to input U 2 B 0 • Writing changes MUX – Like register – Operate on data and implement control 1 – Update only as clock rises 2 x y • Storage Elements valA valW 2 A y Rising valW srcA Register W valW dstW – Store bits Register W file 2 clock dstW Register W file – Addressable memories dstW valB file B – Non-addressable registers srcB Clock – Loaded only as clock rises Clock Clock Clock 27 28 27 28 7.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-