Computer Architectures

Processor+Memory

Richard Šusta, Pavel Píša

Czech Technical University in Prague, Faculty of Electrical Engineering

AE0B36APO Computer Architectures Ver.1.00 2019 1 The main instruction cycle of the CPU

1. Initial setup/reset – set initial PC value, PSW, etc. 2. Read the instruction from the memory

 PC → to the address bus

 Read the memory contents (machine instruction) and transfer it to the IR

 PC+l → PC, where l is length of the instruction 3. Decode operation code (opcode) 4. Execute the operation

 compute effective address, select registers, read operands, pass them through ALU and store result 5. Check for exceptions/interrupts (and service them) 6. Repeat from the step 2 AE0B36APO Computer Architectures 2 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0

ALUControl

25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU

A RD Instr. A2 RD2 SrcB AluOut Data ReadData

Memory Memory A3 Reg. WD3 File WD

15:0 Sign Ext SignImm

B35APO Computer Architectures 3 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched data I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0

Write on rising edge of CLK RegWrite = 1 ALUControl

25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU

A RD Instr. A2 RD2 SrcB AluOut Data ReadData 20:16 Memory Memory A3 Reg. WD3 File WD

15:0 Sign Ext SignImm

B35APO Computer Architectures 4 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched data I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0

RegWrite = 1 ALUControl

25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU

A RD Instr. A2 RD2 SrcB AluOut Data ReadData

Memory 20:16 Memory A3 Reg. WD3 File WD

+ 15:0 4 Sign Ext SignImm PCPlus4

B35APO Computer 5 Architectures Single cycle CPU – implementation of the store instruction sw: type I, rs – base address, imm – offset, rt – select register to store into memory I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0

RegWrite = 0 MemWrite = 1 ALUControl

25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU

20:16 A RD Instr. A2 RD2 SrcB AluOut Data ReadData

Memory 20:16 Memory A3 Reg. WD3 File WD

+ 15:0 4 Sign Ext SignImm PCPlus4

B35APO Computer Architectures 6 Switch analogy

Multiplexer 2 to 1

1bit Select (address) Select z z x0 0 x0 x1 1 x1

B35APO Computer Architectures 7 Single cycle CPU – implementation of the add instruction add: type R, rs, rt – source, rd – destination, funct – select ALU operation = add R opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 rd(5), 15:11 shamt(5) funct(6), 5:0

RegWrite = 1 ALUSrc = 0 MemToReg = 0 RegDst = 1 ALUControl

25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU 0 Result A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData

Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 15:0 4 Sign Ext SignImm PCPlus4

B35APO Computer Architectures 8 Single cycle CPU – sub, and, or, slt

Only difference is another ALU operation selection (ALUcontrol). The data path is the same as for add instruction

RegWrite = 1 ALUSrc = 0 MemToReg = 0 RegDst = 1 ALUControl

25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU 0 Result A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData

Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 15:0 4 Sign Ext SignImm PCPlus4

B35APO Computer Architectures 9 Single cycle CPU – implementation of beq • beq – branch if equal; imm–offset; PC´ = PC+4 + SignImm*4 I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0

RegWrite = 0 ALUSrc = 0 Branch = 1 MemToReg = x RegDst = x ALUControl

25:21 WE3 SrcA Zero WE 0 PC’ PC Instr A1 RD1 A RD ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData

Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4

B35APO Computer Architectures 10 Single cycle CPU – Throughput: IPS = IC / T = IPCstr.fCLK • What is the maximal possible frequency of the CPU? • It is given by latency on the critical path – it is lw instruction in our case:

Tc = tPC + tMem + tRFread + tALU + tMem + tMux + tRFsetup

25:21 WE3 SrcA WE 0 PC’ PC Instr A1 RD1 Zero A RD ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData

Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4

B35APO Computer Architectures 11 Single cycle CPU – Throughput: IPS = IC / T = IPCstr.fCLK

• Tc = Tcinstr + Tcproc = (tPC + tMem) + (tRFread + tALU + tMem + tMux + tRFsetup) • Consider following parameters:

tPC = 30 ns tMem = 300 ns tRFread = 50 ns tALU = 200 ns tMux = 20 ns tRFsetup = 20 ns

Then Tc = 920 ns --> fCLK max = 1,08 MHz, IPS = 1 080 000 [instructions per second]

If Tc instr is executed paralel with Tc proc, then Tc i < Tc p, and Tc p = 50+200+300+20+20 = 590 ns = 1.69 MHz -> IPS = 1 690 000

B35APO Computer Architectures 12 Note

• Remember the result, so you can compare it with result for pipelined CPU.

B35APO Computer Architectures 13 Key Technology Gaps Prediction

Note: The increase of algorithm complexity over time has been formalized in literature with the so-defined Shannon's law.

B35APO Computer Architectures 14 Memory and CPU speed – Moore's law more exact numbers CPU 25% 52% 20% performance per year per year per year

Processor-Memory Performance Gap Growing Throughput of memory only +7% per year

Source: Hennesy, Patterson CaaQA 4th ed. 2006 B35APO Computer Architectures 15 Layout of a Program in Memory lower address Other programs .ent start Text Segment entry point, initial value of PC Static Initialized Data Data Segment Static Uninitialized Data

Dynamic Area, heap Heap grows (cz: halda) to higher address

Stack grows Stack Segment to lower address higher address (cz:zásobník)

B35APO Computer Architectures Memory Addressing

 Memory address in load and store instructions is specified by a base register and offset

 This is called base addressing

addi a0, zero, 10 // load the word from absolute address lw $2, 0x2000($0) // store the word to absolute address sw $2, 0x2004($0) Data Types in Instructions MIPS registers hold 32-bit (4-byte) words.

Other common data sizes include byte and halfword. Byte = 8 bits - example: lb / lbu - load byte extend signed/unsigned

Halfword = 2 bytes example: - lh / lhu - load halfword extend signed/unsigned

Word = 4 bytes - example: - lw - load word

u-unsigned extension 18 Data Directives

.WORD Directive Stores the list as 32-bit values aligned on a word boundary .WORD w:n Directive Stores the 32-bit value w into n consecutive words aligned on a word boundary. .HALF Directive Stores the list as 16-bit values aligned on half-word boundary .HALF w:n Directive Stores the 16-bit value w into n consecutive half-words aligned on a half-word boundary. .BYTE Directive Stores the list of values as 8-bit bytes .BYTE w:n Directive Stores the 8-bit value w into n consecutive bytes.

B35APO Computer Architectures

String Directives

.ASCII Directive Allocates a sequence of bytes for an ASCII string .ASCIIZ Directive Same as .ASCII directive, but adds a NULL char at end of string Strings are null-terminated, as in the C programming language .SPACE n Directive Allocates space of n uninitialized bytes in the data segment

Special characters in strings follow C convention Newline: \n Tab:\t Quote: \”

B35APO Computer Architectures Memory Alignment

.align n directive - aligns the next data definition begins on a 2n byte boundary .align 2 - the least significant 2 bits of address should be 00

Memory

Memory is addressed as an . . .

array of bytes address aligned word Words occupy 4 consecutive 12 not aligned 8 bytes (MIPS is 32bit processor), 4 0 not aligned

B35APO Computer Architectures Align

Assembler align data in data segment .DATA .ALIGN 2 var1: .BYTE 3, 5,'A','P','O' var2: .WORD 0x12345678 .ALIGN 3 var3: .HALF 1000

Example on BIG ENDIAN

var1 var2

Address 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x2000 3 5 41 50 4F 12 34 56 78 0x2010 10 00

var3 B35APO Computer Architectures Lecture motivation

Quick Quiz 1.: Is the result of both code fragments a same? Quick Quiz 2.: Which of the code fragments is processed faster and why?

A: B: int matrix[M][N]; int matrix[M][N]; int i, j, sum = 0; int i, j, sum = 0; … … for(i=0; i

Is there a rule how to iterate over matrix element efficiently?

B35APO Computer Architectures 23 PC Computer

B35APO Computer Architectures 24 http://www.pcplanetsystems.com/abc/product_details.php?item_id=3263&category_id=208 Computer architecture (desktop PC)

example generic

B35APO Computer Architectures 25 From UMA to NUMA development (even in PC segment)

Non-Uniform Memory Architecture CPU 1 CPU 2 CPU 1 CPU 2

RAM RAM RAM RAM RAM

RAM MC MC MC MC MC Northbridge MC Northbridge CPU 1 CPU 2 CPU 1 CPU 2

RAM

Southbridge Southbridge Southbridge Southbridge

USB USB USB USB SATA PCI-E SATA PCI-E SATA PCI-E SATA PCI-E MC - – contains circuitry responsible for SDRAM read and writes. It also takes care of refreshing each every 64 ms.

B35APO Computer Architectures 26 Core 2 generation

B35APO Computer Architectures 27 Intel i3/5/7 generation

B35APO Computer Architectures 28 Memory subsystem – terms and definitions

● Memory address – fixed-length sequences of bits or index

● Data value – the visible content of a memory location

● Memory location can hold even more control/bookkeeping information

● validity flag, parity and ECC bits etc.

Basic memory parameters:

● Access time – delay or latency between a request and the access being completed or the requested data returned

● Memory latency – time between request and data being available (does not include time required for refresh and deactivation)

● Throughput/bandwidth – main performance indicator. Rate of transferred data units per time.

● Maximal, average and other latency parameters

B35APO Computer Architectures 29 Internal architecture of the DRAM memory chip

This 4M × 1 DRAM (Dynamic Random Access Memory) is internally realized as an 2048x2048 array of 1b memory cells

B35APO Computer Architectures 30 Memory Cell bitline row-address stored bit

bitline =0 bitline =Z

row-address 1 row-address = 0 stored stored bit = 0 bit = 0

bitline =Z bitline =1 row-address 1 row-address= 0 stored stored bit = 1 bit = 1

B35APO Computer Architectures Switch Analogy of Decoder

y y One Hot Decoder cz: Dekodér 1 ze 4 1 0 y1 y0 x 0 x 0 0 x x 1 1 '1' 1 x x 2 2 2 x x 3 3 3

B35APO Computer 32 Architectures Switch analogy Multiplexer 2 to 1 or 1 of 2 cz :2 kanálový (2-vstupový) multiplexor

Select Select z z x0 0 x0 x1 1 x1

Multiplexer 4 to 1 or 1 of 4 cz : 4 kanálový (4-vstupový) multiplexor y0 y1 x0

x1 x2 z

x3 y1 y0

y1 y0 x 0 x 0 0 z z x 1 x 1 1 x 2 x 2 2 x 3 x 3 3

B35APO Computer Architectures 33 Memory matrix Decoder one-hot bitline2 bitline1 bitline01 bitline0 row 0 0 stored stored stored stored bit = 0 bit = 1 bit = 0 bit = 0 1 row 1 stored stored stored stored bit = 1 bit = 0 bit = 0 bit = 0 2 row 2 stored stored stored stored 2 bit = 1 bit = 1 bit = 0 bit = 0 3 row 3 Address stored stored stored stored 4 4 bit = 0 bit = 1 bit = 1 bit = 1 address register

of data Data 3 Data 2 Data 1 Data 1

1 0 2 2 3

Multiplexer clock data out 1 of 4

Register of address is a necessary condition for the implementation and to reduce power consumption

B35APO Computer . Architectures Memory matrix - example 1/4

Decoder one-hot bitline2 bitline1 bitline01 bitline0 row 0 0 stored stored stored stored bit = 0 bit = 1 bit = 0 bit = 0 1 row 1 stored stored stored stored bit = 1 bit = 0 bit = 0 bit = 0 2 row 2 stored stored stored stored 2 bit = 1 bit = 1 bit = 0 bit = 0 address 3 row 3 Address 6 = 0110 stored stored stored stored 4 register bit = 0 bit = 1 bit = 1 bit = 1

Data 3 Data 2 Data 1 Data 1

1 0 2 2 3

Multiplexer

clock 1 of 4

Address value is waiting for the rising edge of clocks

B35APO Computer Architectures Memory matrix - example 2/4

Decoder one-hot bitline2 bitline1 bitline01 bitline0 row 0 0 stored stored stored stored bit = 0 bit = 1 bit = 0 bit = 0 1 row 1 stored stored stored stored bit = 1 bit = 0 bit = 0 bit = 0 2 row 2 stored stored stored stored 01 bit = 1 bit = 1 bit = 0 bit = 0 address 3 row 3 6 = 0110 0110 stored stored stored stored

register bit = 0 bit = 1 bit = 1 bit = 1

Data 3 Data 2 Data 1 Data 1

1 0 2 10 3

Multiplexer clock 1 of 4

Address was loaded on the rising edge of clocks and divided to two parts, higher and lower bits B35APO Computer Architectures Memory matrix - example 3/4

Decoder one-hot bitline2 bitline1 bitline01 bitline0 row 0 0 stored stored stored stored bit = 0 bit = 1 bit = 0 bit = 0 1 row 1 stored stored stored stored bit = 1 bit = 0 bit = 0 bit = 0 2 row 2 stored stored stored stored 01 bit = 1 bit = 1 bit = 0 bit = 0 address 3 row 3 6 = 0110 0110 stored stored stored stored

register bit = 0 bit = 1 bit = 1 bit = 1

Data 3 Data 2 Data 1 Data 1

1 0 2 10 3

Multiplexer clock 1 of 4

Output of one-hot decoder 1 from N aktivate row and of cell in it. B35APO Computer Architectures Memory matrix - example 4/4

Decoder one-hot bitline2 bitline1 bitline01 bitline0 row 0 0 stored stored stored stored bit = 0 bit = 1 bit = 0 bit = 0 1 row 1 stored stored stored stored bit = 1 bit = 0 bit = 0 bit = 0 2 row 2 stored stored stored stored 01 bit = 1 bit = 1 bit = 0 bit = 0 address 3 row 3 6 = 0110 0110 stored stored stored stored

register bit = 0 bit = 1 bit = 1 bit = 1

Data 3 Data 2 Data 1 Data 1

1 0 2 10 3

Multiplexer

clock 1 of 4 0

The cells are connected to bitlines and the output multiplexer selects one value - Data 2 = 0 B35APO Computer Architectures Memory terminology

• Types: RWM (RAM), ROM, FLASH, • RAM memories: SRAM (static), DRAM (dynamic).

• RAM = Random Access Memory

type transistors area for access latence 1 bit SRAM cca 6 < 0,1 m2 always < 1ns – 5ns DRAM 1 < 0,001 m2 requires refresh tens of ns

B35APO Computer Architectures 39 Typical chip and SRAM cell

Principle: SRAM cell 6-transistors CMOS, but there are also 4 transistors versions

Area of cell:

B35APO Computer Architectures 40 http://educypedia.karadimov.info/library/SEC08.PDF Typical Chip and SRAM cell

Typical SRAM chip Example of reading:

OE signál je asynchronní

B35APO Computer Architectures 41 https://www.ece.cmu.edu/~ece548/localcpy/sramop.pdf Typical Chip and SRAM cell

Bigger memory?

B35APO Computer Architectures 42 Dynamic memory cell detail

Single transistor dynamic memory cell

The nMOS transistor is a switch that connects (or not) a capacitor to a "bitline" wire. The connection is controlled by the wordline wire.

B35APO Computer Architectures 43 Obrázek: http://www.eetimes.com/document.asp?doc_id=1281315 Capacitor

Commercial DRAM parameters Capacity fF [femtofarad] Storage capacitor from 10 fF to 50 fF Bit line around 2 fF

[Source: l'INSA de Toulouse] fF - femtofarad An SI unit of electrical capacitance equal to 10−15 farads.

10-6 F = 1 μF = 103 nF = 106 pF = 109 fF

~9 fF - the capacitor is created by two 1 mm2 plates in 1 mm distance inside vacuum B35APO Computer Architectures 44 Development of DRAM chips in time

Year Capacity Price[$]/GB Access time [ns] 1980 64 Kb 1 500 000 250 1983 256 Kb 500 000 185 1985 1 Mb 200 000 135 1989 4 Mb 50 000 110 1992 16 Mb 15 000 90 1996 64 Mb 10 000 60 1998 128 Mb 4 000 60 2000 256 Mb 1 000 55 2004 512 Mb 250 50 2007 1 Gb 50 40

RAS – Row Address Strobe, CAS – Column Address Strobe

B35APO Computer Architectures 45 Classical DRAM - asynchronous interface

• The address is transferred in two phases – reduces number of chip module pins and is natural for internal DRAM organization • This method is preserved even for today chips

RAS – Row Address Strobe, CAS – Column Address Strobe

B35APO Computer Architectures 46 EDO DRAM – about 1995 • Output register holds data during overlap of next read CAS phase with previous access data transfer • this overlap (“pipelining”) increases throughput

B35APO Computer Architectures 47 SDRAM – konec 90.let – synchronní DRAM

SDRAM chip is equipped by counter that can be used to define continuous length (burst) which is read together.

B35APO Computer Architectures 48 Consecutive Read/Write

B35APO Computer Architectures 49 From SDRAM to Contemporary mMemories

• SDRAM – clock frequency up to 100 MHz, 2.5V • DDR () SDRAM – data transfer at both CLK edges, 2.5V, I/O bus clock 100-200 MHz, speed 0.2-0.4 GT/s (gigatransfers per second) • DDR2 SDRAM – lower power consumption by power 1.8V, frequency of IO bus 400 MHz, 0.8 GT/s • DDR3 SDRAM – lower power consumption by power 1.5V, IO bus frequency up to 800 MHz, 1.6 GT/s • DDR4 SDRAM - 1.05 – 1.2V, I/O bus clock 1.2 GHz, 2.4 GT/s • DDR5 SDRAM - expected during 2019-20, supposed transfer rate ~6 GT/s

All these innovations are focused mainly on throughput, not on the random access latency.

B35APO Computer Architectures 50 Other Memories

• QDRx SDRAM (Quad Data Rate) - however, they are not 2 * faster, they only allow simultaneous read and write since they have separate clocks for RD and WR, while DDRs are more efficient than QDR for one type of access. • GDDR SDRAM - today up to GDDR6, suitable for graphics cards - based on DDR memory. - their speed is accelerated by the extended output bus. • Furthermore, there are memories and RDRAM modules ( DRAMs), which have a completely different interface and method of use. For patent litigation, they have not been used in personal computers since 2003.

B35APO Computer Architectures 51