Computer Architectures
Processor+Memory
Richard Šusta, Pavel Píša
Czech Technical University in Prague, Faculty of Electrical Engineering
AE0B36APO Computer Architectures Ver.1.00 2019 1 The main instruction cycle of the CPU
1. Initial setup/reset – set initial PC value, PSW, etc. 2. Read the instruction from the memory
PC → to the address bus
Read the memory contents (machine instruction) and transfer it to the IR
PC+l → PC, where l is length of the instruction 3. Decode operation code (opcode) 4. Execute the operation
compute effective address, select registers, read operands, pass them through ALU and store result 5. Check for exceptions/interrupts (and service them) 6. Repeat from the step 2 AE0B36APO Computer Architectures 2 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched data I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0
ALUControl
25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU
A RD Instr. A2 RD2 SrcB AluOut Data ReadData
Memory Memory A3 Reg. WD3 File WD
15:0 Sign Ext SignImm
B35APO Computer Architectures 3 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched data I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0
Write on rising edge of CLK RegWrite = 1 ALUControl
25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU
A RD Instr. A2 RD2 SrcB AluOut Data ReadData 20:16 Memory Memory A3 Reg. WD3 File WD
15:0 Sign Ext SignImm
B35APO Computer Architectures 4 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched data I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0
RegWrite = 1 ALUControl
25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU
A RD Instr. A2 RD2 SrcB AluOut Data ReadData
Memory 20:16 Memory A3 Reg. WD3 File WD
+ 15:0 4 Sign Ext SignImm PCPlus4
B35APO Computer 5 Architectures Single cycle CPU – implementation of the store instruction sw: type I, rs – base address, imm – offset, rt – select register to store into memory I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0
RegWrite = 0 MemWrite = 1 ALUControl
25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU
20:16 A RD Instr. A2 RD2 SrcB AluOut Data ReadData
Memory 20:16 Memory A3 Reg. WD3 File WD
+ 15:0 4 Sign Ext SignImm PCPlus4
B35APO Computer Architectures 6 Switch analogy
Multiplexer 2 to 1
1bit Select (address) Select z z x0 0 x0 x1 1 x1
B35APO Computer Architectures 7 Single cycle CPU – implementation of the add instruction add: type R, rs, rt – source, rd – destination, funct – select ALU operation = add R opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 rd(5), 15:11 shamt(5) funct(6), 5:0
RegWrite = 1 ALUSrc = 0 MemToReg = 0 RegDst = 1 ALUControl
25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU 0 Result A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData
Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 15:0 4 Sign Ext SignImm PCPlus4
B35APO Computer Architectures 8 Single cycle CPU – sub, and, or, slt
Only difference is another ALU operation selection (ALUcontrol). The data path is the same as for add instruction
RegWrite = 1 ALUSrc = 0 MemToReg = 0 RegDst = 1 ALUControl
25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU 0 Result A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData
Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 15:0 4 Sign Ext SignImm PCPlus4
B35APO Computer Architectures 9 Single cycle CPU – implementation of beq • beq – branch if equal; imm–offset; PC´ = PC+4 + SignImm*4 I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0
RegWrite = 0 ALUSrc = 0 Branch = 1 MemToReg = x RegDst = x ALUControl
25:21 WE3 SrcA Zero WE 0 PC’ PC Instr A1 RD1 A RD ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData
Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4
B35APO Computer Architectures 10 Single cycle CPU – Throughput: IPS = IC / T = IPCstr.fCLK • What is the maximal possible frequency of the CPU? • It is given by latency on the critical path – it is lw instruction in our case:
Tc = tPC + tMem + tRFread + tALU + tMem + tMux + tRFsetup
25:21 WE3 SrcA WE 0 PC’ PC Instr A1 RD1 Zero A RD ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData
Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4
B35APO Computer Architectures 11 Single cycle CPU – Throughput: IPS = IC / T = IPCstr.fCLK
• Tc = Tcinstr + Tcproc = (tPC + tMem) + (tRFread + tALU + tMem + tMux + tRFsetup) • Consider following parameters:
tPC = 30 ns tMem = 300 ns tRFread = 50 ns tALU = 200 ns tMux = 20 ns tRFsetup = 20 ns
Then Tc = 920 ns --> fCLK max = 1,08 MHz, IPS = 1 080 000 [instructions per second]
If Tc instr is executed paralel with Tc proc, then Tc i < Tc p, and Tc p = 50+200+300+20+20 = 590 ns = 1.69 MHz -> IPS = 1 690 000
B35APO Computer Architectures 12 Note
• Remember the result, so you can compare it with result for pipelined CPU.
B35APO Computer Architectures 13 Key Technology Gaps Prediction
Note: The increase of algorithm complexity over time has been formalized in literature with the so-defined Shannon's law.
B35APO Computer Architectures 14 Memory and CPU speed – Moore's law more exact numbers CPU 25% 52% 20% performance per year per year per year
Processor-Memory Performance Gap Growing Throughput of memory only +7% per year
Source: Hennesy, Patterson CaaQA 4th ed. 2006 B35APO Computer Architectures 15 Layout of a Program in Memory lower address Other programs .ent start Text Segment entry point, initial value of PC Static Initialized Data Data Segment Static Uninitialized Data
Dynamic Area, heap Heap grows (cz: halda) to higher address
Stack grows Stack Segment to lower address higher address (cz:zásobník)
B35APO Computer Architectures Memory Addressing
Memory address in load and store instructions is specified by a base register and offset
This is called base addressing
addi a0, zero, 10 // load the word from absolute address lw $2, 0x2000($0) // store the word to absolute address sw $2, 0x2004($0) Data Types in Instructions MIPS registers hold 32-bit (4-byte) words.
Other common data sizes include byte and halfword. Byte = 8 bits - example: lb / lbu - load byte extend signed/unsigned
Halfword = 2 bytes example: - lh / lhu - load halfword extend signed/unsigned
Word = 4 bytes - example: - lw - load word
u-unsigned extension 18 Data Directives
.WORD Directive Stores the list as 32-bit values aligned on a word boundary .WORD w:n Directive Stores the 32-bit value w into n consecutive words aligned on a word boundary. .HALF Directive Stores the list as 16-bit values aligned on half-word boundary .HALF w:n Directive Stores the 16-bit value w into n consecutive half-words aligned on a half-word boundary. .BYTE Directive Stores the list of values as 8-bit bytes .BYTE w:n Directive Stores the 8-bit value w into n consecutive bytes.
B35APO Computer Architectures
String Directives
.ASCII Directive Allocates a sequence of bytes for an ASCII string .ASCIIZ Directive Same as .ASCII directive, but adds a NULL char at end of string Strings are null-terminated, as in the C programming language .SPACE n Directive Allocates space of n uninitialized bytes in the data segment
Special characters in strings follow C convention Newline: \n Tab:\t Quote: \”
B35APO Computer Architectures Memory Alignment
.align n directive - aligns the next data definition begins on a 2n byte boundary .align 2 - the least significant 2 bits of address should be 00
Memory
Memory is addressed as an . . .
array of bytes address aligned word Words occupy 4 consecutive 12 not aligned 8 bytes (MIPS is 32bit processor), 4 0 not aligned
B35APO Computer Architectures Align
Assembler align data in data segment .DATA .ALIGN 2 var1: .BYTE 3, 5,'A','P','O' var2: .WORD 0x12345678 .ALIGN 3 var3: .HALF 1000
Example on BIG ENDIAN
var1 var2
Address 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x2000 3 5 41 50 4F 12 34 56 78 0x2010 10 00
var3 B35APO Computer Architectures Lecture motivation
Quick Quiz 1.: Is the result of both code fragments a same? Quick Quiz 2.: Which of the code fragments is processed faster and why?
A: B: int matrix[M][N]; int matrix[M][N]; int i, j, sum = 0; int i, j, sum = 0; … … for(i=0; i Is there a rule how to iterate over matrix element efficiently? B35APO Computer Architectures 23 PC Computer Motherboard B35APO Computer Architectures 24 http://www.pcplanetsystems.com/abc/product_details.php?item_id=3263&category_id=208 Computer architecture (desktop x86 PC) example generic B35APO Computer Architectures 25 From UMA to NUMA development (even in PC segment) Non-Uniform Memory Architecture CPU 1 CPU 2 CPU 1 CPU 2 RAM RAM RAM RAM RAM RAM MC MC MC MC MC Northbridge MC Northbridge CPU 1 CPU 2 CPU 1 CPU 2 RAM Southbridge Southbridge Southbridge Southbridge USB USB USB USB SATA PCI-E SATA PCI-E SATA PCI-E SATA PCI-E MC - Memory controller – contains circuitry responsible for SDRAM read and writes. It also takes care of refreshing each memory cell every 64 ms. B35APO Computer Architectures 26 Intel Core 2 generation B35APO Computer Architectures 27 Intel i3/5/7 generation B35APO Computer Architectures 28 Memory subsystem – terms and definitions ● Memory address – fixed-length sequences of bits or index ● Data value – the visible content of a memory location ● Memory location can hold even more control/bookkeeping information ● validity flag, parity and ECC bits etc. Basic memory parameters: ● Access time – delay or latency between a request and the access being completed or the requested data returned ● Memory latency – time between request and data being available (does not include time required for refresh and deactivation) ● Throughput/bandwidth – main performance indicator. Rate of transferred data units per time. ● Maximal, average and other latency parameters B35APO Computer Architectures 29 Internal architecture of the DRAM memory chip This 4M × 1 DRAM (Dynamic Random Access Memory) is internally realized as an 2048x2048 array of 1b memory cells B35APO Computer Architectures 30 Memory Cell bitline row-address stored bit bitline =0 bitline =Z row-address 1 row-address = 0 stored stored bit = 0 bit = 0 bitline =Z bitline =1 row-address 1 row-address= 0 stored stored bit = 1 bit = 1 B35APO Computer Architectures Switch Analogy of Decoder y y One Hot Decoder cz: Dekodér 1 ze 4 1 0 y1 y0 x 0 x 0 0 x x 1 1 '1' 1 x x 2 2 2 x x 3 3 3 B35APO Computer 32 Architectures Switch analogy Multiplexer 2 to 1 or 1 of 2 cz :2 kanálový (2-vstupový) multiplexor Select Select z z x0 0 x0 x1 1 x1 Multiplexer 4 to 1 or 1 of 4 cz : 4 kanálový (4-vstupový) multiplexor y0 y1 x0 x1 x2 z x3 y1 y0 y1 y0 x 0 x 0 0 z z x 1 x 1 1 x 2 x 2 2 x 3 x 3 3 B35APO Computer Architectures 33 Memory matrix Decoder one-hot bitline2 bitline1 bitline01 bitline0 row 0 0 stored stored stored stored bit = 0 bit = 1 bit = 0 bit = 0 1 row 1 stored stored stored stored bit = 1 bit = 0 bit = 0 bit = 0 2 row 2 stored stored stored stored 2 bit = 1 bit = 1 bit = 0 bit = 0 3 row 3 Address stored stored stored stored 4 4 bit = 0 bit = 1 bit = 1 bit = 1 address register of data Data 3 Data 2 Data 1 Data 1 1 0 2 2 3 Multiplexer clock data out 1 of 4 Register of address is a necessary condition for the implementation and to reduce power consumption B35APO Computer . Architectures Memory matrix - example 1/4 Decoder one-hot bitline2 bitline1 bitline01 bitline0 row 0 0 stored stored stored stored bit = 0 bit = 1 bit = 0 bit = 0 1 row 1 stored stored stored stored bit = 1 bit = 0 bit = 0 bit = 0 2 row 2 stored stored stored stored 2 bit = 1 bit = 1 bit = 0 bit = 0 address 3 row 3 Address 6 = 0110 stored stored stored stored 4 register bit = 0 bit = 1 bit = 1 bit = 1 Data 3 Data 2 Data 1 Data 1 1 0 2 2 3 Multiplexer clock 1 of 4 Address value is waiting for the rising edge of clocks B35APO Computer Architectures Memory matrix - example 2/4 Decoder one-hot bitline2 bitline1 bitline01 bitline0 row 0 0 stored stored stored stored bit = 0 bit = 1 bit = 0 bit = 0 1 row 1 stored stored stored stored bit = 1 bit = 0 bit = 0 bit = 0 2 row 2 stored stored stored stored 01 bit = 1 bit = 1 bit = 0 bit = 0 address 3 row 3 6 = 0110 0110 stored stored stored stored register bit = 0 bit = 1 bit = 1 bit = 1 Data 3 Data 2 Data 1 Data 1 1 0 2 10 3 Multiplexer clock 1 of 4 Address was loaded on the rising edge of clocks and divided to two parts, higher and lower bits B35APO Computer Architectures Memory matrix - example 3/4 Decoder one-hot bitline2 bitline1 bitline01 bitline0 row 0 0 stored stored stored stored bit = 0 bit = 1 bit = 0 bit = 0 1 row 1 stored stored stored stored bit = 1 bit = 0 bit = 0 bit = 0 2 row 2 stored stored stored stored 01 bit = 1 bit = 1 bit = 0 bit = 0 address 3 row 3 6 = 0110 0110 stored stored stored stored register bit = 0 bit = 1 bit = 1 bit = 1 Data 3 Data 2 Data 1 Data 1 1 0 2 10 3 Multiplexer clock 1 of 4 Output of one-hot decoder 1 from N aktivate row and of cell in it. B35APO Computer Architectures Memory matrix - example 4/4 Decoder one-hot bitline2 bitline1 bitline01 bitline0 row 0 0 stored stored stored stored bit = 0 bit = 1 bit = 0 bit = 0 1 row 1 stored stored stored stored bit = 1 bit = 0 bit = 0 bit = 0 2 row 2 stored stored stored stored 01 bit = 1 bit = 1 bit = 0 bit = 0 address 3 row 3 6 = 0110 0110 stored stored stored stored register bit = 0 bit = 1 bit = 1 bit = 1 Data 3 Data 2 Data 1 Data 1 1 0 2 10 3 Multiplexer clock 1 of 4 0 The cells are connected to bitlines and the output multiplexer selects one value - Data 2 = 0 B35APO Computer Architectures Memory terminology • Types: RWM (RAM), ROM, FLASH, • RAM memories: SRAM (static), DRAM (dynamic). • RAM = Random Access Memory type transistors area for access latence 1 bit SRAM cca 6 < 0,1 m2 always < 1ns – 5ns DRAM 1 < 0,001 m2 requires refresh tens of ns B35APO Computer Architectures 39 Typical chip and SRAM cell Principle: SRAM cell 6-transistors CMOS, but there are also 4 transistors versions Area of cell: B35APO Computer Architectures 40 http://educypedia.karadimov.info/library/SEC08.PDF Typical Chip and SRAM cell Typical SRAM chip Example of reading: OE signál je asynchronní B35APO Computer Architectures 41 https://www.ece.cmu.edu/~ece548/localcpy/sramop.pdf Typical Chip and SRAM cell Bigger memory? B35APO Computer Architectures 42 Dynamic memory cell detail Single transistor dynamic memory cell The nMOS transistor is a switch that connects (or not) a capacitor to a "bitline" wire. The connection is controlled by the wordline wire. B35APO Computer Architectures 43 Obrázek: http://www.eetimes.com/document.asp?doc_id=1281315 Capacitor Commercial DRAM parameters Capacity fF [femtofarad] Storage capacitor from 10 fF to 50 fF Bit line around 2 fF [Source: l'INSA de Toulouse] fF - femtofarad An SI unit of electrical capacitance equal to 10−15 farads. 10-6 F = 1 μF = 103 nF = 106 pF = 109 fF ~9 fF - the capacitor is created by two 1 mm2 plates in 1 mm distance inside vacuum B35APO Computer Architectures 44 Development of DRAM chips in time Year Capacity Price[$]/GB Access time [ns] 1980 64 Kb 1 500 000 250 1983 256 Kb 500 000 185 1985 1 Mb 200 000 135 1989 4 Mb 50 000 110 1992 16 Mb 15 000 90 1996 64 Mb 10 000 60 1998 128 Mb 4 000 60 2000 256 Mb 1 000 55 2004 512 Mb 250 50 2007 1 Gb 50 40 RAS – Row Address Strobe, CAS – Column Address Strobe B35APO Computer Architectures 45 Classical DRAM - asynchronous interface • The address is transferred in two phases – reduces number of chip module pins and is natural for internal DRAM organization • This method is preserved even for today chips RAS – Row Address Strobe, CAS – Column Address Strobe B35APO Computer Architectures 46 EDO DRAM – about 1995 • Output register holds data during overlap of next read CAS phase with previous access data transfer • this overlap (“pipelining”) increases throughput B35APO Computer Architectures 47 SDRAM – konec 90.let – synchronní DRAM SDRAM chip is equipped by counter that can be used to define continuous block length (burst) which is read together. B35APO Computer Architectures 48 Consecutive Read/Write B35APO Computer Architectures 49 From SDRAM to Contemporary mMemories • SDRAM – clock frequency up to 100 MHz, 2.5V • DDR (double data rate) SDRAM – data transfer at both CLK edges, 2.5V, I/O bus clock 100-200 MHz, speed 0.2-0.4 GT/s (gigatransfers per second) • DDR2 SDRAM – lower power consumption by power 1.8V, frequency of IO bus 400 MHz, 0.8 GT/s • DDR3 SDRAM – lower power consumption by power 1.5V, IO bus frequency up to 800 MHz, 1.6 GT/s • DDR4 SDRAM - 1.05 – 1.2V, I/O bus clock 1.2 GHz, 2.4 GT/s • DDR5 SDRAM - expected during 2019-20, supposed transfer rate ~6 GT/s All these innovations are focused mainly on throughput, not on the random access latency. B35APO Computer Architectures 50 Other Memories • QDRx SDRAM (Quad Data Rate) - however, they are not 2 * faster, they only allow simultaneous read and write since they have separate clocks for RD and WR, while DDRs are more efficient than QDR for one type of access. • GDDR SDRAM - today up to GDDR6, suitable for graphics cards - based on DDR memory. - their speed is accelerated by the extended output bus. • Furthermore, there are memories and RDRAM modules (RAMBUS DRAMs), which have a completely different interface and method of use. For patent litigation, they have not been used in personal computers since 2003. B35APO Computer Architectures 51