Computer Architectures

Computer Architectures Processor+Memory Richard Šusta, Pavel Píša Czech Technical University in Prague, Faculty of Electrical Engineering AE0B36APO Computer Architectures Ver.1.00 2019 1 The main instruction cycle of the CPU 1. Initial setup/reset – set initial PC value, PSW, etc. 2. Read the instruction from the memory PC → to the address bus Read the memory contents (machine instruction) and transfer it to the IR PC+l → PC, where l is length of the instruction 3. Decode operation code (opcode) 4. Execute the operation compute effective address, select registers, read operands, pass them through ALU and store result 5. Check for exceptions/interrupts (and service them) 6. Repeat from the step 2 AE0B36APO Computer Architectures 2 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched data I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0 ALUControl 25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU A RD Instr. A2 RD2 SrcB AluOut Data ReadData Memory Memory A3 Reg. WD3 File WD 15:0 Sign Ext SignImm B35APO Computer Architectures 3 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched data I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0 Write on rising edge of CLK RegWrite = 1 ALUControl 25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU A RD Instr. A2 RD2 SrcB AluOut Data ReadData 20:16 Memory Memory A3 Reg. WD3 File WD 15:0 Sign Ext SignImm B35APO Computer Architectures 4 Single cycle CPU – implementation of the load instruction lw: type I, rs – base address, imm – offset, rt – register where to store fetched data I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0 RegWrite = 1 ALUControl 25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU A RD Instr. A2 RD2 SrcB AluOut Data ReadData Memory 20:16 Memory A3 Reg. WD3 File WD + 15:0 4 Sign Ext SignImm PCPlus4 B35APO Computer 5 Architectures Single cycle CPU – implementation of the store instruction sw: type I, rs – base address, imm – offset, rt – select register to store into memory I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0 RegWrite = 0 MemWrite = 1 ALUControl 25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU 20:16 A RD Instr. A2 RD2 SrcB AluOut Data ReadData Memory 20:16 Memory A3 Reg. WD3 File WD + 15:0 4 Sign Ext SignImm PCPlus4 B35APO Computer Architectures 6 Switch analogy Multiplexer 2 to 1 1bit Select (address) Select z x x z 0 0 0 x1 1 x1 B35APO Computer Architectures 7 Single cycle CPU – implementation of the add instruction add: type R, rs, rt – source, rd – destination, funct – select ALU operation = add R opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 rd(5), 15:11 shamt(5) funct(6), 5:0 RegWrite = 1 ALUSrc = 0 MemToReg = 0 RegDst = 1 ALUControl 25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU 0 Result A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 15:0 4 Sign Ext SignImm PCPlus4 B35APO Computer Architectures 8 Single cycle CPU – sub, and, or, slt Only difference is another ALU operation selection (ALUcontrol). The data path is the same as for add instruction RegWrite = 1 ALUSrc = 0 MemToReg = 0 RegDst = 1 ALUControl 25:21 WE3 SrcA Zero WE PC’ PC Instr A1 RD1 A RD ALU 0 Result A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 15:0 4 Sign Ext SignImm PCPlus4 B35APO Computer Architectures 9 Single cycle CPU – implementation of beq • beq – branch if equal; imm–offset; PC´ = PC+4 + SignImm*4 I opcode(6), 31:26 rs(5), 25:21 rt(5), 20:16 immediate (16), 15:0 RegWrite = 0 ALUSrc = 0 Branch = 1 MemToReg = x RegDst = x ALUControl 25:21 WE3 SrcA Zero WE 0 PC’ PC Instr A1 RD1 A RD ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4 B35APO Computer Architectures 10 Single cycle CPU – Throughput: IPS = IC / T = IPCstr.fCLK • What is the maximal possible frequency of the CPU? • It is given by latency on the critical path – it is lw instruction in our case: Tc = tPC + tMem + tRFread + tALU + tMem + tMux + tRFsetup 25:21 WE3 SrcA WE 0 PC’ PC Instr A1 RD1 Zero A RD ALU 0 Result 1 A RD 1 20:16 Instr. A2 RD2 0 SrcB AluOut Data ReadData Memory 1 Memory A3 Reg. WD3 File WriteData WD 20:16 Rt 0 WriteReg 1 15:11 Rd + 4 15:0 <<2 Sign Ext SignImm PCBranch + PCPlus4 B35APO Computer Architectures 11 Single cycle CPU – Throughput: IPS = IC / T = IPCstr.fCLK • Tc = Tcinstr + Tcproc = (tPC + tMem) + (tRFread + tALU + tMem + tMux + tRFsetup) • Consider following parameters: tPC = 30 ns tMem = 300 ns tRFread = 50 ns tALU = 200 ns tMux = 20 ns tRFsetup = 20 ns Then Tc = 920 ns --> fCLK max = 1,08 MHz, IPS = 1 080 000 [instructions per second] If Tc instr is executed paralel with Tc proc, then Tc i < Tc p, and Tc p = 50+200+300+20+20 = 590 ns = 1.69 MHz -> IPS = 1 690 000 B35APO Computer Architectures 12 Note • Remember the result, so you can compare it with result for pipelined CPU. B35APO Computer Architectures 13 Key Technology Gaps Prediction Note: The increase of algorithm complexity over time has been formalized in literature with the so-defined Shannon's law. B35APO Computer Architectures 14 Memory and CPU speed – Moore's law more exact numbers CPU 25% 52% 20% performance per year per year per year Processor-Memory Performance Gap Growing Throughput of memory only +7% per year Source: Hennesy, Patterson CaaQA 4th ed. 2006 B35APO Computer Architectures 15 Layout of a Program in Memory lower address Other programs .ent start Text Segment entry point, initial value of PC Static Initialized Data Data Segment Static Uninitialized Data Dynamic Area, heap Heap grows (cz: halda) to higher address Stack grows Stack Segment to lower address higher address (cz:zásobník) B35APO Computer Architectures Memory Addressing Memory address in load and store instructions is specified by a base register and offset This is called base addressing addi a0, zero, 10 // load the word from absolute address lw $2, 0x2000($0) // store the word to absolute address sw $2, 0x2004($0) Data Types in Instructions MIPS registers hold 32-bit (4-byte) words. Other common data sizes include byte and halfword. Byte = 8 bits - example: lb / lbu - load byte extend signed/unsigned Halfword = 2 bytes example: - lh / lhu - load halfword extend signed/unsigned Word = 4 bytes - example: - lw - load word u-unsigned extension 18 Data Directives .WORD Directive Stores the list as 32-bit values aligned on a word boundary .WORD w:n Directive Stores the 32-bit value w into n consecutive words aligned on a word boundary. .HALF Directive Stores the list as 16-bit values aligned on half-word boundary .HALF w:n Directive Stores the 16-bit value w into n consecutive half-words aligned on a half-word boundary. .BYTE Directive Stores the list of values as 8-bit bytes .BYTE w:n Directive Stores the 8-bit value w into n consecutive bytes. B35APO Computer Architectures String Directives .ASCII Directive Allocates a sequence of bytes for an ASCII string .ASCIIZ Directive Same as .ASCII directive, but adds a NULL char at end of string Strings are null-terminated, as in the C programming language .SPACE n Directive Allocates space of n uninitialized bytes in the data segment Special characters in strings follow C convention Newline: \n Tab:\t Quote: \” B35APO Computer Architectures Memory Alignment .align n directive - aligns the next data definition begins on a 2n byte boundary .align 2 - the least significant 2 bits of address should be 00 Memory Memory is addressed as an . array of bytes address aligned word Words occupy 4 consecutive 12 not aligned 8 bytes (MIPS is 32bit processor), 4 0 not aligned B35APO Computer Architectures Align Assembler align data in data segment .DATA .ALIGN 2 var1: .BYTE 3, 5,'A','P','O' var2: .WORD 0x12345678 .ALIGN 3 var3: .HALF 1000 Example on BIG ENDIAN var1 var2 Address 0 1 2 3 4 5 6 7 8 9 A B C D E F 0x2000 3 5 41 50 4F 12 34 56 78 0x2010 10 00 var3 B35APO Computer Architectures Lecture motivation Quick Quiz 1.: Is the result of both code fragments a same? Quick Quiz 2.: Which of the code fragments is processed faster and why? A: B: int matrix[M][N]; int matrix[M][N]; int i, j, sum = 0; int i, j, sum = 0; … … for(i=0; i<M; i++) for(j=0; j<N; j++) for(j=0; j<N; j++) for(i=0; i<M; i++) sum += matrix[i][j]; sum += matrix[i][j]; Is there a rule how to iterate over matrix element efficiently? B35APO Computer Architectures 23 PC Computer Motherboard B35APO Computer Architectures 24 http://www.pcplanetsystems.com/abc/product_details.php?item_id=3263&category_id=208 Computer architecture (desktop x86 PC) example generic B35APO Computer Architectures 25 From UMA to NUMA development (even in PC segment) Non-Uniform Memory Architecture CPU 1 CPU 2 CPU 1 CPU 2 RAM RAM RAM RAM RAM RAM MC MC MC MC MC Northbridge MC Northbridge CPU 1 CPU 2 CPU 1 CPU 2 RAM Southbridge Southbridge Southbridge Southbridge USB USB USB USB SATA PCI-E SATA PCI-E SATA PCI-E SATA PCI-E MC - Memory controller – contains circuitry responsible for SDRAM read and writes.

Computer Architectures

Load Management and Demand Response in Small and Medium Data Centers

A Modern Primer on Processing in Memory

(BEER): Determining DRAM On-Die ECC Functions by Exploiting DRAM Data Retention Characteristics

256M(16Mx16) GDDR SDRAM HY5DU561622CT

An Analysis of Solderability and Manufacturability of Moisture

Mithril: Cooperative Row Hammer Protection on Commodity DRAM Leveraging Managed Refresh

Product Guide SAMSUNG ELECTRONICS RESERVES the RIGHT to CHANGE PRODUCTS, INFORMATION and SPECIFICATIONS WITHOUT NOTICE

Constructing Effective UVM Testbench for DRAM Memory Controllers

Introducing Micron DDR5 SDRAM: More Than a Generational Update

Видове Dram Памети - Sdr Sdram, Ddr, Ddr2, Ddr3, Ddr4, Ddr5 И Rambus

US 2015/0160872 A1 CHEN (43) Pub

R60, R60e, R61, and R61i