Outline

z What is ? Hardware Design I Chap. 10 z Microprocessor from sequential machine Design of microprocessor viewpoint {Microprocessor and Neumann computer Architecture Lab. { {Instruction set architecture Hajime Shimada z of the microprocessor E-mail: [email protected] {Microarchitecture with sequential processing {Microarchitecture with pipelined processing

1 Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 2

What is microprocessor? (CPU) z LSI for processing data at the center of the z A nucleus of Neumann computer computer { Detail will be taught in later slide {Also, called “” z Sometimes, the word “microprocessor” denotes this z There are several type of z By combining CPU with memories, disks, I/Os, we can create PC or server {Central Processing Unit (CPU) z Examples: core i7, Fujitsu SPARC64 VII, AMD { Opteron, ... {Graphic accelerator {Other several accelerators

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 3 Hajime Shimada Hardware Design I (Chap. 10) 4

Microcontroller Graphic accelerator z A processor used for control of electric devices z A processor used for processing graphic { Optimized for those use { Also called Graphic Processing Unit (GPU) z e.g. give high current drive ability to output pin to directly drive LED z Implement too many ALU to utilize parallelism z Many of them can organize computer with one chip { In graphic processing, usually, we can each pixel { Implement memory hierarchy into them independently z Example: Renesus H8, Atmel AT91, , ... { It also utilized for high parallelism arithmetic { Too many companies provide them z Examples: NVIDIA GeForce, AMD(ATI) , ...

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 5 Hajime Shimada Hardware Design I (Chap. 10) 6

1 Other accelerators Outline

z There’s several processor to accelerate data processing z What is microprocessor? which is not suitable to process with CPU or GPU z Microprocessor from sequential machine { But recently, GPU intrudes to this area viewpoint z Usually, it implements much ALU to supply high arithmetic performance {Microprocessor and Neumann computer z Example: ClearSpeed CSX, Ageia PhysiX, ... {Memory hierarchy {Instruction set architecture z Microarchitecture of the microprocessor {Microarchitecture with sequential processing {Microarchitecture with pipelined processing

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 7 Hajime Shimada Hardware Design I (Chap. 10) 8

Microprocessor from sequential circuit Neumann computer viewpoint

z We can abstract microprocessor with following z Neumann computer is current major organization sequential machine z Point of Neumann computer z Inputs { Instructions and data are placed in main memory { Programs (= instructions) { Instructions manipulate data of flip-flop and memories { Data for processing { We can apply different manipulation by different instruction z Outputs: Processed data Neumann computer Non-Neman computer z State: Register (and ) Data before Instructions A hardware processing A hardware which apply Combinational Data before which apply operation Programs Processed processing fixed logic circuit notated in (=instructions) data operation Data after instruction Data after data for processing to data Register (and register file) processing to data processing Main memory Main memory Computing Architecture Lab. Clock Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 9 Hajime Shimada Hardware Design I (Chap. 10) 10

Advantages and disadvantages of Neumann computer and microprocessor Neumann computer

z Advantages This hardware is Instructions A hardware a microprocessor { We don’t have to modify hardware between different data which apply Combinational processing Data before processing operation logic circuit z EDSAC(1949) is one of the early Neumann computer notated in z ENIAC have to change wire connection if it change processing Data after instruction { We can execute complicated processing with multiple processing to data Register Main memory (and instructions register file) z Disadvantages (Neumann bottleneck) z Microprocessor is a hardware which operate data processing in { Communication between processor and memory increases Neman computer { Slow memory drags down processor performance { Also called “processor” or “Central Processing Unit (CPU)” z How about non-Neumann computer? { It includes a part of main memory (see memory hierarchy) { It remains in some specific use (e.g. movie codec) z Hardware organization of the microprocessor differs between its purpose { It begins to reposition with reconfigurable hardware ->Chap. 9 { For server, for PC, for high performance embedded, for embedded, ...

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 11 Hajime Shimada Hardware Design I (Chap. 10) 12

2 States in the microprocessor Inputs for the microprocessor z How we define states of sequential machine in the z There are two inputs of sequential machine in the processor? processor { Usually, we call it register { Instruction: must be defined if we design sequential machine z There are many types of registers { Data: don’t have to define them { Special purpose registers (SPR) z What’s instruction? z Program (PC): Denotes position of instruction which is executing { Series of bits: e.g. 00000000100001010100000000010000 z Flag register: Denotes carry generation, overflow, ... { Usually, we use to represent it { Global purpose registers (GPR) z A programming language which has one to one relationship to z Used for hold data before/after processing (work as a part of main instruction memory) z It defines operation relationship between registers and main z Also, used for intermediate data under arithmetic memory (in basic) z The organization of register differs between instruction z e.g. add R8, R4, R5 (GPR #8 = GPR #4 + GPR #5) set architectures Relationship between GPR and <-> 00000000100001010100000000010000 Introduce how to define it memory hierarchy is shown in later efficiency in later slide Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 13 Hajime Shimada Hardware Design I (Chap. 10) 14

GPR and memory hierarchy (1/2) GPR and memory hierarchy (2/2) z In recent processors, the GPR becomes a part of main z We call “memory hierarchy” for those hierarchical memory { Including disk { Firstly the processor moves data from main memory to register { It also reduces performance degradation from slow device { Processor apply operation to the data in the register z Recently, number of hierarchy increasing because the { After operation, it write back data to main memory speed difference between devices is increasing z This organization effectively reduces workload for main Data Data memory Traditional transmission transmission memory { Assuming that we apply multiple operation to data Disk

hierarchy Register Data Data memory Main transmission transmission SRAM SRAM Recent Disk memory Register Internal processor Main memory Main hierarchy Disk L2 L2 L1 cache L1 Register

FF or SRAM DRAM HDD or SSD memory Main Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 15 Hajime Shimada Hardware Design I (Chap. 10) 16

Instruction set architecture (ISA) Instruction encoding z To create sequential machine, we have to define format z Instruction is encoded to chunk of binary under ISA of inputs and internal state definition { Internal state: denoted by registers (for internal state) { e.g. add R8, R4, R5 (GPR #8 = GPR #4 + GPR #5) { Inputs: instructions <-> 00000000100001010100000000010000 z We usually call this definition as Instruction Set z In usual encoding, we give meaning into some chunk of Architecture (ISA) bits { Including systematic instruction construction method z By defining ISA carefully you can reduce { States (registers) R type op rs rt rd shift func { Combinational logics 31 26 21 16 11 6 0 I type op rs rt addr 31 26 21 16 0

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 17 Hajime Shimada Hardware Design I (Chap. 10) 18

3 Example of instruction encoding (1/3) Example of instruction encoding (2/3) z Example: Instruction encoding of MIPS z add R8, R4, R5 z Total length is 32-bit z It has meaning in several chunk of bits 000000 00100 00101 01000 no use 010000 { op: Type of operation (arithmetic, load, store, branch, ...) 31 26 21 16 11 6 0 { rs: Source operand 1 for arithmetic { Operation: R8 = R4 + R5 (add: addition) This difference { rt: Source operand 2 for arithmetic z sub R8, R4, R5 indicates different { rd: Destination operand for arithmetic (store arithmetic result) arithmetic { shift: Amount of shift { func: Type of arithmetic (supplemental for op) 000000 00100 00101 01000 no use 010010 { addr: Immediate value for arithmetic 31 26 21 16 11 6 0 R type op rs rt rd shift func { Operation: R8 = R4 - R5 (sub: subtract) 31 26 21 16 11 6 0 I type op rs rt addr 31 26 21 16 0 Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 19 Hajime Shimada Hardware Design I (Chap. 10) 20

Example of instruction encoding (3/3) Short Exercise z lw R8, 8(R4) z Let’s translate following assembly to instruction notated by binary 100101 00100 01000 0000000000001000 31 26 21 16 0 {Refer R-type instruction notation in slides { Operation: Load value in (R4 + 8) position on main memory to R8 (lw: load word) add R10, R13, R14 z bne R4, R5, -5

000101 00100 00101 1111111111111011 31 26 21 16 0

{ Operation: if R4 != R5, back to 5 prior instruction (bne: branch not equal)

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 21 Hajime Shimada Hardware Design I (Chap. 10) 22

Answer Outline z Let’s translate following assembly to instruction z What is microprocessor? notated by binary z Microprocessor from sequential machine {Refer R-type instruction notation in slides viewpoint {Microprocessor and Neumann computer add R10, R13, R14 {Memory hierarchy {Instruction set architecture (no use) z Microarchitecture of the microprocessor 000000 01101 01110 01010 00000 010000 {Microarchitecture with sequential processing 31 26 21 16 11 6 0 {Microarchitecture with pipelined processing

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 23 Hajime Shimada Hardware Design I (Chap. 10) 24

4 What’s Microarchitecture? One organization of microprocessor (1/3) z An implementation of processor on the hardware z Combinational logics z We can choose several possible microarchitecture in { ALU: execute add, sub, logical arithmetic, shift, ... same ISA { : construct data path from instructions and values in register { e.g. i7, Intel , AMD Phenon { after PC: increment PC to indicate next instruction { It can execute same program (e.g. Windows) because ISA is the same { Adder beside ALU: calculate branch target in branch instruction z Usually, we choose microarchitecture for the purpose of Address the computer Data bus { e.g. Choose low power consumption microarchitecture for IR notebook PC PC

1 ALU Main RF memory

+ +

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 25 Hajime Shimada Hardware Design I (Chap. 10) 26

One organization of microprocessor (2/3) One organization of microprocessor (3/3) z Buses for main memory z Registers { Address bus: send address value which indicate read/write { Register file (RF): chunk of GPR position in main memory z Number is differ between architectures (e.g. 32 in MIPS) { Data Bus { (PC) z Send data value which is read from main memory { (IR): hold instruction comes from main z Send data value which will be written into main memory memory Address bus Address bus

Data bus Data bus IR IR PC PC

1 ALU Main 1 ALU Main RF memory RF memory

+ + + +

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 27 Hajime Shimada Hardware Design I (Chap. 10) 28

How to understand operation of prior Decomposition to 5 part chunk of hardware? z It seems that it’s hard to understand operation of prior 1. Instruction fetch (IF) large hardware -> True 2. Instruction decode (ID), register read z How can we understand it easily? 3. Execution (EX) -> Decompose hardware to 5 part and understand those 4. Memory access (MA) operation 5. Write back to register (WB), and commit z This 5 part decomposition has importance in operation { Operate 5 part sequentially: 5 phase operation processor Data bus z Finish one instruction with 5 clock pulse IR { Operate 5 part simultaneously: 5 stage pipelined processor PC 1 ALU Main z Finish one instruction with 1 clock pulse (in general case) RF memory

+ +

(5.) 1. 2.(5.) 3. 4. 5. Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 29 Hajime Shimada Hardware Design I (Chap. 10) 30

5 Operation of IF (1/2) Operation of IF (2/2) z Manages PC updating z Manages PC updating { Update with incremented PC value z Read instruction which is indicated by PC value z If instruction is not branch instructions (or not taken branch) { Send content of PC to address bus { Update with branch target address { Capture instruction (to IR) comes from data bus z If instruction is (take conditional) branch instruction z Read instruction which is indicated by PC value

Data bus Data bus add R4, R5, R6 IR IR PC PC

1 ALU Main 1 ALU Main RF memory RF memory

+ + + +

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 31 Hajime Shimada Hardware Design I (Chap. 10) 32

Operation of ID Operation of EX z Read registers by rs or rt bits in instruction z Arithmetic { Usually, RF is consist of RAM so that rs or rt becomes address { Memory address generation on memory access is also operated for RAM { Detailed arithmetic is indicated by “func” part of instruction z Decode instructions z Calculate branch target address { Generate several control signals from instruction { Add PC+1 and immediate value comes from instruction z e.g. signal for before PC

Data bus Data bus IR IR PC PC

1 ALU Main 1 ALU Main RF memory RF memory

+ + + +

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 33 Hajime Shimada Hardware Design I (Chap. 10) 34

Operation of MA Operation of WB z Memory access z Writeback operation result to RF { Send generated address to address bus { In instructions which creates result { If memory access instruction is load, capture data in data bus z If branch is taken, write branch target address to PC { If memory access instruction is store, the processor send store { Coordinated to IF part largely data into data bus -> See IF part slide z Other instruction: no operation

Data bus Data bus IR IR PC PC

1 ALU Main 1 ALU Main RF memory RF memory

+ + + +

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 35 Hajime Shimada Hardware Design I (Chap. 10) 36

6 Example execution of “add R8, R4, R5” Example execution of “lw R8, 8(R4)”

1. Send PC to main memory and read instruction into IR 1. Send PC to main memory and read instruction into IR 2. Read register by a part of instruction and decode instruction and 2. Read register by a part of instruction, decode instruction and generate signals generate signals, and apply sign extension to immediate value 3. Apply arithmetic denoted by a part of instruction 3. Create memory address by adding register and immediate values 4. Do nothing 4. Send address to memory and read data 5. Writeback arithmetic result to RF and increment PC 5. Writeback read data to RF and increment PC

Data bus Data bus IR IR PC PC

1 ALU Main 1 ALU Main RF memory RF memory

+ + + +

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 37 Hajime Shimada Hardware Design I (Chap. 10) 38

Example execution of “bne R4, R5, -5” Pipelined processing on processor

1. Send PC to main memory and read instruction into IR z Prior example is sequential execution of 5 parts 2. Read register by a part of instruction, decode instruction and z Are there any way to work them simultaneously? generate signals, and apply sign extension to immediate value { Process consecutive instructions in each parts 3. Check branch condition by arithmetic result of ALU and generate target address with adder { Called “pipelined processing” -> Prof. Yao’s lecture on Jan. 26 4. Do nothing z Imaging creating product on belt conveyer in factory 5. If condition is taken, writeback target address to PC. Otherwise Decoding lw R7, 8(R3) Fetching bne R2, R6, -5 Executing add R8, R4, R5 increment PC

Data bus Data bus IR IR PC PC

1 ALU Main 1 ALU Main RF memory RF memory

+ + + +

Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 39 Hajime Shimada Hardware Design I (Chap. 10) 40

Additional hardware for pipelined Outlined notation of pipeline processing z We usually utilize operation (e.g. IF, ID, ...) denoted in z We have to prepare additional FF to keep different instructions box to represent parallel execution { Prepare FF between each part called “pipeline register” { Horizontal axis denotes time (by clock cycles) { It keeps not only instructions but also related informations (read register value, arithmetic result, ...) { Vertical axis denotes instruction order (by program order) z Each part is called “pipeline stage” or stage Time (by clock cycles) add R8, R4, R5 IF ID EX MEM WB lw R7, 8(R3) IF ID EX MEM Sequential IR PC

1 ALU Pipeline add R8, R4, R5 IF ID EX MEM WB RF register

+ + lw R7, 8(R3) IF ID EX MEM WB bne R2, R6, -5 IF ID EX MEM WB Hold add R8, R4, R5 and related informations Pipelined Computing Architecture Lab. Computing Architecture Lab. Hold lw R7, 8(R3) and related informations Hajime Shimada Hardware Design I (Chap. 10) 41 Hajime Shimada Hardware Design I (Chap. 10) 42

7 Pipelined processing from sequential Pipeline hazard caused by data hazard circuit viewpoint

z It becomes anomalistic sequential circuit z Let’s consider following instructions { Updated state is written into next FF add R8, R4, R5 IF ID EX MEM WB Read R8 in this point z It gives additional constraint to state sub R6, R8, R7 IF ID ID ID ID EX MEM WB { Caused by relationship between instructions Data dependency R8 is not written in this point!!! { Pipeline continuously try read R8 and stop it at ID stage z Called “” { Called “pipeline hazard”: pipeline processing stops with several IR PC

1 ALU reasons Next state Next state RF Current state Current state z This is a pipeline hazard caused by data dependency

+ + { Called “data hazard” { Data dependency: a relationship that later instruction utilize the Hold add R8, R4, R5 and related informations result of prior instruction

Computing Architecture Lab. Hold lw R7, 8(R3) and related informations Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 43 Hajime Shimada Hardware Design I (Chap. 10) 44

Data hazard avoidance with result Pipeline and length of logic (1/2) forwarding

z Pipeline stall is achieved by additional constraint on z Length of (combinational) logic seems the same in sequential machine outlined figure between stages z Are there any way to avoid it? z But in practical, it differs -> Result forwarding: passing value without RF { Additional data path is required add R8, R4, R5 IF ID EX MEM WB

R8 IR sub R6, R8, R7 PC IF ID EX MEM WB 1 ALU RF Result forwarding

+ + IR PC 1 RF ALU

+ Additional data path IF ID EX MEM WB for result forwarding IF ID EX MEM WB Computing Architecture Lab. Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 45 Hajime Shimada Hardware Design I (Chap. 10) 46

Pipeline and depth of logic (2/2)

z How do we operate those logics? { We have to operate them with latest one (which has longest logic) because they works with same clock pulse add R8, R4, R5 IF ID EX MEM WB sub R6, R8, R7 IF ID EX MEM WB z We have to average them as far as possible z If we consider sequential organization, there’s no problem

IF ID EX MEM WB

IF ID EX MEM WB Computing Architecture Lab. Hajime Shimada Hardware Design I (Chap. 10) 47

8