Page 1 a Take on Moore’S Law Technology Trends

Outline

• Why Take CS252? CS252 Graduate Computer Architecture • Fundamental Abstractions & Concepts Lecture 1 • Instruction Set Architecture & Organization Introduction • Administrivia • Pipelined Instruction Processing • Performance • The Memory Abstraction January 22, 2002 • Summary Prof. David E Culler Computer Science 252 Spring 2002

CS252/Culler CS252/Culler 1/22/02 Lec 1. 1 1/22/02 Lec 1. 2

Why take CS252? Example Hot Developments ca. 2002 • Manipulating the instruction set abstraction • To design the next great instruction set?...well... – itanium: translate ISA64 -> micro-op sequences – instruction set architecture has largely converged – transmeta: continuous dynamic translation of IA32 – especially in the desktop / server / laptop space – tinsilica: synthesize the ISA from the application – dictated by powerful market forces – reconfigurable HW • Tremendous organizational innovation relative to • Virtualization established ISA abstractions – vmware: emulate full virtual machine – JIT: compile to abstract virtual machine, dynamically compile • Many New instruction sets or equivalent to host – embedded space, controllers, specialized devices, ... • Parallelism • Design, analysis, implementation concepts vital to all – wide issue, dynamic instruction scheduling, EPIC aspects of EE & CS – multithreading (SMT) – systems, PL, theory, circuit design, VLSI, comm. – chip multiprocessors • Equip you with an intellectual toolbox for dealing with • Communication a host of systems design challenges – network processors, network interfaces • Exotic explorations – nanotechnology, quantum computing

CS252/Culler CS252/Culler 1/22/02 Lec 1. 3 1/22/02 Lec 1. 4

Forces on Computer Architecture Amazing Underlying Technology Change

Technology Programming Languages

Applications Computer Architecture

Operating Systems History (A = F / M)

CS252/Culler CS252/Culler 1/22/02 Lec 1. 5 1/22/02 Lec 1. 6

Page 1 A take on Moore’s Law Technology Trends

Bit-level parallelism Instruction-level Thread-level (?) 100,000,000 • Clock Rate: ~30% per year u • Transistor Density: ~35% 10,000,000 u u u u u R10000 • Chip Area: ~15% u u uu u u uuuuu u u u • Transistors per chip: ~55% 1,000,000 uuu uuuu u uu Pentium u • Total Performance Capability: ~100% u uu i80386 u

ransistors i80286 u u u R3000 • by the time you graduate... T 100,000 u u u R2000 – 3x clock rate (3-4 GHz) – 10x transistor count (1 Billion transistors) u i8086

10,000 – 30x raw capability i8080 u u i8008 u u i4004 • plus 16x dram density, 32x disk density 1,000 1970 1975 1980 1985 1990 1995 2000 2005

CS252/Culler CS252/Culler 1/22/02 Lec 1. 7 1/22/02 Lec 1. 8

Measurement and Evaluation Performance Trends Architecture is an iterative process -- searching the space of possible designs Design -- at all levels of computer systems 100 Supercomputers Analysis

10 Mainframes Microprocessors Creativity Minicomputers Performance 1 Cost / Performance Analysis

0.1 1965 1970 1975 1980 1985 1990 1995 Good Ideas Mediocre Ideas Bad Ideas

CS252/Culler CS252/Culler 1/22/02 Lec 1. 9 1/22/02 Lec 1. 10

What is “Computer Architecture”? Coping with CS 252

Application • Students with too varied background? Operating – In past, CS grad students took written prelim exams on System undergraduate material in hardware, software, and theory – 1st 5 weeks reviewed background, helped 252, 262, 270 Compiler Firmware Instruction Set – Prelims were dropped => some unprepared for CS 252? Architecture Instr . Set Proc. I/O system • In class exam on Tues Jan. 29 (30 mins) Datapath & Control – Doesn’t affect grade, only admission into class – 2 grades: Admitted or audit/take CS 152 1st Digital Design – Improve your experience if recapture common background Circuit Design Layout • Review: Chapters 1, CS 152 home page, maybe “Computer Organization and Design (COD)2/e” • Coordination of many levels of abstraction – Chapters 1 to 8 of COD if never took prerequisite • Under a rapidly changing set of forces – If took a class, be sure COD Chapters 2, 6, 7 are familiar – Copies in Bechtel Library on 2-hour reserve • Design, Measurement, and Evaluation • FAST review this week of basic concepts

CS252/Culler CS252/Culler 1/22/02 Lec 1. 11 1/22/02 Lec 1. 12

Page 2 The Instruction Set: a Critical Interface Review of Fundamental Concepts

• Instruction Set Architecture software • Machine Organization • Instruction Execution Cycle instruction set • Pipelining

• Memory hardware • Bus (Peripheral Hierarchy) • Performance Iron Triangle

CS252/Culler CS252/Culler 1/22/02 Lec 1. 13 1/22/02 Lec 1. 14

Instruction Set Architecture Organization

... the attributes of a [computing] system as seen • Capabilities & Performance Logic Designer's View by the programmer, i.e. the conceptual structure Characteristics of Principal and functional behavior, as distinct from the Functional Units ISA Level organization of the data flows and controls the logic – (e.g., Registers, ALU, Shifters, Logic FUs & Interconnect design, and the physical implementation. Units, ...) – Amdahl, Blaaw, and Brooks, 1964 • Ways in which these components SOFTWARE are interconnected -- Organization of Programmable • Information flows between Storage components -- Data Types & Data Structures: Encodings & Representations • Logic and means by which such information flow is controlled. -- Instruction Formats • Choreography of FUs to -- Instruction (or Operation Code) Set realize the ISA -- Modes of Addressing and Accessing Data Items and Instructions • Register Transfer Level (RTL) -- Exceptional Conditions Description

CS252/Culler CS252/Culler 1/22/02 Lec 1. 15 1/22/02 Lec 1. 16

Review: MIPS R3000 (core) Review: Basic ISA Classes r0 0 Programmable storage Data types ? r1 Accumulator: ° 2^32 x bytes Format ? 1 address add A acc ¬ acc + mem[A] ° 31 x 32-bit GPRs (R0=0) Addressing Modes? ° 1+x address addx A acc ¬ acc + mem[A + x] 32 x 32-bit FP regs (paired DP) r31 Stack: PC HI, LO, PC lo 0 address add tos ¬ tos + next hi General Purpose Register: Arithmetic logical 2 address add A B EA(A) ¬ EA(A) + EA(B) Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, 3 address add A B C EA(A) ¬ EA(B) + EA(C) AddI , AddIU, SLTI, SLTIU, AndI , OrI, XorI, LUI Load/Store: SLL, SRL, SRA, SLLV, SRLV, SRAV 3 address add Ra Rb Rc Ra ¬ Rb + Rc Memory Access load Ra Rb Ra ¬ mem[Rb] LB, LBU, LH, LHU, LW, LWL,LWR store Ra Rb mem[Rb] ¬ Ra SB, SH, SW, SWL, SWR Control 32-bit instructions on word boundary

J, JAL, JR, JALR

BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL CS252/Culler CS252/Culler 1/22/02 Lec 1. 17 1/22/02 Lec 1. 18

Page 3 Instruction Formats MIPS Addressing Modes & Formats Variable: … • Simple addressing modes • All instructions 32 bits wide Fixed: Register (direct) op rs rt rd Hybrid: register Immediate op rs rt immed •Addressing modes Base+index –each operand requires addess specifier => variable format op rs rt immed Memory •code size => variable length instructions register + •performance => fixed length instructions PC-relative –simple decoding, predictable operations op rs rt immed Memory •With load/store instruction arch, only one memory PC + address and few addressing modes • Register Indirect? •=> simple format, address mode given by opcode CS252/Culler CS252/Culler 1/22/02 Lec 1. 19 1/22/02 Lec 1. 20

Cray-1: the original RISC VAX-11: the canonical CISC

15 9 8 6 5 3 2 0 Byte 0 1 n m Op Rd Rs1 R2 OpCode A/M A/M A/M

Load, Store and Branch 15 9 8 6 5 3 2 0 15 0 • Rich set of orthogonal address modes Op Rd Rs1 Immediate – immediate, offset, indexed, autoinc/dec , indirect, indirect+offset – applied to any operand • Simple and complex instructions – synchronization instructions – data structure operations (queues) – polynomial evaluation

CS252/Culler CS252/Culler 1/22/02 Lec 1. 21 1/22/02 Lec 1. 22

Review: Load/Store Architectures MIPS R3000 ISA (Summary) Registers • Instruction Categories ° 3 address GPR MEM reg ° Register to register arithmetic – Load/Store R0 - R31 ° Load and store with simple addressing modes (reg + immediate) – Computational ° Simple conditionals – Jump and Branch compare ops + branch z – Floating Point compare&branch op r r r » coprocessor PC condition code + branch on condition – Memory Management op r r immed HI ° Simple fixed-format encoding – Special op offset LO

3 Instruction Formats: all 32 bits wide ° Substantial increase in instructions OP rs rt rd sa funct ° Decrease in data BW (due to many registers) OP rs rt immediate ° Even more significant decrease in CPI (pipelining) ° Cycle time, Real estate, Design time, Design complexity OP jump target

CS252/Culler CS252/Culler 1/22/02 Lec 1. 23 1/22/02 Lec 1. 24

Page 4 CS 252 Administrivia • TA: Jason Hill, [email protected] Research Paper Reading • All assignments, lectures via WWW page: http://www.cs.berkeley.edu/~culler/252S02/ • As graduate students, you are now researchers. • 2 Quizzes: 3/21 and ~14th week (maybe take home) • Most information of importance to you will be in • Text: research papers. – Pages of 3rd edition of Computer Architecture: A Quantitative Approach » available from Cindy Palwick (MWF) or Jeanette Cook ($30 1-5) • Ability to rapidly scan and understand research – “Readings in Computer Architecture” by Hill et al papers is key to your success. • In class, prereq quiz 1/29 last 30 minutes • So: 1-2 paper / week in this course – Improve 252 experience if recapture common background – Quick 1 paragraph summaries will be due in class – Bring 1 sheet of paper with notes on both sides – Doesn’t affect grade, only admission into class – Important supplement to book. – 2 grades: Admitted or audit/take CS 152 1st – Will discuss papers in class – Review: Chapters 1, CS 152 home page, maybe “Computer Organizat ion and Design (COD)2/e” • Papers “Readings in Computer Architecture” or online – If did take a class, be sure COD Chapters 2, 5, 6, 7 are familiar – Copies in Bechtel Library on 2-hour reserve • Think about methodology and approach

CS252/Culler CS252/Culler 1/22/02 Lec 1. 25 1/22/02 Lec 1. 26

Grading First Assignment (due Tu 2/5) • 10% Homeworks (work in pairs) • 40% Examinations (2 Quizzes) • Read • 40% Research Project (work in pairs) – Amdahl, Blaauw, and Brooks, Architecture of the IBM – Draft of Conference Quality Paper System/360 – Transition from undergrad to grad student – Lonergan and King, B5000 – Berkeley wants you to succeed, but you need to show initiative – pick topic • Four each prepare for in-class debate 1/29 – meet 3 times with faculty/TA to see progress – give oral presentation • rest write analysis of the debate – give poster session – written report like conference paper • Read “Programming the EDSAC”, Cambell-Kelly – 3 weeks work full time for 2 people (over more weeks) – Opportunity to do “research in the small” to help make transition – write subroutine sum(A,n) to sum an array A of n numbers from good student to research colleague • 10% Class Participation – write recursive fact(n) = if n==1 then 1 else n*fact(n-1)

CS252/Culler CS252/Culler 1/22/02 Lec 1. 27 1/22/02 Lec 1. 28

Levels of Representation (61C Review) Course Profile temp = v[k]; High Level Language v[k] = v[k+1]; • 3 weeks: basic concepts Program v[k+1] = temp; – instruction processing, storage • 3 weeks: hot areas Compiler – latency tolerance, low power, embedded design, lw $15,0($2) network processors, NIs, virtualization Assembly Language lw $16,4($2) Program • Proposals due sw $16,0($2) • 2 weeks: advanced microprocessor design Assembler sw $15,4($2) • Quiz & Spring Break 0000 1001 1100 0110 1010 1111 0101 1000 Machine Language 1010 1111 0101 1000 0000 1001 1100 0110 • 3 weeks: Parallelism (MPs, CMPs, Networks) Program 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 • 2 weeks: Methodology / Analysis / Theory Machine Interpretation • 1 weeks: Topics: nano, quantum

• 1 week: Project Presentations Control Signal ALUOP[0:3] <= InstReg[9:11] & MASK Specification ° ° CS252/Culler CS252/Culler 1/22/02 Lec 1. 29 1/22/02 Lec 1. 30

Page 5 Execution Cycle What’s a Clock Cycle? Obtain instruction from program storage Instruction Fetch

Latch combinational Instruction Determine required actions and instruction size or logic Decode register

Operand Locate and obtain operand data Fetch

Execute Compute result value or status

Result Deposit results in storage for later use • Old days: 10 levels of gates Store • Today: determined by numerous time-of- flight issues + gate delays Next Determine successor instruction – clock propagation, wire lengths, drivers Instruction

CS252/Culler CS252/Culler 1/22/02 Lec 1. 31 1/22/02 Lec 1. 32

Fast, Pipelined Instruction Interpretation Sequential Laundry

Next Instruction 6 PM 7 8 9 10 11 Midnight NI NI NI NI NI Instruction Address IF IF IF IF IF Time D D D D D Instruction Fetch E E E E E 30 40 20 30 40 20 30 40 20 30 40 20 W W W W W T Instruction Register a A s Decode & Time k Operand Fetch B Operand Registers O r C Execute d e Result Registers r D Store Results • Sequential laundry takes 6 hours for 4 loads

Registers or Mem • If they learned pipelining, how long would laundry take? CS252/Culler CS252/Culler 1/22/02 Lec 1. 33 1/22/02 Lec 1. 34

Pipelined Laundry Start work ASAP Pipelining Lessons

6 PM 7 8 9 10 11 Midnight • Pipelining doesn’t help 6 PM 7 8 9 latency of single task, it Time Time helps throughput of entire workload 30 40 40 40 40 20 T a 30 40 40 40 40 20 • Pipeline rate limited by T slowest pipeline stage a A s s k A • Multiple tasks operating k simultaneously B O • Potential speedup = O r B Number pipe stages r d C e • Unbalanced lengths of d pipe stages reduces r C e speedup r D D • Time to “fill” pipeline and time to “drain” it • Pipelined laundry takes 3.5 hours for 4 loads reduces speedup CS252/Culler CS252/Culler 1/22/02 Lec 1. 35 1/22/02 Lec 1. 36

Page 6 Example: MIPS (Note register location) Instruction Pipelining Register- Register

• Execute billions of instructions, so throughput is 31 26 25 21 20 16 15 11 10 6 5 0 what matters Op Rs1 Rs2 Rd Opx – except when? • What is desirable in instruction sets for pipelining? Register-Immediate – Variable length instructions vs. 31 26 25 21 20 16 15 0 all instructions same length? Op Rs1 Rd immediate – Memory operands part of any operation vs. Branch memory operands only in loads or stores? 31 26 25 21 20 16 15 0 – Register operand many places in instruction Op Rs1 Rs2/ Opx immediate format vs. registers located in same place? Jump / Call

31 26 25 0 Op target

CS252/Culler CS252/Culler 1/22/02 Lec 1. 37 1/22/02 Lec 1. 38

5 Steps of MIPS Datapath 5 Steps of MIPS Datapath Figure 3.1, Page 130, CA:AQA 2e Figure 3.4, Page 134 , CA:AQA 2e Instruction Instr. Decode Execute Memory Write Instruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc Access Back Fetch Reg. Fetch Addr. Calc Access Back Next PC MUX MUX Next PC Next SEQ PC Next SEQ PC Adder Adder Next SEQ PC Zero? Zero? 4 RS1 MUX Address Memory MEM/WB

4 RS1 Reg EX/MEM MUX Address ID/EX

Memory Reg IF/ID RS2

RS2 ALU Inst Memory ALU File MUX Memory Data File MUX Data RD L MUX M MUX D Sign Sign Imm Extend Imm Extend

RD RD RD WB Data

WB Data • Data stationary control

CS252/Culler – local decode for each instruction phase / pipeline stage CS252/Culler 1/22/02 Lec 1. 39 1/22/02 Lec 1. 40

Visualizing Pipelining Figure 3.3, Page 133 , CA:AQA 2e Its Not That Easy for Computers Time (clock cycles)

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 • Limits to pipelining: Hazards prevent next I Ifetch Reg DMem Reg instruction from executing during its designated n ALU s clock cycle t – Structural hazards: HW cannot support this combination of r. Ifetch Reg DMem Reg instructions (single person to fold and put clothes away) ALU – Data hazards: Instruction depends on result of prior O instruction still in the pipeline (missing sock)

r Ifetch Reg DMem Reg ALU – Control hazards: Caused by delay between the fetching of d instructions and decisions about changes in control flow e (branches and jumps).

r Ifetch Reg DMem Reg ALU

CS252/Culler CS252/Culler 1/22/02 Lec 1. 41 1/22/02 Lec 1. 42

Page 7 Which is faster?

Plane DC to Speed Passengers Throughput Paris (pmph)

Review of Performance Boeing 747 6.5 hours 610 mph 470 286,700

BAD/Sud 3 hours 1350 mph 132 178,200 Concorde

• Time to run the task (ExTime) – Execution time, response time, latency • Tasks per day, hour, week, sec, ns … (Performance) – Throughput, bandwidth CS252/Culler CS252/Culler 1/22/02 Lec 1. 43 1/22/02 Lec 1. 44

CPI Definitions Computer Performance • Performance is in units of things per sec – bigger is better inst count Cycle time • If we are primarily concerned with response time CPUCPU timetime == SecondsSeconds == InstructionsInstructions xx CyclesCycles xx SecondsSeconds ProgramProgram ProgramProgram Instruction Instruction Cycle Cycle – performance(x) = 1 execution_time(x) Inst Count CPI Clock Rate Program X " X is n times faster than Y" means Compiler X (X)

Performance(X) Execution_time(Y) Inst. Set. X X n = = Performance(Y) Execution_time(Y) Organization X X

Technology X

CS252/Culler CS252/Culler 1/22/02 Lec 1. 45 1/22/02 Lec 1. 46

Cycles Per Instruction (Throughput) Example: Calculating CPI bottom up

“Average Cycles per Instruction” Base Machine (Reg / Reg) CPI = (CPU Time * Clock Rate) / Instruction Count Op Freq Cycles CPI(i) (% Time) = Cycles / Instruction Count ALU 50% 1 .5 (33%) n Load 20% 2 .4 (27%) CPU time = Cycle Time ´ å CPI j ´ I j j=1 Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5 Typical Mix of instruction types n Ij CPI = å CPI j ´ Fj where Fj = in program j =1 Instructio n Count

“Instruction Frequency”

CS252/Culler CS252/Culler 1/22/02 Lec 1. 47 1/22/02 Lec 1. 48

Page 8 Example: Branch Stall Impact Speed Up Equation for Pipelining

• Assume CPI = 1.0 ignoring branches (ideal) CPIpipelined = Ideal CPI + Average Stall cycles per Inst • Assume solution was stalling for 3 cycles • If 30% branch, Stall 3 cycles on 30% Ideal CPI ´ Pipeline depth Cycle Timeunpipeline d Speedup = ´ + • Op Freq Cycles CPI(i) (% Time) Ideal CPI Pipeline stall CPI Cycle Timepipelined • Other 70% 1 .7 (37%) • Branch 30% 4 1.2 (63%) For simple RISC pipeline, CPI = 1:

Pipeline depth Cycle Time • => new CPI = 1.9 Speedup = ´ unpipeline d 1 + Pipeline stall CPI Cycle Time • New machine is 1/1.9 = 0.52 times faster (i.e. slow!) pipelined

CS252/Culler CS252/Culler 1/22/02 Lec 1. 49 1/22/02 Lec 1. 50

The Memory Abstraction

• Association of pairs – typically named as byte addresses – often values aligned on multiples of size • Sequence of Reads and Writes Now, Review of Memory Hierarchy • Write binds a value to an address • Read of addr returns most recently written value bound to that address

command (R/W) address (name) data (W)

data (R)

done CS252/Culler CS252/Culler 1/22/02 Lec 1. 51 1/22/02 Lec 1. 52

Recap: Who Cares About the Memory Hierarchy? Levels of the Memory Hierarchy

Capacity Upper Level Access Time Staging Processor-DRAM Memory Gap (latency) Cost Xfer Unit faster CPU Registers 100s Bytes Registers <1s ns µProc Instr. Operands prog./compiler 1000 CPU 1-8 bytes 60%/yr. Cache “Joy’s Law” 10s -100s K Bytes Cache (2X/1.5yr 1-10 ns $10/ MByte cache cntl 100 Processor-) Memory Blocks 8-128 bytes Main Memory Performance Gap: M Bytes 100ns- 300ns Memory (grows 50% / year) $1/ MByte 10 OS DRAM Pages 512- 4K bytes Disk Performance DRAM 9%/yr. 10s G Bytes, 10 ms Disk (10,000,000 ns) $0.0031/ MByte 1 (2X/10 user/operator Files Mbytes yrs) Larger 1980 1982 1983 1984 1986 1987 1990 1991 1992 1993 1994 1997 1998 1999 2000 1981 1985 1988 1989 1995 1996 Tape infinite Tape Lower Level sec-min Time $0.0014/ MByte CS252/Culler CS252/Culler 1/22/02 Lec 1. 53 1/22/02 Lec 1. 54

Page 9 The Principle of Locality Memory Hierarchy: Terminology • Hit: data appears in some block in the upper level (example: Block X) • The Principle of Locality: – Hit Rate: the fraction of memory access found in the upper level – Program access a relatively small portion of the address space at – Hit Time: Time to access the upper level which consists of any instant of time. RAM access time + Time to determine hit/miss • Two Different Types of Locality: • Miss: data needs to be retrieve from a block in the lower – Temporal Locality (Locality in Time): If an item is referenced, it level (Block Y) will tend to be referenced again soon (e.g., loops, reuse) – Miss Rate = 1 - (Hit Rate) – Spatial Locality (Locality in Space): If an item is referenced, – Miss Penalty: Time to replace a block in the upper level + items whose addresses are close by tend to be referenced soon Time to deliver the block the processor (e.g., straightline code, array access) • Hit Time << Miss Penalty (500 instructions on 21264!) • Last 15 years, HW (hardware) relied on locality for speed Lower Level To Processor Upper Level Memory Memory Blk X From Processor Blk Y

CS252/Culler CS252/Culler 1/22/02 Lec 1. 55 1/22/02 Lec 1. 56

Simplest Cache: Direct Mapped Cache Measures Memory Address Memory 0 4 Byte Direct Mapped Cache 1 Cache Index • Hit rate: fraction found in that level 2 0 – So high that usually talk about Miss rate 3 1 – Miss rate fallacy: as MIPS to CPU performance, 4 2 miss rate to average memory access time in memory 5 3 • Average memory-access time 6 7 • Location 0 can be occupied by = Hit time + Miss rate x Miss penalty data from: (ns or clocks) 8 9 – Memory location 0, 4, 8, ... etc. • Miss penalty: time to replace a block from A – In general: any memory location lower level, including time to replace in CPU B whose 2 LSBs of the address are 0s – Address<1:0> => cache index – access time: time to lower level C = f(latency to lower level) D • Which one should we place in – transfer time: time to transfer block E the cache? F =f(BW between upper & lower levels) • How can we tell which one is in CS252/Culler CS252/Culler 1/22/02 Lec 1. 57 1/22/02 the cache? Lec 1. 58

1 KB Direct Mapped Cache, 32B blocks The Cache Design Space

• For a 2 ** N byte cache: • Several interacting dimensions Cache Size – The uppermost (32 - N) bits are always the Cache Tag – cache size – The lowest M bits are the Byte Select (Block Size = 2 ** M) – block size Associativity – associativity 31 9 4 0 Cache Tag Example: 0x50 Cache Index Byte Select – replacement policy Ex: 0x01 Ex: 0x00 – write-through vs write-back Stored as part of the cache “state” Block Size • The optimal choice is a compromise Valid Bit Cache Tag Cache Data – depends on access characteristics Byte 31 : Byte 1 Byte 0 0 » workload Bad 0x50 Byte 63 : Byte 33 Byte 32 1 » use (I-cache, D-cache, TLB) 2 – depends on technology / cost 3 Good Factor A Factor B : • Simplicity often wins : : Less More

Byte 1023 : Byte 992 31

CS252/Culler CS252/Culler 1/22/02 Lec 1. 59 1/22/02 Lec 1. 60

Page 10 Relationship of Caching and Pipelining Computer System Components Proc I-Cache

Next PC MUX Caches Next SEQ PC Next SEQ PC Busses Adder

adapters Zero? 4 RS1 Memory MUX Address Memory MEM/WB Reg EX/MEM ID/EX

IF/ID RS2

ALU Controllers Memory File MUX Data

MUX Disks I/O Devices: Displays Networks Keyboards

Sign Cache Extend -

Imm D

RD RD RD WB Data • All have interfaces & organizations • Bus & Bus Protocol is key to composition • => perhipheral hierarchy

CS252/Culler CS252/Culler 1/22/02 Lec 1. 61 1/22/02 Lec 1. 62

A Modern Memory Hierarchy TLB, Virtual Memory

• By taking advantage of the principle of locality: • Caches, TLBs, Virtual Memory all understood by – Present the user with as much memory as is available in the chea pest technology. examining how they deal with 4 questions: 1) – Provide access at the speed offered by the fastest technology. Where can block be placed? 2) How is block found? • Requires servicing faults on the processor 3) What block is repalced on miss? 4) How are writes handled?

Processor • Page tables map virtual address to physical address • TLBs make virtual memory practical Control Tertiary – Locality in data => locality in addresses of data, Secondary Storage temporal and spatial Storage (Disk/Tape)

Registers On Second Main Cache (Disk) • TLB misses are significant in processor performance

- Level Memory Datapath Chip Cache (DRAM) – funny times, as most systems can’t access all of 2nd level cache (SRAM) without TLB misses! • Today VM allows many processes to share single

Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s memory without having to swap all processes to Size (bytes): 100s (10s ms) (10s sec) disk; today VM protection is more important than Ks Ms Gs Ts memory hierarchy CS252/Culler CS252/Culler 1/22/02 Lec 1. 63 1/22/02 Lec 1. 64

Summary

• Modern Computer Architecture is about managing and optimizing across several levels of abstraction wrt dramatically changing technology and application load • Key Abstractions – instruction set architecture – memory – bus • Key concepts – HW/SW boundary – Compile Time / Run Time – Pipelining – Caching • Performance Iron Triangle relates combined effects – Total Time = Inst. Count x CPI + Cycle Time

CS252/Culler 1/22/02 Lec 1. 65

Page 11