COMP 212 Organization & Architecture Re-Cap of Lecture #2 • The text book is required for the class

– COMP 212 Fall 2008 You will need it for homework, project, review…etc. – To get it at a good price: Lecture 3 » Check with senior student for used book » Check with university book store & Memory System » Try this website: addall.com » Anyone got the book ? Care to share experience ?

Comp 212 Computer Org & Arch 1 Z. Li, 2008 Comp 212 Computer Org & Arch 2 Z. Li, 2008

Components & Connections Instruction

– CPU: processing • Instruction word has 2 parts – Mem: store data – Opcode: eg, 4 bit, will have total 24=16 different instructions – I/O & Network: exchange data with – Operand: address or immediate number an outside world instruction can operate on – Connection: , a – In von Neumann computer, instruction and data broadcasting medium share the same memory space: – : 2W for address width W. • eg, 8 bit has 28=256 addressable space, • 216=65536 addressable space (room number)

Comp 212 Computer Org & Arch 3 Z. Li, 2008 Comp 212 Computer Org & Arch 4 Z. Li, 2008 Register & Memory Operations During Instruction Cycle

• Instruction Cycle has 3 phases • Pay attention to the – Instruction fetch: following registers’ • pull instruction from mem to IR, according to PC change over cycles: • CPU can’t memory data directly ! – Instruction execution: – PC, IR • Operate on the operand, either load or save data to – AC memory, or move data among registers, or ALU – Mem at [940], [941] operations – Interruption Handling: to achieve parallel operation with slower IO device • Sequential • Nested

Comp 212 Computer Org & Arch 5 Z. Li, 2008 Comp 212 Computer Org & Arch 6 Z. Li, 2008

Homework Lecture #3: Cache & Memory System

• Compute the summation of an Array in memory: • Summary:

– Key is to use a to control how many times we add – is hierarchical:

– Use JMP command to control the flow of the program. » CPU >> Registers >> Cache >> Memory >> Hard Disk>>Tape/Optical Disk

– Give the » Access speed: decreasing dramatically as we go down the hiearchy » Cost: decreasing also – Walk thru the cycles, and give register status at selected states. – How to design cache/ and access algorithm, such – See website for more detail. that the average access time is the best ?

» Cache Design

» Cache performance

Comp 212 Computer Org & Arch 7 Z. Li, 2008 Comp 212 Computer Org & Arch 8 Z. Li, 2008 Memory Hierarchy - Diagram

The computer memory hierarchy

Comp 212 Computer Org & Arch 9 Z. Li, 2008 Comp 212 Computer Org & Arch 10 Z. Li, 2008

Memory types Physical Types

• Registers • Semiconductor – Directly operatable by CPU – RAM • L1/L2 Cache – Direct accessible by CPU • Magnetic • Main memory – Disk & Tape – Need a address, and then load from memory • Optical • Disk cache – CD & DVD – Memory on disk, • Disk • Others

• Optical, e.g. DVD, CD, BlueRay – Bubble

• Tape – Hologram

Comp 212 Computer Org & Arch 11 Z. Li, 2008 Comp 212 Computer Org & Arch 12 Z. Li, 2008 Physical Characteristics The Design Goals and Constraints

• Volatility • How much?

– When power off, the info is gone, e.g. RAM – Capacity hierarchy, register < cache < memory < disk < tape/optical

– Non-volatile: SSD, Hard Disk storage

• Erasable • How fast?

– ROM: read only memory, – Register/Cache: CPU of access, in GHz range

– CD: write once, read multiple times – Memory: limited by BUS speed, and data bus width, in MHz range

– RAM/Disk: read/write many times • How expensive?

• Power consumption – Register > cache > memory > solid disk > disk > tape /

Comp 212 Computer Org & Arch 13 Z. Li, 2008 Comp 212 Computer Org & Arch 14 Z. Li, 2008

Key Characteristics of Computer Memory System

Cache system design and performance

Comp 212 Computer Org & Arch 15 Z. Li, 2008 Comp 212 Computer Org & Arch 16 Z. Li, 2008 Location – where it resides Capacity

• CPU: • Word size

– Registers, L1 and L2 cache, – For ext. memory, typically (8bits), e.g., 120GB hard disk, 20GB

• Internal SSD (Solid State Disk) on EEE Pc – For internal memory, word can be 8, 16, 32 bits. – Main Memory, • Number of words • External – Determined by address size for internal memory, – Secondary storage, hard disk, optical disk, tapes » E.g. 32bit mem address gives us 232 word, or 4G word addressable space

» Installed memory can be less than that, e.g. 1G mem.

Comp 212 Computer Org & Arch 17 Z. Li, 2008 Comp 212 Computer Org & Arch 18 Z. Li, 2008

Unit of Transfer Access Methods (1)

• Internal • Sequential

– Usually governed by data bus width, and bus clock rate – Start at the beginning and read through in order

• External – Access time depends on location of data and previous location

– Usually a block which is much larger than a word – e.g. tape

• Addressable unit • Direct

A – Smallest location which can be uniquely addressed, determined by 2 , – Individual blocks have unique address A is address space. – Access is by jumping to vicinity plus sequential search – Word internally – Access time depends on location and previous location – Cluster on M$ disks – e.g. disk

Comp 212 Computer Org & Arch 19 Z. Li, 2008 Comp 212 Computer Org & Arch 20 Z. Li, 2008 Access Methods (2) Performance

• Random • Access time (latency)

– Individual addresses identify locations(address) exactly – Time between presenting the address and getting the valid data

– Access time is independent of location or previous access • Memory Cycle time – for RAM

– e.g. Internal Mem, RAM – Time may be required for the memory to “recover” before next access, related to BUS operations. • Associative – Cycle time is access + recovery – Data is located by a comparison with contents of a portion of the • store: e.g. get word with MSB 1110,xxxx,xxxx,xxxx. Transfer Rate – Rate at which data can be moved – Access time is independent of location or previous access » 1 unit per cylce time for RAM: e.g. 32bit data bus, 500Mhz cycle time, – e.g. Cache gives us 2G Bytes/ sec transfer rate

Comp 212 Computer Org & Arch 21 Z. Li, 2008 Comp 212 Computer Org & Arch 22 Z. Li, 2008

Design goals of computer memory Average memory access time with cache

• It is about tradeoffs between • If it takes

– More space – t1 to access cache

– Faster access – t2 to access memory, with t2>>t1, say 20 times

– Cost – h : prob of data access in cache, or hit ratio

• Use cache system to balance out • The average access time:

– Between CPU and Main Memory – h*t1 + (1-h)*(t1+t2)

– Between Main Memory – Disk – h*t1: time to directly access cache

– (1-h)*(t1+t2): in time of miss, time to load the data into cache, and access it.

Comp 212 Computer Org & Arch 23 Z. Li, 2008 Comp 212 Computer Org & Arch 24 Z. Li, 2008 Average access time as a function of hit ratio

• How to improve hit ratio ?

• Good news is data access has strong locality

– E.g. Loop operations repeatedly access a small set of data

• What is the right cache size ?

• How do design cache replacement algorithm ?

– i.e. what to keep in cache, based on past access pattern ?

Comp 212 Computer Org & Arch 25 Z. Li, 2008 Comp 212 Computer Org & Arch 26 Z. Li, 2008

Cache

• Small amount of fast memory

• Sits between normal main memory and CPU

Cache Design • May be located on CPU chip or module

Comp 212 Computer Org & Arch 27 Z. Li, 2008 Comp 212 Computer Org & Arch 28 Z. Li, 2008 Typical Cache to Memory Diagram Cache/Main Memory Structure

• Memory has 2n addressable words

• Memory is accessed in blocks

– K words per block

– Total 2n/K blocks.

– E.g.

» n=24, we have 224=16M address space

» If K=4, we have 224/22 = 222 memory blocks

» So the block address is the first 22 bits of word address

Comp 212 Computer Org & Arch 29 Z. Li, 2008 Comp 212 Computer Org & Arch 30 Z. Li, 2008

Cache/Main Memory Structure Cache operation – overview

• Cache contains cache C lines • CPU requests contents of memory location

• Each line has a block, or K • Check cache for this data

words • If present, get from cache (fast)

• It has a tag to indicate which • If not present, read required block from main memory to block in memory is in cache cache line

• To uniquely identify a cache • Then deliver from cache to CPU h line, we need h bits, if C=2 . • Cache includes tags to identify which block of main memory is in each cache slot

Comp 212 Computer Org & Arch 31 Z. Li, 2008 Comp 212 Computer Org & Arch 32 Z. Li, 2008 Cache Read Operation - Flowchart Cache Design Issues

• Size of Cache

• Block Size

• Levels of Cache,

– 1, 2 or 3 levels ?

• Mapping Function

– Direct

– Associative

– Set Associative

Comp 212 Computer Org & Arch 33 Z. Li, 2008 Comp 212 Computer Org & Arch 34 Z. Li, 2008

Cache Design Issues (2) Cache Size

• Cache Replacement Algorithm: what to keep in cache ? • Cost

– Least recent used (LRU), – Cache is expensive, compared with memory, in dollar per bits

– First In First Out (FIFO), • Speed – Least Freq Use (LFU), – Cache size too big not good for fast access

– Random » More gates and logic needed for addressing • Write Policy – Checking cache for data takes time

– Write thru

– Write back

– Write once

Comp 212 Computer Org & Arch 35 Z. Li, 2008 Comp 212 Computer Org & Arch 36 Z. Li, 2008 Comparison of Cache Sizes Mapping Functions: Direct Mapping

• Simply computed as:

– I = J mod m, where

– I is the line number in cache, J is the memory block number, and m is the number of lines in cache

• Each block of main memory maps to only one cache line

– i.e. if a block is in cache, it must be in one specific place

• Address is in two parts: total s+w bits

– Least Significant w bits identify unique word

– Most Significant s bits specify one memory block

– The MSBs are split into a cache line field r and a tag of s-r (most significant)

Comp 212 Computer Org & Arch 37 Z. Li, 2008 Comp 212 Computer Org & Arch 38 Z. Li, 2008

An example Direct Mapping Example Tag s-r Line or Slot r Word w • Cache of 64kByte 8 14 2 • Cache block of 4 bytes • 24 bit mem address: s+w=24, – i.e. cache is 16k (214) lines of 4 bytes, m = 16k • Block size 2w = 4 bytes • 24 bit , 16M bytes of memory • m = 214 = 16k lines in cache s+w w s – (224=16M) • 2 / 2 = 2 blocks in memory • Eg. (next page) – Addr =16339Ch = 0001 0110 0011 0011 1001 1100 – Tag = left 8 bits = 0001 0110 = 16h, – Line Addr = 14 bits in the middle= 00 1100 1110 0111 = 0CE7h

Comp 212 Computer Org & Arch 39 Z. Li, 2008 Comp 212 Computer Org & Arch 40 Z. Li, 2008 example Direct Mapping Cache Organization

Comp 212 Computer Org & Arch 41 Z. Li, 2008 Comp 212 Computer Org & Arch 42 Z. Li, 2008

Mem block mapping to cache lines Direct Mapping Summary

• Address length = (s + w) bits, given, e.g. 24. Cache line Main Memory blocks held • Number of addressable units = 2s+w words or bytes 0 0, m, 2m, 3m…2s-m • Block size = line size = 2w words or bytes 1 1,m+1, 2m+1…2s-m+1 • Number of blocks in main memory = 2s+w/2w = 2s ……………………………………… r m-1 m-1, 2m-1,3m-1…2s-1 • Number of lines in cache = m = 2 , so cache size determines r.

• No two mem blocks that are mapped to the same cache line has the • Size of tag = (s – r) bits same Tag !

Comp 212 Computer Org & Arch 43 Z. Li, 2008 Comp 212 Computer Org & Arch 44 Z. Li, 2008 Direct Mapping pros & cons Associative Mapping

• Simple • A main memory block can load into any line of cache

• Inexpensive – Address space: s+w bits – Block size: 2w bytes. • Fixed location for given block – Total memory blocks: 2s – Thrashing: If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high • Memory address is interpreted as tag and word – First s bit as tag: each mem block is unique in tag

– Line size: 2w bytes

– Every line’s tag is examined for a match

• Cache searching gets expensive

Comp 212 Computer Org & Arch 45 Z. Li, 2008 Comp 212 Computer Org & Arch 46 Z. Li, 2008

Fully Associative Cache Organization

Word Tag 22 bit 2 bit

• 22 bit tag stored with each 32 bit block of data

• Compare tag field with tag entry in cache to check for hit (expensive operation in circuits complexity)

• Least significant 2 bits of address identify which 8 bit word is required from 32 bit data block

• e.g.

– Mem Address Tag (first 22b) Data Cache line

– 163339C 058CE7 FEDCAB98 ---- (anywhere you want) 先主席早就警告过这m lines of cache样的结果. – 0001 0110 0011 0011 1001 1100 -> first 22 bits= 0001 0110 0011 0011 1001 11

Comp 212 Computer Org & Arch 47 Z. Li, 2008 Comp 212 Computer Org & Arch 48 Z. Li, 2008 Example Associative Mapping Summary

• Each cache line can • Total flexibility, memory block can be in any line in cache

store any memory • Allows for complex cache replacement algorithm that block improves hit ratio

• Only by compare all • Costly in hardware implementation Tags in cache, can we know if there’s a miss or hit.

Comp 212 Computer Org & Arch 49 Z. Li, 2008 Comp 212 Computer Org & Arch 50 Z. Li, 2008

Cache Mapping : Set Associative Mapping Cache Set Address Structure

• A compromise between direct and associative mapping. Word Tag 9 bit Cache Set 13 bit – Cache is divided into v set 2 bit

– Each set contains k lines • Use set field to locate one of the 213 cache set » Eg. k=3, 2 way associative cache system • 2-way cache set, so can store data in either one of the lines in the – So we have m = k x v lines in cache set – Each memory block can be in any line for a given cache set. • Compare tag field to see if we have a hit – It is like a student dorm system, • e.g (fig in next page) » Cache set is the dorm room number, and the tags are the names of students living in that room – Address Tag Data Set number

» To locate a student, Cache set address is pulled from address, and then – 02C 7FFC 02C 12345678 1FFF

Tags compared. – 1FF 7FFC 1FF 24682478 1FFF

Comp 212 Computer Org & Arch 51 Z. Li, 2008 Comp 212 Computer Org & Arch 52 Z. Li, 2008 Two Way Set Associative Cache Organization

Two Way Example

Comp 212 Computer Org & Arch 53 Z. Li, 2008 Comp 212 Computer Org & Arch 54 Z. Li, 2008

Replacement algorithms

• When there’s a cache miss, a new memory block is loaded into the cache, we need replace cache content

– If direct mapping, don’t have a choice, the new block has a fixed Cache Performance location in cache

– If set associative mapping, need to choose which line in a set to replace

– In associative mapping, more choices, larger space to choose from.

• Typically hardware implemented, no CPU involvement.

Comp 212 Computer Org & Arch 55 Z. Li, 2008 Comp 212 Computer Org & Arch 56 Z. Li, 2008 Replacement algorithms Write Policy

• Algorithms used • Memory data consistency issue – no free lunch theorem.

– Least Recently used (LRU) – When replace cache line, if cache data changed, before it is replaced,

» e.g. in 2 way set associative cache, which of the 2 block is lru? need to write back to corresponding memory location

• First in first out (FIFO) – When IO modified memory word via DMA, cache word becomes invalid, need to reload into cache – replace block that has been in cache longest – Multi-core CPU with its own cache: cache word invalid if changed by • Least frequently used one of the CPU – replace block which has had fewest hits

• Random

– Generate a random number to determine which one to replace

Comp 212 Computer Org & Arch 57 Z. Li, 2008 Comp 212 Computer Org & Arch 58 Z. Li, 2008

Write through Write back

• All writes go to main memory as well as cache • Purpose is to minimize write operations on BUS

• Multiple CPUs can monitor main memory traffic to keep • When a cache line is updated, a bit is set to indicate local (to CPU) cache up to date that

• Problem: • At the time of cache line replacement, only write to

– Many writes to memory memory those lines updated.

– Lots of traffic on bus – Average cache update is 15%, but for vector computing, 33%, matrix transposition, 50%.

– Write involves a line instead of a word, so only if a cache word gets written multiple times before replacement, can make it profitable

Comp 212 Computer Org & Arch 59 Z. Li, 2008 Comp 212 Computer Org & Arch 60 Z. Li, 2008 Example Cache Performance

• Memory write is 32 bit, takes 30ns • Cost per bits for a two level cache system

• Cache line is 16 byes, 196bits – C1: cost for cache per bit – C2: cost of mem per bit • Average word writes per replacement is 12 times – S1: cache size • How will write back save BUS time than write thru ? – S2: mem size • Solutions • What is the average cost per bit ? – Write thru: 12 x 30 = 360ns / replacement cycle

– Write back: (196/32)x30 = 240ns / replacement cycle C1* S1+ C2 * S2 S1+ S2

Comp 212 Computer Org & Arch 61 Z. Li, 2008 Comp 212 Computer Org & Arch 62 Z. Li, 2008

Cache Performance - Cost Cache Performance – Access

• Consider the following 2 level system:

– Cache hit ratio is h, i.e, prob of a memory word access is in cache

– Time to access a word in L1 and L2 cache: T1, T2.

• What is the average word access time ? Ts = h *T1+ (1− h) *(T1+ T 2) ⇒ T1 1 = Ts T 2 1+ (1− h) T1

– We want T1/Ts to be close to 1.0

Comp 212 Computer Org & Arch 63 Z. Li, 2008 Comp 212 Computer Org & Arch 64 Z. Li, 2008 Cache access as function of hit ratio Hit ratio vs data access locality

• Different program has different access locality characteristics

• What is the cache size affecting the hit ratio ?

– If no locality, totally proportional to the S1/S2 ratio

Comp 212 Computer Org & Arch 65 Z. Li, 2008 Comp 212 Computer Org & Arch 66 Z. Li, 2008

Intel CPU Cache

• 80386

– no on chip cache

Modern Computer Cache System Examples • 80486 – – 8k using 16 cache lines and 4 way set associative organization (Informational) • (all versions) –

– two on chip L1 caches, for Data & instructions separately

• Pentium III –

– L3 cache added off chip

Comp 212 Computer Org & Arch 67 Z. Li, 2008 Comp 212 Computer Org & Arch 68 Z. Li, 2008 Cache Pentium 4 Block Diagram

• Pentium 4

– L1 caches

» 8k bytes

» 64 byte lines, 8k cachelines

» 4 way set associative

– L2 cache

» Feeding both L1 caches

» 256k

» 128 byte lines

» 8 way set associative

– L3 cache on chip

Comp 212 Computer Org & Arch 69 Z. Li, 2008 Comp 212 Computer Org & Arch 70 Z. Li, 2008

Pentium 4 Core Pentium 4 Core Processor

• Fetch/Decode Unit • Execution units

– Fetches instructions from L2 cache – Execute micro-ops

– Decode into micro-ops – Data from L1 cache

– Store micro-ops in L1 cache – Results in registers

• Out of order execution logic • Memory subsystem

– Schedules micro-ops – L2 cache and systems bus

– Based on data dependence and resources

– May speculatively execute

Comp 212 Computer Org & Arch 71 Z. Li, 2008 Comp 212 Computer Org & Arch 72 Z. Li, 2008 PowerPC Cache Organization PowerPC G5 Block Diagram • 601 – single 32kb 8 way set associative • 603 – 16kb (2 x 8kb) two way set associative • 604 – 32kb • 620 – 64kb • G3 & G4 – 64kb L1 cache » 8 way set associative – 256k, 512k or 1M L2 cache » two way set associative • G5 – 32kB instruction cache – 64kB data cache

Comp 212 Computer Org & Arch 73 Z. Li, 2008 Comp 212 Computer Org & Arch 74 Z. Li, 2008

Review Questions & Homework

• Review notes and text book, try to answer the questions in review questions

– 4.1, 4.2, and 4.3

• Homework 1 : due oct/9/2008.

– Please check the course webpage, will be announced on Friday.

– It will be given in parts, covering lec 1~4.

Comp 212 Computer Org & Arch 75 Z. Li, 2008